Generalized Learning Vector 
Quantization 
Atsushi Sato & Keiji Yamada 
Information Technology Research Laboratories, 
NEC Corporation 
1-1, Miyazaki 4-chome, Miyamae-ku, 
Kawasaki, Kanagawa 216, Japan 
E-mail: {asato, yamada)@pat.cl.nec.co.jp 
Abstract 
We propose a new learning method, "Generalized Learning Vec- 
tor Quantization (GLVQ)," in which reference vectors are updated 
based on the steepest descent method in order to minimize the cost 
function. The cost function is deternfined so that the obtained 
learning rule satisfies the convergence condition. We prove that 
Kohonen's rule as used in LVQ does not satisfy the convergence 
condition and thus degrades recognition ability. Experimental re- 
sults for printed Chinese character recognition reveal that GLVQ 
is superior to LVQ in recognition ability. 
1 INTRODUCTION 
Artificial neural network models have been applied to character recognition with 
good results for small-set characters such as alphanumerics (Le Cun et al., 1989) 
(Yamada et al., 1989). However, applying the models to large-set characters such 
as Japanese or Chinese characters is difficult because most of the models are based 
on Multi-Layer Perceptron (MLP) with the back propagation algorithm, which has 
a problem in regard to local minima as well as requiring a lot of calculation. 
Classification methods based on pattern matching have commonly been used for 
large-set character recognition. Learning Vector Quantization (LVQ) has been stud- 
ied to generate optimal reference vectors because of its simple and fast learning al- 
gorithm (Kohonen, 1989; 1995). However, one problem with LVQ is that reference 
vectors diverge and thus degrade recognition ability. Much work has been done on 
improving LVQ (Lee & Song, 1993) (Miyalmra & Yoda, 1993) (Sato & Tsukumo, 
1994), but the problem remains unsolved. 
Recently, a generalization of the Simple Competitive Learning (SCL) has been under 
424 A. SATO, K. YAMADA 
study (Pal et al., 1993) (Gonzalez et al., 1995), and one unsupervised learning 
rule has been derived based on the steepest descent method to minimize the cost 
function. Pal et al. call their model "Generalized Learning Vector Quantization," 
but it is not a generalization of Kohonen's LVQ. 
In this paper, we propose a new learning method for supervised learning, in which 
reference vectors are updated based on the steepest descent method, to minimize 
the cost function. This is a generalization of Kohonen's LVQ, so we call it "Gener- 
alized Learning Vector Quantization (GLVQ)." The cost function is determined so 
that the obtained learning rule satisfies the convergence condition. We prove that 
Kohonen's rule as used in LVQ does not satisfy the convergence condition and thus 
degrades recognition ability. Preliminary experiments revealed that non-linearity 
in the cost function is very effective for improving recognition ability. Printed Chi- 
nese character recognition experiments were carried out, and we can show that the 
recognition ability of GLVQ is very high compared with LVQ. 
2 REVIEW OF LVQ 
Assume that a number of reference vectors wk are placed in the input space. Usu- 
ally, several reference vectors are assigned to each class. An input vector x is decided 
to belong to the same class to which the nearest reference vector belongs. Let wk(t) 
represent sequences of the w in the discrete-time domain. Heretofore, several LVQ 
algorithms have been proposed (Kohonen, 1995), but in this section, we will focus 
on LVQ2.1. Starting with properly defined initial values, the reference vectors are 
updated as follows by the LVQ2.1 algorithm: 
wi(t + 1) = wi(t) - a(t)(x - wi(t)), (1) 
oj(t + 1) = oj(t) +.(t)(x - (2) 
where 0 < a(t) < 1, and a(t) may decrease monotonically with time. The two 
reference vectors wi and wj are the nearest to x; a and wj belong to the same 
class, while x and wi belong to different classes. Furthermore, a: must fall into 
the "window," which is defined around the midplane of wi and wj. That is, if the 
following condition is satisfied, wi and wj are updated: 
min (d d) 
. >s. (3) 
where di = Ix - wi I, dj = la:- wjl. The LVQ2.1 algorithm is based on the idea 
of shifting the decision boundaries toward the Bayes limits with attractive and 
repulsive forces from a:. However, no attention is given to what might happen to 
the location of the w, so the reference vectors diverge in the long run. LVQ3 
has been proposed to ensure that the reference vectors continue approximating the 
class distributions, but it must be noted that if only one reference vector is assigned 
to each class, LVQ3 is the same as LVQ2.1, and the problem of reference vector 
divergence remains unsolved. 
3 GENERALIZED LVQ 
To ensure that the reference vectors continue approximating the class distributions, 
we propose a new learning method based on minimizing the cost function. Let w 
be the nearest reference vector that belongs to the same class of a, and likewise let 
w2 be the nearest reference vector that belongs to a different class from x. Let us 
consider the relative distance difference p(a) defined as follows: 
d I - d 2 
p(x) = d + d2' (4) 
Generalized Learning Vector Quantization 425 
where d and d2 are the distances of r from Wl and w2, respectively. p(x) ranges 
between -1 and +1, and if p(x) is negative, x is classified correctly; otherwise, x 
is classified incorrectly. In order to improve error rates, it(x) should decrease for 
all input vectors. Thus, a criterion for learning is formulated as the minimizing of 
a cost function $ defined by 
N 
(5) 
i'-1 
where N is the number of input vectors for training, and f() is a monotonically 
increasing function. To minimize S, Wl and w2 are updated based on the steepest 
descent method with a small positive constant ct as follows: 
os 
wiwi-CtOw i, i=1,2 (6) 
Ot Od 1 Owl 
OS O Od2 
O!a Od2 0w2 
, is used, we can obtain the following. 
Of 4d2 
0p (d 1 -{- d2) 2 (3 - Wl) (7) 
Of 4dl 
0[t (d 1 -{- 42) 2 (X - W2) (8) 
If squared Euclid distance, di = Ix - wi[2 
OS OS O Od 1 
Owl 
os 
(9) 
Therefore, the GLVQ's learning rule can be described as follows: 
Of d2 
W 1 --- W 1 -{- Ot 0/ (41 - d2) 2 (x Wl) 
Of dl 
W2 --' W2 -- "0/ (dl - d2) 2 ( - w2) (10) 
Let us discuss the meaning of f(). O flOp is a kind of gain factor for updating, 
and its value depends on x. In other words, O flOp is a weight for each x. To 
decrease the error rate, it is effective to update reference vectors mainly by input 
vectors around class boundaries, so that the decision boundaries are shifted toward 
the Bayes limits. Accordingly, f(p) should be a non-linear monotonically increasing 
function, and it is considered that classification ability depends on the definition 
of f(p). In this paper, Of/Op = f(p,t){1 - f(p,t)) was used in the experiments, 
where t is learning time and f(p,t) is a sigmoid function of 1/(1 + e-"e). In this 
case, O flOP has a single peak at p = O, and the peak width becomes narrower as t 
increases, so the input vectors that affect learning are gradually restricted to those 
around the decision boundaries. 
Let us discuss the meaning of p. w 1 and w2 are updated by attractive and repulsive 
forces from x, respectively, as shown in Eqs. (9) and (10), and the quantities of 
updating, IAw11 and ]Aw2], depend on derivatives of . Reference vectors will 
converge to the equilibrium states defined by attractive and repulsive forces, so it 
is considered that convergence property depends on the definition of . 
4 DISCUSSION 
First, we show that the conventional LVQ algorithms can be derived based on the 
framework of GLVQ. If y -- dl for dl < d2, y -- -d for d > d2, and f(p) - y, the 
cost function is written as S = d, <d2 dl - d >d2 d2. 
following: 
W 1 --' W 1 -{-t(X-- W 1), W 2 --' W 2 
W2 --- W2 -- (X-- W2), Wl --- Wl 
Then, we can obtain the 
for dl < d2 (11) 
for dl > d2 (12) 
426 A. SATO, K. YAMADA 
This learning algorithm is the same as LVQ1. If p = dl-d 2 and f(p) = p for < s, 
f(tt) = const for [tt[ > s, the cost function is written as S = Y.ll<8(dl - d9.) + C. 
Then, we can obtain the following: 
falls into the window) 
Wl + - 
if < s 
W 1 (13) 
w2 (14) 
In this case, wl and w2 are updated simultaneously, and this learning algorithm 
is the same as LVQ2.1. So it can be said that GLVQ is a generalized model that 
includes the conventional LVQs. 
Next, we discuss the convergence condition. We can obtain other learning algo- 
rithms by defining a different cost function, but it must be noted that the conver- 
gence property depends on the definition of the cost function. The main difference 
between GLVQ and LVQ2.1 is the definition of p; p =(dl-d2)/(dl +d2) in GLVQ, 
p = dl - d2 in LVQ2.1. Why do the reference vectors diverge in LVQ2.1, while they 
converge in GLVQ, as shown later? In order to clarify the convergence condition, 
let us consider the following learning rule: 
Wl <--' Wl q- OZIX -- w2Ik(x -- Wl) 
.- - - Wll/( x - 
Here, 
repulsive forces, respectively. The ratio of these two is cculated as follows: 
[Wl[ -- W=II - Wl[ [x - w2[ k-1 
[w2l - [x- wllklx - w2l [- Wl[ k-1 
(15) 
(16) 
I/Xw11 and law9.] are the quantities of updating by the attractive and the 
(17) 
If the initial values of reference vectors are properly defined, most x's will satisfy 
[a: - w[ < la: - w9.1. Therefore, if k > 1, the attractive force is greater than the 
repulsive force, and the reference vectors will converge, because the attractive forces 
come from a:'s that belong to the same class of Wl. In GLVQ, k = 2 as shown in 
Eqs. (9) and (10), and the vectors will converge, while they will diverge in LVQ2.1 
because k - 0. According to the above discussion, we can use di/(di + d9.) or just 
di, instead of di/(d] +d2) 9' in Eqs. (9) and (10). This correction does not affect the 
convergence condition. The essential problem in LVQ2.1 results from the drawback 
in Kohonen's rule with k = 0. In other words, the cost function used in LVQ is not 
determined so that the obtained learning rule satisfies the convergence condition. 
5 EXPERIMENTS 
5.1 PRELIMINARY EXPERIMENTS 
The experimental results using Eqs. (15) and (16) with a = 0.001, shown in Fig. 1, 
support the above discussion on the convergence condition. Two-dimensional input 
vectors with two classes shown in Fig. 1(a) were used in the experiments. The ideal 
decision boundary that minimizes the error rate is shown by the broken line. One 
reference vector was assigned to each class with initial values (x, y) = (0.3, 0.5) for 
Class A and (x,y) = (0.7,0.5) for Class B. Figure l(b) shows the distance between 
the two reference vectors during learning. The distance remains the same value for 
k > 1, while it increases with time for k _< 1; that is, the reference vectors diverge. 
Figure 2 shows the experimental results from GLVQ for linearly non-separable pat- 
terns compared with LVQ2.1. The input vectors shown in Fig. 2(a) were obtained 
by shifting all input vectors shown in Fig. l(a) to the right by [y - 0.5 I. The ideal 
Generalized Learning Vector Quantization 42 7 
1.0 
0.8 
0.6 
0.4 
0.2 
Class A 
Class B 
[] x 
[3 
x 
0.0 
0.0 0.2 0.4 0.6 0.8 1.0 
X Position 
(a) 
' k=O.O 
5.0 -  k = 0,5 
/ , k = 1.0 
/ / k= 1,5 "x-" 
4.0 -' I k=2.0 -- 
1.0 
0.0 '  
0 10 20 30 40 50 
Iteration 
(b) 
Figure 1: Experimental results that support the discussion on the convergence 
condition with one reference vector for each class. (a) Input vectors used in the 
experiments. The broken line shows the ideal decision boundary. (b) Distance 
between two reference vectors for each k value during learning. The distance remains 
the same value for k > 1, while it diverges for k _< 1. 
decision boundary that minimizes the error rate is shown by the broken line. Two 
reference vectors were assigned to each class with initial values (x,y) - (0.3,0.4) 
and (0.3, 0.6) for Class A, and (x, y) = (0.7, 0.4) and (0.7, 0.6) for Class B. The gain 
factor a was 0.004 in GLVQ and LVQ2.1, and the window parameter s in LVQ2.1 
was 0.8 in the experiments. 
Figure 2(b) shows the number of error counts for all the input vectors during 
learning. GLVQ(NL) shows results by GLVQ with a non-linear function; that is, 
Of/O!a -- f(tt, t){1 - f(tt, t)}. The number of error counts decreased with time to 
the minimum determined by the Bayes limit. GLVQ(L) shows results by GLVQ 
with a linear function; that is, Of/Oft = 1. The number of error counts did not 
decrease to the minimum. This indicates that non-linearity of the cost function is 
very effective for improving recognition ability. Results using LVQ2.1 show that the 
number of error counts decreased in the beginning, but overall increased gradually 
with time. The degradation in the recognition ability results from the divergence 
of the reference vectors, as we have mentioned earlier. 
5.2 CHARACTER RECOGNITION EXPERIMENTS 
Printed Chinese character recognition experiments were carried out to examine the 
performance of GLVQ. Thirteen kinds of printed fonts with 500 classes were used 
in the experiments. The total number of characters was 13,000; half of which were 
used as training data, and the other half were used as test data. As input vectors, 
256-dimensional orientation features were used (Hamanaka et al., 1993). Only one 
reference vector was assigned to each class, and their initial values were defined by 
averaging training data for each class. 
Recognition results for test data are tabulated in Table 1 compared with other 
methods. TM is the template matching method using mean vectors. LVQ2 is the 
earlier version of LVQ2.1. The learning algorithm is the same as LVQ2.1 described 
in Section 2, but di must be less than dj. The gain factor a was 0.05, and the window 
parameter s was 0.65 in the experiments. The experimental result by LVQ3 was 
428 A. SATO, K. YAMADA 
1.0 
0.8 
 0.6 
>. 0.4 
[] C. tas B 
160 
140 
120 
100 
0.2 x'< x 
0.0 I    " 40 
0.0 0.2 0.4 0.6 0.8 1.0 1.2 
X Position 
GLVQ(NL) 
GLVQ(L) _ -e__- 
LVQ2.1 -a--- 
0 20 40 60 80 1 O0 
Iteration 
Figure 2: Experimental results for linearly non-separable patterns with two refer- 
ence vectors for each class. (a) Input vectors used in the experiments. The broken 
line shows the ideal decision boundary. (b) The number of error counts during learn- 
ing. GLVQ (NL) and GLVQ (L) denote the proposed method using a non-linear 
and linear function in the cost function, respectively. This shows that non-linearity 
of the cost function is very effective for improving classification ability. 
Table 1: Experimental results for printed Chinese character recognition compared 
with other methods. 
Methods Error rates(%) 
TM  0.23 
LVQ22 0.18 
LVQ2.1 0.11 
IVQ s 0.08 
GLVQ 0.05 
1 Template matching using mean vectors. 
2The earlier version of LVQ2.1. 
SOur previous model (Improved Vector Quantization). 
the same as that by LVQ2.1, because only one reference vector was assigned to 
each class. IVQ (Improved Vector Quantization) is our previous model based on 
Kohonen's rule (Sato & Tsukumo, 1994). 
The error rate was extremely low for GLVQ, and a recognition rate of 99.95% was 
obtained. Ambiguous results can be rejected by thresholding the value of/t(x). If 
input vectors with/t(x) _ -0.02 were rejected, a recognition rate of 100% would 
be obtained, with a rejection rate of 0.08% for this experiment. 
6 CONCLUSION 
We proposed the Generalized Learning Vector Quantization as a new learning 
method. We formulated the criterion for learning as the minimizing of the cost 
function, and obtained the learning rule based on the steepest descent method. 
GLVQ is a generalized method that includes LVQ. We discussed the convergence 
condition and showed that the convergence property depends on the definition of 
Generalized Learning Vector Quantization 429 
the cost function. We proved that the essential problem of the divergence of the 
reference vectors in LVQ2.1 results from a drawback of Kohonen's rule that does 
not satisfy the convergence condition. Preliminary experiments revealed that non- 
linearity in the cost function is very effective for improving recognition ability. We 
carried out printed Chinese character recognition experiments and obtained a recog- 
nition rate of 99.95%. The experimental results revealed that GLVQ is superior to 
the conventional LVQ algorithms. 
Acknowledgements 
We are indebted to Mr. Jun Tsukumo and our colleagues in the Pattern Recognition 
Research Laboratory for their helpful cooperation. 
References 
Y. Le Cun, B. Bose, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and 
L. D. Jackel, "Handwritten Digit Recognition with a Back-Propagation Network," 
Neural Information Processing Systems 2, pp. 396-404 (1989). 
K. Yamada, H. Kanfi, J. Tsukumo, and T. Temma, "Handwritten Numeral Recog- 
nition by Multi-Layered Neural Network with Improved Learning Algorithm," Proc. 
of the International Joint Conference on Neural Networks 89, Vol. 2, pp. 259-266 
(1989). 
T. Kohonen, Self-Organization and Associative Memory, 3rd ed., Springer-Verlag 
(1989). 
T. Kohonen, "LVQ_PAK Version 3.1 -- The Learning Vector Quantization Program 
Package," LVQ Programming Team of the Helsinki University of Technology, (1995). 
S. W. Lee and H. H. Song, "Optimal Design of Reference Models Using Simulated 
Annealing Combined with an Improved LVQ3," Proc. of the International Confer- 
ence on Document Analysis and Recognition, pp. 244-249 (1993). 
K. Miyahara and F. Yoda, "Printed Japanese Character Recognition Based on 
Multiple Modified LVQ Neural Network," Proc. of the International Conference on 
Document Analysis and Recognition, pp. 250-253 (1993). 
A. Sato and J. Tsukumo, "A Criterion for Training Reference Vectors and Improved 
Vector Quantization," Proc. of the International Conference on Neural Networks, 
Vol. 1, pp.161-166 (1994). 
N. R. Pal, J. C. Bezdek, and E. C.-K. Tsao, "Generalized Clustering Networks and 
Kohonen's Self-organizing Scheme," IEEE Trans. of Neural Networks, Vol. 4, No. 4, 
pp. 549-557 (1993). 
A. I. Gonzalez, M. Grafia, and A. D'Anjou, "An Analysis of the GLVQ Algorithm," 
IEEE Trans. of Neural Networks, Vol. 6, No. 4, pp. 1012-1016 (1995). 
M. Hamanaka, K. Yamada, and J. Tsukumo, "On-Line Japanese Character Recog- 
nition Experiments by an Off-Line Method Based on Normalization-Cooperated 
Feature Extraction," Proc. of the International Conference on Document Analysis 
and Recognition, pp. 204-207 (1993). 
