S-Map: A network with a simple 
self-organization algorithm for generative 
topographic mappings 
Kimmo Kiviluoto 
Laboratory of Computer and 
Information Science 
Helsinki University of Technology 
P.O. Box 2200 
FIN-02015 HUT, Espoo, Finland 
Ki mmo. Kiviluot oChut. f i 
Erkki Oja 
Laboratory of Computer and 
Information Science 
Helsinki University of Technology 
P.O. Box 2200 
FIN-02015 HUT, Espoo, Finland 
Erkki. Oj ahu. f i 
Abstract 
The S-Map is a network with a simple learning algorithm that com- 
bines the self-organization capability of the Self-Organizing Map 
(SOM) and the probabilistic interpretability of the Generative To- 
pographic Mapping (GTM). The simulations suggest that the S- 
Map algorithm has a stronger tendency to self-organize from ran- 
dom initial configuration than the GTM. The S-Map algorithm 
can be further simplified to employ pure Hebbian learning, with- 
out changing the qualitative behaviour of the network. 
I Introduction 
The self-organizing map (SOM; for a review, see [1]) forms a topographic mapping 
from the data space onto a (usually two-dimensional) output space. The SOM has 
been succesfully used in a large number of applications [2]; nevertheless, there are 
some open theoretical questions, as discussed in [1, 3]. Most of these questions 
arise because of the following two facts: the SOM is not a generative model, i.e. it 
does not generate a density in the data space, and it does not have a well-defined 
objective function that the training process would strictly minimize. 
Bishop et al. [3] introduced the generative topographic mapping (GTM) as a solution 
to these problems. However, it seems that the GTM requires a careful initialization 
to self-organize. Although this can be done in many practical applications, from a 
theoretical point of view the GTM does not yet offer a fully satisfactory model for 
natural or artificial self-organizing systems. 
550 K. Kiviluoto and E. Oja 
In this paper, we first briefly review the SOM and GTM algorithms (section 2); then 
we introduce the S-Map, which may be regarded as a crossbreed of SOM and GTM 
(section 3); finally, we present some simulation results with the three algorithms 
(section 4), showing that the S-Map manages to combine the computational sim- 
plicity and the ability to self-organize of the SOM with the probabilistic framework 
of the GTM. 
2 SOM and GTM 
2.1 The SOM algorithm 
The self-organizing map associates each data vector t with that map unit that has 
its weight vector closest to the data vector. The activations rl of the map units are 
given by 
r/ = {1' when IIt, i- '11 < IIt,j vj # i 
O, otherwise (1) 
where/z i is the weight vector of the ith map unit i, i = 1,...,K. Using these 
activations, the SOM weight vector update rule can be written as 
K 
t.j := + 5 * Q; * - (2) 
i=1 
Here parameter 5 t is a learning rate parameter that decreases with time. The neigh- 
borhood function h(i, j;/t) is a decreasing function of the distance between map 
units i and j;/t is a width parameter that makes the neighborhood function get 
narrower as learning proceeds. One popular choice for the neighborhood function 
is a Gaussian with inverse variance/t. 
2.2 The GTM algorithm 
In the GTM algorithm, the map is considered as a latent space, from which a 
nonlinear mapping to the data space is first defined. Specifically, a point  in the 
latent space is mapped to the point v in the data space according to the formula 
L 
v(;M) = Mb() = E qbj()/zj (3) 
j=l 
where qb is a vector consisting of L Gaussian basis functions, and M is a D x L 
matrix that has vectors luj as its columns, D being the dimension of the data space. 
The probability density p() in the latent space generates a density to the manifold 
that lies in the data space and is defined by (3). If the latent space is of lower 
dimension than the data space, the manifold would be singular, so a Gaussian noise 
model is added. A single point in the latent space generates thus the following 
density in the data space: 
p(]; M,/) ()D/2 
= exp[-l]v(;M)-,] 2] (4) 
where/ is the inverse of the variance of the noise. 
The key point of the GTM is to approximate the density in the data space by 
assuming the latent space prior p() to consist of equiprobable delta functions that 
S-Map 551 
form a regular lattice in the latent space. The centers i of the delta functions are 
called the latent vectors of the GTM, and they are the GTM equivalent to the SOM 
map units. The approximation of the density generated in the data space is thus 
given by 
K 
1 
p({IM,/) =  EP({[i;M,/) (5) 
i=1 
The parameters of the GTM are determined by minimizing the negative log likeli- 
hood error 
(M,/) - - Eln ip({tl&;M,/) (6) 
t=l i----1 
over the set of sample vectors {{t}. The batch version of the GTM uses the EM 
algorithm [4]; for details, see [3]. One may also resort to an on-line gradient descent 
procedure that yields the GTM update steps 
K 
/t+l .__ t 5t /t 
J '-- J q- E rl(Mt'/t)qJ(i)[{t - v(i;Mt)] (7) 
i----1 
/t+ :=/ +5 t v(Mt,/)ll{t-v(i;Mt)ll 2 -. (8) 
where /(M,/) is the GTM counterpart to the SOM unit activation, the posterior 
probability p(&lt; M,/) of the latent vector i given data vector t. 
r/ (M,/) = p(&lt;M,/) 
p({tl&;M,/) 
-2iK,= p({tli, ; M, / ) (9) 
exp[-  IIv(&; M) - tl12] 
- Z=x exp[-llv(i,; M) - tll 
2.3 Connections between SOM and GTM 
Let us consider a GTM that has an equal number of latent vectors and basis func- 
tions  , each latent vector i being the center for one Gaussian basis function bi(). 
Latent vector locations may be viewed as units of the SOM, and consequently the 
basis functions may be interpreted as connection strengths between the units. Let 
us use the shorthand notation b E qj(i). Note that b. -- , and assume that 
the basis functions be normalized so that E=i ' -- E/K=I (' '- 1. 
At the zero-noise limit, or when/ - c, the softmax activations of the GTM given 
in (9) approach the winner-take-all function (1) of the SOM. The winner unit c(t) 
for the data vector t is the map unit that has its image closest to the data vector, 
so that the index c(t) is given by 
c(t)=argminllv(&)-{tll=argminl(kc/.lj) -{tll (10) 
i i j----1 
Note that this choice serves the purpose of illustration only; to use GTM properly, 
one should choose much more latent vectors than basis functions. 
552 K. Kiviluoto and E. Oja 
The GTM weight update step (7) then becomes 
tq-1 t (j [ _ v(c(t);Mt)] 
J ..__ #j q_ (t (t) t 
(11) 
This resembles the variant of SOM, in which the winner is searched with the rule (10) 
and weights axe updated as 
tq-1 t j (t -- #j) 
J 1-" lJ q_ (t (t) t 
(12) 
Unlike the original SOM rules (1) and (2), the modified SOM with rules (10) 
and (12) does minimize a well-defined objective function: the $OM distortion mea- 
sure [5, 6, 7, 1]. However, there is a difference between GTM and SOM learning 
rules (11) and (12). With SOM, each individual weight vector moves towards the 
data vector, but with GTM, the image of the winner latent vector v((t); M) moves 
towaxds the data vector, and all weight vectors/zj move to the same direction. 
For nonzero noise, when 0 </ < c, there is more difference between GTM and 
SOM: with GTM, not only the winner unit but activations from other units as well 
contribute to the weight update. 
3 S-Map 
Combining the softmax activations of the GTM and the learning rule of the SOM, 
we axrive at a new algorithm: the S-Map. 
3.1 The S-Map algorithm 
The S-Map resembles a GTM with an equal number of latent vectors and basis 
functions. The position of the ith unit on the map is is given by the latent vector 
i; the connection strength of the unit to another unit j is b-, and a weight vector 
/z i is associated with the unit. The activation of the unit is obtained using rule (9). 
The S-Map weights leaxn proportionally to the activation of the unit that the weight 
is associated with, and the activations of the neighboring units: 
:= uj + W- u}) (13) 
i=1 
which can be further simplified to a fully Hebbian rule, updating each weight pro- 
portionally to the activation of the corresponding unit only, so that 
/t+ t 
j :---- j q_ 5t]( t -- ) (14) 
The paxameter/ value may be adjusted in the following way: start with a small 
value, slowly increase it so that the map unfolds and spreads out, and then keep 
increasing the value as long as the error (6) decreases. The parameter adjustment 
scheme could also be connected with the topographic error of the mapping, as 
proposed in [9] for the SOM. 
Assuming normalized input and weight vectors, the "dot-product metric" form of 
the learning rules (13) and (14) may be written as 
i----1 
(15) 
S-Map 553 
and 
t (16) 
respectively; the matrix in the second penthesis keeps the weight vectors norm- 
ized to unit lenh, sumg a sml ue for the leaning rate paxmeter 5 t [8]. 
The dot-product metric form of a unit tivity is 
exp ] 
which appromates the posterior probability p(iIt;M,) that the data vector 
were generated by that specific unit. This is bed on the obsermtion that if the 
data vectors {t} e normized to unit lenh, the density generated in the data 
space (unit sphere in B ) becomes 
P(li; M, 3) = , constant ] x exp b}/j  
j=l 
(18) 
3.2 S-Map algorithm minimizes the GTM error function in 
dot-product metric 
The GTM error function is the negative log likelihood, which is given by (6) and is 
reproduced here: 
(M,) = - In ip(ttli;M,) (19) 
When the weights axe updated using a batch version of (15), accumulating the 
updates for one epoch, the expected value of the error [4] for the unit i is 
T 
E( ew) = - E pld(i!t;M, ) lnnew(i)pnew(t[i;M,])] 
t=l r/ta. t =i/K 
T (20) 
k]? ld't k i. new t 
= - . rkjpj ) + terms not involving the weight vectors 
t=l j=l 
The change of the error for the whole map after one epoch is thus 
K T K 
__ -- --  old,t0i new _ flld)rt 
E(new old)  E  }i ]J 
i=1 t=l j=l 
k(kk__old,ti,t T fldfld T) (k k 
j=l , t=l i=1 , , tt=l i'=1 -- (21) 
K 
-[ ) l 0 
j=l 
with equality only when the weights are ready in the error minimum. 
554 K. Kiviluoto and E. Oja 
4 Experimental results 
The self-organization ability of the SOM, the GTM, and the S-Map was tested on 
an artificial data set: 500 points from a uniform random distribution in the unit 
square. 
The initial weight vectors for all models were set to random values, and the final 
configuration of the map was plotted on top of the data (figure 1). For all the algo- 
rithms, the batch version was used. The $OM was trained as recommended in [1] - 
in two phases, the first staxting with a wide neighborhood function, the second with 
a narrow neighborhood. The GTM was trained using the Matlab implementation 
by Svensn, following the recommendations given in [10]. The S-Map was trained in 
two ways: using the "full" rule (13), and the simplified rule (14). In both cases, the 
parameter/ value was slowly increased every epoch; by monitoring the error (6) of 
the S-Map (see the error plot in the figure) the suitable value for/ can be found. 
In the GTM simulations, we experimented with many different choices for basis 
function width and their number, both with normalized and unnormalized basis 
functions. It turned out that GTM is somewhat sensitive to these choices: it had 
difficulties to unfold after a random initialization, unless the basis functions were set 
so wide (with respect to the weight matrix prior) that the map was well-organized 
already in its initial configuration. On the other hand, using very wide basis func- 
tions with the GTM resulted in a map that was too rigid to adapt well to the data. 
We also tried to update the parameter/ according to an annealing schedule, as 
with the S-Map, but this did not seem to solve the problem. 
Figure 1: Random initialization (top left), SOM (top middle), GTM (top right), 
"full" S-Map (bottom left), simplified S-Map (bottom middle). On bottom right, 
the S-Map error as a function of epochs is displayed; the parameter/ was slightly 
increased every epoch, which causes the error to increase in the early (unfolding) 
phase of the learning, as the weight update only minimizes the error for a given/. 
5 Conclusions 
The S-Map and SOM seem to have a stronger tendency to self-organize from random 
initialization than the GTM. In data analysis applications, when the GTM can 
S-Map 555 
be properly initialized, SOM, S-Map, and GTM yield comparable results; those 
obtained using the latter two algorithms are also straightforward to interpret in 
probabilistic terms. In Euclidean metric, the GTM has the additional advantage 
of guaranteed convergence to some error minimum; the convergence of the S-Map 
in Euclidean metric is still an open question. On the other hand, the batch GTM 
is computationally clearly heavier per epoch than the S-Map, while the S-Map is 
somewhat heavier than the SOM. 
The SOM has an impressive record of proven applications in a variety of different 
tasks, and much more experimenting is needed for any alternative method to reach 
the same level of practicality. SOM is also the basic bottom-up procedure of self- 
organization in the sense that it starts from a minimum of functional principles 
realizable in parallel neural networks. This makes it hard to analyze, however. 
A probabilistic approach like the GTM stems from the opposite point of view by 
emphasizing the statistical model, but as a trade-off, the resulting algorithm may 
not share all the desirable properties of the SOM. Our new approach, the S-map, 
seems to have succeeded in inheriting the strong self-organization capability of the 
SOM, while offering a sound probabilistic interpretation like the GTM. 
References 
[lO] 
[1] T. Kohonen, Self-Organizing Maps. Springer Series in Information Sciences 30, 
Berlin Heidelberg New York: Springer, 1995. 
[2] T. Kohonen, E. Oja, O. Simula, A. Visa, and J. Kangas, "Engineering applica- 
tions of the self-organizing map," Proceedings of the IEEE, vol. 84, pp. 1358- 
1384, Oct. 1996. 
[3] C. M. Bishop, M. Svensen, and C. K. I. Williams, "GTM: A principled alterna- 
tive to the self-organizing map," in Advances in Neural Information Processing 
Systems (to appear) (M. C. Mozer, M. I. Jordan, and T. Petche, eds.), vol. 9, 
MIT Press, 1997. 
[4] A. P. Dempster, N.M. Laird, and D. B. Rubin, "Maximum likelihood from in- 
complete data via the EM algorithm," Journal of the Royal Statistical Society, 
vol. B 39, no. 1, pp. 1-38, 1977. 
[5] S. P. Luttrell, "Code vector density in topographic mappings," Memorandum 
4669, Defense Research Agency, Malvern, UK, 1992. 
[6] T. M. Heskes and B. Kappen, "Error potentials for self-organization," in Pro- 
ceedings of the International Conference on Neural Networks (ICNN'93), vol. 3, 
(Piscataway, New Jersey, USA), pp. 1219-1223, IEEE Neural Networks Coun- 
cil, Apr. 1993. 
[7] S. P. Luttrell, "A Bayesian analysis of self-organising maps," Neural Compu- 
tation, vol. 6, pp. 767-794, 1994. 
[8] E. Oja, "A simplified neuron model as a principal component analyzer," Jour- 
nal of Mathematical Biology, vol. 15, pp. 267-273, 1982. 
[9] K. Kiviluoto, "Topology preservation in self-organizing maps," in Proceedings 
of .the International Conference on Neural Networks (ICNN'96), vol. 1, (Pis- 
cataway, New Jersey, USA), pp. 294-299, IEEE Neural Networks Council, June 
1996. 
M. Svensn, The GTM toolbox - user's guide. Neural Computing Research 
Group ] Aston University, Birmingham, UK, 1.0 ed., Oct. 1996. Available at 
URL http://neural-server. aston. ac. uk/GTM/MATLAB_Impl. html. 
