511 
CONVERGENCE AND PATTERN STABILIZATION 
IN THE BOLTZMANN MACHINE 
Moshe Kam 
Dept. of Electrical and Computer Eng. 
Drexel University, Philadelphia PA 19104 
Roger Cheng 
Dept. of Electrical Eng. 
Princeton University, NJ 08544 
ABSTRACT 
The Boltzmann Machine has been introduced as a means to perform 
global optimization for multimodal objective functions using the 
principles of simulated annealing. In this paper we consider its utility 
as a spurious-free content-addressable memory, and provide bounds on 
its performance in this context. We show how to exploit the machine's 
ability to escape local minima, in order to use it, at a constant 
temperature, for unambiguous associative pattern-retrieval in noisy 
environments. An association rule, which creates a sphere of influence 
around each stored pattern, is used along with the Machine's dynamics 
to match the machine's noisy input with one of the pre-stored patterns. 
Spurious fixed points, whose ragions of attraction are not recognized by 
the rule, are skipped, due to the Machine's finite probability to escape 
from any state. The results apply to the Boltzmann machine and to the 
asynchronous net of binary threshold elements (*rlopfield model'). They 
provide the network designer with worst-case and best-case bounds for 
the network's performance, and allow polynomial-time tradeoff studies 
of design parameters. 
I. INTRODUCTION 
The suggestion that artificial neural networks can be utilized for classification, pattern 
recognition and associative recall has been the main theme of numerous studies which 
appeared in the last decade (e.g. Rumelhart and McClelland (1986) and Grossberg (1988) - 
and their references.) Among the most popular families of neural networks are fully 
connected networks of binary threshold elements (e.g. Amari (1972), Hopfield (1982).) 
These structures, and the related family of fully connected networks of sigmoidal threshold 
elements have been used as error-correcting decoders in many applications, among which 
were interesting applications in optimization (Hopfield and Tank, 1985; Tank and 
Hopfield, 1986; Kennedy and Chua, 1987.) A common drawback of the many studied 
schemes is the abundance of 'spurious' local minima, which 'trap' the decoder in 
undesirable, and often non-interpretable, states during the process of input / stored-pattern 
association. It is generally accepted now that while the number of arbitrary binary patterns 
that can be stored in a fully-connected network is of the order of magnitude of N (N = 
number of the neurons in the network,) the number of created local attractors in the 
512 Kam and Cheng 
network's state space is exponential in N. 
It was proposed (Ackley et al., 1985; Hinton and Sejnowski, 1986) that fully-connected 
binary neural networks, which update their states on the basis of  
state-reassessment rules, could be used for global optimization when the objective 
function is multi-modal. The suggested architecture, the Boltzmann machine, is based on 
the principles of simulated annealing ( Kirkpatrick et al., 1983; Geman and Geman, 1984; 
Aarts et al., 1985; Szu, 1986,) and has been shown to perform interesting tasks of 
decision making and optimization. However, the learning algorithm that was proposed for 
the Machine, along with its "cooling" procedures, do not lend themselves to real-time 
operation. Most studies so far have concentrated on the properties of the Machine in 
global optimization and only few studies have mentioned possible utilization of the 
Machine (at constant 'temperature') as a content-addressable memory (e.g. for local 
optimization.) 
In the present paper we propose to use the Boltzmann machine for associative retrieval, 
and develop bounds on its performance as a content-addressable memory. We introduce a 
learning algorithm for the Machine, which locally maximizes the stabilization probability 
of learned patterns. We then proceed to calculate (in polynomial time) upper and lower 
bounds on the probability that a tuple at a given initial Hamming distance from a stored 
pattern will get attracted to that pattern. A proposed association rule creates a sphere of 
influence around each stored pattern, and is indifferent to 'spurious' attractors. Due to the 
fact that the Machine has a nonzero probability of escape from any state, the 'spurious' 
attractors are ignored. The obtained bounds allow the assessment of retrieval probabilities, 
different learning algorithms and necessary learning periods, network 'temperatures' and 
coding schemes for items to be stored. 
H. THE MACHINE AS A CONTENT ADDRESSABLE 
MEMORY 
The Boltzmann Machine is a multi-connected network of N simple processors called 
probabilistic binary neurons. The ith neuron is characterized by N-1 real numbers 
representing the synaptic weights (wij, j=l,2,...,i-l,i+l,...,N; wii is assumed to be zero 
for all i), a real threshold (xi) and a binary activity level (ui  B ={-1,1},) which we 
shall also refer to as the neuron's state. The binary N-tuple U = [Ul,U2,.   ,UN] is called 
the network's state. We distinguish between two phases of the network's operation: 
a) The learning phase - when the network parameters wij and xi are determined. This 
determination could be achieved through autonomous learning of the binary pattern 
environment by the network (unsupervised learning); through learning of the environment 
with the help of a 'teacher' which supplies evaluafive reinforcement signals (supervised 
learning); or by an external fixed assignment of parameter values. 
b) Th? production phase - when the network's state U is determined. This determination 
could be performed synchronously by all neurons at the same predetermined time instants, 
or asynchronously - each neuron reassesses its state independently of the other neurons at 
random times. The decisions of the neurons regarding their states during reassessment can 
be arrived at deterministically (the set of neuron inputs determines the neuron's state) or 
Convergence and Pattern-Stabilization 513 
stochastically (the set of neuron inputs shapes a probability distribution for the 
state-selection law.) 
We shall describe first the (asynchronous and stochastic) production rule which our 
network employs. At random times during the production phase, asynchronously and 
independently of all other neurons, the i th neuron decides upon its next state, using the 
probabilistic decision rule 
1 with probabilty 
(1) 
-1 with probabilty 'ar where 
l+e-' 
N 
AEi= Z WijUj --'i 
is called the ith energy gap, and Te is a predetermined real number called temperature. 
The state-updating rule (1) is related to the network's energy level which is described by 
E--- i=l 
If the network is to find a local minimum of E in equation (2), then the ith neuron, when 
chosen (at random) for state updating, should choose deterministically 
u i = sgn Wij Uj -- 'r, i . (3) 
Lj=l,ji 
W'e note that rule in equation (3) is obtained from rule in equation (1) as Te - 0. This 
deterministic choice of ui guarantees, under symmetry conditions on the weights 
(wij=wji), that the network's state will stabilize at afixed point in the 2N-tuple state 
space of the network (Hopfield, 1982), where 
Dfinition 1; A state Uf BN is called a fixed point in the state space of the N-neuron 
network ff 
PJ(n+l)--Uf I U(h)-Uf]= 1. (4) 
A fixed point found through iterations of equation (3) (with i chosen at random at each 
iteration) may not be the global minimum of the energy in equation (2). A mechanism 
which seeks the global minimum should avoid local-minimum "traps" by allowing 
'uphill' climbing with respect to the value of E. The decision scheme of equation (1) is 
devised for that purpose, allowing an increase in E with nonzero probability. This 
provision for progress in the locally 'wrong' direction in order to reap a 'global' advantage 
later, is in accordance with the principles of simulated annealing techniques, which are 
used in multimodal optimization. In our case, the probabilities of choosing the locally 
'fight' decision (equation (3)) and the locally 'wrong' decision are determined by the ratio 
and Oheng 
of the energy gap AF.i to the 'temperature' shaping constant Te. 
The Boltzmann Machine has been proposed for global minimization and a considerable 
amount of effort has been invested in devising a good cooling scheme, namely a means to 
control Te in order to guarantee the finding of a global minimum in a short time (Geman 
and Geman, 1984, Szu, 1987.) However, the network may also be used as a selective 
content addressable memory which does not suffer from inadvertently-installed spurious 
local minima. 
We consider the following application of the Boltzmann Machine as a scheme for pattern 
classification under noisy conditions: let an encoder emit a sequence of NX1 binary code 
vectors from a set of Q codewords (or 'patterns',) each having a probability of occurrence 
of rl m (m = 1,2,... ,Q). The emitted pattern encounters noise and distortion before it 
arrives at the decoder, resulting in some of its bits being reversed. The Boltzmann 
Machine, which is the decoder at the receiving end, accepts this distorted pattern as its 
initial state (U(0)), and observes the consequent time evolution of the network's state U. 
At a certain time instant no, the Machine will declare that the input pattern U(0) is to be 
associated with pattern Bm if U at that instant (u(n0)) is 'close enough' to Bin. For this 
purpose we define 
Definition 2: The dnmx-sphere of influence of pattern Bin, c( Bm, dmax) is 
O(Bm, dmax)= {U BN: HD(U, Bm) <- dmax}. (5) 
dmax is prespecified. 
Let Y_.(dmax) = u O(B m, Clma x ) and let n o be the smallest integer such that u(nO Y-.(dmax 
in 
Therule of association is' associate U 4) with Bmat time n o, ifU (n0 which has evolved 
from U () satisfies: u4n0 o(B m, dmax). 
Due to the finite probability of escape from any minimum, the energy minima which 
correspond to spurious fixed points are skipped by the network on its way to the energy 
valleys induced by the designed fixed points (i.e. B 1,.   ,BQ.) 
III. ONE-STEP CONTRACTION PROBABILITIES 
Using the association rule, the utility of the Boltzmann machine for error correction 
involves the probabilities 
Pr{HD[U(n),Bin]<dmax [HD[U(0),Bm]=d} re=l,2,..., Q (6) 
for predetermined n and dmax. In order to calculate (6) we shall first calculate the 
following one-step attraction probabilities: 
P03m,Cl,)=Pr{HD[h, + 1),Bin]=d+ IHD[U(%),Bin]=d}where=-l, 0, 1 (7) 
For  = -1 we obtain the probability of convergence ; For 15 = +1 we obtain the 
probability of divergence; For 15 = 0 we obtain the probability of stalemate. 
An exact calculation of the attraction probabilities in equation (7) is time-exponential and 
we shall therefore seek lower and upper bounds which can be calculated in polynomial 
time. We shall reorder the weights of each neuron according to their contribution to the 
Convergence and Pattern-Stabilization 51  
AEi for each pattern, using the notation 
Wl"={Wilbm,,webmv..., wbm3 =maxWi 
fs = max{ W[ni- (f, f2, "- ,fs-1 } } 
(8) 
i =l,2,...,N, s =2,3, ...,N, m=1,2, ...,Q 
d d 
Let AEgi(d)= Agid - 2E f  and AEmi+.(d)= Agid 
r=-I r=-I 
- CN+,-r (9) 
These values represent the maximum and minimum values that the ith energy gap could 
assume when the network is at HD of d from Bin. Using these extrema, we can find the 
worst case attraction probabilities: 
d il I U-l(bmi) 1 - U-l(bmi) 
 + (10a) 
PW(Bm, d,-1) = ' Pi AE,(d) AE(d) e- 
l+d- 'T l+e- T, 
N U_l(bmi) e- T, 
l+e- T e 
1 - U_i (brai) 
l+e- Te 
and the best case attraction probabilities  
N 
Pb(Bm, d,- 1) =' , Pi ---- 'd) + AE(d) e 
l+e-  l+e- T 
(lOb) 
(11a) 
pbC(B m, d, 1 ) = 
N 
i=l 
where for both cases 
U_l(bmi) e- T 1 - U_l(bmi) 
+ 
l+e- T l+e- T, 
PWC(be)(B m, d, 0)= 1 -PWC0ac) m, d, - 1)-PWe(bC)(Bm, d, 1). 
(lib) 
(12) 
For the worst- (respectively, best-) case probabilities, we have used the extreme values of 
AEmi(d) to underestimate (overestimate) convergence and overestimate (underestimate) 
divergence, given that there is a disagreement in d of the N positions between the 
network's state and Bm; we assume that errors are equally likely at each one of the bits. 
IV. ESTIMATION OF RETRIEVAL PROBABILITIES 
To estimate the retrieval probabilities, we shall study the Hamming distance of the 
516 Earn and Cheng 
network's state from a stored pattern. The evolution of the network from state to state, as 
affecting the distance from a stored pattern, can be interpreted in terms of a birth-and-death 
Markov process (e.g. Howard, 1971) with the probability transition matrix 
 lPbi, Pdi) = 
1- 
0 Pa2 
o o 
o 
o 
o 
0 0 0 0 0 
p 0 0 0 
1-12-Pd P2 0 0 
0 
0 p 1-1k-P lk 0 
o 
o PdN-1 I-PbN-I-PdN-1 PbN-1 
0 0 PdN 1-PdN 
(13) 
where the birth probability Phi is the divergence probability of increasing the HD from i 
to i+l, and the death probability Pdi is the contraction probability of decreasing the HD 
from i to i-1. 
Given that an input pattern was at HD of do from Bm, the probability that after n steps 
the network will associate it with Bm is 
PtU(n)-gm] I HD[0)m]= c10}=Pr[HDCUCn),Bm)=r I HDOJ),Bm)= c!0] (14) 
r=0 
where we can use the one-step bounds found in section III in order to calculate the 
worst-case and best-case probabilities of association. Using equations (10) and (11) we 
define two matrices for each pattern Bin; a worst case matrix, , and a best case matrix,: 
Worst case matrix 
We  
Pbi-P (Bm,  ,+1) 
Pdi= pwC(Bm, i ,-1) 
Best case matfix 
Pbi =Pb(Bm, i ,+1) 
Pdi=pb(Bm, i ,-1). 
Using these matrices, it is now possible to calculate lower and upper bounds for the 
association probabilities needed in equation (14): 
[tgdo (xle)n]r_( Pr [HD(U(n),Bm)= r J HD(u(O),Bm)= dO] _<[rdo OPl:!)n]r (15a) 
where Ix] i indicates the i t element of the vector x, andrao is the unit 1 x n+ 1 vector 
Convergence and Pattern-Stabilization 51 7 
1 i=d 0 
[;doli= l (lSb) 
0 id o 
The bounds of equation 14(a) can be used to bound the association probability in equation 
(13). The upper bound of the association probability is obtained by replacing the summed 
terms in (13) by their upper-bound values: 
Pr {u(n)-->Bm] I HD [U(0),Bm] = do} <E[do (w)n]r (16a) 
r--0 
The lower bound cannot be treated similarly, since it is possible that at some instant of 
time prior to the present time-step (n), the network has already associated its state U with 
one of the other patterns. We shall therefore use as the lower bound on the convergence 
probability in equation (14): 
Z[tr%( n]r< Pr{lJ(n)-B m] IHD 
) _ [U(0),Bm} 
r=0 
where the  matrix is the birth-and-death matrix (13) with 
(16b) 
Phi--' 
pwc(B m, i ,+1 ) 
PW(Bm, i,-1) for i= 1,2,...,gi-1 
(16c) 
0 for i = gi, gi +1,..., N 
!i=minHD(Bi, Bj)-dma x j= 1,...,Q,j:i (16d) 
Equation (16c) and (16d) assume that the network wanders into the dmax- sphere of 
influence of a pattern other than Bi, whenever its distance from Bi is [ti or more. This 
assumption is very conservative, since I.q represents the shortest distance to a competing 
dmax- sphere of influence, and the network could actually wander to distances larger than 
[li and still converge eventually into the dmax -sphere of influence of B i. 
CONCLUSION 
We have presented how the Boltzmann Machine can be used as a content-addressable 
memory, exploiting the stochastic nature of its state-selection procedure in order to escape 
undesirable minima. An association rule in terms of patterns' spheres of influence is used, 
along with the Machine's dynamics, in order to match an input tuple with one of the 
predetermined stored patterns. The system is therefore indifferent to 'spurious' states, 
whose spheres of influence are not recognized in the retrieval process. We have detailed a 
technique to calculate the upper and lower bounds on retrieval probabilities of each stored 
518 Kam and Cheng 
pattern. These bounds are functions of the network's parameters (i.e. assignment or 
learning rules, and the pattern sets); the initial Hamming distance from the stored pattern; 
the association rule; and the number of production steps. They allow a polynomial-time 
assessment of the network's capabilities as an associative memory for a given set of 
patterns; a comparison of different coding schemes for patterns to be stored and retrieved; 
an assessment of the length of the learning period necessary in order to guarantee a 
prespecified probability of retrieval; and a comparison of different leaming/assignment 
rules for the network parameters. Examples and additional results are provided in a 
companion paper (Karo and Cheng, 1989). 
Acknowledgements 
This work was supported by NSF grant IRI 8810186. 
References 
[1] Aarts,E.H.L., Van Laarhoven,P.J.M. : "Statistical Cooling: A General 
Approach to Combinatorial Optimization Problems," Phillips J. Res., Vol. 40, 1985. 
[2] Ackley,D.H., Hinton,J.E., Sejnowski,T.J. :"A Learning Algorithm for 
Boltzmann Machines," Cognitive Science, Vol. 19, pp. 147-169, 1985. 
[3] Amari,S-I: "Learning Patterns and Pattern Sequences by Self-Organizing Nets of 
Threshold Elements," IEEE Trans. Computers, Vol. C-21, No. 11, pp. 1197-1206, 1972. 
[4] Geman,S., Geman,D.: "Stochastic Relaxation, Gibbs Distributions, and the 
Bayesian Restoration of Images" IEEE Trans. Patt. Anal. Mach. Int., pp. 721-741, 1984. 
[5] Grossberg,S.: "Nonlinear Neural Networks: Principles, Mechanisms, and 
Architectures," Neural Networks, Vol. 1, 1988. 
[6] Hebb,D.O.: The Organization of Behavior, New York:Wiley, 1949. 
[7] Hinton,J.E., Sejnowski,T.J. "Learning and Relearning in the Boltzmann 
Machine," in [14] 
[8] Hopfield,J.J. : "Neural Networks and Physical Systems with Emergent 
Collective Computational Abilities," Proc. Nat. Acad. Sci. USA, pp. 2554-2558, 1982. 
[9] Hopfield,J.J., Tank,D. :" 'Neural' Computation of Decision in Optimization 
Problems," Biological Cybernetics, Vol. 52, pp. 1-12, 1985. 
[10] Howard,R.A.: Dynamic Probabilistic Systems, New York:Wiley, 1971. 
[11] Kam,M., Cheng,R.: "Decision Making with the Boltzmann Machine," 
Proceedings of the 1989 American Control Conference, Vol. 1, Pittsburgh, PA, 1989. 
[12] Kennedy,M.P., Chua, L.O. :"Circuit Theoretic Solutions for Neural 
Networks," Proceedings of the ISt Int. Conf. on Neural Networks, San Diego, CA, 1987. 
[13] Kirkpatrick,S, Gellat,C.D.,Jr., Vecchi,M.P. :"Optimization by 
Simulated Annealing," Science, 220, pp. 671-680, 1983. 
[14] Rumelhart,D.E., McClelland,J.L. (editors): Parallel Distributed 
Processing, Volume I: Foundations, Cambridge:MIT press, 1986. 
[15] Szu,H.: "Fast Simulated Annealing," in Denker,J.S.(editor) : Neural 
Networks for Computing, New York:American Inst. Phys., Vol. 51.,pp. 420-425, 1986. 
[16] Tank,D.W., Hopfield, J.J.: "Simple 'Neural' Optimization Networks," IEEE 
Transactions on Circuits and Systems, Vol. CAS-33, No. 5, pp. 533-541, 1986. 
