Boltzmann Machine learning using mean 
field theory and linear response correction 
H.J. Kappen 
Department of Biophysics 
University of Nijmegen, Geert Grooteplein 21 
NL 6525 EZ Nijmegen, The Netherlands 
F. B. Rodriguez 
Instituto de Ingenieria del Conocimiento & Departamento de Ingenieria InformS.tica, 
Universidad Autdnoma de Madrid, Canto Blanco,28049 Madrid, Spain 
Abstract 
We present a new approximate learning algorithm for Boltzmann 
Machines, using a systematic expansion of the Gibbs free energy to 
second order in the weights. The linear response correction to the 
correlations is given by the Hessian of the Gibbs free energy. The 
computational complexity of the algorithm is cubic in the number 
of neurons. We compare the performance of the exact BM learning 
algorithm with first order (Weiss) mean field theory and second 
order (TAP) mean field theory. The learning task consists of a fully 
connected Ising spin glass model on 10 neurons. We conclude that 
1) the method works well for paramagnetic problems 2) the TAP 
correction gives a significant improvement over the Weiss mean field 
theory, both for paramagnetic and spin glass problems and 3) that 
the inclusion of diagonal weights improves the Weiss approximation 
for parame[gnetic problems, but not for spin glass problems. 
I Introduction 
Boltzmann Machines (BMs) [1], are networks of binary neurons with a stochastic 
neuron dynamics, known as Glauber dynamics. Assuming symmetric connections 
between neurons, the probability distribution over neuron states g will become 
stationary and is given by the Boltzmann-Gibbs distribution P(. The Boltzmann 
distribution is a known function of the weights and thresholds of the network. 
However, computation of P(s-) or any statistics involving P(s-), such as mean firing 
rates or correlations, requires exponential time in the number of neurons. This is 
Boltzmann Machine Learning Using Mean Field Theory 281 
due to the fact that P(s-') contains a normalization term Z, which involves a sum 
over all states in the network, of which there are exponentially many. This problem 
is particularly important for BM learning. 
Using statistical sampling techiques [2], learning can be significantly improved [1]. 
However, the method has rather poor convergence and can only be applied to small 
networks. 
In [3, 4], an acceleration method for learning in BMs is proposed using mean field 
theory by replacing (sisj} by mimj in the learning rule. It can be shown [5] that 
such a naive mean field approximation of the learning rules does not converge in 
general. Furthermore, we argue that the correlations can be computed using the 
linear response theorem [6]. 
In [7, 5] the mean field approximation is derived by making use of the properties 
of convex functions (Jensen's inequality and tangential bounds). In this paper we 
present an alternative derivation which uses a Legendre transformation and a small 
coupling expansion [8]. It has the advantage that higher order contributions (TAP 
and higher) can be computed in a systematic manner and that it may be applicable 
to arbitrary graphical models. 
2 Boltzmann Machine learning 
The Boltzmann Machine is defined as follows. The possible configurations of the 
network can be characterized by a vector '- (s, .., si, .., sn), where si = +1 is the 
state of the neuron i, and n the total number of the neurons. Neurons are updated 
using Glauber dynamics. 
Let us define the energy of a configuration g as 
After long times, the probability to find the network in a state g becomes indepen- 
dent of time (thermal equilibrium) and is given by the Boltzmann distribution 
1 exp{-E(s-')}. (1) 
Z - y,exp{-E(s-')} is the partition function which normalizes the probability 
distribution. 
Learning [1] consists of adjusting the weights and thresholds in such a way that 
the Boltzmann distribution approximates a target distribution q(s-') as closely as 
possible. 
A suitable measure of the difference between the distributions p(s-) and q(s-') is the 
Kullback divergence [9] 
I: = E q(s-') log q(s--') (2) 
' p(s-)' 
Learning consists of minimizing I: using gradient descent [1] 
The parameter r/is the learning rate. The brackets (.) and ('}c denote the 'free' and 
'clamped' expectation values, respectively. 
282 H. J. Kappen and E B. Rodrt'guez 
The computation of both the free and the clamped expectation values is intractible, 
because it consists of a sum over all unclamped states. As a result, the BM learning 
algorithm can not be applied to practical problems. 
3 The mean field approximation 
We derive the mean field free energy using the small 7 expansion as introduced by 
Plefka [8]. The energy of the network is given by 
E(s,w,h,7) - 7Eint-E0isi 
Ein t 
i 
--5 y WijSiSj 
ij 
for 7 = 1. The free energy is given by 
F(w, O, 7) = - log Tr, e-E(*'w'e'v) 
and is a function of the independent variables wij, Oi and 7. We perform a Legendre 
OF The Gibbs free energy 
transformation on the variables Oi by introducing mi = ae' 
i 
is now a function of the independent variables mi and wij, and Oi is implicitly 
given by 
interaction 
-- mi. The expectation (.).is with respect to the full model with 
G(7) = G(0) + 7G'(0) + 72G"(0) + 0(73) 
We expand 
We directly obtain from [8] 
(7'(7) : (Eint). 
OOi 
+ SintE (si--mi) 
i ,y 
G"(7) - (ant) - 
For 7 = 0 the expectation values (.)v 
can directly compute: 
1 ( 
G(0) = Z (l+mi)log (l+mi)+(1- 
$ 
1 
(7'(0) = 
ij 
1 
G"(O) = 4 E w'j(1 - m/2)(1 - m) 
ij 
Thus 
become the mean field expectations which we 
log - 
1 ( 1 (1 + mi) + (1 - 
E (l q-rni)log 
1 
ij 
1 
- E wj(1 - m)(1 - m.) + O(w3f(m)) 
ij 
mi) log (1 -- mi)) 
Boltzmann Machine Learning Using Mean Field Theory 283 
where f(m) is some unknown function of m. 
The mean field equations are given by the inverse Legendre transformation 
Oi : OG _ tanh_(mi ) _ E wijmj q- E WJ lrti(1 -- m.) 
J 
which we recognize as the mean field equations. 
The correlations are given by 
02F 
OOi OOj OOj 
We therefore obtain from Eq. 3 
with 
__ 
Om J ij Ore2 J 
- = 
(4) 
(1 ) 
(A-)ij = 6'ij 1 - rn +  wu(1 - m) - w,j - 2m, mjwUj (5) 
Thus, for given Wij and Oi, we obtain the approximate mean firing rates mi by solv- 
ing Eqs. 4 and the correlations by their linear response approximations Eqs. 5. The 
inclusion of hidden units is straigthforward. One applies the above approximations 
in the free and the clamped phase separately [5]. The complexity of the method is 
O(n3), due to the matrix inversion. 
4 Learning without hidden units 
We will assess the accuracy of the above method for networks without hidden units. 
Let us define Cij -- (sisj) c - {si)c (sj} c, which can be directly computed from the 
data. The fixed point equation for AOi gives 
AOi = 0  mi = (si). (6) 
The fixed point equation for Awij , using Eq. 6, gives 
Awij = 0 ,t Aij = Cij, i  j. (7) 
From Eq. 7 and Eq. 5 we can solve for wij, using a standard least squares method. 
In our case, we used fsolve from Matlab. Subsequently, we obtain Oi from Eq. 4. 
We refer to this method as the TAP approximation. 
In order to assess the effect of the TAP term, we also computed the weights and 
thresholds in the same way as described above, but without the terms of order w 2 
in Eqs. 5 and 4. Since this is the standard Weiss mean field expression, we refer to 
this method as the Weiss approximation. 
The fixed point equations are only imposed for the off-diagonal elements of Awij 
because the Boltzmann distribution Eq. 1 does not depend on the diagonal elements 
wii. In [5], we explored a variant of the Weiss approximation, where we included 
diagonal weight terms. As is discussed there, if we were to impose Eq. 7 for i = j 
as well, we have A = C. If C is invertible, we therefore have A - = C - However, 
we now have more constraints than variables. Therefore, we introduce diagonal 
weights wii by adding the term wiimi to the righthandside of Eq. 4 in the Weiss 
approximation. Thus, 
Wij -- 1 - m 
and Oi is given by Eq. 4 in the Weiss approximation. Clearly, this method is com- 
putationally simpler because it gives an explicit expression for the solution of the 
weights involving only one matrix inversion. 
284 H. J. Kappen and E B. Rodrguez 
5 
Numerical results 
For the target distribution q(s) in Eq. 2 we chose a fully connected Ising spin glass 
model with equilibrium distribution 
1 exp{- 1 
q(s) = 2  E Jijsisj } 
ij 
with Jij i.i.d. Gaussian variables with mean  and variance a: 
-x' This model 
is lnown as the Sherrington-Kirkpatrick (SK) model [10]. Depending on the val- 
ues of J and J0, the model displays a para-magnetic (unordered), ferro-magnetic 
(ordered) and a spin-glass (frustrated) phase. For J0 = o, the para-magnetic (spin- 
glass) phase is obtained for J < 1 (J > 1). We will assess the effectiveness of our 
approximations for finite n, for J0 = 0 and for various values of J. Since this is a 
realizable task, the optimal KL divergence is zero, which is indeed observed in our 
simulations. 
We measure the quality of the solutions by means of the Kullback divergence. There- 
fore, this comparison is only feasible for small networks. The reason is that the 
computation of the Kullback divergence requires the computation of the Boltzmann 
distribution, Eq. 1, which requires exponential time due to the partition function 
Z. 
We present results for a network of n = 10 neurons. For J0 = 0, we generated 
for each value of 0.1 < J < 3, 10 random weight matrices Jij. For each weight 
matrix, we computed the q(s-*) on all 2 ' states. For each of the 10 problems, we 
applied the TAP method, the Weiss method and the Weiss method with diagonal 
weights. In addition, we applied the exact Boltzmann Machine learning algorithm 
using conjugate gradient descent and verified that it gives KL divergence equal 
to zero, as it should. We also applied a factorized model p(s-): I-Ii (1 + misi) 
with m i -- ($i)c to assess the importance of correlations in the target distribution. 
In Fig. la, we show for each J the average KL divergence over the 10 problem 
instances as a function of J for the TAP method, the Weiss method, the Weiss 
method with diagonal weights and the factorized model. We observe that the TAP 
method gives the best results, but that its performance deteriorates in the spin-glass 
phase (J > 1). 
The behaviour of all approximate methods is highly dependent on the individual 
problem instance. In Fig. lb, we show the mean value of the KL divergence of the 
TAP solution, together with the minimum and maximum values obtained on the 
10 problem instances. 
Despite these large fluctuations, the quality of the TAP solution is consistently 
better than the Weiss solution. In Fig. lc, we plot the difference between the TAP 
and Weiss solution, averaged over the 10 problem instances. 
In [5] we concluded that the Weiss solution with diagonal weights is better than 
the standard Weiss solution when learning a finite number of randomly generated 
patterns. In Fig. ld we plot the difference between the Weiss solution with and 
without diagonal weights. We observe again that the inclusion of diagonal weights 
leads to better results in the paramagnetic phase (J < 1), but leads to worse results 
in the spin-glass phase. For J > 2, we encountered problem instances for which 
either the matrix C is not invertible or the KL divergence is infinite. This problem 
becomes more and more severe for increasing J. We therefore have not presented 
results for the Weiss approximation with diagonal weigths for J > 2. 
Boltzmann Machine Learning Using Mean Field Theory 285 
Comparison mean values 
0 
0 
fact ... .' 
weiss+d 
weiss 
tap 
 / 
/ 
1 2 3 
J 
Difference WEISS and TAP 
_1- 0.5 
o 
1.5 
.>_ 
0.5 
o 
o 
'-' 0.5 
+ 
03 
uJ 
o 
-0.5 .... 0.5 
0 1 2 3 0 
TAP 
mean 
min .' 
max " 
1 2 
J 
Difference WEISS+D and WEISS 
J J 
Figure 1: Mean field learning of paramagnetic (J < 1) and spin glass (J > 1) 
problems for a network of 10 neurons. a) Comparison of mean KL divergences for 
the factorized model (fact), the Weiss mean field approximation with and without 
diagonal weights (weiss+d and weiss), and the TAP approximation, as a function 
of J. The exact method yields zero KL divergence for all J. b) The mean, mini- 
mum and maximum KL divergence of the TAP approximation for the 10 problem 
instances, as a function of J. c) The mean difference between the KL divergence 
for the Weiss approximation and the TAP approximation, as a function of J. d) 
The mean difference between the KL divergence for the Weiss approximation with 
and without diagonal weights, as a function of J. 
6 Discussion 
We have presented a derivation of mean field theory and the linear response correc- 
tion based on a small coupling expansion of the Gibbs free energy. This expansion 
can in principle be computed to arbitrary order. However, one should expect that 
the solution of the resulting mean field and linear response equations will become 
more and more difficult to solve numerically. The small coupling expansion should 
be applicable to other network models such as the sigmoid belief network, Potts 
networks and higher order Boltzmann Machines. 
The numerical results show that the method is applicable to paramagnetic problems. 
This is intuitively clear, since paramagnetic problems have a unimodal probability 
distribution, which can be approximated by a mean and correlations around the 
mean. The method performs worse for spin glass problems. However, it still gives 
a useful approximation of the correlations when compared to the factorized model 
which ignores all correlations. In this regime, the TAP approximation improves 
286 H. J. Kappen and E B. Rodrt'guez 
significantly on the Weiss approximation. One may therefore hope that higher order 
approximation may further improve the method for spin glass problems. Therefore. 
we cannot conclude at this point whether mean field methods are restricted to 
unimodal distributions. In order to further investigate this issue, one should also 
study the ferromagnetic case (J0 > 1, J > 1), which is multimodal as well but less 
challenging than the spin glass case. 
It is interesting to note that the performance of the exact method is absolutely 
insensitive to the value of J. Naively, one might have thought that for highly 
multi-modal target distributions, any gradient based learning method will suffer 
from local minima. Apparently, this is not the case: the exact KL divergence has 
just one minimum, but the mean field approximations of the gradients may have 
multiple solutions. 
Acknowledgement 
This research is supported by the Technology Foundation STW, applied science 
division of NWO and the techology programme of the Ministry of Economic Affairs. 
References 
[1] 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
[lO] 
D. Acldey, G. Hinton, and T. Sejnowsld. A learning algorithm for Boltzmann Ma- 
chines. Cognitive Science, 9:147-169, 1985. 
C. Itzykson and J-M. Drouffe. Statistical Field Theory. Cambridge monographs on 
mathematical physics. Cambridge University Press, Cambridge, UK, 1989. 
C. Peterson and J.R. Anderson. A mean field theory learning algorithm for neural 
networks. Complex Systems, 1:995-1019, 1987. 
G.E. Hinton. Deterministic Boltzmann learning performs steepest descent in weight- 
space. Neural Computation, 1:143-150, 1989. 
H.J. Kappen and F.B. Rodriguez. Efficient learning in Boltzmann Machines using 
linear response theory. Neural Computation, 1997. In press. 
G. Parisi. Statistical Field Theory. Frontiers in Physics. Addison-Wesley, 1988. 
L.K. Saul, T. Jaakkola, and M.I. Jordan. Mean field theory for sigmoid belief net- 
works. Journal of artificial intelligence research, 4:61-76, 1996. 
T. Plefka. Convergence condition of the TAP equation for the infinite-range Ising 
spin glass model. Journal of Physics A, 15:1971-1978, 1982. 
S. Kullback. Information Theory and Statistics. Wiley, New York, 1959. 
D. Sherrington and S. Kirkpatrick. Solvable model of Spin-Glass. Physical review 
letters, 35:1792-1796, 1975. 
