A Revolution: Belief Propagation in 
Graphs With Cycles 
Brendan J. Frey* 
http://n. cs. utoronto. ca/frey 
Department of Computer Science 
University of Toronto 
David J. C. MacKay 
http://wol. ra. phy. cam. ac. uk/mackay 
Department of Physics, Carendish Laboratory 
Cambridge University 
Abstract 
Until recently, artificial intelligence researchers have frowned upon 
the application of probability propagation in Bayesian belief net- 
works that have cycles. The probability propagation algorithm is 
only exact in networks that are cycle-free. However, it has recently 
been discovered that the two best error-correcting decoding algo- 
rithms are actually performing probability propagation in belief 
networks with cycles. 
I Communicating over a noisy channel 
Our increasingly wired world demands efficient methods for communicating bits of 
information over physical channels that introduce errors. Examples of real-world 
channels include twisted-pair telephone wires, shielded cable-TV wire, fiber-optic 
cable, deep-space radio, terrestrial radio, and indoor radio. Engineers attempt 
to correct the errors introduced by the noise in these channels through the use 
of channel coding which adds protection to the information source, so that some 
channel errors can be corrected. A popular model of a physical channel is shown 
in Fig. 1. A vector of K information bits u = (u,... ,UK), uk  {0, 1} is encoded, 
and a vector of N codeword bits x = (x,..., XN) is transmitted into the channel. 
Independent Gaussian noise with variance a ' is then added to each codeword bit, 
* Brendan rey is currently a Beckman Fellow at the Beckman Institute for Advanced 
Science and Technology, University of Illinois at Urbana-Champaign. 
480 B. J. Frey and D. J. C. MacKay 
u 
Gaussian noise 
with variance er 
Decoder 
Figure 1: A communication system with a channel that adds Gaussian noise to the 
transmitted discrete-time sequence. 
producing the real-valued channel output vector y -- (Yl,..., YN). The decoder 
must then use this received vector to make a guess fi at the original information 
vector. The probability Pb(e) of bit error is minimized by choosing the uk that 
maximizes P(ukly ) for k - 1,...,K. The rate KIN of a code is the number of 
information bits communicated per codeword bit. We will consider rate ~ 1/2 
systems in this paper, where N = 2K. 
The simplest rate 1/2 encoder duplicates each information bit: x9._l = x2 - u, 
k = 1,..., K. The optimal decoder for this repetition code simply averages together 
pairs of noisy channel outputs and then applies a threshold: 
 - I if (Y9k-1 q- y9.)/2  0.5, 0 otherwise. (1) 
Clearly, this procedure has the effect of reducing the noise variance by a factor of 
1/2. The resulting probability Pb(e) that an information bit will be erroneously 
decoded is given by the area under the tail of the noise Gaussian: 
-0.5 
where (0 is the cumulative standard normal distribution. A plot of P(e) versus a 
for this repetition code is shown in Fig. 2, along with a thumbnail picture that shows 
the distribution of noisy received signals at the noise level where the repetition code 
gives Po(e) = 10-*. 
More sophisticated channel encoders and decoders can be used to increase the toler- 
able noise level without increasing the probability of a bit error. This approach can 
in principle improve performance up to a bound determined by Shannon (1948). For 
a given probability of bit error P0 (e), this limit gives the maximum noise level that 
can be tolerated, no matter what channel code is used. Shannon's proof was non- 
constructive, meaning that he showed that there exist channel codes that achieve his 
limit, but did not present practical encoders and decoders. The curve for Shannon's 
limit is also shown in Fig. 2. 
The two curves described above define the region of interest for practical channel 
coding systems. For a given Po(e), if a system requires a lower noise level than the 
repetition code, then it is not very interesting. At the other extreme, it is impossible 
for a system to tolerate a higher noise level than Shannon's limit. 
2 Decoding Hamming codes by probability propagation 
One way to detect errors in a string of bits is to add a parity-check bit that is 
chosen so that the sum modulo 2 of all the bits is 0. If the channel flips one bit, 
the receiver will find that the sum modulo 2 is 1, and can detect than an error 
occurred. In a simple Hamming code, the codeword x consists of the original vector 
A Revolution: Belief Propagation in Graphs with Cycles 481 
le-1 
le-5 
0.2 
Repetitin"  T 
Shannon 
1 / H-PP// Concatenated 
0.3 0.4 0.5 0.6 0.7 0.8 
Standard deviation of Gaussian noise, a 
Figure 2: Probability of bit error Pb(e) versus noise level a for several codes with 
rates near 1/2, using 0/1 signalling. It is impossible to obtain a Pb(e) below Shan- 
non's limit (shown on the far right for rate 1/2). "H-PP" = Hamming code (rate 
4/7) decoded by probability propagation (5 iterations); "H-Exact" = Hamming 
code decoded exactly; "LDPCC-PP" = low-density parity-check coded decoded by 
probability propagation; "TC-PP" = turbocode decoded by probability propaga- 
tion. The thumbnail pictures show the distribution of noisy received signals at the 
noise levels where the repetition code and the Shannon limit give P(e) - 10 -5. 
u in addition to several parity-check bits, each of which depends on a different 
subset of the information bits. In this way, the Hamming code can not only detect 
errors but also correct them. 
The code can be cast in the form of the conditional probabilities that specify a 
Bayesian network. The Bayesian network for a K = 4, N -- 7 rate 4/7 Hamming 
code is shown in Fig. 3a. Assuming the information bits are uniformly random, 
we have P(uk) = 0.5, uk  {0,1}, k = 1,2,3,4. Codeword bits I to 4 are di- 
rect copies of the information bits: P(xJu) = 5(x,u), k = 1,2,3,4, where 
5(a, b) = I if a = b and 0 otherwise. Codeword bits 5 to 7 are parity-check 
bits: P(xs[ul, u2, us) = 5(x5, Ul ( u2 ( u3), P(x6lZtl, u2, u4)  ((x6, 1 ( u2 ( u4), 
P(x7[uo., us, u4) = 5(x7, u2 )us )u4), where ) indicates addition modulo 2 (XOR). 
Finally, the conditional channel probability densities are 
P(YnIXn)= I e_(y,,_x,,)2/.,2 (3) 
X/2.a. ' 
forn- 1,...,7. 
The probabilities P(uk]y) can be computed exactly in this belief network, using 
Lauritzen and Spiegelhalter's algorithm (1988) or just brute force computation. 
However, for the more powerful codes discussed below, exact computations are 
intractable. Instead, one way the decoder can approximate the probabilities P(u lY) 
is by applying the probability propagation algorithm (Pearl 1988) to the Bayesian 
network. Probability propagation is only approximate in this case because the 
482 B. J. Frey and D. J. C. MacKay 
(a) 
(c) 
(b) 
54 
(d) 
Permuter I 
Figure 3: (a) The Bayesian network for a K = 4, N -- 7 Hamming code. (b) The 
Bayesian network for a K = 4, N = 8 low-density parity-check code. (c) A block 
diagram for the turbocode linear feedback shift register. (d) The Bayesian network 
for a K = 6, N = 12 turbocode. 
network contains cycles (ignoring edge directions), e.g., 1-lrs-u2-1r6-u1. Once a 
channel output vector y is observed, propagation begins by sending a message from 
Yn to Xn for n = 1,..., 7. Then, a message is sent from xk to uk for k = 1, 2, 3, 4. 
An iteration now begins by sending messages from the information variables u, 
us, u4 to the parity-check variables xs, x6, x7 in parallel. The iteration finishes by 
sending messages from the parity-check variables back to the information variables 
in parallel. Each time an iteration is completed, new estimates of P(uly ) for 
k - 1, 2, 3, 4 are obtained. 
The Pb (e) -- a curve for optimal decoding and the curve for the probability propaga- 
tion decoder (5 iterations) are shown in Fig. 2. Quite surprisingly, the performance 
of the iterative decoder is quite close to that of the optimal decoder. Our expec- 
tation was that short cycles would confound the probability propagation decoder. 
However, it seems that good performance can be obtained even when there are short 
cycles in the code network. 
For this simple Hanuning code, the complexities of the probability propagation de- 
coder and the exact decoder are comparable. However, the similarity in performance 
between these two decoders prompts the question: "Can probability propagation 
decoders give performances comparable to exact decoding in cases where exact de- 
coding is computationally intractable?" 
A Revolution: Belief Propagation in Graphs with Cydes 483 
3 A leap towards the limit: Low-density parity-check codes 
Recently, there has been an explosion of interest in the channel coding community in 
two new coding systems that have brought us a leap closer to Shannon's limit. Both 
of these codes can be described by Bayesian networks with cycles, and it turns out 
that the corresponding iterative decoders are performing probability propagation 
in these networks. 
Fig. 3b shows the Bayesian network for a simple low-density parity-check code (Gal- 
lager 1963). In this network, the information bits are not represented explicitly. 
Instead, the network defines a set of allowed configurations for the codewords. Each 
parity-check vertex qi requires that the codeword bits {x,},eQ  to which qi is con- 
nected have even parity: 
P(qil{xn}nQ,) = 5(qi, {]) 
nQi 
(4) 
where q is clamped to 0 to ensure even parity. Here, Qi is the set of indices of 
the codeword bits to which parity-check vertex qi is connected. The conditional 
probability densities for the channel outputs are the same as in Eq. 3. 
One way to view the above code is as N binary codeword variables along with a 
set of linear (modulo 2) equations. If in the end we want there to be K degrees 
of freedom, then the number of linearly independent parity-check equations should 
be N - K. In the above example, there are N = 8 codeword bits and 4 parity- 
checks, leaving K -- 8 - 4 - 4 degrees of freedom. It is these degrees of freedom 
that we use to represent the information vector u. Because the code is linear, a 
K-dimensional vector u can be mapped to a valid x simply by multiplying by an 
N x K matrix (using modulo 2 addition). This is how an encoder can produce a 
low-density parity-check codeword for an input vector. 
Once a channel output vector y is observed, the iterative probability propagation 
decoder begins by sending messages from y to x. An iteration now begins by sending 
messages from the codeword variables x to the parity-check constraint variables q. 
The iteration finishes by sending messages from the parity-check constraint variables 
back to the codeword variables. Each time an iteration is completed, new estimates 
of P(xnly) for n = 1,..., N are obtained. After a valid (but not necessarily correct) 
codeword has been found, or a prespecified limit on the number of iterations has 
been reached, decoding stops. The estimate of the codeword is then mapped back 
to an estimate fi of the information vector. 
Fig. 2 shows the performance of a K = 32,621, N = 65,389 low-density parity- 
check code when decoded as described above. (See MacKay and Neal (1996) for 
details.) It is impressively close to Shannon's limit -- significantly closer than the 
"Concatenated Code" (described in Lin and Costello (1983)) which was considered 
the best practical code until recently. 
4 Another leap: Turbocodes 
The codeword for a turbocode (Berrou et al. 1996) consists of the original informa- 
tion vector, plus two sets of bits used to protect the information. Each of these two 
sets is produced by feeding the information bits into a linear feedback shift register 
(LFSR), which is a type of finite state machine. The two sets differ in that one 
set is produced by a permuted set of information bits; i.e., the order of the bits is 
scrambled in a fixed way before the bits are fed into the LFSR. Fig. 3c shows a block 
diagram (not a Bayesian network) for the LFSR that was used in our experiments. 
484 B. J. Frey and D. J. C. MacKay 
Each box represents a delay (memory) element, and each circle performs addition 
modulo 2. When the kth information bit arrives, the machine has a state sk which 
can be written as a binary string of state bits bab3b2bbo as shown in the figure. b0 
of the state sk is determined by the current input un and the previous state sn_x. 
Bits b to b4 are just shifted versions of the bits in the previous state. 
Fig. 3d shows the Bayesian network for a simple turbocode. Notice that each 
state variable in the two constituent chains depends on the previous state and an 
information bit. In each chain, every second LFSR output is not transmitted. In 
this way, the overall rate of the code is 1/2, since there are K = 6 information bits 
and N = 6 + 3 + 3 - 12 codeword bits. The conditional probabilities for the states 
of the nonpermuted chain are 
P(sls_l,Uk ) 1 if state s follows  for input u, 0 otherwise. (5) 
-- 8k_l 
The conditional probabilities for the states in the other chain are similar, except that 
the inputs are permuted. The probabilities for the information bits are uniform, 
and the conditional probability densities for the channel outputs are the same as in 
Eq. 3. 
Decoding proceeds with messages being passed from the channel output variables 
to the constituent chains and the information bits. Next, messages are passed 
from the information variables to the first constituent chain, s 1. Messages are 
passed forward and then backward along this chain, in the manner of the forward- 
backward algorithm (Smyth et al. 1997). After messages are passed from the 
first chain to the second chain s 2, the second chain is processed using the forward- 
backward algorithm. To complete the iteration, messages are passed from s 9' to the 
information bits. 
Fig. 2 shows the performance of a K - 65,536, N - 131,072 turbocode when 
decoded as described above, using a fixed number (18) of iterations. (See Frey 
(1998) for details.) Its performance is significantly closer to Shannon's limit than the 
performances of both the low-density parity-check code and the textbook standard 
"Concatenated Code". 
5 Open questions 
We are certainly not claiming that the NP-hard problem (Cooper 1990) of proba- 
bilistic inference in general Bayesian networks can be solved in polynomial time by 
probability propagation. However, the results presented in this paper do show that 
there are practical problems which can be solved using approximate inference in 
graphs with cycles. Iterative decoding algorithms are using probability propagation 
in graphs with cycles, and it is still not well understood why these decoders work 
so well. Compared to other approximate inference techniques such as variational 
methods, probability propagation in graphs with cycles is unprincipled. How well 
do more principled decoders work? In (MacKay and Neal 1995), a variational de- 
K 
coder that maximized a lower bound on lk_-i P(un]y) was presented for low-density 
parity-check codes. However, it was found that the performance of the variational 
decoder was not as good as the performance of the probability propagation decoder. 
It is not difficult to design small Bayesian networks with cycles for which probability 
propagation is unstable. Is there a way to easily distinguish between those graphs for 
which propagation will work and those graphs for which propagation is unstable? A 
belief that is not uncommon in the graphical models community is that short cycles 
are particularly apt to lead probability propagation astray. Although it is possible to 
design networks where this is so, there seems to be a variety of interesting networks 
A Revolution: Belief Propagation in Graphs with Cycles 485 
(such as the Hamming code network described above) for which propagation works 
well, despite short cycles. 
The probability distributions that we deal with in decoding are very special distribu- 
tions: the true posterior probability mass is actually concentrated in one microstate 
in a space of size 2 M where M is large (e.g., 10,000). The decoding problem is to 
find this most probable microstate, and it may be that iterative probability prop- 
agation decoders work because the true probability distribution is concentrated in 
this microstate. 
We believe that there are many interesting and contentious issues in this area that 
remain to be resolved. 
Acknowledgements 
We thank Frank Kschischang, Bob McEliece, and Radford Neal for discussions re- 
lated to this work, and Zoubin Ghahramani for comments on a draft of this paper. 
This research was supported in part by grants from the Gatsby foundation, the In- 
formation Technology Research Council, and the Natural Sciences and Engineering 
Research Council. 
References 
C. Berrou and A. Glavieux 1996. Near optimum error correcting coding and de- 
coding: Turbo-codes. IEEE Transactions on Communications 44, 1261-1271. 
G. F. Cooper 1990. The computational complexity of probabilistic inference using 
Bayesian belief networks. Artificial Intelligence 42, 393-405. 
B. J. Frey 1998. Graphical Models for Machine Learning and Digital Communica- 
tion, MIT Press, Cambridge, MA. See http://m,. cs.utoronto. ca/~frey. 
R. G. Gallager 1963. Low-Density Parity-Check Codes, MIT Press, Cambridge, 
MA. 
S. Lin and D. J. Costello, Jr. 1983. Error Control Coding: Fundamentals and 
Applications, Prentice-Hall Inc., Englewood Cliffs, NJ. 
S. L. Lauritzen and D. J. Spiegelhalter 1988. Local computations with probabilities 
on graphical structures and their application to expert systems. Journal of the 
Royal Statistical Society B 50, 157-224. 
D. J. C. MacKay and R. M. Neal 1995. Good codes based on very sparse matrices. 
In Cryptography and Coding. 5th IMA Conference, number 1025 in Lecture Notes 
in Computer Science, 100-111, Springer, Berlin Germany. 
D. J. C. MacKay and R. M. Neal 1996. Near Shannon limit performance of low 
density parity check codes. Electronics Letters 32, 1645-1646. Due to editing 
errors, reprinted in Electronics Letters 33, 457-458. 
J. Pearl 1988. Probabilistic Reasoning in Intelligent Systems, Morgan Kaufmann, 
San Mateo, CA. 
C. E. Shannon 1948. A mathematical theory of communication. Bell System Tech- 
nical Journal 27, 379-423, 623-656. 
P. Smyth, D. Heckerman, and M. I. Jordan 1997. Probabilistic independence net- 
works for hidden Markov probability models. Neural Computation 9, 227-270. 
