Discovering Structure in Continuous 
Variables Using Bayesian Networks 
Reimar Hofmann and Volker Tresp* 
Siemens AG, Central Research 
Otto-Hahn-Ring 6 
81730 Mfinchen, Germany 
Abstract 
We study Bayesian networks for continuous variables using non- 
linear conditional density estimators. We demonstrate that use- 
ful structures can be extracted from a data set in a self-organized 
way and we present sampling techniques for belief update based on 
Markov blanket conditional density models. 
I Introduction 
One of the strongest types of information that can be learned about an unknown 
process is the discovery of dependencies and --even more important-- of indepen- 
dencies. A superior example is medical epidemiology where the goal is to find the 
causes of a disease and exclude factors which are irrelevant. Whereas complete 
independence between two variables in a domain might be rare in reality (which 
would mean that the joint probability density of variables A and B can be factored: 
p(A,B) - p(A)p(B)), conditional independence is more common and is often a 
result from true or apparent causality: consider the case that A is the cause of B 
and B is the cause of C, then p(C[A, B) -- p(C]B) and A and C are independent 
under the condition that B is known. Precisely this notion of cause and effect and 
the resulting independence between variables is represented explicitly in Bayesian 
networks. Pearl (1988) has convincingly argued that causal thinking leads to clear 
knowledge representation in form of conditional probabilities and to efficient local 
belief propagating rules. 
Bayesian networks form a complete probabilistic model in the sense that they repre- 
sent the joint probability distribution of all variables involved. Two of the powerful 
Reimar. Hofmann@zfe.siemens.de Volker. Tresp@zfe.siemens.de 
Discovering Structure in Continuous Variables Using Bayesian Networks 501 
features of Bayesian networks are that any variable can be predicted from any sub- 
set of known other variables and that Bayesian networks make explicit statements 
about the certainty of the estimate of the state of a variable. Both aspects are par- 
ticularly important for medical or fault diagnosis systems. More recently, learning 
of structure and of parameters in Bayesian networks has been addressed allowing 
for the discovery of structure between variables (Buntine, 1994, Heckerman, 1995). 
Most of the research on Bayesian networks has focused on systems with discrete 
variables, linear Gaussian models or combinations of both. Except for linear mod- 
els, continuous variables pose a problem for Bayesian networks. In Pearl's words 
(Pearl, 1988): "representing each [continuous] quantity by an estimated magnitude 
and a range of uncertainty, we quickly produce a computational mess. [Continuous 
variables] actually impose a computational tyranny of their own." In this paper we 
present approaches to applying the concept of Bayesian networks towards arbitrary 
nonlinear relations between continuous variables. Because they are fast learners we 
use Parzen windows based conditional density estimators for modeling local depen- 
dencies. We demonstrate how a parsimonious Bayesian network can be extracted 
out of a data set using unsupervised self-organized learning. For belief update we 
use local Markov blanket conditional density models which --in combination with 
Gibbs sampling-- allow relatively efficient sampling from the conditional density of 
an unknown variable. 
2 Bayesian Networks 
This brief introduction of Bayesian networks follows closely Heckerman, 1995. Con- 
sidering a joint probability density x p(x) over a set of variables {xx,..., XN} we can 
decompose using the chain rule of probability 
N 
p(x): HP(XilXl,...,Xi_i). (1) 
i--1 
For each variable xi, let the parents of xi denoted by Pi C_ {Xl,..., Xi_l} be a set 
of variables 2 that renders xi and {Xl,..., xi-1} independent, that is 
p(xilxx, . . ., ;gi-1) ---- p(xil7i). 
(2) 
Note, that Pi does not need to include all elements of {Xl,...,Xi_i} which indi- 
cates conditional independence between those variables not included in 79i and xi 
given that the variables in 79i are known. The dependencies between the variables 
are often depicted as directed acyclic 3 graphs (DAGs) with directed arcs from the 
members of 79i (the parents) to xi (the child). Bayesian networks are a natural 
description of dependencies between variables if they depict causal relationships be- 
tween variables. Bayesian networks are commonly used as a representation of the 
knowledge of domain experts. Experts both define the structure of the Bayesian 
network and the local conditional probabilities. Recently there has been great 
 For simphcity of notation we will only treat the continuous case. Handling mixtures 
of continuous and discrete variables does not impose any additional difticnlties. 
2Usually the smallest set will be used. Note that in /Yi is defined with respect to a 
given ordering of the variables. 
ai.e. not containing any directed loops. 
502 R. HOFMANN, V. TRESP 
emphasis on learning structure and parameters in Bayesian networks (Heckerman, 
1995). Most of previous work concentrated on models with only discrete variables 
or on linear models of continuous variables where the probability distribution of all 
continuous given all discrete variables is a multidimensional Gaussian. In this paper 
we use these ideas in context with continuous variables and nonlinear dependencies. 
3 
Learning Structure and Parameters in Nonlinear 
Continuous Bayesian Networks 
Many of the structures developed in the neural network community can be used to 
model the conditional density distribution of continuous variables p(xi I)i). Under 
the usual signal-plus independent Gaussian noise model a feedforward neural net- 
work NN(.) is a conditional density model such that p(xi179i) - G(xi; NN(7)i), 
where G(x; c, (r ) is our notation for a normal density centered at c and with variance 
a. More complex conditional densities can, for example, be modeled by mixtures 
of experts or by Parzen windows based density estimators which we used in our ex- 
periments (Section 5). We will use pM(xi179i ) for a generic conditional probability 
model. The joint probability model is then 
N 
pM (x) = HpM (xi179i). 
i----1 
(3) 
following Equations 1 and 2. Learning Bayesian networks is usually decomposed 
into the problems of learning structure (that is the arcs in the network) and of 
learning the conditional density models pM(xi179i ) given the structure 4. First as- 
sume the structure of the network is given. If the data set only contains complete 
data, we can train conditional density models pM(xi17Pi ) independently of each 
other since the log-likelihood of the model decomposes conveniently into the indi- 
vidual likelihoods of the models for the conditional probabilities. Next, consider 
two competing network structures. We are basically faced with the well-known 
bias-variance dilemma: if we choose a network with too many arcs, we introduce 
large parameter variance and if we remove too many arcs we introduce bias. Here, 
the problem is even more complex since we also have the freedom to reverse arcs. 
In our experiments we evaluate different network structures based on the model 
likelihood using leave-one-out cross-validation which defines our scoring function 
for different network structures. More explicitly, the score for network structure 
$ is Score = log(p(S)) + L cv, where p(S) is a prior over the network structures 
and L cv - D 
-k----1 lg(P M ( xk IS, X - {x k })) is the leave-one-out cross-validation log- 
likelihood (later referred to as cv-log-likelihood). X  D 
= {x }=1 is the set of training 
samples, and pM (xl$, X - {x'}) is the probability density of sample x, given the 
structure S and all other samples. Each of the terms pM(x  iS ' X - {x'}) can be 
computed from local densities using Equation 3. 
Even for sinall networks it is computationally impossible to calculate the score for all 
possible network structures and the search for the global optimal network structure 
4Differing from Heckerman we do not follow a fully Bayesian approach in which priors 
are defined on parameters and structure; a fully Bayesian approach is elegant if the oc- 
curring integqrals can be solved in closed form which is not the case for general nonlinear 
models or if data are incomplete. 
Discovering Structure in Continuous Variables Using Bayesian Networks 503 
is NP-hard. In the Section 5 we describe a heuristic search which is closely related to 
search strategies commonly used in discrete Bayesian networks (Heckerman, 1995). 
4 Prior Models 
In a Bayesian framework it is useful to provide means for exploiting prior knowledge, 
typically introducing a bias for simple structures. Biasing models towards simple 
structures is also useful if the model selection criteria is based on cross-validation, 
as in our case, because of the variance in this score. In the experiments we added 
a penalty per arc to the log-likelihood i.e. logp($) cx --aNA where NA is the 
number of arcs and the parameter a determines the weight of the penalty. Given 
more specific knowledge in form of a structure defined by a domain expert we 
can alternatively penalize the deviation in the arc structure (Heckerman, 1995). 
Furthermore, prior knowledge can be introduced in form of a set of artificial training 
data. These can be treated identical to real data and loosely correspond to the 
concept of a conjugate prior. 
5 Experiment 
In the experiment we used Parzen windows based conditional density estimators to 
model the conditional densities pM (xil7i) from Equation 2, i.e. 
pM (xi179i) = EkD----1G( ( goi, ')i); (goi k, ')/k), o') 
o , (4) 
Ek=l G(Pi; Pi k , O'i 2) 
where {xJ}?=x is the training set. The Gaussians in the nominator are centered 
at (x', P) which is the location of the k-th sample in the joint input/output (or 
parent/child) space and the Gaussians in the denominator are centered at (P') 
which is the location of the k-th sample in the input (or parent) space. For each 
conditional model, ai was optimized using leave-one-out cross validation 5. 
The unsupervised structure optimization procedure starts with a complete Bayesian 
model corresponding to Equation 1, i.e. a model where there is an arc between 
any pair of variables 6. Next, we tentatively try all possible arc direction changes, 
arc removals and arc additions which do not produce directed loops and evaluate 
the change in score. After evaluating all legal single modifications, we accept the 
change which improves the score the most. The procedure stops if every arc change 
decreases the score. This greedy strategy can get stuck in local minirod which 
could in principle be avoided if changes which result in worse performance are also 
accepted with a nonzero probability ? (such as in annealing strategies, Heckerman, 
1995). Calculating the new score at each step requires only local computation. 
The removal or addition of an arc corresponds to a simple removal or addition of 
the corresponding dimension in the Gaussians of the local density model. However, 
5Note that if we maintained a global o' for all density estimators, we would maintain 
likehhood equivalence which means that each network displaying the same independence 
model gets the same score on any test set. 
6The order of nodes determining the direction of initial arcs is random. 
?In our experiments we treated very small changes in score as if they were exactly zero 
thus allowing small decreases in score. 
504 
5 
10 
R. HOFMANN, V. TRESP 
lOO 
5o 
o 
-50 
-1 oo 
-15o 
o 
-10 ' 
0 50 100 5 10 15 
Number of Iterations Number of inputs 
Figure 1: Left: evolution of the cv-log-likelihood (dashed) and of the log-likelihood 
on the test set (continuous) during structure optimization. The curves are averages 
over 20 runs with different partitions of training and test sets and the likelihoods 
are normalized with respect to the number of cv- or test-samples, respectively. The 
penalty per arc was c: 0.1. The dotted line shows the Parzen joint density model 
commonly used in statistics, i.e. assuming no independencies and using the same 
width for all Gaussians in all conditional density models. Right: log-likelihood 
of the local conditional Parzen model for variable 3 (pm(x3lP3)) on the test set 
(continuous) and the corresponding cv-log-likelihood (dashed) as a function of the 
number of parents (inputs). 
i crime rate 
2 percent land zoned for lots 
3 percent nonretail business 
4 located on Charles river? 
5 nitrogen oxide concentration 
6 average number of rooms 
7 percent built before 1940 
8 weighted distance to employment center 
9 access to radial highways 
10 tax rate 
11 pupil/teacher ratio 
12 percent black 
13 percent lower-status population 
14 median value of homes 
Figure 2: Final structure of a run on the full data set. 
after each such operation the widths of the Gaussians rri in the affected local models 
have to be optimized. An arc reversal is simply the execution of an arc removal 
followed by an arc addition. 
In our experiment, we used the Boston housing data set, which contains 506 sam- 
ples. Each sample consists of the housing price and 14 variables which supposedly 
influence the housing price in a Boston neighborhood (Figure 2). Figure i (left) 
shows an experiment where one third of the samples was reserved as a test set to 
monitor the process. Since the algorithm never sees the test data the increase in 
likelihood of the model on the test data is an unbiased estimator for how much 
the model has improved by the extraction of structure from the data. The large 
increase in the log-likelihood can be understood by studying Figure i (right). Here 
we picked a single variable (node 3) and formed a density model to predict this vari- 
able from the remaining 13 variables. Then we removed input variables in the order 
of their significance. After the removal of a variable, a3 is optimized. Note that the 
cv-log-likelihood increases until only three input variables are left due to the fact 
Discovering Structure in Continuous Variables Using Bayesian Networks 505 
that irrelevant variables or variables which are well represented by the remaining 
input variables are removed. The log-likelihood of the fully connected initial model 
is therefore low (Figure 1 left). 
We did a second set of 15 runs with no test set. The scores of the final structures 
had a standard deviation of only 0.4. However, comparing the final structures in 
terms of undirected arcs s the difference was 18% on average. The structure from one 
of these runs is depicted in Figure 2 (right). In comparison to the initial complete 
structure with 91 arcs, only 18 arcs are left and 8 arcs have changed direction. 
One of the advantages of Bayesian networks is that they can be easily interpreted. 
The goal of the original Boston housing data experiment was to examine whether 
the nitrogen oxide concentration (5) influences the housing price (14). Under the 
structure extracted by the algorithm, 5 and 14 are dependent given all other vari- 
ables because they have a common child, 13. However, if all variables except 13 are 
known then they are independent. Another interesting question is what the rele- 
vant quantities are for predicting the housing price, i.e. which variables have to be 
known to render the housing price independent from all other variables. These are 
the parents, children, and children's parents of variable 14, that is variables 8, 10, 
11, 6, 13 and 5. It is well known that in Bayesian networks, different constellations 
of directions of arcs may induce the same independencies, i.e. that the direction 
of arcs is not uniquely determined. It can therefore not be expected that the arcs 
actually reflect the direction of causality. 
6 
Missing Data and Markov Blanket Conditional Density 
Model 
Bayesian networks are typically used in applications where variables might be miss- 
ing. Given partial information (i.e. the states of a subset of the variables) the goal 
is to update the beliefs (i.e. the probabilities) of all unknown variables. Whereas 
there are powerful local update rules for networks of discrete variables without 
(undirected) loops, the belief update in networks with loops is in general NP-hard. 
A generally applicable update rule for the unknown variables in networks of discrete 
or continuous variables is Gibbs sampling. Gibbs sampling can be roughly described 
as follows: for all variables whose state is known, fix their states to the known val- 
ues. For all unknown variables choose some initial states. Then pick a variable xi 
which is not known and update its value following the probability distribution 
(5) 
Do this repeatedly for all unknown variables. Discard the first samples. Then, 
the samples which are generated are drawn from the probability distribution of the 
unknown variables given the known variables. Using these samples it is easy to 
calculate the expected value of any of the unknown variables, estimate variances, 
covariances and other statistical measures such as the mutual information between 
variables. 
s Since the direction of arcs is not unique we used the difference in undirected arcs to 
compare two structures. We used the number of arcs present in one and only one of the 
structures normahzed with respect to the number of arcs in a fully connected network. 
506 R. HOFMANN, V. TRESP 
Gibbs sampling requires sampling from the univariate probability distribution in 
Equation 5 which is not straightforward in our model since the conditional den- 
sity does not have a convenient form. Therefore, sampling techniques such as 
importance sampling have to be used. In our case they typically produce many 
rejected samples and are therefore inefficient. An alternative is sampling based 
on Markov blanket conditional density models. The Markov blanket of xi, .A/li is 
the smallest set of variables such that p(xil{xl,...,XN} \ xi) -- p(xi[./t/[i) (given a 
Bayesian network, the Markov blanket of a variable consists of its parents, its chil- 
dren and its children's parents.). The idea is to form a conditional density model 
pM(xilA/ti )  p(xilA/ti) for each variable in the network instead of computing it 
according to Equation 5. Sampling from this model is simple using conditional 
Parzen models: the conditional density is a mixture of Gaussians from which we 
can sample without rejection . Markov blanket conditional density models are also 
interesting if we are only interested in always predicting one particular variable, as 
in most neural network applications. Assuming that a signal-plus-noise model is a 
reasonably good model for the conditional density, we can train an ordinary neural 
network to predict the variable of interest. In addition, we train a model for each 
input variable predicting it from the remaining variables. In addition to having ob- 
tained a model for the complete data case, we can now also handle missing inputs 
and do backward inference using Gibbs sampling. 
7 Conclusions 
We demonstrated that Bayesian models of local conditional density estimators form 
promising nonlinear dependency models for continuous variables. The conditional 
density models can be trained locally if training data are complete. In this paper 
we focused on the self-organized extraction of structure. Bayesian networks can 
also serve as a framework for a modular construction of large systems out of smaller 
conditional density models. The Bayesian framework provides consistent update 
rules for the probabilities i.e. communication between modules. Finally, consider 
input pruning or variable selection in neural networks. Note, that our pruning 
strategy in Figure i can be considered a form of variable selection by not only 
removing variables which are statistically independent of the output variable but 
also removing variables which are represented well by the remaining variables. This 
way we obtain more compact models. If input values are missing then the indirect 
influence of the pruned variables on the output will be recovered by the sampling 
mechanism. 
References 
Buntine, W. (1994). Operations for learning with caphical models. Journal of Artificial 
Intelligence Research 2:159-225. 
Heckerman, D. (1995). A tutorial on learning Bayesian networks. Microsoft Research, 
TR. MSR-TR-95-06, 1995. 
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan 
Kaufmann. 
There are, however, several open issues concerning consistency between the conditional 
models. 
