Variational Inference for Bayesian 
Mixtures of Factor Analysers 
Zoubin Ghahramani and Matthew J. Beal 
Gatsby Computational Neuroscience Unit 
University College London 
17 Queen Square, London WCIN 3AR, England 
{zoubin,m. bealegat sby. ucl. ac. uk 
Abstract 
We present an algorithm that infers the model structure of a mix- 
ture of factor analysers using an efficient and deterministic varia- 
tional approximation to full Bayesian integration over model pa- 
rameters. This procedure can automatically determine the opti- 
mal number of components and the local dimensionality of each 
component (i.e. the number of factors in each factor analyser). 
Alternatively it can be used to infer posterior distributions over 
number of components and dimensionalities. Since all parameters 
are integrated out the method is not prone to overfitting. Using a 
stochastic procedure for adding components it is possible to per- 
form the variational optimisation incrementally and to avoid local 
maxima. Results show that the method works very well in practice 
and correctly infers the number and dimensionality of nontrivial 
synthetic examples. 
By importance sampling from the variational approximation we 
show how to obtain unbiased estimates of the true evidence, the 
exact predictive density, and the KL divergence between the varia- 
tional posterior and the true posterior, not only in this model but 
for variational approximations in general. 
1 Introduction 
Factor analysis (FA) is a method for modelling correlations in multidimensional 
data. The model assumes that each p-dimensional data vector y was generated by 
first linearly transforming a k < p dimensional vector of unobserved independent 
zero-mean unit-variance Gaussian sources, x, and then adding a p-dimensional zero- 
mean Gaussian noise vector, n, with diagonal covariance matrix : i.e. y - Ax + n. 
Integrating out x and n, the marginal density of y is Gaussian with zero mean 
and covariance AA r + . The matrix A is known as the factor loading matrix. 
Given data with a sample covariance matrix E, factor analysis finds the A and  
that optimally fit E in the maximum likelihood sense. Since k < p, a single factor 
analyser can be seen as a reduced parametrisation of a full-covariance Gaussian.  
Factor analysis and its relationship to principal components analysis (PCA) and mix- 
ture models is reviewed in [10]. 
450 Z. Ghahramani and M. J. Beal 
A mixture of factor analysers (MFA) models the density for y as a weighted average 
of factor analyser densities 
s 
P(ylA, = P(sl)P(yls, A s, (1) 
s----1 
where r is the vector of mixing proportions, s is a discrete indicator variable, and 
A s is the factor loading matrix for factor analyser s which includes a mean vector 
for y. 
By exploiting the factor analysis parameterisation of covariance matrices, a mix- 
ture of factor analysers can be used to fit a mixture of Gaussians to correlated high 
dimensional data without requiring O(p 2) parameters or undesirable compromises 
such as axis-aligned covariance matrices. In an MFA each Gaussian cluster has in- 
trinsic dimensionality k (or ks if the dimensions are allowed to vary across clusters). 
Consequently, the mixture of factor analysers simultaneously addresses the prob- 
lems of clustering and local dimensionality reduction. When  is a multiple of the 
identity the model becomes a mixture of probabilistic PCAs. Tractable maximum 
likelihood procedure for fitting MFA and MPCA models can be derived from the 
Expectation Maximisation algorithm [4, 11]. 
The maximum likelihood (ML) approach to MFA can easily get caught in local 
maxima. 2 Ueda et al. [12] provide an effective deterministic procedure for avoiding 
local maxima by considering splitting a factor analyser in one part of space and 
merging two in a another part. But splits and merges have to be considered simul- 
taneously because the number of factor analysers has to stay the same since adding 
a factor analyser is always expected to increase the training likelihood. 
A fundamental problem with maximum likelihood approaches is that they fail to 
take into account model complexity (i.e. the cost of coding the model parameter- 
s). So more complex models are not penalised, which leads to overfitting and the 
inability to determine the best model size and structure (or distributions thereof) 
without resorting to costly cross-validation procedures. Bayesian approaches over- 
come these problems by treating the parameters 0 as unknown random variables 
and averaging over the ensemble of models they define: 
P() = f ). (2) 
P(Y) is the evidence for a data set Y = {y,... ,yN}. Integrating out parameters 
penalises models with more degrees of freedom since these models can a priori 
model a larger range of data sets. All information inferred from the data about the 
parameters is captured by the posterior distribution P(OIY ) rather than the ML 
point estimate 
While Bayesian theory deals with the problems of overfitting and model selec- 
tion/averaging, in practice it is often computationally and analytically intractable to 
perform the required integrals. For Gaussian mixture models Markov chain Monte 
Carlo (MCMC) methods have been developed to approximate these integrals by 
sampling [8, 7]. The main criticism of MCMC methods is that they are slow and 
2Technically, the log likelihood is not bounded above if no constraints are put on the 
determinant of the component covariances. So the real ML objective for MFA is to find 
the highest finite local maximum of the likelihood. 
awe sometimes use 0 to refer to the parameters and sometimes to all the unknown 
quantities (parameters and hidden variables). Formally the only difference between the two 
is that the number of hidden variables grows with N, whereas the number of parameters 
usually does not. 
Variational Inference for Bayesian Mixtures of Factor Analysers 451 
it is usually difficult to assess convergence. Furthermore, the posterior density over 
parameters is stored as a set of samples, which can be inefficient. 
Another approach to Bayesian integration for Gaussian mixtures [9] is the Laplace 
approximation which makes a local Gaussian approximation around a maximum a 
posteriori parameter estimate. These approximations are based on large data limits 
and can be poor, particularly for small data sets (for which, in principle, the advan- 
tages of Bayesian integration over ML are largest). Local Gaussian approximations 
are also poorly suited to bounded or positive parameters such as the mixing pro- 
portions of the mixture model. Finally, it is difficult to see how this approach can 
be applied to online incremental changes to model structure. 
In this paper we employ a third approach to Bayesian inference: variational ap- 
proximation. We form a lower bound on the log evidence using Jensen's inequality: 
f f P(Y,O) 
/2 -- in P(Y) = In dO P(Y, O) _> dO Q(O) In Q(O) - 
(3) 
which we seek to maximise. Maximising 5 r is equivalent to minimising the KL- 
divergence between Q(O) and P(OIY), so a tractable Q can be used as an approx- 
imation to the intractable posterior. This approach draws its roots from one way 
of deriving mean field approximations in physics, and has been used recently for 
Bayesian inference [13, 5, 1]. 
The variational method has several advantages over MCMC and Laplace approxi- 
mations. Unlike MCMC, convergence can be assessed easily by monitoring 5 r. The 
approximate posterior is encoded efficiently in Q(O). Unlike Laplace approxima- 
tions, the form of Q can be tailored to each parameter (in fact the optimal form 
of Q for each parameter falls out of the optimisation), the approximation is global, 
and Q optimises an objective function. Variational methods are generally fast, 5 r 
is guaranteed to increase monotonically and transparently incorporates model com- 
plexity. To our knowledge, no one has done a full Bayesian analysis of mixtures of 
factor analysers. 
Of course, vis-a-vis MCMC, the main disadvantage of variational approximations 
is that they are not guaranteed to find the exact posterior in the limit. However, 
with a straightforward application of sampling, it is possible to take the result of 
the variational optimisation and use it to sample from the exact posterior and exact 
predictive density. This is described in section 5. 
In the remainder of this paper we first describe the mixture of factor analysers in 
more detail (section 2). We then derive the variational approximation (section 3). 
We show empirically that the model can infer both the number of components and 
their intrinsic dimensionalities, and is not prone to overfitting (section 6). Finally, 
we conclude in section 7. 
2 The Model 
Starting from (1), the evidence for the Bayesian MFA is obtained by averaging the 
likelihood under priors for the parameters (which have their own hyperparameters): 
P(Y) 
(4) 
452 Z. Ghahramani and M. . Beal 
Here {a, a, b, ) are hyperparameters 4, v are precision parameters (i.e. inverse vari- 
ances) for the columns of A. The conditional independence relations between the 
variables in this model are shown 
tation in Figure 1. 
Figure 1: Generative model for 
variational Bayesian mixture of 
factor analysers. Circles denote 
random variables, solid rectangles 
denote hyperparameters, and the 
dashed rectangle shows the plate 
(i.e. repetitions) over the data. 
graphically in the usual belief network represen- 
While arbitrary choices could be made for the 
priors on the first line of (4), choosing priors that 
are conjugate to the likelihood terms on the sec- 
ond line of (4) greatly simplifies inference and 
interpretability. 5 So we choose P(rla ) to be 
symmetric Dirichlet, which is conjugate to the 
multinomial P(slr ). 
The prior for the factor loading matrix plays a 
key role in this model. Each component of the 
mixture has a Gaussian prior P(A81vs), where 
each element of the vector 8 is the precision of 
a column of A. If one of these precisions  -> 0, 
then the outgoing weights for factor xt will go to 
zero, which allows the model to reduce the in- 
trinsic dimensionality of x if the data does not 
warrant this added dimension. This method of 
intrinsic dimensionality reduction has been used 
by Bishop [2] for Bayesian PCA, and is closely 
related to MacKay and Neal's method for auto- 
matic relevance determination (ARD) for inputs 
to a neural network [6]. 
To avoid overfitting it is important to integrate out all parameters whose cardinality 
scales with model complexity (i.e. number of components and their dimensionali- 
ties). We therefore also integrate out the precisions using Gamma priors, P(tla , b). 
3 The Variational Approximation 
Applying Jensen's inequality repeatedly to the log evidence (4) we lower bound it 
using the following factorisation of the distribution of parameters and hidden vari- 
ables: Q (A) Q (r, ) Q (s, x). Given this factorisation several additional factorisations 
fall out of the conditional independencies in the model resulting in the variational 
objective function: 
N S 
q- E E Q(sn) [/ dr Q(r)In p(sn[r) / p(xn) 
n:l s":l Q($n) q- dxnQ(xn[$ n) In Q(xnlsn) 
+/dASQ(AS)/dxnQ(xnlsn)lnP(ynlxn, sn, As,)] (5) 
The variational posteriors Q(.), as given in the Appendix, are derived by performing 
a free-form extremisation of  w.r.t.Q. It is not difficult to show that these extrema 
are indeed maxima of . The optimal posteriors Q are of the same conjugate forms 
as the priors. The model hyperparameters which govern the priors can be estimated 
in the same fashion (see the Appendix). 
4We currently do not integrate out 9, although this can also be done. 
5Conjugate priors have the same effect as pseudo-observations. 
Variational Inference for Bayesian Mixtures of Factor Analysers 453 
4 Birth and Death 
When optimising jr, occasionally one finds that for some s: -n Q(s') = 0. These 
zero responsibility components are the result of there being insufficient support from 
the local data to overcome the dimensional complexity prior on the factor loading 
matrices. So components of the mixture die of natural causes when they are no 
longer needed. Removing these redundant components increases 
Component birth does not happen spontaneously, so we introduce a heuristic. 
Whenever .T has stabilised we pick a parent-component stochastically with prob- 
ability proportional to e -y' and attempt to split it into two; s is the s-specific 
contribution to jr with the last bracketed term in (5) normalised by ',Q(sn). 
This works better than both cycling through components and picking them at ran- 
dom as it concentrates attempted births on components that are faring poorly. The 
parameter distributions of the two Gaussians created from the split are initialised 
by partitioning the responsibilities for the data, Q(s'), along a direction sampled 
from the parent's distribution. This usually causes  to decrease, so by monitoring 
the future progress of  we can reject this attempted birth if  does not recover. 
Although it is perfectly possible to start the model with many components and let 
them die, it is computationally more efficient to start with one component and allow 
it to spawn more when necessary. 
5 Exact Predictive Density, True Evidence, and KL 
By importance sampling from the variational approximation we can obtain unbiased 
estimates of three important quantities: the exact predictive density, the true log 
evidence , and the KL divergence between the variational posterior and the true 
posterior. Letting 0 = (A, r), we sample Oi ~ Q(O). Each such sample is an instance 
of a mixture of factor analysers with predictive density given by (1). We weight 
these predictive densities by the importance weights wi = P(Oi, Y)/Q(Oi), which 
are easy to evaluate. This results in a mixture of mixtures of factor analysers, and 
will converge to the exact predictive density, P(ylY), as long as Q(O) > 0 wherever 
P(O[Y)  O. The true log evidence can be similarly estimated by  = ln(w), where 
(-) denotes averaging over the importance samples. Finally, the KL divergence is 
given by: KL(Q(O)[[P(OIY)) = ln(w) - (ln w). 
This procedure has three significant properties. First, the same importance weights 
can be used to estimate all three quantities. Second, while importance sampling 
can work very poorly in high dimensions for ad hoc proposal distributions, here the 
variational optimisation is used in a principled manner to pick Q to be a good ap- 
proximation to P and therefore hopefully a good proposal distribution. Third, this 
procedure can be applied to any variational approximation. A detailed exposition 
can be found in [3]. 
6 Results 
Experiment 1: Discovering the number of components. We tested the 
model on synthetic data generated from a mixture of 18 Gaussians with 50 points 
per cluster (Figure 2, top left). The variational algorithm has little difficulty finding 
the correct number of components and the birth heuristics are successful at avoiding 
local maxima. After finding the 18 Gaussians repeated splits are attempted and 
rejected. Finding a distribution over number of components using  is also simple. 
Experiment 2: The shrinking spiral. We used the dataset of 800 data points 
from a shrinking spiral from [12] as another test of how well the algorithm could 
454 Z. Ghahramani and M. J. Beal 
Figure 2: (top) Exp 1: The frames from left to right are the data, and the 2 S.D. Gaussian 
ellipses after 7, 14, 16 and 22 accepted births. (bottom) Exp 2: Shrinking spiral data 
and i S.D. Gaussian ellipses after 6, 9, 12, and 17 accepted births. Note that the number 
of Gaussians increases from left to right. 
-64O0 
-660( 
-680( 
-700(; 
-720(; 
-740C 
-760C 
-760C 
of points 
per cluster 
8 
8 
intrinsic dimensionalities 
7 4 3 2 2 
I ; II 1 I 
I i II 2 
 I 4 I 2 
1 6 3 3 2 2 
1 7 4 3 2 2 
1 7 4 3 2 2 
Figure 3: (left) Exp 2: . as function of iteration for the spiral problem on a typical run. 
Drops in . constitute component births. Thick lines are accepted attempts, thin lines are 
rejected attempts. (middle) Exp 3: Means of the factor loading matrices. These results 
are analogous to those given by Bishop [2] for Bayesian PCA. (right) Exp 3: Table with 
learned number of Gaussians and dimensionalities as training set size increases. Boxes 
represent model components that capture several of the clusters. 
escape local maxima and how robust it was to initial conditions (Figure 2, bottom). 
Again local maxima did not pose a problem and the algorithm always found between 
12-14 Gaussians regardless of whether it was initialised with 0 or 200. These runs 
took about 3-4 minutes on a 500MHz Alpha EV6 processor. A plot of J shows that 
most of the compute time is spent on accepted moves (Figure 3, left). 
Experiment 3: Discovering the local dimensionalities. We generated a syn- 
thetic data set of 300 data points in each of 6 Gaussians with intrinsic dimension- 
alities (7 4 3 2 2 1) embedded in 10 dimensions. The variational Bayesian approach 
correctly inferred both the number of Gaussians and their intrinsic dimensionalities 
(Figure 3, middle). We varied the number of data points and found that as expected 
with fewer points the data could not provide evidence for as many components and 
intrinsic dimensions (Figure 3, right). 
7 Discussion 
Search over model structures for MFAs is computationally intractable if each factor 
analyser is allowed to have different intrinsic dimensionalities. In this paper we have 
shown that the variational Bayesian approach can be used to efficiently infer this 
model structure while avoiding overfitting and other deficiencies of ML approaches. 
One attraction of our variational method, which can be exploited in other models, 
is that once a factorisation of Q is assumed all inference is automatic and exact. 
We can also use J to get a distribution over structures if desired. Finally we derive 
Varational Inference for Bayesian Mixtures of Factor Analysers 455 
a generally applicable importance sampler that gives us unbiased estimates of the 
true evidence, the exact predictive density, and the KL divergence between the 
variational posterior and the true posterior. 
Encouraged by the results on synthetic data, we have applied the Bayesian mixture 
of factor analysers to a real-world unsupervised digit classification problem. We 
will report the results of these experiments in a separate article. 
Appendix: Optimal Q Distributions and Hyperparameters 
Q(xls) ~ Af(', E*) Q(Aq)~Af(X,E q'*) Q(yt)~6(a,bt) Q(,r) ~ z)(wu) 
I A, ' 
lnQ(s)=[(wu,)-(w)]+lnl*[+(lnP(ylx,s , ))+c 
'*=*X*-y, X= -xQ(s)y'** q'* a=a+ p b=b+ (At 2 
n=l . q q=l 
N N 
E * = <A*Z-A*> +I, E q'* =q- Q(s)<xxZ> +diag<y*>, wu =  
where {, 6, } denote Normal, Gamma d Dirichlet distributions respectively, {.> de- 
notes expectation under the variational posterior, d (x) is the digaroma function 
(x)   lnF(x). Note that the optimal distributions Q(A *) have block diagonal co- 
variance structure; even though each A * is a p x q matrix, its coviance only h Oq 2) 
pameters. Differentiating  with respect to the pmeters, a and b, of the precision pri- 
or we get fixed point equations (a) = {ln y> +ln b and b = a/{y>. Similly the fixed point 
for the pameters of the Dirichlet prior is (a) - (/S) + E [(wu,) - (w)]/S = O. 
References 
[1] H. Attias. Inferring parameters and structure of latent variable models by variational 
Bayes. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence, 1999. 
[2] C.M. Bishop. Variational PCA. In Proc. Ninth Int. Conf. on Artificial Neural Net- 
works. ICANN, 1999. 
[3] Z. Ghahramani, H. Attias, and M.J. Beal. Learning model structure. Technical 
Report GCNU-TR-1999-006, (in prep.) Gatsby Unit, Univ. College London, 1999. 
[4] Z. Ghahramani and G.E. Hinton. The EM algorithm for mixtures of fac- 
tor analyzers. Technical Report CRG-TR-96-1 [http://www.gatsby.ucl.ac.uk/ 
zoubin/papers/tr-96-1. ps. gz], Dept. of Comp. Sci., Univ. of Toronto, 1996. 
[5] D.J.C. MacKay. Ensemble learning for hidden Markov models. Technical report, 
Cavendish Laboratory, University of Cambridge, 1997. 
[6] R.M. Neal. Assessing relevance determination methods using DELVE. In C.M. Bish- 
op, editor, Neural Networks and Machine Learning, 97-129. Springer-Verlag, 1998. 
[7] C.E. Rasmussen. The infinite gaussian mixture model. In Adv. Neur. Inf. Proc. $ys. 
12. MIT Press, 2000. 
[8] S. Richardson and P.J. Green. On Bayesian analysis of mixtures with an unknown 
number of components. J. Roy. Star. $oc.-$er. B, 59(4):731-758, 1997. 
[9] S.J. Roberts, D. Husmeier, I. Rezek, and W. Penny. Bayesian approaches to Gaussian 
mixture modeling. IEEE PAMI, 20(11):1133-1142, 1998. 
[10] S. T. Roweis and Z. Ghahramani. A unifying review of linear Gaussian models. Neural 
Computation, 11(2):305-345, 1999. 
[11] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic principal component ana- 
lyzers. Neural Computation, 11(2):443-482, 1999. 
[12] N. Ueda, R. Nakano, Z. Ghahramani, and G.E. Hinton. SMEM algorithm for mixture 
models. In Adv. Neut. Inf. Proc. $ys. 11. MIT Press, 1999. 
[13] S. Waterhouse, D.J.C. Mackay, and T. Robinson. Bayesian methods for mixtures of 
experts. In Adv. Neur. Inf. Proc. $ys. 7. MIT Press, 1995. 
