Active Learning with Statistical Models 
David A. Cohn, Zoubin Ghahramani, and Michael I. Jordan 
cohnpsyche. mi. edu, zoubinpsyche. mi. edu, j ordanepsyche. mi. edu 
Department of Brain and Cognitive Sciences 
Massachusetts Institute of Technology 
Cambridge, MA 02139 
Abstract 
For many types of learners one can compute the statistically "op- 
timal" way to select data. We review how these techniques have 
been used with feedforward neural networks [MacKay, 1992; Cohn, 
1994]. We then show how the same principles may be used to select 
data for two alternative, statistically-based learning architectures: 
mixtures of Gaussians and locally weighted regression. While the 
techniques for neural networks are expensive and approximate, the 
techniques for mixtures of Gaussians and locally weighted regres- 
sion are both efficient and accurate. 
I ACTIVE LEARNING- BACKGROUND 
An active learning problem is one where the learner has the ability or need to 
influence or select its own training data. Many problems of great practical interest 
allow active learning, and many even require it. 
We consider the problem of actively learning a mapping X - Y based on a set of 
training examples {(xi, Yi)}?=l, where xi E X and yi E Y. The learner is allowed 
to iteratively select new inputs  (possibly from a constrained set), observe the 
resulting output , and incorporate the new examples (f, ) into its training set. 
The primary question of active learning is how to choose which  to try next. 
There are many heuristics for choosing f based on intuition, including choosing 
places where we don't have data, where we perform poorly [Linden and Weber, 
1993], where we have low confidence [Thrun and MSller, 1992], where we expect it 
706 David Cohn, Zoubin Ghahramani, Michael L Jordon 
to change our model [Cohn et al, 1990], and where we previously found data that 
resulted in learning [Schmidhuber and Storck, 1993]. 
In this paper we consider how one may select  "optimally" from a statistical 
viewpoint. We first review how the statistical approach can be applied to neural 
networks, as described in MacKay [1992] and Cohn [1994]. We then consider two 
alternative, statistically-based learning architectures: mixtures of Gaussians and 
locally weighted regression. While optimal data selection for a neural network is 
computationally expensive and approximate, we find that optimal data selection for 
the two statistical models is efficient and accurate. 
2 ACTIVE LEARNING- A STATISTICAL APPROACH 
We denote the learner's output given input x as O(x). The mean squared error of 
this output can be expressed as the sum of the learner's bias and variance. The 
variance a(x) indicates the learner's uncertainty in its estimate at x. x Our goal 
will be to select a new example : such that when the resulting example (', .) is 
added to the training set, the integrated variance IV is minimized: 
IV = f o'P(x)dx. (1) 
Here, P(x) is the (known) distribution over X. In practice, we will compute a 
2 at a number of random 
Monte Carlo approximation of this integral, evaluating 
points drawn according to P(x). 
~ 2 the new variance at x given 
Selecting  so as to minimize IV requires computing 
(', .0). Until we actually commit to an , we do not know what corresponding .0 we 
will see, so the minimization cannot be performed deterministically? Many learning 
architectures, however, provide an estimate of P(.[) based on current data, so we 
~2 Selecting  to minimize 
can use this estimate to compute the e:rpectation of 
the expected integrated variance provides a solid statistical basis for choosing new 
examples. 
2.1 
EXAMPLE: ACTIVE LEARNING WITH A NEURAL 
NETWORK 
In this section we review the use of techniques from Optimal Experiment Design 
(OED) to minimize the estimated variance of a neural network [Fedorov, 1972; 
MacKay, 1992; Cohn, 1994]. We will assume we have been given a learner .0 - f0, 
a training set {(xi, Yi)}?= and a parameter vector b that maximizes a likelihood 
measure. One such measure is the minimum sum squared residual 
S2 1 i 
- m (yi - .O(xi)) 2 
'. 
2 
 Unless explicitly denoted, .0 and aS are functions of x. For simplicity, we present our 
results in the univariate setting. All results in the paper extend easily to the multivariate 
case. 
2This contrasts with related work by Plutowski and White [1993], which is concerned 
with filtering an existing data set. 
Active Learning with Statistical Models 707 
The estimated output variance of the network is 
2  S2 ( O__(wX ) :r / 02 S2 
-1 
The standard OED approach assumes normality and local linearity. These as- 
sumptions allow replacing the distribution P(I) by its estimated mean () and 
variance $2 The expected value of the new variance, ~2 is then: 
 r, 
- + ool. 
where we define 
For empirical results on the predictive power of Equation 2, see Cohn [1994]. 
The advantages of minimizing this criterion are that it is grounded in statistics, 
and is optimal given the assumptions. Furthermore, the criterion is continuous 
and differentiable. As such, it is applicable in continuous domains with continuous 
action spaces, and allows hillclimbing to find the "best" . 
For neural networks, however, this approach has many disadvantages. The criterion 
relies on simplifications and strong assumptions which hold only approximately. 
Computing the variance estimate requires inversion of a Iwl x Iwl matrix for each 
new example, and incorporating new examples into the network requires expensive 
retraining. Paass and Kindermann [1995] discuss an approach which addresses some 
of these problems 
3 MIXTURES OF GAUSSIANS 
The mixture of Gaussians model is gaining popularity among machine learning prac- 
titioners [Nowlan, 1991; Specht, 1991; Ghahramani and Jordan, 1994]. It assumes 
that the data is produced by a mixture of N Gaussians gi, for i - 1, ..., N. We 
can use the EM algorithm [Dempster et al, 1977] to find the best fit to the data, 
after which the conditional expectations of the mixture can be used for function 
approximation. 
For each Gaussian gi we will denote the estimated input/output means as/,i and 
2 a 2  and O'xy,i. The conditional variance of 
y,i and estimated covariances as r,i, 
y given x may then be written 
2 
o'y21x,i -- o.2 . O'xy, i 
y, 2 ' 
O'x , i 
We will denote as ni the (possibly fractional) number of training examples for which 
gi takes responsibility: 
m 
= yli) 
708 David Cohn, Zoubin Ghahramani, Michael I. Jordon 
For an input x each gi has conditional expectation i and variance er .. 
' y,z 
O'xy, i [ -- ]Ix,i) , 
O. x2,i \X 
er 2  eryl'i i + 
y , -- rl i 
These expectations and variances are mixed according to the prior probability tha't 
gi has of being responsible for x' 
P(xli) 
hi .-:--hi(x)-- Ej?= 1 P(xlj ) ' 
For input x then, the conditional expectation  of the resulting mixture and its 
variance may be written: 
-- hi Oi, erO _.  i ylx,i 1+ -- . 
i:1 i=1 li O'x, i 
2 
In contrast to the variance estimate computed for a neural network, here er9 can be 
computed efficiently with no approximations. 
3.1 ACTIVE LEARNING WITH A MIXTURE OF GAUSSIANS 
We want to select  to minimize (). With a mixture of Gaussians, the model's 
estimated distribution of  given  is explicit: 
N N 
i=1 i=1 
where ;i = hi(). Given this, calculation of (ff) is straightforward: we model the 
change in each gi separately, calculating its expected variance given a new point 
sampled from P(l , i) and weight this change by hi. The new expectations combine 
to form the learner's new expected variance 
: (3) 
i=1 ni + hi erz,i 
where the expectation can be computed exactly in closed form: 
~2 
O'x , i -- 
O'x, i 
nilzx,i -4- i ' 
ni + i 
n, n(- ,)" n + i + (n + h)" 
Active Learning with Statistical Models 709 
4 LOCALLY WEIGHTED REGRESSION 
We consider here two forms of locally weighted regression (LWR): kernel regression 
and the LOESS model Cleveland et al, 1988]. Kernel regression computes fi as an 
average of the Yi in the data set, weighted by a kernel centered at x. The LOESS 
model performs a linear regression on points in the data set, weighted by a kernel 
centered at x. The kernel shape is a design parameter: the original LOESS model 
uses a "tricubic" kernel; in our experiments we use the more common Gaussian 
hi(x)  h(x - xi) = exp(-k(x - xi)2), 
where k is a smoothing constant. For brevity, we will drop the argument x for hi(x), 
and define n = Y4 hi. We can then write the estimated means and covariances as: 
-- , x -- , cy -- 
n / n 
y =  hy 2 _ ; h(yi - y)2 2 2 _  
n ' y -- n ' y Ix = y  ' 
We use them to express the conditional expectations and their estimated variances: 
2 
2 ry 
kernel:  = py, r = (4) 
%1 1 + -- (5) 
LOESS: 9 = py + ?- (x- p),  = n er 2 
O' x \ 
4.1 
ACTIVE LEARNING WITH LOCALLY WEIGHTED 
REGRESSION 
Again we want to select : to minimize (). With LWR, the model's estimated 
distribution of  given : is explicit: 
P(I ) -- N(9(), 1()) 
The estimate of () is also explicit. Defining  as the weight assigned to: by the 
kernel, the learner's expected new variance is 
%1 (x- ) 2 
kernel: 
where the expectation can be computed exactly in closed form: 
(%) () =  + () (()- ) 
 +   h(-)(9()- ) 
 - d ;] + (n + )" ' ( + ) 
710 David Cohn, Zoubin Ghahramani, Michael I. Jordon 
5 EXPERIMENTAL RESULTS 
Below we describe two sets of experiments demonstrating the predictive power of 
the query selection criteria in this paper. In the first set, learners were trained on 
data from a noisy sine wave. The criteria described in this paper were applied to 
predict how a new training example selected at point : would decrease the learner's 
variance. These predictions, along with the actual changes in variance when the 
training points were queried and added, are plotted in Figure 1. 
1,Neural Network 
  ....... predicted change 
0 0.2 0.4 0.6 0.8 
,Kernel, Regression 
0.5 
o 
.......... predicted change 
 i -- actual change-., 
-0.5 . '," ' '  ,  '.-- 
-10 0.2 0.4 0.6 0.8 
1Mixture of Gaussian,s , 
{ v . 
......... predicted change 
/ -- actual change 
o o'.2 o'.4 o16 o'.8 
LOESS 
......... predicted change 
 ' actual change 
o'.2 o'.n 0'.6 0'.8 
Figure 1: The upper portion of each plot indicates each learner's fit to noisy sinu- 
soidal data. The lower portion of each plot indicates predicted and actual changes 
in the learner's average estimated variance when Y: is queried and added to the 
training set, for  E [0, 1]. Changes are not plotted to scale with learners' fits. 
In the second set of experiments, we gpplied the techniques of this paper to learning 
the kinematics of a two-joint planar arm (Figure 2; see Cohn [1994] for details). 
Below, we illustrate the problem using the LOESS algorithm. 
An example of the correlation between predicted and actual changes in variance 
on this problem is plotted in Figure 2. Figure 3 demonstrates that this cor- 
relation may be exploited to guide sequential query selection. We compared a 
LOESS learner which selected each new query so as to minimize expected variance 
Active Learning with Statistical Models 711 
with LOESS learners which selected queries according to various heuristics. The 
variance-minimizing learner significantly outperforms the heuristics in terms of both 
variance and MSE. 
0.025 
0.02 
 0.015 
,- 
 0.01 
 0.005 
 o 
-0.005 
, 
-0.005 
0 O' O' 
0 0 .0 
0 
o o 
 0.05 0.1 0.015 0.02 
predicted delta variance 
0.025 
Figure 2: (left) The arm kinematics problem. (right) Predicted vs. actual changes 
in model variance for LOESS on the arm kinematics problem. 100 candidate points 
are shown for a model trained with 50 initial random examples. Note that most 
of the potential queries produce very little improvement, and that the algorithm 
successfully identifies those few that will help most. 
0.2 
0.1 
VarianceO.O 4 
0.02 
0.01 
 , bias _ 
o SUI3DOI/ 
0 100 150 200 250 300 350 400 450 500 
training examples 
30- 
10 
3 
MSE 
1 
0.3 
0.1 
randem.. 
,, nsmvay 
las . 
/ s su13port 
.., variance 
o ' ' o ' o ' ' ' 
100 150 2 250 400 450 500 
training examples 
Figure 3: Variance and MSE for a LOESS learner selecting queries according to 
the variance-minimizing criterion discussed in this paper and according to several 
heuristics. "Sensitivity" queries where output is most sensitive to new data, "Bias" 
queries according to a bias-minimizing criterion, "Support" queries where the model 
has the least data support. The variance of "Random" and "Sensitivity" are off the 
scale. Curves are medians over 15 runs with non-Gaussian noise. 
712 David Cohn, Zoubin Ghahramani, Michael L Jordon 
6 SUMMARY 
Mixtures of Gaussians and locally weighted regression are two statistical models 
that offer elegant representations and efficient learning algorithms. In this paper 
we have shown that they also offer the opportunity to perform active learning in an 
efficient and statistically correct manner. The criteria derived here can be computed 
cheaply and, for problems tested, demonstrate good predictive power. 
Acknowledgement s 
This work was funded by NSF grant CDA-9309300, the McDonnell-Pew Foundation, 
ATR Human Information Processing Laboratories and Siemens Corporate Research. 
We thank Stefan Schaal for helpful discussions about locally weighted regression. 
References 
W. Cleveland, S. Devlin, and E. Grosse. (1988) Regression by local fitting. Journal of 
Econometrics 37:87-114. 
D. Cobh, L. Atlas and R. Ladnet. (1990) Training Connectionist Networks with Queries 
and Selective Sampling. In D. Touretzky, ed., Advances in Neural Information Processing 
Systems 2, Morgan Kaufmann. 
D. Cohn. (1994) Neural network exploration using optimal experiment design. In J. Cowan 
et al., eds., Advances in Neural Information Processing Systems 6. Morgan Kaufmann. 
A. Dempster, N. Laird and D. Rubin. (1977) Maximum likelihood from incomplete data. 
via. the EM algorithm. J. Royal Statistical Society Series B, 39:1-38. 
V. Fedorov. (1972) Theory of Optimal Experiments. Aca.demic Press, New York. 
Z. Gha.hra.ma.ni a.nd M. Jorda.n. (1994) Supervised lea.ming from incomplete da.ta. via. a.n 
EM a.pproach. In J. Cowa.n et al., eds., Advances in Neural Information Processing Systems 
6. Morga.n Ka.ufma.nn. 
A. Linden a.nd F. Weber. (1993) Implementing inner drive by competence reflection. In 
H. Roitbla.t et al., eds., Proc. nd Int. Conf. on Simulation of Adaptive Behavior, MIT 
Press, Ca.mbridge. 
D. MacKay. (1992) Informa.tion-based objective functions for active da.ta. selection, Neural 
Computation 4(4): 590-604. 
S. Nowla.n. (1991) Soft Competitive Ada.pta.tion: Neural Network Lea.ming Algorithms 
ba.sed on Fitting Statistical Mixtures. CMU-CS-91-126, School of Computer Science, 
Carnegie Mellon University, Pittsburgh, PA. 
Pa.a.ss, G., a.nd Kinderma.nn, J. (1995). Ba.yesia.n Query Construction for Neural Network 
Models. In this volume. 
M. Plutowski a.nd H. White (1993). Selecting concise training sets from clea.n da.ta.. IEEE 
Transactions on Neural Networks, 4, 305-318. 
S. Scha.al a.nd C. Atkeson. (1994) Robot Juggling: An Implementa.tion of Memory-based 
Lea.ming. Control Systems Magazine, 14(1):57-71. 
J. Schmidhuber a.nd J. Storck. (1993) Reinforcement driven informa.tion acquisition in 
nondeterministic environments. Tech. Report, Fa.kult/kt fiir Informa.tik, Technische Uni- 
versit gt Miinchen. 
D. Specht. (1991) A general regression neural network. IEEE Trans. Neural Networks, 
2(6):568-576. 
S. Thrun and K. M611er. (1992) Active explora.tion in dyna.mic environments. In J. Moody 
et al., editors, Advances in Neural Information Processing Systems d. Morga.n Ka.ufma.nn. 
