Active Learning in Multilayer 
Perceptrons 
Kenji Fukumizu 
Information and Communication R&D Center, Ricoh Co., Ltd. 
3-2-3, Shin-yokohama, Yokohama, 222 Japan 
E-mail: fukuic.rdc.ricoh.co.jp 
Abstract 
We propose an active learning method with hidden-unit reduction, 
which is devised specially for multilayer perceptrons (MLP). First, 
we review our active learning method, and point out that many 
Fisher-information-based methods applied to MLP have a critical 
problem: the information matrix may be singular. To solve this 
problem, we derive the singularity condition of an information ma- 
t fix, and propose an active learning technique that is applicable to 
MLP. Its effectiveness is verified through experiments. 
I INTRODUCTION 
When one trains a learning machine using a set of data given by the true system, its 
ability can be improved if one selects the training data actively. In this paper, we 
consider the problem of active learning in multilayer perceptrons (MLP). First, we 
review our method of active learning (Fukumizu el al., 1994), in which we prepare a 
probability distribution and obtain training data as samples from the distribution. 
This methodology leads us to an information-matrix-based criterion similar to other 
existing ones (Fedorov, 1972; Pukelsheim, 1993). 
Active learning techniques have been recently used with neural networks (MacKay, 
1992; Cohn, 1994). Our method, however, as well as many other ones has a crucial 
problem: the required inverse of an information matrix may not exist (White, 1989). 
We propose an active learning technique which is applicable to three-layer percep- 
trons. Developing a theory on the singularity of a Fisher information matrix, we 
present an active learning algorithm which keeps the information marlex nonsingu- 
lar. We demonstrate the effectiveness of the algorithm through experiments. 
296 K. FUKUMIZU 
2 STATISTICALLY OPTIMAL TRAINING DATA 
2.1 A CRITERION OF OPTIMALITY 
We review the criterion of statistically optimal training data (Fukumizu et al., 1994). 
We consider the regression problem in which the target system maps a given input 
a to y according to 
y = z, 
where f(a) is a deterministic function from R L to R a, and Z is a random variable 
whose law is a normal distribution N(O, a2IM), (Ia4 is the unit M x M matrix). 
Our objective is to estimate the true function f s accurately  possible. 
Let f(; 0)} be a parametric model for estimation. We use the maximum likelihood 
estimator (MLE)  for training data (((", y(")}ff=, which minimizes the sum of 
squared errors in this ce. In theoretical derivations, we sume that the target 
function f is included in the model and equal to f(.; 00). 
Wc make a training example by choosing (" to try. observing the resulting output 
y(", and pairing them. The problem of active learning is how to determine input 
data ()  
},=x to minimize the estimation error after training. Our approach is 
a statistical one using a probability for training. r(), and choosing {()  
independent samples from r() to minimize the expectation of the MSE in the 
actual environment: 
EMSE = E((,y(,)} [/ / [[y - f(;)[[p(y[)dydQ] . (1) 
In the above equation, Q is the environmental probability which gives input vectors 
to the true system in the actual environment, and E((.y',)} means the expec- 
tation on training data. Eq.(1), therefore, shows the average error of the trained 
machine that is used as a substitute of the true finction in the actual environment. 
2.2 REVIEW OF AN ACTIVE LEARNING METHOD 
Using statistical asymptotic theory, Eq. (1) is approximated as follows: 
2 
EMSE = 2 + Tr[I(Oo)J-X(0o)] + O(N-3/a), 
where the matrixes I and J are (Fisher) information matrixes defined by 
Iab(z; 0) = 
(2) 
OOa 
, t(o) = f 
:(o) = f 
The essential part of Eq.(2) is Tr[I(Oo)J-X(0o)], computed by the unavailable pa- 
rameter 00. We have proposed a practical algorithm in which we replace 00 with . 
prepare a family of probability {r(a; v) I v: paramater} to choose training samples, 
and optimize v and  iteratively (Fukumizu et al., 1994). 
Active Learning Algorithm 
1. Select an initial training data set D[0] fi'om r(;r; v[0]). and compute [0]. 
2. k:=l. 
3. Compute the optimal v = Vial to minimize Tr[I([a_])J-([a_x])]. 
Active Learning in Multilayer Perceptrons 297 
4. Choose N. new training data from r(a;v[]) and let D[] be a union of 
D[_] and the new data. 
5. Compute the MLE [1 based on the training data set D[]. 
6. k:=k+landgoto3. 
The above method utilizes a probability to generate training data. It has the 
advantage of making many data in one step compared to existing ones in which 
only one data is chosen in each step, though their criterions are similar to each 
other. 
3 SINGULARITY OF AN INFORMATION MATRIX 
3.1 A PROBLEM ON ACTIVE LEARNING IN MLP 
Hereafter, we focus on active learning in three-layer perceptrons with H hidden 
units, Afs = {J'(;, 0)}. The map J'(;; 0) is defined by 
H L 
fi(a:;O) = y] wij s(y] ujaz + (j) + rli, (1 < i < M), (3) 
j=l k=l 
where s(t) is the sigmoidal function: s(t) = 1/(1 + e-t). 
Our active learning method as well as many other ones requires the inverse of an 
information natrix J. The information matrix of MLP, however, is not always 
invertible (White, 1989). Any statistical aigorithms utilizing the inverse, then, 
cannot be applied directly to MLP (Hagiwara et al., 1993). Such problems do not 
arise in linear models, which almost always have a nonsingular information matrix. 
3.2 SINGULARITY OF AN INFORMATION MATRIX OF MLP 
The following theorem shows that the information matrix of a three-layer perceptron 
is singular if and only if the network has redundant hidden units. We can deduce 
that if the information matrix is singular, we can make it nonsingular by eliminating 
redundant hidden units without changing the input-output map. 
Theorem 1 Assume r(a) is continuous and positive at any ;r. Then. the Fisher 
information matrix J is singular if and only if at least one of the following three 
conditions is satisfied: 
(1) uj := (ujx,...,uj) T = O, for some j. 
(2) wj := (Wj, . . . , WMj) = 0 T. for some j. 
(3) For different jx and j2, 
The rough sketch of the proof is shown below. The complete proof will appear in a 
forthconfing paper '(Fukumizu, 1996). 
Rough sketch of the proof. We know easily that an information matrix is singular if 
o(;0) } are linearly dependent. The sufficiency can be proved easily. 
and only if { o0o 
To show the necessity, we show that the derivatives are linearly independent if none 
of the three conditions is satisfied. Assume a linear relation: 
M H 
i=1 j=l 
+ E E + E 
i=1 j=l k=l j=l 
-- =o. (4) 
298 K. FUKUMIZU 
We can show there exists a basis of R L, ((),..., (L)}, such that uj. () y 0 for 
vj, Vl, and uj  ;r (t) + j  4-(uj2  (t) + Q) for j  j2,l. We replace  in 
eq.(4) by(Ot(tSR).Letm(f):=uj.(O,S (0 tzeCl z ((2n+ 1) 
j :  -- 
, _ ,,..q(0 The points in the singularities 
(j)/m(f) n  Z}, and D (0 := C jj . S t) are 
of s(m(f)z + Q). We define holomorphic functions on D(0 as 
.-- j=iaij + 
a '(m(.l)z + (j), (1 < i < M) 
+ =owos   
From eq.(4), we have 9)(t) = 0 for all t  R. Using standard arguments on isolated 
singularities of holomorphic functions, we know S l) are removable singularities of 
9(t)(z/ and finly obtn 
, EL& ' = 0, p0: 0, : 0, 0: 0. 
It is ey to see flj = 0. This completes the proo5 
3.3 REDUCTION PROCEDURE 
We introduce the following reduction procedure based on Theorem 1. Used dur- 
ing BP training, it eliminates redundant hidden units and keeps the information 
matrix nonsingular. The criterion of elimination is very important, because exces- 
sive eliurination of hidden units degrades the approximation capacity. We propose 
an algorithm which does not increase the mean squared error on average. In the 
following, let j := s(itj.  + j) and e(N) = A/N for a positive number A. 
Reduction 
1. If 
and 
If 
If 
then 
If 
e 
then eliminate the j2th hidden unit and ij 
Wij2 for all i. 
Procedure 
I1%11 ' f(j -s())'dQ < (N), then eliurinate the jth hidden unit, 
Oi '-} Oi + bijs(j) for all i. 
I1%11 ' ()'dQ < 4N), then eliminate the jth hidden unit. 
I1%11" (j2 - )'dQ < (N) for different j and j2, 
eliminate the j2th hidden unit and bi  bij + bij. for all i. 
I1%11 ' (1 - y - ,)'dQ < 4N) for different j and j., 
rrom Theorem l, weknowthatj,/tj, (/if2 j)_(it.T j),or(/t,  
)+%, ) 
' $1' ~ ' 
can be reduced to 0 if the information matrix is singular. Let 0  Afar denote 
the reduced parameter from  according to the above procedure. The above four 
conditions are, then, given by calculating f I1'(; )- '(; )11'd  
We briefly explain how the procedure keeps the information matrix nonsingular 
and does not increase EMSE in high probability. First, suppose der J(00) = 0, then 
there exists 00 x  Afar (K < H) such that f(x;00) = f(x;00 x) and detJ(0ff) y 0 
in Afar. The elimination of hidden units up to K, of course, does not increase the 
EMSE. Therefore, we have only to consider the case in which der J(00) y 0 and 
hidden units are eliminated. 
Suppose f II'(; 0o ) - '(; Oo)ll'dq > O(N -x) for any reduced parameter 0oar frmn 
00. The probability of satisfying f II'(; ) - Y(,;d)ll'dQ < A/N is very small for 
Active Learning in Multilayer Perceptrons 299 
a sufficiently small A. Thus, the elimination of hidden units occurs in very tiny 
probability. Next, suppose f 00 - '(; Oo)ll'dq - O(N Let t 6 A/s: be 
a reduced parameter made from  with the same procedure  we obtain 0 kom 
00. We 11 show for a suciently snall A, 
where  is MLE computed in . We write 9 = (9(x),9 (2)) in which 9 (2) is 
changed to 0 in reduction, channg the coordinate system if necessary. The Taylor 
expansion and asymptotic theory ve 
E [fllf(;O K) - f(;Oo)l]dQ]  
where Iii and Jii denote the local information matrixes w.r.t. 9 (i) (i = 1, 2). Thus, 
if2 
+ 0) - 00)ll 
N ' 
Since the sum of the last two terms is positive, the 1.h.s is positive if E[f I]f(; )- 
f(; 0)l]adQ] < BIN for a suciently small B. Athough we cannot know the vaue 
of this expectation, we can make the probability of holding this enequality very high 
by taking a small A. 
4 
ACTIVE LEARNING WITH REDUCTION 
PROCEDURE 
The reduction procedure keeps the information matrix nonsingular and makes the 
active learning algorithm applicable to MLP even with surplus hidden units. 
Active 
1. 
2. 
3. 
Learning with Hidden Unit Reduction 
Select initial training data set Do from r(a; v[0]). and compute t[o]. 
k := 1, and do REDUCTION PROCEDURE. 
Conpute the optimal v = vial to minimize Tr[I([a_Xl)J-X(O[_x])], using 
the steepest descent method. 
Choose Na new training data from r(;vtal) and let Dt ] be a union of 
D[-x] and the new data. 
Compute the MLE [a] based on the training data Dial using BP with 
REDUCTION PROCEDURE. 
k := k -t- 1 and go to 3. 
The BP witIx reduction procedure is applicable not only to active learning, but to 
a variety of statistical techniques that require the inverse of an information matrix. 
We do not discuss it in this paper. however. 
300 K. FUKUMIZU 
 0.000001  
: Active Learning 
 Active Learning [Av-Sd,Av+Sd] 
- --. Passive Learning 
+ Passive Learning 
%,-,...  
.---,.. * , 
4.% 'l--,. 
+ + 
 
  
i i  i 
The Number of Training Oa 
0.00001 
0.000001 
0.0000001 
: Learning Curve 
... -- --- # of hidden units 
I I I I I I I I 
-4 
loo :o 300 40o .500 ro 7oo aoo 9oo looo 
The Number of Training Data 
Figure 1: Active/Passive Learning' f(x) = s(x) 
5 EXPERIMENTS 
We demonstrate the effect of the proposed active learning algorithm through ex- 
periments. First we use a three-layer model with 1 input unit, 3 hidden units, and 
1 output unit. The true function f is a MLP network with 1 hidden unit. The 
information matrix is singular at 00, then. The environmental probability, Q, is 
a normal distribution N(0, 4). We evaluate the generalization error in the actual 
environment using the following nean squared error of the function values: 
We set the deviation in the true system a = 0.01. As a family of distributions 
for training {r(;c;v)}, a nixture model of 4 normal distributions is used. In each 
step of active learning, 100 new samples are added. A network is trained using 
online BP, presented with all training data 10000 times in each step, and operated 
the reduction procedure once a 100 cycles between 5000th and 10000th cycle. We 
try 30 trainings changing the seed of random numbers. In comparison, we train a 
network passively based on training samples given by the probability Q. 
Fig.1 shows the averaged learning curves of active/passive learning and the number 
of hidden units in a typical learning curve. The advantage of the proposed active 
learning algorithm is clear. We can find that the algorithm has expected effects on 
a simple, ideal approxination problem. 
Second, we apply the algorithm to a problem in which the true function is not 
included in the MLP model. We use MLP with 4 input units, 7 hidden units, and 1 
output unit. The true function is given by f(;r) = erf(xx), where erf(t) is the error 
function. The graph of the error function resembles that of the sigmoidal function, 
while they never coincide by any arline transforms. We set Q = N(0, 25 x 14). We 
train a network actively/passively based on 10 data sets, and evaluate MSE's of 
function values. Other conditions are the same as those of the first experiment. 
Fig.2 shows the averaged learning cmwes and the number of hidden units in a 
typical learning curve. We find that the active learning algorithm reduces the errors 
though the theoretical condition is not perfectly satisfied in this case. It suggests 
the robustness of our active learning algorithm. 
Active Learning in Multilayer Perceptrons 301 
-' Active Learning 
- .a-. Passive Learning 
The Number of Training Data 
-8 
.... -' Learning Curve 
 .... a... "a" # of hidden units 
I I I I I I I 4 
The Number of Training Data 
Figure 2: Active/Passive Learning: f(a)= erf(x) 
6 CONCLUSION 
We review statistical active learning methods and point out a problem in their ap- 
plication to MLP: the required inverse of an information matrix does not exist if the 
network has redundant hidden units. We characterize the singularity condition of 
an information matrix and propose an active learning algorithm which is applicable 
to MLP with any number of hidden units. The effectiveness of the algorithm is 
verified through computer simulations, even when the theoretical assumptions are 
not perfectly satisfied. 
References 
D. A. Cohn. (1994) Neural network exploration using optimal experiment design. 
In J. Cowan et al. (ed.), Advances in Neural Information Processing Systems 6, 
679-686. San Mateo, CA: Morgan Kaufmann. 
V. V. Fedorov. (1972) Theory of Optimal Ea:periments. NY: Academic Press. 
K. Fukumizu. (1996) A Regularity Condition of the Information Matrix of a Mul- 
tilayer Perceptron Network. Neural Networks, to appear. 
K. Fukumizu, & S. Watanabe. (1994) Error Estimation and Learning Data Ar- 
rangement for Neural Networks. Proc. IEEE Int. Conf. Neural Networks :777-780. 
K. Hagiwara, N. Toda, & S. Usui. (1993) On the problem of applying AIC to 
determine the structure of a layered feed-forward neural network. Proc. 1993 Int. 
Joint Conf. Neural Networks :2263-2266. 
D. MacKay. (1992) Information-based objective functions for active data selection, 
Neural Computation 4(4):305-318. 
F. Pukelsheim. (1993) Optimal Design of Ezperiments. NY: John Wiley & Sons. 
H. White. (1989) Learning in artificial neural networks: A statistical perspective 
Neural Comp'utation 1 (4):425-464. 
