I II I I Ill II I I I 
Unification of Information Maximization 
and Minimization 
Ryotaro Kamimura 
Information Science Laboratory 
Tokai University 
1117 Kitakaname Hiratsuka Kanagawa 259-12, Japan 
E-mail: ryo@cc.u-tokai.ac.jp 
Abstract 
In the present paper, we propose a method to unify information 
maximization and minimization in hidden units. The information 
maximization and minimization are performed on two different lev- 
els: collective and individual level. Thus, two kinds of information: 
collective and individual information are defined. By maximizing 
collective information and by minimizing individual information, 
simple networks can be generated in terms of the number of con- 
nections and the number of hidden units. Obtained networks are 
expected to give better generalization and improved interpretation 
of internal representations. This method was applied to the infer- 
ence of the maximum onset principle of an artificial language. In 
this problem, it was shown that the individual information min- 
imization is not contradictory to the collective information max- 
imization. In addition, experimental results confirmed improved 
generalization performance, because over-training can significantly 
be suppressed. 
I Introduction 
There have been many attempts to interpret neural networks from the information 
theoretical point of view [2], [4], [5]. Applied to the supervised learning, informa- 
tion has been maximized and minimized, depending on problems. In these methods, 
information is defined by the outputs of hidden units. Thus, the methods aim to 
control hidden unit activity patterns in an optimal manner. Information maximiza- 
tion methods have been used to interpret explicitly internal representations and 
simultaneously to reduce the number of necessary hidden units [5]. On the other 
hand, information minimization methods have been especially used to improve gen- 
eralization performance [2], [4] and to speed up learning. Thus, if it is possible to 
Unification of lnformation Maximization and Minimization 509 
maximize and minimize information simultaneously, information theoretic methods 
are expected to be applied to a wide range of problems. 
In this paper, we unify the above mentioned two methods, namely, information 
maximization and minimization methods, into one framework to improve general- 
ization performance and to interpret explicitly internal representations. However, it 
is apparently impossible to maximize and minimize simultaneously the information 
defined by the hidden unit activity. Our goal is to maximize and to minimize infor- 
mation on two different levels, namely, collective and individual levels. This means 
that information can be maximized in collective ways and information is minimized 
for individual input-hidden connections. The seeming contradictory proposition of 
the simultaneous information maximization and minimization can be overcome by 
assuming the existence of the two levels for the information control. 
Information is supposed to be controlled by an information controller located outside 
neural networks and used exclusively to control information. By assuming the 
information controller, we can clearly see how information appropriately defined can 
be maximized or minimized. In addition, the actual implementation of information 
methods is much easier by introducing a concept of the information controller. 
2 Concept of Information 
In this section, we explain a concept of information in a general framework of an in- 
formation theory. Let Y take on a finite number of possible values y, y2,..., YM with 
probabilities p(y), p(y2), ..., P(YM), respectively. Then, initial uncertainty H(Y) of 
a random variable Y is defined by 
M 
H(Y) = - E P(YJ) logp(yj). 
where 
I(Y l - yp(yj l a,)log p(y 
j=l 
Now, consider conditional uncertainty after the observation of another random vari- 
able X, taking possible values a:, a:, ..., a:$ with probabilities p(a:), p(a:), ..., p(:cM), 
respectively. Conditional uncertainty H(Y I X) can be defined as 
$ M 
H(YIX)- - ZP(X) EP(Yj I a:) lgp(a:j lye). (2) 
=1 j=l 
We can eily verify that conditional uncertainty is always less than or equal to 
initial uncertainty. Information is usually defined  the decrede of this uncertainty 
I(Y I X ) = H(Y)- H(Y I X ) 
M S M 
= - p(yj)logp(yj)+ p() p(yj I,)1ogp(yj I) 
j=l s=l j=l 
= P(x)p(Yj ])log p(yj ) 
 j P(YJ) 
= 
510 R. Karnirnura 
I m,)logp(y)+ P(Yi I m,)lgp(Yi I m,), 
(4) 
which is referred to as conditional information. Especially, when prior uncertainty 
is maximum, that is, a prior probability is equi-probable (l/M), then inforrnation 
is 
$ M 
I(Y I x) = logM + yp(z,)p(yj I z,)logp(y1 Ix,) (5) 
s=l j=l 
where log M is maximum uncertainty concering A. 
3 Formulation of Information Controller 
In this section, we apply a concept of information to actual network architectures 
and define collective information and individual information. The notation in the 
above section is changed into ordinary notation used in the neural network. 
3.1 Unification by Information Controller 
Two kinds of information, collective information and individual information, are 
controlled by using an information controller. The information controller is devised 
to interpret the mechanism of the information maximization and minimization more 
explicitly. As shown in Figure 1, the information controller is composed of two 
subcomponents, that is, an individual information minimizer and collective infor- 
mation maximizer. A collective information maximizer is used to increase collective 
information as much as possible. An individual information minimizer is used to 
decrease individual information. By this minimization, the majority of connections 
are pushed toward zero. Eventually, all the hidden units tend to be intermediately 
activated. Thus, when the collective information maximizer and individual infor- 
mation maximizer are simultaneously applied, a hidden unit activity pattern is a 
pattern of the maximum information in which only one hidden unit is on, while 
all the other hidden units are off. However, multiple strongly negative connections 
to produce a maximum information state, are replaced by extremely weak input- 
hidden connections. Strongly negative connections are inhibited by the individual 
information minimization. This means that by the information controller, informa- 
tion can be maximized and at the same time one of the most important properties 
of the information minimization, namely, weight decay or weight elimination, can 
approximately be realized. Consequently, the information controller can generate 
much simplified networks in terms of hidden units and in terms of input-hidden 
connections. 
3.2 Collective Information Maximlzer 
A neural network to be controlled is composed of input, hidden and output units 
with bias, as shown in Figure 1. The jth hidden unit receives a net input from 
input units and at the same time fi'om a collective information maximizer: 
L 
k=O 
where x i is an information maximizer from the jth collective information maximizer 
to the jth hidden unit,  is the number of input units, wik is a connection from 
the kth input unit to the jth hidden unit and . is the kth element of the sth input 
Unification of lnforrnation Maximization and Minimization 511 
Bias  Bias- Itidden 
\ ,--, Connections Bias-Output 
\ ,X ',, / 
Xv'  \ / Bias ] 
Inpu  W i Hidden 
u..  X'J o u.. xW. 
 * ' - 'Ouut 
I  Wi' 
  , utput. 
k / -i X x. 
,  ; J 
: 
  Information 
; . 
,' Maximizers 
Individual 
Information 
Minimizer 
Collective 
Information 
Maximizer 
Information Controller 
Figure 1: A network architecture, realizing the information con- 
troller. 
pattern. The jth hidden unit produces an activity or an output by a sigmoidal 
activation function: 
1 
1 q- exp(-uj)' 
(7) 
The collective information maximizer is used to maximize the information contained 
in hidden units. For this purpose, we should define collective information. Now, 
suppose that in the previous formulation in information, a symbol X and Y repre- 
sent a set of input patterns and hidden units respectively. Then, let us approximate 
a probability p(yj ] :r,,) by a normalized output pj of the jth hidden unit computed 
by 
; (8) 
Em=l Vrn 
where the summation is over all the hidden units. Then, it is reasonable to suppose 
that at an initial stage all the hidden units are activated randomly or uniformly 
and all the input patterns are also randomly given to networks. Thus, a probability 
p(yj) of the activation of hidden units at the initial stage is equi-probable, that is, 
1/M. A probability p(a:) of input patterns is also supposed to be equi-probable, 
namely, 1/$. Thus, information in the equation (3) is rewritten as 
I(YIX)  
M $ M 
1 1 1 
- E  log +  E EP; logp; 
j=l s=lj=l 
512 R. Kamimura 
Input Unit () Hidden Unit (1)) 
k J 
pj Fires 
Fire 
1-p jk  Do$ not fire 
Figure 2: An interpretation of an input-hidden connection for 
defining the individual information. 
$ M 
= logM+EEpjlogpj (9) 
*=lj=l 
where logM is maximum uncertainty. This information is considered to be the 
information acquired in a course of learning. Information maximizers are updated to 
increase collective information. For obtaining update rules, we should differentiate 
the information function with respect to information maximizers 
A - fi$ OI(Y I X) 
) 
= ogj- 
=1 m=l 
where fi is a parameter. 
3.3 Individual Information Minimization 
For representing individual information by a concept of information discussed in 
the previous section, we consider an output P. ik from the jth hidden unit only with 
a connection from the kth input unit to the jth output unit: 
p./:: f(wj:) (11) 
which is supposed to be a probability of the firing of the jth hidden unit, given the 
firing of the kth input unit, as shown in Figure 2. Since this probability is considered 
to be a probability, given the firing of the kth input unit, conditional information is 
appropriate for measuring the information. In addition, it is reasonable to suppose 
that a probability of the firing of the jth hidden unit is 1/2 at an initial stage 
of learning, because we have no knowledge on hidden units. Thus, conditional 
information for a pair of the kth unit and the jth hidden unit is formulated as 
5(D l fives)  
If connections are close to zero, 
meaning that it is impossible to estimate the firing of the kth hidden unit. 
(1) 
i - (1 - p/:) log 1-. 
-Pj t, log  
+p./: logp./: + (1 - p./:) log(1 - p./:) 
1og2 +pjlogpjk + (1-pj)log(1-pjk) (12) 
this function is close to minimum information, 
If 
Unification of lnformation Maximization and Minimization 
Table 1: An example of obtained input-hidden connections wjk 
by the information controller. The parameter 6, / and r/ were 
0.015, 0.0008, and 0.01. 
513 
Hidden Input Units  Informo. tion 
Units v 1 2 3 4 Bias wjo Mximizer j 
i 3.09 10.77 26.48 13.82 22.07 -60.88 
2 -3.35 0.11 0.33 -3.08 -0.95 1.63 
3 -0.01 0.00 0.00 -0.01 0.00 -10.93 
4 0.00 0.00 0.00 0.00 0.00 -10.94 
5 0.00 0.00 -0.01 0.00 0.00 -10.97 
6 0.02 0.01 -0.04 0.01 0.06 -12.01 
7 0.00 0.00 -0.01 0.00 0.00 -11.01 
8 0.00 0.00 -0.01 0.00 0.00 -11.00 
9 0.02 0.01 -0.03 0.01 0.03 -11.61 
10 0.01 0.00 -0.02 0.00 0.07 -11.67 
connections are larger, the information is larger and correlation between input and 
hidden units is larger. Total individual information is the sum of all the individual 
individual information, namely, 
M L 
t(Z>lsi,es) = jk(r>lSies), (13) 
j=l k=O 
because each connection is treated separately or independently. The individual 
information minimization directly controls the input-hidden connections. By dif- 
ferentiating the individual information function and a cross entropy cost function 
with respect to input-hidden connections w, we have rules for updating concerning 
input-hidden connections: 
Awj 
O!(D l fires) OC 
= -P Ow.i  00wj 
s 
(14) 
where 5 is an ordinary delta for the cross entropy function and r/and/ are pa- 
rameters. Thus, rules for updating with respect to input-hidden connections are 
closely related to the weight decay method. Clearly, as the individual information 
minimization corresponds to diminishing the strength of input-hidden connections. 
4 Results and Discussion 
The information controller was applied to the segmentation of strings of an artifi- 
cial language into appropriate minimal elements, that is, syllables. Table i shows 
input-hidden connections with the bias and the information maximizers. Hidden 
units were ordered by the magnitude of the relevance of each hidden unit [6]. Col- 
lective information and individual information could sufficiently be maximized and 
minimized. Relative collective and individual information were 0.94 and 0.13. In 
this state, all the input-hidden connections except connections into the first two 
hidden units are almost zero. Information maximizers a:j are all strongly negative 
for these cases. These negative information maximizers make eight hidden units 
(from the third to tenth hidden unit) inactive, that is, close to zero. By carefully 
514 R. Karnirnura 
Table 2: Generalization performance comparison for 200 and 200 train- 
ing patterns. Averages in the table are average generalization errors over 
seven errors of ten errors with ten different initial values. 
(a) 200 patterns 
Methods 
Generalization Errors 
RMS 
Averages Std. Dev. 
Error Rates 
Averages Std. Der. 
Standard 0.188 
Weight Decay 0.183 
Weight Elimination 0.172 
hfformation Controller 0.167 
0.010 0.087 0.015 
0.004 0.082 0.009 
0.014 0.064 0.015 
0.011 0.052 0.008 
(b) 300 patterns 
Methods 
Generalization Errors 
RMS Error Rates 
Averages Std. Dev. Averages Std. Dev. 
Standard 0.108 
Weight Decay 0.110 
Weight Elimination 0.083 
Information Controller 0.072 
0.009 0.024 0.009 
0.003 0.012 0.004 
0.005 0.009 0.006 
0.006 0.008 0.004 
examing the first two hidden units, we could see that the first hidden unit and the 
second hidden unit are concerned with rules for syllabification and a exceptional 
Then, networks were trained to infer the well-formedness of strings in addition to 
the segmentation to exanfine generalization performance. Table 2 shows general- 
ization errors for 200 and 300 training patterns. As clearly shown in the figure, 
the best generalization performance in terms of RMS and error rates is obtained by 
the information controller. Thus, experimental results confirmed that in all cases 
the generalization performance of the information controller is well over the other 
methods. In addition, experimental results explicitly confirmed that better gener- 
alization performance is due to the suppression of over-training by the information 
controller. 
References 
[1] R. Ash, Information Theory, John Wiley & Sons: New York, 1965. 
[2] G. Deco, W. Finnof and H. G. Zimmermann, "Unsupervised mutual infor- 
mation criterion for elimination of overtraining in Supervised Multilayer Net- 
works," Neural Computation, Vol. 7, pp.86-107, 1995. 
[3] R. Kamimura "Entropy minimization to increase the selectivity: selection and 
competition in neural networks," Intelligent Engineering Systems through Ar- 
tificial Neural Networks, ASME Press, pp.227-232, 1992. 
[4] R. Kamimura, T. Takagi and S. Nakanishi, "Improving generalization perfor- 
mance by information minimization," IEICE Transactions on Information and 
Systems, Vol. E78-D, No.2, pp.163-173, 1995. 
[5] R. Kamimura and S. Nakanishi, "Hidden information maximization for feature 
detection and rule discovery," Network: Computation in Neural Systems, Vol.6, 
pp.577-602, 1995. 
[6] M. C. Mozer and P. Smolensky, "Using relevance to reduce network size auto- 
matically," Connection Science, Vo.1, No.1, pp.3-16, 1989. 
