The effect of correlated input data on the 
dynamics of learning 
Sren Halkjar and Ole Winther 
CONNECT, The Niels Bohr Institute 
Blegdamsvej 17 
2100 Copenhagen, Denmark 
halkj aer, winherQconnec. nbi. dk 
Abstract 
The convergence properties of the gradient descent algorithm in the 
case of the linear perceptton may be obtained from the response 
function. We derive a general expression for the response function 
and apply it to the case of data with simple input correlations. It 
is found that correlations severely may slow down learning. This 
explains the success of PCA as a method for reducing training time. 
Motivated by this finding we furthermore propose to transform the 
input data by removing the mean across input variables as well as 
examples to decrease correlations. Numerical findings for a medical 
classification problem are in fine agreement with the theoretical 
results. 
1 INTRODUCTION 
Learning and generalization are important areas of research within the field of neu- 
ral networks. Although good generalization is the ultimate goal in feed-forward 
networks (perceptrons), it is of practical importance to understand the mechanism 
which control the amount of time required for learning, i.e. the dynamics of learn- 
ing. This is of course particularly important in the case of a large data set. An exact 
analysis of this mechanism is possible for the linear perceptron and as usual it is 
hoped that the results to some extend may be carried over to explain the behaviour 
of non-linear perceptrons. 
We consider N dimensional input vectors x G TN and scalar output y. The linear 
170 S. Halkjcer and O. Winther 
perceptron is parametrized by the weight vector w  T N 
1 wT x (1) 
Y(x) = vW 
Let the training set be {(x, y),/- 1,...,p) and the training error be the usual 
1 
squared error, E(w) -  -. (y- y(x)) 2. We will use the well-known gradient 
descent algorithm 1 w(k q- 1) - w(k)- r/VE(w(k)) to estimate the minimum points 
w* of E. Here r/ denotes the learning parameter. Collecting the input examples 
in the N x p matrix X and the corresponding output in y, the error function is 
I 1 
written E(w) - (wTRw--2qTw+c), where R _= N Z XI(X/)T' q = Xy and 
c = yTy. As in (Le Cun t al., 1991) the convergence properties of the minimum 
points w* are examined in the coordinate system where R is diagonal. Let U denote 
the matrix whose columns are the eigenvectors of R and A = diag(A1,..., AN) the 
diagonal matrix containing the eigenvalues of R. The new coordinates then become 
v = U T(w - w*) with corresponding error function 2 
S(v) = ivYv + s0 = 1 
+ z0 
i 
where E0 = E(w*). Gradient descent now leads to the decoupled equations 
vi(k + 1)= (1 - rlAi)vi(k ) = (1 - rlAi)kvi(O) (3) 
with i = 1,..., N. Clearly, v -+ 0 requires ]1 - r/Ai I < i for all i, so that r/must 
be chosen in the interval 0 < r/ < 2/Amax. In the extreme case Ai = A we will 
have convergence in one step for r/= 1/A. However, in the usual case of unequal Ai 
the convergence for large k will be exponential vi(k): exp(-rlAik)vi(O). (r/Ai) -1 
therefore defines the time constant of the i'th equation giving a slowest time constant 
(rlAmin) -. A popular choice for the learning parameter is r/= 1/Amax resulting in 
a slowest time constant ,Xrna/,X,in called the learning time r in the following. The 
convergence properties of the gradient descent algorithm is thus characterized by 
r. In the case of a singular matrix R, one or more of the eigenvalues will be zero, 
and there will be no convergence along the corresponding eigendirections. This has 
however no influence on the error according to (2). Thus, Amin will in the following 
denote the smallest non-zero eigenvalue. 
We will in the article calculate the eigenvalue spectrum of R in order to obtain the 
learning time of the gradient descent algorithm. This may be done by introducing 
the response function 
-- i - / i / 
G. --G(L,H)= TrL 1-RH - L 1-':RH (4) 
where L, H are arbitrary N x N matrices. Using a standard representation of the 
Dirac 6-function (Krogh, 1992) we may write the eigenvalue spectrum of R as 
1 , 
= N - = --lIm (5) 
i 
The Newton-Raphson method, w(k + 1) = w(k)- VE(w(k))(V2E(w(k))) - is of 
course much more effective in the linear case since it gives convergence in one step. This 
method however requires an inversion of the Hessian matrix. 
2Note that this analysis is valid for any part of an error surface in which a quadratic 
approximation is valid. In the general case R should be exchanged with the Hessian 
WE(w*). 
The Effect of Correlated Input Data on the Dynamics of Learning 171 
where  has an infinitesimal imaginary part which is set equal to zero at the end of 
the calculation. 
In the 'thermodynamic limit' N -+ oc keeping a = n z constant and finite, G 
(and thus the eigenvalue spectrum) is a self-averaging quantity (Sollich, 1996) i. 
e. G -  = O(N-), where U is defined as the response function averaged over 
the input distribution. Previously U has been calculated for independent input 
variables (Hertz et al., 1989; Sollich, 1996). In section 2 we derive an implicit 
equation for the averaged response function for arbitrary correlations using random 
matrix techniques (Brody et al., 1981). This equation is solved showing that simple 
input correlations may slow down learning significantly. Based on this finding we 
propose in section 3 data transformations for improving the learning speed and test 
the transformation numerically on a medical classification problem in section 4. We 
conclude in section 5 with a discussion of the results. 
2 THE RESPONSE FUNCTION 
The method for deriving the averaged response function is based on the fact that the 
response function (4)may be written as a geometrical series Gr. = -]r__0 (L(RH)r). 
We will assume that the input examples x  are drawn independently from a 
Gaussian distribution with means rni and correlations xixj -rnirnj -- (fij, i.e. 
x(x )r = 6Z and  = aZ where Z = C +mm r. The Gaussian distribution 
has the property that the average of products of x's can be calculated by making 
all possible pair correlations, e.g. xixjxxl = ZijZi + ZiZjt + ZaZj. To take 
the average of (L(RH)), we must therefore make all possible pairs of the x's and 
exchange each pair 3cixj with Zij. This combinatorial problem will be solved below 
in a recursive fashion leading to an implicit equation for G. Using underbraces to 
indicate pairings of x's, we get for r >_ 2 
<L(RH)> 
(6) 
Resumming this we get the response function 
L = <L) + c  <LZH(RH) r ) +   <L(RH) r--l) <ZH(RH) ) 
r=0 r=0 s=0 
(7) 
Exchanging the order of summation in the last term we can write everything in 
terms of the response function 
= <L)+ ar.ZH +   <L(RH) -) <ZH(RH) ) 
s=O r=s+l 
: (L) + aGr.zH + (Gr. - 
172 S. Halkjcer and O. Winther 
OGLZH 
(L) + 1 - ZH (8) 
Using (8) recursively setting L equal to LZH, L(ZH) 2 etc. one obtains 
GL =  L = L ZH 
=o 1 -- ZH 1 -- 1-G--zH 
(9) 
This is our main result for the response function. To get the response function 
_ = G(A -, A -1) requires two steps, first set L = ZH and solve for ZH and then 
solve for Gz. If Z has a particularly simple form (9) may be solved analytically, 
but in general it must be solved numerically. 
In the following we will calculate the eigenvalue spectrum for a correlation matrix 
on the form C = nn T + rI and general mean m. To ensure that C is positive 
semi definite r > 0 and Inl 2 + r >_ 0 where Inl 2 -- n.h. The eigenvalues of 
Z = nn T + turn T + rI are straight forwardly shown to be a = r (with multiplicity 
N -2), 32 =r+ [In[2 + [ml 2- v/-]/2 and 33:rq- []n]2 + ]ml2+ V/-] /2 with 
D = (Inl 2 - Im12) 2 + 4(n. m) 2. Carrying out the trace in eq. (9) we get 
N-2 i i i i 1 
Gx = + -- + -- (10) 
 N A a NA- a NA- a 
1--GzH 1--GzH 1--GzH 
This expression suggests that we may solve G k in powers of 1IN (see e.g. (Sollich, 
1996)). However for purposes of the discussion of learning times the only -term 
that will be of importance is the last term above. We therefore only need to solve 
for GZH (setting L = ZH in (9)) to leading order 
A + al(1 - c 0 - v/(A + al(1 - c02) - 4a 
ZH -- 2A 
(11) 
Note that GZH will vanish for large A implying that the last term in (10) to leading 
order is singular for A = aaa. Inserting the result in (10) gives 
1 A+a'(1-a)-x/(A+a'(1-a))2-4Aa2 + NA cta 
Gk = 2Aa--- - 
According to (__5) the eigenvalue spectrum is determined by the imaginary part and 
poles of Gz. Gz has an imaginary part for A_ < A < A+ where A+ = a(1 4- x/-a-) 2 
and the poXles A x = 0, A = aaa. The poles contribute each with a &function such 
that the eigenvalue spectrum up to corrections of order  becomes 
(A) = (1 - a)O(1 - a)6(A) + 
1 6(A - aa3) q- -- 
27rAa1 
v/(A+ - A)(A- A_) (13) 
where O(x) = 1 for x > 0 and 0 otherwise. The first term expresses the trivial fact 
that for p < N the whole input space is not spanned and R will have a fraction of 
1 - a zero-eigenvalues. The continuous spectrum (the root term) only contributes 
for A_ < A < A+. Numerical simulations has been performed to test the validity 
of the spectrum (13) (Hulkjeer, 1996). They are in good agreement with predicted 
results indicating that finite size effects are unimportant. The continuous spectrum 
'n (13) has also been calculated using the replica method (Halkjaer, 1996). 
The Effect of Correlated Input Data on the Dynamics of Learning 173 
From the spectrum the learning time r may be read of directly 
(A+ aaa) ((l+v/) 2 
aaa (14) 
r = max __, __ = max 1 --  'a 1 (1 - )2 
To illustrate how input correlations and bi may affect learning time consider 
simple correlations Cij = 6ijv(1 - c) + vc and mi = m. With this special choice 
N(m+ vc) m 2 
r  v(X_c)(x_). For + cv > 0, i.e. for non-zero mean or positive correlations, 
the convergence time will blow up by a factor proportional to N. The input bi 
effect h previously been observed by (Le Cun et a/.,Wendemuth et al.). In the 
next section we will consider transformations to remove the large eigenvalue and 
thus to speed up learning. 
3 
DATA TRANSFORMATIONS FOR INCREASING 
LEARNING SPEED 
In this section we consider two data transformations for minimizing the learning 
time r of a data set, based on the results obtained in the previous sections. 
The PCA transformation (Jackson, 1991) is a data transformation often used in data 
analysis. Let U be the matrix whose columns are the eigenvectors of the sample 
covariance matrix and let x mean denote the sample average vector (see below). It 
is easy to check that the transformed data set 
= ur(x. _ xmean) 
(15) 
have uncorrelated (zero-mean) variables. However, the new PCA variables will often 
have a large spread in variances which might result in slow convergence. A simple 
rescaling of the new variables will remove this problem, such that according to (14) 
a PCA transformed data set with rescaled variables will have optimal convergence 
properties. 
The other transformation, which we will call double centering, is based on the re- 
moval of the observation means and the variable means. However, whereas the 
PCA transformation doesn't care about the initial distribution, this transformation 
is optimal for a data set generated from the matrix Zij -- 5ij v(1--c)+vc+mi rrlj stud- 
ied earlier. Define x? ean= k  x (mean of the i'th variable), x  =  i x 
p mean 
_mean __  Y'-,i X (grand mean). Consider first 
(mean of the p'th example) and iCmean -- pN 
the transformed data set 
= xi - x2 ean + __mean 
-- Xmean Jrmean 
The new variables are readily seen to have zero mean, variance  = v(1-c)- (1-c) 
and correlation 5 = -1 Since (1- 5) = v(1- c) we immediately get from 
N-I' 
(13) that the continuous eigenvalue spectrum is unchanged by this transformation. 
Furthermore the 'large' eigenvalue cta is equal to zero and therefore uninteresting. 
Thus the learning time becomes r = (1 + v/)2/(1 - v/) . This transformation 
however removes perhaps important information from the data set, namely the 
observation means. Motivated by these findings, we create a new data set {} 
where this information is added as an extra component 
 = (1, -. ., v,  mean, (16) 
38mean -- 38mean ] 
174 S. Halkjcer and O. Winther 
Table 1: Required number of iterations and corresponding learning times for differ- 
ent data transformations. 'Raw' is the original data set, 'Var. cent.' indicates the 
variable centered (mi = 0) data set, 'Doub. cent.' denotes (16), while the two last 
columns concerns the PCA transformed data set (15) without and with rescaled 
variables. 
Raw Var. cent. Doub. cent. PCA PCA (res.) 
Iterations oc 300 50 630 7 
r 161190 3330 237 3330 i 
The matrix fl. resulting from this data set is identical to the above case except that 
a column and a row have been added. We therefore conclude that the eigenvalue 
spectrum of this data set consists of a continuous spectrum equal to the above 
and a single eigenvalue which is found to be /k - v(1- c) q-cv. For c  0 we 
will therefore have a learning time r of order one indicating fast convergence. For 
independent variables (c - 0) the transformation results in a learning time of order 
N but in this case a simple removal of the variable means will be optimal. After 
training, when an (N + 1)-dim parameter set r has been obtained, it is possible to 
transform back to the original data set using the parameter transformation wl - 
+ 
4 NUMERICAL INVESTIGATIONS 
The suggested transformations for improving the convergence properties have been 
tested on a medical classification problem. The data set consisted of 40 regional 
values of cerebral glucose metabolism from 85 patients, 48 HIV-negatives and 37 
HIV-positives. A simple perceptton with sigmoidal tanh output was trained using 
gradient descent on the entropic error function to diagnose the 85 patients cor- 
rectly. The choice of an entropic error function was due to it's superior convergence 
properties compared to the quadratic error function considered in the analysis. The 
learning was stopped once the perceptton was able to diagnose all patients correctly. 
Table 1 shows the average number of required iterations for each of the transformed 
data sets (see legend) as well as the ratio r - Amax/Arnin for the corresponding 
matrix R. The 'raw' data set could not be learned within the allowed 1000 itera- 
tions which is indicated by an o. Overall, there's fine agreement between the order 
of calculated learning times and the corresponding order of required number of it- 
erations. Note especially the superiority of the PCA transformation with rescaled 
variables. 
5 CONCLUSION 
For linear networks the convergence properties of the gradient descent algorithm 
may be derived from the eigenvalue spectrum of the covariance matrix of the input 
data. The convergence time is controlled by the ratio between the largest and 
smallest (non-zero) eigenvalue. In this paper we have calculated the eigenvalue 
spectrum of a covariance matrix for correlated and biased inputs. It turns out that 
correlation and bias give rise to an eigenvalue of order the input dimension as well as 
a continuous spectrum of order one. This explains why a PCA transformation (with 
The Effect of Correlated Input Data on the Dynamics of Learning 175 
a variable rescaling) may increase learning speed significantly. We have proposed 
to center (setting equal to zero) the empirical mean both for each variable and 
each observation in order to remove the large eigenvalue. We add an additional 
component containing the observation mean to the input vector in order have this 
information in the training set. At the end of training it is possible to transform 
the solution back to the original representation. Numerical investigations are in fine 
agreement with the theoretical analysis for improving the convergence properties. 
6 ACKNOWLEDGMENTS 
We would like to thank Sara A. Solla and Lars Kai Hansen for valuable comments 
and discussions. Furthermore we wish to thank Ido Kanter for providing us with 
notes on some of his previous work. This work has been supported by the Dan- 
ish National Councils for the Natural and Technical Sciences through the Danish 
Computational Neural Network Center CONNECT. 
REFERENCES 
Brody, T. A., Flores J., French J. B., Mello, P. A., Pendey, A., & Wong, S.S. (1981) 
Random-matrix physics. Rev. Mod. Phys. 53:385. 
Halkjaer, S. (1996) Dynamics of learning in neural networks: application to the di- 
agnosis of HIV and Alzheimer patients. Master's thesis, University of Copenhagen. 
Hertz, J. A., Krogh, A. & Thorbergsson G. I. (1989) Phase transitions in simple 
learning. J. Phys. A 22:2133-2150. 
Jackson, J. E. (1991) A User's Guide to Principal Components. John Wiley & Sons. 
Krogh, A. (1992) Learning with noise in a linear perceptton J. Phys A 25:1119-1133. 
Le Cun, Y., Kanter, I. & Solla, S.A. (1991) Second Order Properties of Error 
Surfaces: Learning Time and Generalization. NIPS, 3:918-924. 
Sollich, P. (1996) Learning in large linear percepttons and why the thermodynamic 
limit is relevant to the real world. NIPS, 7:207-214 
Wendemuth, A., Opper, M. & Kinzel W. (1993) The effect of correlations in neural 
networks, J. Phys. A 26:3165. 
