Asymptotic Theory for Regularization: 
One-Dimensional Linear Case 
Petri Koistinen 
Roll Nevanlinna Institute, P.O. Box 4, FIN-00014 University of Helsinki, 
Finland. Emaih Petri.Koistinen@rni.helsinki.fi 
Abstract 
The generalization ability of a neural network can sometimes be 
improved dramatically by regularization. To analyze the improve- 
ment one needs more refined results than the asymptotic distri- 
bution of the weight vector. Here we study the simple case of 
one-dimensional linear regression under quadratic regularization, 
i.e., ridge regression. We study the random design, misspecified 
case, where we derive expansions for the optimal regularization pa- 
rameter and the ensuing improvement. It is possible to construct 
examples where it is best to use no regularization. 
1 INTRODUCTION 
Suppose that we have available training data (X, Y1),..., (X,,, Y,,) consisting of 
pairs of vectors, and we try to predict Y/on the basis of Xi with a neural network 
with weight vector w. One popular way of selecting w is by the criterion 
n 
I , t(Xi, Yi, w) + AQ(w) - min!, 
(1)  
1 
where the loss (x,y,w) is, e.g., the squared error ]]y- g(x,w)]] 2, the function 
g(.,w) is the input/output function of the neural network, the penalty Q(w) is 
a real function which takes on small values when the mapping g(-, w) is smooth 
and high values when it changes rapidly, and the regularization parameter A is a 
nonnegative scalar (which might depend on the training sample). We refer to the 
setup (1) as (training with) regularization, and to the same setup with the choice 
A - 0 as training without regularization. Regularization has been found to be very 
effective for improving the generalization ability of a neural network especially when 
the sample size n is of the same order of magnitude as the dimensionality of the 
parameter vector w, see, e.g., the textbooks (Bishop, 1995; Ripley, 1996). 
Asymptotic Theory for Regularization: One-Dimensional Linear Case 295 
In this paper we deal with asymptotics in the case where the architecture of the 
network is fixed but the sample size grows. To fix ideas, let us assume that the 
training data is part of an i.i.d. (independent, identically distributed) sequence 
(X,); (Xl,1), (X2,2),... of pairs of random vectors, i.e., for each i the pair 
(Xi,Y/) has the same distribution as the pair (X,) and the collection of pairs is 
independent (X and  can be dependent). Then we can define the (prediction) risk 
of a network with weights w as the expected value 
:= 
Let us denote the minimizer of (1) by tb,,(,k), and a minimizer of the risk r by 
w*. The quantity r(,,(,k)) is the average prediction error for data independent of 
the training sample. This quantity r(t,,(,k)) is a random variable which describes 
the generalization performance of the network: it is bounded below by r(w*) and 
the more concentrated it is about r(w*), the better the performance. We will 
quantify this concentration by a single number, the expected value E r( (h)). We 
are interested in quantifying the gain (if any) in generalization for training with 
versus training without regularization defined by 
(3) 
- 
When regularization helps, this is positive. 
However, relatively little can be said about the quantity (3) without specifying in 
detail how the regularization parameter is determined. We show in the next section 
that provided  converges to zero sufficiently quickly (at the rate op(n-/2)), then 
F. r(n (0)) and F. r(n ()) are equal to leading order. It turns out, that the optimal 
regularization parameter resides in this asymptotic regime. For this reason, delicate 
analysis is required in order to get an asymptotic approximation for (3). In this 
article we derive the needed asymptotic expansions only for the simplest possible 
case: one-dimensional linear regression where the regularization parameter is chosen 
independently of the training sample. 
2 REGULARIZATION IN LINEAR REGRESSION 
We now specialize the setup (1) to the case of linear regression and a quadratic 
smoothness penalty, i.e., we take (x, y, w) -- [y- xTw] 2 and Q(w) - wTRw, where 
now y is scalar, x and w are vectors, and R is a symmetric, positive definite matrix. 
It is well known (and easy to show) that then the minimizer of (1) is 
(4)  (,k) XiX + ,kR 1 ' 
= - XY. 
n 
i 1 
This is called the generalized ridge regression estimator, see, e.g., (Titterington, 
1985); ridge regression corresponds to the choice R = I, see (Hoerl and Kennard, 
1988) for a survey. Notice that (generalized) ridge regression is usually studied in 
the fixed design ca.e, where Xi:s are nonrandom. Further, it is usually assumed 
that the model is correctly specified, i.e., that there exists a parameter such that 
Yi = XiTw * + ei, and such that the distribution of the noise term ei does not depend 
on Xi. In contrast, we study the random design, misspecified case. 
Assuming that E IIxII 2 <  and that E [XX T] is invertible, the minimizer of the 
risk (2) and the risk itself can be written as 
(5) w* -- A-iI[XY], with A := I[XX T] 
(6) r(w) = r(w*) + (w - w*) n(w - $*). 
296 P Koistinen 
If Z, is a sequence of random variables, then the notation Z,, = Op(n -a) means 
that naZ,, converges to zero in probability as n  vo. For this notation and the 
mathematical tools needed for the following proposition see, e.g., (Serfling, 1980, 
Ch. 1) or (Brockwell and Davis, 1987, Ch. 6). 
Proposition 1 Suppose that EY 4 < oo, E IlXll 4 < oo and that A - E [XX T] is in- 
vertible. If A = op(n-1/2), then both v/-(.(0)-w*) and v/-0b.(A)- w *) converge 
in distribution to N(O,C), a normal distribution with mean zero and covariance 
matrix C. 
The previous proposition also generalizes to the nonlinear case (under more compli- 
cated conditions). Given this proposition, it follows (under certain additional con- 
ditions) by Taylor expansion that both Er(b,())-r(w*) and Er(b(0))- r(w*) 
admit the expansion fin -1 + o(n -) with the same constant fi. Hence, in the 
regime A = Op(n -1/2) we need to consider higher order expansions in order to 
compare the performance of b, (A) and b (0). 
3 ONE-DIMENSIONAL LINEAR REGRESSION 
We now specialize the setting of the previous section to the case where x is scalar. 
Also, from now on, we only consider the case where the regularization parameter 
for given sample size n is deterministic; especially A is not allowed to depend on 
the training sample. This is necessary, since coefficients in the following type of 
asymptotic expansions depend on the details of how the regularization parameter 
is determined. The deterministic case is the easiest one to analyze. 
We develop asymptotic expansions for the criterion 
(7) 
&(k) := - 
where now the regularization parameter k is deterministic and nonnegative. The 
expansions we get turn out to be valid uniformly for k > 0. We then develop 
asymptotic formulas for the minimizer of J, and also for J,(0) - inf J. The last 
quantity can be interpreted as the average improvement in generalization perfor- 
mance gained by optimal level of regularization, when the regularization constant 
is allowed to depend on n but not on the training sample. 
From now on we take Q(w) - w 2 and assume that A - E X 2 = i (which could be 
arranged by a linear change of variables). Referring back to formulas in the previous 
section, we see that 
(8) 
whence J,(k) = Eh(r,,,,,k), where we have introduced the function h (used 
heavily in what follows) as well as the arithmetic means , and z, 
(9) 
(10) 
r,,:=-yUi, z '--1 V/, with 
1 1 
ui := 
For convenience, also define U := X 2 - 1 and V := XY -w*X 2. Notice that 
U; U, U2,... are zero mean i.i.d. random variables, and that V; , V2,... satisfy 
the same conditions. Hence f, and c, converge to zero, and this leads to the idea 
of using the Taylor expansion of h(u, v, k) about the point (u, v) = (0, 0) in order 
to get an expansion for J,,(k). 
Asymptotic Theory for Regutarization: One-Dimensional Linear Case 297 
To outline the ideas, let Tj (u, v, k) be the degree j Taylor polynomial of (u, v)  
h(u,v,k) about (0,0), i.e., Tj(u,v,k) is a polynomial in u and v whose coeffi- 
cients are functions of k and whose degree with respect to u and v is j. Then 
E Tj(rn, zn, k) depends on n and moments of U and V. By deriving an upper 
bound for the quantity llg ]h(n, zn, k)- Tj(n,Zn, k)l we get an upper bound for 
the error committed in approximating Jn(k) by E Tj(n, n, k). It turns out that 
for odd degrees j the error is of the same order of magnitude in n as for degree 
j - 1. Therefore we only consider even degrees j. It also turns out that the error 
bounds are uniform in k _ 0 whenever j _ 2. To proceed, we need to introduce 
assumptions. 
Assumption 1 E iXl r < o and E IYI s < c for high enough r and s. 
Assumption 2 Either (a) for some constant  > 0 almost surely IX] >/ or (b) 
X has a density which is bounded in some neighborhood of zero. 
Assumption 1 guarantees the existence of high enough moments; the values r = 20 
and s = 8 are sufficient for the following proofs. E.g., if the pair (X, Y) has a 
normal distribution or a distribution with compact support, then moments of all 
orders exist and hence in this case assumption 1 would be satisfied. Without some 
condition such as assumption 2, Jn(0) might fail to be meaningful or finite. The 
following technical result is stated without proof. 
Proposition 2 Let p > 0 and let 0 < EX 2 < oo. If assumption 2 holds, then 
--> [I (X2)] --p , as l], --> 00, 
where the expectation on the left is finite (a) for n >_ 1 (b) for n > 2p provided that 
assumption 2 (a), respectively 2 (b) holds. 
Proposition 3 Let assumptions i and  hold. Then there exist constants no and 
M such that 
w* kE UV 
PROOF SKETCH The formula for ET2(On,'n,k) follows easily by integrating the 
degree two Taylor polynomial term by term. To get the upper bound for R(n, k), 
consider the residual 
-2(k + 1)auv 2 + -4(w*)2k2(k + 1)u a +.-. 
h(t, v, ) - T2(u,v,) -- (u q- 1 q- k)2( q_ 1)4 , 
where we have omitted four similar terms. Using the bound 
+ + + _> x, 
1 I 
2 
, Vk_>0, 
298 P Koistinen 
the L1 triangle inequality, and the Cauchy-Schwartz inequality, we get 
{2(k + 1)3[E (112114)] 1/2 +4(w*)2k2(k + 1)[E 11]/2... } 
1 n 
By proposition 2, here E[([  X) -4] = O(1). Next we use the following fact, cf. 
(Serfling, 1980, Lemma B, p. 68). 
Fact 1 
Then 
Let {Zi} be i.i.d. 
with I[Z1] -- 0 and with E[ZI  < c for some v > 2. 
zi = 
n 1 
Applying the Cauchy-Schwartz inequality and this fact, we get, e.g., that 
[E ([n121rn14)] 1/2 _< [(E lnla)l/2(E lrnlS)l/2] /2 -- 0(n-3/2). 
Going through all the terms carefully, we see that the bound holds. 
Proposition 4 Let assumptions i and  hold, assume that w*  O, and set 
O 1 :-- (]V 2 - 2w*E[UV])/(w*) 2. 
If a > 0,_ then there exists a constant n such that for all n _> n the function 
k -> ET2(U,,,,,,k) has a unique minimum on [0,c) at the point k admitting the 
expansion 
k = an - + O(n-2); further, 
J(0) - inf{J(k)  k _> 0} = J(0) - Jn(an -) = a2(w*)2n -2 + O(n-S/2). 
If a _< O, then 
inf{J,,(k)  k > 0} = J(0) + O(n-S/2). 
PROOF SKETCH The proof is based on perturbation expansion considering 1In a 
small parameter. By the previous proposition, $n(k) := ET2(r, z, k) is the sum 
of (w*)2k2/(1 + k) 2 and a term whose supremum over k _> ko > -1 goes to zero 
as n --> c. Here the first term has a unique minimum on (-1, c) at k = 0. 
Differentiating $,, we get 
Stn(k) -- [2(w*)2k(k + 1) 2 q- n-lp2(k)]/(k q- 1) 5, 
where p2(k) is a second degree polynomial in k. The numerator polynomial has 
three roots, one of which converges to zero as n - c. A regular perturbation 
expansion for this root, k = an -1 + a2n -2 + ..., yields the stated formula for 
c. This point is a minimum for all sufficiently large n; further, it is greater than 
zero for all sufficiently large n if and only if a > 0. 
The estimate for J,,(0) - inf{J,,(k)  k _> 0} in the case a > 0 follows by noticing 
that 
= 
where we now use a third degree Taylor expansion about (u, v, k) = (0, 0, 0) 
- = 
2w*kv-(w*)2k2-4w*kuv+ 2(w*)2k2u + 2kv 2 -4w*k2v+ 2(w*)2k 3 +r(u,v,k). 
Asymptotic Theory for Regularization: One-Dimensional Linear Case 299 
0.2 
0.18 
0.16 
0.14 
0.12 
0.1 
I I I I I I I I I 
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 
Figure 1: Illustration of the asymptotic approximations in the situation of equation 
(11). Horizontal axis k; vertical axis _Jn(k_) and its asymptotic a_ppro_ximations. 
Legend: markers J,, (k); solid line 1 T2 (U,,, V,,, k); dashed line E T4 (U,,, V,,, k). 
Using_the techniques of the previous proposition, it can be shown that 
1 Ir(Un, n, ,)1 ---- 0(n-5/2)  Integrating the Taylor polynomial and using this 
estimate gives 
Jn(0) - Jn(o1//],) = 012(w*)2/1, -2 -- 
Finally, by the mean value theorem, 
Jn(0)-inf{Jn(k)- k _ 0} - Jn(0)- Jn(al/n) + k[Jn(0)- Jn(k)]lk=a(k,-ax/n ) 
= &(0) - + -2) 
where 0 lies between k, and al/n, and where we have used the fact that the indi- 
cated derivative evaluated at 0 is of order O(n-), as can be shown with moderate 
effort. [] 
Remark In the preceding we assumed that A = E X 2 equals 1. If this is not 
the case, then the formula for al has to be divided by A; again, if a > 0, then 
=mn + O(n-2). 
If the model is correctly specified in the sense that Y = w*X + , where  is 
independent of X and E = 0, then V = Xe and E[UV] = 0. Hence we have 
Oq -- E [e2]/(W*)2, and this is strictly positive expect in the degenerate case where 
e - 0 with probability one. This means that here regularization helps provided the 
regularization parameter is chosen around the value al/n and n is large enough. 
See Figure 1 for an illustration in the case 
(11) X-N(0,1), Y=w*X+e, e---N(0,1), w*=l, 
where e and X are independent. J,(k) is estimated on the basis of 1000 repetitions 
of the task for n = 8. In addition to IT2(,,Q;,,k) the function ET4(U,,Q',,k) 
is also plotted. The latter can be shown to give J,(k) correctly up to order 
O(n-5/2(k + 1)-3). Notice that although llg T2 (n, Zn, k) does not give that good an 
approximation for Jr,(k), its minimizer is near the minimizer of J**(k), and both of 
these minimizers lie near the point a/n = 0.125 as predicted by the theory. In the 
situation (11) it can actually be shown by lengthy calculations that the minimizer 
of J,,(k) is exactly c/n for each sample size n _> 1. 
It is possible to construct cases where a < 0. For instance, take 
I 1 
X-- Uniform(a,b), a = ,b= (3Vr - 1) 
Y = c/X + d + Z, c=-5, d=8, 
300 P.. Koistinen 
and Z  N(0, a 2) with Z and X independent and 0_< a < 1.1. In such a case 
regularization using a positive regularization parameter only makes matters worse; 
using a properly chosen negative regularization parameter would, however, help in 
this particular case. This would, however, amount to rewarding rapidly changing 
functions. In the case (11) regularization using a negative value for the regulariza- 
tion parameter would be catastrophic. 
4 DISCUSSION 
We have obtained asymptotic approximations for the optimal regularization param- 
eter in (1) and the amount of improvement (3) in the simple case of one-dimensional 
linear regression when the regularization parameter is chosen independently of the 
training sample. It turned out that the optimal regularization parameter is, to 
leading order, given by an - and the resulting improvement is of order O(n-2). 
We have also seen that if o 1  0 then regularization only makes matters worse. 
Also (Larsen and Hansen, 1994) have obtained asymptotic results for the optimal 
regularization parameter in (1). They consider the case of a nonlinear network; 
however, they assume that the neural network model is correctly specified. 
The generalization of the present results to the nonlinear, misspecified case might 
be possible using, e.g., techniques from (Bhattacharya and Ghosh, 1978). General- 
ization to the case where the regularization parameter is chosen on the basis of the 
sample (say, by cross validation) would be desirable. 
Acknowledgements 
This paper was prepared while the author was visiting the Department for Statis- 
tics and Probability Theory at the Vienna University of Technology with financial 
support from the Academy of Finland. I thank F. Leisch for useful discussions. 
References 
Bhattacharya, R. N. and Ghosh, J. K. (1978). On the validity of the formal Edge- 
worth expansion. The Annals of Statistics, 6(2):434-451. 
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University 
Press. 
Brockwell, P. J. and Davis, R. A. (1987). Time Series: Theory and Methods. 
Springer series in statistics. Springer-Verlag. 
Hoerl, A. E. and Kennard, R. W. (1988). Ridge regression. In Kotz, S., Johnson, 
N. L., and Read, C. B., editors, Encyclopedia of Statistical Sciences. John Wiley 
& Sons, Inc. 
Larsen, J. and Hansen, L. K. (1994). Generalization performance of regularized 
neural network models. In Vlontos, J., Whang, J.-N., and Wilson, E., editors, 
Proc. of the Jth IEEE Workshop on Neural Networks for Signal Processing, 
pages 42-51. IEEE Press. 
Ripley, B. D. (1996). Pattern Recognition and Neural Networks. Cambridge Uni- 
versity Press. 
Serfiing, R. J. (1980). Approximation Theorems of Mathematical Statistics. John 
Wiley & Sons, Inc. 
Titterington, D. M. (1985). Common structure of smoothing techniques in statistics. 
International Statistical Review, 53:141-170. 
