Outcomes of the Equivalence of Adaptive Ridge 
with Least Absolute Shrinkage 
Yves Grandvalet Stphane Canu 
Heudiasyc, UMR CNRS 6599, Universit de Technologie de Compigne, 
BP 20.529, 60205 Compigne cedex, France 
Yves. Grandvalet@hds .utc. fr 
Abstract 
Adaptive Ridge is a special form of Ridge regression, balancing the 
quadratic penalization on each parameter of the model. It was shown to 
be equivalent to Lasso (least absolute shrinkage and selection operator), 
in the sense that both procedures produce the same estimate. Lasso can 
thus be viewed as a particular quadratic penalizer. 
From this observation, we derive a fixed point algorithm to compute the 
Lasso solution. The analogy provides also a new hyper-parameter for tun- 
ing effectively the model complexity. We finally present a series of possi- 
ble extensions of lasso performing sparse regression in kernel smoothing, 
additive modeling and neural net training. 
1 INTRODUCTION 
In supervised learning, we have a set of explicative variables :e from which we wish to pre- 
dict a response variable y. To solve this problem, a learning algorithm is used to produce a 
predictor fx) from a learning set st = { (xi, Yi)}f=l of examples. The goal of prediction 
may be: 1) to provide an accurate prediction of future responses, accuracy being measured 
by a user-defined loss function; 2) to quantify the effect of each explicative variable in the 
response; 3) to better understand the underlying phenomenon. 
Penalization is extensively used in learning algorithms. It decreases the predictor variability 
to improve the prediction accuracy. It is also expected to produce models with few non-zero 
coefficients if interpretation is planned. 
Ridge regression and Subset Selection are the two main penalization procedures. The for- 
mer is stable, but does not shrink parameters to zero, the latter gives simple models, but is 
unstable [ 1 ]. These observations motivated the search for new penalization techniques such 
as Garrotte, Non-Negative Garrotte [ 1 ], and Lasso (least absolute shrinkage and selection 
operator) [ 10]. 
446 Y. Grandvalet and S. Canu 
Adaptive Ridge was proposed as a means to automatically balance penalization on different 
coefficients. It was shown to be equivalent to Lasso [4]. Section 2 presents Adaptive Ridge 
and recalls the equivalence statement. The following sections give some of the main out- 
comes of this connection. They concern algorithmic issues in section 3, complexity control 
in section 4, and some possible generalizations of lasso to non-linear regression in section 5. 
2 ADAPTIVE RIDGE REGRESSION 
For clarity of exposure, the formulae are given here for linear regression with quadratic loss. 
The predictor is defined as f(:e) = /3Y:e, with/3 y = (fi,... ,fid). Adaptive Ridge is a 
modification of the Ridge estimate, which is defined by the quadratic constraint y',j =  f? _< 
C applied to the parameters. It is usually computed by minimizing the Lagrangian 
g d d 
= Argmin- ( -fjZ,d- Yi)'+ A   , (1) 
 i=1 j=l 5=1 
where A is the Lagrange multiplier vawing with the bound C on the no of the pareters. 
When the ordina W least squares (OLS) estimate maximizes likelihood , the Ridge estimate 
may be seen as a maximum a posteriori estimate. The Bayes prior distribution is a centered 
nodal distribution, with variance propogional to 1/A. This prior distribution treats all co- 
variates similarly. It is not appropriate when we ow that all covariates are not equally 
relevant. 
O 
The gagotte estimate [ 1 ] is based on the OLS estimate . The standard quadratic constraint 
27/e  C. The coecients with smaller OLS estimate are thus 
is replaced by J=  ,_ ,_ _ 
more heavily penalized. Other modifications are better explained with the prior distribu- 
tion viewoint. Mixtures of Gaussians may be used to cluster different set of covariates. 
Several models have been proposed, with data dependent clusters [9], or classes defined a 
priori [7]. The Automatic Relevance Deteination model [8] ras in the latter Wpe. In [4], 
we propose to use such a mixture, in the fo 
t d d 
 in ( jxij - yi)   Aj (2) 
 i:1 j=l j=l 
Here, each coefficient has its own prior distribution. The priors are centered normal distri- 
butions with variances proportional to 1/A;. To avoid the simultaneous estimation of these 
d hyper-parameters by trial, the constraint 
j=l 
, A. > 0 (3) 
is applied on ,X = (A,..., Ad) T, where A is a predefined value. This constraint is a link 
between the d prior distributions. Their mean variance is proportional to 1/A. The values of 
Aj are automatically 2 induced from the sample, hence the qualifter adaptative. Adaptativity 
refers here to the penalization balance on {fj }, not to the tuning of the hyper-parameter A. 
 If {:e, } are independently and identically drawn from some distribution, and that some,O* exists, 
such that Y, - ,0' :rze, + s, where s is a centered normal random variable, then the empirical cost 
based on the quadratic loss is proportional to the log-likelihood of the sample. The OLS estimate/3 
is thus the maximum likelihood estimate 
2Adaptive Ridge, as Ridge or Lasso, is not scale invariant, so that the covariates should be nor- 
malized to produce sensible estimates. 
Equivalence of Adaptive Ridge with Least Absolute Shrinkage 44 7 
It was shown [4] that Adaptive Ridge and least absolute value shrinkage are equivalent, in 
the sense that they yield the same estimate. We remind that the Lasso estimate is defined by 
g d d 
 = Argmin - (-/35 a:i/- yi) 2 subject to I/jl < (4) 
 i=l 5=1 j=l 
The only difference in the definition of the Adaptive Ridge and the Lasso estimate is that 
the Lagrangian form of Adaptive Ridge uses the constraint (-]J=i I/3J [)2/d < tC e. 
3 OPTIMIZATION ALGORITHM 
Tibshirani [ 10] proposed to use quadratic programming to find the 1.asso solution, with 2d 
variables (positive and negative parts of/3) and 2d + 1 constraints (signs of positive and 
negative parts of/3 plus constraint (4)). Equations (2) and (3) suggest to use a fixed point 
(FP) algorithm. At each step s, the FP algorithm estimates the optimal parameters A} s) of 
the Bayes prior based on the estimate/3* - 1), and then maximizes the posterior to compute 
the current estimate .8 s) 
As the parameterization , ,k) may lead to divergent solutions, we define new variables 
7; = , and c j= for j= 1,...,d . 
The FP algorithm updates alternatively c and , as follows: 
(5) 
,() = 
d7}S-1) 2 
a (-i) 2 
-]k= 1 7k 
(diag(c())X r Xdiag(c ()) + AI) 
- i diag(c(S ))X t t 
(6) 
where zij ---- Xij , I is the identity matrix, and diag(c) is the square matrix with the vector 
 on its diagonal. 
The algorithm can be initialized by the Ridge or the OLS estimate. In the latter case, fl(1) is 
the garrotte estimate. 
Practically, 'if'}- l) is small compared to numerical accuracy, then c} ) is set to zero. In 
turn, 7}  is zero, and the system to be solved in the second step to determine , can be 
reduced to the other variables. If cj is set to zero at any time during the optimization pro- 
cess, the final estimate j will be zero. The computations are simplified, but it is not clear 
whether global convergence can be obtained with this algorithm. It is easy to show the con- 
vergence towards a local minimum, but we did not find general conditions ensuring global 
convergence. If these conditions exist, they rely on initial conditions. 
Finally, we stress that the optimality conditions for c (or in a less rigorous sense for ,k) do 
not depend on the first part of the cost minimized in (2). In consequence, the equivalence 
between Adaptive Ridge and lasso holds for any model or loss function. The FP algorithm 
can be applied to these other problems, without modifying the first step. 
4 COMPLEXITY TUNING 
The Adaptive Ridge estimate depends on the learning set st and on the hyper-parameter 
,X. When the estimate is defined by (2) and (3), the analogy with Ridge suggests A as the 
448 Y. Grandvalet and S. Canu 
"natural" hyper-parameter for tuning the complexity of the regressor. As , goes to zero,  
approaches the OLS estimate/3, and the number of effective parameters is d. As , goes to 
infinity,/3 goes to zero and the number of effective parameters is zero. 
When the estimate is defined by (4), there is no obvious choice for the hyper-parameter con- 
d d .--o 
trolling complexity. Tibshirani[10]proposedtousev 5-'j= [jl/ j As goes 
- I. 
to one, approaches; as v goes to iniw,goes to zero. 
The weaess of v is that it is explicitly defined from the OLS estimate. As a result, it is 
variable when the design matrix is badly conditioned. The estimation of v is thus harder, 
and the overall procedure looses in stabiliW. This is illustrated on an experiment following 
Breiman's benchark [ 1 ] with 30 highly co,elated predictors  (Xj X ) = plJ- 1, with 
p = 1 -- 10 -3. 
We generate 1000 i.i.d. samples of size  = 0. For each sample s, the modeling er- 
ror (ME) is computed for several values of v and X. We select v  d X  achieving the 
lowest ME. For one sample, there is a one to one mapping from v to X. Thus ME is the 
same for v  and X . Then, we compute v* and * achieving the best average ME on 
the 1000 samples. As v  and A achieve the lowest ME for s, the ME for s is higher 
or equal for v* and X*. Due to the wide spread of {v }, the average loss encountered is 
 W  (ME(s,v* - ME(s, 10 - and 
twice for * than for A* 1/1000 : ) )) = 4.6 , 
1/1000 pooo (ME(s A*) - ME(s, A)) = 2.3 10 -. The average modeling eor are 
k=l  
ME(v*) = 1.9 10 -t and ME(A*) = 1.7 10 -t 
The estimates of prediction eor, such as leave-one-out cross-validation tend to be variable. 
Hence, complexiW tuning is often based on the minimization of some estimate of the mean 
prediction eor (e.g bootstrap, K-fold cross-validation). Our experiment suppos that, re- 
garding mean prediction eor, the optimal  perfos better than the optal v. Thus,  is 
the best candidate for complexiW tuning. 
Although  and v are respectively the control parameter of the FP and QP algorits, the 
preceding statement does not imply that we should use the FP algorit. Once the solution 
 is known, v or  are easily computed. The choice of one hyper-parameter is not lied to 
the choice of the optimization algofit. 
5 APPLICATIONS 
Adaptive Ridge may be applied to a variety of regression techniques. They include kernel 
smoothing, additive and neural net modeling. 
5.1 KERNEL SMOOTHING 
Soft-thresholding was proved to be efficient in wavelet functional e,stimation [2]. Kernel 
smoothers [5] can also benefit from the sparse representation given by soft-thresholding 
methods. For these regressors, f a:) e 
= i=/3iK(e, a:i)+/30, there are as many covariates 
as pairs in the sample. The quadratic procedure of Lasso with 2 + 1 constraints becomes 
computationally expensive, but the FP algorithm of Adaptive Ridge is reasonably fast to 
converge. 
An example of least squares fitting is shown in fig. 1 for the motorcycle dataset [5]. On 
this example, the hyperparameter A has been estimated by .632 bootstrap (with 50 boot- 
strap replicates) for Ridge and Adaptive Ridge regressions. For tuning A, it is not necessary 
to determine the coefficients/3 with high accuracy. Hence, compared to Ridge regression, 
Equivalence of Adaptive Ridge with Least Absolute Shrinkage 449 
the overall amount of computation required to get the Adaptive Ridge estimate was about 
six times more important. For evaluation, Adaptive Ridge is ten times faster as Ridge re- 
gression as the final fitting uses only a few kernels (11 out of 133). 
AR 
+4q- q4- + 
v+ + + 
+ 
+ 
x 
Figure 1: Adaptive Ridge (AR) and Ridge (R) in kernel smoothing on the 
motorcycle data. The + are data points, and * are the prototypes corre- 
sponding to the kernels with non-zero coefficients in AR. The Gaussian 
kernel used is represented dotted in the lower right-hand comer. 
Girosi [3] showed an equivalence between a version of least absolute shrinkage applied to 
kernel smoothing, and Support Vector Machine (SVM). However, Adaptive Ridge, as ap- 
plied here, is not equivalent to SVM, as the cost minimized is different. The fit and proto- 
types are thus different from the fit and support vectors that would be obtained from SVM. 
5.2 ADDITIVE MODELS 
subject to constraint (3). 
minimization of 
^ d 
Additive models [6] are sums of univariate functions, f(:e) = 5-d= t f (zd). In the non- 
parametric setting, {f; } are smooth but unspecified functions. Additive models are easily 
represented and thus interpretable, but they require the choice of the relevant covariates to 
be included in the model, and of the smoothness of each f;. 
In the form presented in the two previous sections, Adaptive Ridge regression penalizes 
differently each individual coefficient, but it is easily extended to the pooled penalization of 
coefficients. Adaptive Ridge may thus be used as an alternative to BRUTO [6] to balance 
the penalization parameters on each f. 
A classical choice for fd is cubic spline smoothing. Let B; denote the  x ( + 2) matrix of 
the unconstrained B-spline basis, evaluated at zi;. Let fl d be the ( + 2) x ( + 2) matrix 
corresponding to the penalization of the second derivative of fd. The coefficients of fd in 
the unconstrained B-spline basis are noted/3;. The "natural" extension of Adaptive Ridge 
is to minimize 
d d 
2 
- + , (7) 
j=l j=l 
This problem is easily shown to have the same solution as the 
j----1 5--1 
Note that if the cost (8) is optimized with respect to a single covariate, the solution is a usual 
smoothing spline regression (with quadratic penalization). In the multidimensional case, 
450 Y. Grandvalet and $. Canu 
cj =  12 j13j - f { f j ( t ) }2 dt may be used to summarize the non-hneanty of fj, thus I cjl 
can be interpretea as a relevance index operating besides linear dependence of feature j. 
The penalizer in (8) is a least absolute shrinkage operator applied to cj. Hence, formula (8) 
may be interpreted as "quadratic penalization within, and soft-thresholding between covari- 
ates".The FP algorithm of section 3 is easily modified to minimize (8), and backfitting may 
be used to solve the second step of this procedure. 
A simulated example in dimension five is shown in fig. 2. The fitted univariate functions 
are plotted for five values of,X. There is no dependency between the the explained variable 
and the last covariate. The other covariates affect the response, but the dependency on the 
first features is smoother, hence easier to capture and more relevant for the spline smoothen 
For a small value of ,X, the univariate functions are unsmooth, and the additive model is 
interpolating the data. For ,X -- 10 -4, the dependencies are well estimated on all covariates. 
As ,X increases, the covariates with higher coordinate number are more heavily penalized, 
and the corresponding fj tend to be linear. 
II 
X 1 X 2 X 3 X 4 X 5 
Figure 2: Adaptive Ridge in additive modeling on simulated data. The 
true model is j = zt + cos(fez2) + cos(2rcza) + co8(3rz4) + 6. The co- 
variates are independently drawn from a uniform distribution on [- 1, 1] 
and 6 is a Gaussian noise of standard deviation r - 0.3. The solid curves 
are the estimated univariate functions for different values of,X, and + are 
partial residuals. 
Linear trends are not penalized in cubic spline smoothing. Thus, when after convergence 
 12j = 0, the jth covariate is not eliminated. This can be corrected by applying Adap- 
tve Ridge a second time. To test if a significant 
A linear trend can be detected, a linear (pe- 
nalized) model may be used for fj, the remaining f, k 7 j being cubic splines. 
5.3 MLP FITTING 
The generalization to the pooled penalization of coefficients can also be applied to Multi- 
Layered Perceptrons to control the complexity of the fit. If weights are penalized individ- 
ually, Adaptive Ridge is equivalent to the Lasso. If weights are pooled by layer, Adaptive 
Ridge automatically tunes the amount ofpenalization on each layer, thus avoiding the mul- 
tiple hyper-parameter tuning necessary in weight-decay [7]. 
Equivalence of Adaptive Ridge with Least Absolute Shrinkage 451 
X 1 X 3. 2 .Xd. 1 X d X 1 X 2 X Xd I Xd 
Figure 3: groups of weights for two examples of Adaptive Ridge in MLP 
fitting. Left: hidden node soft-thresholding. Right: input penalization 
and selection, and individual smoothing coefficient for each output unit. 
Two other interesting configurations are shown in fig. 3. If weights are pooled by incom- 
ing and outcoming weights of a unit, node penalization/pruning is performed. The weight 
groups may also gather the outcoming weights from each input unit, or the incoming weights 
from each output unit (one set per input plus one per output). The goal here is to penal- 
ize/select the input variables according to their relevance, and each output variable accord- 
ing to the smoothness of the corresponding mapping. This configuration proves itself espe- 
cially useful in time series prediction, where the number of inputs to be fed into the network 
is not known in advance. There are also more complex choices of pooling, such as the one 
proposed to encourage additive modeling in Automatic Relevance Determination [8]. 
References 
[10] 
[ 1 ] L. Breiman. Heuristics of instability and stabilization in model selection. The Annals 
of Statistics, 24(6):2350-2383, 1996. 
[2] D.L Donoho and I.M. Johnstone. Minimax estimation via wavelet shrinkage. Ann. 
Statist., 26(3): 879-921, 1998. 
[3] F. Girosi. An equivalence between sparse approximation and support vector machines. 
Technical Report 1606, M.I.T. AI Laboratory, Cambridge, MA., 1997. 
[4] Y. Grandvalet. Least absolute shrinkage is equivalent to quadratic penalization. In 
L. Niklasson, M. Bod6n, and T. Ziemske, editors, ICANN'98, volume 1 of Perspec- 
tives in Neural Computing, pages 201-206. Springer, 1998. 
[5] W. Hirdle. Applied Nonparametric Regression, volume 19 of Economic Society 
Monographs. Cambridge University Press, New York, 1990. 
[6] TJ. Hastie and R.J. Tibshirani. Generalized Additive Models, volume 43 of Mono- 
graphs on Statistics and Applied Probability. Chapman & Hall, New York, 1990. 
[7] D.J.C. MacKay. A practical Bayesian framework for backprop networks. Neural Com- 
putation, 4(3):448-472, 1992. 
[8] R. M. Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics. 
Springer, New York, 1996. 
[9] S.J. Nowlan and G.E. Hinton. Simplifying neural networks by soft weight-sharing. 
Neural Computation, 4(4):473-493, 1992. 
R.J. Tibshirani. Regression shrinkage and selection via the lasso. dournal of the Royal 
Statistical Society, B, 58(1):267-288, 1995. 
