Some Theoretical Results Concerning the 
Convergence of Compositions of Regularized 
Linear Functions 
Tong Zhang 
Mathematical Sciences Department 
IBM T.J. Watson Research Center 
Yorktown Heights, NY 10598 
tzhang@watson.ibm.com 
Abstract 
Recently, sample complexity bounds have been derived for problems in- 
volving linear functions such as neural networks and support vector ma- 
chines. In this paper, we extend some theoretical results in this area by 
deriving dimensional independent covering number bounds for regular- 
ized linear functions under certain regularization conditions. We show 
that such bounds lead to a class of new methods for training linear clas- 
sifiers with similar theoretical advantages of the support vector machine. 
Furthermore, we also present a theoretical analysis for these new meth- 
ods from the asymptotic statistical point of view. This technique provides 
better description for large sample behaviors of these algorithms. 
1 Introduction 
In this paper, we are interested in the generalization performance of linear classifiers ob- 
tained from certain algorithms. From computational learning theory point of view, such 
performance measurements, or sample complexity bounds, can be described by a quanti- 
ty called covering number [ 11, 15, 17], which measures the size of a parametric function 
family. For two-class classification problem, the covering number can be bounded by a 
combinatorial quantity called VC-dimension [12, 17]. Following this work, researchers 
have found other combinatorial quantities (dimensions) useful for bounding the covering 
numbers. Consequently, the concept of VC-dimension has been generalized to deal with 
more general problems, for example in [ 15, 11 ]. 
Recently, Vapnik introduced the concept of support vector machine [ 16] which has been 
successful applied to many real problems. This method achieves good generalization by 
restricting the 2-norm of the weights of a separating hyperplane. A similar technique has 
been investigated by Bartlett [3], where the author studied the performance of neural net- 
works when the 1-norm of the weights is bounded. The same idea has also been applied 
in [ 13] to explain the effectiveness of the boosting algorithm. In this paper, we will extend 
their results and emphasize the importance of dimension independence. Specifically, we 
consider the following form of regularization method (with an emphasis on classification 
problems) which has been widely studied for regression problems both in statistics and in 
Convergence of Regularized Linear Functions 3 71 
numerical mathematics: 
inf Ex,yL(w, z, y) = inf Ex,y f(w 'zy) + ,g(w), (1) 
where E,y is the expectation over a distribution of (z, y), and y E {-1, 1} is the binary 
label of data vector z. To apply this formulation for the purpose of training linear classifiers, 
we can choose f as a decreasing function, such that f(-) _> 0, and choose g(w) > 0 as 
a function that penalizes large w (lirn g(w) -+ oo). X is an appropriately chosen 
positive parameter to balance the two terms. 
The paper is organized as follows. In Section 2, we briefly review the concept of covering 
numbers as well as the main results related to analyzing the performance of learning algo- 
rithms. In Section 3, we introduce the regularization idea. Our main goal is to construct 
regularization conditions so that dimension independent bounds on covering numbers can 
be obtained. Section 4 extends results from the previous section to nonlinear composition- 
s of linear functions. In Section 5, we give an asymptotic formula for the generalization 
performance of a learning algorithm, which will then be used to analyze an instance of 
SVM. Due to the space limitation, we will only present the main results and discuss their 
implications. The detailed derivations can be found in [ 18]. 
2 Covering numbers 
We formulate the learning problem as to find a parameter from random observations to 
minimize risk: given a loss function L(a, z) and n observations X = {zx,... , z,} 
independently drawn from a fixed but unknown distribution D, we want to find a that 
minimizes the expected loss over z (risk): 
R(): E L(, )= f L(, ) dP(z). (2) 
The most natural method for solving (2) using a limited number of observations is by the 
empirical risk minimization (ERM) method (cf [15, 16]). We simply choose a parameter 
a that minimizes the observed risk: 
(3) 
) = 
i=1 
We denote the parameter obtained in this way as Oerm(X). The convergence behavior 
of this method can be analyzed by using the VC theoretical point of view, which relies 
on the uniform convergence of the empirical risk (the uniform law of large numbers): 
sups IR(a, X) - R(a)[. Such a bound can be obtained from quantities that measure 
the size of a Glivenko-Cantelli class. For finite number of indices, the family size can be 
measured simply by its cardinality. For general function families, a well known quantity to 
measure the degree of uniform convergence is the covering number which can be be dated 
back to Kolmogrov [8, 9]. The idea is to discretize (which can depend on the data X) the 
parameter space into N values a,... , aN so that each L(a, .) can be approximated by 
L(ai, .) for some i. We shall only describe a simplified version relevant for our purposes. 
Definition 2.1 Let B be a metric space with metric p. Given a norm p, observations X - 
[z, . . . , z,], and vectors f(a, X ) = [f(a, z ), . . . , f(a, z,)] E B ' parameterized by 
a, the covering number in p-norm, denoted as A/'p( f, e, X ), is the minimum number of a 
collection of vectors vt, . . . , v, E B ' such that Va, 3vi' [Ip(f(.,x?), 
We also denote Alp(f, e, n): maxx Alp (f, e, Xp). 
Note that from the definition and the Jensen's inequality, we have ./V'p _< ./V'q for p _< q. We 
will always assume the metric on R to be [zx - z21 if not explicitly specified otherwise. 
The following theorem is due to Pollard [11]' 
372 T. Zhang 
Theorem 2.1 ([lid n, e > 0 and distribution D. 
_he 2 
P[supIR(a,X)- R(c)[ > e] <_ 8E[A/(L,e/S,X)]exp(128M2), 
where M = sup,,x L(a, z) - inf,.x L(a, ), and X[* = (, . . . , ,) are independently 
drawn from D. 
The constants in the above theorem can be improved for certain problems: see [4, 6, 15, ! 6] 
for related results. However, they yield very similar bounds. The result most relevant for 
this paper is a lemma in [3] where the 1-norm covering number is replaced by the cx>-norm 
covering number. The latter can be bounded by a scale-sensitive combinatorial dimension 
[ 1 ], which can be bounded from the 1-norm covering number if this covering number does 
not depend on n. These results can replace Theorem 2.1 to yield better estimates under 
certain circumstances. 
Since Bartlett's lemma in [3] is only for binary loss functions, we shall give a generalization 
so that it is comparable to Theorem 2.1' 
Theorem 2.2 Let fx and f2 be two functions.' R '  [0, 1] such that lyx - y2l _< implies 
f (y) <_ fa(y2 ) _< f2(Y ) where rs' R  -+ [0, 1] is a reference separating function, then 
--he 2 
P[sup[Efx(L(a,z)) - Exl f2(L(a,z))] > e] _< 4E[Afoo(L,'x,X)]exp(). 
Note that in the extreme case that some choice of a achieves perfect generalization: 
Ef2(L(a, z)) - 0, and assume that our choices of a(X) always satisfy the condition 
Ex? f2 (L(a, z)) = 0, then better bounds can be obtained by using a refined version of the 
Chernoffbound. 
3 Covering number bounds for linear systems 
In this section, we present a few new bounds on covering numbers for the following form 
of real valued loss functions: 
d 
L(w, z): z T w - E ziwi. (4) 
i:1 
As we shall see later, these bounds are relevant to the convergence properties of (1). Note 
that in order to apply Theorem 2.1, since . < A/'2, therefore it is sufficient to estimate 
A/'2(L, e, n) for e > 0. It is clear that A/'2(L, e, n- is not finite if no restrictions on z and w 
are imposed. Therefore in the following, we will assume that each is bounded, and 
study conditions of llwllq so that log A/'( f, e, n) is independent or weakly dependent of d. 
Our first result generalizes a theorem of Bartlett [3]. The original results is with p -- oo 
and q = 1, and the related technique has also appeared in [10, 13]. The proof uses a lemma 
that is attributed to Maurey (cf. [2, 7]). 
Theorem 3.1 r/llllp 5 b andllvollq a, where I/p+ 1/q = 1 and2 _< p _< oo, then 
a 2 b 2 
The above bound on the covering number depends logarithmically on d, which is already 
quite weak (as compared to linear dependency on d in the standard situation). However, the 
bound in Theorem 3.1 is not tight for p < oo. For example, the following theorem improves 
the above bound for p = 2. Our technique of proof relies on the SVD decomposition [5] 
for matrices, which improves a similar result in [14] by a logarithmic factor. 
Convergence of Regularized Linear Functions $ 73 
Theorem 3.2 Ifllill b and IIll a, the, 
log..M.(L, e, n)< [ 2a'b' ] log.(4a'b'/e+ 1). 
The next theorem shows that if 1/p + 1/q > 1, then the 2-norm covering number is also 
ndependent of dimension. 
Theorem 3.3 Let (w,z) = z'w. f'llillp 5 bandllwllq 5 a, where 1 <_ q _< 2 and 
5 = l/p + l/q- l > O. then 
.4a2b 2 
log2N'2(,e,n)_< | -2 1g2(2(2ab/e) /+ 1). 
One consequence of this theorem is a potentially refined explanation tbr the boosting al- 
gorithm. In [ 13], the boosting algorithm has been analyzed by using a technique related to 
results in [3] which essentially rely on Theorem 3.1 with p -- oo. Unfommately, the bound 
contains a logarithmic dependency on d (in the most general case) which does not seem to 
fully explain the fact that in many cases the performance of the boosting algorithm keeps 
improving as d increases. However, this seemingly mysterious behavior might be better 
understood from Theorem 3.3 under the assumption that the data is more restricted than 
simply being cr)-norm bounded. For example, when the contribution of the wrong predic- 
tions is bounded by a constant (or grow very slowly as d increases), then we can regard 
its p-th norm bounded for some p < oo. In this case, Theorem 3.3 implies dimensional 
independent generalization. 
If we want to apply Theorem 2.2, then it is necessary to obtain bounds fbr infinity-norm 
covering numbers. The following theorem gives such bounds by using a result from online 
learning. 
Theorem 3.4 I/Ill, lip 5 b and ll11q 5 a, where 2 <_ p < oo and 1/p + 1/q : 1, then 
Ve> O, 
a2b 2 
log2,&(L, e, n ) _< 36(p- 1)--T- log21214ab/e + 2]n + 1]. 
In the case ofp = oo, an entropy condition can be used to obtain dimensional independent 
covering number bounds. 
Definition 3.1 Let/ = [/i] be a vector with positive entries such that IIll: i an this 
case, we call i  a distribution vector). Let z = [zi]  0 be a vector of the same length, then 
we define the weighted relative entropy of z with respect to i  as.' 
i 
Theorem 3.5 Given a distribution vector i, If 1111 5 b and IIl[ 
entr%(w) _< c, where we assume that w has non-negative entries, then  > O, 
36b2(a 2 q- ac) log21214ab/e q- 2In q- 1]. 
log2 -/foo (L, e, n) _< e2 
< a and 
Theorems in this section can be combined with Theorem 4.1 to form more complex cover- 
ing number bounds for nonlinear compositions of linear functions. 
374 T. Zhang 
4 Nonlinear extensions 
Consider the following system: 
L([a, w], :*) = f(g(a, :*)+ w'h(a, :*)), (5) 
where :* is the observation, and [a, w] is the parameter. We assume that f is a nonlinear 
function with bounded total variation. 
Definition 4.1 A function f: R -+ R is said to satisJ the Lipschitz condition with param- 
eter 9' ifV:*, y: If(:*) - f(v)l <_ 3'1:* - Yl. 
Definition 4.2 The total variation of a function f: R -+ R is defined as 
TV(f, :*) -- sup  If(:*i) - f(:*-)l. 
We also denote TV(f, oo) as TV(f). 
Theorem 4.1 IfL([ct, w],:*) = f(g(a,:*) + wTh(a,:*)), where TV(f) < c and f is 
Lipschitz with parameter % Assume also that w is a d-dimensional vector and Ilwllq _ c, 
then Vq, e2 > O, and n > 2(d + 1): 
en TV(f)j,1)]+logA([g,h],e/v,n), 
+ (a+ 
where the metric of[g, h] is defined as [gx - g21 + cll& - n2ll n/p + 1/q = 1). 
Example 4.1 Consider classification by hyperplane: L(w, :*) = I(wT:* < 0) where I is 
the set indicator function. Let L'(w, :*) - fo(wT:*) be another loss function where 
1 z<i0 
fo(z) : 1- z z C ,1]. 
0 z> 
Instead of using ERM for estimating parameter that minimizes the risk of L, consider the 
scheme of minimize empirical risk associated with L ', under the assumption that [I :.11 < b 
and constraint that Ilwl12 _< a. Denote the estimated parameter by w. It follows from the 
covering number bounds and Theorem 2.1 that with probability of at least 1 - 
n/2ab In(nab + 2) + In 
EzI(wnT :* < 0) < inf Ezfo(w T :*) + O( v ). 
If we apply a slight generalization of Theorem 2.2 and the covering number bound of 
Theorem 3.4, then with probability of at least 1 - r: 
/1 a2 b 2 
EI(w:* _< O)_< Exi, I(w:* _< 27) + O(V(--5--ln(a*/7 + 2) +lnn + In )) 
for all 7 C (0, 1]. [] 
Bounds given in this paper can be applied to show that under appropriate regularization 
conditions and assumptions on the data, methods based on (1) lead to generalization per- 
formances of the form (1/x/), where ( symbol (which is independent of d) is used to 
indicate that the hidden constant may include a polynomial dependency on log(n). It is 
also important to note that in certain cases, , will not appear (or it has a small influence on 
the convergence) in the constant of (, as being demonstrated by the example in the next 
section. 
Convergence of Regularized Linear Functions 375 
5 Asymptotic analysis 
The convergence results in the previous sections are in the form of VC style convergence 
in probability, which has a combinatorial flavor. However, for problems with differen- 
tiable function families involving vector parametem, it is often convenient to derive precise 
asymptotic results using the differential structure. 
Assume that the parameter a E R " in (2) is a vector and L is a smooth function. Let 
a* denote the optimal parameter; V denote the derivative with respect to a; and 9(ct, z) 
denote VL(a, ). Assume that 
U = f t(*, )t(*, ) T 
Then under certain regularity conditions, the asymptotic expected generalization error is 
given by 
E J(erm)= /(O?)-3' 2--tr(v-lu). (6) 
More generally, for any evaluation function h(a) such that Vh(a*): 0: 
E h(erm) + 2tr(V-V2' (7) 
where Veh is the Hessian matrix of h at a*. Note that this approach assumes that the op- 
timal solution is unique. These results are exact asymptotically and provide better bounds 
than those from the standard PAC analysis. 
Example 5.1 We would like to study a form of the support vector machine: Consider 
(c, z) -- f(cTz) + ,kc 2, 
{10-z z<l 
f(z) : - 
z>l 
Because of the discontinuity in the derivative of f, the asymptotic formula may not hold. 
However, if we make an assumption on the smoothness of the distribution , then the 
expectation of the derivative over e can still be smooth. In this case, the smoothness of 
f itself is not crucial. Furthermore, in a separate report, we shall illustrate that similar 
small sample bounds without any assumption on the smoothness of the distribution can be 
obtained by using techniques related to asymptotic analysis. 
Consider the optimal parameter a* and let S = {z  a*Tz < 1}. Note that Xa* = 
and U = Ees(z - Esz)(z - Esz) T. Assume that 7 > 0 s.t. P(a*Tz _< 7): O, 
then V: )I + B where B is a positive semi-definite matrix. It follows that 
Ez6s 2 
tr(V-S) <_ r(S)/X _< II*ll _< sup Ilzll11*11/% 
Now, consider a, obtained from observations X = [ ,..., ,] by minimizing empirical 
risk associated with loss function (a, ), then 
1 
E=L(ae,p, z) _< inaf E=L(a, z) + n sup 
asymptotically. Let  --> 0, this scheme becomes the optimal separating hyperplane [ 16]. 
This asymptotic bound is better than typical PAC bounds with fixed . [] 
Note that although the bound obtained in the above example is very similar to the mistake 
bound for the perceptron online update algorithm, we may in practice obtain much better 
estimates from (6) by plugging in the empirical data. 
376 T. Zhang 
References 
Ill 
[2] 
[3] 
[4] 
[5] 
[6] 
[7] 
[8] 
[9] 
[10] 
[11] 
[12] 
[13] 
[14] 
[15] 
[16] 
[17] 
[18] 
N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimen- 
sions, uniform convergence, and learnability. Journal of the ACM, 44(4):615-631, 
1997. 
A.R. Barron. Universal approximation bounds tbr superpositions ofa sigmoidal func- 
tion. IEEE Transactions on InJbrmation Theory, 39(3):930-945, 1993. 
P.L. Bartlett. The sample complexity of pattern classification with neural networks: 
the size of the weights is more important than the size of the network. IEEE Transac- 
tions on Information Theory, 44(2):525-536, 1998. 
R.M. Dudley. A course on empirical processes, volume 1097 of Lecture Notes in 
Mathematics. 1984. 
G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins University 
Press, Baltimore, MD, third edition, 1996. 
D. Haussler. Generalizing the PAC model: sample size bounds from metric 
dimension-based uniform convergence results. In Proc. 30th IEEE Symposium on 
Foundations of Computer Science, pages 40-45, 1989. 
Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and con- 
vergence rates for projection pursuit regression and neural network training. Ann. 
Statist., 20(1):608--613, 1992. 
A.N. Kolmogorov. Asymptotic characteristics of some completely bounded metric 
spaces. Dokl. Akad. Nauk. SSSR, 108:585-589, 1956. 
A.N. Kolmogorov and V.M. Tihomirov. e-entropy and e-capacity of sets in functional 
spaces. Amer. Math. Soc. Transl., 17(2):277-364, 1961. 
Wee Sun Lee, P.L. Bartlett, and R.C. Williamson. Efficient agnostic learning of 
neural networks with bounded fan-in. IEEE Transactions on In, formation Theory,, 
42(6):2118-2132, 1996. 
D. Pollard. Convergence of stochastic processes. Springer-Verlag, New York, 1984. 
N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (Series 
,4), 13:145-147, 1972. 
Robert E. Schapire, Yoav Freun& Peter Bartlett, and Wee Sun Lee. Boosting the 
margin: a new explanation for the effectiveness of voting methods. Ann. Statist., 
26(5): 1651-1686, 1998. 
J. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk 
minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory, 44(5): 1926- 
1940, 1998. 
V.N. Vapnik. Estimation of dependences based on empirical data. Springer-Verlag, 
New York, 1982. Translated from the Russian by Samuel Kotz. 
V.N. Vapnik. The nature of statistical learning theory. Springer-Verlag, New York, 
1995. 
V.N. Vapnik and A.J. Chervonenkis. On the uniform convergence of relative fre- 
quencies of events to their probabilities. Theory of Probability and Applications, 
16:264-280, 1971. 
Tong Zhang. Analysis of regularized linear functions for classification problems. 
Technical Report RC-21572, IBM, 1999. 
PART IV 
ALGORITHMS AND ARCHITECTURE 
