A Bound on the Error of Cross Validation Using 
the Approximation and Estimation Rates, with 
Consequences for the Training-Test Split 
Michael Kearns 
AT&T Research 
ABSTRACT
1 INTRODUCTION 
We analyze the performance of cross validation  in the context of model selection and 
complexity regularization. We work in a setting in which we must choose the right number 
of parameters for a hypothesis function in response to a finite training sample, with the goal 
of minimizing the resulting generalization error. There is a large and interesting literature 
on cross validation methods, which often emphasizes asymptotic statistical properties, or 
the exact calculation of the generalization error for simple models. Our approach here is 
somewhat different, and is primarily inspired by two sources. The first is the work of Barron 
and Cover [2], who introduced the idea of bounding the error of a model selection method 
(in their case, the Minimum Description Length Principle) in terms of a quantity known as 
the index of resolvability. The second is the work of Vapnik [5], who provided extremely 
powerful and general tools for uniformly bounding the deviations between training and 
generalization errors. 
We combine these methods to give a new and general analysis of cross validation perfor- 
mance. In the first and more formal part of the paper, we give a rigorous bound on the error 
of cross validation in terms of two parameters of the underlying model selection problem: 
the approximation rate and the estimation rate. In the second and more experimental part 
of the paper, we investigate the implications of our bound for choosing 7, the fraction of 
data withheld for testing in cross validation. The most interesting aspect of this analysis is 
the identification of several qualitative properties of the optimal 7 that appear to be invariant 
over a wide class of model selection problems: 
When the target function complexity is small compared to the sample size, the 
performance of cross validation is relatively insensitive to the choice of 7. 
The importance of choosing 7 optimally increases, and the optimal value for 7 
decreases, as the target function becomes more complex relative to the sample 
size. 
There is nevertheless a single fixed value for 7 that works nearly optimally for a 
wide range of target function complexity. 
2 THE FORMALISM 
We consider model selection as a two-part problem: choosing the appropriate number of 
parameters for the hypothesis function, and tuning these parameters. The training sample 
is used in both steps of this process. In many settings, the tuning of the parameters is 
determined by a fixed learning algorithm such as backpropagation, and then model selection 
reduces to the problem of choosing the architecture. Here we adopt an idealized version of 
this division of labor. We assume a nested sequence of function classes H C ... C Ha..., 
called the structure [5], where Ha is a class of boolean functions of d parameters, each 
Perhaps in conflict with accepted usage in statistics, here we use the term "cross validation" to 
mean the simple method of saving out an independent test set to perform model selection. Precise 
definitions will be stated shortly. 
184 M. KEARNS 
function being a mapping from some input space X into {0, 1}. For simplicity, in this 
paper we assume that the Vapnik-Chervonenkis (VC) dimension [6, 5] of the class Ha is 
O(d). To remove this assumption, one simply replaces all occurrences of d in our bounds by 
the VC dimension of Ha. We assume that we have in our possession a learning algorithm 
L that on input any training sample $ and any value d will output a hypothesis function 
ha  Ha that minimizes the training error over Ha that is, et(ha) = min,E, 
where c(h) is the fraction of the examples in $ on which h disagrees with the given label. 
In many situations, training error minimization is known to be computationally intractable, 
leading researchers to investigate heuristics such as backpropagation. The extent to which 
the theory presented here applies to such heuristics will depend in part on the extent to 
which they approximate training error minimization for the problem under consideration. 
Model selection is thus the problem of choosing the best value of d. More precisely, we 
assume an arbitrary target function f (which may or may not reside in one of the function 
classes in the structure H C .-. C Ha...), and an input distribution P; f and P together 
define the generalization error function %(h) = PrE,[h(z)  f(z)]. We are given a 
training sample S of f, consisting of m random examples drawn according to P and labeled 
by f (with the labels possibly corrupted by a noise process that randomly complements each 
label independently with probability r/< 1/2). The goal is to minimize the generalization 
error of the hypothesis selected. 
In this paper, we will make the rather mild but very useful assumption that the structure has 
the property that for any sample size m, there is a value d,,, (m) such that el(ha,,, (,,,)) = 
0 for any labeled sample $ of m examples. We call the function d,,,,,(rn.) the fitting 
number of the structure. The fitting number formalizes the simple notion that with enough 
parameters, we can always fit the training data perfectly, a property held by most sufficiently 
powerful function classes (including multilayer neural networks). We typically expect the 
fitting number to be a linear function of m, or at worst a polynomial in m. The significance 
of the fitting number for us is that no reasonable model selection method should choose ha 
for d _> d,,,,,(rn.), since doing so simply adds complexity without reducing the training 
error. 
In this paper we concentrate on the simplest version of cross validation. We choose a 
parameter 7  [0, 1], which determines the split between training and test data. Given the 
input sample $ of m examples, let $ be the subsample consisting of the first (1 - 7)rn. 
examples in $, and $" the subsample consisting of the last 7mexamples. In cross validation, 
rather than giving the entire sample $ to L, we give only the smaller sample $, resulting in 
the sequence h,..., h,,,(( _.fi,,,) of increasingly complex hypotheses. Each hypothesis 
is now obtained by training on only (1 - 7)m examples, which implies that we will only 
consider values of d smaller than the corresponding fitting number d,,,,((1 - 7)m); let 
us introduce the shorthand d't,, for d,,,,((1 - 7)m). Cross validation chooses the ha 
satisfying ha = minle{,...,ag,,, } {e'(hi)} where e'(hl) is the error of hi on the subsample 
S". Notice that we are not considering multifold cross validation, or other variants that 
make more efficient use of the sample, because our analyses will require the independence 
of the test set. However, we believe that many of the themes that emerge here may apply to 
these more sophisticated variants as well. 
We use c,, (m) to denote the generalization error %(ha) of the hypothesis ha chosen by cross 
validation when given as input a sample $ of m random examples of the target function. 
Obviously, c,, (m) depends on $, the structure, f, P, and the noise rate. When bounding 
cv (m), we will use the expression "with high probability" to mean with probability 1 - 6 
over the sample $, for some small fixed constant 6 > 0. All of our results can also be 
stated with 6 as a parameter at the cost of a log(1/6) factor in the bounds, or in terms of the 
expected value of ,, (m). 
3 THE APPROXIMATION RATE 
It is apparent that any nontrivial bound on e,, (m) must take account of some measure of the 
"complexity" of the unknown target function f. The correct measure of this complexity is 
less obvious. Following the example of Barron and Cover's analysis of MDL performance 
A Bound on the Error of Cross Validation 185 
in the context of density estimation [2], we propose the approximation rate as a natural 
measure of the complexity of f and P in relation to the chosen structure H1 C ... C Ha.... 
Thus we define the approximation rate function %(d) to be %(d) = minhEH .[%(h)}. The 
function %(d) tells us the best generalization error that can be achieved in the class Ha, 
and it is a nonincreasing function of d. If %(s) = 0 for some sufficiently large s, this 
means that the target function f, at least with respect to the input distribution, is realizable 
in the class Ho, and thus s is a coarse measure of how complex f is. More generally, even 
if %(d) > 0 for all d, the rate of decay of %(d) still gives a nice indication of how much 
representational power we gain with respect to f and P by increasing the complexity of 
our models. Still missing, of course, is some means of determining the extent to which this 
representational power can be realized by training on a finite sample of a given size, but 
this will be added shortly. First we give examples of the approximation rate that we will 
examine following the general bound on ec,, (m). 
The Intervals Problem. In this problem, the input space X is the real interval [0, 1], and 
the class Ha of the structure consists of all boolean step functions over [0, 1] of at most 
d steps; thus, each function partitions the interval [0, 1] into at most d disjoint segments 
(not necessarily of equal width). and assigns alternating positive and negative labels to 
these segments. The input space is one-dimensional, but the structure contains arbitrarily 
complex functions over [0, 1]. It is easily verified that our assumption that the VC dimension 
of Ha is O(d) holds here, and that the fitting number obeys d,,,,,(m) <_ m. Now suppose 
that the input density P is uniform, and suppose that the target function f is the function 
of s alternating segments of equal width l/s. for some s (thus. f lies in the class Ho). 
We will refer to these settings as the intervals problem. Then the approximation rate is 
%(d) = (1/2)(1 - d/s) for 1 < d < s and %(d) = 0 for d _> s (see Figure 1). 
The Perceptron Problem. In this problem, the input space X is R Jr for some large 
natural number N. The class Ha consists of all perceptrons over the N inputs in which 
at most d weights are nonzero. If the input density is spherically symmetric (for instance, 
the uniform density on the unit ball in St), and the target function is the function in Ho 
with all s nonzero weights equal to 1, then it can be shown that the approximation rate 
is %(d) - (1/r)cos-() for d < s [4], and of course %(d) -- 0 for d _> s (see 
Figure 1). 
Power Law Decay. In addition to the specific examples just given, we would also like 
to study reasonably natural parametric forms of %(d), to determine the sensitivity of our 
theory to a plausible range of behaviors for the approximation rate. This is important, 
because in practice we do not expect to have precise knowledge of %(d), since it depends 
on the target function and input distribution. Following the work of Barron [1], who shows 
a cid bound on % (d) for the case of neural networks with one hidden layer under a squared 
error generalization measure (where c is a measure of target function complexity in terms 
of a Fourier transform integrability condition) 2, we can consider approximation rates of 
the form %(d) = (c/d)" + e,,i,, where e,,i, _> 0 is a parameter representing the "degree 
of unrealizability" of f with respect to the structure, and c, a > 0 are parameters capturing 
the rate of decay to e,,,, (see Figure 1). 
4 THE ESTIMATION RATE 
For a fixed f, P and H1 C ... C Ha..., we say that a function p(d, rn) is an estimation rate 
bound if for all d and rn, with high probability over the sample S we have (ha) -  (h)l _< 
p(d, rn), where as usual ha is the result of training error minimization on S within Ha. 
Thus p(d, rn) simply bounds the deviation between the training error and the generalization 
error of ha. Note that the best such bound may depend in a complicated way on all of 
the elements of the problem: f, P and the structure. Indeed, much of the recent work 
on the statistical physics theory of learning curves has documented the wide variety of 
behaviors that such deviations may assume [4, 3]. However, for many natural problems 
2Since the bounds we will give have straightforward generalizations to real-valued function learn- 
ing under squared error, examining behavior for %(d) in this setting seems reasonable. 
186 M. KEARNS 
it is both convenient and accurate to rely on a universal estimation rate bound provided 
by the powerful theory of uniform convergence: Namely, for any f, P and any structure, 
the function p(d, m) = v/(d/m)log(m/d) is an estimation rate bound [5]. Depending 
upon the details of the problem, it is sometimes appropriate to omit the log(m/d) factor, 
and often appropriate to refine the V/-C/m behavior to a function that interpolates smoothly 
between dim behavior for small et to V/--/m for large et. Although such refinements are 
both interesting and important, many of the qualitative claims and predictions we will make 
are invariant to them as long as the deviation let(ha) - e9(ha)l is well-approximated by a 
power law (d/m)" (a > 0); it will be more important to recognize and model the cases in 
which power law behavior is grossly violated. 
Note that this universal estimation rate bound holds only under the assumption that the 
training sample is noise-free, but straightforward generalizations exist. For instance. if the 
training data is corrupted by random label noise at rate 0 _< / < 1/2, then p(d, m) = 
v/(d/(1 - 2,)2m)log(m/d) is again a universal estimation rate bound. 
5 THE BOUND 
Theorem 1 Let H C ... C Ha... be any structure, where the VC dimension of Ha is 
O(d). Let f and P be any target function and input distribution, let eg(d) be the ap- 
proximation rate function for the structure with respect to f and P, and let p(d, m) be an 
estimation rate bound for the structure with respect to f and P. Then for any m, with high 
probability 
(/1g(dm,)) 
eel(m)< min {e(d)+o(d,(1-7)m)}+O V  (1) 
-  <a<a 
where 7 is the fraction of the training sample used for testing, and d'm,a is the fitting number 
dm, a((1 - 7)m). Using the universal estimation bound rate and the rather weak assumption 
that dm,a(m) is polynomial in m, we obtain that with high probability 
toy(m)_< min ,le(d).4-O (( d (-))/ ((log((1-7)m)) 
<a<a',L (1 - 7)m log + 0 7m ' 
(2) 
Straightforward generalizations of these bounds for the case where the data is corrupted 
by classification noise can be obtained, using the modified estimation rate bound given in 
Section 4 3 
We delay the proof of this theorem to the full paper due to space considerations. However, 
the central idea is to appeal twice to uniform convergence arguments: once within each class 
Ha to bound the generalization error of the resulting training error minimizer ha  Ha, and 
a second time to bound the generalization error of the ha minimizing the error on the test 
set of 7m examples. 
In the bounds given by (1) and (2), the min{.} expression is analogous to Barron and Cover's 
index of resolvability [2]; the final term in the bounds represents the error introduced by 
the testing phase of cross validation. These bounds exhibit tradeoff behavior with respect 
to the parameter 7: as we let 7 approach 0, we are devoting more of the sample to training 
the ha, and the estimation rate bound term p(d, (1 - 7)m) is decreasing. However, the 
test error term O(v/log(arl,,,,,)/(7m)) is increasing, since we have less data to accurately 
estimate the ca(ha). The reverse phenomenon occurs as we let 7 approach 1. 
While we believe Theorem 1 to be enlightening and potentially useful in its own right, 
we would now like to take its interpretation a step further. More precisely, suppose we 
3The main effect of classification noise at rate r/is the replacement of occurrences in the bound of 
the sample size ra by the smaller "effective" sample size (1 - r/)2ra. 
A Bound on the Error of Cross Validation 18 7 
assume that the bound is an approximation to the actual behavior of ec,,(m). Then in 
principle we can optimize the bound to obtain the best value for 7- Of course, in addition 
to the assumptions involved (the main one being that p(d, m) is a good approximation to 
the training-generalization error deviations of the ha), this analysis can only be carried out 
given information that we should not expect to have in practice (at least in exact form) -- 
in particular, the approximation rate function e 9 (d), which depends on f and P. However, 
we argue in the coming sections that several interesting qualitative phenomena regarding 
the choice of 7 are largely invariant to a wide range of natural behaviors for e9 (d). 
6 A CASE STUDY: THE INTERVALS PROBLEM 
We begin by performing the suggested optimization of 7 for the intervals problem. Recall 
that the approximation rate here is e(d) = (1/2)(1 - d/s) for d < s and e(d) - 0 for 
d _> s, where s is the complexity of the target function. Here we analyze the behavior 
obtained by assuming that the estimation rate p(d, rn.) actually behaves as p(d, rn.) = 
v/d/(1 - 7)rn. (so we are omitting the log factor from the universal bound), and to simplify 
the formal analysis a bit (but without changing the qualitative behavior) we replace the 
term v/log((1 - 7)rn.)/(7rn.) by the weaker v/log(m)/m. Thus, if we define the function 
F(d, m, 7) = e(d)+ v/d/(1 -7)rn. + v/log(rn.)/(7m)then following Equation (1), we 
are approximating ec,,(m) by e e,,(m)  min 1 <,/<,//n,z { F(d, m, 7)} 4. 
The first step of the analysis is to fix a value for 7 and differentiate F(d, m, 7) with respect 
to d to discover the minimizing value of d; the second step is to differentiate with respect to 
7. It can be shown (details omitted) that the optimal choice of 7 under the assumptions is 
7ot, = (log(m)/s)/3/( 1 + (1og(m)/s)/3). It is important to remember at this point that 
despite the fact that we have derived a precise expression for 7o,, due to the assumptions 
and approximations we have made in the various constants, any quantitative interpretation 
of this expression is meaningless. However, we can reasonably expect that this expression 
captures the qualitative way in which the optimal 7 changes as the amount of data m 
changes in relation to the target function complexity s. On this score the situation initially 
appears rather bleak, as the function (1og(m)/s)/3/(1 + (log(m)/s) 1/3) is quite sensitive 
to the ratio 1og(m)/s, which is something we do not expect to have the luxury of knowing 
in practice. However, it is both fortunate and interesting that 7o, does not tell the entire 
story. In Figure 2, we plot the function F(s, m, 7) as a function of 7 for m = 10000 and for 
several different values of s (note that for consistency with the later experimental plots, the 
:e axis of the plot is actually the training fraction 1 - 7). Here we can observe four important 
qualitative phenomena, which we list in order of increasing subtlety: (A) When s is small 
compared to m, the predicted error is relatively insensitive to the choice of 7: as a function 
of 7, F(s, m, 7) has a wide, fiat bowl, indicating a wide range of 7 yielding essentially the 
same near-optimal error. (B) As s becomes larger in comparison to the fixed sample size 
m, the relative superiority of 7ot, t over other values for 7 becomes more pronounced. In 
particular, large values for 7 become progressively worse as s increases. For example, the 
plots indicate that for s = 10 (again, m = 10000), even though 7o, = 0.524... the choice 
7 = 0.75 will result in error quite near that achieved using 7o,. However, for s = 500, 
7 = 0.75 is predicted to yield greatly suboptimal error. Note that for very large s, the bound 
predicts vacuously large error for all values of 7, so that the choice of 7 again becomes 
irrelevant. (C) Because of the insensitivity to 7 for s small compared to m, there is afixed 
value of 7 which seems to yield reasonably good performance for a wide range of values 
for s. This value is essentially the value of 7ot, t for the case where s is large but nontrivial 
generalization is still possible, since choosing the best value for 7 is more important there 
than for the small s case. (D) The value of 7o,t is decreasing as s increases. This is slightly 
difficult to confirm from the plot, but can be seen clearly from the precise expression for 
7oi, t. 
'*Although there are hidden constants in the O(.) notation of the bounds, it is the relative weights 
of the estimation and test error terms that is important, and choosing both constants equal to 1 is a 
reasonable choice (since both terms have the same Chemoff bound origins). 
188 M. KEARNS 
In Figure 3, we plot the results of experiments in which labeled random samples of size 
rn = 5000 were generated for a target function of s equal width intervals, for s = 10,100 
and 500. The samples were corrupted by random label noise at rate r/ = 0.3. For each 
value of 7 and each value of d, (1 - 7)m of the sample was given to a program performing 
training error minimization within Ha; the remaining 7m examples were used to select the 
best ha according to cross validation. The plots show the true generalization error of the 
ha selected by cross validation as a function of 7 (the generalization error can be computed 
exactly for this problem). Each point in the plots represents an average over 10 trials. 
While there are obvious and significant quantitative differences between these experimental 
plots and the theoretical predictions of Figure 2, the properties (A), (B) and (C) are rather 
clearly borne out by the data: (A) In Figure 3, when s is small compared to m, there 
is a wide range of acceptable 7; it appears that any choice of 7 between 0.10 and 0.50 
yields nearly optimal generalization error. (B) By the time s = 100, the sensitivity to 7 is 
considerably more pronounced. For example, the choice 7 -- 0.50 now results in clearly 
suboptimal performance, and it is more important to have 7 close to 0.10. (C) Despite these 
complexities, there does indeed appear to be single value of 7 -- approximately 0.10 -- 
that performs nearly optimally for the entire range of s examined. 
The property (D) -- namely, that the optimal 7 decreases as the target function complexity 
is increased relative to a fixed m -- is certainly not refuted by the experimental results, 
but any such effect is simply too small to be verified. It would be interesting to verify 
this prediction experimentally, perhaps on a different problem where the predicted effect is 
more pronounced. 
7 CONCLUSIONS 
For the cases where the approximation rate c 9 (d) obeys either power law decay or is that 
derived for the perceptron problem discussed in Section 3, the behavior of cc,,(m) as a 
function of 7 predicted by our theory is largely the same (for example, see Figure 4). In the 
full paper, we describe some more realistic experiments in which cross validation is used 
to determine the number of backpropogation training epochs. Figures similar to Figures 2 
through 4 are obtained, again in rough accordance with the theory. 
In summary, our theory predicts that although significant quantitative differences in the 
behavior of cross validation may arise for different model selection problems, the properties 
(A), (B), (C) and (D) should be present in a wide range of problems. At the very least, 
the behavior of our bounds exhibits these properties for a wide range of problems. It 
would be interesting to try to identify natural problems for which one or more of these 
properties is strongly violated; a potential source for such problems may be those for which 
the underlying learning curve deviates from classical power law behavior [4, 3]. 
Acknowledgements: I give warm thanks to Yishay Mansour, Andrew Ng and Dana Ron 
for many enlightening conversations on cross validation and model selection. Additional 
thanks to Andrew Ng for his help in conducting the experiments. 
References 
[1] A. Barron. Universal approximation bounds for superpositions of a sigrnoidal function. IEEE 
Transactions on Information Theory, 19:930-944, 1991. 
[2] A. R. Barron and T. M. Cover. Minimum complexity density estimation. IEEE Transactions on 
Information Theory, 37:1034-1054, 1991. 
[3] D. Haussler, M. Keams, H.S. Seung, and N. Tishby. Rigourous learning curve bounds from 
statistical mechanics. In Proceedings of the Seventh Annual ACM Confernce on Computational 
Learning Theory, pages 76-87, 1994. 
[4] H. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. 
PhysicalReview, A45:6056-6091, 1992. 
[5] V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 
1982. 
[6] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of 
events to their probabilities. Theory of Probability and its Applications, 16(2):264-280, 1971. 
A Bound on the Error of Cross Validation 189 
Three Approximation Rates 
d 0 
error bound, intervals, d  s  1000 slice, m-10000 
rr vs train set size, s-10,100,50O, 30% noise, 
Figure 1: Plots of three approximation rates: 
for the intervals problem with target complexity 
s = 250 intervals (linear plot intersecting d-axis at 
250), for the perceptron problem with target com- 
plexity s = 150 nonzero weights (nonlinear plot 
intersecting d-axis at 150), and for power law de- 
cay asymptoting at e,,a - 0.05. 
Figure 2: Plot of the predicted generalization error 
of cross validation for the intervals model selection 
problem, as a function of the fraction 1 - 3' of 
data used for training. (In the plot, the fraction of 
training data is 0 on the left (3' = 1) and 1 on the 
right (3' = 0)). The fixed sample size ra = 10,000 
was used, and the 6 plots show the error predicted 
by the theory for target function complexity values 
s = 10 (bottom plot), 50, 100, 250, 500, and 1000 
(top plot). 
Figure 3: Experimental plots of cross validation 
generalization error in the intervals problem as a 
function of training set size ( 1 - 3') m. Experiments 
with the three target complexity values s = 10,100 
and 500 (bottom plot to top plot) are shown. Each 
point represents performance averaged over 10 tri- 
als. 
cv bound, (c/d} for c from 1.0 to 150.0, m-25000 
Figure 4: Plot of the predicted generalization error 
of cross validation for the power law case eg (d) = 
(c/d), as a function of the fraction 1-3'of data used 
for training. The fixed sample size m = 25,000 
was used, and the 6 plots show the error predicted 
by the theory for target function complexity values 
c = 1 (bottom plot), 25, 50, 75, 100, and 150 (top 
plot). 
