Estimating the Bayes Risk from Sample Data 
Robert R. Snapp* and Tong Xu 
Computer Science and Electrical Engineering Department 
University of Vermont 
Burlington, VT 05405 
Abstract 
A new nearest-neighbor method is described for estimating the Bayes risk 
of a multiclass pattern classification problem from sample data (e.g., a 
classified training set). Although it is assumed that the classification prob- 
lem can be accurately described by sufficiently smooth class-conditional 
distributions, neither these distributions, nor the corresponding prior prob- 
abilities of the classes are required. Thus this method can be applied to 
practical problems where the underlying probabilities are not known. This 
method is illustrated using two different pattern recognition problems. 
1 INTRODUCTION 
An important application of artificial neural networks is to obtain accurate solutions to 
pattern classification problems. In this setting, each pattern, represented as an n-dimensional 
feature vector, is associated with a discrete pattern class, or state of nature (Duda and Hart, 
1973). Using available information, (e.g., a statistically representative set of labeled feature 
vectors {(xi, i)}, where xi E [' denotes a feature vector and ti 
its correct pattern class), one desires a function (e.g., a neural network classifier) that assigns 
new feature vectors to pattern classes with the smallest possible misclassification cost. 
If the classification problem is stationary, such that the patterns from each class are generated 
according to known probability distributions, then it is possible to construct an optimal 
classifier that assigns each pattern to a class with minimal expected risk. Although our 
method can be generalized to problems in which different types of classification errors 
incur different costs, we shall simplify our discussion by assuming that all errors are equal. 
In this case, a Bayes classifier assigns each feature vector to a class with maximum posterior 
probability. The expected risk of this classifier, or Bayes risk then reduces to the probability 
of error 
t:lB = fs [1- sup P(glx)] f(x)dx , (1) 
* E-mail:snappemba. uvm. edu 
Estimating the Bayes Risk from Sample Data 233 
(Duda and Hart, 1973). Here, P(elx) denotes the posterior probability of class t conditioned 
on observing the feature vector x, f(x) denotes the unconditional mixture density of the 
feature vector x, and $ C R n denotes the probability-one support of f. 
Knowing how to estimate the value of the Bayes risk of a given classification problem with 
a specific input representation, may facilitate the design of more accurate classifiers. For 
example, since the value of RB depends upon the set of features chosen to represent each 
pattern (e.g., the significance of the input units in a neural network classifier), one might 
compare estimates of the Bayes risk for a number of different feature sets, and then select 
the representation that yields the smallest value. Unfortunately, it is necessary to know the 
explicit probability distributions to evaluate (1). Thus with the possible exception of trivial 
examples, the Bayes risk cannot be determined exactly for practical classification problems. 
Lacking the means to evaluate the B ayes risk exactly, motivates the development of statistical 
estimators of RB. In this paper, we use a recent asymptotic analysis of the finite-sample 
risk of the k-nearest-neighbor classifier to obtain a new procedure for estimating the Bayes 
risk from sample data. Section 2 describes the k-nearest-neighbor algorithm, and briefly 
describes how estimates of its finite-sample risk have been used to estimate RB. Section 3 
describes how a recent asymptotic analysis of the finite-sample risk can be applied to obtain 
new statistical estimators of the Bayes risk. In Section 4 the k-nearest-neighbor algorithm 
is used to estimate the Bayes risk of two example problems. Section 5 contains some 
concluding remarks. 
2 THE k-NEAREST-NEIGHBOR CLASSIFIER 
Due to its analytic tractability, and its nearly optimal performance in the large sample limit, 
the k-nearest-neighbor classifier has served as a useful framework for estimating the Bayes 
risk from classified samples. Recall, that the k-nearest-neighbor algorithm (Fix and Hodges, 
1951) classifies an n-dimensional feature vector x by consulting a reference sample of re 
correctly classified feature vectors A', = {(x,g): i = 1,...re}. First, the algorithm 
identifies the k nearest neighbors of x, i.e., the k feature vectors within A', that lie closest 
to x with respect to a given metric. Then, the classifier assigns x to the most frequent class 
label represented by the k nearest neighbors. (A variety of procedures can be used to resolve 
ties.) In the following, (7 denotes the number of pattern classes. 
The finite-sample risk of this algorithm, Rm, equals the probability that the k-nearest- 
neighbor classifier assigns x to an incorrect class, averaged over all input vectors x, and 
all m-samples, A',. The following properties have been shown to be true under weak 
assumptions: 
Property 1 
with 
(Cover and Hart, 1967): For fixed k, 
R,  Rod(k), as 
RB _< Roo(1) _< R (2 
C Rs) 
C-1 ' 
(2) 
Property 2 (Devroye, 1981): If k >_ 5, and C = 2, then there exist universal constants 
a = 0.3399..., and/3 = 0.9749... such that Rod(k) is bounded by 
R < Roo(k) < (l + ak)R, where ak - k _ 3.25 ' 
More generally, if C = 2, then 
234 R.R. SNAPP, T. XU 
By the latter property, this algorithm is said to be Bayes consistent in that for any e > 0, it 
is possible to construct a k-nearest-neighbor classifier such that IR, - Rsl < e if m and 
k are sufficiently large. Bayes consistency is also evident in other nonparametric pattern 
classifiers. 
Several methods for estimating RB from sample data have previously been proposed, e.g., 
(Devijver, 1985), (Fukunaga, 1985), (Fukunaga and Hummels, 1987), (Garnett and Yau, 
1977), and (Loizou and Maybank, 1987). Typically, these methods involve constructing 
sequences of k-nearest neighbor classifiers, with increasing values of k and m. The mis- 
classification rates are estimated using an independent test sample, from which upper and 
lower bounds to Rs are obtained. Because these experiments are necessarily performed 
with finite reference samples, these bounds are often imprecise. This is especially true for 
problems in which Rm converges to Ro(k) at a slow rate. In order to remedy this deficiency, 
it is necessary to understand the manner in which the limit in Property 1 is achieved. In the 
next section we describe how this information can be used to construct new estimators for 
the Bayes risk of sufficiently smooth classification problems. 
3 NEW ESTIMATORS OF THE BAYES RISK 
For a subset of multiclass classification problems that can be described by probability 
densities with uniformly bounded partial derivatives up through order N + 1 (with N > 2), 
the finite-sample risk of a k-nearest-neighbor classifier that uses a weighted L v metric can 
be represented by the truncated asymptotic expansion 
N 
Rm = Ro(k)+ E cjm-J/n + 0 (m-(N+l)/n) , (4) 
j=2 
(Psaltis, Snapp, and Venkatesh, 1994), and (Snapp and Venkatesh, 1995). In the above, 
n equals the dimensionality of the feature vectors, and Rod(k), c2,..., cv, are the ex- 
pansion coefficients that depend upon the probability distributions that define the pattern 
classification problem. 
This asymptotic expansion provides a parametric description of how the finite-sample risk 
Rm converges to its infinite sample limit Ro(k). Using a large sample of classified data, 
one can obtain statistical estimates of the finite-sample risk/m for different values of 
m. Specifically, let {rn} denote a sequence of M different sample sizes, and select fixed 
values for k and N. For each value of rn, construct an ensemble of k-nearest-neighbor 
classifiers, i.e., for each classifier construct a random reference sample 3Jm by selecting 
rn patterns with replacement from the original large sample. Estimate the empirical risk 
of each classifier in the ensemble with an independently drawn set of "test" vectors. Let 
tm, denote the average empirical risk of the i-th ensemble. Then, using the resulting set 
of data points {(m, ,)}, find the values of the coefficients Ro(k), and c2 through cv, 
that minimizes the sum of the squares: 
E lm, - Rm(k)- E cjrn-J/n (5) 
i=l 
Several inequalities can then be used obtain approximations of Rs from the estimated value 
of Ro(k). For example, if k = 1, then Cover and Hart's inequality in Property I implies 
that 
Rod(l) 
_< Rs _< Rod(l). 
2 
To enable an estimate of Rs with precision e, choose k > 2/e , and estimate Rod(k) by the 
above method. Then Devroye's inequality (3) implies 
Ro(k)- < Rod(k)(1 -) < Rs < Rod(k). 
Estimating the Bayes Risk from Sample Data 235 
4 EXPERIMENTAL RESULTS 
The above procedure for estimating ];lB was applied to two pattern recognition problems. 
First consider the synthetic, two-class problem with prior probabilities P = P2 = 1/2, and 
normally distributed, class-conditional densities 
1  ((x +(-1]]+? "* x?) 
fdx) = (2r),/2 e - ' ' ' '=  , 
for f = 1 and 2. Pseudorandom labeled feature vectors (x, f) were numerically generated in 
accordance with the above for dimensions r = 1 and r, = 5. Twelve sample sizes between 
10 and 3000 were examined. For each dimension and sample size the risks R, of many 
independent k-nearest-neighbor classifiers with k = 1, 7, and 63 were empirically estimated. 
(Because the asymptotic expansion (4) does not accurately describe the very small sample 
behavior of the k-nearest-neighbor classifier, sample sizes smaller than 2k were not included 
in the fit.) 
Estimates of the coefficients in (5) for six different fits appear in the first equation of each cell 
in the third and fourth columns of Table 1. For reference, the second column contains the 
values of ROd(k) that were obtained by numerically evaluating an exact integral expression 
(Cover and Hart, 1967). Estimates of the Bayes risk appear in the second equation of each 
cell in the third and fourth columns. Cover and Hart's inequality (2) was used for the 
experiments that assumed k = 1, and Devroye's inequality (3) was used if k > 7. For this 
problem, formula (1) evaluates to RB = (1/2) erfc( 1 / x/) = 0.15865. 
Table 1: Estimates of the model coefficients and Bayes error for a classification problem 
with two normal classes. 
k Rod(k) n=l (N=2) n=5 (N=6) 
1 0.2248 
0.6536 0.1121 0.2001 
R, = 0.2287 + ra  R, = 0.2287 + ra/s + -- 
RB =0.172 4- 0.057 RB =0.172 4- 0.057 
0.0222 
rn,4/5 rn?/5 
7 0.1746 
4.842 0.2218 1.005 3.782 
R,=0.1744+ ra----  R,=0.1700+ rrt2/5 m4/-----  + rrt6/--g- 
R =0.152 + 0.023 R =0.148 + 0.022 
63 0.1606 
20.23 0.1002 1.426 10.96 
R,,=0.1606+ m- T- R,=0.1595+ rrt2/5 m4/-  + rrt6/-----g- 
R =0.157 + 0.004 RB =0.156 + 0.004 
The second pattern recognition problem uses natural data; thus the underlying probability 
distributions are not known. A pool of 2:2 classified multispectral pixels were was extracted 
from a seven band satellite image. Each pixel was represented by five spectral components, 
x = (z,... ,zs), each in the range 0 < z,, _< 255. (Thus, n = 5.) The class label of 
each pixel was determined by one of the remaining spectral components, 0 < gt < 255. 
Two pattern classes were then defined: w = {gt < 0}, and w: = {gt > 0}, where 0 was a 
predetermined threshold. (This particular problem was chosen to test the feasibility of this 
method. In future work, we will examine more interesting pixel classification problems.) 
236 R.R. SNAPP, T. XU 
Table 2: Coefficients that minimize the squared error fit for different N. Note that cs = 0 
and cs = 0 in (2) if n >_ 4 (Psaltis, Snapp, and Venkatesh, 1994). 
N Roo(1) c c4 c 
2 0.0757133 0.126214 
4 0.0757846 0.124007 0.0132804 
6 0.0766477 0.0785847 0.689242 
-2.68818 
With k = 1, a large number of Bernoulli trials (e.g., 200 1000) were performed for each 
value of rat. Each trial began by constructing a reference sample of rat classified pixels 
chosen at random from the pool. The risk of each reference sample was then estimated by 
classifying t pixels with the nearest-neighbor algorithm under a Euclidean metric. Here, 
the t pixels, with 2000 < t < 20000, were chosen independently, with replacement, from 
the pool. The risk/,m was then estimated as the average risk of each reference sample 
of size mi. (The number of experiments performed for each value of rat, and the values 
oft, were chosen to ensure that the variance of/ was sufficiently small, less than 10 -4 
in this case.) This process was repeated for M = 33 different values of ra in the range 
100 _< rn _< 15000. Results of these experiments are displayed in Table 2 and Figure 1 
for three different values of N. Note that the robustness of the fit begins to dissolve, for this 
data, at N = 6, either the result of overfitting, or insufficient smoothness in the underlying 
probability distributions. However, the estimate for Rod(l) appears to be stable. For this 
classification problem, we thus obtain Rs = 0.0568 + 0.0190. 
5 CONCLUSION 
The described method for estimating the Bayes risk is based on a recent asymptotic analysis 
of the finite-sample risk of the k-nearest-neighbor classifier (Snapp and Venkatesh, 1995). 
Representing the finite-sample risk as a truncated asymptotic series enables an efficient 
estimation of the infinite-sample risk Rod(k) from the classifier's finite-sample behavior. 
The Bayes risk can then be estimated by the Bayes consistency of the k-nearest-neighbor 
algorithm. Because such finite-sample analyses are difficult, and consequently rare, this 
new method has the potential to evolve into a useful algorithm for estimating the Bayes risk. 
Further improvements in efficiency may be obtained by incorporating principles of optimal 
experimental design, cf., (Elfving, 1952) and (Federov, 1972). 
It is important to emphasize, however, that the validity of (4) rests on several rather strong 
smoothness assumptions, including a high-degree ofdifferentiability of the class-conditional 
probability densities. For problems that do not satisfy these conditions, other finite-sample 
descriptions need to be constructed before this method can be applied. Nevertheless, there 
is much evidence that nature favors smoothness. Thus, these restrictive assumptions may 
still be applicable to many important problems. 
Acknowledgments 
The work reported here was supported in part by the National Science Foundation under 
Grant No. NSF OSR-9350540 and by Rome Laboratory, Air Force Material Command, 
USAF, under grant number F30602-94-1-0010. 
Estimating the Bayes Risk from Sample Data 237 
-1.8 
-2.0 
-2.4 
100 1000 10000 
m 
Figure 1: The best fourth-order (N = 4) fit of Eqn. (5) to 33 empirical estimates of ft,,, 
for a pixel classification problem obtained from a multispectral Landsat image. Using 
Roo = 0.0758, the fourth-order fit, R, = 0.0758 + 0.124ra -2/s + 0.0133m -4/s, is plotted 
on a log-log scale to reveal the significance of the j = 2 term. 
References 
T. M. Cover and P. E. Hart, "Nearest neighbor pattern classification," IEEE Trans. Inform. 
Theory, vol. IT-13, 1967, pp. 21-27. 
P. A. Devijver, "A multiclass, k - NN approach to Bayes risk estimation," Pattern Recog- 
nition Letters, vol. 3, 1985, pp. 1-6. 
L. Dewoye, "On the asymptotic probability of error in nonparametric discrimination," An- 
nals of Statistics, vol. 9, 1981, pp. 1320-1327. 
R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York, New York: 
John Wiley & Sons, 1973. 
G. Elfving, "Optimum allocation in linear regression theory," Ann. Math. Statist., vol. 23, 
1952, pp. 255-262. 
V. V. Federov, Theory of Optimal Exper#nents, New York, New York: Academic Press, 
1972. 
E. Fix and J. L. Hodges, "Discriminatory Analysis: Nonparametric Discrimination: Con- 
sistency Properties," from Project 21-49-004, Report Number 4, UASF School of Aviation 
Medicine, Randoll Field, Texas, 1951, pp. 261-279. 
238 R.R. SNAPP, T. XU 
K. Fukunaga, "The estimation of the Bayes error by the k-nearest neighbor approach," in L. 
N. Kanal and A. Rosenreid (ed.), Progress in Pattern Recognition, vol. 2, Elesvier Science 
Publishers B.V. (North Holland), 1985, pp. 169-187. 
K. Fukunaga and D. Hummels, "Bayes error estimation using Parzen and k-NN procedures," 
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 9, 1987, pp. 634-643. 
J. M. Garnett, III and S.S. Yau, "Nonparametric estimation of the Bayes error of feature 
extractors using ordered nearest neighbor sets," IEEE Transactions on Computers, vol. 26, 
1977, pp. 46-54. 
G. Loizou and S. J. Maybank, "The nearest neighbor and the Bayes error rate;' IEEE 
Transactions on Pattern Analysis and Machine Intelligence, vol. 9, 1987, pp. 254-262. 
D. Psaltis, R. R. Snapp, and S.S. Venkatesh, "On the finite sample performance of the 
nearest neighbor classifier," IEEE Trans. Inform. Theory, vol. IT-40, 1994, pp. 820 837. 
R. R. Snapp and S.S. Venkatesh, "k Nearest Neighbors in Search of a Metric," 1995, 
(submitted). 
