Asymptotic slowing down of the 
nearest-neighbor classifier 
Robert R. Snapp 
CS/EE Department 
University of Vermont 
Burlington, VT 05405 
Demetri Psaltis 
Electrical Engineering 
Caltech 116-81 
Pasadena, CA 91125 
Santosh S. Venkatesh 
Electrical Engineering 
University of Pennsylvania 
Philadelphia, PA 19104 
Abstract 
If patterns are drawn from an n-dimensional feature space according to a 
probability distribution that obeys a weak smoothness criterion, we show 
that the probability that a random input pattern is misclassified by a 
nearest-neighbor classifier using M random reference patterns asymptoti- 
cally satisfies 
a 
PM(error) ~ Po(error) + 
for sufficiently large values of M. Here, Poo(error) denotes the probability 
of error in the infinite sample limit, and is at most twice the error of a 
Bayes classifier. Although the value of the coefficient a depends upon the 
underlying probability distributions, the exponent of M is largely distri- 
bution free. We thus obtain a concise relation between a classifier's ability 
to generalize from a finite reference sample and the dimensionality of the 
feature space, as well as an analytic validation of Bellman's well known 
"curse of dimensionality." 
I INTRODUCTION 
One of the primary tasks assigned to neural networks is pattern classification. Com- 
mon applications include recognition problems dealing with speech, handwritten 
characters, DNA sequences, military targets, and (in this conference) sexual iden- 
tity. Two fundamental concepts associated with pattern classification are general- 
ization (how well does a classifier respond to input data it has never encountered 
before?) and scalability (how are a classifier 's processing and training requirements 
affected by increasing the number of features that describe the input patterns?). 
932 
Asymptotic Slowing Down of the Nearest-Neighbor Classifier 933 
Despite recent progress, our present understanding of these concepts in the con- 
text of neural networks is obstructed by complexities in the functional form of the 
network and in the classification problems themselves. 
In this correspondence we will present analytic results on these issues for the nearest- 
neighbor classifier. Noted for its algorithmic simplicity and nearly optimal perfor- 
mance in the infinite sample limit, this pattern classifier plays a central role in the 
field of pattern recognition. Furthermore, because it uses proximity in feature space 
as a measure of class similarity, its performance on a given classification problem 
should yield qualitative cues to the performance of a neural network. Indeed, a 
nearest-neighbor classifier can be readily implemented as a "winner-take-all" neural 
network. 
2 THE TASK OF PATTERN CLASSIFICATION 
We begin with a formulation of the two-class problem (Duda and Hart, 1973): 
Let the labels wt and w2 denote two states of nature, or pattern classes. 
A pattern belonging to one of these two classes is selected, and a vector of 
n features, x, that describe the selected pattern is presented to a pattern 
classifier. The classifier then attempts to guess the selected pattern's class 
by assigning x to either w or 
As an example, the two class labels might represent the states benign and malignant 
as they pertain to the diagnosis of cancer tumors; the feature vector could then be 
a 1024 x 1024 pixel, real-valued representation of an electron-microscope image. A 
pattern classifier can thus be viewed as a mapping from an n-dimensional feature 
space to the discrete set {wt,w2), and can be specified by demarcating the regions 
in the n-dimensional feature space that correspond to wt and w2. We define the 
decision region 7 as the set of feature vectors that the pattern classifier assigns to 
wt, with an analogous definition for 72. A useful figure of merit is the probability 
that the feature vector of a randomly selected pattern is assigned to the correct 
class. 
2.1 THE BAYES CLASSIFIER 
If sufficient information is available, it is possible to construct an optimal pattern 
classifier. Let P(wt) and P(w2) denote the prior probabilities of the two states of 
nature. (For our cancer diagnosis problem, the prior probabilities can be estimated 
by the relative frequency of each type of tumor in a large statistical sample.) Fur- 
ther, let p(x I wt ) and p(x I w2) denote the class-conditional probability densities of 
the feature vector for the two class problem. The total probability density is now 
defined by p(x) = p(x I wt)P(wl) + p(x I w2)P(w2), and gives the unconditional 
distribution of the feature vector. Where p(x)  0 we can now use Bayes' rule to 
compute the osterior probabilities: 
P(l I x) -- p(x I and P(w2 
p(x) P(x) 
The Bayes classifier assigns an unclassified feature vector x to the class label having 
934 Snapp, Psaltis, and Venkatesh 
the greatest posterior probability. (If the posterior probabilities happen to be equal, 
then the class assignment is arbitrary.) With 7gx and 7E2 denoting the two decision 
regions induced by this strategy, the probability of error of the Bayes classifier, PB, 
is just the probability that x is drawn from class wl but lies in the Bayes decision 
region 7E2, or conversely, that x is drawn from class w2 but lies in the Bayes decision 
region 7gx: 
The reader may verify that the Bayes classifier minimizes the probability of error. 
Unfortunately, it is usually impossible to obtain expressions for the class-conditional 
densities and prior probabilities in practice. Typically, the available information 
resides in a set of correctly labeled patterns, which we collectively term a training 
or reference sarapk. Over the last few decades, numerous pattern classification 
strategies have been developed that attempt to learn the structure of a classification 
problem from a finite training sample. (The backpropagation algorithm is a recent 
example.) The underlying hope is that the c lassifier 's performance can be made 
acceptable with a sufficiently large reference sample. In order to understand how 
large a sample may be needed, we turn to what is perhaps the simplest learning 
algorithm of this class. 
3 THE NEAREST-NEIGHBOR CLASSIFIER 
Let rim = {(x(), 00)), (x(2),0(2)),..., (x(M),o(M))} denote a finite reference sam- 
ple of M feature vectors, x (i)  R", with corresponding known class assignments, 
0 (i)  {wx,w2}. The nearest-neighbor 
rule assigns each feature vector x to 
class cox or w2 as a function of the ref- 
erence M-sample as follows: 
Identify (x', 0')  ,YM such that 
IIx-x'll _< IIx-x(i)ll for i ranging 
from i through M; 
Assign x to class 0 . 
Here, IIx-yll = X/Ejl(xj- yj)2 de- 
notes the Euclidean metric in R ". The 
nearest-neighbor rule hence classifies 
each feature vector x according to the 
label, 0 , of the closest point, x , in 
the reference sample. As an example, 
we sketch the nearest-neighbor deci- 
sion regions for a two-dimensional clas- 
siftcation problem in Fig. 1. 
g [/:[:i:i:i:[:i:i:[:i:i:i:i:E::E::::::::::::::::::::::::::::::::::::::: 
_..::i:i:i:i:i:i:i:i:i:i::i:i:i?!: :!::::::: :.1 
================================================================= 
=================================================== :i:i: ::!l 
=========================================================== 
:!:!:!:!:!:!:!:i:i:i:i:i:i:i:::i:i::i:!:i:!:11 
Figure 1: The decision regions induced 
by a nearest-neighbor classifier with a 
seven-element reference set in the plane. 
Other metrics, such as the more general Minkowski-r metric, are also possible. 
Asymptotic Slowing Down of the Nearest-Neighbor Classifier 935 
It is interesting to consider how the performance of this classifier compares with that 
of a Bayes classifier. To facilitate this analysis, we assume that the reference patterns 
are selected from the total probability density p(x) in a statistically independent 
manner (i.e., the choice of A/1 does not in any way bias the selection of x (1+) and 
0(1+)). Furthermore, let PM(error) denote the probability of error of a nearest- 
neighbor classifier working with the reference sample A:'M, and let Peo(error) denote 
this probability in the infinite sample limit (M -- c). We will also let $ denote 
the volume in feature space over which p(x) is nonzero. The following well known 
theorem shows that the nearest-neighbor classifier, with an infinite reference sample, 
is nearly optimal (Cover and Hart, 1967). 2 
Theorem I For the two-class problem in the infinite sample limit, the probability 
of error of a nearest-neighbor classifier tends toward the value, 
P(error) = 2 P(w I x)P(w2 I x)p(x) dm, 
which is furthermore bounded by the two inequalities, 
PB _ Pc(error) _ 2PB(1- PB), 
where PB is the probability of error of a Bayes classifier. 
This encouraging result is not so surprising if one considers that, with probability 
one, about every feature vector x is centered a ball of radius e that contains an 
infinite number of reference feature vectors for every e > 0. The annoying factor of 
two accounts for the event that the nearest neighbor to x belongs to the class with 
smaller posterior probability. 
3.1 THE ASYMPTOTIC CONVERGENCE RATE 
In order to satisfactorily address the issues of generalization and scalability for the 
nearest-neighbor classifier, we need to consider the rate at which the performance of 
the classifier approaches its infinite sample limit. The following theorem applicable 
to nearest-neighbor classification in one-dimensional feature spaces was shown by 
Cover (1968). 
Theorem 2 Let p(x I wl) and p(x [ w2) have uniformly bounded third derivatives 
and let p(x) be bounded away from zero on $. Then for sufficiently large M, 
(1) 
PM(error) = Poo(error) + O - . 
Note that this result is also very encouraging in that an order of magnitude increase 
in the sample size, decreases the error rate by two orders of magnitude. 
The following theorem is our main result which extends Cover's theorem to n- 
dimensional feature spaces: 
2 Originally, this theorem was stated for multiclass decision problems; it is here presented 
for the two class problem only for simplicity. 
936 Snapp, Psaltis, and Venkatesh 
Theorem $ Let p(x I w,), p(x I w2), and p(x) satisfy the same conditions as in 
Theorem 2. Then, there exists a scalar a (depending on n) such that 
PM(error)  Pc(error) + -- 
M2/n ' 
where the right-hand side describes the first two terms of an asymptotic expansion 
in reciprocal powers of M 2/n. Explicitly, 
a  
r (1+ + 1)) 
nw 
where, 
I X)p(2 Ix) 
o2P( Ix) 
Ix). 
For n = 1 this result agrees with Cover's theorem. With increasing n, however, 
the convergence rate significantly slows down. Note that the constant a depends on 
the way in which the class-conditional densities overlap. If a is bounded away from 
zero, then for sufficiently small 5 > 0, PM(error) -- Pc(error) < 5 is satisfied only 
if M > (a/5) n/ so that the sample size required to achieve a given performance 
criterion is exponential in the dimensionality of the feature space. The above pro- 
vides a sufficient condition for Bellman's well known "curse of dimensionality" in 
this context. 
It is also interesting to note that one can easily construct classification problems for 
which a vanishes. (Consider, for example, p(x [wx) = p(x I w) for all x.) In these 
cases the higher-order terms in the asymptotic expansion are important. 
4 A NUMERICAL EXPERIMENT 
A conspicuous weakness in the above theorem is the requirement that p(x) be 
bounded away from zero over $. In exchange for a uniformly convergent asymptotic 
expansion, we have omitted many important probability distributions, including 
normal distributions. Therefore we numerically estimate the asymptotic behavior 
of PM(error) for a problem consisting of two normally distributed classes in Rn: 
p(x I w) = (27ro.2)n/2 exp [--2---,- ((Xl- kt)2+ Y'4=2 x])], 
i  ((xt+y)2+r,=2x])] ' 
p(x [w2) = (2a2)/2 exp [- 
Assuming that P(wx)= P(w2= 1/2, we find 
Pc(error) = 1 _/2, fo m 
av/ e 
e-Z/' sech (-)dx. 
Asymptotic Slowing Down of the Nearest-Neighbor Classifier 937 
- 1.0 
-2.0 
- 3.0 
0 n=l 
+ n=2 
 x n=3 
12 n=4 
 n=5 
ooO o + 
o o 
o 
o 
o 
o 
0.0 0.5 1.0 1.5 2.0 2.5 
1og0 (M) 
Figure 2' Numerical validation of the nearest-neighbor scaling hypothesis for two 
normally distributed classes in R a. 
For/ = a = 1, Poo(error) is numerically found to be 0.22480, which is consistent 
with the Bayes probability of error, PB = (1/2)erfc(1/V) = 0.15865. (Note that 
the expression for a given in Theorem 3 is undefined for these distributions.) For 
n ranging from 1 to 5, and M ranging from 1 to 200, three estimates of PM(error) 
were obtained, each as the fraction of "failures" in 160,000 or more Bernoulli trials. 
Each trial consists of constructing a pseudo-random sample of M reference patterns, 
followed by a single attempt to correctly classify a random input pattern. These 
estimates of PM are represented in Figure 2 by circular markers for n = 1, crosses 
for n = 2, etc. The lines in Figure 2 depict the power law 
PM(error) = Poo(error) + bM -2/n, 
where, for each n, b is chosen to obtain an appealing fit. The agreement between 
these lines and data points suggests that the asymptotic scaling hypothesis of The- 
orem 3 can be extended to a wider class of distributions. 
938 Snapp, Psaltis, and Venkatesh 
5 DISCUSSION 
The preceding analysis indicates that the convergence rate of the nearest-neighbor 
classifier slows down dramatically as the dimensionality of the feature space in- 
creases. This rate reduction suggests that proximity in feature space is a less effec- 
tive measure of class identity in higher dimensional feature spaces. It is also clear 
that some degree of smoothness in the class-conditional densities is necessary, as 
well as sufficient, for the asymptotic behavior described by our analysis to occur: 
in the absence of smoothness conditions, one can construct classification problems 
for which the nearest-neighbor convergence rate is arbitrarily slow, even in one di- 
mension (Cover, 1968). Fortunately, the most pressing classification problems are 
typically smooth in that they are constrained by regularities implicit in the laws of 
nature (Marr, 1982). With additional prior information, the convergence rate may 
be enhanced by selecting a fewer number of descriptive features. 
Because of their smooth input-output response, neural networks appear to use prox- 
imity in feature space as a basis for classification. One might, therefore, expect the 
required sample size to scale exponentially with the dimensionality of the feature 
space. Recent results from computational learning theory, however, imply that with 
a sample size proportional to the capacity--a combinatorial quantity which is char- 
acteristic of the network architecture and which typically grows polynomially in the 
dimensionality of the feature space one can in principle identify network param- 
eters (weights) which give (close to) the smallest classification error for the given 
architecture (Baum and Haussler, 1989). There are two caveats, however. First, 
the information-theoretic sample complexities predicted by learning theory give no 
clue as to whether, given a sample of the requisite size, there exist any algorithms 
that can specify the appropriate parameters in a reasonable time frame. Second, 
and more fundamental, one cannot in general determine whether a particular ar- 
chitecture is intrinsically well suited to a given classification problem. The best 
performance achievable may be substantially poorer than that of a Bayes classifier. 
Thus, without sufficient prior information, one must search through the space of 
all possible network architectures for one that does fit the problem well. This situ- 
ation now effectively resembles a non-parametric classifier and the analytic results 
for the sample complexities of the nearest-neighbor classifier should provide at least 
qualitative indications of the corresponding case for neural networks. 
References 
Baum, E. B. and Haussler, D. (1989), "What size net gives valid generalization," 
Neural Computation, 1, pp. 151-160. 
Cover, T. M. (1968), "Rates of convergence of nearest neighbor decision proce- 
dures," Proc. First Annual Hawaii Conference on Systems Theory, pp. 413-415. 
Cover, T. M. and P. E. Hart (1967), "Nearest neighbor pattern classification," IEEE 
Trans. Info. Theory, vol. IT-13, pp. 21-27. 
Duda, R. O. and P. E. Hart (1973), Pattern Classification and Scene Analysis. New 
York: John Wiley & Sons. 
Marr, D. (1982), Vision, San Francisco: W. H. Freeman. 
