Shooting Craps in Search of an Optimal Strategy for 
Training Connectionist Pattern Classifiers 
J. B. Hampshire II and B.V.K. Vijaya Kumar 
Department of Electrical & Computer Engineering 
Carnegie Mellon University 
Pittsbulgh, PA 15213-3890 
hamps@ speechl. cs. cmu. edu and kumar@gauss. ece. cmu. edu 
Abstract 
We compare two strategies for training connectionist (as well as non- 
connectionist) models for statistical pattern recognition. The probabilistic strat- 
egy is based on the notion that Bayesian discrimination (i.e., optimal classifica- 
tion) is achieved when the classifier learns the a posteriori class distributions of 
the random feature vector. The differential strategy is based on the notion that 
the identity of the largest class a posteriori probability of the feature vector is 
all that is needed to achieve Bayesian discrimination. Each strategy is directly 
linked to a family of objective functions that can be used in the supervised training 
procedure. We prove that the probabilistic strategy -- linked with error measure 
objective functions such as mean-squared-error and cross-entropy -- typically 
used to train classifiers necessarily requires larger training sets and more complex 
classifier architectures than those needed to approximate the Bayesian discrim- 
inant function. In contrast, we prove that the differential strategy  linked 
with classificationfigure-of-merit objective functions (CFMo) [3]  requires 
the minimum classifier functional complexity and the fewest training examples 
necessary to approximate the Bayesian discriminant function with specified pre- 
cision (measured in probability of error). We present our proofs in the context of 
a game of chance in which an unfair C-sided die is tossed repeatedly. We show 
that this rigged game of dice is a paradigm at the root of all statistical pattern 
recognition tasks, and demonstrate how a simple extension of the concept leads 
us to a general information-theoretic model of sample complexity for statistical 
pattern recognition. 
1125 
1126 Hampshire and Kumar 
1 Introduction 
Creating a connecfionist pattern classifier that generalizes well to novel test data has recently 
focussed on the process of finding the network architecture with the minimum functional 
complexity necessary to model the training data accurately (see, for example, the works of 
Baum, Cover, Haussler, and Vapnik). Meanwhile, relatively litfie attention has been paid to 
the effect on generalization of the objective function used to train the classifier. In fact, the 
choice of objective function used to train the classifier is tantamount to a choice of training 
strategy, as described in the abstract [2, 3]. 
We formulate the proofs outlined in the abstract in the context of a rigged game of dice in 
which an unfair C-sided die is tossed repeatedly. Each face of the die has some probability of 
turning up. We assume that one face is always more likely than all the others. As a result, all 
the probabilities may be different, but at most C - 1 of them can be identical. The objective 
of the game is to identify the most likely die face with specified high confidence. The 
relationship between this rigged dice paradigm and statistical pattern recognition becomes 
clear if one realizes that a single unfair die is analogous to a specific point on the domain 
of the random feature vector being classified. Just as there are specific class probabilities 
associated with each point in feature vector space, each die has specific probabilities 
associated with each of its faces. The number of faces on the die equals the number of 
classes associated with the analogous point in feature vector space. Identifying the most 
likely die face is equivalent to identifying the maximum class a posteriori probability for 
the analogous point in feature vector space -- the requirement for Bayesian discrimination. 
We formulate our proofs for the case of a single die, and conclude by showing how a simple 
extension of the mathematics leads to general expressions for pattern recognition involving 
both discrete and continuous random feature vectors. 
Authors' Note: In the interest of brevity, our proofs are posed as answers to questions that 
pertain to the rigged game of dice. It is hoped that the reader will find the relevance of 
each question/answer to statistical pattern recognition clear. Owing to page limitations, we 
cannot provide our proofs in full detail; the reader seeking such detail should refer to [1]. 
Definitions of symbols used in the following proofs are given in table 1. 
1.1 A Fixed-Point Representation 
The Mq-bit approximation qM[X] to the real number x  (-- 1, 1] is of the form 
MSB (most significant hi0 = sign[x] 
with the specific value 
is located: 
qMtx] __a { sign[x]  ([ Ixl' 2 
sign[x] (1 - 2 
The lower and upper bounds on the quantization interval are 
st. [x] < x _< ust. [x] 
MSB-1 = 2 - (1) 
LSB (least significant bit) = 2 
defined as the mid-point of the 2 -(st.-) -wide interval in which x 
+ 24'.), Ixl < 
Ixl = 
(2) 
(3) 
An Optimal Strategy for Training Connectionist Pattern Classifiers 1127 
Table 1: Definitions of symbols used to describe die faces, probabilities, probabilistic 
differences, and associated estimates. 
Symbol Definition 
P(w0) 
Zri 
The tmejth most likely die face (' is the estimatedjth most likely face). 
The probability of the tmejth most likely die face. 
The number of occurrences of the tmejth most likely die face. 
An empirical estimate of the probability of the true jth most likely die face: 
(w,j) = lg/ (note n denotes the sample size) 
The probabilistic difference involving the true rankings and probabilities of the 
 die faces: 
A = P(w/) - supi 
The probabilistic difference involving the true rankings but empirically estimated 
probabilities of the  die faces: 
where 
and 
Lu,[x] = qu[x] - 2 -u' 
(4) 
UM,[x] = qM[X] + 2 -t (5) 
The fixed-point representation described by (1) - (5) differs from standard fixed-point 
representations in its choice of quantization interval. The choice of (2) - (5) represents zero 
as a negative -- more precisely, a non-positive  finite precision number. See [1] for the 
motivation of this format choice. 
1.2 A Mathematical Comparison of the Probabilistic and Differential Strategies 
The probabilistic strategy for identifying the most likely face on a die with  faces involves 
estimating the C face probabilities. In order for us to distinguish f(Wrl) from f(Wr2), we 
must choose Mq (i.e. the number of bits in our fixed-point representation of the estimated 
probabilities) such that 
qm[(Wrl)] > qm[P(w2)] (6) 
The distinction between the differential and probabilistic strategies is made more clear if 
one considers the way in which the Mq-bit approximation Arl is computed from a random 
sample containing krl occurrences of die face &r and k2 occurrences of die face &r2. For 
the differential strategy 
krl - kr2] 
Arl differential  qM n 
and for the probabilistic strategy 
(7) 
Arlprobabilistic quilL] - qu[] 
(8) 
1128 Hampshire and Kumar 
where 
Note that when i -- rl 
and when i  r l 
Note also 
Ai __a P(wi)- sup P(wj) 
J 
i= 1,2,...  
Ail = P(w,1) - P(w,a) 
Ai9rl '- P(wi) -- P(w,1) 
Arl '- --Zr2 
(9) 
(10) 
(11) 
(12) 
Since 
i=l i=r3 
we can show that the  differences of (9) yield the  probabilities by 
(13) 
1 
P(rl) = 
(14) 
p(wrj) = Aj + P(w,1) Vj > 1 
Thus, estimating the  differences of (9) is equivalent to estimating the C probabilities 
P(w0, P(w2),..., P(wc). 
Clearly, the sign of/i1 in (7) is modeled correctly (i.e., Arl differential  COrrectly identify 
the most likely face) when Mq = 1, while this is typically not the case for Arl probabilistic 
in (8). In the latter case, Zl,robabi is zero when Mq = 1 because qm[(Wrl)] and 
qM[(Wr)] are indistinguishable for M below some minimal value implied by (6). That 
minimal value of M can be found by recognizing that the number of bits necessary for (6) 
to hold for asymptotically large n (i.e., for the quantized difference in (8) to exceed one 
LSB ) is 
1 + I-log 2 [Arl]] --log 2 [P(wrj)]  Z + j  {1,2} 
sign bit magnitude bits 
Mq rain -- 
1 + [- log2 [/rl] ] + 1, otherwise 
sign bit magnitude bits 
(15) 
where Z + represents the set of all positive integers. Note that the conditional nature of 
M ,,an in (15) prevents the case in which lim._.0 P(o.,l) -  = Lug [P(o-'rl)] or P(o-'r2) = 
Ust [P(Wr2)]; either case would require an infinitely large sample size before the variance 
of the corresponding estimated probability became small enough to distinguish qu [P(O.'rl)] 
from qu[P(O.'r)]. The sign bit in (15)is not required to estimate the probabilities themselves 
in (8), but it is necessary to compute the difference between the two probabilities in that 
equation -- this difference being the ultimate computation by which we choose the most 
likely die face. 
An Optimal Strategy for Training Connectionist Pattern Classifiers 1129 
1.3 The Sample Complexity Product 
We introduce the sample complexity product (SCP) as a measure of both the number of 
samples and the functional complexity (measured in bits) required to identify the most 
likely face of an unfair die with specified probability. 
SCP = n  Mq s.t. P(most likely face correctly ID'd) _> a (16) 
2 
A Comparison of the Sample Complexity Requirements for the 
Probabilistic and Differential Strategies 
Axiom 1 We view the number of bits M e in the finite-precision approximation qM[x] to 
the real number x E (- 1, 1] as a measure of the approximation's functional complexity. 
That is, the functional complexity of an approximation is the number of bits with which it 
represents a real number on (- 1, 1]. 
Assumption 1 [fP(;Orl ) > P((.0r2), then P((.Orl ) will be greater than b(Wrj) j > 2 (see [1] 
for an analysis of cases in which this assumption is invalid). 
Question: What is the probability that the most likely face of an unfair die will be empiri- 
cally identifiable after n tosses? 
Answer for the probabilistic strategy: 
P (qM[(;Orl)] > qM[(t. Orj)], Vj > 1) 
where 
P(r2) kn (] - P(rl) -- P(r2)) (n-k"-kn) 
kr2 ! (n - krl - kr2)! 
(17) 
n - kr2 
A = max B+I, C-1 + 1 C > 2 
191 = n 
A2 = 0 
v2 = min(B,n-krl) 
 '- {M,} -' k[/.,[P(aJr2)] '- kL.,[P(aJrl)]- 1 
(18) 
There is a simple recursion in [ 1] by which every possible boundary for Mq-bit quantization 
leads to itself and two additional boundaries in the set {BM, } for (Mq + 1)-bit quantization. 
Answer for the differential strategy: 
P (gMq[Arl] < Zr,  UMq[Arl], Zrj < 0 Vj > 1) 
1130 Hampshire and Kumar 
where 
v, p(wrl)lg, 
E krl ! 
t 2 
P(Wr2) kn (1 -- P(wrl) -- P(wr2)) (n-kn-kn) 
k,a! (n - krl -- kr2)! 
n - k, 
max kLat. [arl]' 'Z_-  + I 
>2 
(19) 
(20) 
,. = max(0,1-k%[Ar]) 
v2 = min(krl--kLa6r[,rl] , n-krl) 
Since the multinomial distribution is positive semi-definite, it should be clear from a 
comparison of (17)-(18) and (19)-(20) thatP (Lta[Ar,] < Zr, _< UM[Ar,]) is largest 
(and larger than any possible P (qa4[(O.'r,)] > qM[(Wrj)], j > 1) ) for a given sample 
size n when the differential strategy is employed with Mq - 1 such that Lta [ Ar ] = 0 and 
UM[Ar] = 1 (i.e., IM JArl] = 1 and ktrM JArl] = n). The converse is also true, to wit: 
Theorem 1 For a fixed value of n in (19), the 1-bit approximation to Arl yields the highest 
probability of identifying the most likely die face Wry. 
It can be shown that theorem 1 does not depend on the validity of assumption 1 [ 1]. Given 
Axiom 1, the following corollary to theorem 1 holds: 
Corollary 1 The differential strategy's minimum-complexity I-bit approximation of Arl 
yields the highest probability of identifying the most likely die face Cdrl for a given number 
of tosses n. 
Corollary 2 The differential strategy's minimum-complexity 1-bit approximation of Arl 
requires the smallest sample size necessary (nmin) tO identify P(Wr!)   and thereby the most 
likely die face Cdrl  correctly with specified confidence. Thus, the differential strategy 
requires the minimum SCP necessary to identify the most likely die face with specified 
confidence. 
2.1 Theoretical Predictions versus Empirical Results 
Figures 1 and 2 compare theoretical predictions of the number of samples n and the number 
of bits Mq necessary to identify the most likely face of a particular die versus the actual 
requirements obtained from 1000 games (3000 tosses of the die in each game). The die has 
five faces with probabilities P(Wrl) = 0.37, P(cvr) = 0.28, P(CVr3) = 0.2, P(W4) = 0.1, and 
P(Wrl) = 0.05. The theoretical predictions for Mq and n (arrows with boxed labels based 
on iterative searches employing equations (17) and (19)) that would with 0.95 confidence 
correctly identify the most likely die face O./rl are shown to correspond with the empirical 
results: in figure 1 the empirical 0.95 confidence interval is marked by the lower bound of 
the dark gray and the upper bound of the light gray; in figure 2 the empirical 0.95 confidence 
^ 
interval is marked by the lower bound of the P(rl) distribution and the upper bound of the 
An Optimal Strategy for Training Connectionist Pattern Classifiers 1131 
 -' :' "::"::'"':"" '"'"'.....""..'.: :.......i i-.:...: .:........: ....:...::!..:..............:...-.. :. :.- :....... 
0.1 ' =========================================================================== ...... "' 
0. 
0.o :::::::::::::::::::::::::::::'"' ........ 
-0.0625 *':':':':'::: 
0.5625 h 
0.37 
0.33125 
0.1875, 
hA 
2O00 
Figare 1: Theoretical predictions of the 
number of tosses needed to identify the 
most likely face Wr with 95% confidence 
(Die 1): Differential strategy prediction su- 
perimposed on empirical results of 1000 
games (3000 tosses each). 
Figure 2: Theoretical predictions of the 
number of tosses needed to identify the 
most likely face Wr with 95% confidence 
(Die 1): Probabilistic strategy prediction 
superimposed on empirical results of 1000 
games (3000 tosses each). 
'(Wr2) distribution. These figures illustrate that the differential strategy's minimum SCP 
is 227 (n = 227, Mq = 1 ) while the minimum SCP for the probabilistic strategy is 2720 
(n = 544, Mq = 5). A complete tabulation of SCP as a function of P(wr0, P(Wr2), and the 
worst-case choice for C (the number of classes/die faces) is given in [1]. 
3 Conclusion 
The sample complexity product (SCP) notion of functional complexity set forth herein is 
closely aligned with the complexity measures of Kolmogorov and Rissanen [4, 6]. We have 
used it to prove that the differential strategy for learning the Bayesian discriminant function 
is optimal in terms of its minimum requirements for classifier functional complexity and 
number of training examples when the classification task is identifying the most likely face 
of an unfair die. It is relatively straightforward to extend theorem 1 and its corollaries to 
the general pattern recognition case in order to show that the expected SCP for the 1-bit 
differential strategy 
E [SCP]differentia,  /X timin [P (60rl IX), P (60r 2 I X)]'Mq rain [P (60rl IX), P (60r2 I X)] /9(x)dx 
=1 
(21) 
(or the discrete random vector analog of this equation) is minimal [ 1]. This is bause nmin 
is by corollary 2 the smallest sample size necessary to distinguish any and all P(Wrl) from 
1132 Hampshire and Kumar 
lesser P(wr2)  The resulting analysis confirms that the classifier trained with the differential 
strategy for statistical pattern recognition (i.e., using a CFM,,o objective function) has 
the highest probability of learning the Bayesian discriminant function when the functional 
capacity of the classifier and the available training data are both limited. 
The relevance of this work to the process of designing and training robust connec- 
tionist pattern classifiers is evident if one considers the practical meaning of the terms 
n,,/n [P (Wrl Ix), P (w,a Ix)] and Mq ,an [P .(Wrl Ix), P (Wr2 IX)] in the sample complex- 
ity product of (21). Given one's choice of connecfionist model to employ as a classifier, the 
Mq  term dictates the minimum necessary connectivity of that model. For example, (21) 
can be used to prove that a partially connected radial basis function (RBF) with trainable 
variance parameters and three hidden layer "nodes" has the minimum Mq neaessary for 
Bayesian discrimination in the 3-class task described by [5]. However, because both SCP 
terms are functions of the probabilistic nature of the random feature vector being classified 
and the learning strategy employed, that minimal RBF architecture will only yield Bayesian 
discrimination if trained using the differential strategy. The probabilistic strategy requires 
significantly more functional complexity in the RBF in order to meet the requirements 
of the probabilistic strategy's SCP [1]. Philosophical arguments regarding the use of the 
differential strategy in lieu of the more traditional probabilistic strategy are discussed at 
length in [1]. 
Acknowledgement 
This research was funded by the Air Force Office of Scientific Research under grant 
AFOSR-89-0551. We gratefully acknowledge their support. 
References 
[1] J. B. Hampshire II. A Differential Theory of Statistical Pattern Recognition. PhD 
thesis, Carnegie Mellon University, Department of Electrical & Computer Engineering, 
Hammerschlag Hall, Pittsburgh, PA 15213-3890, 1992. manuscript in progress. 
[2] J. B. Hampshire II and B. A. Pearlmutter. Equivalence Proofs for Multi-Layer Per- 
ceptron Classifiers and the Bayesian Discriminant Function. In Touretzky, Elman, 
Sejnowski, and Hinton, editors, Proceedings of the 1990 Connectionist Models Sum- 
mer School, pages 159-172, San Mateo, CA, 1991. Morgan-Kaufmann. 
[3] J. B. Hampshire II and A. H. Waibel. A Novel Objective Function for Improved 
Phoneme Recognition Using Time-Delay Neural Networks. IEEE Transactions on 
Neural Networks, 1(2):216-228, June 1990. A revised and extended version of work 
first presented at the 1989 International Joint Conference on Neural Networks, vol. I, 
pp. 235-241. 
[4] A. N. Kolmogorov. Three Approaches to the Quantitative Definition of Information. 
Problems of Information Transmission, 1(1):1-7, Jan. - Mar. 1965. Faraday Press 
translation of Problemy Peredachi Informatsii. 
[5] M.D. Richard and R. P. Lippmann. Neural Network Classifiers Estimate Bayesian a 
posteriori Probabilities. Neural Computation, 3(4):461-483,1991. 
[6] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978. 
