On Learning p-Perceptron Networks with 
Binary Weights 
Mostera Golea 
Ottawa-Carleton Institute for Physics 
University of Ottawa 
Ottawa, Ont., Canada K1N 6N5 
050287acadvml .uottawa.ca 
Mario Marchand 
Ottawa-Carleton Institute for Physics 
University of Ottawa 
Ottawa, Ont., Canada KIN 6N5 
mmmsj acadvm 1 .uottawa. ca 
Thomas R. Hancock 
Siemens Corporate Research 
755 College Road East 
Princeton, NJ 08540 
hancockicarning.sicmens.com 
Abstract 
Neural networks with binary weights are very important from both 
the theoretical and practical points of view. In this paper, we in- 
vestigate the learnability of single binary percepttons and unions of 
p-binary-perceptron networks,/. e. an "OR" of binary percepttons 
where each input unit is connected to one and only one percep- 
tron. We give a polynomial time algorithm that PAC learns these 
networks under the uniform distribution. The algorithm is able 
to identify both the network connectivity and the weight values 
necessary to represent the target function. These results suggest 
that, under reasonable distributions, p-perceptron networks may 
be easier to learn than fully connected networks. 
I Introduction 
The study of neural networks with binary weights is well motivated from both the 
theoretical and practica] points of view. Although the number of possible states 
591 
592 Golea, Marchand, and Hancock 
in the weight space of a binary network are finite, the capacity of the network is 
not much inferior to that of its continuous counterpart (Barkai and Kanter 1991). 
Likewise, the hardware realization of binary networks may prove simpler. 
A major obstacle impeding the development of binary networks is that, under an 
arbitrary distribution of examples, the problem of learning binary weights is NP- 
complete(Pitt and Valiant 1988). However, the question of the learnability of this 
class of networks under some reasonable distributions is still open. A first step in 
this direction was reported by Venkatesh (1991) for single majority functions. 
In this paper we investigate, within the PAC model (Valiant 1984; Biumer et al. 
1989), the learnability of two interesting concepts under the uniform distribution: 
1) Single binary percepttons with arbitrary thresholds and 2) Unions of p-binary 
percepttons, i.e. an "OR" of binary percepttons where each input unit is connected 
to one and only one perceptton (fig. 1) 1. These functions are related to so-called 
p or read-once formulas (Kearns et al. 1987). Learning these functions on special 
distributions is presently a very active research area (Paga!io and Haussler 1989; 
Valiant and Warmuth 1991). The/z restriction may seem to be a severe one but 
it is not. First, under an arbitrary distribution, p-formulas are not easier to learn 
than their unrestricted counterpart (Kearns et al. 1987). Second, the p assumption 
brings up a problem of its own: determining the network connectivity, which is a 
challenging task in itself. 
Our main results are polynomial time algorithms that PAC learn single binary 
percepttons and unions of p-binary percepttons under the uniform distribution of 
examples. These results suggest that p-perceptron networks may be somewhat eas- 
ier to learn than fully connected networks if one restricts his attention to reasonable 
distributions of examples. Because of the limited space, only a sketch of the proofs 
is given in this abstract. 
2 Definitions 
Let I denote the set (0, 1 ). A perceptron g on I" is specified by a vector of n real 
valued weights w and a single real valued threshold O. For x = (xl,x2, ...,x,) E I", 
we have: 
g(x) = 0 if =" (1) 
=wx < 0 
A perceptron is said to be positive if wi _> 0 for i = 1, ...,n. We are interted 
in the e where the weights are binary lu (1). We ume, without 1 of 
generality (w.l.o.g.), that 0 is integer. 
A function f is said to be a union of percepttons if it can be written  a disjunction 
of perptrons. If th pereeptrons do not share any variable, f is sd to be a 
union of -pereeptrons (fig. 1), and we write 
f=gO) Vg()V...Vg(,) lsn (2) 
We shall sume, w.l.o.g., that f is expre with the mimum poible number 
of g(0's, d h the minimum psible number of non-zero weights. 
The intention is simply the mplement of the ion d c  treat silly. 
On Learning !x-Perceptron Networks with Binary Weights 593 
Output Unit 
I il I I I 
Input Units 
Figure 1: A two layer network representing a union of p-perceptrons. Note that 
each input unit is connected to one and only one perceptron (hidden unit). The 
output unit computes an OR function. 
We denote by P(A) the probability of an event A and by iS(A) its empirical estimate 
based on a given finite sample. All probabilities are taken with respect to the 
uniform distribution D on I n. If a E {0, 1}, we denote by P(f = 11a: = a) the 
conditional probability that f = 1 given the fact that x = a. 
The algorithm will make use of the following probabilistic quantities: 
The influence of a variable xi, denoted Inf(xi), is defined as 
Inf(x,) - P(f = llx, - 1) - P(f = llx, - 0). 
Intuitively, the influence of a variable is positive (negative) if its weight is positive 
(negative). 
The correlation of a variable x1 with xi, denoted O(xi, xi), is defined as 
C(x,,) = P(I= 11, = 1)- P(f = llw, = 1) 
P(I = 11w = 1)- P(f = 1) 
where xix = 1 stands for xi = x. = 1. This quantity depends largely on whether 
or not the two variables are in same perceptton. 
In what follows, we adopt the PAC learning model (Valiant 1984; Biumer et al. 
1989). Here the methodology is to draw a sample of a certain size labeled according 
to the unknown target function f and then to find a "good" approximation h of 
f. The error of the hypothesis function h, with respect to the target f, is defined 
to be P(h  f) = P(h(x)  f(x)), where x is distributed according to D. An 
algorithm learns from examples a target class F using an hypothesis class H under 
the distribution D on I n, if for every f E F, and any 0 < , 6 < 1, the algorithm 
runs in time polynomial in (n, , 6) and outputs an hypothesis h  H such that 
P[ P(a# f) > ] < , 
594 Golea, Marchand, and Hancock 
3 Learning Networks with Binary Weights 
3.1 Learning Single Binary Porceptrons 
Let us assume that the target function f is a single binary perceptron g given by 
eq. (1). Let us assume also that the distribution generating the examples is uniform 
on (0, 1)n. The learning algorithm proceeds in two steps: 
1. Estimating the weight values (signs). This is done by estimating the influ- 
ence of each variable. Then the target perceptron is reduced to a pos/tive 
perceptron by simply changing x, to 1 - x, whenever w, = -1. 
2. Estimating the threshold of the positive target perceptron and hence the 
threshold of the original perceptron. 
To simplify the analysis, we introduce the following notation. Let N be the number 
of negative weights in g and let y be defined as 
Ix, if w, = 1 
Y' = I - x, if wi = -1 (3) 
Then eq. (1) can be written as 
g(Y) = 0 if E,--St, < fl (4) 
where the renormalized threshold fl is related to the original threshold 0 by: tl = 
0 + N. We assume, w.l.o.g., that 1 (_ fl _( n. Note that D(x) = D(y). 
The following iemma, which follows directly from Bahadur's expansion (Bahadur 
1960), will be used throughout this paper to approximate binomial distributions. 
n (d(n, 
Lemma 1 Let d and n be two integers. Then, if  _ _ 
() 
a n 1 n 1 
i < d 1 z 
!n-- And if O ( d ( , 
where z = 2 4+ l . - - 
d 
E(n) l(n) 1 
i <  d 1 
i=0 
where z t = _1 n-k l 
2 n-d-k1 ' 
As we said earlier, intuition suggests that the influence of a variable is positive 
(negative) if its weight is positive (negative). The following iemma strengthens this 
intuition by showing that there is a measurable gap between the two cases. This 
gap will be used to estimate the weight values (signs). 
Lemma 2 Let g be a perceptton such that P(g = 1), P(g = O) > p, where 0 < p < 
1. Then, 
Inf(x,){ > -+_ if w, = 1 
< --+ if w,=-I 
On Learning !x-Perceptron Networks with Binary Weights 595 
Proof sketch for w = 1: We exploit the equivalence between eq. (1) and eq. (4), 
and the fact that the distribution is uniform: 
Inf(a:) = P(g(x)= llx, = 1)- P(g(x)-- llx =0) 
= P(g(y)= lly, = 1)- P(g(y)= 1]y = O) 
= 2 P(g= 1) 
i 
(5) 
Applying lemma 1 to either eq. (5) (case: Ct _> ) or eq.(6) (case: Ct _< -) yields 
the desired result. El 
Once we determine the weight values, we reduce g to its positive form by changing 
x to 1 -x whenever w = -1. The next step is to estimate the renorma]ized 
threshold Ct and hence, the original threshold 0. 
Lemma 3 Let g be the positive perceptton with a renofinalized threshold . Let g' 
the positive perceptton obtained from g by substituting r for . Then, if r _< , 
P(9  9') <- 1 - P(9 = 11g'= 1) 
So, if we estimate P(g = llg' = 1) for r = 1,2, ... and then choose as the renormal- 
ized threshold the least r for which P(g = 11g' = 1) _> (1 - e), we are guaranteed 
to have P(g  g') _< e. Obviously, such r exists and is always _< t because 
P(g- 11g' - 1)- 1 for r- Ct. 
A sketch of the algorithm for learning single binary perceptrons is given in fig. 2. 
Theorem 1 The class of binary percepttons is PA C learnable under the uniform 
distribution. 
Proof sketch: A sample of size O( n2 ln) is sufficient to ensure that the different 
probabilities are estimated to within a sufficient precision (llagerup and Rub 1989). 
 and 
Steps 2 and 3 of the algorithm are obvious. If we reach step 5, P(g = 1) >  
P(g = 0) > 3' Then one has only to set p =  and apply iemma 2 and 3 to step 5 
and 6 respectively. Final]y, the algorithm runs in time polynomial in n,  and 5.C3 
3.2 Learning Unions of p-binary Perceptrons 
Let us assume now that the target function f is a union of p-binary perceptrons 
as in eq. (2). Note that we do not assume that the architecture (connectivity) is 
known in advance. Rather, it is the task of the learning algorithm to determine 
which variables are in a given perceptton. The learning algorithm proceeds in three 
steps: 
596 Golea, Marchand, and Hancock 
Algorithm LEA RN-BINA R Y-PERCEPTRON(n,e,5) 
Parameters: n is the number of input variables, e is the accuracy parameter and  is 
the confidence parameter. 
Output: a binary perceptton h. 
Description: 
1 Call M = [so(,?)]: In sn examples. This sample is be used to estimate the 
different probabilities. [nitiahze h to the constant perceptton 0. Initialize N to 0. 
2. (Are most examples positive?) If ifi(g = 1) _> (1 - ) then set h = 1. Return h. 
 then return h. 
3. (Are most examples negative?) If P(g = 1) <_  
4. Set p= . 
5. (Reduce g to a positive perceptton) For each input variable 
(a) Ifilf (x,)> P setw=l. 
n+2' 
(b) Else ifIf(x)<- e setN=N+l,w=-l, andchangextol- 
n+2' 
(c) Else delete x (update n accordingly). 
6. (Estimating the bias) Let g' be as defined in iemma 3. Initialize r to 1. 
(a) Estimate P(g = llg'= 1). 
(b) If P(g = 11g'= 1) > 1 - e, set fl = r and go to step 7. 
(c) r = r + 1. Go to step 6a. 
7. Set # = q - N. Return h (that is (w,...,w,;O)). 
Figure 2: An algorithm for learning single binary perceptrons. 
1. Estimating the weight values (signs). This is again done by estimating the 
influence of each variable. Then the target function is reduced to a union 
of positive percepttons by simply changing x to 1 - x whenever w = -1. 
2. Estimating which variables belong to the same perceptton. This is done by 
estimating correlations between different variables. 
3. Estimating the renofinalized threshold of the each positive target percep- 
tron and hence, the threshold of the original perceptton. 
The following lemma is a straightforward generalization of lcmma 2. 
Lemma 4 Let f be a union of p-perceptrons as in (). Let g(a) be a perceptton in 
f and let x  g(). Assume that P(f = 1) < 1 -7 and P(g - 1),P(g = O) > p 
where O < 7, P< 1. Then 
7P if w 
Inf(xi) > = 1 
< -"-+ if w---1 
Proof sketch for wi - 1: Let m be the disjunction of all perceptrons in f except 
g. Then, using the inclusion-exclusion property and the fact that perceptrons in f 
do not share any variables, 
Inf(x,) = (1-P(m=l))(P(g=llx,=l)-(P(g=l[x,=0)) 
On Learning -Perceptron Networks with Binary Weights 597 
> 7(P(g = 1Ix, = 1)- (P(g = 1Ix,- 0)) 
> (7) 
n+2 
Inequality (7) follows from the fact that I - P(m = 1) > 1 - P(f = 1) > y and 
from lemma 2.D 
Lemma 4 enables us to determine the weight values and reduce f to its positive 
form. Note that we can assume, w.i.o.g., that y _> e/2 and p _> e/2n. The next 
step is to determine which variables belong to the same perceptton. Starting with 
a variable, say xi, the procedure uses the correlation measure to decide whether or 
not another variable, say xj, belongs to xi's perceptton. We appeal to the following 
lemma where we assume that f is already reduced to its positive form. 
Lemma 5 
tron in f. 
Then 
Let f be a union of p-perceptrons (positive form). Let g(a) be a percep- 
Let xi E g(a) and let xj and x: be two other influential variables in f . 
0 if xiEg(a) and xk6g(a) 
C(x,,x) -C(x,,x,)= 0 if xl f g(,O and x, f g(*) 
_> -: if xg(a) and xg(a) 
The lengthy proof of this important lemma will appear in the full paper. 
If we estimate the correlations to within a sufficient precision, the correlation gap, 
established in iemma 5, enables us to decide which variables are in the same per- 
ceptron. 
The last step is to estimate the bias of each perceptton. Let g be a perceptton in f 
and let g' be the perceptton obtained from g by setting its tenorrealized bias to r. 
Estimating g's bias may be done by simply estimating P(f = 1 Ig' = 1) for different 
values of the renormalized threshold, r = 2, 3,..., and choosing the least r such that 
P(f-- llg'- 1) (1 -e/n) (see lemma 3 and step 6 in figure 2). 
Theorem 2 The class of p-binary perceptton unions are PA C learnable under the 
uniform distribution. 
4 Conclusion and Open Problems 
We presented a simple algorithm for PAC learning single binary perceptrons and 
p-binary-perceptron unions, under the uniform distribution of examples. The hard- 
hess results reported in the literature (see (Lin and Vittcr 1991)) suggest that one 
can not avoid the training difficulties simply by considering only very simple neural 
networks. Our results (see also (Marchand and Golca 1992)) suggest that the com- 
bination of simple networks and reasonable distributions may be needed to achieve 
any degree of succe. 
The results reported here are part of an ongoing research aimed at understanding 
nonoverlapping perceptton networks (Itancock et al. 1991). The extension of these 
results to more complicated networks will appear in the full paper. 
598 Golea, Marchand, and Hancock 
Acknowledgements 
M. Golea and M. Marchand are supported by NSERC grant OGP0122405. This 
research was conducted while T. Hancock was a graduate student at Harvard Uni- 
versity, supported by ONR grant N00014-85.K-0445 and NSF grant NSF-CCR-89- 
02500. 
References 
[1] Bahadur R. (1960) "Some Approximations to the Binomial Distribution Func- 
tion", Annals Math. Star., Vol.31, 43-54. 
[2] Barkai E. & Kanter I. (1991) "Storage Capacity of a Multilayer Neural Network 
with Binary weights", Europhys. Lett., Vol. 14, 107-112. 
[3] Blumer A., Ehrenfeucht A., Haussler D., and Warmuth K. (1989) "Learnability 
and the Vapnik-Chervonenkis Dimension" J. ACM, Vol. 36, 929-965. 
[4] Hagerup T. & Rub C. (1989) "A Guided Tour to Chernoff Bounds" Info. Proc. 
Lett., Vol. 33, 305-308. 
[5] Hancock T., Golea M., and Marchand M. (1991) "Learning Nonoverlapping 
Perceptton Networks From Examples and Membership Queries", TR-26-91, Center 
for Research in Computing Technology, Ilarvard University. Submitted to Machine 
Learning. 
[6] Kearns M., Li M., Pitt L., and Valiant L. (1987) "On the Learnability of Boolean 
Formulas", in Proc. of the 19th Annum ACM Symposium on Theory of Computing, 
285-295, New York, NY. 
[7] Lin J.H. & Vitter J.S. (1991) "Complexity Results on Learning by Neural Nets" 
Machine Learning, Vol. 6, 211-230. 
[8] Marchand M. & Golea M. (1992) "On Learning Simple Neural Concepts", to 
appear in Network. 
[9] Pagallo G. & Ilaussler D. (1989) "A Greedy Method for learning/DNF functions 
under the Uniform Distribution". Technical Report UCSC-CRL-89-12, Santa Cruz: 
Dept. of Computer and Information Science, University of California at Santa Cruz. 
[10] Pitt L. & Valiant L.G. (1988) "Computational Limitations on Learning from 
Examples", J. ACM, Vol. 35, 965-984. 
[11] Valiant L.G. (1984) "A Theory of the Learnable", Comm. ACM, Vol. 27, 
1134-1142. 
[12] Valiant L.G. & Warmuth K. (Editors) (1991) Proc. of the 4st Workshop on 
ComputationM Learning Theory, Morgan Kaufman. 
[13] Venkatesh S. (1991) "On Learning Binary Weights for Majority Functions", in 
Proc. of the 4th Workshop on Computational Learning Theory, 257-266, Morgan 
Kaufman. 
