Strong Unimodality and Exact Learning 
of Constant Depth/-Perceptron 
Networks 
Mario Marchand 
Department of Computer Science 
University of Ottawa 
Ottawa, Ont., Canada KIN 6N5 
marchand@csi.uottawa. ca 
Saeed Hadjifaradji 
Department of Physics 
University of Ottawa 
Ottawa, Ont., Canada KIN 6N5 
saeed@physics.uottawa.ca 
Abstract 
We present a statistical method that exactly learns the class of 
constant depth /-perceptron networks with weights taken from 
f-l, 0 q- 1) and arbitrary thresholds when the distribution that 
generates the input examples is member of the family of product 
distributions. These networks (also known as nonoverlapping per- 
ceptron networks or read-once formulas over a weighted threshold 
basis) are loop-free neural nets in which each node has only one 
outgoing weighl{. With arbitrary high probability, the learner is 
able to exactly identify the connectivity (or skeleton) of the target 
/-perceptron network by using a new statistical test which exploits 
the strong unimodality property of sums of independent random 
variables. 
1 INTRODUCTION 
From a computational learning theory perspective, it is well known that efficient 
learning of non trivial (or non simple) neural network function classes is possible 
only when either (1) the learner is able to use membership queries or (2) the distri- 
bution that generates the input examples is not arbitrary but member of some well 
defined family. Following several positive learnability results on different classes of 
read-once Boolean formulas, a membership query algorithm has been recently pro- 
posed [4] for learning the class of nonoverlapping perceptron networks. These net- 
works (also known as/-perceptron networks or read-once formulas over a weighted 
threshold basis) are loop-free neural nets in which each node has only one outgoing 
weight. If membership queries are not permitted (as we assume throughout this 
paper), learning this class becomes intractable [6] under arbitrary input distribu- 
Strong Unimodality and Exact Learning of Constant Depth g-Perceptron Networks 289 
tions. However, under the uniform distribution, a PAC learning algorithm has been 
proposed recently [2] for a quite restricted subclass called generalized/-perceptron 
decision lists. As an important step towards the learnability of the whole class of 
/-perceptron networks under "simple" distributions, we present in this paper a sta- 
tistical method that exactly learns the class of constant depth/-perceptron networks 
under the family of product distributions, i.e. distributions in which the setting of 
each input variable is chosen independently of the other variables. Eventhough the 
depth of the network must be fixed to a constant, we satisfy here a harder learning 
criterion than the one proposed by the PAC model [9]. Indeed, with arbitrary high 
probability, the proposed algorithm is able to exactly identify [1] the target function. 
Moreover, because of its statistical nature [7], the proposed algorithm can tolerate 
a classification noise rate  up to the information theoretic limit of  = 1/2. 
There exist other statistical methods to learn other classes of read-once formulas un- 
der particular distributions [1] and product distributions [8]. They all basically differ 
in the statistical tests they use to identify the gate parameters and the formula's 
skeleton. Our key novel contribution is to introduce a new test (for discovering the 
network's connectivity) which exploits the strong unimodality [5] property of sums 
of independent random variables. 
2 DEFINITIONS 
We consider the problem of learning Boolean functions of the Boolean domain 
{0,1} n. Let X = {x,x2... ,Xn} be the set of n input variables and x C {0,1} n be 
some assignment of these n variables, we denote by xv the restriction of assignment 
x on the variables in V C_ X. A perceptron g on V is defined by a vector of v = I VI 
weights wi and a single threshold t9. As usual, for any xv c {0, 1} v, the output of 
g(xv) is I whenever -ev wx > 9 and 0 otherwise. 
We restrict ourselves to the case where each wi  {-1,0, +1} but the thresholds 
are arbitrary so that, without loss of generality (w.l.o.g.), t9 E {-v - 1,-..v}. A 
perceptron is said to be positive if all its incoming weights are +1. The learning 
algorithm will use the following classification for positive perceptrons. 
T1 perceptrons: These are perceptrons which output I iff one or more of its inputs 
are set to 1. These are OR gates of multiple inputs. 
TO perceptrons: These are perceptrons which output 0 iff one or more of its inputs 
are set to 0. These are AND gates of multiple inputs. 
Tll perceptrons: These are perceptrons which output I iff two or more of its inputs 
are set to 1. These include majority gates of three inputs. 
TOO perceptrons: These are perceptrons which output 0 iff two or more of its inputs 
are set to 0. These include majority gates of four variables. 
TG perceptrons: All the perceptrons which do not belong to any one of the above 
four categories. They must, therefore, have at least five inputs. 
Each perceptron can have variables and/or other perceptrons as inputs. Hence, 
a node will denote either a variable or a perceptron. The class of/-perceptron 
networks is the set of all Boolean functions that can be represented as a loop- 
free network of percepttons where each node (including input units) has only one 
outgoing weight. The output unit of a network will often be referred as the root node. 
We say that a node is a child of the parent perceptron g if it is an immediate input 
to perceptron g. Children of the same perceptron are called siblings. A perceptron 
is said to be a bottom level perceptron if all its children are variables. The depth of a 
node is defined as the number of perceptrons (including the parent of the node and 
the root node) on the path from the parent of the node to the root. The perceptrons 
290 M. MARCHAND, S. HADJIFARADJI 
on this path are called the ancestors of that node. The descendants of perceptron 
g, denoted by desc(g), is the set of nodes that have perceptton g as an ancestor. 
The depth of a network is defined as the depth of the deepest variable in the net. 
The least common ancestor of a set V of nodes, denoted by lca(V), is defined as 
the deepest ancestor which is common to every node in V. Variables {xi, xj, xk} 
are said to meet at perceptron g, iff lca(xi,xj) - lca(xi,xk) -- lca(xj,xk) -- g. If 
there does not exist a perceptron g having this property, then variables {xi, xj, x} 
are said not to meet. 
In this paper, we use a learning criterion which is more ambitious than the PAC 
criterion introduced by Valiant [9]. We consider that each training example x 
is generated by an unknown product distribution D on 0, 1} n and then labeled 
according to an unknown target Boolean function f representable as a/-perceptron 
network. After observing a set of such examples, the goal of the learning algorithm 
is to produce an hypothesis function h which is the exact equivalent of f. More 
formally we say that algorithm A exactly learns (or exactly identifies) a class F of 
of Boolean functions iff for any 0 < 5 < 1, any product distribution D on 0, 1} n, 
and any target function f C F, algorithm A outputs, with probability at least 1 - 5 
an hypothesis function h such that h(x) - f(x) V x c 0, 1} n. 
The learning algorithm will perform several statistical tests to build its hypothesis. 
Namely, for each variable xi, it will estimate its influence, defined as: 
Infi(xi) de=f Pr(f = llxi = 1)- Pr(f = 1Ix  = 0) 
(1) 
where all probabilities (here and in the sequel) are defined with respect to the 
(unknown) training product distribution D. The empirical estimate of Pt(A) will 
be denoted as lfr(A). We will also use, Infig(xi) to denote the influence of xi on the 
subformula of f which is rooted at perceptton g. Also, Infi(xilxj = a) will denote 
the influence of xi given that variable xj is fixed to value a. To discover the skeleton 
of the target function, the learner will compute the coinfluence of several triples of 
variables, defined as: 
c/k,j de__f Infl(xjlxi = 1,x: 0) Infl(xjlx = 1, x = 1) 
- Infi(xlx, - 0, xk -- 0) - Infi(xlx, - 0, - 1) (2) 
Because h must make zero error with f, the learner must produce an hypothesis 
h which contains all the input variables and all the perceptrons of f (except those 
variables and perceptrons which are fixed to a constant value). Consequently, for a 
target f defined on n input variables xi and containing r perceptrons g, we define 
s as: 
der 
% = min {Pr(zi = a)/e{0,...n},ae{0,u,Pr(gk = b)e{0,...},be{0,}} (3) 
Hence, Vi E {0,...n} we have: %  Pr(z = 1) % l-e, andV k E {0,..-r} 
we have: % N Pr(g = 1) N 1 - %. To exactly learn the cls of constant depth 
-percepton network, the proposed algorithm needs a number of examples which 
is polynomial in l/e, (see the algorithm LearnNPN). 
3 THE LEARNING ALGORITHM 
We first perform some simplifying reductions that hold for any target/-perceptron 
net f. (1) We can assume, w.l.o.g., that only input variables have a negative 
outgoing weight. Indeed, if a perceptron g has a -1 outgoing weight, we can 
replace it by a perceptron which has all its incoming weights negated and a +1 
Strong Unimodality and Exact Learning of Constant Depth g-Perceptron Networks 291 
outgoing weight; this leaves the computation by f unchanged when we add +1 to 
the threshold of g's parent. In this manner, all -1 weights are pushed to the input 
variables. (2) T1 perceptrons do not have T1 perceptrons as children since such 
nodes can always be merged. The same remark is true for TO perceptrons. (3) 
Because the output of each node is Boolean valued, each perceptron has at least 
two inputs. This implies that f has at most n - I perceptrons. 
The first step of the algorithm is to identify the weight wi that springs out of each 
input variable xi. For this we appeal to the following lemma: 
Lemma ! Let f be any/-perceptron network with weights taken from (-1, 0, +1) 
and arbitrary thresholds. Let D be any product distribution on CO, 1) n. Let g be any 
perceptron with v weights and for which p <_ Pr(g = 1) <_ I - p. Let input variable 
xi be a child of g. Then: 
Infis(xi) = 
Moreover, if xi has depth d, then, we have: 
+p/(2v) if wi = +1 
0 if wi = 0 
-p/(2v) if wi =-1 
sld 
Thus Infi(xi) has a gap of O([%/n] d) that separates the three possible values for 
wi. From Chernoff bounds [3], this implies that a sample size polynomial in %/n is 
sufficient to find, with high probability, the exact value of wi when d is fixed. After 
having identified all the weights in this manner, we transform the target function 
into its positive form simply by changing xi to 1 -xi (and adding +1 to the threshold 
of xi's parent) whenever wi = -1. 
To find the skeleton of the target function, the algorithm will first find all the bot- 
tom level perceptrons (i.e. perceptrons whose children are all variables). Then, 
after finding the exact thresholds (for TG perceptrons), we will consider these bot- 
tom level perceptrons as new "meta" variables (that replace their children) from 
which we can find their parent perceptrons. In this manner similar to Schapire's 
algorithm [8], we will build every perceptron of the net until we reach the root. 
The coinfiuence function will enable the learner to determine if certain variables 
are siblings of a perceptron g and if g is fed by other perceptrons. This is possi- 
ble because the distribution of a sum of independent random variables is strongly 
unimodal [5]. More specifically, we have (and need) here a stronger property: 
Lemma 2 Let (Xl,X2... ,Xv) be v independent random Boolean variables, 
with Pr(xi 1) qi and let S aef v 
 --  Ei----1 mi' 
k C (1,-..,v)' 
Pr($=k-1) Pr($=k-2) > _ 1 
Pr(S=k) Pr(S=k-1) - v<C>v 
where (id--efqi/(1 -- qi) and < c >v 
each 
Then for any (ql'",qv) and any 
Pr($ = O) 
Pr($: 1) 
def -v Oli/V. 
Proof: Omitted from this abstract but one can easily verify its exactness in the 
case where qi = q for all i = 1..., v. 3. 
The next lemma constitutes our main tool for finding the connectivity of f. It is 
expressed in terms of what we call the strong unimodal gap %: 
def { 1 1 } 
% = min , 
n<c>a n<l/c>a 
292 M. MARCHAND, S. HADJIFARADJI 
der x-n Oz -1/n. 
where: < 
Lemma 3 Let {xi,xj, xk} be any triple of variables such that each is a child of 
some TG perceptron. Then: 
1. C.  . > fn if {xi, xj, xl} are siblings of a perceptton g. 
2. C .. = 0 if x  desc(lca(xi,xj)) 
o 
C.  . = C k 
,3 i,z if {xi,xj,x} meet at a perceptron g that has a perceptron gj as 
a child with the property that both xj and xt feed gj. 
If  C.   > fn and xt feeds lca(xi, xj) through a perceptron gt which is not 
fed by (Xi,Xj) } Thell C,j > fn 3 . Infi(xt) 
Proof sketch: If {xi, xj,xk} are siblings of a perceptron g of threshold 0, then 
C. . = [Pr(S = 0- 1)/Pr(S = 0)]- [Pr(S = 0- 2)/Pr(S = 0- 1)] which, from 
lemma 2, establishes fact 1. Let g = lca(xi, xj). Then, if x  desc(lca(xi, xj)), 
Infi(xjlxi = a, xk = b) = Infi(glx = b) . Infia(xjlxi = a) which establishes fact 2. 
The proofs for fact 3 and 4 are omitted from this abstract. 
The constraints on xi and x in lemma 3 are to avoid vanishing denominators 
in Cj. This does not create any problems since by using simpler tools, we can 
always find the children of the TO, T1, TOO, Tll perceptrons before those of the TG 
perceptrons. In the following we also explain how to identify the non-TG bottom 
level perceptrons. 
Lemma 4 Variable xi is a child of a T1 perceptron iff there exist xj such that 
Infl(xjlxi = 1)=0. Otherwise Infl(xjlx i = 1)> Infi(xj)'fin for all xj  xi. 
Moreover, a set W of variables, each of which is a child of a T1 perceptton, is a set 
of siblings iffInfl(xjlx i ---- 1) ---- 0 V xi,xj}  W. 
Moreover, If { W _ V is a set of variables, all siblings of a T1 perceptron g, such 
that no children of g is in V-W  Then  g is a bottom level perceptron with respect 
to V iffInfi(xklxi = 1) > Infi(x). % for all x  V- W, xi  W. Otherwise there 
exist x  V - W and xi  W such that Infi(xlx i -- 1) = 0 ). 
The lemma is valid when we replace T1 by TO if the condition xi  I is replaced by 
X i =0. 
Proof idea: Directly follows from lemma 2, the definitions of T1 and TO and from 
the fact that no two consecutive T1 (nor TO) perceptrons occur in f. [] 
From this lemma, we define a routine, Find-bl-Tl(V), that finds all T1 perceptrons 
which are bottom level with respect to the set V of variables (or meta variables). It 
achieves this by testing, for each pair of variables, if Ifl(xjIxi ---- 1)/I{fi(xj) > Vn/2. 
(By using Chernoff bounds, we find the probability of making the correct decision 
for each variable as a function of the sample size m.) Moreover, the output of this 
routine is a set V  which consists of the original set V from which the siblings have 
been replaced by their bottom level T1 parents (with their children connected) as 
new meta variables. It also tags those variables in V that are children of some non- 
bottom level T1 perceptron. This is to warn the subsequent routines of not using 
these variables to find out if they are children of other types of perceptrons. An 
identical definition and a similar operation applies for Find-bl-T0(V). The same 
applies also for Find-bl-Tll(V), Find-bl-T00(V) and Find-bl-TG(V) but for 
them, we need to use the following lemmas. 
Strong Unimodality and Exact Learning of Constant Depth t-Perceptron Networks 293 
Lemma 5 Let xi and xj be variables which neither is a child of a T1 or a TO 
perceptton. Then {xi, xj} are siblings of a Tll perceptton iff there exist xk such 
that Infi(xklxi = xj-- 1)=0. Otherwise Infi(xklxi- xj- 1)> Infi(Xk).7n  for all 
Moreover, let V be a set of variables for which no one is a child of some TO or T1 
perceptron. Let W C_ V be a set of variables, all siblings of a Tll perceptron g and 
such that no children of g is in V - W. Then g is a bottom level perceptron with 
respect to V iff C. . = 0 for all xk E V - W and {xi, xj   W. Otherwise there 
exist x  V- W and {xi, xj)  W such that C. . > Infi(x).%3 
,3 ' 
The lemma is valid when we replace Tll by TOO if the condition xi -- x - I is 
replaced by xi = xj = O. 
Proof idea: Follows from lemmas 2 and 3 and from the definitions of Tll and TOO 
perceptrons. [] 
Lemma 6 Let V be a set of variables, each of which is a child of some TG per- 
ceptron. Let W C V be a set of variables for which C. .  % V {xi, x,x)  W. 
Then W is a set of siblings of a bottom level TG perceptton g (and thus g is bottom 
level with respect to V) iff there does not exist any {1, m)  V - W and {i,j)  W 
for which all of these properties hold: 
1. C,j > "/n 3. Infi(xl) and Cm. a. Infi(xm) 
2. C  - C j - 0 
l,m -- l,m -- 
Moreover, the threshold 0 of a bottom level TG perceptron g (in positive form) is 
obtained bythe value ofk for which Pr(f = 1]S = k+l)-Pr(f = 1IS = k) >_ Infi(xj) 
where xj can be any child of g and $ denotes the sum over all its children. This 
difference is zero if k  . 
Having sketched the action of the different Find-bl-T* routines, we now propose 
the following learning algorithm. 
Algorithm LearnNPN(n, es, 5) 
1. Call m = In 5 training examples. 
2. For every xi 6 X, let wi = +1 if Ifi(xi) > es/4n, let wi = --1 if Ifi(xi) < 
--%4n and let wi -- 0 otherwise. Let xi = 1- xi whenever wi = 1 
(conversion into the positive form). Let V = X. 
3. Repeat { 
Repeat { 
Repeat { 
Repeat {Let V = V; V = Find-bl-Tl(Find-bl-T0(V0) 
} Until V = V/ 
Let V = V; V = Find-bl-T00(V 0 
} Until V = V/ 
Let V = V; V = Find-bl-Tll(V 0 
} Until V =  
V = Find-bl-TG(V) 
} Until only one meta-variable remains in V 
2 94 M. MARCHAND, S. HADJIFARADJI 
4. Return this meta-variable (with all the others attached to it) as our hy- 
pothesis network h and convert from positive to normal form. 
The nested loops insure that, every time the set V of meta-variables is updated, all 
bottom level TO and T1 perceptrons are found before the TOO and Tll perceptrons 
which are themselves found before the TG perceptrons. This is essential in order 
that the Find-bl-T* routines make proper use of lemma 5 and 6. 
Theorem 1 Under product distributions, the algorithm LearnNPN exactly learns 
the class of I-perceptrons networks of depth at most d with weights taken from 
{-1,0, +1) and arbitrary thresholds. The algorithm runs in time of O(m x d x nS). 
Proof idea: By using Chernoff bounds [3], one can verify that the above sample of 
m examples is sufficient to ensure that all probabilities are estimated with enough 
precision to have h(x) = f(x) Vx c {0,1) n with probability at least I -6. 
Acknowledgments 
We thank Mostefa Golea and Hans U. Simon for useful comments and discussions 
about technical points. M. Marchand is supported by NSERC grant OGP0122405. 
Saeed Hadjifaradji is supported by the MCHE of Iran. 
References 
[1] Goldman S., Kearns M.J., & Schapire R., (1990). "Exact identification of circuits 
using fixed points of amplification functions. Proceedings of the 31st Symposium on 
Foundations of Computer Science, Los Alamitos, CA: IEEE Computer Society Press. 
[2] Golea M., Marchand M., & Hancock T.R., (1995) "On Learning/-Perceptron Net- 
works On the Uniform Distribution", to appear in Neural Networks. For a short 
version see: Advances in Neural Information Processing Systems 5, pp. 591-598, San 
Mateo CA, Morgan Kaufinann Publishers, (1993). Also: Computational Learning 
Theory: EuroCOLT'93, pp. 47-60, Oxford University Press, (1994). 
[3] Hagerup T. & Rub C. (1989) "A Guided Tour to Chernoff Bounds", Info. Proc. Left., 
Vol. 33, 305-308. 
[4] Hancock T.R., Golea M., & Marchand M., (1994) "Learning Nonoverlapping Percep- 
tron Networks from Examples and Membership Queries", Machine Learning, vol. 16, 
pp. 161-183. 
[5] Ibragimov I.A., (1956) "On the composition of unimodal distributions", Theor. Prob- 
ability Appl. vol. 1,255-266. Also: Keilson J. & Gerber H., (1971) "Some results for 
discrete unimodality", J. Amer. Statist. Assoc., vol. 66, 386-389. 
[6] Kearns M.J., Li M., Pitt L. & Valiant L. G., (1987). "On the learnability of boolean 
formulae" Proceedings of the Nineteenth Annual ACM Symposium on Theory of 
Computing (pp. 285-295). New York: ACM Press. 
[7] Kearns M. (1993). "Efficient Noise-Tolerant Learning from Statistical Queries", Pro- 
ceedings of the Twenty Fifth Annual ACM Symposium on the Theory of Computa- 
tion, p. 392. 
[8] Schapire R., (1994) "Learning probabilistic read-once formulas on product distribu- 
tions" Machine Learning vol. 14, 47-81. San Mateo, CA: Morgan Kaufman. 
[9] Valiant L.G. (1984) "A Theory of the Learnable", Comm. ACM, Vol. 27, 1134-1142. 
