Learning Stochastic Perceptrons Under 
k-Blocking Distributions 
Mario Marchand 
Ottawa-Carleton Institute for Physics 
University of Ottawa 
Ottawa, Ont., Canada KiN 6N5 
mario@physics.uottawa. ca 
Saeed Hadjifaradji 
Ottawa-Carleton Institute for Physics 
University of Ottawa 
Ottawa, Ont., Canada KiN 6N5 
saeed@physics.uottawa. ca 
Abstract 
We present a statistical method that PAC learns the class of 
stochastic perceptrons with arbitrary monotonic activation func- 
tion and weights wi  (-1, 0, +1} when the probability distribution 
that generates the input examples is member of a family that we 
call k-blocking distributions. Such distributions represent an impor- 
tant step beyond the case where each input variable is statistically 
independent since the 2k-blocking family contains all the Markov 
distributions of order k. By stochastic perceptron we mean a per- 
ceptron which, upon presentation of input vector x, outputs i with 
probability f(i WiXi -- )' Because the same algorithm works for 
any monotonic (nondecreasing or nonincreasing) activation func- 
tion f on Boolean domain, it handles the well studied cases of 
sigmods and the "usual" radial basis functions. 
I INTRODUCTION 
Within recent years, the field of computational learning theory has emerged to pro- 
vide a rigorous framework for the design and analysis of learning algorithms. A 
central notion in this framework, known as the "Probably Approximatively Cor- 
rect" (PAC) learning criterion (Valiant, 1984), has recently been extended (Hassler, 
1992) to analyze the learnability of probabilistic concepts (Kearns and Schapire, 
1994; Schapire, 1992). Such concepts, which are stochastic rules that give the prob- 
ability that input example x is classified as being positive, are natural probabilistic 
280 Mario Marchand, Saeed Hadjifaradji 
extensions of the deterministic concepts originally studied by Valiant (1984). 
Motivated by the stochastic nature of many "real-world" learning problems and by 
the indisputable fact that biological neurons are probabilistic devices, some prelimi- 
nary studies about the PAC learnability of simple probabilistic neural concepts have 
been reported recently (Golea and Marchand, 1993; Golea and Marchand, 1994). 
However, the probabilistic behaviors considered in these studies are quite specific 
and clearly need to be extended. Indeed, only classification noise superimposed 
on a deterministic signurn function was considered in Golea and Marchand (1993). 
The probabilistic network, analyzed in Golea and Marchand (1994), consists of a 
linear superposition of signum functions and is thus solvable as a (simple) case of 
linear regression. What is clearly needed is the extension to the non-linear cases 
of sigmoYds and radial basis functions. Another criticism about Golea and Marc- 
hand (1993, 1994) is the fact that their learnability results was established only 
for distributions where each input variable is statistically independent from all the 
others (sometimes called product distributions). In fact, very few positive learning 
results for non-trivial p-concepts classes are known to hold for larger classes of dis- 
tributions. Therefore, in an effort to find algorithms that will work in practice, we 
introduce in this paper a new family of distributions that we call k-blocking. As we 
will argue, this family has the dual advantage of avoiding malicious and unnatural 
distributions that are prone to render simple concept classes unlearnable (Lin and 
Vittel 1991) and of being likely to contain several distributions found in practice. 
Our main contribution is to present a simple statistical method that PAC learns (in 
polynomial time) the class of stochastic perceptrons with monotonic (but otherwise 
arbitrary) activation functions and weights wi {-1, 0, +1} when the input exam- 
ples are generated according to any distribution member of the k-blocking family. 
Due to space constraints, only a sketch of the proofs is presented here. 
2 DEFINITIONS 
The instance (input) space, I n, is the Boolean domain {-1, +1} n. The set of all 
input variables is denoted by X. Each input example x is generated according to 
some unknown distribution D on I n. We will often use pt)(x), or simply p(x), to 
denote the probability of observing the vector value x under distribution D. If U 
and V are two disjoint subsets of X, xrr and xv will denote the restriction (or 
projection) of x over the variables of U and V respectively and pr)(xrlxv) will 
denote the probability, under distribution D, of observing the vector value xrr (for 
the variables in U) given that the variables in V are set to the vector value xv. 
Following Kearns and Schapire (1994), a probabilistic concept (p-concept) is a map 
c: I n --, [0, 1] for which c(x) represents the probability that example x is classified 
as positive. More precisely, upon presentation of input x, an output of a = 1 is 
generated (by an unknown target p-concept) with probability c(x) and an output 
of a = 0 is generated with probability 1 - c(x). 
A stoclastic perceptton is a p-concept parameterized by a vector of n weights wi 
and a activation function f(.) such that, the probability that input example x is 
Learning Stochastic Perceptrons under k-Blocking Distributions 281 
classified as positive is given by 
i----1 
We consider the case of a non-linear function f(.) since the linear case can be solved 
by a standard least square approximation like the one performed by Kearns in 
Schapire (1994) for linear sums of basis functions. We restrict ourselves to the case 
where f(.) is monotonic i.e. either nondecreasing or nonincreasing. But since any 
nonincreasing f(.) combined with a weight vector w can always be represented by a 
nondecreasing f(.) combined with a weight vector -w, we can assume without loss 
of generality that the target stochastic perceptron has a nondecreasing f(.). Hence, 
we allow any sigmoid-type of activation function (with arbitrary threshold). Also, 
since our instance space Z n is on a n-sphere, eq. 1 also include any nonincreasing 
radial basis function of the type b(z 2) where z = Ix - w[ and w is interpreted as 
the "center" of b. The only significant restriction is on the weights where we allow 
only for wi  (-1, 0, +1). 
As usual, the goal of the learner is to return an hypothesis h which is a good ap- 
proximation of the target p-concept c. But, in contrast with decision rule learning 
which attempts to "filter out" the noisy behavior by returning a deterministic hy- 
pothesis, the learner will attempt the harder (and more useful) task of modeling 
the target p-concept by returning a p-concept hypothesis. As a measure of error 
between the target and the hypothesis p-concepts we adopt the variation distance 
dv (',') defined as: 
err(h,c) = dv(h,c) ae2 EpD(x)lb(x) -c(x)l (2) 
x 
Where the summation is over all the 2 n possible values of x. Hence, the same D is 
used for both training and testing. The following formulation of the PAC criterion 
(Valiant, 1984; Hassler, 1992) will be sufficient for our purpose. 
Definition I Algorithm A is said to PA C learn the class C of p-concepts by using 
the hypothesis class H (of p-concepts) under a family Z) of distributions on instance 
space I n, iff for any c  C, any D  Z), any 0 < e, 5  1, algorithm A returns in a 
time polynomial in (l/e, 1/5, n), an hypothesis h  H such that with probability at 
least i - 5, err(h, c)  e. 
3 K-BLOCKING DISTRIBUTIONS 
To learn the class of stochastic perceptrons, the algorithm will try to discover each 
weight wi that connects to input variable xi by estimating how the probability 
of observing a positive output (a = 1) is affected by "hard-wiring" variable xi to 
some fixed value. This should clearly give some information about wi when xi 
is statistically independent from all the other variables as was the case for Golea 
and Marchand (1993) and Schapire (1992). However, if the input variables are 
correlated, then the process of fixing variable xi will carry over neighboring variables 
which in turn will affect other variables until all the variables are perturbed (even 
in the simplest case of a first order Markov chain). The information about wi will 
282 Mario Marchand, Saeed Hadjifaradji 
then be smeared by all the other weights. Therefore, to obtain information only 
on wi, we need to break this "chain reaction" by fixing some other variables. The 
notion of blocking sets serves this purpose. 
Loosely speaking, a set of variables is said to be a blocking set  for variable xi 
if the distribution on all the remaining variables is unaffected by the setting of xi 
whenever all the variables of the blocking set are set to a fixed value. More precisely, 
we have: 
Definition 2 Let B be a subset of X and let U = X - (B U {xi}). Let xB and xv 
be the restriction of x on B and U respectively and let b be an assignment for xB. 
Then B is said to be a blocking set for variable xi (with respect to D), iff: 
pD(XU[XB = b, xi = +1) = piv(xv[x = b, xi = -1) for all b and xv 
In addition, if B is not anymore a blocking set when we remove anyone of its 
variables, we then say that B is a minimal blocking set for variable xi. 
We thus adopt the following definition for the k-blocking family. 
Definition 3 Distribution D on :n is said to be k-blocking iff IBil 
1, 2... n when each Bi is a minimal blocking set for variable xi. 
for/= 
The k-blocking family is quite a large class of distributions. In fact we have the 
following property: 
Property ! All Markov distributions of kth order are members of the 2k-blocking 
family. 
Proof: By kth order Markov distributions, we mean distributions which can be 
exactly written as a Chow(k) expansion (see Hoeffgen, 1993) for some permuta- 
tion of the variables. We prove it here (by using standard techniques such as 
in Abend et. al, 1965) for first order Markov distributions, the generalization for 
k  i is straightforward. Recall that for Markov chain distributions we have: 
p(xjlxj-1,'" Xl) --p(xj[xj-1) for i ( j _ n. Hence: 
p(xl ' . ' xj-2, xj+2 . . . xlx_l, x, x+  ) 
= p(x1)p(x]x1)...p(xjlxj_1)p(xj+11xj).'.p(xnlxn_1)/p(xj_1,xj,xj+1) 
= p(x1)p(xlx1)...p(xj_11xj_)p(xj+lxj+1)'"p(xnlxn-1)/p(xj_1 ) 
= p(xl "'x_2, x+'.' xlx-, x-, X+l) 
where Y] denotes the negation of xj. Thus, we see that Markov chain distributions 
are a special case of 2-blocking distributions: the blocking set of each variable 
consisting only of the two first-neighbor variables. D. 
The proposed algorithm for learning stochastic perceptrons needs to be provided 
with a blocking set (of at most k variables) for each input variable. Hoeffgen (1993) 
has recently proven that Chow(l) and Chow(k  1) expansions are efficiently learn- 
able; the latter under some restricted conditions. We can thus use these algorithms 
XThe wording "blocking set" was also used by Hancock & Mansour (Proc. of COLT'91, 
179-183, Morgan Kaufmann Publ.) to denote a property of the taxget concept. In contrast, 
our definition of blocking set denotes a property of the input distribution only. 
Learning Stochastic Perceptrons under k-Blocking Distributions 283 
to discover the blocking sets for such distributions. However, the efficient learn- 
ability of unrestricted Chow(k > 1) expansions and larger classes of distributions, 
such as the k-blocking family, is still unknown. In fact, from the hardness results of 
Hoeffgen (1993), we can see that it is definitely very hard (perhaps NP-complete) 
to find the blocking sets if the learner has no information available other than the 
fact that the distribution is k-blocking. On the other hand, we can argue that the 
"natural" ordering of the variables present in many "real-world" situations is such 
that the blocking set of any given variable is among the neighboring variables. In 
vision for example, we expect that the setting of a pixel will directly affect only 
those located in it's neighborhood; the other pixels being affected only through this 
neighborhood. In such cases, the neighborhood of a variable "naturally" provides 
its blocking set. 
4 LEARNING STOCHASTIC PERCEPTRONS 
We first establish (the intuitive fact) that, without making much error, we can 
always consider that the target p-concept is defined only over the variables which 
are not almost always set to the same value. 
Lemma 1 Let V be a set of v variables xi for which Pr(xi = ai) > i - a. Let 
c be a p-concept and let c  be the same p-concept as c except that the reading of 
each variable xi  V is replaced by the reading of the constant value ai. Then 
err(c', c) < v . c. 
Proof.' Let a be the vector obtained from the concatenation of all ais and let xv 
be the vector obtained from x by keeping only the components xi which are in V. 
Then err(c,c) _< Pr(xv 
B 
For a given set of blocking sets { i}i=1, the algorithm will try to discover each 
weight wi by estimating the blocked influence of xi defined as: 
Binf(xlbi ) a__f Pr(a = 11xs  = b,xi = +1) - Pr(a: 11xs ,: b,xi = -1) 
where xB, denotes the restriction of x on the blocking set Bi for variable xi and bi 
is an assignment for xB. The following lemma ensures the learner that Binf(xilbi) 
contains enough information about wi. 
Lemma 2 Let the target p-concept be a stochastic perceptton on Z n having a non- 
decreasing activation function and weights taken from (-1, 0, +1}. Then, for any 
assignment bi for the variables in the blocking set B i of variable xi, we have: 
_0 ifwi=+l 
Binf(xi]bi) = 0 if wi = 0 (3) 
_< 0 if wi = -1 
Proof sketch: Let U = X - (Bi U (xi}), s = -,jeu wjxj and  = keB, wkb. 
Let p(s) denote the probability of observing s (under D). Then Binf(x[bi) -- 
Esp(s) [f(s +  + wi) - f(s +  - from which we find the desired result for a 
nondecreasing f(.). 
284 Mario Marchand, Saeed Hadjifaradji 
In principle, lemma 2 enables the learner to discover wi from Binf(xilbi). The 
learner, however, has only access to its empirical estimate Bfnf(xilbi) from a finite 
sample. Hence, we will use Hoeffding's inequality (Hoeffding, 1963) to find the 
number of examples needed for a probability p to be close to its empirical estimate 
15 with high probability. 
Lemma 3 (Hoeffding, 1963) Let YI,...,Y, be a sequence of m independent 
Bernoulli trials, each succeeding with probability p. Let 15 = '.im= Yi/m. Then: 
Pr([15-pl>e) _< 2exp(-2me 2) 
Hence, by writing Binf(xi [bi) in terms of (unconditional) probabilities that can be 
estimated from all the training examples, we find from lemma 3 that the number 
rn0(e, 6, n) of examples needed to have IBfnf(xilbi) - Binf(xi[bi)l < e with proba- 
bility at least i - 6 is given by: 
> In 
where n = a k+l is the lowest permissible value for pl2(bi, xi) (see lemma 1). So, if 
the minimal nonzero value for IBinf(xilb)l is then the number of examples needed 
to find, with confidence at least 1 - 6, the exact value for wi among {-1, 0, +1} is 
such that we need to have: Pr([Bnf(xi[bi) - Binf(xilbi)[ < /2) > 1 - 6. Thus, 
whenever/ is of fl(e-n), we will need of O(e 2n) examples to find (with prob > 1- 6) 
the value for wi. So, in order to be able to PAC learn from a polynomial sample, 
we must arrange ourselves so that we do not need to worry about such low values 
for [Binf(xi[bi)l. We therefore consider the mazimum blocked influence defined as: 
Binf(xi) deal Binf(xi[b) 
where b? is the vector value for which [Binf(xilb)l is the largest. We now show 
that the learner can ignore all variables xi for which [Binf(xi)[ is too small (without 
making much error). 
Lemma 4 Let c be a stochastic perceptron with nondecreasing activation function 
f (. ) and weights taken from {-1, 0, +1}. Let V c X and let cv be the same stochas- 
tic perceptton as c except that wi = 0 fo r all xi  V and its activation function is 
changed to f(. + 0). Then, there always exists a value for 0 such that: 
err(cv, c) < ylBinf(xi)l 
iEV 
Proof sketch: By induction on IVI. To first verify the lemma for V = {x}, let 
b be a vector of values for the setting of all xi  B and let xv be a vector of 
values for the setting of all xj  U = X - (B U {x}). Let s = --jev wxj and 
 = '.jeB wxj, then for 0 = w, we have: 
err(cv, c) ---- y ypD(xuIb)pD(blxl - --1)pD(X ---- --1) 
xv b 
x If(s +  + w) - f(s +  - w)[ < [Binf(x)[ 
Learning Stochastic Perceptrons under k-Blocking Distributions 285 
We now assume that the lemma holds for V = {x,x2...xk} and prove it for 
W = VU{xk+}. Let S = {x+} and let f(.+0w), f(.+0v) and f(-+0s) 
denote respectively the activation function for cw, cv and cs. By inspecting 
the expressions for err(cv, c) and err(cw, cs), we can see that there always ex- 
ist a value for Ow  {Or + Wk+I,OV --Wk+l} and Os e {Wk+l,--Wk+l} such 
that err(cw,cs) < err(cv,c). And since dv(.,.) satisfies the triangle inequality, 
err(cw,c) < err(cv, c) + IBinf(x&+l)[. ]. 
After discovering the weights, the hypothesis p-concept h returned by the learner 
will simply be the table look-up of the estimated probabilities of observing a positive 
classification given that Y].in__x wixi = s for all s values that are observed with 
sufficient probability (the hypothesis can output any value for the values of s that are 
observed very rarely). We thus have the following learning algorithm for stochastic 
perceptrons. 
Algorithm LearnSP(n, e, 6, {Bi}in=) 
1. Call m = 128 ()2&+41n (-) training examples (where k = maxilBil). 
2. Compute ISr(xi = q-l) for each variable xi. Neglect xi whenever we have 
Ir(xi = +1) < e/(4n) or 15r(xi = +1) > i- e/(4n). 
3. For each variable xi and for each of its blocking vector value bi, compute 
B{'nf(xlbi). Let b' be the value of bi for which IBnf(xilbi)] is the largest. 
Let Bnf(x)= Bnf(xilb'). 
4. For each variable xi: 
(a) Let wi = +1 whenever B{'nf(x) > e/(4n). 
(b) Let wi = -1 whenever Bnf(xi) < e/(4n). 
(c) Otherwise let wi = 0 
^ n s) for s -n,... + n. 
5. Compute Pr(-.i= wxi = = 
6. Return the hypothesis p-concept h formed by the table look-up: 
h(x) -- hi(s) -- r ((T--- i i=1 wixi: s 
^ n s) > e/(8n + 8). For the other s values, 
for all s for which Pr(-.i= 1 WiXi : 
let h'(s) = 0 (or any other value). 
Theorem 1 Algorithm LearnSP PAC learns the class of stochastic perceptrons 
on I n with monotonic activation functions and weights wi  {-1, 0, +1} under 
any k-blocking distribution (when a blocking set for each variable is known). The 
number of examples required is rn = 128 ()2/+41n (x-) (and the time needed is 
O(n x m)) for the returned hypothesis to make error at most e with confidence at 
least i - 5. 
Proof sketch: From Hoeffding's inequality (lemma 3) we can show that this sample 
size is sufficient to ensure that: 
286 Mario Marchand, Saeed Hadjifaradji 
 1Gr(x -- +1)- Pr(xi = +1) < e/(4n) with confidence at least 1 -6/(4n) 
^ 
 Binf(xi) - Binf(xi) < e/(4n) with confidence at least 1 - 6/(4n) 
 15r(Y]= wixi = s) - Pr(y]4=  wixi = s) < e2/[64(n + 1)] with confidence 
at least 1 - 6/(4n + 4) 
" ) 
 < e/4withconfi- 
dence at least 1 - 6/4 
From this and from lemma 1, 2 and 4, it follows that returned hypothesis will make 
error at most e with confidence at least 1 - 6. [2. 
Acknowledgments 
We thank Mostera Golea, Klaus-U. Hoeffgen and Stefan PoeIt for useful comments 
and discussions about technical points. M. Marchand is supported by NSERC grant 
OGP0122405. Saeed Hadjifaradji is supported by the MCHE of Iran. 
References 
Abend K., Hartley T.J. & Kanal L.N. (1965) "Classification of Binary Random 
Patterns", IEEE Trans. Inform. Theory vol. IT-11,538-544. 
Golea, M. & Marchand M. (1993) "On Learning Perceptrons with Binary Weights", 
Neural Computation vol. 5, 765-782. 
Golea, M. & Marchand M. (1994) "On Learning Simple Deterministic and Prob- 
abilistic Neural Concepts", in Shawe-Talor J. , Anthony M. (eds.), Computational 
Learning Theory: EuroCOLT'9$, Oxford University Press, pp. 47-60. 
Haussler D. (1992) "Decision Theoritic Generalizations of the PAC Model for Neural 
Net and Other Learning Applications", Information and Computation vol. 100, 78- 
150. 
Hoeffgen K.U. (1993) "On Learning and Robust Learning of Product Distributions", 
Proceedings of the 6th A CM Conference on Computational Learning Theory, ACM 
Press, 77-83. 
Hoeffding W. (1963) "Probability inequalities for sums of bounded random vari- 
ables", Journal of the American Statistical Association, vol. 58(301), 13-30. 
Kearns M.J. and Schapire R.E. (1994) "Efficient Distribution-free Learning of Prob- 
abilistic Concepts", Journal of Computer and System Sciences, Vol. 48, pp. 464-497. 
Lin J.H. & Vitter J.S. (1991) "Complexity Results on Learning by Neural Nets", 
Machine Learring, Vol. 6, 211-230. 
Schapire R.E. (1992) The Design and Analysis of Efficient Learning Algorithms, 
Cambridge MA: MIT Press. 
Valiant L.G. (1984) "A Theory of the Learnable", Comm. ACM, Vol. 27, 1134- 
1142. 
