Polynomial Uniform Convergence of 
Relative Frequencies to Probabilities 
Alberto Bertoni, Paola Campadelll;' Anna Morpurgo, Sandra Panlzza 
Dipartimento di Scienze dell'Informazione 
Universitk degli Studi di Milano 
via Comelico, 39 - 20135 Milano- Italy 
Abstract 
We define the concept of polynomial uniform convergence of relative 
frequencies to probabilities in the distribution-dependent context. Let 
X = {0, 1} , let P be a probability distribution on X and let F C 2 x' 
be a family of events. The family {(X,,P,, F,)}>i has the property 
of polynomial uniform convergence if the probability that the maximum 
difference (over F) between the relative frequency and the probabil- 
ity of an event exceed a given positive e be at most 5 (0 < 5 < 1), 
when the sample on which the frequency is evaluated has size polynomial 
in n, 1/e,1/5. Given a t-sample (Xl,...,xt), let C(nt)(Xl,...,Xt) be the 
Vapnik-Chervonenkis dimension of the family {{Xl,...,xt} FI f lf e Fn} 
and M(n,t) the expectation E(C(nt)/t). We show that {(Xn,Pn, Fn)}n>i 
has the property of polynomial uniform convergence iff there exists/ > 0 
such that M(n,t) = O(n/t). Applications to distribution-dependent 
PAC learning are discussed. 
1 INTRODUCTION 
The probably approximately correct (PAC) learning model proposed by Valiant 
[Valiant, 1984] provides a complexity theoretical basis for learning from examples 
produced by an arbitrary distribution. As shown in [Blumer et al., 1989], a cen- 
*Also at CNR, Istituto di Fisiologia dei Centri Nervosi, via Mario Bianco 9, 20131 
Milano, Italy. 
904 
Polynomial Uniform Convergence 905 
tral notion for distribution-free learnability is the Vapnik-Chervonenkis dimension, 
which allows obtaining estimations of the sample size adequate to learn at a given 
level of approximation and confidence. This combinatorial notion has been defined 
in [Vapnik k Chervonenkis, 1971] to study the problem of uniform convergence of 
relative frequencies of events to their corresponding probabilities in a distribution- 
free framework. 
In this work we define the concept of polynomial uniform convergence of relative 
frequencies of events to probabilities in the distribution-dependent setting. More 
precisely, consider, for any n, a probability distribution on {0, 1} n and a family of 
events Fn _C 2{,1}': our request is that the probability that the maximum difference 
(over Fn) between the relative frequency and the probability of an event exceed a 
given arbitrarily small positive constant e be at most 5 (0 < 5 < 1) when the sample 
on which we evaluate the relative frequencies has size polynomial in n, l/e, 1/5. 
The main result we present here is a necessary and sufficient condition for polyno- 
mial uniform convergence in terms of "average information per example". 
In section 2 we give preliminary notations and results; in section 3 we introduce the 
concept of polynomial uniform convergence in the distribution-dependent context 
and we state our main result, which we prove in section 4. Some applications to 
distribution-dependent PAC learning are discussed in section 5. 
2 PRELIMINARY DEFINITIONS AND RESULTS 
Let X be a set of elementary events on which a probability measure P is defined 
and let F be a collection of boolean functions on X, i.e. functions f  X - {0, 1}. 
For f  F the set f-(1) is said event, and 7! denotes its probability. A t-sample 
(or sample ofsize t) on X is a sequence _x = (31,...,St) , where z,  X (1 _< k _< t). 
Let X (t) denote the space of t-samples and p(t) the probability distribution induced 
by P on X (t), such that P(t)(z,. ..,zt) = P(z)P(z2). . . P(zt). 
Given a t-sample __x and a set f  F, let v3t)(_x) be the relative frequency of f in the 
t-sample _x, i.e. 
vSt)(_x) = E=l f(x,) 
t 
Consider now the random variable II?  X (t) -- [0 1], defined over 
where 
II()(x,...,xt) = sup I/5t)(31,...,3t)- 'Pj I' 
(X(t),P(t)), 
The relative frequencies of the events are said to converge to the probabilities uni- 
formly over F if, for every e > 0, limt--.oo P(t){__x I II?(_x) > e} = 0. 
In order to study the problem of uniform convergence of the relative frequencies 
to the probabilities, the notion of index A?(_x) of a family F with respect to a 
t-sample _x has been introduced [Vapnik & Chervonenkis, 1971]. Fixed a t-sample 
__--- (Xl,.. 
A,(_x) = #{f-(1)91 {x,...,xt}lf  F}. 
906 Bertoni, Campadelli, Morpurgo, and Panizza 
Obviously AF(Xl,...,xt) _< 2t; a set {;rl,...,xt} is said shattered by F iff 
AF(Xi,...,x) = 2t; the maximum t such that there is a set {xi,...,x} shat- 
tered by F is said the Vapnik-Chervonenkis dimension dr of F. The following 
result holds [Vapnik &: Chervonenkis, 1971]. 
Theorem 2.1 For all distribution probabilities on X, the relative frequencies of the 
events converge (in probability) to their corresponding probabilities uniformly over 
F iff dr < oo. 
We recall that the Vapnik-Chervonenkis dimension is a very useful notion in 
the distribution-independent PAC learning model [Blumer et al., 1989]. In the 
distribution-dependent framework, where the probability measure P is fixed and 
known, let us consider the expectation E[log 2 Ar(_X)], called entropy HE(t) of the 
family F in samples of size t; obviously Hr(t) depends on the probability distribu- 
tion P. The relevance of this notion is showed by the following result [Vapnik &: 
Chervonenkis, 1971]. 
Theorem 2.2 A necessary and sufficient condition for the relative frequencies of 
the events in F to converge uniformly over F (in probability) to their corresponding 
probabilities is that 
lira Hr(t) = O. 
3 POLYNOMIAL UNIFORM CONVERGENCE 
Consider the family {(Xn,Pn,Fn))n_>l, where X. = {0,1)", P. is a probability 
distribution on X. and F. is a family of boolean functions on X.. 
Since Xn is finite, the frequencies trivially converge uniformly to the probabilities; 
therefore we are interested in studying the problem of convergence with constraints 
on the sample size. To be more precise, we introduce the following definition. 
Definition 3.1 Given 
events in Fn converge 
over F, iff there ezists 
the family { {Xn, Pn, Fn) }n>_l, the relative frequencies of the 
polynomially to their corresponding probabilities uniformly 
a polynomial p(n, l/e, 1/5) such that 
Ve,5>OVn(t>p(n,1/e,1/5)=P(t){_xl r.t_x) > < 
In this context e and 5 are the approximation and confidence parameters, respec- 
tively. 
The problem we consider now is to characterize the family {{X, P., F.)}n>_l such 
that the relative frequencies of events in F. converge polynomially to the probabil- 
ities. Let us introduce the random variable [t) . X(.t) --, N, defined as 
C(nt)(Xl,...,X,) = max{#A I A C_ {1,.- .,,) A A is shattered by Fn }. 
In this notation it is understood that C( t) refers to F,. The random variable C[ t) 
and the index function AF, are related to one another; in fact, the following result 
can be easily proved. 
Polynomial Uniform Convergence 907 
Lemma 3.1 C(n t)(x) _< log A?,, () _< (n t)(x) log t. 
Let M(n, t) = E(-7-- ) be the expectation of the random variable T 
3.1 readily follows that 
 From Lemma 
M(n,t) < HF.(t) < M(n,t)logt' 
therefore M(n,t) is very close to HF,(t)/t, which can be interpreted as "average 
information for example" for samples of size t. 
Our main result shows that M(n,t) is a useful measure to verify whether 
{(Xn, Pn, Fn)}n_>l satisfies the property of polynomial convergence, as shown by 
the following theorem. 
Theorem 3.1 Given { (X,, P,, Fn) }n> , the following conditions are equivalent: 
C1. The relative frequencies of events in Fn converge polynomially to their corre- 
sponding probabilities. 
C2. There ezists /3 > 0 such that M(n,t)= O(n/tls). 
ca. There ezists a polynomial (n, l/e) such that 
Vs n (t _> O(n, 1/6):=:> m(n,t)  6). 
C2 :O C3 is readily veirfied. In fact, condition C2 says there exist c,/3 > 0 
such that M(n,t) <_ an/tt; now, observing that t _> (an/6)} implies 
an/t  <_ 6, condition C3 immediately follows. 
C3 :o C2. As stated by condition C3, there exist a, b, c > 0 such that if 
t >_ anb/6  then M(n,t) _< 6. Solving the first inequality with respect to 6 
b 
gives, in the worst case, 6 = (an /t);, and substituting for 6 in the second 
inequality yields M(n,t) _< (and/t) - x  x  < 1 we immediately 
 = aan/t. If; _ 
obtain m(n,t) <_ an,/t; _< a;n/t{. Otherwise, if; > 1, since m(n,t) _< 1, 
c ! 
we have m(n,t) g min{1,an/Z{}_< min{1,(a{n/tx){}_< an/t. [] 
The proof of the equivalence between propositions C1 and C3 will be given in the 
next section. 
4 PROOF OF THE MAIN THEOREM 
First of all, we prove that condition C3 implies condition C1. The proof is based 
on the following lemma, which is obtained by minor modifications of [Vapnik &: 
Chervonenkis, 1971 (Lemma 2, Theorem 4, and Lemma 4)]. 
908 
Bertoni, Campadelli, Morpurgo, and Panizza 
Lemma 4.1 Given the family {(X,,P. Fn))n>l if limt-oo Hr.(t) 
132t0 H (') e) (5), 
ve ve v, (t _> e-T =* e.(o{_ I r (-) > 
where to is such that H(to)/to < /64. 
= 0 then 
As a consequence, we can prove the following. 
Theorem 4.1 Given { (Xn, Pn, Fn) )n> 1, if there exists a polynomial ;b(n, 1/c) such 
that 
ww (t _> (, 1/). t - ' 
then the relative frequencies of events in Fn converge polynomially to their proba- 
bilities. 
Proof (outline). It is sufficient to observe that if we choose to = ;b(n, 64/c2), by 
hypothesis it holds that H?,(to)/to _< c2/64; therefore, from Lemma 4.1, if 
t>-- 
n(*)Zx  c} < 5. 
then Pn(t){x l ..F.,_l > 
132t0 
132 64 
= . ;0(n, ), 
An immediate consequence of Theorem 4.1 and of the relation M(n,t) < Hr.(t)/t < 
M(n, t)logt is that condition C3 implies condition C1. 
We now prove that condition C1 implies condition C3. For the sake of simplicity it 
is convenient to introduce the following notations: 
a? = (.t) Pa(n,e,t): P?){_xla?(_x ) < e}. 
The following lemma, which relates the problem of polynomial uniform convergence 
of a family of events to the parameter Pa(n, c, t), will only be stated since it can be 
proved by minor modifications of Theorem 4 in [Vapnik &; Chervonenkis, 1971]. 
Lemma 4.2 /ft > 16/e 2 then Pn(t){xl (t) e} >_ (1 P(n,8e 2t)). 
_ _ n:(_) > - , 
A relevant property of P(n, c, t) is given by the following lemma. 
Lemna 4.3 a >_ i P(n,e/a, at) _< P(n,e,t). 
Proof. Let C$-1,.   ,-xa) be an at-sample obtained by the concatenation of a elements 
_Xl,...,__x,  X (t). It is easy to verify that (nat)(__Xl,...,_xa) > max/:l,...,a (nt)(xi). 
Therefore 
p(at).,(at), _ k} < P(at),fc(t)(_Xl) _< k A A c(nt)(_xa) _< 
By the independency of the events c(nt)(_xi) _< k we obtain 
ot 
p,;(at)...(at). .. . i=1 
Polynomial Uniform Convergence 909 
Recalling that a? = Ct)/t and substituting k = st, the thesis follows. 
A relation between Pa(n, e, t) and the parameter M(n, t), which we have introduced 
to characterize the polynomial uniform convergence of { {Xn Pn Fn) }n_> 1, is shown 
in the following lemma. 
Lemma 4.4 For every e (0 < e < 1/4), if M(n,t) > 2V/ then Pa(n,e,t) < 1/2. 
Proof. For the sake of simplicity, let m = M(n, l). If m > 5 > 0 , we have 
f01 f,,2 1 
5 < m = xdP = xdP + xdPa 
.,o /2 
5 5 l)+l_e(n, 5 l). 
< 
Since 0 < 5 < 1, we obtain 
1-5 5 
5 /)< _ - 
Pa(n, 1:--372 < i . 
By applying Lemma 4.3 it is proved that, for every ( _> 1, 
( 
e(n, , al) _< 1- 
2 
For a =  we obtain 
52 21 e- 1 1 
7) < < -. 
4' 2 
For e = 52/4 and t = 21/5, the previous result implies that, if M(n, tv ) > 
then P(n, e, t) < 1/2. 
It is easy to verify that 't)(_x,...,__x,) _< Y]il t)(--xi) for every a >_ 1. This 
implies M(n, at) <_ M(n,t) for a >_ 1, hence M(n, tv) > M(n,t), from which the 
thesis follows. 121 
Theorem 4.2 If for the family {(Xn, Pn, .Tn)}n> the relative frequencies of 
events in F, converge polynomially to their probabilities, then there ezists a polyno- 
mial b(n, l/e) such that 
Ve Vn(t _> (n, 1/e) : M(n,t) < e). 
Proof. By contradiction. Let us suppose that {(X,,P,,F,)},>i polynomially 
converges and that for all polynomial functions b(n, l/e) there exist e, n, t such 
that t _> 0(n, l/e) and M(n,t) > e. 
Since M(n,t) is a monotone, non-increasing function with respect to t it fol- 
lows that for every b there exist e, n such that M(n,b(n, l/e)) > e. Consid- 
ering the one-to-one corrispondence T between polynomial functions defined by 
Tb(n, l/e) = 9(n, 4/e2), we can conclude that for any 9 there exist e, n such that 
M(n, 9(n, l/e)) > 2x/. From Lemma 4.4 it follows that 
(1) 
910 
Bertoni, Campadelli, Morpurgo, and Panizza 
Since, by hypothesis, {(X,, P,,, F',,}},,>l polynomiMly converges, fixed 6 = 1/20, 
there exists a polynomial 05 such that 
_ II (t) ,' , } 
From Lemma 4. we know that if t  lfi/s = then 
> 
) 4 
Ift  max{1B/e 2 0(n, 1/6)}, then {(1-P(n,86, 2)) < , hence P(n, 86,2t) > . 
Fixed a polynomial p(n, l/s)such that 2p(n, 8/s)k max{lB/s=, 0(n, l/s)}, we can 
conclude that 
4 
1 4 
From sertions (1) and (2) the contradiction  < g can eily be derived. n 
An immediate consequence of Theorem 4.2 is that, in Theorem 3.1, condition C1 
implies condition C3. Theorem 3.1 is thus proved. 
5 DISTRIBUTION-DEPENDENT PAC LEARNING 
In this section we briefly recall the notion of learnability in the distribution- 
dependent PAC model and we discuss some applications of the previous re- 
sults. Given {{X,, P,, F,)},>l, a labelled /-sample S! for f  F, is a sequence 
(Ixl,f(xl)),...,Ixt,f(xt)}), here (xl,...,xt) is a t-sample on X,. We say that 
fl,f2  Fn are e-close with respect to Pn iff Pn{Xlfl(x )  f2(x)} _< e. 
A learning algorithm A for {{Xn,Pn, Fn)}n>I is an algorithm that, given in input 
e, 5 > 0, a labelled t-sample $! with f  Fn, outputs the representation of a function 
g which, with probability 1 - 8, is s-close to f. The family {(Xn, Pn, Fn)}n>I is 
said polynomially learnable iff there exists a learning algorithm A working in time 
bounded by a polynomial p(n, l/e, 1/8). 
Bounds on the sample size necessary to learn at approximation e and confidence 
1 -8 have been given in terms of e-covers [Benedek &: Itai, 1988]; classes which 
are not learnable in the distribution-free model, but are learnable for some specific 
distribution, have been shown (e.g. /-terms DNF [Kucera et al., 1988]). 
The following notion is expressed in terms of relative frequencies. 
Definition 5.1 A quasi-consistent algorithm for the family { (Xn, P., Fn) }n>_l is 
an algorithm that, given in input 8, e > 0 and a labelled t-sample $! with f  Fn, 
outputs in time bounded by a polynomial p(n, l/e, 1/8) the representation of a func- 
tion g  F. such that 
> < 
By Theorem 3.1 the following result can easily be derived. 
Polynomial Uniform Convergence 911 
Theorem 5.1 Given {(Xn,Pn, Fn)}n>, if there ezists /3 > 0 such that M(n,t)= 
O(n/t) and there exists a quasi-consistent algorithm for {(X.,P., F.)}.>i then 
{ (Xn, Pn, Fn)}n>lis polynomially learnable. 
6 CONCLUSIONS AND OPEN PROBLEMS 
We have characterized the property of polynomial uniform convergence of 
{(Xn,Pn,Fn)}n> by means of the parameter M(n,t). In particular we proved 
that { (X,, P,, F,) },> 1 has the property of polynomial convergence iff there exists 
/3 > 0 such that M(n, t) = O(n/t), but no attempt has been made to obtain better 
upper and lower bounds on the sample size in terms of M(n, t). 
With respect to the relation between polynomial uniform convergence and PAC 
learning in the distribution-dependent context, we have shown that if a family 
{ (X., P., F) }.> 1 satisfies the property of polynomial uniform convergence then it 
can be PAC learned with a sample of size bounded by a polynomial function in 
n, l/e, 1/6. 
It is an open problem whether the converse implication also holds. 
Acknowledgement s 
This research was supported by CNR, project Sistemi Informatici e Calcolo Paral- 
lelo. 
References 
G. Benedek, A. Itai. (1988)"Learnability by Fixed Distributions". Proc. COLT'88, 
80-90. 
A. Blumer, A. Ehrenfeucht, D. Haussler, K. Warmuth. (1989) "Learnability and 
the Vapnik-Chervonenkis Dimension". J. ACM 36,929-965. 
L. Kucera, A. Marchctti-Spaccamela, M. Protasi. (1988) "On the Learnability of 
DNF Formulae". Proc. XV Coil. on Automata, Languages, and Programming, 
L.N.C.S. 317, Springer Verlag. 
L.G. Valiant. (1984) "A Theory of the Learnable". Communications of the ACM 
27 (11), 1134-1142. 
V.N. Vapnik, A.Ya. Chervoncnkis. (1971) "On the uniform convergence of relative 
frequencies of events to their probabilities". Theory of Prob. and its Appl. 16 (2), 
265-280. 
