Let be a sequence of independent identically distributed random variables with values in the state space S with probability distribution P.
Definition
- The empirical measure Pn is defined for measurable subsets of S and given by
- where is the indicator function and is the Dirac measure.
Properties
- For a fixed measurable set A, nPn(A) is a binomial random variable with mean nP(A) and variance nP(A)(1 − P(A)).
- For a fixed partition of S, random variables form a multinomial distribution with event probabilities
- The covariance matrix of this multinomial distribution is .
Definition
- is the empirical measure indexed by , a collection of measurable subsets of S.
To generalize this notion further, observe that the empirical measure maps measurable functions to their empirical mean,
In particular, the empirical measure of A is simply the empirical mean of the indicator function, Pn(A) = Pn IA.
For a fixed measurable function , is a random variable with mean and variance .
By the strong law of large numbers, Pn(A) converges to P(A) almost surely for fixed A. Similarly converges to almost surely for a fixed measurable function . The problem of uniform convergence of Pn to P was open until Vapnik and Chervonenkis solved it in 1968.[1]
If the class (or ) is Glivenko–Cantelli with respect to P then Pn converges to P uniformly over (or ). In other words, with probability 1 we have
The empirical distribution function provides an example of empirical measures. For real-valued iid random variables it is given by
In this case, empirical measures are indexed by a class It has been shown that is a uniform Glivenko–Cantelli class, in particular,
with probability 1.
Vapnik, V.; Chervonenkis, A (1968). "Uniform convergence of frequencies of occurrence of events to their probabilities". Dokl. Akad. Nauk SSSR. 181.