Concentration inequalities

In words...

Concentration inequalities allow us to bound the probability that the empirical average is far away from the true mean of a random variable.

A major advantage of such inequalities is that they are distribution-free, i.e., they hold for any distribution of the random variables.

In pictures...

Markov's inequality

Here is an illustration of Markov's inequality in which you can drag-and-drop $t$
and $\mu$ (the mean of $X$).

$-t$ $t$ $\mu$

The plot shows the probability density function of $X$ of mean $\mu$ with a shaded area that corresponds to the probability of $X$ taking a value with magnitude greater than $t$:
$P_X(|X|\geq t) = $

The other area shaded in red represents the expectation of the absolute value of $X$ divided by $t$:
$\displaystyle{\frac{\E_X[|X|]}{t} =} $

Markov's inequality states that
the blue area cannot be larger than the red area.

Hoeffding's inequality

Here is a plot of the pdf of a bounded variable $X\in[-4,4]$ with mean $\mu=\E_X[X]$.
We estimate the mean by the empirical average of a realization of an $n$-sample.

$\mu$ $2 \epsilon$ $\hat{\mu}$

The plot below shows the upper bound given by Hoeffding's inequality on the probability of observing an error larger than $\epsilon$ between the sample average and the true mean.

$N$

Choose a sample size: $n = $

Choose a tolerance: $\epsilon = $

The empirical average is $\hat{\mu} = $ , which is close to the true mean $\mu = $ .

Hoeffding's inequality yields the upper bound

$P(|\hat{\mu}-\mu| \geq \epsilon) \leq $ .

In maths...

In the following we consider continuous random variables taking values in $\R$ and assume that all quantities involved in the discussion (expectation, variance...) exist for these random variables.

Markov's inequality

Markov's inequality relates the probability of a continuous random variable $X$ to take large values and the expectation of its magnitude:

$$ \forall t>0,\quad P(|X| \geq t) \leq \frac{\E_X [ |X| ]}{t} $$

Proof: \begin{align*} \E_X [ |X| ] & = \int_{\R} |x|\, p_X(x)\, dx \\ & = \int_{|x|< t} |x|\, p_X(x)\, dx + \int_{|x|\geq t} |x|\, p_X(x)\, dx \\ & \geq \int_{|x| \geq t} |x|\, p_X(x)\, dx & \mbox{(due to the first integral being positive)} \\ & \geq \int_{|x| \geq t} t\, p_X(x)\, dx = t \int_{|x| \geq t} p_X(x)\, dx & \mbox{(due to } |x|\geq t ) \\ & \geq t P(|X| \geq t ) & \mbox{(due to the }\href{pdf.html}{\mbox{definition of the density}}) \end{align*}

Bienaymé-Chebyshev's inequality

This inequality relates the distance of a random variable $X$ from its mean $\E_X[X]$ to its variance $Var(X)$:

$$ \forall t>0,\quad P(|X - \E_X[X]| \geq t) \leq \frac{Var ( X )}{t^2} $$

Proof: \begin{align*} P(|X - \E_X[X]| \geq t) & = P((X - \E_X[X])^2 \geq t^2) \end{align*} By using Markov's inequality for the random variable $Z = (X - \E_X[X])^2$ with the constant $t^2$ then yields $$ P((X - \E_X[X])^2 \geq t^2) \leq \frac{\E_X [ (X - \E_X[X])^2 ]}{t^2} = \frac{Var ( X )}{t^2} $$

Quantitative law of large numbers / Chebyshev's inequality for empirical averages

For $n$ independent and identically distributed (i.i.d.) random variables $X_i$, seen here as independent copies of $X$, we have

$$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E[X] \right| \geq t\right\} \leq \frac{Var(X)}{nt^2} $$

Proof: applying Chebychev's inequality to the random variable $\frac{1}{n}\sum_{i=1}^n X_i$, we obtain $$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right]\right| \geq t\right\} \leq \frac{Var (\frac{1}{n}\sum_{i=1}^n X_i )}{t^2} , $$ where the linearity of the expectation yields $$ \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \E [ X_i ] = \frac{1}{n}\sum_{i=1}^n \E [ X ] = \E [ X ] $$ (due to the $X_i$'s being identically distributed and sharing the same expectation $\E[X]$). Since the $X_i$'s are independent, we also have $$ Var\left(\frac{1}{n}\sum_{i=1}^n X_i\right) =\frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2}\sum_{i=1}^n Var(X) = \frac{Var(X)}{n}, $$ which concludes the proof.

Chernoff's bound

For any random variable $X$,

$$ \forall t>0,\ s> 0, \quad P(X \geq t) \leq \frac{\E_X [ e^{sX} ]}{e^{st}} $$

This bound is part of Chernoff's bounding method, in which we find an $s$ that minimizes (or at least yields a small value for) the right-hand side of the bound.

Proof: From the strictly increasing nature of the exponential and the positivity of $s$, we have $$ P(X \geq t) = P(e^{sX} \geq e^{st} ) $$ Then the proof is a direct consequence of Markov's inequality applyied to the (positive) random variable $e^{sX}$ with the positive constant $e^{st}$.

Hoeffding's inequality

For $n$ i.i.d. random variables $X_i$ that are copies of a bounded random variable $X\in[a,b]$, Hoeffding's inequality states that, for all $\epsilon > 0$, $$ P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] \leq - \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). $$

Step 1. We first show that, for a bounded random variable $X\in[a,b]$ with $\E_X [X] = 0$, we have $$ \E_X [ e^{sX} ] \leq e^{s^2(b-a)^2/8} $$ as follows. For all $x\in[a,b]$, written as $x = \lambda a + (1-\lambda)b$, the convexity of the exponential implies that $e^x \leq \lambda e^a + (1-\lambda) e^b$ with $\lambda = $. A similar reasoning applied to $sx \in[sa,sb]$ yields $$ e^{sx} \leq \frac{b-x}{b-a} e^{sa} + \left(1-\frac{b-x}{b-a}\right) e^{sb} = \frac{b-x}{b-a} e^{sa} + \frac{x-a}{b-a} e^{sb} $$ Thus, \begin{align*} \E_X[ e^{sX} ] &\leq \E_X\left[ \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb} \right] \\ & \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} + \E_X[X] \left( \frac{-1}{b-a} e^{sa} + \frac{1}{b-a} e^{sb} \right) & \mbox{(due to the }\href{expectation.html}{\mbox{linearity of the expectation}})\\ & \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} & \mbox{(due to } \E_X[X] = 0) \end{align*} Let define $c=\frac{-a}{b-a}$, $u=s(b-a)$ and $f(u) = -cu + \log(1-c +ce^u)$. Then, \begin{align*} \E_X[ e^{sX} ] &\leq (1 - c) e^{sa} + c e^{sb} = (1 - c) e^{-c s (b-a)} + c e^{(1-c)s(b-a)} & \left( \mbox{due to } a=-c(b-a) \mbox{ and } b = (1-c)(b-a) \right)\\ & \leq (1 - c + c e^{s(b-a)}) e^{-c s (b-a)} = (1 - c + c e^{u}) e^{-c u} &\left(\mbox{due to } u=s(b-a) \right)\\ & \leq e^{f(u)} \end{align*} We have $f(0)=0$ and the derivative \begin{align*} f^\prime(u) &= -c + \frac{ce^u}{1-c+ce^u} = -c + \frac{c}{(1-c)e^{-u} + c} \end{align*} that is also zero at $u=0$. The second derivative is bounded by $$ f^{\prime\prime}(u) = \frac{c (1-c)e^{-u} }{( (1-c)e^{-u} + c)^2} = \frac{\alpha\beta} {(\alpha+\beta)^2} \leq \frac{1}{4} $$ since $(\alpha+\beta)^2 - 4\alpha\beta = (\alpha-\beta)^2 \geq 0$. Therefore, the Taylor-Lagrange formula giving the expansion of $f$ at $0$ yields, for some $\theta\in[0,u]$, $$ f(u) = f(0) + u f^\prime(0) + \frac{u^2}{2}f^{\prime\prime}(\theta) \leq \frac{u^2}{8} = \frac{s^2(b-a)^2}{8} $$
Step 2. We can now apply Chernoff's bound to the random variable $\sum_{i=1}^n \left(X_i - \E[X]\right)$: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st}\E \left[ e^{s \sum_{i=1}^n (X_i - \E[X]) } \right] = e^{-st}\E \left[ \prod_{i=1}^n e^{s (X_i - \E[X])} \right] \end{align*} Due to the independence of the $X_i$'s, the expectation can be introduced in the product as: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n \E \left[e^{s (X_i - \E[X])} \right] \end{align*} At this point we can use the result of step 1 to the random variables $( X_i - \E[X] )$, since we have the bounds $$ a_i - \E[X] \leq X_i - \E[X] \leq b_i - \E[X] $$ and the fact that $$ \E\left[ X_i - \E[X] \right] = \E [X_i] - \E \E[X] = \E [X] - \E[X] = 0 . $$ This yields $$ \E \left[e^{s (X_i - \E[X])} \right] \leq e^{s^2(b_i-a_i)^2/8} $$ which leads to the following when plugged in the Chernoff's bound: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n e^{s^2(b_i-a_i)^2/8} = \exp\left( -st + \frac{s^2\sum_{i=1}^n (b_i-a_i)^2}{8} \right) \end{align*} Choosing $s= 4t / \sum_{i=1}^n (b_i-a_i)^2$ then gives $$ P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \leq \exp\left( \frac{-4t^2}{\sum_{i=1}^n (b_i-a_i)^2} + \frac{16t^2}{8\sum_{i=1}^n (b_i-a_i)^2} \right) = \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right) $$
Step 3. The particular version of Hoeffding's inequality involving the average instead of the sum, as in the first of the three inequalities stated above, is obtained by writing \begin{align*} P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \geq \frac{t}{n}\right\} &= P\left\{ \frac{1}{n}\sum_{i=1}^n (X_i - \E[X]) \geq \frac{t}{n}\right\} \\ & = P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \\ & \leq \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right) \end{align*} and choosing $t=n\epsilon$ with $a_i=a$ and $b_i=b$.
Step 4. The second of the three inequalities is obtained with a similar reasoning applied to the random variable $\sum_{i=1}^n \left(\E[X] - X_i\right)$ instead of $\sum_{i=1}^n \left(X_i - \E[X]\right)$. Using the bounds $$ \E[X] - b_i \leq \E[X] - X_i \leq \E[X] - a_i $$ this leads to the same kind of result: $$ P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right) $$ from which we deduce $$ P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \leq -\epsilon \right\} = P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). $$ The third inequality involving the absolute value is a direct consequence of the two first and the union bound: \begin{align*} P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| > \epsilon \right\} &= P\left\{ \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} \cup \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \right\} \\ & \leq P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} + P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \\ & \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). \end{align*}

Bounded differences inequality

The inequalities below bound the probability of observing a large difference between a function of several variables and its expectation when the effect of a change of a single variable on the function value can be bounded.

Let $f: \X^n \rightarrow \R$ be a function of $n$ variables such that for some nonnegative constants $c_i\geq 0$, $i=1,\dots,n$, $$ \sup_{(x_1,\dots, x_n) \in \X^n, x_i^\prime \in \X} |f(x_1,\dots, x_n) - f(x_1, \dots,x_{i-1}, x_i^\prime, x_{i+1}, \dots, x_n)| \leq c_i,\quad i=1,\dots, n, $$ holds and let $X_i$, $i=1,\dots,n$, be $n$ independent (but not necessarily identically distributed) random variables. Then, $$ P\left\{ f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ $$ P\left\{ \E f(X_1,\dots, X_n) - f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ and $$ P\left\{ \left|f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right). $$

coming soon...

Notes

For the sake of simplicity, we restrained the discussion to continuous random variables with densities. However, these inequalities can be proved in a more general setting.
Hoeffding's inequality originally applies to sums of random variables rather than averages as presented here. This more standard formulation is the one obtained at the end of the second step of the proof.