Concentration inequalities

In words...

Concentration inequalities allow us to bound the probability that the empirical average is far away from the true mean of a random variable.

A major advantage of such inequalities is that they are distribution-free, i.e., they hold for any distribution of the random variables.

In pictures...

Markov's inequality

Here is an illustration of Markov's inequality in which you can drag-and-drop $t$
and $\mu$ (the mean of $X$).

The plot shows the probability density function of $X$ of mean $\mu$ with a shaded area that corresponds to the probability of $X$ taking a value with magnitude greater than $t$:
$P_X(|X|\geq t) = $

The other area shaded in red represents the expectation of the absolute value of $X$ divided by $t$:
$\displaystyle{\frac{\E_X[|X|]}{t} =} $

Markov's inequality states that
the blue area cannot be larger than the red area.



Hoeffding's inequality

Here is a plot of the pdf of a bounded variable $X\in[-4,4]$ with mean $\mu=\E_X[X]$.
We estimate the mean by the empirical average of a realization of an $n$-sample.


The plot below shows the upper bound given by Hoeffding's inequality on the probability of observing an error larger than $\epsilon$ between the sample average and the true mean.

Choose a sample size: $n = $

Choose a tolerance: $\epsilon = $

The empirical average is $\hat{\mu} = $ , which is close to the true mean $\mu = $ .

Hoeffding's inequality yields the upper bound

$P(|\hat{\mu}-\mu| \geq \epsilon) \leq $ .

In maths...

In the following we consider continuous random variables taking values in $\R$ and assume that all quantities involved in the discussion (expectation, variance...) exist for these random variables.

Markov's inequality

Markov's inequality relates the probability of a continuous random variable $X$ to take large values and the expectation of its magnitude:

$$ \forall t>0,\quad P(|X| \geq t) \leq \frac{\E_X [ |X| ]}{t} $$

Proof: \begin{align*} \E_X [ |X| ] & = \int_{\R} |x|\, p_X(x)\, dx \\ & = \int_{|x|< t} |x|\, p_X(x)\, dx + \int_{|x|\geq t} |x|\, p_X(x)\, dx \\ & \geq \int_{|x| \geq t} |x|\, p_X(x)\, dx & \mbox{(due to the first integral being positive)} \\ & \geq \int_{|x| \geq t} t\, p_X(x)\, dx = t \int_{|x| \geq t} p_X(x)\, dx & \mbox{(due to } |x|\geq t ) \\ & \geq t P(|X| \geq t ) & \mbox{(due to the }\href{pdf.html}{\mbox{definition of the density}}) \end{align*}

Bienaymé-Chebyshev's inequality

This inequality relates the distance of a random variable $X$ from its mean $\E_X[X]$ to its variance $Var(X)$:

$$ \forall t>0,\quad P(|X - \E_X[X]| \geq t) \leq \frac{Var ( X )}{t^2} $$

Proof: \begin{align*} P(|X - \E_X[X]| \geq t) & = P((X - \E_X[X])^2 \geq t^2) \end{align*} By using Markov's inequality for the random variable $Z = (X - \E_X[X])^2$ with the constant $t^2$ then yields $$ P((X - \E_X[X])^2 \geq t^2) \leq \frac{\E_X [ (X - \E_X[X])^2 ]}{t^2} = \frac{Var ( X )}{t^2} $$

Quantitative law of large numbers / Chebyshev's inequality for empirical averages

For $n$ independent and identically distributed (i.i.d.) random variables $X_i$, seen here as independent copies of $X$, we have

$$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E[X] \right| \geq t\right\} \leq \frac{Var(X)}{nt^2} $$

Proof: applying Chebychev's inequality to the random variable $\frac{1}{n}\sum_{i=1}^n X_i$, we obtain $$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right]\right| \geq t\right\} \leq \frac{Var (\frac{1}{n}\sum_{i=1}^n X_i )}{t^2} , $$ where the linearity of the expectation yields $$ \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \E [ X_i ] = \frac{1}{n}\sum_{i=1}^n \E [ X ] = \E [ X ] $$ (due to the $X_i$'s being identically distributed and sharing the same expectation $\E[X]$). Since the $X_i$'s are independent, we also have $$ Var\left(\frac{1}{n}\sum_{i=1}^n X_i\right) =\frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2}\sum_{i=1}^n Var(X) = \frac{Var(X)}{n}, $$ which concludes the proof.

Chernoff's bound

For any random variable $X$,

$$ \forall t>0,\ s> 0, \quad P(X \geq t) \leq \frac{\E_X [ e^{sX} ]}{e^{st}} $$

This bound is part of Chernoff's bounding method, in which we find an $s$ that minimizes (or at least yields a small value for) the right-hand side of the bound.

Proof: From the strictly increasing nature of the exponential and the positivity of $s$, we have $$ P(X \geq t) = P(e^{sX} \geq e^{st} ) $$ Then the proof is a direct consequence of Markov's inequality applyied to the (positive) random variable $e^{sX}$ with the positive constant $e^{st}$.

Hoeffding's inequality

For $n$ i.i.d. random variables $X_i$ that are copies of a bounded random variable $X\in[a,b]$, Hoeffding's inequality states that, for all $\epsilon > 0$, $$ P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] \leq - \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). $$

Step 1. We first show that, for a bounded random variable $X\in[a,b]$ with $\E_X [X] = 0$, we have $$ \E_X [ e^{sX} ] \leq e^{s^2(b-a)^2/8} $$ as follows. For all $x\in[a,b]$, written as $x = \lambda a + (1-\lambda)b$, the convexity of the exponential implies that $e^x \leq \lambda e^a + (1-\lambda) e^b$ with $\lambda = $. A similar reasoning applied to $sx \in[sa,sb]$ yields $$ e^{sx} \leq \frac{b-x}{b-a} e^{sa} + \left(1-\frac{b-x}{b-a}\right) e^{sb} = \frac{b-x}{b-a} e^{sa} + \frac{x-a}{b-a} e^{sb} $$ Thus, \begin{align*} \E_X[ e^{sX} ] &\leq \E_X\left[ \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb} \right] \\ & \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} + \E_X[X] \left( \frac{-1}{b-a} e^{sa} + \frac{1}{b-a} e^{sb} \right) & \mbox{(due to the }\href{expectation.html}{\mbox{linearity of the expectation}})\\ & \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} & \mbox{(due to } \E_X[X] = 0) \end{align*} Let define $c=\frac{-a}{b-a}$, $u=s(b-a)$ and $f(u) = -cu + \log(1-c +ce^u)$. Then, \begin{align*} \E_X[ e^{sX} ] &\leq (1 - c) e^{sa} + c e^{sb} = (1 - c) e^{-c s (b-a)} + c e^{(1-c)s(b-a)} & \left( \mbox{due to } a=-c(b-a) \mbox{ and } b = (1-c)(b-a) \right)\\ & \leq (1 - c + c e^{s(b-a)}) e^{-c s (b-a)} = (1 - c + c e^{u}) e^{-c u} &\left(\mbox{due to } u=s(b-a) \right)\\ & \leq e^{f(u)} \end{align*} We have $f(0)=0$ and the derivative \begin{align*} f^\prime(u) &= -c + \frac{ce^u}{1-c+ce^u} = -c + \frac{c}{(1-c)e^{-u} + c} \end{align*} that is also zero at $u=0$. The second derivative is bounded by $$ f^{\prime\prime}(u) = \frac{c (1-c)e^{-u} }{( (1-c)e^{-u} + c)^2} = \frac{\alpha\beta} {(\alpha+\beta)^2} \leq \frac{1}{4} $$ since $(\alpha+\beta)^2 - 4\alpha\beta = (\alpha-\beta)^2 \geq 0$. Therefore, the Taylor-Lagrange formula giving the expansion of $f$ at $0$ yields, for some $\theta\in[0,u]$, $$ f(u) = f(0) + u f^\prime(0) + \frac{u^2}{2}f^{\prime\prime}(\theta) \leq \frac{u^2}{8} = \frac{s^2(b-a)^2}{8} $$
Step 2. We can now apply Chernoff's bound to the random variable $\sum_{i=1}^n \left(X_i - \E[X]\right)$: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st}\E \left[ e^{s \sum_{i=1}^n (X_i - \E[X]) } \right] = e^{-st}\E \left[ \prod_{i=1}^n e^{s (X_i - \E[X])} \right] \end{align*} Due to the independence of the $X_i$'s, the expectation can be introduced in the product as: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n \E \left[e^{s (X_i - \E[X])} \right] \end{align*} At this point we can use the result of step 1 to the random variables $( X_i - \E[X] )$, since we have the bounds $$ a_i - \E[X] \leq X_i - \E[X] \leq b_i - \E[X] $$ and the fact that $$ \E\left[ X_i - \E[X] \right] = \E [X_i] - \E \E[X] = \E [X] - \E[X] = 0 . $$ This yields $$ \E \left[e^{s (X_i - \E[X])} \right] \leq e^{s^2(b_i-a_i)^2/8} $$ which leads to the following when plugged in the Chernoff's bound: \begin{align*} P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n e^{s^2(b_i-a_i)^2/8} = \exp\left( -st + \frac{s^2\sum_{i=1}^n (b_i-a_i)^2}{8} \right) \end{align*} Choosing $s= 4t / \sum_{i=1}^n (b_i-a_i)^2$ then gives $$ P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \leq \exp\left( \frac{-4t^2}{\sum_{i=1}^n (b_i-a_i)^2} + \frac{16t^2}{8\sum_{i=1}^n (b_i-a_i)^2} \right) = \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right) $$
Step 3. The particular version of Hoeffding's inequality involving the average instead of the sum, as in the first of the three inequalities stated above, is obtained by writing \begin{align*} P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \geq \frac{t}{n}\right\} &= P\left\{ \frac{1}{n}\sum_{i=1}^n (X_i - \E[X]) \geq \frac{t}{n}\right\} \\ & = P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \\ & \leq \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right) \end{align*} and choosing $t=n\epsilon$ with $a_i=a$ and $b_i=b$.
Step 4. The second of the three inequalities is obtained with a similar reasoning applied to the random variable $\sum_{i=1}^n \left(\E[X] - X_i\right)$ instead of $\sum_{i=1}^n \left(X_i - \E[X]\right)$. Using the bounds $$ \E[X] - b_i \leq \E[X] - X_i \leq \E[X] - a_i $$ this leads to the same kind of result: $$ P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right) $$ from which we deduce $$ P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \leq -\epsilon \right\} = P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). $$ The third inequality involving the absolute value is a direct consequence of the two first and the union bound: \begin{align*} P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| > \epsilon \right\} &= P\left\{ \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} \cup \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \right\} \\ & \leq P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} + P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \\ & \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). \end{align*}

Bounded differences inequality

The inequalities below bound the probability of observing a large difference between a function of several variables and its expectation when the effect of a change of a single variable on the function value can be bounded.

Let $f: \X^n \rightarrow \R$ be a function of $n$ variables such that for some nonnegative constants $c_i\geq 0$, $i=1,\dots,n$, $$ \sup_{(x_1,\dots, x_n) \in \X^n, x_i^\prime \in \X} |f(x_1,\dots, x_n) - f(x_1, \dots,x_{i-1}, x_i^\prime, x_{i+1}, \dots, x_n)| \leq c_i,\quad i=1,\dots, n, $$ holds and let $X_i$, $i=1,\dots,n$, be $n$ independent (but not necessarily identically distributed) random variables. Then, $$ P\left\{ f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ $$ P\left\{ \E f(X_1,\dots, X_n) - f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ and $$ P\left\{ \left|f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right). $$

coming soon...

Notes