Concentration inequalities allow us to bound the probability that the empirical average is far away from the true mean of a random variable.
A major advantage of such inequalities is that they are distribution-free, i.e., they hold for any distribution of the random variables.
The plot shows the probability density function of $X$ of mean $\mu$ with a shaded area that corresponds to the probability of $X$ taking a value with magnitude greater than $t$:
$P_X(|X|\geq t) = $
The other area shaded in red represents the expectation of the absolute value of $X$ divided by $t$:
$\displaystyle{\frac{\E_X[|X|]}{t} =} $
Markov's inequality states that
the blue area cannot be larger than the red area.
Choose a sample size: $n = $
Choose a tolerance: $\epsilon = $
The empirical average is $\hat{\mu} = $ , which is close to the true mean $\mu = $ .
$P(|\hat{\mu}-\mu| \geq \epsilon) \leq $ .
Markov's inequality relates the probability of a continuous random variable $X$ to take large values and the expectation of its magnitude:
$$ \forall t>0,\quad P(|X| \geq t) \leq \frac{\E_X [ |X| ]}{t} $$
Proof: \begin{align*} \E_X [ |X| ] & = \int_{\R} |x|\, p_X(x)\, dx \\ & = \int_{|x|< t} |x|\, p_X(x)\, dx + \int_{|x|\geq t} |x|\, p_X(x)\, dx \\ & \geq \int_{|x| \geq t} |x|\, p_X(x)\, dx & \mbox{(due to the first integral being positive)} \\ & \geq \int_{|x| \geq t} t\, p_X(x)\, dx = t \int_{|x| \geq t} p_X(x)\, dx & \mbox{(due to } |x|\geq t ) \\ & \geq t P(|X| \geq t ) & \mbox{(due to the }\href{pdf.html}{\mbox{definition of the density}}) \end{align*}
This inequality relates the distance of a random variable $X$ from its mean $\E_X[X]$ to its variance $Var(X)$:
$$ \forall t>0,\quad P(|X - \E_X[X]| \geq t) \leq \frac{Var ( X )}{t^2} $$
Proof: \begin{align*} P(|X - \E_X[X]| \geq t) & = P((X - \E_X[X])^2 \geq t^2) \end{align*} By using Markov's inequality for the random variable $Z = (X - \E_X[X])^2$ with the constant $t^2$ then yields $$ P((X - \E_X[X])^2 \geq t^2) \leq \frac{\E_X [ (X - \E_X[X])^2 ]}{t^2} = \frac{Var ( X )}{t^2} $$
For $n$ independent and identically distributed (i.i.d.) random variables $X_i$, seen here as independent copies of $X$, we have
$$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E[X] \right| \geq t\right\} \leq \frac{Var(X)}{nt^2} $$
Proof: applying Chebychev's inequality to the random variable $\frac{1}{n}\sum_{i=1}^n X_i$, we obtain $$ P\left\{ \left|\frac{1}{n}\sum_{i=1}^n X_i - \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right]\right| \geq t\right\} \leq \frac{Var (\frac{1}{n}\sum_{i=1}^n X_i )}{t^2} , $$ where the linearity of the expectation yields $$ \E\left[\frac{1}{n}\sum_{i=1}^n X_i\right] = \frac{1}{n}\sum_{i=1}^n \E [ X_i ] = \frac{1}{n}\sum_{i=1}^n \E [ X ] = \E [ X ] $$ (due to the $X_i$'s being identically distributed and sharing the same expectation $\E[X]$). Since the $X_i$'s are independent, we also have $$ Var\left(\frac{1}{n}\sum_{i=1}^n X_i\right) =\frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n^2}\sum_{i=1}^n Var(X) = \frac{Var(X)}{n}, $$ which concludes the proof.
$$ \forall t>0,\ s> 0, \quad P(X \geq t) \leq \frac{\E_X [ e^{sX} ]}{e^{st}} $$
This bound is part of Chernoff's bounding method, in which we find an $s$ that minimizes (or at least yields a small value for) the right-hand side of the bound.Proof: From the strictly increasing nature of the exponential and the positivity of $s$, we have $$ P(X \geq t) = P(e^{sX} \geq e^{st} ) $$ Then the proof is a direct consequence of Markov's inequality applyied to the (positive) random variable $e^{sX}$ with the positive constant $e^{st}$.
For $n$ i.i.d. random variables $X_i$ that are copies of a bounded random variable $X\in[a,b]$, Hoeffding's inequality states that, for all $\epsilon > 0$, $$ P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] \leq - \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right), $$ $$ P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right). $$
Step 1.
We first show that, for a bounded random variable $X\in[a,b]$ with $\E_X [X] = 0$, we have
$$
\E_X [ e^{sX} ] \leq e^{s^2(b-a)^2/8}
$$
as follows. For all $x\in[a,b]$, written as $x = \lambda a + (1-\lambda)b$, the convexity of the exponential implies
that $e^x \leq \lambda e^a + (1-\lambda) e^b$ with $\lambda = $. A similar reasoning applied to $sx \in[sa,sb]$ yields
$$
e^{sx} \leq \frac{b-x}{b-a} e^{sa} + \left(1-\frac{b-x}{b-a}\right) e^{sb} =
\frac{b-x}{b-a} e^{sa} + \frac{x-a}{b-a} e^{sb}
$$
Thus,
\begin{align*}
\E_X[ e^{sX} ] &\leq \E_X\left[ \frac{b-X}{b-a} e^{sa} + \frac{X-a}{b-a} e^{sb} \right] \\
& \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} + \E_X[X] \left( \frac{-1}{b-a} e^{sa} + \frac{1}{b-a} e^{sb} \right) & \mbox{(due to the }\href{expectation.html}{\mbox{linearity of the expectation}})\\
& \leq \frac{b}{b-a} e^{sa} - \frac{a}{b-a} e^{sb} & \mbox{(due to } \E_X[X] = 0)
\end{align*}
Let define $c=\frac{-a}{b-a}$, $u=s(b-a)$ and $f(u) = -cu + \log(1-c +ce^u)$. Then,
\begin{align*}
\E_X[ e^{sX} ] &\leq (1 - c) e^{sa} + c e^{sb} = (1 - c) e^{-c s (b-a)} + c e^{(1-c)s(b-a)} & \left( \mbox{due to } a=-c(b-a) \mbox{ and } b = (1-c)(b-a) \right)\\
& \leq (1 - c + c e^{s(b-a)}) e^{-c s (b-a)} = (1 - c + c e^{u}) e^{-c u} &\left(\mbox{due to } u=s(b-a) \right)\\
& \leq e^{f(u)}
\end{align*}
We have $f(0)=0$ and the derivative
\begin{align*}
f^\prime(u) &= -c + \frac{ce^u}{1-c+ce^u} = -c + \frac{c}{(1-c)e^{-u} + c}
\end{align*}
that is also zero at $u=0$. The second derivative is bounded by
$$
f^{\prime\prime}(u) = \frac{c (1-c)e^{-u} }{( (1-c)e^{-u} + c)^2}
= \frac{\alpha\beta} {(\alpha+\beta)^2} \leq \frac{1}{4}
$$
since $(\alpha+\beta)^2 - 4\alpha\beta = (\alpha-\beta)^2 \geq 0$.
Therefore, the Taylor-Lagrange formula giving the expansion of $f$ at $0$ yields, for some $\theta\in[0,u]$,
$$
f(u) = f(0) + u f^\prime(0) + \frac{u^2}{2}f^{\prime\prime}(\theta) \leq \frac{u^2}{8} = \frac{s^2(b-a)^2}{8}
$$
Step 2.
We can now apply Chernoff's bound to the random variable $\sum_{i=1}^n \left(X_i - \E[X]\right)$:
\begin{align*}
P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st}\E \left[ e^{s \sum_{i=1}^n (X_i - \E[X]) } \right]
= e^{-st}\E \left[ \prod_{i=1}^n e^{s (X_i - \E[X])} \right]
\end{align*}
Due to the independence of the $X_i$'s, the expectation can be introduced in the product as:
\begin{align*}
P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n \E \left[e^{s (X_i - \E[X])} \right]
\end{align*}
At this point we can use the result of step 1 to the random variables $( X_i - \E[X] )$, since we have the bounds
$$
a_i - \E[X] \leq X_i - \E[X] \leq b_i - \E[X]
$$
and the fact that
$$
\E\left[ X_i - \E[X] \right] = \E [X_i] - \E \E[X] =
\E [X] - \E[X] = 0 .
$$
This yields
$$
\E \left[e^{s (X_i - \E[X])} \right] \leq e^{s^2(b_i-a_i)^2/8}
$$
which leads to the following when plugged in the Chernoff's bound:
\begin{align*}
P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} &\leq e^{-st} \prod_{i=1}^n e^{s^2(b_i-a_i)^2/8}
= \exp\left( -st + \frac{s^2\sum_{i=1}^n (b_i-a_i)^2}{8} \right)
\end{align*}
Choosing $s= 4t / \sum_{i=1}^n (b_i-a_i)^2$ then gives
$$
P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \leq \exp\left( \frac{-4t^2}{\sum_{i=1}^n (b_i-a_i)^2} + \frac{16t^2}{8\sum_{i=1}^n (b_i-a_i)^2} \right) = \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right)
$$
Step 3.
The particular version of Hoeffding's inequality involving the average instead of the sum, as in the first of the three inequalities stated above, is obtained by writing
\begin{align*}
P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \geq \frac{t}{n}\right\} &= P\left\{ \frac{1}{n}\sum_{i=1}^n (X_i - \E[X]) \geq \frac{t}{n}\right\} \\
& = P\left\{ \sum_{i=1}^n (X_i - \E[X]) \geq t\right\} \\
& \leq \exp\left( \frac{-2t^2}{\sum_{i=1}^n (b_i-a_i)^2} \right)
\end{align*}
and choosing $t=n\epsilon$ with $a_i=a$ and $b_i=b$.
Step 4.
The second of the three inequalities is obtained with a similar reasoning applied to the random variable $\sum_{i=1}^n \left(\E[X] - X_i\right)$ instead of $\sum_{i=1}^n \left(X_i - \E[X]\right)$. Using the bounds
$$
\E[X] - b_i \leq \E[X] - X_i \leq \E[X] - a_i
$$
this leads to the same kind of result:
$$
P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)
$$
from which we deduce
$$
P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E[X] \leq -\epsilon \right\} = P\left\{ \E[X] -\frac{1}{n}\sum_{i=1}^n X_i \geq \epsilon \right\} \leq \exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right).
$$
The third inequality involving the absolute value is a direct consequence of the two first and the union bound:
\begin{align*}
P\left\{\left|\frac{1}{n}\sum_{i=1}^n X_i - \E [X]\right| > \epsilon \right\}
&= P\left\{ \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} \cup \left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \right\} \\
& \leq P\left\{ \frac{1}{n}\sum_{i=1}^n X_i - \E [X] > \epsilon\right\} + P\left\{\frac{1}{n}\sum_{i=1}^n X_i - \E [X] < - \epsilon\right\} \\
& \leq 2\exp\left(\frac{-2n\epsilon^2}{(b-a)^2}\right).
\end{align*}
Let $f: \X^n \rightarrow \R$ be a function of $n$ variables such that for some nonnegative constants $c_i\geq 0$, $i=1,\dots,n$, $$ \sup_{(x_1,\dots, x_n) \in \X^n, x_i^\prime \in \X} |f(x_1,\dots, x_n) - f(x_1, \dots,x_{i-1}, x_i^\prime, x_{i+1}, \dots, x_n)| \leq c_i,\quad i=1,\dots, n, $$ holds and let $X_i$, $i=1,\dots,n$, be $n$ independent (but not necessarily identically distributed) random variables. Then, $$ P\left\{ f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ $$ P\left\{ \E f(X_1,\dots, X_n) - f(X_1,\dots, X_n) \geq \epsilon \right\} \leq \exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right), $$ and $$ P\left\{ \left|f(X_1,\dots, X_n) - \E f(X_1,\dots, X_n) \right| \geq \epsilon \right\} \leq 2\exp\left(\frac{-2\epsilon^2}{\sum_{i=1}^n c_i^2}\right). $$
coming soon...