Covering numbers, epsilon-nets and packing numbers

In words...

A covering number is the minimal number of balls of a given size required to entirely cover a given set.

More formally, in a space equipped with a (possibly pseudo) metric, we consider covering a given subset of the space with open balls of a given radius, say $\epsilon$. Then, any set of balls entirely convering the considered subset is called an $\epsilon$-cover of the subset, while the set of their centers is an $\epsilon$-net.

The cardinality of the smallest such $\epsilon$-net is a covering number of the subset for the chosen value of $\epsilon$.

A subset is said to be $\epsilon$-separated if the distance between all pairs of elements from the subset is at least $\epsilon$. An $\epsilon$-packing number is the maximal number of elements that can be $\epsilon$-separated.

Covering and packing numbers are particularly useful to measure the capacity of real-valued function classes, as for instance in obtaining margin-based error bounds.

In pictures...

In maths...

Covering numbers

Let $(E,\rho)$ be a (possibly pseudo) metric space and consider a set $F\subset E$ and a scale parameter $\epsilon\in (0,+\infty)$. An $\epsilon$-cover of $F$ is a coverage of $F$ with open balls of radius $\epsilon$ and centers in $E$. These centers form an $\epsilon$-net of $F$. If $F$ has an $\epsilon$-net of finite cardinality, then its covering number is the smallest cardinality of its $ \epsilon$-nets: $$ \mathcal{N}(\epsilon, F, \rho) = \min_{C \subset E} |C|,\quad s.t. \ \forall f\in F,\ \rho(f,C) < \epsilon , $$ where the dependence of the covering number $\mathcal{N}(\epsilon, F, \rho)$ on $\epsilon$ and the metric $\rho$ appears explicitly in the notation and where the distance from an element $f$ of $F$ to the set of centers $C$ is defined as $$ \rho(f, C) = \min_{c\in C} \rho( f,c). $$ If there is no finite $\epsilon$-net, then the covering number is defined to be infinite.

Packing numbers

With the notations above, a finite set $P\subset F$ is said to be $\epsilon$-separated if all pairs of distinct elements of $P$ are $\epsilon$-separated, i.e., if $$ \forall p\in P, \forall p^\prime \in P\setminus\{p\}, \quad \rho(p,p^\prime) \geq \epsilon. $$ The maximal size of $\epsilon$-separated sets in $F$ is the packing number of $F$ at scale $\epsilon$: $$ \mathcal{M}(\epsilon, F, \rho) = \max |P|,\ \mbox{s.t. } P \mbox{ is } \epsilon\mbox{-separated}. $$

Uniform covering and packing numbers

For a function class $F \subseteq \R^{\X}$ and a sequence of points $\g x_n = \{x_i\}_{i=1}^n \in \X^n$, we define the $\ell_{\infty}$-pseudo metric as $$ \forall (f,f^\prime) \in F^2,\quad d_{\ell_{\infty}(\g x_n)}(f,f^\prime) = \max_{i\in\{1,\dots,n\}} |f(x_i) - f^\prime(x_i) |. $$ Such a metric depends on the sample $\g x_n$ and leads to sample-dependent covering and packing numbers. In this case, we also define the sample-independent uniform covering numbers $$ \mathcal{N}_{\infty}(\epsilon, F, n) = \max_{\g x_n \in \X^n}\ \mathcal{N}(\epsilon, F, d_{\ell_{\infty}(\g x_n)}) $$ which upper bound the covering numbers with respect to the $\ell_{\infty}$ pseudo-metric induced by any sample $\g x_n$, and, similarly, the uniform packing numbers $$ \mathcal{M}_{\infty}(\epsilon, F, n) = \max_{\g x_n \in \X^n}\ \mathcal{M}(\epsilon, F, d_{\ell_{\infty}(\g x_n)}). $$ Similar definitions can be introduced for other $\ell_p$-pseudo metrics.

Relationship between covering and packing numbers

We have the following relationship: $$ \mathcal{M}(2 \epsilon, F, \rho) \leq \mathcal{N}(\epsilon, F, \rho) \leq \mathcal{M}(\epsilon, F, \rho) . $$

We prove that $\mathcal{M}(2 \epsilon, F, \rho) \leq \mathcal{N}(\epsilon, F, \rho) $ as follows. For $\mathcal{M}(2 \epsilon, F, \rho)=1$ this is obvious. Otherwise, let $P$ be the $2\epsilon$-separated set of maximal cardinality $\mathcal{M}(2 \epsilon, F, \rho)\geq 2$ and $\mathcal{C}$ be the $\epsilon$-net with minimal cardinality $ \mathcal{N}(\epsilon, F, \rho)$. Note that any $\epsilon$-cover of $F$ is also an $\epsilon$-cover of $P\subset F$ and thus, given a point $p\in P$, there is a center $c\in\mathcal{C}$ for which $$ \rho(p,c) < \epsilon. $$ On the other hand, the triangle inequality gives \begin{align} &\rho(p,p^\prime) \leq \rho(p, c) + \rho(c, p^\prime) \\ \Rightarrow\ & \rho(c, p^\prime) \geq \rho(p,p^\prime) - \rho(p, c) \end{align} Since $P$ is $2\epsilon$-separated, for any $p\in P$ and $p^\prime \in P$ such that $p^\prime \neq p$, this yields $$ \rho(c, p^\prime) \geq 2\epsilon - \rho(p, c) > \epsilon. $$ Therefore, any $\epsilon$-net of $P$ must contain at least one center per point in $P$, hence $|\mathcal{C}| \geq |P|$ and we obtain the desired result.

Let $P\subseteq F$ be an $\epsilon$-separated set. If $P$ is not an $\epsilon$-net for $F$, then there is some $p \in F$ for which $\forall p\in P$, $\rho(p,p^\prime) \geq \epsilon$. Thus, $P\cup\{p^\prime\}$ is also an $\epsilon$-separated set and we conclude that any $\epsilon$-separated set of maximal cardinality $\mathcal{M}(\epsilon, F, \rho)$ must also be an $\epsilon$-net for $F$. This implies that the minimal cardinality $\mathcal{N}(\epsilon, F, \rho)$ of an $\epsilon$-net of $F$ is at least $\mathcal{M}(\epsilon, F, \rho)$.

The relationship also holds for the uniform covering and packing numbers: for an $\ell_p$ pseudo-metric $d_{\ell_p(\g x_n)}$, we have $$ \mathcal{M}_p(2 \epsilon, F, n) \leq \mathcal{N}_p(\epsilon, F, n) \leq \mathcal{M}_p(\epsilon, F, n) . $$

Let $\hat{\g x}_n \in \arg\max_{\g x_n \in \X^n } \mathcal{N}(\epsilon, F, d_{\ell_{p}(\g x_n)})$. Then, the classical relationship for $\rho = d_{\ell_{p}(\hat{\g x}_n)}$ leads to $$ \mathcal{N}_{p}(\epsilon, F, n) = \mathcal{N}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{M}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{M}_p(\epsilon, F, n) $$ Similarly, with $\hat{\g x}_n \in \arg\max_{\g x_n \in \X^n } \mathcal{M}(2\epsilon, F, d_{\ell_{p}(\g x_n)})$, we have $$ \mathcal{M}_p(2\epsilon, F, n) = \mathcal{M}(2\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{N}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{N}_{p}(\epsilon, F, n). $$