A covering number is the minimal number of balls of a given size required to entirely cover a given set.
More formally, in a space equipped with a (possibly pseudo) metric, we consider covering a given subset of the space with open balls of a given radius, say $\epsilon$. Then, any set of balls entirely convering the considered subset is called an $\epsilon$-cover of the subset, while the set of their centers is an $\epsilon$-net.
The cardinality of the smallest such $\epsilon$-net is a covering number of the subset for the chosen value of $\epsilon$.
A subset is said to be $\epsilon$-separated if the distance between all pairs of elements from the subset is at least $\epsilon$. An $\epsilon$-packing number is the maximal number of elements that can be $\epsilon$-separated.
Covering and packing numbers are particularly useful to measure the capacity of real-valued function classes, as for instance in obtaining margin-based error bounds.
Let $(E,\rho)$ be a (possibly pseudo) metric space and consider a set $F\subset E$ and a scale parameter $\epsilon\in (0,+\infty)$. An $\epsilon$-cover of $F$ is a coverage of $F$ with open balls of radius $\epsilon$ and centers in $E$. These centers form an $\epsilon$-net of $F$. If $F$ has an $\epsilon$-net of finite cardinality, then its covering number is the smallest cardinality of its $ \epsilon$-nets: $$ \mathcal{N}(\epsilon, F, \rho) = \min_{C \subset E} |C|,\quad s.t. \ \forall f\in F,\ \rho(f,C) < \epsilon , $$ where the dependence of the covering number $\mathcal{N}(\epsilon, F, \rho)$ on $\epsilon$ and the metric $\rho$ appears explicitly in the notation and where the distance from an element $f$ of $F$ to the set of centers $C$ is defined as $$ \rho(f, C) = \min_{c\in C} \rho( f,c). $$ If there is no finite $\epsilon$-net, then the covering number is defined to be infinite.
We prove that $\mathcal{M}(2 \epsilon, F, \rho) \leq \mathcal{N}(\epsilon, F, \rho) $ as follows. For $\mathcal{M}(2 \epsilon, F, \rho)=1$ this is obvious. Otherwise, let $P$ be the $2\epsilon$-separated set of maximal cardinality $\mathcal{M}(2 \epsilon, F, \rho)\geq 2$ and $\mathcal{C}$ be the $\epsilon$-net with minimal cardinality $ \mathcal{N}(\epsilon, F, \rho)$. Note that any $\epsilon$-cover of $F$ is also an $\epsilon$-cover of $P\subset F$ and thus, given a point $p\in P$, there is a center $c\in\mathcal{C}$ for which $$ \rho(p,c) < \epsilon. $$ On the other hand, the triangle inequality gives \begin{align} &\rho(p,p^\prime) \leq \rho(p, c) + \rho(c, p^\prime) \\ \Rightarrow\ & \rho(c, p^\prime) \geq \rho(p,p^\prime) - \rho(p, c) \end{align} Since $P$ is $2\epsilon$-separated, for any $p\in P$ and $p^\prime \in P$ such that $p^\prime \neq p$, this yields $$ \rho(c, p^\prime) \geq 2\epsilon - \rho(p, c) > \epsilon. $$ Therefore, any $\epsilon$-net of $P$ must contain at least one center per point in $P$, hence $|\mathcal{C}| \geq |P|$ and we obtain the desired result.
Let $P\subseteq F$ be an $\epsilon$-separated set. If $P$ is not an $\epsilon$-net for $F$, then there is some $p \in F$ for which $\forall p\in P$, $\rho(p,p^\prime) \geq \epsilon$. Thus, $P\cup\{p^\prime\}$ is also an $\epsilon$-separated set and we conclude that any $\epsilon$-separated set of maximal cardinality $\mathcal{M}(\epsilon, F, \rho)$ must also be an $\epsilon$-net for $F$. This implies that the minimal cardinality $\mathcal{N}(\epsilon, F, \rho)$ of an $\epsilon$-net of $F$ is at least $\mathcal{M}(\epsilon, F, \rho)$.
The relationship also holds for the uniform covering and packing numbers: for an $\ell_p$ pseudo-metric $d_{\ell_p(\g x_n)}$, we have $$ \mathcal{M}_p(2 \epsilon, F, n) \leq \mathcal{N}_p(\epsilon, F, n) \leq \mathcal{M}_p(\epsilon, F, n) . $$
Let $\hat{\g x}_n \in \arg\max_{\g x_n \in \X^n } \mathcal{N}(\epsilon, F, d_{\ell_{p}(\g x_n)})$. Then, the classical relationship for $\rho = d_{\ell_{p}(\hat{\g x}_n)}$ leads to $$ \mathcal{N}_{p}(\epsilon, F, n) = \mathcal{N}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{M}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{M}_p(\epsilon, F, n) $$ Similarly, with $\hat{\g x}_n \in \arg\max_{\g x_n \in \X^n } \mathcal{M}(2\epsilon, F, d_{\ell_{p}(\g x_n)})$, we have $$ \mathcal{M}_p(2\epsilon, F, n) = \mathcal{M}(2\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{N}(\epsilon, F, d_{\ell_{p}(\hat{\g x}_n)}) \leq \mathcal{N}_{p}(\epsilon, F, n). $$