Output labels

In words...

Machine learning works from data. In supervised learning, the machine is given both input data and output labels, from which it produces a model that can predict the label for any input. In unsupervised learning, the machine is only given a set of input data from which it should extract knowledge, typically by assigning labels to the data.

Therefore, in machine learning, each input data is associated to a label (either given or to be predicted).

The nature of the labels depends directly on the type of problem. In a classification problem, where we aim at grouping the data in different categories, the labels take a finite number of possible values corresponding to the names of categories (or a set of integers identifying the categories).

In a regression problem, a label can be any number.

In this book, we limit ourselves to these two cases, while machine learning techniques also exist for multi-dimensional labels.

In pictures...

Making predictions about fruits

Recall the example used to introduce input data:
We want to make a computer learn to distinguish between apples and bananas. To do this, we have to input data about a given fruit to the computer. There are many ways to do this and we here consider a simple scheme where the user inputs the width and height of the fruit to the computer, i.e., a vector with 2 components that can be represented as a point in a 2D plot.

In the corresponding pictures below, the labels of the data can be either "apple" or "banana" and are represented in the right-hand plot by the color of the points. The data are yet unlabeled and the first step in supervised learning would be to label them.

You can click on fruits to set their label and click again to change it.

In maths...

We denote a label by $y$ and assume that all labels for a given problem are taken from a set of possible labels $\Y$.

In this entire book, we restain ourselves to labels $y$ that are either taken from a discrete set of integers, i.e., $\Y\subset \mathbb{Z}$, or real numbers, i.e., $\Y \subseteq \R$, depending on wether we are dealing with a classification or a regression problem.

Most of the time, we will also consider labels as random variables $Y$ taking values in $\Y$. In this context, $y$ refers to a realization of $Y$.