\[ \def\range{\text{range}} \def\Real{\mathbb{R}} \def\null{\text{null}} \def\zero{0} \def\one{\mathbf{1}} \def\tran{^\top} \def\pinv{^\dagger} \def\inv{^{-1}} \def\norm#1{\left\|#1\right\|} \def\inner#1{\left<#1\right>} \def\set#1{\left\{#1\right\}} \def\abs#1{\left|#1\right|} \def\round#1{\left(#1\right)} \]
Lecture 1: What is Machine Learning?
Systems consist of some inputs and some outputs. In this view any system is just a map from the input space to the output space. There are multiple ways to provide a recipe for a system.
Machine learning is a broad class of methods to learn these maps from data. At a high level, the key problem of machine learning is the so-called Supervised Learning where we are provided \(n\) pairs of examples \(\mathcal{D}_n=\left((x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)\right)\in(\mathbb{X}\times\mathbb{Y})^n\). The domain of the inputs is \(\mathbb X\), and the range is \(\mathbb Y\), so \(x_i\in\mathbb X\) and \(y_i\in\mathbb Y\).
The main object of interest
is a map \(\mathbb{X}\mapsto\mathbb{Y}\) that we wish to learn from these \(n\) examples, \[f_{\mathcal{D_n}}:\mathbb X\to\mathbb Y\]
How good is a proposed map \(f\)?
Once a map \(f_{\mathcal{D}_n}\) is chosen. We ask the question, does this map generalize? i.e., is \[f_{\mathcal{D}_n}(x_{n+1})\overset?=y_{n+1}\] for a pair \((x_{n+1},y_{n+1})\), potentially not an element of \(\mathcal{D}_n\).
Prima facie, this may seem like a hopeless endeavour if \((x_{n+1},y_{n+1})\notin\mathcal{D}_n\). However, we as humans are routinely able to come up with such maps.
Some intuition about the key challenge in Machine Learning
The problem is clearly not interesting if \(|\mathbb X|\) is small, since we can just store a look-up table in memory once-and-for-all. Whats the hoopla all about? Well, in many practical applications \(\mathbb X\) is huge, exponentially large, so storing a look-up table becomes impractical.
Question: Suppose \(\mathbb X,\mathbb Y\) are finite sets, how many functions exist of the form \(\mathbb X \mapsto\mathbb Y\)?Click to reveal
Answer: \(|\mathbb Y|^{|\mathbb X|}\).For any non-trivial range \(\mathbb Y\), this is huge. Even \(\mathbb Y=\{-1,+1\}\), which is called the problem of Binary Classification.
We will later see that the hardness1 of learning a problem is related roughly to the \[\log(\text{size of the search space})\approx |\mathbb{X}|\cdot\log\mathbb{Y}.\]
We will spend quite some time on the simplest problem of binary classification, which is simpler that the regression problem where \(\mathbb{Y}\subset\mathbb{R}\), the range can contain real numbers.
Note, while the regression problem is harder in a statistical sense, it ends up being somewhat simpler analytically, and hence we will start our discussions with a regression problem.
The curse of dimensionality
If we were to consider the space \(\mathbb{X}=[0,1]\), or perhaps a (uniformly) discretized version of it, \(\mathbb{X}=\{0,\varepsilon,2\varepsilon,\ldots,1-\varepsilon,1\}\), we have \(|\mathbb{X}|=\left(1+\frac1{\varepsilon}\right)\), for some numerical precision \(\varepsilon\).
What’s worse: If we consider problems with the domain being the uniformly discretized unit cube in \(d\)-dimensions, \(\mathbb{X}=[0,1]^d\), then \[|\mathbb{X}|=\left(1+\frac1{\varepsilon}\right)^d.\]
This is often called the curse of dimensionality.
At a very high level, perhaps one important way in which machine learning differs from classical curve fitting is that it has enabled us to solve problems in high dimensions, despite the curse of dimensionality.
Footnotes
There are several notions of hardness in machine learning – Statistical hardness, Computational hardness, Optimization hardness. In this case we are referring to the Statistical Hardness as measured through some objects such as Rademacher complexity, VC dimensions, etc. to be defined later in the course.↩︎