(Introduction to) Machine Learning and Evolutionary Robotics

456MI, 470SM

Eric Medvet

A.Y. 2024/2025

1 / 366

Lecturer

Eric Medvet

associate Professor of Computer Engineering at Department of Engineering and Architecture, University of Trieste
online at: medvet.inginf.units.it

Research interests:

evolutionary computation
embodied artificial intelligence
machine learning applications

Labs:

2 / 366

Computer Engineering (ING-INF/05) group

Sylvio Barbon Jr. Sylvio Barbon Jr.
Fondamenti di informatica
Progettazione del software e dei sistemi informativi
meta learning, applied ML, process mining

Alberto Bartoli Alberto Bartoli
Reti di calcolatori
Computer networks 2 and introduction to cybersecurity
security, applied ML, evolutionary computation

Andrea De Lorenzo Andrea De Lorenzo
Basi di dati
Programmazione web
security, applied AI&ML, information retrieval, GP

Eric Medvet Eric Medvet
Programmazione avanzata
Introduction to machine learning and evolutionary robotics
evolutionary computation, embodied AI, applied ML

Laura Nenzi Laura Nenzi
Cyber-physical systems
Introduction to Artificial Intelligence
formal methods, runtime verification

Martino Trevisan Martino Trevisan
Reti di calcolatori
Sistemi operativi
Architetture dei sistemi digitali
network measurements, data privacy, big data

3 / 366

Structure of the course

1st part (6 CFUs, 48 hours): for all: IN23, IN19, SM38, SM36, SM34, SM23, SM28, SM13, and SM64

what is machine learning?
- definitions, recap
evaluating an ML system
- supervised ML, classification, binary classification, multiclass and regression, learning technique, examples, recap
supervised learning techniques
- tree-based, bagging, Random Forest, SVM, Naive Bayes, kNN, data pre- and post-processing, intro to R
clustering
- definition and assessment, hierarchical, k-means
application to text

2nd part (3 CFUs, 24 hours): just for IN23 and IN19

what is evolutionary computation?
significant applications in robotics

Focus on methodology:

how to design, build, and evaluate an ML (or EC) system?

4 / 366

Materials

Teacher slides:

available on the course web page
might be updated during the course

Notebooks for the lab activity:

available on the course web page
please, to fully enjoy lab activities, do not look at notebooks in advance

Textbooks:

1st part: James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013 available in UniTs library
2nd part: De Jong, Kenneth A. Evolutionary Computation: A Unified Approach. MIT Press, 2006.

Disclaimer: overlap with course material is very partial!

5 / 366

How to attend lectures

Depending on your learning style and habits, you might want to take notes to augment the slide content.

6 / 366

Visual syntax

This is an important concept.

This is a very important key concept, like a definition.

Sometimes there is something that is marginally important, it is an aside. like this

There will be scientific papers or books to be referred to, like this book: James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

External resourses (e.g., videos, software tools, ...) will be linked directly.

Palette is color-blind safe: ⬤⬤⬤⬤⬤

Pseudo-code for describing algorithms in an abstract way:

function $\text{factorial}(n)$ {
$p \gets 1$
while $n>1$ {
$p \gets n p$
$n \gets n -1$
}
return $p$ ;
}

Code in a concrete programming language:

public static String sayHello(String name) {
  return "Hello %s".formatted(name);
}

7 / 366

Lab activities and how to attend

Focus on methodology:

how to design, build, and evaluate an ML (or EC) system?

Practice (in designing, building, evaluating) is fundamental!

You'll practice doing lab activities:

$\approx$ 15 hours in the 1st part
in classroom
- the teacher is there and always available
- the teacher actively monitors your progresses
- ... but you can do the activities also at home
"solution" shown at the end
- solution = one way of doing design, build, evaluate
agnostic w.r.t. concrete tools used
- teacher is more familiar with R
- tutor is more familiar with Python
suitable to be done in small group (2–4 students)

8 / 366

Lecture times

Where:

Room H, building C1, Piazzale Europa Campus
Room 3B, building H3, Piazzale Europa Campus
Room 2A, building D, Piazzale Europa Campus

When:

Monday, 16.00–19.00, H, C1 $\rightarrow$ 16.00–18.30
Tuesday, 11.00–13.00, 3B, H3 $\rightarrow$ 11.00–12.30
Wednesday, 10.00–12.00, 2A, D $\rightarrow$ 10.00–11.30

9 / 366

Tutor

Michel El Saliby

master student enrolled in Ingegneria Elettronica e Informatica master course
hopefully future PhD student enrolled in the Applied Data Science and Artificial Intelligence PhD program

Role of the tutor:

assisting students during lab activities, together with the teacher
first point-of-contact for course-related questions by students
- the teacher is always available

10 / 366

Exam

The exam may be done in two ways:

project and written test
written test only

The written test consists of few ( $\approx$ 6) questions, some with medium-length answer, some with short answer, to be done in 1h.

The project consists in the design, development, and assessment of an ML system dealing with one "problem" chosen among a few options (examples).

the student delivers a description, not the software
the description is evaluated for clarity, technical soundness, (amount of) results
may be done in group (you are encouraged to form groups!)

The grade is the average of written test and project grades:

both must be $\ge 18$
parts can be repeated
honors (lode) if and only if both parts $\ge 30$ and one $> 30$

11 / 366

You?12 / 366

Basic concepts13 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

14 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

A few considerations:

defining a field of science is hard: science evolves, its boundaries change
ML "comes" from many communities' (statistics, computer science, ...) efforts: this (the use of computers) is just one point of view
it captures just some "parts" of ML: we'll see

14 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

A few considerations:

defining a field of science is hard: science evolves, its boundaries change
ML "comes" from many communities' (statistics, computer science, ...) efforts: this (the use of computers) is just one point of view
it captures just some "parts" of ML: we'll see

Let's analyze it in details:

is the science: what's science?
getting computer: who is doing that?
to learn: to learn what? this appears to be the key point!
without being explicitly programmed: who is not doing that?

14 / 366

An example: spam detection

GMail screenshot with spam folder

15 / 366

Spam detection: under the hood

What the user sees:

unwanted emails (spam) are in a separated place (the spam folder)
sometimes some spam email is not put in the spam folder
sometimes some not-spam email is put in the spam folder

16 / 366

Spam detection: under the hood

What the user sees:

unwanted emails (spam) are in a separated place (the spam folder)
sometimes some spam email is not put in the spam folder
sometimes some not-spam email is put in the spam folder

What the web-based email system (a computer) does:

whenever an email arrives, decides if it is spam or not
if spam, move it to the spam folder; otherwise, leave it in the main place

In brief: a computer is making a decision about an email

16 / 366

Making a decision

Let's be formal:

$y=f(x)$

$x$ : the entity about which the decision has to be made (the email)
$y$ : the decision (spam or not-spam)
$f(\cdot)$ : applying some procedure that, given an $x$ results in a decision $y$

17 / 366

Making a decision

Let's be formal:

$y=f(x)$

$x$ : the entity about which the decision has to be made (the email)
$y$ : the decision (spam or not-spam)
$f(\cdot)$ : applying some procedure that, given an $x$ results in a decision $y$

$y=f(x)$ is a formal notation capturing the idea that $y$ is obtained from $x$ by applying $f$ on it.
But it is lacky about the nature of $x$ and $y$ .

$f: X \to Y$

$X$ : the set of all $x$ , the domain of $f$ (all the possible emails)
$Y$ : the set of all $y$ , the codomain of $f$ ( $Y=\{\text{spam}, \text{not-spam}\}$ )

None of the two notations says how $f$ works internally.

17 / 366

xxx and yyy namesxxx is an observationsomething that can be observed, right because a decision has to be made about it

yyy is the response (for a given xxx)if you feed the decision system with an xxx, the system responds with a yyy

18 / 366

$x$ and $y$ names

$x$ is an observation
- something that can be observed, right because a decision has to be made about it
$y$ is the response (for a given $x$ )
- if you feed the decision system with an $x$ , the system responds with a $y$

Alternative names:

$x$ is an/the input, $y$ is an/the output
- $f$ as information processing system
$x$ is a data point
- assuming it carries some data about the underlying entity
$x$ is an instance
- instance [ˈɪnst(ə)ns]: an example or single occurrence of something

Names are used interchangeably; some communities tend to prefer some names.

18 / 366

to learn what?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

is the science: what's science?
getting computer: who is doing that?
to learn: to learn what?
- how to make a decision $y$ about an observation $x$ . That is: $f: X \to Y$
without being explicitly programmed: who is not doing that?

19 / 366

to learn what?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

is the science: what's science?
getting computer: who is doing that?
to learn: to learn what?
- how to make a decision $y$ about an observation $x$ . That is: $f: X \to Y$
without being explicitly programmed: who is not doing that?

New version:

Machine Learning is the science of getting computers to learn $f: X \to Y$ without being explicitly programmed.

we want the computer to learn $f$ and use it, not just learn it

19 / 366

Prediction

$f$ is often denoted as $f\subtext{predict}$ since, given an $x$ , predicts a $y$

when used in practice, i.e., in the prediction phase, $f\subtext{predict}$ guesses about an unknown, real $\hat{y}$

20 / 366

$f$ for a computer

Computers execute instructions grouped in programs and expressed according to some language.
$f$ is the mathematical, abstract notation for a computer program that, when executed on an input $x \in X$ , outputs a $y \in Y$ .

Mathematical notation:

$X = \mathbb{R}^2$ $Y = \mathbb{R}$ $f: \mathbb{R}^2 \to \mathbb{R}$ $f(\vect{x}) = f((x_1,x_2)) = \left\lvert\frac{x_1-x_2}{x_1}\right\rvert$

$\vect{x}$ is a notation for vectors, or, more broadly, for sequences of homogeneous elements, in place of $\vec{x}$

Computer language:

public double f(double[] xs) {
  return Math.abs((xs[0] - xs[1]) / xs[0]);
}

Most not all, typed ones languages make connection clear:
double[] is $X$ , i.e., $\mathbb{R}^2$ actually $\mathbb{R}^p$ , with $p \ge 1$
double is $Y$ , i.e., $\mathbb{R}$
xs is $\vect{x}$
f is $f$ : types correspond!
no explicit counterpart for $y$

21 / 366

Further point of view

Abstract definition ( $\approx$ the signature):

just domain and codomain, not how the function works

$f: \mathbb{R}^2 \to \mathbb{R}$

double f(double[] xs)

\vect{x} \in \mathbb{R}^2

y \in \mathbb{R}

f

Concrete definition ( $\approx$ signature and code):

domain, codomain, and how the function works

$f: \mathbb{R}^2 \to \mathbb{R}$ $\begin{align*} y &= f(\vect{x}) \\ &=f((x_1,x_2)) \\ &=x_1+x_2 \end{align*}$

double f(double[] xs) {
  return xs[0] + xs[1];
}

\vect{x} \in \mathbb{R}^2

y \in \mathbb{R}

x_1+x_2

22 / 366

Writing $f$

Usually, computer programs are written by humans, but here:

Machine Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ without being explicitly programmed.

without being explicitly programmed means that $f\subtext{predict}$ is not written by a human!

It appears verbose, let's get rid of it.

23 / 366

Writing $f$

Usually, computer programs are written by humans, but here:

Machine Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ without being explicitly programmed.

without being explicitly programmed means that $f\subtext{predict}$ is not written by a human!

It appears verbose, let's get rid of it.

New version:

Machine Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ autonomously.

23 / 366

Finding/writing a program

Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"

Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"

"Find" suggests Bob to apprach the task in two steps:

consider the universe of all the possible programs
choose the one (or ones) that does what expected

24 / 366

Finding/writing a program

Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"

Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"

"Find" suggests Bob to apprach the task in two steps:

consider the universe of all the possible programs
choose the one (or ones) that does what expected

In $f$ terms:

consider $\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}$
choose one $f \in \mathcal{F}_{X \to Y}$ that does what expected

24 / 366

Desired behavior of $f$

consider $\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}$
choose one $f \in \mathcal{F}_{X \to Y}$ that does what expected

Step 2 is fundamental in practice

"find a program that, given a string, returns a number" wouldn't make sense alone!

... but it is hard to be further formalized in general.

There has to be some supervision facilitating the search for a good $f$ .

25 / 366

Supervised learning

When the supervision is in the form of some examples (observation $\rightarrow$ response) and the learned $f\subtext{predict}$ should process them correctly.

example: "if I give you this observation $x$ , you should predict this response $y$ "

New version:

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

26 / 366

Supervised learning

When the supervision is in the form of some examples (observation $\rightarrow$ response) and the learned $f\subtext{predict}$ should process them correctly.

example: "if I give you this observation $x$ , you should predict this response $y$ "

New version:

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

In unsupervised learning there is no supervision, there are no examples:

nevertheless, there is some implicit expectation about how to process $x$
we'll discuss unsupervised learning later

26 / 366

Examples

Formally, examples available for learning $f\subtext{predict}$ are pairs $(x,y)$ .

A dataset compatible with $X$ and $Y$ is a bag of pairs $(x,y)$ : $D = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_{i=1}^{i=n}$ with $\forall i: x^{(i)} \in X, y^{(i)} \in Y$ and $|D|=n$ .

Or, more briefly $D = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_i$ . examples are also denoted by $(x_i,y_i)$ , depending on the community

A bag ( $D$ should be called databag...):

can have duplicates (bag $\ne$ set)
it does not imply any order among its elements (bag $\ne$ sequence)

In most algorithms, and in their program counterparts, dataset are actually processed sequentially, though

27 / 366

Learning set

A learning set is a dataset that is used for learning an $f\subtext{predict}$ .

may be denoted by $D\subtext{learn}$ or $L$ , or $T$ , for training set

The learning set has to be consistent with the domain and codomain of the function $f\subtext{predict}$ to be learned:

if $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ , then $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$
- $X \times Y$ is the Cartesian product of $X$ and $Y$ , i.e., the set of all possible $(x,y)$ pairs
- $\mathcal{P}(A)$ is the powerset of $A$ , i.e., the set of all the possible subsets of $A$
- $\mathcal{P}^*(A)$ is a custom notation for the powerset with duplicates, i.e., the set of all the possible multisets of $A$

28 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

In brief: given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ , learn an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ .

29 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

In brief: given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ , learn an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ .

A supervised learning technique is a way for learning an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ .

$f\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}$ $f\subtext{predict} = f\subtext{learn}\left(D\subtext{learn}\right)$

f\subtext{learn}

D\subtext{learn}

f\subtext{predict}

29 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

In brief: given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ , learn an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ .

A supervised learning technique is a way for learning an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ .

$f\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}$ $f\subtext{predict} = f\subtext{learn}\left(D\subtext{learn}\right)$

f\subtext{learn}

D\subtext{learn}

f\subtext{predict}

learning phase: when $f\subtext{learn}$ is applied to obtain $f\subtext{predict}$ from $D$
prediction phase: when $f\subtext{predict}$ is applied to obtain a $y$ from an $x$

29 / 366

Learning techniques

A supervised learning technique is a way for learn an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ .

Why don't we suffice a single learning technique? Why are there many of them?

They differ in:

applicability with respect to $X$ and/or $Y$
- e.g., some require $X = \mathbb{R}^p$ , some require $Y = \mathbb{R}$
efficiency with respect to $|D\subtext{learn}|$
- e.g., some are really fast for producing $f\subtext{predict}$ ( $\mathcal{O}\left(|D\subtext{learn}|^{\approx 0}\right)$ ), some are slow ( $\mathcal{O}\left(|D\subtext{learn}|^2\right)$ )
effectiveness in terms of the quality of the learned $f\subtext{predict}$
attributes of learned $f\subtext{predict}$
- nature/type of $f\subtext{predict}$ (a formula, a text, a tree...)
- interpretability of $f\subtext{predict}$

30 / 366

Who?

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

getting computer: who is doing that?

the user of a learning technique, who is likely the designer/developer of a ML system

31 / 366

Who?

Supervised (Machine) Learning is the science of getting computers to learn $f\subtext{predict}: X \to Y$ from examples autonomously.

getting computer: who is doing that?

the user of a learning technique, who is likely the designer/developer of a ML system

is the science: what's science?

there's not only the user: someone designs/develops learning techniques

New version:

Supervised (Machine) Learning is about designing and applying supervised learning techniques.

31 / 366

Learning as optimization

A supervised learning technique $f\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}$ can be seen as a form of optimization:

consider $\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}$
find the one $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ that works best on $D\subtext{learn}$

Could we use a general optimization technique?
In principle, yes, but:

$X$ (and maybe $Y$ ) might be infinite (e.g., $X=\mathbb{R}^p$ )
$X \times Y$ is "more" infinite
$\mathcal{F}_{X \to Y}$ is "hugely more" infinite

32 / 366

Learning as optimization

A supervised learning technique $f\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}$ can be seen as a form of optimization:

consider $\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}$
find the one $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ that works best on $D\subtext{learn}$

Could we use a general optimization technique?
In principle, yes, but:

$X$ (and maybe $Y$ ) might be infinite (e.g., $X=\mathbb{R}^p$ )
$X \times Y$ is "more" infinite
$\mathcal{F}_{X \to Y}$ is "hugely more" infinite

Practical solution: reduce $\mathcal{F}_{X \to Y}$ size by considering only the $f$ of some nature:

e.g., for $X=Y=\mathbb{R}$ , consider $\mathcal{F}'_{\mathbb{R} \to \mathbb{R}} = \{f: f(x)=ax+b \text{ with } a,b \in \mathbb{R}\}$
e.g., for $x$ a UTF-8 strings and $y$ a Boolean, consider the $f$ as regular expressions

32 / 366

Templating $f$

Often a learning technique works on a reduced $\mathcal{F}'_{X \to Y}$ which is based on an template $f'$ :

most parts of $f'$ are defined, some parts are undefined, variable
$f'$ can be used for prediction only if the undefined parts are defined

E.g., for $X=Y=\mathbb{R}$ , $f'(x)=ax+b$ :

you need concrete values for $a,b$ in order to apply $f$ to an $x$ , i.e., to obtain a response $y$ out of an $x$
this is univariate linear regression: we'll expand
- univariate because $X$ has one dimension
- regression because $Y=\mathbb{R}$
- linear because of the template

33 / 366

Model

We can make explicit the undefined part of the template: $f\subtext{predict}(x) = f'\subtext{predict}(x, m)$ where $m \in M$ is the undefined part.

e.g., $f'\subtext{predict}(x, a, b) = ax+b$ and $M=\mathbb{R}^2$

Note that $f'\subtext{predict}$ is fixed for a given learning technique and defines the reduced $\mathcal{F}'_{X \to Y} \subset \mathcal{F}_{X \to Y}$ where the learning will look for an $f\subtext{predict}$ .

34 / 366

Model

We can make explicit the undefined part of the template: $f\subtext{predict}(x) = f'\subtext{predict}(x, m)$ where $m \in M$ is the undefined part.

e.g., $f'\subtext{predict}(x, a, b) = ax+b$ and $M=\mathbb{R}^2$

Given a template $f'\subtext{predict}$ , $m$ defines an $f\subtext{predict}$ that can be used to predict a $y$ from an $x$ .
That is, $m$ is a model of how $y$ depends on $x$ .

34 / 366

Learning a model

For techniques based on a template, $f\subtext{learn}$ actually looks just $\mathcal{F}'_{X \to Y}$ , hence in $M$ , for finding an $f\subtext{predict}$ .

General case: $f\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}$ $f\subtext{predict}: X \to Y$

The learning technique is defined by $f\subtext{learn}$ .

f\subtext{learn}

D\subtext{learn}

f\subtext{predict}

With template: $f'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M$ $f'\subtext{predict}: X \times M \to Y$

The learning technique is defined by $f'\subtext{learn}, f'\subtext{predict}$ .

f'\subtext{learn}

D\subtext{learn}

m

f'\subtext{predict}

x, m

y

35 / 366

Examples of templated $f$

Problem: price of a flat from surface $X=\mathbb{R}^+, Y=\mathbb{R}^+$ $\mathcal{F}_{\mathbb{R}^+ \to \mathbb{R}^+} = \{\dots,x^2, 3, \pi \frac{x^3+5x}{0.1+x}, \dots\}$

Learning technique: linear regression $f'\subtext{predict}(x, a,b) = ax+b$ $M = \mathbb{R} \times \mathbb{R} = \{(a,b): a \in \mathbb{R} \land b \in \mathbb{R} \}$ $\mathcal{F}'_{\mathbb{R}^+ \to \mathbb{R}^+} = \{\dots,x+1, 3, \pi x+5, \dots\}$

Problem: classify email as spam/not-spam

$X=A^*$ , $Y=\{ \text{spam},\neg\text{spam} \}$ , $A=$ UTF-8

$\mathcal{F}_{A^* \to Y} = \{ \dots \}$ (all predicates on UTF-8 strings)

Learning technique: regex-based flagging
$f'\subtext{predict}(x, r) = \begin{cases} \text{spam} & \text{if } x \text{ matches } r \newline \neg\text{spam} & \text{otherwise} \end{cases}$

$M =$ regexes $=\{\dots, \text{\htmlClass{ttt}{ca.++}}, \text{\htmlClass{ttt}{[a-z]+a.+}}, \dots\}$

$\mathcal{F}'_{A^* \to Y} = \{ \dots, f'\subtext{predict}(\cdot, \text{\htmlClass{ttt}{[a-z]+a.+}}), \dots \}$

Choosing the learning technique means choosing one $\mathcal{F}'_{X \to Y}$ !

$A^* = \bigcup\sub{i=0}^{i=\infty} A^i$

36 / 366

Alternative views/terminology

The model $m$ is learned on a dataset $D$ .

$m$ learned from the examples in $D$

The model $m$ is trained on a dataset $D$ .

$m$ trained to correctly work on the examples in $D$

The model $m$ is fitted on a dataset $D$ .

$m$ adjusted until it works well on the examples in $D$

37 / 366

Alternative views/terminology

The model $m$ is learned on a dataset $D$ .

$m$ learned from the examples in $D$

The model $m$ is trained on a dataset $D$ .

$m$ trained to correctly work on the examples in $D$

The model $m$ is fitted on a dataset $D$ .

$m$ adjusted until it works well on the examples in $D$

Formally, a model is one specific $m \in M$ that has been found upon learning.
However, often "model" is used to denote a generic (e.g., still untrained/unfitted) artifact.

"fit the model": a model exists before fitting (e.g., before the learning phase)
"learn a model": the model is the outcome of the learning phase

37 / 366

Common cases and terminology

Supervised learning techniques may be categorized depending on the kind of $X,Y,M$ they deal with:

With respect to $Y$ , most important cases:

$Y$ is a finite set without intrinsic ordering $\rightarrow$ classification
- $y$ is said a categorical (or nominal) variable
- if $|Y|=2$ $\rightarrow$ binary classification
  otherwise $\rightarrow$ multiclass classification
$Y = \mathbb{R}$ (or $Y \subseteq \mathbb{R}$ ) $\rightarrow$ regression
- $y$ is said a numerical variable

With respect to $X$ , common cases:

$X = X_1 \times \dots \times X_p$ , with each $X_i$ being $\mathbb{R}$ or a finite unordered set (each $x$ is a $p$ -sized tuple)
- $X$ is multivariate and each $x_i$ is either numerical or categorical
$X$ is the set of all strings $\rightarrow$ text mining we'll see

38 / 366

Variables terminology

In the common case of a multivariate $X = X_1 \times \dots \times X_p$ :

each $x_i$ is said a independent variable
- or feature, since it is a feature of an $x \in X$
- or attribute, since it is an attribute of an $x \in X$
- or predictor, since it hopefully helps predicting a $y$
$y$ is said the dependent variable, since it is hoped to depend on $x$
- or response variable

39 / 366

Variables terminology

In the common case of a multivariate $X = X_1 \times \dots \times X_p$ :

each $x_i$ is said a independent variable
- or feature, since it is a feature of an $x \in X$
- or attribute, since it is an attribute of an $x \in X$
- or predictor, since it hopefully helps predicting a $y$
$y$ is said the dependent variable, since it is hoped to depend on $x$
- or response variable

Given a dataset $D$ with $|D|=n$ examples defined over $X,Y$ :

$D= \begin{pmatrix} x_1^{(1)} & \dots & \c{3}{x_j^{(1)}} & \dots & x_p^{(1)} & y^{(1)} \newline \dots & \dots & \c{3}{\dots} & \dots & \dots & \dots \newline \c{1}{x_1^{(i)}} & \c{1}{\dots} & \c{2}{x_j^{(i)}} & \c{1}{\dots} & \c{1}{x_p^{(i)}} & \c{4}{y^{(i)}} \newline \dots & \dots & \c{3}{\dots} & \dots & \dots & \dots \newline x_1^{(n)} & \dots & \c{3}{x_j^{(n)}} & \dots & x_p^{(n)} & y^{(n)} \newline \end{pmatrix}$

$x^{(i)}$ is the $i$ -th observation
$\left\{x_j^{(i)}\right\}_i$ are the values of the $j$ -th feature
$x^{(i)}_j$ is the value of the $j$ -th feature for the $i$ -th observation recall, order does not matter in $D$
$y^{(i)}$ is the response for the $i$ -th observation
- if $y$ is categorical $\rightarrow$ class label

39 / 366

Size of the "problem"

The common notation for the size of a multivariate dataset (i.e., a dataset with a multivariate $X=X_1 \times \dots \times X_p$ ) is:

$n$ number of observations
$p$ number of (independent) variables

On the assumption that a dataset $D$ implicitly defines the problem (since it bounds $X$ and $Y$ and hence $\mathcal{F}_{X \to Y}$ ), $n$ and $p$ also describe the size of the problem.

40 / 366

What (sup. learning techniques) we will see

A family of learning techniques (tree-based) for:

multivariate $X = X_1 \times \dots \times X_p$ , each $x$ being categorical or numerical
classification (binary or multiclass) and regression

A family of learning techniques (SVM) for:

$X = \mathbb{R}^p$
binary classification

A learning technique (kNN) for:

any $X$ with a similarity metric (including $X = \mathbb{R}^p$ )
classification (binary or multiclass) and regression

A learning technique (naive Bayes) for:

multivariate $X = X_1 \times \dots \times X_p$ , each $x$ being categorical with mention to the hybrid case
classification (binary or multiclass)

41 / 366

... and...

What if none of the above learning techniques fits the problem ( $X,Y$ ) at hand?

We'll see:

a method for applying techniques suitable for $X=\mathbb{R}^p$ to problems where a multivariate $X$ includes categorical variables
a few methods for applying techniques suitable for $X=\mathbb{R}^p$ to problems where $X=$ strings
two methods for applying techniques suitable for binary classification ( $|Y|=2$ ) to multiclass classification problems ( $|Y|\ge 2$ )

What about the other kinds of problems?

42 / 366

ML system

An information processing system in which there is:

a supervised learning technique (i.e., a pair $f'\subtext{learn},f'\subtext{predict}$ )
other components operating on $X$ or $Y$
- pre-processing, if "before" the learning technique, i.e., $X \to X'$
- post-processing, if "after" the learning technique, i.e., $Y' \to Y$

43 / 366

ML system example: Twitter profiling

Goal: given a tweet, determine age range and gender of the author

problem 1: $X=A^{280}, A=$ UTF-16, $Y=\{ \text{0--16}, \text{17--29}, \text{30--49}, \text{50--}\}$
problem 1: $X=A^{280}, A=$ UTF-16, $Y=\{ \text{M}, \text{F}\}$ or broader

One possible ML system for this problem:

$f\subtext{text-to-num}: A^{280} \to [0,1]^{50}$ (chosen among a few options, maybe adjusted)
$f\subtext{foreach}: X^* \times \mathcal{F}_{X \to Y} \to Y^*$ (given an $f: X \to Y$ and a sequence $\{x_i\}_i$ , apply $f$ to each $x_i$ )
$f'_{\text{learn},1},f'_{\text{predict},1}$ and $f'_{\text{learn},2},f'_{\text{predict},2}$ (two learning techniques suitable for classification)

Learning phase:

$D'\subtext{learn} = f\subtext{foreach}(D\subtext{learn}, f\subtext{text-to-num})$ just the $x$ part
$m\subtext{age} = f'_{\text{learn},1}(D'\subtext{learn})$
$m\subtext{gender} = f'_{\text{learn},2}(D'\subtext{learn})$

Prediction phase:

$x' = f\subtext{text-to-num}(x)$
$y\subtext{age} = f'_{\text{predict},1}(x', m\subtext{age})$
$y\subtext{gender} = f'_{\text{predict},2}(x', m\subtext{gender})$

44 / 366

Designing an ML systemWho chooses the learning technique(s)?And its parameter values?

Who chooses/designs the pre- and post-processing components?And their parameter values?

45 / 366

Designing an ML system

Who chooses the learning technique(s)?
- And its parameter values?
Who chooses/designs the pre- and post-processing components?
- And their parameter values?

The designer of the ML system, that is, you¹!

You!

Can those choices be made automatically? "Yes", it's called Auto-ML

45 / 366

Phases of design of an ML design

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement):
- define $X$ and $Y$
- define a way for assessing solutions
  - before designing!
  - applicable to any compatible ML solution
Design the ML system
- choose a learning technique
- choose/design pre- and post-processing steps
Implement the ML system
- learning/prediction phases
- obtain the data
Assess the ML system

Steps 4–6 are usually iterated many times

Skills of the ML practitioner/designer:

knowing main ML techniques
knowing common pre- and post-proc. techs
knowing main (comparative) assessment techniques
implementing them in production
motivate all choices

Skills of the ML researcher:

(as above and)
- but implementing them as prototype
disigning new ML/pre-/post-processing/assessment techniques
formally/experimentally motivating them

Experience, practice, knowledge!

46 / 366

Should I use Machine Learning?

Recall: we need an $f\subtext{predict}: X \to Y$ to make a decision $y$ about an $x$

Reasons for running $f\subtext{predict}$ on a machine:

$y$ has to be computed very quickly
- a human couldn't keep the pace
$y$ has to be computed in a dangerous context
- or a human is simply not available
the value of $y$ is very low
it is believed that a human would be biased in deciding $y$

If $f\subtext{predict}$ is run on a machine, still $f\subtext{predict}$ might be designed by a human.

human "learning", not machine learning

Reasons for running $f\subtext{learn}$ on a machine, i.e., to obtain $f\subtext{predict}$ through learning:

humans cannot design a reasonable $f\subtext{predict}$
human-made $f\subtext{predict}$ is too costly/slow
human-made $f\subtext{predict}$ is not good
- does not make good decisions

Factors:

efficiency
effectiveness
human dignity (cost)

47 / 366

Domain knowledge and data exploration

Reasons for running $f\subtext{learn}$ on a machine:

humans cannot design a reasonable $f\subtext{predict}$ : yes or no?
human-made $f\subtext{predict}$ is too costly/slow: yes or no?
human-made $f\subtext{predict}$ is not good: yes or no?

Answering these questions requires the knowledge of the domain

(necessary, not sufficient)
better/more with exploration of the data
- which data?
- how to explore it? $\rightarrow$ data visualization

48 / 366

How to choose components?

Component:

learning technique
pre- or post-processing technique
dataset
assessment technique

Factors: beyond applicability, which is a yes/no matter

effectiveness
- the component works well (experimental assessment, evaluation metrics and methods)
efficiency
- using the component consumes low resources
interpretability
- the working of the component and/or its outcomes is understandable
familiarity
- the designer does little effort for using the component: e.g., already knows the software tool, good parameter values, ...
technological constraints

49 / 366

Example of Iris

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...

50 / 366

Example of Iris

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...

Why a story?

we need a concrete case in order to practice the phases of the ML design (steps 1–3)
those steps cannot be made with an abstract case

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

50 / 366

Iris species

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist.

Bob liked to collect Iris flowers and to sort them properly in his collection boxes at home. He organized collected flowers depending on their species.

Iris setosa Iris versicolor Iris virginica

Iris setosa Iris virginica Iris versicolor

51 / 366

Bob's need

Alice: What's up, Bob?
Bob: I'd like to put new flowers into proper boxes.
Well... I'm not an expert of flowers. Can't you do it by yourself?
No, actually I cannot. But I heard you now master the art of machine learning...
Mmmmhhh... I see that you already have flowers in boxes. How did you sort them? Why ML now?
Well, I used to go to a professional botanist, who was able to tell me the species of each Iris flower I collected. I don't want to bother her anymore and her lab is far from here and it takes time to get there and the fuel is getting more and more pricey... 🦖
Ok, I understand. So you think ML can be helpful here. Let's see...

52 / 366

Bob's need

Some information about the context up to here (Alice's thoughts 💭):

problem timings: no real hurry to make a decision
scale of the problem: how many flowers would Bob collect in the unit of time?
cost of the solution: Bob is basically trying to replace a free service with another free service...
expected quality of the solution: how picky will be Bob?

No car accidents to be avoided (timing), no billions of emails to be analyzed (scale), no big business process invoved (cost), no loan decision to be made (quality).

52 / 366

Tackling the Iris problem: phase 1 - ML?

Reasons for running $f\subtext{predict}$ on a machine:

👎 $y$ has to be computed very quickly
👎 $y$ has to be computed in a dangerous context
🤏 the value of $y$ is very low
🤌 a human would be biased in deciding $y$

Reasons for learning $f\subtext{predict}$ on a machine:

👍 humans cannot design a reasonable $f\subtext{predict}$
🤌 human-made $f\subtext{predict}$ is too costly/slow
🤏 human-made $f\subtext{predict}$ is not good

👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

53 / 366

Tackling the Iris problem: phase 1 - ML?

Reasons for running $f\subtext{predict}$ on a machine:

👎 $y$ has to be computed very quickly
👎 $y$ has to be computed in a dangerous context
🤏 the value of $y$ is very low
🤌 a human would be biased in deciding $y$

Reasons for learning $f\subtext{predict}$ on a machine:

👍 humans cannot design a reasonable $f\subtext{predict}$
🤌 human-made $f\subtext{predict}$ is too costly/slow
🤏 human-made $f\subtext{predict}$ is not good

👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

Outcome: ok, let's use ML!

53 / 366

Phase 2 - supervised vs. unsupervised

Do we have examples at hand?

Yes, Bob already collected some flowers and organized them in boxes. For each of them, there's a species label that has been assigned by an expert (the professional botanist). We assume those labels are correctly assigned.

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

54 / 366

Phase 2 - supervised vs. unsupervised

Do we have examples at hand?

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

Outcome: it's supervised learning!

54 / 366

Phase 3 - problem statement

In natural language: given an Iris flower, assign a species

Formally:
$X=\{x: x \text{ is an Iris flower}\}$
$Y=\{\text{setosa}, \text{versicolor}, \text{virginica}\}$

Issues with this $X$ : 🤔

is that a useful definition? that is: can it be used for judging the membership of an object to $X$ ? $\text{🌸} \overset{?}{\in} X$
is an $x \in X$ processable by a machine? recall that in later phases:
- we want to take an $f\subtext{learn}$ that is able to learn an $f\subtext{predict}: X \to Y$ and use $f\subtext{learn}$ on a machine
- we want to use the learned $f\subtext{predict}$ on a machine

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

55 / 366

Phases 3 - shaping $X$

We cannot just take another $X$ , because the problem is "given an Iris flower, assign a species". But we can introduce some pre-processing¹ steps that transform an $x \in X$ in an $x' \in X'$ , with $X'$ being better, more suitable, for later steps.

That is, we can design an $f\subtext{pre-proc}: X \to X'$ and an $X'$ !

Requirements:

(designing and) applying of $f\subtext{pre-proc}$ should have an acceptable cost
an $x'=f\subtext{pre-proc}(x)$ should retain the information of $x$ that is useful for obtaining a $y$
$X'$ should be compatible with one or more learning techniques see

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

If $x \in X$ is not digital, we consider $f\subtext{pre-proc}$ to be applied outside the ML system, hence its definition is part of the problem statement; otherwise, if $x \in X$ is natively digital, then each $f\subtext{pre-proc}$ can be considered as part of the ML system, and its definition is done in phase 4.

56 / 366

Phases 3 - feature engineering

Since most learning techniques are designed to work on a mutlivariate $X$ , we are going to design an $f\subtext{pre-proc}: X \to X' = X'_1 \times \dots \times X'_p$ . That is, we are going to define the features and the way to compute them out of an $x$ .

This step is called feature engineering and is in practice a key step in the design of an ML system, often more important than the choice of the learning technique:

for the key requirement concerning the information retaining contained in $x$
because it is often done before collecting the dataset, which may be a costly, hardly repeatable operation

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

57 / 366

Phases 3 - feature engineering for Iris

Some options:

Function $f\subtext{pre-proc}$	Set $X'$	Cost	Info¹	Comp.²
$x'$ is a textual desc. of $x$	strings	🫰🫰	🫳	👍
$x'$ is a digital pic. of $x$	$[0,1]^{512 \times 512 \times 3}$	🫰	🫳	👍³
$x'$ is "the" DNA of $x$	$\{\text{A}, \text{C}, \text{G}, \text{T}\}^*$	🫰🫰🫰	👍	👍
$x'$ is some measurements of $x$	$\mathbb{R}^p$	🫰	🫳	👍

Info retain: 👍: large, i.e., good; 🫳: medium; 👎: small, i.e., bad.
Compatibility: 👍: large, i.e., good; 🫳: medium; 👎: small, i.e., bad.
Not if Alice just attends this course...

The actual decision should be taken by Alice and Bob together, based on domain knowledge of the latter and ML knowledge of the former.

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

Requirements for $f\subtext{pre-proc}: X \to X'$ :

proper cost
retaining information
compatibility

58 / 366

Phase 3 - flower to vector

Assume choice " $x'$ is some measurements of $x$ ", namely 4 measurements, then $f\subtext{pre-proc}: X \to \mathbb{R}^4$ and $f\subtext{pre-proc}(x)=\vect{x}'=(x'_1,x'_2,x'_3,x'_4)$ with:

$x'_1$ is the¹ sepal length of $x$ in cm
$x'_2$ is the sepal width of $x$ in cm
$x'_3$ is the petal length of $x$ in cm
$x'_4$ is the petal width of $x$ in cm

Iris sepal and petal measurements

$x'_1$	$x'_2$	$x'_3$	$x'_4$	$y$
5.1	3.5	1.4	0.2	setosa
7.0	3.2	4.7	1.4	versicolor
6.3	3.3	6.0	2.5	virginica

Which one? it has to be decided! e.g., the longest, mean value, ...

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

Requirements for $f\subtext{pre-proc}: X \to X'$ :

proper cost
retaining information
compatibility

59 / 366

Phases 1 and 3 - explore the data

Alice's thoughts 💭: Is it true that we cannot design a reasonable $f\subtext{predict}$ ? Are we retaining information?

Let's look at the data!

which data? Bob, give me your samples and let's measure them
what to look?
1. mean values for species and feature
2. boxplots of values for species and feature
3. pairwise (with respect to feature) scatterplots of observations

How does Alice choose these 3 approaches, in this order?

experience
nature of $X'$ (here $\mathbb{R}^4$ )
knowledge of basic plots and their cost

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

60 / 366

Phases 1 and 3 - data mean values

Mean values for species and feature

iris %>% group_by(Species) %>% summarise_all(mean)

# A tibble: 3 × 5
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
  <fct>             <dbl>       <dbl>        <dbl>       <dbl>
1 setosa             5.01        3.43         1.46       0.246
2 versicolor         5.94        2.77         4.26       1.33
3 virginica          6.59        2.97         5.55       2.03

Findings: setosa looks more different

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

61 / 366

Phases 1 and 3 - boxplots

Boxplots of values for species and feature

iris %>% pivot_longer(cols=!Species)
  %>% ggplot(aes(x=name, y=value, color=Species)) + geom_boxplot()

Iris boxplot

Findings: overlap between versicolor and virginica, for all features

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

62 / 366

Phases 1 and 3 - pairwise scatterplots

Pairwise scatterplots of observations

ggpairs(iris, columns=1:4, aes(color=Species, alpha=0.5),
  upper=list(continuous="points"))

Iris pairwise scatterplots

Findings: overlap!

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement)
Design the ML system
Implement the ML system
Assess the ML system

Questions:

cannot design an $f\subtext{predict}$ ?
retaining information?

Outcome:
Yes, let's use ML!

63 / 366

Phase 3 - solution assessment

Problem statement:

define $X$ and $Y$ ✅
define a way for assessing solutions ❌

How?

next part

64 / 366

The true story of the Iris dataset

Iris paper 1 Anderson, Edgar. "The species problem in Iris." Annals of the Missouri Botanical Garden 23.3 (1936): 457-509.

Iris paper 2 Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.

1936!!!

65 / 366

Basic conceptsBrief recap66 / 366

Refining a definition of ML

Machine Learning is the science of getting computers to learn without being explicitly programmed. $\downarrow$ Machine Learning is the science of getting computers to learn $f: X \to Y$ without being explicitly programmed. $\downarrow$ Machine Learning is the science of getting computers to learn $f: X \to Y$ autonomously. $\downarrow$ Supervised (Machine) Learning is the science of getting computers to learn $f: X \to Y$ from examples autonomously. $\downarrow$

Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is a way for learning an $f\subtext{predict} \in \mathcal{F}_{X \to Y}$ given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ .

67 / 366

Key termseach x∈Xx \in Xx∈X is an observation, input, data point, or instance
each y∈Yy \in Yy∈Y is a response or output
D={(x(i),y(i))}iD = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_iD={(x(i),y(i))}i​ is a dataset compatible with XXX and YYY; a learning set if used for learning
learning phase is when flearnf\subtext{learn}flearn​ is being applied
prediction phase is when fpredictf\subtext{predict}fpredict​ is being applied
a model is the variable part mmm of a templated fpredict(x)=fpredict′(x,m)f\subtext{predict}(x) = f'\subtext{predict}(x, m)fpredict​(x)=fpredict′​(x,m)

if YYY is finite and without ordering, it's a classification problemif ∣Y∣=2|Y|=2∣Y∣=2, it's binary classification
if ∣Y∣>2|Y|>2∣Y∣>2, it's multiclass classification

if Y=RY = \mathbb{R}Y=R, it's a regression problem
if X=X1×⋯×XpX = X_1 \times \dots \times X_pX=X1​×⋯×Xp​ is multivariate, each xix_ixi​ is an independent variable, feature, attribute, or predictor
yyy is the dependent variable or response variablein classification, yyy is the class label


68 / 366

Supervised learning technique

A learning technique is defined by $f'\subtext{learn}, f'\subtext{predict}$ :

$f'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M$ $f'\subtext{predict}: X \times M \to Y$

f'\subtext{learn}

D\subtext{learn}

m

f'\subtext{predict}

x, m

y

Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is defined by:

a way $f'\subtext{learn}$ for learning a model $m \in M$ given a $D\subtext{learn} \in \mathcal{P}^*(X \times Y)$ ;
a way $f'\subtext{predict}$ for computing a response $y$ given an observation $x$ and a model $m$ .

69 / 366

Phases of design of an ML system

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement):
- define $X$ and $Y$
- feature engineering
- define a way for assessing solutions
Design the ML system
Implement the ML system
Assess the ML system

Arguments for $f\subtext{predict}$ on machine:

computing $y$ quickly
dangerous context
low $y$ value
avoid human bias

Arguments for $f\subtext{learn}$ on machine:

cannot build $f\subtext{predict}$ manually
cost building $f\subtext{predict}$ manually
quality of a manually built $f\subtext{predict}$

Requirements for $f\subtext{pre-proc}: X \to X'$ :

proper cost
retaining information
compatibility

70 / 366

Assessing supervised ML71 / 366

What to assess?

Subject of the assessment:

an ML system (all components)
a supervised learning technique ( $f\subtext{learn}$ and $f\subtext{predict}$ )
a model ( $m$ used in an $f'\subtext{predict}$ )

72 / 366

Axes of assessment

Assume something is assessed with respect to a given goal:

Effectiveness: to which degree is the goal achieved?
- goal poorly achieved $\rightarrow$ low effectiveness 😢
- goal completely achieved $\rightarrow$ high effectiveness 😁
Efficiency: how much resourses are consumed for achieving the goal? to some degree
- large amount of resources $\rightarrow$ low efficiency 😢
- small amount of resources $\rightarrow$ high efficiency 😁
Interpretability (or explainability): to which degree the way the goal is achieved (or not achieved) is explainable?
- poorly explainable $\rightarrow$ low interpretability 😢
- fully explainable $\rightarrow$ high interpretability 😁

73 / 366

Purposes of assessment

Given an axis $a$ of assessment:

absolute assessment: does something meet the expectation in terms of $a$ ?
- is a model effective enough?
- is a learning technique explainable enough?
- is an ML system efficient enough?
comparison: is one thing better than one other thing in terms of $a$ ?
- is model $m_1$ more effective than model $m_2$ ? maybe obtained with same technique and different parameters
- is this learning technique more efficient than that learning technique?

"enough" represents some expectation, some minimum degree of $a$ to be reached.

74 / 366

Purposes of assessment

Given an axis $a$ of assessment:

absolute assessment: does something meet the expectation in terms of $a$ ?
- is a model effective enough?
- is a learning technique explainable enough?
- is an ML system efficient enough?
comparison: is one thing better than one other thing in terms of $a$ ?
- is model $m_1$ more effective than model $m_2$ ? maybe obtained with same technique and different parameters
- is this learning technique more efficient than that learning technique?

"enough" represents some expectation, some minimum degree of $a$ to be reached.

If the outcome of assessment is a quantity (i.e., a number) with a monotonic semantics:

comparison corresponds to check for $>$ or $<$
absolute assessment corresponds to:
- establishing a threshold and
- check for $>$ or $<$

We want assessment to produce a number!

74 / 366

Effectiveness and subject

A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:

tuning properly (i.e., such that, after tuning it makes good decisions)
making good decisions

A supervised learning technique is a pair $f\subtext{learn},f\subtext{predict}$ . Its goals are:

learning a good $f\subtext{predict}$ , for $f\subtext{learn}$ , i.e., an $f\subtext{predict}$ that makes good decisions
making good decisions

A model has one goal:

making good decisions (when used in an $f'\subtext{predict}$ )

75 / 366

Effectiveness and subject

A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:

tuning properly (i.e., such that, after tuning it makes good decisions)
making good decisions

A supervised learning technique is a pair $f\subtext{learn},f\subtext{predict}$ . Its goals are:

learning a good $f\subtext{predict}$ , for $f\subtext{learn}$ , i.e., an $f\subtext{predict}$ that makes good decisions
making good decisions

A model has one goal:

making good decisions (when used in an $f'\subtext{predict}$ )

Eventually, effectiveness is about making good decisions!

Ideally, we want to measure effectiveness with numbers.

75 / 366

Model vs. real system

How to measure if an $f'\subtext{predict}$ is making good decisions?

Recall: $f\subtext{predict}$ , possibly through $f'\subtext{predict}$ and a model $m$ , models the dependency of $y$ on $x$ .

Key underlying assumption: $y$ depends on $x$ . That is, there exists some real system $s: X \to Y$ that, given an $x$ produces a $y$ based on $x$ , that is, $s \in \mathcal{F}_{X \to Y}$ :

given a flat $x$ , an economical system determines the price $y$ of $x$ on the real estate market
given two basketball teams about to play a match $x$ , a sport event determines the outcome $y$ of $x$

Or, there exists in reality some system $s^{-1}: Y \to X$ that, given an $y$ produces an $x$ based on $y$ :

given a seed of an Iris flower of a given species $y$ , the nature eventually develops $y$ in an Iris flower $x$

Model $m$ (or $f\subtext{predict}$ )

f'\subtext{predict}(\cdot, m)

x

y

Real system $s$

s

x

y

A templated $f'\subtext{predict}: X \times M \to Y$ with a fixed model $m$ is an $f\subtext{predict}: X \to Y$ .

76 / 366

Comparing $m$ and $s$

Model $m$ (or $f\subtext{predict}$ )

f'\subtext{predict}(\cdot, m)

x

y

Real system $s$

s

x

y

How to see if the model $m$ is modeling the system $s$ well?

Direct comparison:

"open" $s$ and look inside
"open" $m$ and look inside
compare internals of $s$ and $m$

Issues:

in practice, $s$ can rarely/hardly be opened
$m$ might be hard to open

Comparison of behaviors:

collect some examples of the behavior of $s$
feed $m$ with examples
compare responses of $s$ and $m$

Ideally, we want the comparison (step 3) outcome to be a number.

77 / 366

Comparing behaviors

$f\subtext{comp-behavior}: \mathcal{F}_{X \to Y} \times \mathcal{F}_{X \to Y} \to \mathbb{R}$

f\subtext{comp-behavior}

f\subtext{predict},s

v\subtext{effect}

Or, to highlight the presence of a model in a templated $f\subtext{predict}$ :

$f\subtext{comp-behavior}: \mathcal{F}_{X \times M \to Y} \times M \times \mathcal{F}_{X \to Y} \to \mathbb{R}$

f\subtext{comp-behavior}

f'\subtext{predict},m,s

v\subtext{effect}

In both cases:

function $\text{comp-behavior}(f\subtext{predict}, s)$ {
$\{x^{(i)}\}_i \gets \text{collect}()$
$\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)$
$\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})$
$v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)$
return $v\subtext{effect}$ ;
}

collect some examples of the behavior of $s$
feed $m$ with examples
compare responses of $s$ and $m$

More correctly $\seq{(y^{(i)},\hat{y}^{(i)})}{i} \gets \text{foreach}(\seq{x^{(i)}}{i}, \text{both}(\cdot, s, f\subtext{predict}))$ with $f\subtext{both}: X \times \mathcal{F}^2_{X \to Y}$ and $f\subtext{both}(x, f_1, f_2) = (f_1(x),f_2(x))$ .

78 / 366

Remarks on $f\subtext{comp-behavior}$

f\subtext{comp-behavior}

f\subtext{predict},s

v\subtext{effect}

collect examples of $s$ behavior
feed $m$ with examples
compare responses of $s$ and $m$

it's a partially abstract function: $f\subtext{collect}$ and $f\subtext{comp-resps}$ are abstract (i.e., not given here)
we may reason about effectiveness and efficiency of $f\subtext{comp-behavior}$ , but both depend on concrete $f\subtext{collect}$ and $f\subtext{comp-resps}$
- effectiveness: to which degree $f\subtext{comp-behavior}$ measures if $m$ behaves like $s$ ?
- efficiency: how much resources are consumed to apply $f\subtext{comp-behavior}$ ?

79 / 366

Remarks on $f\subtext{comp-behavior}$

f\subtext{comp-behavior}

f\subtext{predict},s

v\subtext{effect}

collect examples of $s$ behavior
feed $m$ with examples
compare responses of $s$ and $m$

it's a partially abstract function: $f\subtext{collect}$ and $f\subtext{comp-resps}$ are abstract (i.e., not given here)
we may reason about effectiveness and efficiency of $f\subtext{comp-behavior}$ , but both depend on concrete $f\subtext{collect}$ and $f\subtext{comp-resps}$
- effectiveness: to which degree $f\subtext{comp-behavior}$ measures if $m$ behaves like $s$ ?
- efficiency: how much resources are consumed to apply $f\subtext{comp-behavior}$ ?

We'll see many concrete options for $f\subtext{comp-resps}$

$f\subtext{collect}$ is instead hard to define, but it's more important than $f\subtext{comp-resps}$

working with good data is important!

79 / 366

The importance of $f\subtext{collect}$ in assessment

How many observations to collect? (data size) $n$ in $\{(x^{(i)})\}_{i=1}^{i=n} \gets \text{collect}()$
Which observations to collect? (data coverage)

Goal: the behavior $\{(x^{(i)},y^{(i)})\}_{i=1}^{i=n}$ has to be representative of the real system $s$

the larger $n$ , the more representative
the better the coverage of $X$ , the more representative

Concerning size $n$ :

small $n$ , poor effectiveness 👎, great efficiency 👍
large $n$ , great effectiveness 👍, poor efficiency 👎

Concerning coverage of $X$

poor coverage, poor effectiveness 👎
good coverage, good effectiveness 👍

Focus on coverage, rather than size, because it has no drawbacks!

80 / 366

Comparing responses with $f\subtext{comp-resps}$

Formally:

$f\subtext{comp-resps}: \mathcal{P}^*(Y^2) \to \mathbb{R}$

f\subtext{comp-resps}

\{(y^{(i)},\hat{y}^{(i)})\}_i

v\subtext{effect}

where $\{(y^{(i)},\hat{y}^{(i)})\}_i \in \mathcal{P}^*(Y^2)$ is a multiset of pairs of $y$ .

Depends only on $Y$ , not on $X$ !

81 / 366

Comparing responses with $f\subtext{comp-resps}$

Formally:

$f\subtext{comp-resps}: \mathcal{P}^*(Y^2) \to \mathbb{R}$

f\subtext{comp-resps}

\{(y^{(i)},\hat{y}^{(i)})\}_i

v\subtext{effect}

where $\{(y^{(i)},\hat{y}^{(i)})\}_i \in \mathcal{P}^*(Y^2)$ is a multiset of pairs of $y$ .

Depends only on $Y$ , not on $X$ !

We'll see a few options for the main cases:

classification
- all (i.e., agnostic with respect to $|Y|$ ): error, accuracy
- binary: FPR and FNR (and variants), EER, AUC
- multiclass: weighted accuracy
regression: MAE, MSE, MRE

} performance indexes

81 / 366

Assessing modelsClassification82 / 366

Classification error

Recall: in classification $Y$ is a finite set with no ordering

Classification error: more verbosely: classification error rate $f\subtext{err}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=\frac{1}{n}\sum_{i=1}^{i=n}\mathbf{1}(y^{(i)}\ne \hat{y}^{(i)})$ where $\mathbf{1}: \{\text{false},\text{true}\} \to \{0,1\}$ is the indicator function: $\mathbf{1}(b) = \begin{cases} 1 &\text{if } b = \text{true}\\ 0 &\text{otherwise} \end{cases}$

$f\subtext{err}$ is a concrete instance of $f\subtext{comp-resps}$
the codomain of $f\subtext{err}$ is $[0,1]$ : $[0,1] \subseteq{\mathbb{R}}$ , so it can be a concrete instance
- $0$ means no errors, it's good 👍
- $1$ means all errors, it's bad 👎
in general, numbers in $[0,1]$ can be expressed as percentages in $[0,100]$ : $x$ is the same as $100 x\%$

83 / 366

Classification accuracy

Classification accuracy: $f\subtext{acc}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=\frac{1}{n}\sum_{i=1}^{i=n}\mathbf{1}(y^{(i)} \c{3}{=} \hat{y}^{(i)})$

Clearly, $f\subtext{acc}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=1-f\subtext{err}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})$ .

The codomain of $f\subtext{acc}$ is also $[0,1]$ :

$1$ means no errors, it's good 👍
$0$ means all errors, it's bad 👎

For accuracy, the greater, the better.
For error, the lower, the better.

In principle, the only requirement concerning $Y$ for both $f\subtext{acc}$ and $f\subtext{acc}$ is that there is an equivalence relation in $Y$ , i.e., that $=$ is defined over $Y$ . However, in practice $Y$ is a finite set without ordering.

84 / 366

Bounds for accuracy (and error)

In principle, accuracy is in $[0,1]$ .

Recall that in the $f\subtext{acc}$ is part of an $f\subtext{comp-behavior}$ that should measure how well a model $m$ models a real system $s$ .
What are the ideal extreme cases in practice:

$m$ is $s$ , so it perfectly models $s$
$m$ is random, does not model any dependency of $y$ on $x$

From another point of view, what would be the accuracy of a:

model that perfectly models the system?
random model?

85 / 366

The random classifier (lower bound)

The random classifier¹ is an $X \to Y$ doing:

$f\subtext{rnd}(x) = y_i \text{ with } i \sim U(\{1,\dots,|Y|\})$

where $i \sim U(A)$ means choosing an item of $A$ with uniform probability.

Here $A=\{1,\dots,|Y|\}$ , hence $f(x)$ gives a random $y$ , without using $x$ , i.e., no dependency.

Considering all possibles multisets of responses $\mathcal{P}^*(Y)$ , the accuracy of the random classifier is, on average, $\frac{1}{|Y|}$ .

classifier is a shorthand for:
- a model for doing classifcation, i.e., an $f'\subtext{predict}$ with categorical $Y$
- a supervised learning technique for classification, i.e., a pair $f'\subtext{learn}, f'\subtext{predict}$ with categorical $Y$

86 / 366

Dummy classifier (better lower bound)

Given one specific multiset of responses $\{y^{(i)}\}_i$ , the dummy classifier is the one that always predicts the most frequent class in $\{y^{(i)}\}_i$ : $f_{\text{dummy},\{y^{(i)}\}_i}(x) = \argmax_{y \in Y} \frac{1}{n} \sum_{i=1}^{i=n} \mathbf{1}(y=y^{(i)})=\argmax_{y \in Y} \freq{y, \{y^{(i)}\}_i}$ On the $\{y^{(i)}\}_i$ on which it is built, the accuracy of the dummy classifier is $\max_{y \in Y} \freq{y, \{y^{(i)}\}_i}$ .

Recall: we use $f\subtext{acc}$ on one specific $\{y^{(i)}\}_i$ .

Like the random classifier, the dummy classifier does not use $x$ .

Dummy

dummy [duhm-ee]: a representation of a human figure, as for displaying clothes in store windows

Looks like a human, but does nothing!

87 / 366

Random/dummy classifier: examples

Case: coin tossing, $Y=\{\c{1}{\text{heads}},\c{2}{\text{tails}}\}$

Random on average (with $f\subtext{rnd}$ ):

$\seq{y^{(i)}}{i}$	$\seq{\hat{y}^{(i)}}{i}$	$f\subtext{acc}()$
⬤⬤⬤⬤⬤⬤	⬤⬤⬤⬤⬤⬤	$50\%$
⬤⬤⬤⬤	⬤⬤⬤⬤	$25\%$
...	...	...
⬤	⬤	$100\%$
⬤	⬤	$0\%$

Average accuracy =

50\%

Dummy on $\seq{y^{(i)}}{i}=\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}$
(with $f_{\text{dummy},\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}}$ ):

$f\subtext{acc}(\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col2 st}{\text{⬤⬤⬤⬤}}) = 75\%$

Case: Iris, $Y=\{\c{1}{\text{setosa}},\c{2}{\text{versicolor}},\c{3}{\text{virginica}}\}$

Random on average (with $f\subtext{rnd}$ ):

$\seq{y^{(i)}}{i}$	$\seq{\hat{y}^{(i)}}{i}$	$f\subtext{acc}()$
⬤⬤⬤⬤⬤⬤	⬤⬤⬤⬤⬤⬤	$\approx 17\%$
⬤⬤⬤⬤	⬤⬤⬤⬤	$50\%$
...	...	...
⬤	⬤	$100\%$
⬤	⬤	$0\%$

Average accuracy

\approx 33\%

Dummy on $\seq{y^{(i)}}{i}=\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}$
(with $f_{\text{dummy},\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}}$ ):

$f\subtext{acc}(\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤⬤⬤⬤}}) = 50\%$

Here $f\subtext{acc}(\htmlClass{col3 st}{\text{⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤⬤⬤}})$ stays for $f\subtext{acc}(\{(\htmlClass{col3 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}}),(\htmlClass{col1 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}}),(\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}})\})$ .

88 / 366

The perfect classifier (upper bound)

A classifier that works exactly as $s$ :

$f\subtext{perfect}(x) = s(x)$

If $s$ is deterministic, the accuracy of $f\subtext{perfect}(x)$ is 100% on every $\seq{x^{(i)}}{i}$ , by definition.

Are real systems deterministic in practice?

system that makes a mail spam or not-spam
Iris species (where nature is an $s^{-1}$ ...)
a bank employee who decides whether or not to grant a loan
the real estate market forming the price of a flat ( $Y=\mathbb{R}^+$ )

89 / 366

The Bayes classifier (better upper bound)

A non deterministic system (i.e., a stochastic or random system) is one that given the same $x$ may output different $y$ .

The Bayes classifier is an ideal model of a real system that is not deterministic:

$f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}$

where $\prob{s(x)=y \mid x}$ is the probability that $s$ gives $y$ for $x$ .

Key facts:

on a given $\seq{x^{(i)}}{i}$ the accuracy of the Bayes classifier is $\le 100\%$ (it may be lower than 100%)
on $\mathcal{P}^*(X)$ , i.e., on all possible multisets of observations $x$ , the Bayes classifier is the optimal classifier, i.e., no other classifier can score a better accuracy it can be proven, not here!

90 / 366

The Bayes classifier (better upper bound)

A non deterministic system (i.e., a stochastic or random system) is one that given the same $x$ may output different $y$ .

The Bayes classifier is an ideal model of a real system that is not deterministic:

$f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}$

where $\prob{s(x)=y \mid x}$ is the probability that $s$ gives $y$ for $x$ .

Key facts:

on a given $\seq{x^{(i)}}{i}$ the accuracy of the Bayes classifier is $\le 100\%$ (it may be lower than 100%)
on $\mathcal{P}^*(X)$ , i.e., on all possible multisets of observations $x$ , the Bayes classifier is the optimal classifier, i.e., no other classifier can score a better accuracy it can be proven, not here!

In practice:

the Bayes classifier is an ideal classifier: "building" it requires knowing how $s$ works, which is undoable in practice
intuitively, the more random the system, the lower the accuracy of the Bayes classifier

90 / 366

The Bayes classifier: example

The real system $s$ is the professor deciding if a student will pass or fail the exam of Introduction to ML. The professor just looks at the student course to decide ❗ fake! and is a bit stochastic.

$X =\{\text{IN19},\text{IN20},\text{SM34},\text{SM35},\text{SM64}\}$
$Y = \{\text{fail},\text{pass}\}$

The probability according to which the professor "reasons" is completeley known:

	$\text{fail}$	$\text{pass}$
$\text{IN19}$	$20\%$	$80\%$
$\text{IN20}$	$15\%$	$85\%$
$\text{SM34}$	$60\%$	$40\%$
$\text{SM35}$	$80\%$	$20\%$
$\text{SM64}$	$20\%$	$80\%$

❗ these are fake numbers!

$\prob{s(x)=y \mid x}=\begin{cases} 20\% &\text{if } x=\text{IN19} \land y=\text{fail} \\ 80\% &\text{if } x=\text{IN19} \land y=\text{pass} \\ 15\% &\text{if } x=\text{IN20} \land y=\text{fail} \\ \dots \\ 80\% &\text{if } x=\text{SM64} \land y=\text{pass} \end{cases}$ the table is a compact form for this probability

$f\subtext{Bayes}(x) = \begin{cases} \text{pass} &\text{if } x=\text{IN19} \\ \text{pass} &\text{if } x=\text{IN20} \\ \text{fail} &\text{if } x=\text{SM34} \\ \text{fail} &\text{if } x=\text{SM35} \\ \text{pass} &\text{if } x=\text{SM64} \end{cases}$ built using the definition $f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}$

Questions

what's the accuracy of $f\subtext{Bayes}$ ? What's the model for the Bayes classifier? What's $M$ ?
what's the accuracy of $f\subtext{dummy}$ ? And of $f\subtext{rnd}$ ?

91 / 366

Classification accuracy bounds

	Lower	Upper
By definition	$0$	$1$
Bounds, all data	$\frac{1}{\lvert Y\rvert}$	$1$
Better bounds, with one $\seq{x^{(i)}}{i}$	$\max_{y \in Y} \freq{y, \{s(x^{(i)})\}_i}$	$\le 1$

If $\seq{x^{(i)}}{i}$ is collected properly, it is representative of the behavior of the real system (together with the corresponding $\seq{s(x^{(i)})}{i}$ ), hence the third case is the most relevant one:

$f\subtext{acc}(\cdot) \in [\max_{y \in Y} \freq{y, \{s(x^{(i)})\}_i}, 1 - \epsilon]$ $\epsilon > 0$ is actually unknown

In practice, use the random classifier as a baseline and

do not cry 😭 for a missed $100\%$
do not be too happy 🥳 just because you score $> 0\%$

92 / 366

All data

All data means all the theoretically possible datasets, i.e., for just $y$ , $\mathcal{P}^*(Y)$ .

on average in $\mathcal{P}^*(Y)$ , the frequency of each $y_i \in Y$ is $\frac{1}{|Y|}$

In practice not all possible datasets are equally probable.

often, the frequencies $f_i$ of $y_i$ are known (at least an approximation of them).
in these cases, the (approximate) lower bound for the random classifier is: $\max_i f_i$

Example: for spam, $x$ is an email, i.e., a string of text, $y$ is $\text{spam}$ or $\neg\text{spam}$ :

are we interested in measuring the accuracy of a spam filter on all possible strings (theory)?
or are we more interested in knowing its accuracy for actual emails (practice)?

93 / 366

Building the dummy classifier

Consider the random classifier as a supervised learning technique:

in learning phase: compute frequencies/probability of classes concrete
in prediction phase: choose the most frequent class concrete

Hence, formally:

a model $m \in M$ is: these are alternative options
1. the class frequencies $\c{2}{\vect{f} = (f_1,\dots,f_{|Y|})}$ , with $M=F_Y=\{\vect{f} \in [0,1]^{|Y|}: \lVert \vect{f} \rVert_1=1\}$
  $\lVert \vect{x} \rVert_1$ is the 1-norm of a vector $\vect{x}=(x_1,\dots,x_p)$ with $\lVert \vect{x} \rVert_1$ $=\sum_i x_i$
2. a discrete probability distribution $p$ over $Y$ , with $M=P_Y=\{p: Y \to [0,1] \text{ s.t. } 1=\sum_{y' \in Y} p(y')=\prob{y'=y}\}$ $\text{s.t.}$ stays for "such that"
3. the $y$ part $\seq{y^{(i)}}{i}$ of a dataset $\seq{(x^{(i)},y^{(i)})}{i}$ , with $M=\mathcal{P}^*(Y)$
4. just the most frequent class $y^\star$ , with $M=Y$
$f'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M$ asbtract
$f'\subtext{predict}: X \times M \to Y$ asbtract

94 / 366

Building the dummy classifier (options 1 and 2)

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

\c{2}{m}

f'\subtext{predict}

x, \c{2}{m}

y

Option 1: the model $m$ is a vector of frequencies: assume $Y=\{y_1, y_2, \dots\}$

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\vect{f}} = \left(\freq{y_j, \seq{y^{(i)})}{i}}\right)_j$

$f'\subtext{predict}(x,\c{2}{\vect{f}})=y_i$ with $i = \argmax_i f_i$

95 / 366

Building the dummy classifier (options 1 and 2)

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

\c{2}{m}

f'\subtext{predict}

x, \c{2}{m}

y

Option 1: the model $m$ is a vector of frequencies: assume $Y=\{y_1, y_2, \dots\}$

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\vect{f}} = \left(\freq{y_j, \seq{y^{(i)})}{i}}\right)_j$

$f'\subtext{predict}(x,\c{2}{\vect{f}})=y_i$ with $i = \argmax_i f_i$

Option 2: the model $m$ is a discrete probability distribution: here $f'\subtext{learn}$ a function that returns a function

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{p}: p(y)= \freq{y, \seq{y^{(i)})}{i}}$

$f'\subtext{predict}(x,\c{2}{p})=\argmax_{y \in Y} \c{2}{p}(y)$

95 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model $m$ is simply the learning dataset: just the $y$ part of it

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}$

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model $m$ is simply the learning dataset: just the $y$ part of it

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}$

Option 4: the model $m$ is the most frquent class $y^\star$ :

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,y^\star)=\c{2}{y^\star}$

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model $m$ is simply the learning dataset: just the $y$ part of it

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}$

Option 4: the model $m$ is the most frquent class $y^\star$ :

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,y^\star)=\c{2}{y^\star}$

For all options, works with:

any $X$ ( $x$ never appears in $f'\subtext{learn}$ and $f'\subtext{predict}$ bodies)
finite $Y$ (categorical $y$ )

Are they different? How?

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model $m$ is simply the learning dataset: just the $y$ part of it

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}$

Option 4: the model $m$ is the most frquent class $y^\star$ :

$f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}$

$f'\subtext{predict}(x,y^\star)=\c{2}{y^\star}$

For all options, works with:

any $X$ ( $x$ never appears in $f'\subtext{learn}$ and $f'\subtext{predict}$ bodies)
finite $Y$ (categorical $y$ )

Are they different? How?

They differ in efficiency, are equal in effectiveness:

effectiveness as supervised learning techniques, same by definition
efficiency, always high, but: just an implementation matter
- more or less memory for storing the model $m$
- computational effort more in the learning or prediction phase

96 / 366

Assessing modelsBinary classification97 / 366

Binary classification

Binary classification is a very common scenario.

assessment is particularly important
there are many indexes

Examples:

spam detection
decide whether there is a dog in a picture
clinical test (more properly: diagnostic test)

98 / 366

Example: diagnostic test

Suppose there is a (ML-based) diagnostic test for a given disease $d$ . just to give a name to it without calling bad luck...

You are being said the accuracy of the test is $99.8\%$ .

Is this a good test or not?

99 / 366

Example: diagnostic test

Suppose there is a (ML-based) diagnostic test for a given disease $d$ . just to give a name to it without calling bad luck...

You are being said the accuracy of the test is $99.8\%$ .

Is this a good test or not?

In "formal" terms, the test is an $f\subtext{predict}: X \to Y$ with:

$X=\{$ 🧑‍🦰 $,$ 👱‍ $,$ 🙍 $,$ ‍👱 $,$ 🙎‍ $,\dots\}$ the set of persons¹
$Y=\{\text{has the disease } d, \text{does not have the disease } d\}$

Since $|Y|=2$ this is a binary classification problem.

1: or, from another point of view, $X =\{$ 🧑‍🦰 $,$ 👱‍ $,$ 🙍 $,$ ‍👱 $,$ 🙎‍ $,\dots\} \times T$ , with $T$ being the time, because you test a person at a given time $t$ , and the outcome might be different from the test outcome for the same person at a later $t'$ .

99 / 366

The rare disease

Suppose $d$ is a rare¹ disease which affects $\approx 2$ people every $1000$ and let the accuracy be again $99.8\%$ .

Is this a good test or not?

definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).

100 / 366

The rare disease

Suppose $d$ is a rare¹ disease which affects $\approx 2$ people every $1000$ and let the accuracy be again $99.8\%$ .

Is this a good test or not?

definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).

Consider a trivial test that always says "you don't have the disease $d$ ", its accuracy would be $99.8\%$ :

on $1000$ persons, the trivial test would make correct decisions on $998$ cases
is our test good if it works like the trivial test?

100 / 366

The rare disease

Suppose $d$ is a rare¹ disease which affects $\approx 2$ people every $1000$ and let the accuracy be again $99.8\%$ .

Is this a good test or not?

definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).

Consider a trivial test that always says "you don't have the disease $d$ ", its accuracy would be $99.8\%$ :

on $1000$ persons, the trivial test would make correct decisions on $998$ cases
is our test good if it works like the trivial test?
The trivial test is actually the dummy classifier built knowing that the prevalence is $0.2\%$ .

100 / 366

The fallacy of the accuracy

$99.8\%$ was soooo nice, but the test was actually just saying always one $y$ .

The accuracy alone was not able to capture such a gross error.

Can we spot this trivially wrong behavior?
From another point of view, can we check how badly the classifier behaves for each class $y$ ?

Yes, also because we are in binary classification and there are only $2=|Y|$ possible values for $y$ (i.e., 2 classes).

There are performance indexes designed right with this aim.

101 / 366

Positives and negatives

First, let's give a standard name to the two possible $y$ values:

positive (one case, denoted with $\text{pos}$ )
negative (the other case $\text{neg}$ )

How to associate positive/negative with actual $Y$ elements?

e.g., $\text{spam}, \neg\text{spam}$
e.g., $\text{has the disease } d, \text{does not have the disease } d$

Common practice:

associate positive with the rarest case
otherwise, if no rarest case exists or is known, clearly state what's your positive

102 / 366

FPR and FNR

Goal: measuring the error on each of the two classes in binary classification.

The False Positive Rate (FPR) is the rate of negatives that are wrongly¹ classified as positives: $f\subtext{FPR}(\{(y^{(i)},\hat{y}^{(i)})\}_i)=\frac{\sum_i\mathbf{1}(\c{1}{y^{(i)}=\text{neg}} \land \c{2}{y^{(i)} \ne \hat{y}^{(i)}})}{\sum_i\mathbf{1}(\c{1}{y^{(i)}=\text{neg}})}$

The False Negative Rate (FNR) is the rate of positives that are wrongly classified as negatives: $f\subtext{FNR}(\{(y^{(i)},\hat{y}^{(i)})\}_i)=\frac{\sum_i\mathbf{1}(\c{3}{y^{(i)}=\text{pos}} \land \c{2}{y^{(i)} \ne \hat{y}^{(i)}})}{\sum_i\mathbf{1}(\c{3}{y^{(i)}=\text{pos}})}$

For both:

the codomain is $[0,1]$ may be $\frac{0}{0}$ , i.e., NaN, if no negatives (FPR) or positives (FNR) in the data
the lower, the better (like the error)
each one is formally an $f\subtext{comp-resps}$ considering just a part $\seq{(y^{(i)},\hat{y}^{(i)})}{i}$

wrongly $\rightarrow$ falsely $\rightarrow$ false

103 / 366

More comfortable notation

$\text{FPR}=\frac{\text{FP}}{\text{N}}$

$\text{FNR}=\frac{\text{FN}}{\text{P}}$

Assuming that:

there is a $\seq{(y^{(i)},\hat{y}^{(i)})}{i}$ , even if it's not written
$\text{FP}$ is the number of false positives; $\text{FN}$ is the number of false negatives
- you need both $y^{(i)}$ and $\hat{y}^{(i)}$ for counting them
- negative/positive is for $\hat{y}^{(i)}$ ; false is for $y^{(i)}$ , but considering $\hat{y}^{(i)}$
$\text{P}$ is the number of positives and $\text{N}$ is the number of negatives
- you need only $y^{(i)}$ for counting them

104 / 366

FPR, FNR for the trivial test

Suppose $d$ is a rare¹ disease which affects $\approx 2$ persons every $1000$ and consider a trivial test that always says "you don't have the disease $d$ "

on $1000$ persons, the trivial test would make correct decisions on $998$ cases 😁 $\text{Acc} = 99.8\%$
on the $998$ negative persons, the trivial test does not make any wrong prediction 😁 $\text{FPR}=\frac{\text{FP}}{\text{N}} = \frac{0}{998} = 0 \%$
on the $2$ positive persons, the trivial test makes wrong predictions only 🙁 $\text{FNR}=\frac{\text{FN}}{\text{P}} = \frac{2}{2} = 100 \%$

$\text{Acc}$ is the more comfortable notation for the accuracy; $\text{Err}$ for the error.

105 / 366

Accuracy or FPR, FNR?

When to use accuracy? When to use FPR and FNR?

tl;dr¹: use FPR and FNR in binary classification!

106 / 366

Accuracy or FPR, FNR?

When to use accuracy? When to use FPR and FNR?

tl;dr¹: use FPR and FNR in binary classification!

In decreasing order of informativeness effectiveness of assessment of effectiveness, decreasing order of verbosity:

give accuracy, FPR, FNR, frequencies of classes² in $Y$ , possibly other indexes we'll see later
give accuracy, FPR, FNR, frequencies of classes
FPR, FNR, frequencies of classes
FPR, FNR
accuracy, frequencies of classes
accuracy

Accuracy alone in binary classification is evil! 👿

Just FPR, or just FNR is evil too, but also weird.

too long; didn't read
you need to show them just once, if using the "natural" distribution

106 / 366

The many relatives of FPR, FNR: TPR, TNR

Binary classification and its assessment are so practically relevant that there exist many other "synonyms" of FPR and FNR.

True Positive Rate (TPR), positives correctly classified as positives: $\text{TPR}=\frac{\text{TP}}{\text{P}}=1-\text{FNR}$

True Negative Rate (TNR), negatives correctly classified as negatives: $\text{TNR}=\frac{\text{TN}}{\text{N}}=1-\text{FPR}$

For both, the greater, the better (like accuracy); codomain is $[0,1]$ .

Relation with accuracy and error:

$\text{Err} =\frac{\text{FP}+\text{FN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{FNR}+\text{N} \; \text{FPR}}{\text{P}+\text{N}}$

$\text{Acc} =1-\text{Err} =\frac{\text{TP}+\text{TN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{TPR}+\text{N} \; \text{TNR}}{\text{P}+\text{N}}$

107 / 366

On balanced data

In classification (binary and multiclass), a dataset is balanced, with respect to the response variable $y$ , if the frequency of each value of $y$ is roughly the same.

For a balanced dataset in binary classification, $\text{P}=\text{N}$ , hence:

the error rate is the average of FPR and FNR $\text{Err} =\frac{\text{FP}+\text{FN}}{\text{N}+\text{P}}=\frac{\text{P} \; \text{FNR}+\text{N} \; \text{FPR}}{\text{P}+\text{N}} =\frac{\text{N} (\text{FNR} + \text{FPR})}{\text{N}+\text{N}} =\frac{1}{2} (\text{FNR} + \text{FPR})$
the accuracy is the average of TPR and TNR $\text{Acc} =\frac{\text{TP}+\text{TN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{TPR}+\text{N} \; \text{TNR}}{\text{P}+\text{N}} =\frac{\text{N} (\text{TPR}+\text{TNR})}{\text{N}+\text{N}} =\frac{1}{2} (\text{TNR} + \text{TPR})$

The more unbalanced a dataset, the farther the error (accuracy) from the average of FPR and FNR (TPR and TNR), the more misleading 👿 giving error (accuracy) only!

108 / 366

Precision and recall

Precision: $\text{Prec}=\frac{\text{TP}}{\text{TP}+\text{FP}}$ may be $\frac{0}{0}$ , i.e., NaN, if the classifier never says positive

Recall: $\text{Rec}=\frac{\text{TP}}{\text{P}}=\text{TPR}$

F-measure: or F1, F1-score, F-score $\text{F-measure}=2\frac{\text{Prec} \cdot \text{Rec}}{\text{Prec}+\text{Rec}}$ harmonic mean of precision and recall

They come from the information retrieval scenario:

imagine a set of documents $D$ (e.g., the web)
imagine a query $q$ with an ideal subset $D^\star \subseteq D$ as response (relevant documents)
the search engine retrieves a subset $D' \subseteq D$ of documents (retrieved documents)
retrieving a document as binary classification: is $d \in D$ relevant or not? relevant = positive

Precision: how many retrieved documents are actually relevant? $\text{Prec}=\frac{|D' \cap D^\star|}{|D'|}=\frac{\c{1}{|D' \cap D^\star|}}{\c{1}{|D' \cap D^\star|}+\c{2}{|D' \setminus D^\star|}}=\frac{\c{1}{\text{TP}}}{\c{1}{\text{TP}}+\c{2}{\text{FP}}}$

Recall: how many of the relevant documents are actually retrieved? $\text{Rec}=\frac{\c{1}{|D' \cap D^\star|}}{\c{3}{|D^\star|}}=\frac{\c{1}{\text{TP}}}{\c{3}{\text{P}}}$

The greater, the better (like accuracy); precision $\in [0,1] \cup$ NaN, recall $\in [0,1]$ , F-measure $\in [0,1]$ .

109 / 366

Sensitivity and specificity (and more)

Sensitivity: $\text{Sensitivity}=\frac{\text{TP}}{\text{P}}=\text{TPR}$

Specificity: $\text{Specificity}=\frac{\text{TN}}{\text{N}}=\text{TNR}$

The greater, the better (like accuracy); both in $[0,1]$ .

110 / 366

Sensitivity and specificity (and more)

Sensitivity: $\text{Sensitivity}=\frac{\text{TP}}{\text{P}}=\text{TPR}$

Specificity: $\text{Specificity}=\frac{\text{TN}}{\text{N}}=\text{TNR}$

The greater, the better (like accuracy); both in $[0,1]$ .

Other similar indexes:

Type I error for FPR
Type II error for FNR

For both, the lower, the better (like error).

110 / 366

Which terminology?

Rule of the thumb¹ (in binary classification)

precision and recall, if in an information retrieval scenario
- refer to the act of retrieving
sensitivity and specificity, if working with a diagnostic test
- refer to the quality of the text
FPR and FNR, otherwise
- refer to the name of the class

No good reasons imho for using Type I and Type II error:

what do they refer to?
is there a Type III? 🤔 (No!)

rule of the thumb [ˌruːl əv ˈθʌm]: a broadly accurate guide or principle, based on practice rather than theory

111 / 366

Comparison with FPR and FNR

Suppose you have two models and you compute them on the same data:

model $m_1$ with its $f'\subtext{predict}$ scores $\text{FPR}=6\%$ and $\text{FNR}=4\%$
model $m_2$ with its $f'\subtext{predict}$ scores $\text{FPR}=10\%$ and $\text{FNR}=1\%$

Which one is the best?

112 / 366

Comparison with FPR and FNR

Suppose you have two models and you compute them on the same data:

model $m_1$ with its $f'\subtext{predict}$ scores $\text{FPR}=6\%$ and $\text{FNR}=4\%$
model $m_2$ with its $f'\subtext{predict}$ scores $\text{FPR}=10\%$ and $\text{FNR}=1\%$

Which one is the best?

In general, it depends on:

the cost of the error, possibly different between FPs and FNs
the number of positives or negatives

112 / 366

Cost of the error

Assumptions:

once $f\subtext{predict}$ outputs a $y$ , some action is taken
- otherwise, taking a decision $y$ is pointless
if the action is wrong, there is some cost to be paid with respect to the correct action (the other one, in binary classification) assume the correct decision has $0$ cost
- otherwise, making attempting to take the correct decision is pointless

Given $\text{P}+\text{N}$ observations, the overall cost $c$ is: $c = c\subtext{FP} \; \text{FPR} \; \text{N} + c\subtext{FN} \; \text{FNR} \; \text{P}$ with $c\subtext{FP}$ and $c\subtext{FN}$ the cost of FPs and FNs.

If you know $c\subtext{FP}$ , $c\subtext{FN}$ , $\text{N}$ , and $\text{P}$ : (the costs $c\subtext{FP}$ , $c\subtext{FN}$ should come from domain knowledge)

you can compute $c$ (and compare the cost for two models)
find a good trade-off for $\text{FPR}$ and $\text{FNR}$ more later

113 / 366

Balancing FPR and FNR

Given a model (not a learning technique), can we "tune" it to prefer avoiding FPs rathern than FNs (or viceversa)?

e.g., can we make a diagnostic more sensitive to positives (i.e., prefer avoiding FNs) during a pandemic wave?

Yes! It turns out that for many learning techniques (for classification), the $f'\subtext{predict}$ internally computes a discrete probability distribution over $Y$ before actually returning one $y$ .

114 / 366

Model with probability

Formally:

$f''\subtext{predict}: X \times M \to P_{Y}$ $f''\subtext{predict}(x, m) = p$

$f'\subtext{predict}: X \times M \to Y$ $f'\subtext{predict}(x, m)= \argmax\sub{y \in Y} (f''\subtext{predict}(x, m))(y) = \argmax\sub{y \in Y} p(y)$

where

P_Y

is the set of discrete probability distributions over

Y

Example: for spam detection, given an $m$ and an email $x$ , $f'\subtext{predict}(x, m)$ might return: $p(y)= \begin{cases} 80\% &\text{if } y=\text{spam} \\ 20\% &\text{if } y=\neg\text{spam} \end{cases}$ For another email, it might return a 30%/70%, instead of an 80%/20%.

115 / 366

Learning technique with probability

A supervised learning technique with probability (for classification) is defined by:

an $f'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M$ , for learning a model from a dataset
an $f''\subtext{predict}: X \times M \to P_{Y}$ , for giving a probability distribution from an observation and a model

For all the techniques of this kind, $f'\subtext{predict}: X \times M \to Y$ and $f\subtext{predict}$ are always the same: concrete

$f'\subtext{predict}(x, m)= \argmax\sub{y \in Y} (f''\subtext{predict}(x, m))(y)$
$f\subtext{predict}(x) = f'\subtext{predict}(x, m)$

x

m

f''\subtext{predict}

p

\argmax\sub{y \in Y}

y

"internally computes" $\rightarrow$ $p$ is indeed available internally, but can be obtained from outside

in practice, software tools allow to use both $f'\subtext{predict}$ and $f''\subtext{predict}$

116 / 366

Probability and binary classification

In binary classification, with $Y=\{\text{pos},\text{neg}\}$ , $p \in P_Y$ has always this form: $p(y)= \begin{cases} p\subtext{pos} &\text{if } y=\text{pos} \\ 1-p\subtext{pos} &\text{if } y=\text{neg} \end{cases}$ with $p\subtext{pos} \in [0,1]$ .

Hence, prediction can be seen as:

$f'''\subtext{predict}: X \times M \to [0,1]$ $f'''\subtext{predict}(x,m)=p\subtext{pos}$

$f'\subtext{predict}: X \times M \to Y$ $f'\subtext{predict}(x,m)= \begin{cases} \text{pos} &\text{if } p\subtext{pos} \ge 0.5 \\ \text{neg} &\text{otherwise} \end{cases}$

x

m

f'''\subtext{predict}

p\subtext{pos}

\ge 0.5

y

117 / 366

Probability and confidence

$p(y)= \begin{cases} p\subtext{pos} &\text{if } y=\text{pos} \\ 1-p\subtext{pos} &\text{if } y=\text{neg} \end{cases}$

The closer $p\subtext{pos}$ to $0.5$ , the lower the confidence of the model in its decision:

$p\subtext{pos}=0.51$ means "I think it's a positive, but I'm not sure"
$p\subtext{pos}=0.49$ means "I think it's a negative, but I'm not sure"
$p\subtext{pos}=0.98$ means "I'm rather sure it's a positive!"

We may measure the confidence in the binary decision as: $\text{conf}(x,m)=\frac{\abs{p\subtext{pos}-0.5}}{0.5}=\frac{\abs{f'''\subtext{predict}(x,m)-0.5}}{0.5}$

$\text{conf} \in [0,1]$ : the greater, the more confident.

118 / 366

Changing the threshold

If we replace the fixed $0.5$ threshold with a param $\tau$ we obtain a new function:

$f^\tau\subtext{predict}: X \times [0,1] \to Y$ $f^\tau\subtext{predict}(x,\tau)= \begin{cases} \text{pos} &\text{if } f'''\subtext{predict}(x,m) \ge \tau \\ \text{neg} &\text{otherwise} \end{cases}$

x

\tau

m

f'''\subtext{predict}

p\subtext{pos}

\ge \tau

y

Note that:

for using $f^\tau\subtext{predict}$ on an $x$ , you need a concrete value for $\tau$
- $f\subtext{predict}(x)=f^\tau\subtext{predict}(x, 0.5)$ , i.e., $0.5$ is the default value for $\tau$ in $f\subtext{predict}$
like for $f\subtext{predict}$ , the model is inside $f^\tau\subtext{predict}$
you can obtain several predictions for the same observation $x$ by varying $\tau$

Example: if we want our diagnostic test to be more sensible to positives, we lower $\tau$ without changing the model!

119 / 366

Threshold $\tau$ vs. FPR, FNR

Given the same $m$ and the same $\seq{(x^{(i)},y^{(i)})}{i}$ :

the greater $\tau$ , the less frequent $y=\text{pos}$ , the lower $\text{FPR}$ , the greater $\text{FNR}$
the lower $\tau$ , the more frequent $y=\text{pos}$ , the greater $\text{FPR}$ , the lower $\text{FNR}$

Example:

Example of tau vs. FPR and FNR

for the default threshold $\tau=0.5$ , $\text{FPR}\approx 20\%$ , $\text{FNR}\approx 15\%$
if you want to be more sensitive to positives, set, e.g., $\tau=0.25$ , so there will be a lower $\text{FNR} \approx 13\%$
if you know the cost of an FN is $\approx$ double the cost of an FP and the data is balanced, then you should set $\tau\approx 0.12$

why

\text{FNR}=0\%

for

\tau=0

but

\text{FPR}>0\%

for

\tau=1

120 / 366

Equal Error Rate

For a model $m$ and a dataset $\seq{(x^{(i)},y^{(i)})}{i}$ , the Equal Error Rate (EER) is the value of FPR (and FNR) for the $\tau=\tau\subtext{EER}$ value for which $\text{FPR}=\text{FNR}$ .

For EER: the lower, the better (like error); codomain is $[0,1]$ in practice $[0,0.5]$

Example of EER

for $\tau=0.65$ (vertical dashed line), $\text{FPR}=\text{FNR}$
$\text{EER}\approx 19\%$ (horizontal solid line)

121 / 366

The ROC curve

For a model $m$ and a dataset $\seq{(x^{(i)},y^{(i)})}{i}$ and a sequence $(\tau_i)_i$ , the Receiver operating characteristic¹ (ROC) curve is the plot of $\text{TPR}$ ( $= 1-\text{FNR}$ ) vs. $\text{FPR}$ for the different values of $\tau \in (\tau_i)_i$ .

Example of EER

red line: ROC curve
- each point stays at $(\text{FPR},\text{TPR})$ for a given $\tau$
solid black line: points for which $\text{FPR}=\text{FNR}$
- the $x$ -coord of the intersection with the red line is $\text{EER}$
- point at top-left ( $\text{FPR}=\text{FNR}=0$ ) is the perfect classifier
the intersection of dashed and solid black lines is at $\text{FPR}=\text{FNR}=0.5$
- it is the random classifier
points on the dashed line are random classifiers with $\tau \ne 0.5$
- the ROC for a healthy classifier should never stay on the right of the dashed line!

The name comes from its usage as a graphical tool for assessing radar stations during WW2.

122 / 366

Area Under the Curve (AUC)

For a model $m$ and a dataset $\seq{(x^{(i)},y^{(i)})}{i}$ and a sequence $(\tau_i)_i$ , the Area Under the Curve (AUC) is the area under the ROC curve.

For AUC: the greater, the better (like accuracy); codomain is $[0,1]$ in practice $[0.5,1]$

Example of EER

for the random classifier, $\text{AUC}=0.5$
for the ideal classifier, $\text{AUC}=1$

123 / 366

How to choose $\tau$ values?

For computing both $\text{EER}$ and $\text{AUC}$ , you need to compute $\text{FPR}$ and $\text{FNR}$ for many values of $\tau$ .

Ingredients:

$f^\tau\subtext{predict}$
- i.e., $f'''\subtext{predict}$ and a model $m$
a dataset $\seq{(x^{(i)},y^{(i)})}{i}$
a sequence $(\tau_i)_i$ of threshold values

x

\tau

m

f'''\subtext{predict}

p\subtext{pos}

\ge \tau

y

How to choose $(\tau_i)_i$ ? recall: $\tau \in [0,1]$ ; by convention, you always take also $\tau=0$ and $\tau=1$

evenly spaced in $[0,1]$ at $n+1$ points: $(\tau_i)_i=(\frac{i}{n})_{i=0}^{i=n}$
evenly spaced in $[\tau\subtext{min},\tau\subtext{max}]$ : $(\tau_i)_i=(\tau\subtext{min}+\frac{i}{n}(\tau\subtext{max}-\tau\subtext{min}))_{i=0}^{i=n}$
- with $\tau\subtext{min}=\min_i f'''\subtext{predict}(x^{(i)},m)$ and $\tau\subtext{max}=\max_i f'''\subtext{predict}(x^{(i)},m)$
taking midpoints of $(p\subtext{pos}^{(i)})_i$ i.e., sorted $\seq{p\subtext{pos}^{(i)}}{i}$
- with $p\subtext{pos}^{(i)}=f'''\subtext{predict}(x^{(i)},m)$

124 / 366

Example: $\tau$ and its values

$Y=\{\c{1}{\text{pos}},\c{2}{\text{neg}}\}$

$y^{(i)}$	$p\subtext{pos}^{(i)}$	$\hat{y}^{(i)}$	out¹
⬤	0.49	⬤	FN
⬤	0.29	⬤	TN
⬤	0.63	⬤	TP
⬤	0.51	⬤	TP
⬤	0.52	⬤	TP
⬤	0.47	⬤	TN
⬤	0.94	⬤	TP
⬤	0.75	⬤	TP
⬤	0.53	⬤	FP
⬤	0.45	⬤	TN

with $\tau=0.5$

\tau

\text{FPR}

\text{FNR}

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

0.5

\frac{1}{4}=25\%

\frac{1}{6}\approx 17\%

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

0.4

\frac{3}{4}=75\%

\frac{0}{6}=0\%

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

0.6

\frac{0}{4}=0\%

\frac{3}{6}=50\%

$(\tau_i)_i$ evenly spaced in $[0,1]$ 9+2 values $\rightarrow$ raw 7 on 11 different values

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

$(\tau_i)_i$ evenly spaced in $[0.29,0.84]$ 9+2 values $\rightarrow$ better but still 7 on 11 different values

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

$(\tau_i)_i$ at midpoints 9+2 values $\rightarrow$ optimal 11 on 11 different values

01⬤⬤⬤⬤⬤⬤⬤⬤⬤⬤

125 / 366

Cost of errors, index, and $\tau$

If you know the cost of error ( $c\subtext{FP}$ and $c\subtext{FN}$ ) and the class frequencies:

choose a proper $\tau$ and measure $\text{FPR}$ , $\text{FNR}$ , $c$

If you don't know the cost of error and you know the classifier will work at a fixed $\tau$ :

measure $\text{FPR}$ , $\text{FNR}$ for $\tau=0.5$
measure $\text{EER}$

If you don't know the cost of error and don't know at which $\tau$ the classifier will work:

measure $\text{FPR}$ , $\text{FNR}$ for $\tau=0.5$
measure $\text{AUC}$

126 / 366

Cost of errors, index, and $\tau$

If you know the cost of error ( $c\subtext{FP}$ and $c\subtext{FN}$ ) and the class frequencies:

choose a proper $\tau$ and measure $\text{FPR}$ , $\text{FNR}$ , $c$

If you don't know the cost of error and you know the classifier will work at a fixed $\tau$ :

measure $\text{FPR}$ , $\text{FNR}$ for $\tau=0.5$
measure $\text{EER}$

If you don't know the cost of error and don't know at which $\tau$ the classifier will work:

measure $\text{FPR}$ , $\text{FNR}$ for $\tau=0.5$
measure $\text{AUC}$

If you can afford, i.e., you have time/space:

measure "everything"

126 / 366

Confusion matrix

Given a multiset $\seq{(y^{(i)},\hat{y}^{(i)})}{i}$ of pairs, the confusion matrix has:

one row for each possible value $y$ of $Y$ , associated with $y^{(i)}$ (true labels)
one column for each possible value $\hat{y}$ of $Y$ , associated with $\hat{y}^{(i)}$ (predicted labels)
the number of pairs for which $\hat{y}^{(i)}=\hat{y}$ and $y^{(i)}=y$ in the cell

$Y=\{\c{1}{\text{pos}},\c{2}{\text{neg}}\}$

$y^{(i)}$	$p\subtext{pos}^{(i)}$	$\hat{y}^{(i)}$	out
⬤	0.49	⬤	FN
⬤	0.29	⬤	TN
⬤	0.63	⬤	TP
⬤	0.51	⬤	TP
⬤	0.52	⬤	TP
⬤	0.47	⬤	TN
⬤	0.94	⬤	TP
⬤	0.75	⬤	TP
⬤	0.53	⬤	FP
⬤	0.45	⬤	TN

For this case:

$y$ $\hat{y}$	⬤	⬤
⬤	5	1
⬤	1	3

For binary classification:

$y$ $\hat{y}$	$\text{pos}$	$\text{neg}$
$\text{pos}$	$\text{TP}$	$\text{FN}$
$\text{neg}$	$\text{FP}$	$\text{TN}$

In general it holds, being $\vect{c}$ the confusion matrix:

the accuracy is the ratio between the sum of the diagonal and the sum of the matrix: $\text{Acc} = \frac{\lVert \text{diag}(\vect{c}) \rVert_1}{\lVert \vect{c} \rVert_1}$
TPR is the ratio of $c_{\text{pos},\text{pos}}$ on the sum of the first row, i.e., the row for which $y=\text{pos}$
TNR is the ratio of $c_{\text{neg},\text{neg}}$ on the sum of the second row, i.e., the row for which $y=\text{neg}$

127 / 366

Multiclass classification and regression128 / 366

Weighted accuracy for multiclass classification

Besides accuracy and error, for unbalanced datasets, the weighted accuracy (or balanced accuracy) is: $\text{wAcc}=f\subtext{wAcc}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})=\frac{1}{|Y|} \sum_{y \in Y} \left( \frac{\sum_i \mathbf{1}(y^{(i)}=y \land y^{(i)}=\hat{y}^{(i)})}{\sum_i \mathbf{1}(y^{(i)}=y)} \right)=\frac{1}{|Y|} \sum_{y \in Y} \text{Acc}_y$ i.e., the (unweighted) average of the accuracy for each class. You can do the same with error, precision, recall, ...

$y$ $\hat{y}$	⬤	⬤	⬤	⬤
⬤	15	1	2	2
⬤	1	10	4	1
⬤	5	3	28	1
⬤	1	0	0	9

$\text{Acc} = \frac{15+10+28+9}{20+16+38+10} = \frac{62}{84} = 73.8\%$

$\text{Acc}\subtext{\c{1}{⬤}} = \frac{15}{20} = 75\%$
$\text{Acc}\subtext{\c{2}{⬤}} = \frac{10}{16} = 62.5\%$
$\text{Acc}\subtext{\c{3}{⬤}} = \frac{28}{37} = 75.7\%$
$\text{Acc}\subtext{\c{4}{⬤}} = \frac{9}{10} = 90\%$

$\text{wAcc} = \frac{1}{4} \left( \frac{15}{20}+\frac{10}{16}+\frac{28}{37}+\frac{9}{10} \right) = 75.8\%$

$\text{wAcc}$ overlooks class imbalance, $\text{Acc}$ does not; $\text{wAcc} \in [0,1]$ ; the greater, the better

$\text{wAcc}$ is like $\frac{1}{2} (\text{TPR}+\text{TNR})$

129 / 366

Errors in regression

Differently from classification, a prediction in regression may be more or less wrong:

classification: either $y^{(i)}=\hat{y}^{(i)}$ (correct) or $y^{(i)}\ne\hat{y}^{(i)}$ (wrong)
regression:
- $y^{(i)}=\hat{y}^{(i)}$ (perfect);
- $y^{(i)}+1=\hat{y}^{(i)}$ is wrong
- $y^{(i)}+100=\hat{y}^{(i)}$ is much more wrong
- ...

The error in regression measures how far is the prediction $\hat{y}^{(i)}$ from the true value $y^{(i)}$ :

recall, we are in the context of behavior comparison, i.e., $f\subtext{comp-resps}$

130 / 366

MAE, MSE, RMSE, MAPE

Name	$f\subtext{comp-resps}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})$
Mean Absolute Error (MAE)	$\text{MAE} = \frac{1}{n} \sum_i \abs{y^{(i)}-\hat{y}^{(i)}}$
Mean Squared Error (MSE)	$\text{MSE} = \frac{1}{n} \sum_i (y^{(i)}-\hat{y}^{(i)})^2$
Root Mean Squared Error (RMSE)	$\text{RMSE} = \sqrt{\frac{1}{n} \sum_i (y^{(i)}-\hat{y}^{(i)})^2}=\sqrt{\text{MSE}}$
Mean Absolute Percentage Error (MAPE)	$\text{MAPE} = \frac{1}{n} \sum_i \abs{\frac{y^{(i)}-\hat{y}^{(i)}}{y^{(i)}}}$

Remarks:

for all:
- the lower, the better
- domain is $[0, +\infin[$ MAPE might be $\infin$
MAE and RMSE retain the unit of measure: e.g., $y$ is in meters, MAE is in meters
MAPE is scale-independent and dimensionless
MSE and RMSE are more influenced by observations with large errors
MAPE "does not work" if the true $y$ is $0$

131 / 366

Assessing learning techniques132 / 366

Purpose of assessment

Premise:

an effective learning technique is a pair $f'\subtext{learn},f'\subtext{predict}$ that learns a good model $m$
- $f'\subtext{learn}$ needs a dataset for producing $m$
an effective model $m$ is one that has the same behavior of the real system $s$
- we measure this with $f\subtext{comp-behavior}$ , that internally uses a dataset

Goal:

we want a measure (a number!) of the effectiveness of $f'\subtext{learn},f'\subtext{predict}$

Sketch of solution:

learn an $m$ with $f'\subtext{learn}$
measure the effectiveness $\text{Eff}$ of $m$ with $f\subtext{comp-behavior}$ (and one or more suitable $f\subtext{comp-resps}$ )
say that the effectiveness of the learning technique is $\text{Eff}$

$\text{Eff}$ might be accuracy, TPR and TNR, MAE, error, ...

133 / 366

What data?

Sketch of solution:

learn an $m$ with $f'\subtext{learn}$
measure the effectiveness $\text{Eff}$ of $m$ with $f\subtext{comp-behavior}$ (and one or more suitable $f\subtext{comp-resps}$ )
say that the effectiveness of the learning technique is $\text{Eff}$

Both steps 1 and 2 need a dataset:

can we use the same $D$ ?

134 / 366

What data?

Sketch of solution:

learn an $m$ with $f'\subtext{learn}$
measure the effectiveness $\text{Eff}$ of $m$ with $f\subtext{comp-behavior}$ (and one or more suitable $f\subtext{comp-resps}$ )
say that the effectiveness of the learning technique is $\text{Eff}$

Both steps 1 and 2 need a dataset:

can we use the same $D$ ?

In principle yes, in practice no:

many learning techniques attempt to learn a model $m$ that, by definition, perfectly models the learning set
you want to see if it the learned model generalizes beyond examples

134 / 366

Effectiveness of a learning technique

$f\subtext{learn-effect}: \mathcal{L}_{X \to Y} \times \mathcal{P}^*(X \times Y) \to \mathbb{R}$ where $\mathcal{L}_{X \to Y}$ is the set of learning techniques:

$\mathcal{L}_{X \to Y}= \mathcal{F}_{\mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}}$
or $\mathcal{L}_{X \to Y} = \mathcal{F}_{\mathcal{P}^*(X \times Y) \to M} \times \mathcal{F}_{X \times M \to Y}$

f\subtext{learn}, D

f\subtext{learn-effect}

v\subtext{effect}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.

135 / 366

Effectiveness of a learning technique

$f\subtext{learn-effect}: \mathcal{L}_{X \to Y} \times \mathcal{P}^*(X \times Y) \to \mathbb{R}$ where $\mathcal{L}_{X \to Y}$ is the set of learning techniques:

$\mathcal{L}_{X \to Y}= \mathcal{F}_{\mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}}$
or $\mathcal{L}_{X \to Y} = \mathcal{F}_{\mathcal{P}^*(X \times Y) \to M} \times \mathcal{F}_{X \times M \to Y}$

f\subtext{learn}, D

f\subtext{learn-effect}

v\subtext{effect}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.

For consistency, let's reshape model assessment case:

function $\text{predict-effect}(f'\subtext{predict}, m, D)$ {
$\seq{(y^{(i)},\hat{y}^{(i)})}{i} \gets \text{foreach}($
$D,$
$\text{both}(\cdot,\text{second},f'\subtext{predict}(\text{first}(\cdot),m))$
$)$
$v\subtext{effect} \gets f\subtext{comp-resps}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})$
return $v\subtext{effect}$ ;
}

f'\subtext{predict}, m, D

f\subtext{predict-effect}

v\subtext{effect}

We are just leaving the data collection out of $\text{predict-effect}()$ .

$\text{first}()$ and $\text{second}()$ take the first or second element of a pair.

135 / 366

Same dataset

function $\text{learn-effect-same}(f'\subtext{learn},f'\subtext{predict}, D)$ {
$m \gets f'\subtext{learn}(D)$
$v\subtext{effect} \gets \text{predict-effect}(f'\subtext{predict},m,D)$
return $v\subtext{effect}$ ;
}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

The entire $D$ is used for learning the model and assessing it.

Effectiveness of assessment:

generalization is not assessed
- for techniques that, by design, learn a model that perfectly models the learning data, $\text{learn-effect-same}$ gives perfect effectiveness, regardless of $m$ , regardless of $D$
what if $D$ is lucky/unlucky? no robustness w.r.t. $D$

Poor! 👎

Efficiency of assessment:

learning is executed just once

Good! 👍

136 / 366

Static train/test division

function $\text{learn-effect-static}(f'\subtext{learn},f'\subtext{predict}, D,r)$ {
$D\subtext{learn} \gets \text{subbag}(D, r)$
$D\subtext{test} \gets D \setminus D\subtext{learn}$
$m \gets f'\subtext{learn}(D\subtext{learn})$
$v\subtext{effect} \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})$
return $v\subtext{effect}$ ;
}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

r

$r \in [0,1]$ is a parameter

$D$ is split in $D\subtext{learn}$ for learning and a $D\subtext{test}$ for assessment: "split"="partitioned", but $D\subtext{learn} \cap D\subtext{test}$ might be $\ne \emptyset$

$D\subtext{test}$ is called the test set
$D\subtext{learn}$ and $D\subtext{test}$ do not overlap and $\frac{|D\subtext{learn}|}{|D|}=r$ ; common values: $r=80\%$ , $r=70\%$ , ...

Effectiveness of assessment:

generalization is assessed
what if $D$ is lucky/unlucky? no robustness w.r.t. $D$ division in $D\subtext{learn}$ and $D\subtext{test}$

Fair! $\approx$ 👍

Efficiency of assessment:

learning is executed just once

Good! 👍

137 / 366

Role of $D\subtext{test}$

$D\subtext{test}$ , with respect to the model $m$ , is unseen data, because it has not been used for learning.

Assesing $m$ on unseen data answers the questions:

to which degree the model generalizes beyond examples?
does the model work well on new data?
how well will the ML system work in the future? on data that does not exist today

138 / 366

Role of $D\subtext{test}$

$D\subtext{test}$ , with respect to the model $m$ , is unseen data, because it has not been used for learning.

Assesing $m$ on unseen data answers the questions:

to which degree the model generalizes beyond examples?
does the model work well on new data?
how well will the ML system work in the future? on data that does not exist today

In practice $D\subtext{test}$ and $D\subtext{learn}$ are obtained from a $D$ that is collected all at once:

$D\subtext{test}$ might represent future data only roughly

138 / 366

Assessment vs. reality

What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?

Assessment 👍 - Reality 👎

$D$ was not representative w.r.t. the real system:

low coverage
old, i.e., the system has changed

or some bug in the implementation...

Assessment 👎 - Reality 👎

$D$ is not informative w.r.t. the real system:

$y$ in $D$ does not depend on $x$ in $D$
- wrong features
- too much noise in the features

or some bug in the implementation...

139 / 366

Assessment vs. reality

What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?

Assessment 👍 - Reality 👎

$D$ was not representative w.r.t. the real system:

low coverage
old, i.e., the system has changed

or some bug in the implementation...

Assessment 👎 - Reality 👎

$D$ is not informative w.r.t. the real system:

$y$ in $D$ does not depend on $x$ in $D$
- wrong features
- too much noise in the features

or some bug in the implementation...

Assessment 👍 - Reality 👍

Nice! We did everything well!

or some bug in the implementation...

Assessment 👎 - Reality 👍

Sooooo lucky! 🍀🍀🍀

or some bug in the implementation...

you never know if there is some bug in the implementation...

139 / 366

Repeated random train/test division

function $\text{learn-effect-repeated}(f'\subtext{learn},f'\subtext{predict}, D,r,k)$ {
for $j \in 1,\dots,k$ {
$D\subtext{learn} \gets \text{subbag}(D, r)$
$D\subtext{test} \gets D \setminus D\subtext{learn}$
$m \gets f'\subtext{learn}(D\subtext{learn})$
$v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})$
}
return $\frac{1}{k}\sum_j v_j$ ;
}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

r,k

$r \in [0,1]$ and $k \in \mathbb{N}^+$ is a parameter

$D$ is split in $D\subtext{learn}$ and $D\subtext{test}$ for $k$ times and measures are averaged: $\text{subbag}()$ has to be not deterministic

common values: $k=10$ , $k=5$ , ...

Effectiveness of assessment:

generalization is assessed
measures are repeated with different $D\subtext{learn}$ and $D\subtext{test}$ : robustness w.r.t. data

Good! 👍

Efficiency of assessment:

learning is executed $k$ times: might be heavy

$\propto k$ 🫳

140 / 366

Cross-fold validation (CV)

function $\text{learn-effect-cv}(f'\subtext{learn},f'\subtext{predict}, D, k)$ {
for $j \in 1,\dots,k$ {
$D\subtext{test} \gets \text{fold}(D, j, k)$
$D\subtext{learn} \gets D \setminus D\subtext{test}$
$m \gets f'\subtext{learn}(D\subtext{learn})$
$v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})$
}
return $\frac{1}{k}\sum_j v_j$ ;
}

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

k

$k \in \mathbb{N}^+$ is a parameter

Cross-fold validation is like $\text{learn-effect-repeated}$ , but the $k$ $D\subtext{test}$ are mutually disjoint (folds).

Effectiveness of assessment:

generalization is assessed
measures are repeated with different $D\subtext{learn}$ and $D\subtext{test}$ : robustness w.r.t. data

Good! 👍

Efficiency of assessment:

learning is executed $k$ times: might be heavy

$\propto k$ 🫳

141 / 366

Leave-one-out CV (LOOCV)

Simply a CV where the number of folds $k$ is $|D|$ :

each $D\subtext{test}$ consists of just one observation

f'\subtext{learn}, f'\subtext{predict}, D

f\subtext{learn-effect}

v\subtext{effect}

Effectiveness of assessment:

generalization is assessed
measures are repeated with different $D\subtext{learn}$ and $D\subtext{test}$ : robustness w.r.t. data

Good! 👍

Efficiency of assessment:

learning is executed $k=|D|$ times: might be heavy

Bad 👎

142 / 366

Visual summary

Same

$\rightarrow \text{Eff}$

1 learning; $|D|$ predictions

Static random ( $r=0.8$ )

$\rightarrow \text{Eff}$

1 learning; $|D|(1-r)$ predictions

Repeated random ( $r=0.8$ , $k=4$ )

$\rightarrow \text{Eff}_1$
$\rightarrow \text{Eff}_2$
$\rightarrow \text{Eff}_3$
$\rightarrow \text{Eff}_4$ } $\rightarrow \text{Eff}$

$k$ learnings; $|D|(1-r)$ pred. after each, $k|D|(1-r)$ pred.

CV ( $k=5$ )

$\rightarrow \text{Eff}_1$
$\rightarrow \text{Eff}_2$
$\rightarrow \text{Eff}_3$
$\rightarrow \text{Eff}_4$
$\rightarrow \text{Eff}_5$ } $\rightarrow \text{Eff}$

$k$ learnings; $\frac{1}{k}|D|$ pred. after each, $|D|$ pred. tot.

LOOCV

$\rightarrow \text{Eff}_1$
$\rightarrow \text{Eff}_2$
...
$\rightarrow \text{Eff}_{|D|}$ } $\rightarrow \text{Eff}$

$|D|$ learnings; $1$ pred. after each, $|D|$ pred. tot.

143 / 366

More than the average

Repeated random, CV, and LOOCV internally compute the model effectiveness for several models learned on (slightly) different datasets:

$\text{Eff}_1, \text{Eff}_2, \dots, \text{Eff}_k \rightarrow \text{Eff}=\c{2}{\frac{1}{k} \sum_j \text{Eff}_j}$

function $\text{learn-effect-cv}(f'\subtext{learn},f'\subtext{predict}, D, k)$ {
for ( $j \in 1,\dots,k$ ) {
$D\subtext{test} \gets \text{fold}(D, j)$
$D\subtext{learn} \gets D \setminus D\subtext{test}$
$m \gets f'\subtext{learn}(D\subtext{learn})$
$v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})$
}
return $\frac{1}{k}\sum_j v_j$ ;
}

We can compute both the mean and the standard deviation from $(\text{Eff}_i)_i$ :

$\text{Eff}_\mu=\frac{1}{k} \sum_j \text{Eff}_j$

$\text{Eff}_\sigma=\sqrt{\frac{1}{k} \sum_j \left(\text{Eff}_j-\text{Eff}_\mu\right)^2}$

Mean $\text{Eff}_\mu$ : what's the learning technique effectiveness on average?
Standard deviation $\text{Eff}_\sigma$ : how consistent is the learning technique w.r.t. different datasets?

144 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints $\tau$ ):

for LT1: $\text{AUC}_\mu=0.83$ and $\text{AUC}_\sigma=0.04$
for LT2: $\text{AUC}_\mu=0.75$ and $\text{AUC}_\sigma=0.03$

What's the best learning technique?

145 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints $\tau$ ):

for LT1: $\text{AUC}_\mu=0.83$ and $\text{AUC}_\sigma=0.04$
for LT2: $\text{AUC}_\mu=0.75$ and $\text{AUC}_\sigma=0.03$

What's the best learning technique?

Now, suppose that you insted find:

for LT1: $\text{AUC}_\mu=0.81$ and $\text{AUC}_\sigma=0.12$
for LT2: $\text{AUC}_\mu=0.78$ and $\text{AUC}_\sigma=0.02$

What's the best learning technique?

145 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints $\tau$ ):

for LT1: $\text{AUC}_\mu=0.83$ and $\text{AUC}_\sigma=0.04$
for LT2: $\text{AUC}_\mu=0.75$ and $\text{AUC}_\sigma=0.03$

What's the best learning technique?

Now, suppose that you insted find:

for LT1: $\text{AUC}_\mu=0.81$ and $\text{AUC}_\sigma=0.12$
for LT2: $\text{AUC}_\mu=0.78$ and $\text{AUC}_\sigma=0.02$

What's the best learning technique?

LT1 is better, on average, but less consistent
on actual, unseen data, LT1 might give a worse model than LT2

Can we really state that LT1 is better than LT2?

145 / 366

Comparison and statistics

Broader example:
suppose you meet $10$ guys from Udine and $10$ from Trieste and ask them how tall they are:

City	Measures	$\mu$	$\sigma$
Udine	$154, 193, 170, 175, 172, 183, 160, 162, 161, 179$	$170.9$	$12.02$
Trieste	$167, 166, 180, 175, 168, 167, 173, 181, 169, 173$	$171.9$	$5.44$

Questions:

are these $10$ guys from Trieste taller than these $10$ guys from Udine?
are guys from Trieste taller than guys from Udine?

146 / 366

Comparison and statistics

Broader example:
suppose you meet $10$ guys from Udine and $10$ from Trieste and ask them how tall they are:

City	Measures	$\mu$	$\sigma$
Udine	$154, 193, 170, 175, 172, 183, 160, 162, 161, 179$	$170.9$	$12.02$
Trieste	$167, 166, 180, 175, 168, 167, 173, 181, 169, 173$	$171.9$	$5.44$

Questions:

are these $10$ guys from Trieste taller than these $10$ guys from Udine?
are guys from Trieste taller than guys from Udine?

Possible ways of answering:

laziest: yes and yes $\mu\subtext{Ts} > \mu\subtext{Ud}$ and you assume these 10+10 are representative
lazy: yes and I don't know $\mu\subtext{Ts} > \mu\subtext{Ud}$ but you don't assume representativeness
smart: yes and let's look at boxplot assume "these" means "these on average"
stats-geek: yes and let's do a statistical significance test assume "these" means "these on average"

146 / 366

Comparing with boxplot

Boxplot of Ts and Ud guys height

Questions:

are these $10$ guys from Trieste taller than these $10$ guys from Udine?
are guys from Trieste taller than guys from Udine?

Answers with the boxplot:

yes, but just a bit
prefer not to say
- as an aside: people from Udine is much less consistent in height

147 / 366

Statistical significance test

Disclaimer: here, just a brief overview; go to statisticians for more details/theory

148 / 366

Statistical significance test

Disclaimer: here, just a brief overview; go to statisticians for more details/theory

For us, a statistical significance test is a procedure that, given two samples $\seq{x_{a,i}}{i}$ and $\seq{x_{b,i}}{i}$ (i.e., collections of observations) of two random variables $X_a$ and $X_b$ and a set of hypotheses $H_0$ (the null hypothesis), returns a number $p \in [0,1]$ , called the $p$ -value.

\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}, H_0

f\subtext{stat-test}

p

The $p$ -value represents the probability that, by collecting other two samples from the same random variables and assuming that $H_0$ still holds, the new two samples are more unlikely than $\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}$ .

148 / 366

Example

$H_0$ : (you assume all are true)

$X_a$ is normally distributed
$X_b$ is normally distributed
$\mu_a=E[X_a] = \mu_b=E[X_b]$ (our question, indeed)

Samples:

$X_a$ sample: $\{1,1,2,2,3,3\}$
$X_b$ sample: $\{0,0,1,0,1,1\}$

149 / 366

Example

$H_0$ : (you assume all are true)

$X_a$ is normally distributed
$X_b$ is normally distributed
$\mu_a=E[X_a] = \mu_b=E[X_b]$ (our question, indeed)

Samples:

$X_a$ sample: $\{1,1,2,2,3,3\}$
$X_b$ sample: $\{0,0,1,0,1,1\}$

$p=0.90$ means:

if you resample $X_a$ , $X_b$ , very likely you will find samples that are more unlikely, given $H_0$
so, these samples are indeed likely, given $H_0$
so, I can assume $H_0$ is true

149 / 366

Example

$H_0$ : (you assume all are true)

$X_a$ is normally distributed
$X_b$ is normally distributed
$\mu_a=E[X_a] = \mu_b=E[X_b]$ (our question, indeed)

Samples:

$X_a$ sample: $\{1,1,2,2,3,3\}$
$X_b$ sample: $\{0,0,1,0,1,1\}$

$p=0.90$ means:

if you resample $X_a$ , $X_b$ , very likely you will find samples that are more unlikely, given $H_0$
so, these samples are indeed likely, given $H_0$
so, I can assume $H_0$ is true

$p=0.01$ means:

if you resample $X_a$ , $X_b$ , very unlikely you will find samples that are more unlikely, given $H_0$
so, these samples are indeed unlikely, given $H_0$
so, I can think that $H_0$ is likely false I've been "very lucky" with these samples, if $H_0$ is true; or no luck if it's false
- not necessarily the $\mu_a > \mu_a$ , maybe the normality part

149 / 366

In practice

\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}, H_0

f\subtext{stat-test}

p

There exist several concrete statistical significance tests, e.g.:

Wilcoxon (in many versions)
Friedman (in many versions)

Usually, you aim at "argumenting" $\mu_a > \mu_b$ (one-tailed) or $\mu_a \ne \mu_b$ (inequality):

you choose one test based on the other parts of $H_0$
you compute the $p$ -value
you hope it is low
- and compare it against a prededefined threshold $\alpha$ , usually $0.05$
- with $\ne$ , if $p<\alpha$ , you say that there is a statistically significant difference (between the mean values)

150 / 366

Trieste vs. Udine

> wilcox.test(h_ts, h_ud)
    Wilcoxon rank sum test with continuity correction
data:  h_ts and h_ud
W = 54.5, p-value = 0.7621
alternative hypothesis: true location shift is not equal to 0

$H_0 \ni$ true location shift is equal to 0

$p=0.7621 > 0.05$ : we cannot reject the null hypothesis
$\Rightarrow$ people from Trieste is not taller than people from Udine or, at least, we cannot state this

Android malware detection¹ (1)

Results presentation for Android malware detection

binary classification
a few learning techniques
10-CV
just effectiveness
- $\mu$ , $\sigma$ for accuracy, FPR, FNR

Similar:
Canfora, Gerardo, et al. "Detecting android malware using sequences of system calls." Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile. 2015.

one dataset, three variants of effectiveness
- unseen run of known app
- unseen app of known family
- unseen app of unseen family

Canfora, Gerardo, et al. "Acquiring and analyzing app metrics for effective mobile malware detection." Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics. 2016.

153 / 366

Twitter botnet detection¹

binary classification
a few learning techniques
a baseline
just effectiveness

MCC is the Matthews correlation coefficient
- $\text{MCC}=\frac{\text{TP} \; \text{TN} - \text{FP} \; \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}$

Mazza, Michele, et al. "Rtbust: Exploiting temporal patterns for botnet detection on twitter." Proceedings of the 10th ACM conference on web science. 2019.

154 / 366

Anomaly detection in cyber-physical systems¹

Results presentation for CPS anomaly detection

anomaly detection
- binary classification with only negative examples in learning
many datasets
two methods

$f\subtext{evals}$ is a measure of efficiency of learning
TPR, FPR, AUC for effectiveness

Indri, Patrick, et al. "One-Shot Learning of Ensembles of Temporal Logic Formulas for Anomaly Detection in Cyber-Physical Systems." European Conference on Genetic Programming (Part of EvoStar). Springer, Cham, 2022.

155 / 366

AutoML approaches comparison¹

Results presentation for AutoML comparison

6 approaches
10 scenarios
box plots
- accuracy
- F1 for unbalanced case

Truong, Anh, et al. "Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools." 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). IEEE, 2019.

156 / 366

Assessing supervised MLBrief recap157 / 366

Assessing a model

Question: is the model modeling the real system?

Answer: compare responses on the same data and compute one or more performance indexes!

Model $m$ (or $f\subtext{predict}$ )

f'\subtext{predict}(\cdot, m)

x

y

Real system $s$

s

x

y

Binary classification

FPR and FNR ▼▲
- TNR and TPR ▲▼
- precision and recall ▲▼
- sensitivity and spec. ▲▼
EER ▼▲ greater cost, lower efficiency
AUC ▲▼ greater cost, lower efficiency

Classification (w/ binary)

accuracy ▲▼
error ▼▲
weighted accuracy ▲▼

Regression

MAE ▼▲
MSE ▼▲
RMSE ▼▲
MAPE ▼▲

Bounds for classification effectiveness:

random classifier (lower bound)
dummy classifier (better lower bound, baseline)
Bayes classifier (ideal upper bound)

158 / 366

Assessing a learning techniques

Effectiveness of the single technique

Sketch: learn a model on $D\subtext{learn}$ , assess the model on $D\subtext{test}$ ; which learning/test division?

Same

Static rnd

Repeated rnd

LOOCV

...

Comparison between techniques

just compare one measure: $\text{Eff}_1$ vs. $\text{Eff}_2$
compare $\mu$ of several measures: $\text{Eff}_{\mu,1}$ vs. $\text{Eff}_{\mu,2}$
compare $\mu$ and $\sigma$ of several measures: $\text{Eff}_{\mu,1},\text{Eff}_{\sigma,1}$ vs. $\text{Eff}_{\mu,2}, \text{Eff}_{\sigma,2}$
compare using boxplots
compare using a statistical significance test

159 / 366

Effectiveness and efficiency of assessment

Indexes¹

largelargelowEffectivenessEfficiency

\text{Acc}

\text{Err}

\text{FPR}

+²

\text{FNR}

\text{EER}

\text{AUC}

Learning/test division

largelargelowEffectivenessEfficiencySameStatic rndCV/Repeated rndLOOCV

Mainly for binary classification
+ here means "use both"

160 / 366

Tree-based learning techniques161 / 366

The carousel attendant

Once upon a time¹... there is an amusement park with a carousel and an attendant deciding who can ride and who cannot ride. The park owner wants to replace the attendant with a robotic gate.

The owner calls us as machine learning experts.

A carousel

For almost all the learning techniques, we'll (i) see a toy, but "realistic" problem, we'll (ii) try to learn a model by hands (i.e., human learning), and (iii) we'll try to translate the manual procedure into an automatic one (i.e., machine learning).

162 / 366

Approaching the problem

Should we use ML? $\rightarrow$ yes
Supervised vs. unsupervised $\rightarrow$ supervised
Define the problem statement:
- define $X$ and $Y$
- feature engineering
- define a way for assessing solutions
Design the ML system
Implement the ML system
Assess the ML system

$X$ and $Y$

$x$ is a person approaching the carousel
$y$ is $\text{can ride}$ or $\text{cannot ride}$ (binary class)

Features (chosen with domain expert):

person height (in cm)
person age (in years)

Hence:

$X = X\subtext{height} \times X\subtext{age} = \mathbb{R}^+ \times \mathbb{R}^+$
$\vect{x}=(x\subtext{height}, x\subtext{age})$ ( $p=2$ numeric independent variables)

We (the ML expert and the domain expert) decide to collect some data $D=\seq{(x^{(i)},y^{(i)})}{i}$ by observing the real system:

it'll come handy for both learning and assessment

163 / 366

Exploring the data

Carousel data

The data exploration suggests that using ML is not a terrible idea.

Assume we are computer scientists and we like if-then-else (nested) structures: can we manually build an if-then-else structure that allows to make a decision.

Requirements (to keep it feasible manually):

each if condition should:
- involve just one independent variable
- consist of a threshold comparison
the decision has to be ● or ●

Strategy:

tell apart points of different colors

164 / 366

Building the `if-then-else`

Carousel data

function $\text{predict}(\vect{x})$ {
if $x\subtext{age}\le 10$ then {
return ●
} else {
return ●
}
}

requirements are met
background color at position $\vect{x}=(x\subtext{age},x\subtext{height})$ is the color the code above will assign to that $\vect{x}$ , i.e., $f\subtext{predict}(\vect{x})$
most of the examples fall in the correct colored region
- maybe the else branch is too rough

Let's improve it!

165 / 366

Building the `if-then-else`

Carousel data

function $\text{predict}(\vect{x})$ {
if $x\subtext{age}\le 10$ then {
return ●
} else {
if $x\subtext{height}\le 120$ then {
return ●
} else {
return ●
}
}
}

requirements are met
almost all the examples fall in the correct colored region

Nice job!

166 / 366

The decision tree

This if-then-else nested structure can be represented as a tree:

x\subtext{age}

vs.

10

\le

●

>

x\subtext{height}

vs.

120

\le

●

>

●

function $\text{predict}(\vect{x})$ {
if $x\subtext{age}\le 10$ then {
return ●
} else {
if $x\subtext{height}\le 120$ then {
return ●
} else {
return ●
}
}
}

We call this a decision tree, since we use it inside an $f\subtext{predict}$ for making a decision:

it's a binary tree, since nodes have exactly 0 or 2 children
non-terminal nodes (or branch nodes) hold a pair (independent variable, threshold)
terminal nodes (or leaf nodes) hold one value $y \in Y$

167 / 366

De-hard-coding $f\subtext{predict}$

Now: our human learned $f\subtext{predict}$

f\subtext{predict}

x

y

function $\text{predict}(\vect{x})$ {
if $x\subtext{age}\le 10$ then {
return ●
} else {
if $x\subtext{height}\le 120$ then {
return ●
} else {
return ●
}
}
}

Goal: an $f'\subtext{predict}$ working on any tree

f'\subtext{predict}

\vect{x},m

y

function $\text{predict}(\vect{x}, m)$ {
...
}

We human learned (i.e., manually designed) a function where the decision tree is hard-coded in the $\text{predict}()$ function in the form of an if-then-else structure:

can we pull out the decision tree out of it and make $\text{predict}()$ a templated function?

168 / 366

Formalizing the decision tree

Scenario: classification with multivariate numerical features:

$X = X_1 \times \dots \times X_p$ , with each $X_i\subseteq\mathbb{R}$
- we write $\vect{x} = (x_1,\dots,x_p)=(x_i)_i$
$Y$ , finite without ordering

The model $t \in T_{p,Y}$ is a decision tree defined over $X_1 \times \dots \times X_p,Y$ , i.e.:

each $t$ is a binary tree
each non-terminal node is labeled with a pair $(j,\tau)$ , with $j \in \{1,\dots,p\}$ and $\tau \in \mathbb{R}$
- $j$ is the index of the independent variable
- $\tau$ is a threshold for comparison
each terminal node is labeled with a $y \in Y$

x\subtext{age}

vs.

10

\le

●

>

x\subtext{height}

vs.

120

\le

●

>

●

169 / 366

Compact representation of (binary) trees

We represent a tree $t \in T_L$ as: $t = \tree{\c{3}{l}}{\c{4}{t'}}{\c{4}{t''}}$ where $t', t'' \in T_L \cup \{\varnothing\}$ are the left and right children trees and $l \in L$ is the label.

If the tree is a terminal node¹, it has no children (i.e., $t'=t''=\varnothing$ ) and we write: $t = \tree{l}{\varnothing}{\varnothing}=\treel{l}$

For decision trees:

$L= (\{1,\dots,p\} \times \mathbb{R}) \cup Y$ , that is, a label can be a pair $(j,\tau)$ or a $y$
if $l \in Y$ , then $t'=t''=\varnothing$

We shorten $T_{(\{1,\dots,p\} \times \mathbb{R}) \cup Y}$ as $T_{p,Y}$ .

x\subtext{age}

vs.

10

\le

●

>

x\subtext{height}

vs.

120

\le

●

>

●

With:

$X=X\subtext{age} \times X\subtext{height} = X_1 \times X_2$
$Y=\set{\c{1}{●},\c{2}{●}}$

This tree is: $t = \tree{(1,10)}{\treel{\c{1}{●}}}{\tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}}$

Would you be able to write a parser for this?

Actually, node = tree, i.e., a node is a tree and a tree is a node!

170 / 366

Templated $f'\subtext{predict}$

function $\text{predict}(\vect{x}, t)$ {
if $\neg\text{has-children}(t)$ then {
$y \gets \text{label-of}(t)$
return $y$
} else { //hence $r$ is a branch node
$(j, \tau) \gets \text{label-of}(t)$
if $x_j \le \tau$ then {
return $\text{predict}(\vect{x}, \text{left-child-of}(t))$ //recursion
} else {
return $\text{predict}(\vect{x}, \text{right-child-of}(t))$ //recursion
}
}
}

$\text{has-children}(t)$ is true iff $t$ is not terminal
$\text{label-of}(t)$ returns the label of $t$
- a $y \in Y$ for terminal nodes
- a $(j,\tau) \in \{1,\dots,p\} \times \mathbb{R}$ for non-terminal nodes
$\text{left-child-of}(t)$ and $\text{right-child-of}(t)$ return the left or right child of $t$
- that are other trees, in general

It's a recursive function that:

works with any $t \in T_{p,Y}$ and any $\vect{x} \in \mathbb{R}^p$
always returns a $y \in Y$

f'\subtext{predict}

\vect{x},t

y

171 / 366

$f'\subtext{predict}$ application example

1st call: $\vect{x}=(14,155), t = \tree{(1,10)}{\treel{\c{1}{●}}}{\tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}}$

$\neg\text{has-children}(t)=\text{false}$
$(j,\tau)=(1,10)$
$x_1 \le 10 = \text{false}$
$\text{right-child-of}(r)= \tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}$

2nd call: $\vect{x}=(14,155), t = \tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}$

$\neg\text{has-children}(t)=\text{false}$
$(j,\tau)=(2,120)$
$x_2 \le 120 = \text{false}$
$\text{right-child-of}(r)= [\c{2}{●}]$

3rd call: $\vect{x}=(14,155), t = \treel{\c{2}{●}}$

$\neg\text{has-children}(t)=\text{true}$
$y=\c{2}{●}$ return return return

function $\text{predict}(\vect{x}, t)$ {
if $\neg\text{has-children}(t)$ then {
$y \gets \text{label-of}(t)$
return $y$
} else {
$(j, \tau) \gets \text{label-of}(t)$
if $x_j \le \tau$ then {
return $\text{predict}(\vect{x}, \text{left-child-of}(t))$
} else {
return $\text{predict}(\vect{x}, \text{right-child-of}(t))$
}
}
}

172 / 366

Towards tree learning

We have our $f'\subtext{predict}: \mathbb{R}^p \times T_{p,Y} \to Y$ ; for having a learning technique we miss only the learning function, i.e., $f'\subtext{learn}: \mathcal{P}^*(\mathbb{R}^p \times Y) \to T_{p,Y}$ :

f'\subtext{learn}

\seq{(\vect{x}^{(i)},y^{(i)})}{i}

t

What we did manually (i.e., how we human learned):

until we are satisfied
put a vertical/horizontal line that well separates the data
repeat from step 1 once for each on the two resulting regions

Let's rewrite it as (pseudo-)code!

173 / 366

Recursive binary splitting

function $\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$ {
if $\text{should-stop}(\seq{y^{(i)}}{i})$ then {
$y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y)$ // $y^\star$ is the most frequent class
return $\text{node-from}(y^\star,\varnothing,\varnothing)$
} else { //hence $r$ is a branch node
$(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$
$t \gets \text{node-from}($
$(j,\tau),$
$\c{3}{\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau})},$ //recursion
$\c{3}{\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau})}$ //recursion
)
return $t$
}
}

f'\subtext{learn}

\seq{(\vect{x}^{(i)},y^{(i)})}{i}

t

until we are satisfied
put a vertical/horizontal line that well separates the data
repeat step 1 once for each on the two resulting regions

$\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}$ is the sub-multiset of $\seq{(\vect{x}^{(i)},y^{(i)})}{i}$ composed of pairs for which $x_j \le \tau$

This $f'\subtext{learn}$ is called recursive binary splitting:

it's recursive
when recurses, splits the data in two parts (binary)
- it's a top-down approach: starts from the big problem and makes it smaller (divide-et-impera)
when stopping recursion, put a node with the most frequent class

174 / 366

Finding the best branch

Intuitively:

consider all variables (i.e., all $j$ ) and all¹ threshold values
choose the pair (variable, threshold) that best separates the data
- i.e., that results in the lowest rate of misclassified examples

In detail (and formally):

function $\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$ {
$(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\text{error}(\c{1}{\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau}})+\text{error}(\c{1}{\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}})\right)$
return $(j^\star, \tau^\star)$
}

and

function $\text{error}(\seq{y^{(i)}}{i})$ { //the error of the dummy classifier on $\seq{y^{(i)}}{i}$
$y^\star \gets \argmax_y \sum_i \mathbf{1}(y^{(i)}=y)$ // $y^\star$ is the most freq class
return $\frac{1}{n} \sum_i \mathbf{1}(y^{(i)} \ne y^\star)$ // $n=|\seq{y^{(i)}}{i}|$
}

Interpretation: if we split the data at this point (i.e., a $(j, \tau)$ pair) and use one dummy classifier on each of the two sides, what would be the resulting error?

This approach is greedy, since it tries to obtain the maximum result (finding the branch), with the minimum effort (using just two dummy classifiers later on):

in practice, it makes this learning technique computationally light!

you just need to consider, for each $j$ -th feature, the midpoints of $(x_j^{(i)})_i$ : at most $n$ of them

175 / 366

Deciding when to stop (recursion)

Intuitively:

if all the examples belong to the same class, stop
- splitting would be pointless!
or, if the number examples is very small, stop $\approx$ what we did while human learning
- no need to bother

In detail (and formally):

function $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ {
if $n \le n\subtext{min}$ then { // $n=|\seq{y^{(i)}}{i}|$
return $\text{true}$ ;
}
if $\text{error}(\seq{y^{(i)}}{i})=0$ then {
return $\text{true}$ ;
}
return $\text{false}$
}

Checking the first condition is, in general, cheaper than checking the second condition.

only $\seq{y^{(i)}}{i}$ is needed to decide whether to stop, $\seq{x^{(i)}}{i}$ is not used!
$n\subtext{min}$ is a parameter of $f\subtext{should-stop}$
- it represents the "very small" criterion
- it propagates to $f'\subtext{learn}$ , which uses $f\subtext{should-stop}$
- (also denoted as $k\subtext{min}$ )
since $\text{error()}$ is the classification error done by the dummy classifier, it is $=0$ iff the most frequent class $y^\star$ is the only class in $\seq{y^{(i)}}{i}$

176 / 366

$f'\subtext{learn}$ application example

1st call:
$(j,\tau) = (1,7)$

010●●●●●●●●●●

\c{1}{\frac{0}{1}} \c{1}{\frac{6}{9}}

\c{1}{\frac{0}{2}} \c{2}{\frac{5}{8}}

\c{1}{\frac{1}{3}} \c{3}{\frac{4}{7}}

\c{1}{\frac{2}{4}} \c{3}{\frac{3}{6}}

\c{1}{\frac{2}{5}} \c{3}{\frac{2}{5}}

\c{1}{\frac{2}{6}} \c{3}{\frac{1}{4}}

\c{1}{\frac{3}{7}} \c{3}{\frac{0}{3}}

\c{1}{\frac{4}{8}} \c{3}{\frac{0}{2}}

\c{1}{\frac{5}{9}} \c{3}{\frac{0}{1}}

1st-l call:
$(j,\tau) = (1,2)$

010●●●●●●●

\c{1}{\frac{0}{1}} \c{1}{\frac{3}{6}}

\c{1}{\frac{0}{2}} \c{2}{\frac{2}{5}}

\c{1}{\frac{1}{3}} \c{1}{\frac{2}{4}}

\c{1}{\frac{2}{4}} \c{1}{\frac{1}{3}}

\c{1}{\frac{2}{5}} \c{1}{\frac{1}{2}}

\c{1}{\frac{2}{6}} \c{1}{\frac{0}{1}}

1st-l-l call:
return $\treel{\c{1}{●}}$

010●●

1st-l-r call:
$(j,\tau) = (1,4)$

010●●●●●

\c{2}{\frac{0}{1}} \c{2}{\frac{2}{4}}

\c{2}{\frac{0}{2}} \c{1}{\frac{1}{3}}

\c{2}{\frac{1}{3}} \c{1}{\frac{1}{2}}

\c{2}{\frac{2}{4}} \c{2}{\frac{0}{1}}

1st-l-r-l call:
return $\treel{\c{2}{●}}$

010●●

1st-l-r-r call:
return $\treel{\c{1}{●}}$

010●●●

return $\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}$
return $\tree{(1,2)}{\treel{\c{1}{●}}}{\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}}$

1st-r call:
return $\treel{\c{3}{●}}$

010●●●

return $\tree{(1,7)}{\tree{(1,2)}{\treel{\c{1}{●}}}{\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}}}{\treel{\c{3}{●}}}$

Assume:

$X=\mathbb{R}^1=\mathbb{R}$ , $Y=\{\c{1}{●},\c{2}{●},\c{3}{●}\}$
$n\subtext{min}=3$

function $\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min})$ {
if $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ then {
$y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y)$
return $\text{node-from}(y^\star,\varnothing,\varnothing)$
} else {
$(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$
$t \gets \text{node-from}($
$(j,\tau),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min})$
)
return $t$
}
}

Question: what's the accuracy of this $t$ on the learning set?

177 / 366

Alternatives for $\text{find-best-branch}()$

function $\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$ {
$(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})}+\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})}\right)$
return $(j^\star, \tau^\star)$
}

$\text{error}(\seq{y^{(i)}}{i})$ is the error the dummy classifier would do on $\seq{y^{(i)}}{i}$ : $\c{1}{\text{error}(\seq{y^{(i)}}{i})}=1 - \max_y \freq{y, \seq{y^{(i)}}{i}}$

Instead of $\text{error}()$ , two other variants can be used:

Gini index: $\c{1}{\text{gini}(\seq{y^{(i)}}{i})}=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)$
Cross entropy: $\c{1}{\text{cross-entropy}(\seq{y^{(i)}}{i})}=-\sum_y \freq{y, \seq{y^{(i)}}{i}} \log \freq{y, \seq{y^{(i)}}{i}}$

178 / 366

Alternatives for $\text{find-best-branch}()$

$\text{error}(\seq{y^{(i)}}{i})$ is the error the dummy classifier would do on $\seq{y^{(i)}}{i}$ : $\c{1}{\text{error}(\seq{y^{(i)}}{i})}=1 - \max_y \freq{y, \seq{y^{(i)}}{i}}$

Instead of $\text{error}()$ , two other variants can be used:

Gini index: $\c{1}{\text{gini}(\seq{y^{(i)}}{i})}=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)$
Cross entropy: $\c{1}{\text{cross-entropy}(\seq{y^{(i)}}{i})}=-\sum_y \freq{y, \seq{y^{(i)}}{i}} \log \freq{y, \seq{y^{(i)}}{i}}$

For all:

the lower, the better
they measure the node impurity, i.e., the amount $e$ of cases different from the most frequent one among the examples arrived at a certain node

f\subtext{impurity}

\seq{y^{(i)}}{i}

e \in \mathbb{R}^+

178 / 366

Node impurity

function $\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, \c{1}{f\subtext{impurity}})$ {
$(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\c{1}{f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})}+\c{1}{f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})}\right)$
return $(j^\star, \tau^\star)$
}

The way to measure the node impurity might be a parameter of $\text{find-best-branch}()$ , but it has been found that Gini is better for learning trees than error.

Gini, error, cross-entropy vs. frequency of the most frequent class

Here, for binary classification:

on the $x$ -axis: the frequency $f=\freq{\text{pos}, \seq{y^{(i)}}{i}}$ of the positive class
- $f=0.5$ (●●●●●●) is the worst case
- $f=0$ (●●●●●●) and $f=1$ (●●●●●●) are the best cases
on the $y$ -axis: the three impurity indexes

Gini and cross-entropy are smoother than the error.

179 / 366

Alternatives for $\text{should-stop}()$

Original version: (data size)

too few examples or
no errors

$n \le n\subtext{min}$ or $\text{error}(\seq{y^{(i)}}{i})=0$

Alternative 1 (tree depth):

node depth deeper than $\tau_d$ or
no errors

requires propagating recursively the depth of the node being currently built

Alternative 1 (node impurity):

impurity lower than a $\tau_\epsilon$

function $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ {
if $n \le n\subtext{min}$ then {
return $\text{true}$ ;
}
if $\text{error}(\seq{y^{(i)}}{i})=0$ then {
return $\text{true}$ ;
}
return $\text{false}$
}

Impact of the parameter:

the lower $n\subtext{min}$ , the larger the tree
the greater $\tau_d$ , the larger the tree
the lower $\tau_\epsilon$ , the larger the tree

(for the same dataset, in general)

180 / 366

Tree learning with probability

Learning technique with probability:

$f'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M$
$f''\subtext{predict}: X \times M \to P_Y$

x

m

f''\subtext{predict}

p

\argmax\sub{y \in Y}

y

For tree learning:

$f'\subtext{learn}: \c{1}{\mathcal{P}^*(X_1 \times \dots \times X_p \times Y)} \to \c{2}{T_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}}$
- given a multivariate dataset, returns a tree in $T_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}$
$f''\subtext{predict}: \c{1}{X_1 \times \dots \times X_p} \times \c{2}{T_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}} \to \c{3}{P_Y}$
- given a multivariate observation and a tree, returns a discrete probability distribution $p \in P_Y$

Set of trees $T_{\c{1}{(\{1,\dots,p\}\times\mathbb{R})} \cup \c{2}{P_Y}}$ :

$L=\c{1}{(\{1,\dots,p\}\times\mathbb{R})} \cup \c{2}{P_Y}$ is the set of node labels
$\c{1}{(\{1,\dots,p\}\times\mathbb{R})}$ are branch node labels
$\c{2}{P_Y}$ are terminal node labels
- i.e., terminal nodes return discrete probabiliy distributions

181 / 366

$f'\subtext{learn}$ with probability

function $\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min})$ {
if $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ then {
$p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}$
return $\text{node-from}(p,\varnothing,\varnothing)$
} else {
$(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$
$t \gets \text{node-from}($
$(j,\tau),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min})$
)
return $t$
}
}

$y \mapsto \freq{y, \seq{y^{(i)}}{i}}$ is a way to specify the concrete function that, given a $y \in Y$ returns its frequency $\freq{y, \seq{y^{(i)}}{i}} \in [0,1]$
" $p \gets \dots$ " means "the variable¹ $p$ takes the value $\dots$ " or "the variable $p$ becomes $\dots$ "
hence, $p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}$ means " $p$ becomes the function that maps each $y$ to its frequency $\freq{y, \seq{y^{(i)}}{i}}$ in $\seq{y^{(i)}}{i}$ "

here, "variable" as a computer programming term

Before (without probability):

$y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y)$
return $\text{node-from}(y^\star,\varnothing,\varnothing)$

with

\seq{y^{(i)}}{i}

being ●●●●●
returns

\treel{\c{1}{●}}

After (with probability):

$p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}$
return $\text{node-from}(p,\varnothing,\varnothing)$

with

\seq{y^{(i)}}{i}

being ●●●●●
returns

\treel{(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{1}{5}}}, \c{3}{● \smaller{\frac{1}{5}}})}

182 / 366

$f'\subtext{predict}$ with probability

$f'\subtext{predict}: X \times M \to Y$

function $\text{predict}(\vect{x}, t)$ {
if $\neg\text{has-children}(t)$ then {
$p \gets \text{label-of}(t)$
$y^\star \gets \argmax_{y \in Y} p(y)$
return $y^\star$
} else {
$(j, \tau) \gets \text{label-of}(t)$
if $x_j \le \tau$ then {
return $\text{predict}(\vect{x}, \text{left-child-of}(t))$
} else {
return $\text{predict}(\vect{x}, \text{right-child-of}(t))$
}
}
}

$f''\subtext{predict}: X \times M \to P_Y$

function $\text{predict-with-prob}(\vect{x}, t)$ {
if $\neg\text{has-children}(t)$ then {
$p \gets \text{label-of}(t)$
return $p$
} else {
$(j, \tau) \gets \text{label-of}(t)$
if $x_j \le \tau$ then {
return $\text{predict}(\vect{x}, \text{left-child-of}(t))$
} else {
return $\text{predict}(\vect{x}, \text{right-child-of}(t))$
}
}
}

Usually, ML software libraries/tools provide way to access both $\hat{y}$ and $p$ , that are produced out of a single execution.

183 / 366

$f'\subtext{learn}$ with probability application example

1st call:
$(j,\tau) = (1,7)$

010●●●●●●●●●●

\c{1}{\frac{0}{1}} \c{1}{\frac{6}{9}}

\c{1}{\frac{0}{2}} \c{2}{\frac{5}{8}}

\c{1}{\frac{1}{3}} \c{3}{\frac{4}{7}}

\c{1}{\frac{2}{4}} \c{3}{\frac{3}{6}}

\c{1}{\frac{2}{5}} \c{3}{\frac{2}{5}}

\c{1}{\frac{2}{6}} \c{3}{\frac{1}{4}}

\c{1}{\frac{3}{7}} \c{3}{\frac{0}{3}}

\c{1}{\frac{4}{8}} \c{3}{\frac{0}{2}}

\c{1}{\frac{5}{9}} \c{3}{\frac{0}{1}}

1st-l call:
$(j,\tau) = (1,2)$

010●●●●●●●

\c{1}{\frac{0}{1}} \c{1}{\frac{3}{6}}

\c{1}{\frac{0}{2}} \c{2}{\frac{2}{5}}

\c{1}{\frac{1}{3}} \c{1}{\frac{2}{4}}

\c{1}{\frac{2}{4}} \c{1}{\frac{1}{3}}

\c{1}{\frac{2}{5}} \c{1}{\frac{1}{2}}

\c{1}{\frac{2}{6}} \c{1}{\frac{0}{1}}

1st-l-l call:
return $\treel{(\c{1}{● \smaller{1}})}$

010●●

1st-l-r call:
$(j,\tau) = (1,4)$

010●●●●●

\c{2}{\frac{0}{1}} \c{2}{\frac{2}{4}}

\c{2}{\frac{0}{2}} \c{1}{\frac{1}{3}}

\c{2}{\frac{1}{3}} \c{1}{\frac{1}{2}}

\c{2}{\frac{2}{4}} \c{2}{\frac{0}{1}}

1st-l-r-l call:
return $\treel{(\c{2}{● \smaller{1}})}$

010●●

1st-l-r-r call:
ret. $\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}$

010●●●

return $\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}$
return $\tree{(1,2)}{\treel{(\c{1}{● \smaller{1}})}}{\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}}$

1st-r call:
return $\treel{(\c{3}{● \smaller{1}})}$

010●●●

return $\tree{(1,7)}{\tree{(1,2)}{\treel{(\c{1}{● \smaller{1}})}}{\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}}}{\treel{(\c{3}{● \smaller{1}})}}$

Assume:

$X=\mathbb{R}^1=\mathbb{R}$ , $Y=\{\c{1}{●},\c{2}{●},\c{3}{●}\}$
$n\subtext{min}=3$

184 / 366

Let's use the learning technique

If we apply our $f'\subtext{learn}$ to the carousel dataset with $n\subtext{min}=1$ we obtain:

Carousel data

x\subtext{height}

vs.

120

\le

x\subtext{age}

vs.

8.954

>

x\subtext{age}

vs.

9.887

\le

(\c{1}{●\smaller{1}})

>

x\subtext{age}

vs.

9.002

\le

(\c{1}{●\smaller{1}})

>

(\c{2}{●\smaller{1}})

\le

(\c{2}{●\smaller{1}})

>

x\subtext{age}

vs.

9.49

\le

>

x\subtext{age}

vs.

9.306

(\c{1}{●\smaller{1}})

\le

(\c{1}{●\smaller{1}})

>

(\c{2}{●\smaller{1}})

Question: is this tree ok for you?

hint: recall the other way of assessing a model, w/o the behavior

185 / 366

Tree size

If we compare the tree (i.e., the model) against the attendant's reasoning (i.e., the real system), this tree appears too large!

We can do this, because:

trees are inherently inspectionable
we know (actually, we have a rough idea about) how the real system works

The carousel

x\subtext{height}

vs.

120

\le

x\subtext{age}

vs.

8.954

>

x\subtext{age}

vs.

9.887

\le

(\c{1}{●\smaller{1}})

>

x\subtext{age}

vs.

9.002

\le

(\c{1}{●\smaller{1}})

>

(\c{2}{●\smaller{1}})

\le

(\c{2}{●\smaller{1}})

>

x\subtext{age}

vs.

9.49

\le

>

x\subtext{age}

vs.

9.306

(\c{1}{●\smaller{1}})

\le

(\c{1}{●\smaller{1}})

>

(\c{2}{●\smaller{1}})

186 / 366

Model complexity

The tree was large because:

$n\subtext{min}$ was $1$ , i.e., $f'\subtext{learn}$ had no bounds while learning the tree
and, the dataset made $f'\subtext{learn}$ exploit the low value of $n\subtext{min}$
- i.e., the dataset required a large tree to be modeled completely

187 / 366

Model complexity

The tree was large because:

$n\subtext{min}$ was $1$ , i.e., $f'\subtext{learn}$ had no bounds while learning the tree
and, the dataset made $f'\subtext{learn}$ exploit the low value of $n\subtext{min}$
- i.e., the dataset required a large tree to be modeled completely

In general, almost every kind of model can have different degrees of model complexity.

for trees, captured by the size the tree

Moreover, almost every learning technique has at least one parameter affecting the maximum complexity of the learnable models, often called flexibility:

a sort of availability of complexity
for trees learned with recursive binary splitting, $n\subtext{min}$

Usually, to obtain a complex model, you should have:

a learning technique with great flexibility
a dataset requiring flexibility

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

flexibility

187 / 366

This tree complexity: motivation

Why is our tree too complex?

Because of these two points! ●● $\rightarrow$

What are they?

maybe the attendant was distracted
maybe they were two "Portoghesi"
maybe they were the attendant's kids
- i.e., the real system is stochastic and we observed a case where the least probable case happened
maybe the owner wrongly wrote down two observations

More in general: there's some noise in the data!

Carousel data

188 / 366

Fitting the noise?

s

x

y

+

y\prime

noise

In practice, we often don't have a noise-free dataset $\seq{(x^{(i)},y^{(i)})}{i}$ , but have instead a dataset $\seq{(x^{(i)},y'^{(i)})}{i}$ with some noise, i.e., we have the $y'$ instead of the $y$ :

errors in data collection
$s$ being stochastic and having produced unlikely behaviors

However, our goal is to model $s$ , not $s+$ noise!

189 / 366

Overfitting

When we have a noisy dataset (potentially always) and we allow for large complexity, by setting a flexibility parameter to a high flexibility, the learning technique fits the noisy data $\seq{(x^{(i)},y'^{(i)})}{i}$ instead of fitting the real system $s$ , that is, overfitting occurs.

190 / 366

Overfitting

Snake and elephant from Il Piccolo Principe Image from "Il piccolo principe"

Overfits = "fits too much", hence making apparent also those artifacts that are not part of the object being wrapped

the model: the snake skin
the real system: the snake body
the (exaggerated) artifact: the elephant...

190 / 366

Underfitting

When instead we do not allow for enough complexity to model a complex real system, by setting a flexibility parameter to low flexibility, the learning technique does not fit neither the data, nor the system, that is, underfitting occurs.

191 / 366

Underfitting

T-rex in a cardboard box

Underfits = "doesn't fit enough", hence proper characteristics of the object being wrapped are not captured

the model: the cardboard box
the real system: the T-rex
the uncaptured characteristics: everything of the T-rex...

191 / 366

Overfitting/underfitting with trees

In $f'\subtext{learn}$ , $n\subtext{min}$ represents the flexibility:

the greater $n\subtext{min}$ , the lower the flexibility

Extreme values:

$n\subtext{min}=1$ $\rightarrow$ maximum flexibility
- the tree will always be as large as it has to be to perfectly¹ model the dataset
$n\subtext{min}=+\infty$ $\rightarrow$ minimum, i.e., no flexibility
- the tree will be the smallest possible

Always perfectly? Give a counterexample.

192 / 366

Carousel tree with $n\subtext{min}=+\infty$

Carousel data

function $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ {
if $n \le n\subtext{min}$ then {
return $\text{true}$ ;
}
...
return $\text{false}$
}

The learned tree is a dummy classifier (with probability):

(\c{1}{\text{●}\smaller{\frac{59}{103}}}, \c{2}{\text{●}\smaller{\frac{44}{103}}})

$t=\treel{(\c{1}{\text{●}\smaller{\frac{59}{103}}}, \c{2}{\text{●}\smaller{\frac{44}{103}}})}$

t

does not attempt to model the dependency between

x

and

y

, because its complexity budget is completely exhausted by the single leaf node

193 / 366

Bias and variance

As an alternative name for underfitting, we say that a learning technique exhibits high bias:

because it tends to generate models that incorporate a bias towards some $y$ values, regardless of the $x$ , i.e., models that fail in capturing the $x$ - $y$ dependency
- as extreme case, the dummy classifier completely disregards the $x$

194 / 366

Bias and variance

As an alternative name for underfitting, we say that a learning technique exhibits high bias:

because it tends to generate models that incorporate a bias towards some $y$ values, regardless of the $x$ , i.e., models that fail in capturing the $x$ - $y$ dependency
- as extreme case, the dummy classifier completely disregards the $x$

As an alternative name for overfitting, we say that a learning technique exhibits high variance:

because, if we repeat the learning with different datasets coming from the same real system, they give different models; this is bad, because they should be the same, since they model the same system

194 / 366

Spotting underfitting/overfitting

In principle:

observe the model
observe the system
compare their complexity:
- if the model is too simple with respect to the system, that's underfitting
- if the model is too complex with respect to the system, that's overfitting

195 / 366

Spotting underfitting/overfitting

In principle:

observe the model
observe the system
compare their complexity:
- if the model is too simple with respect to the system, that's underfitting
- if the model is too complex with respect to the system, that's overfitting

In practice, this is often (i.e., almost always) unfeasible:

you don't know the system complexity
you cannot observe the system internals (or the system itself)
sometimes, you cannot observe the model internals

195 / 366

Spotting underfitting/overfitting with data

With too low flexibility (here with error):

the model cannot capture system characteristic that are also in the learning data
- $\Rightarrow$ both errors are high
increasing the flexibility decreases both errors

With too large flexibility:

the model captures also data artifacts (i.e., noise)
- $\Rightarrow$ learning error is low because noise is modeled and used to assess the model itself
- $\Rightarrow$ test error is large because the model describes characteristic that are not proper of the real system and hence not visible in data different from the learning data
increasing the flexibility decreases the lerning error and increases the test error

Here, overfitting starts with flexibility $\ge 0.62$

not a real parameter...

Leaning and test error vs. flexibility

Practical procedure:

consider several values of the flexibility parameter
for each value of the flexibility parameter
1. learn a model
2. measure¹ its effectiveness² on the learning data
3. measure¹ its effectiveness² on the test data

with 80/20 static split, CV, ...
with error, accuracy, AUC, ...

196 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have $k$ slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

that is, we do a comparison among learning techniques

197 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have $k$ slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

that is, we do a comparison among learning techniques

In practice:

choose the $k$ candidate parameter values (e.g., $n\subtext{min}=1,2,3,\dots,10$ )
choose a suitable effectiveness index (e.g., AUC, accuracy, ...)
choose a suitable learning/test division method (e.g., 10-fold CV)
for each of the $k$ values, measure the index, take the one corresponding to the best value

197 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have $k$ slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

that is, we do a comparison among learning techniques

In practice:

choose the $k$ candidate parameter values (e.g., $n\subtext{min}=1,2,3,\dots,10$ )
choose a suitable effectiveness index (e.g., AUC, accuracy, ...)
choose a suitable learning/test division method (e.g., 10-fold CV)
for each of the $k$ values, measure the index, take the one corresponding to the best value

This procedure applies to parameters in general, not just to those affecting flexibility;

and possibly to indexes related to efficiency, rather than just effectiveness

197 / 366

Hyperparameter tuning

Given a learning technique with $h$ parameters $p_1,\dots,p_h$ , each $p_j$ defined in its domain $P_j$ , hyperparameter tuning is the task of finding the tuple $p^\star_1,\dots,p^\star_h$ that corresponds to the best effectiveness of the learning technique.

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

p_1,\dots,p_h

198 / 366

Hyperparameter tuning

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

p_1,\dots,p_h

$p_1,\dots,p_h$ are called hyperparameters, rather than just parameter, because in some communities and for some learning technique, the model is defined by one or more parameters (often numerical);

does not fit well the case of trees

It's called tuning because we slightly change the hyperparameter values until we are happy with the results.

198 / 366

Hyperparameter tuning

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

p_1,\dots,p_h

does not fit well the case of trees

It's called tuning because we slightly change the hyperparameter values until we are happy with the results.

Hyperparameter tuning it's a form of optimization, since we are searching the space $P_1 \times \dots \times P_h$ for the tuple giving the best, i.e., $\approx$ optimal, effectiveness:

since it automatizes part of the design of an ML system, hyperparameter tuning may be considered a simple form of AutoML

198 / 366

Grid search

A simple form of hyperparameter tuning:

for each $j$ -th parameter, choose a small set of $P'_j \subseteq P_j$ values
choose a suitable effectiveness index
choose a suitable learning/test division method
consider all the tuples resulting from the cartesian product $P'_1 \times \dots \times P'_h$ (i.e., the grid)
take the best hyperparameters $p^\star_1,\dots,p^\star_h$ such that: $(p^\star_1,\dots,p^\star_h)=\argmax_{(p_1,\dots,p_h) \in P'_1 \times \dots \times P'_h} \c{1}{f\subtext{learn-effect}}(\c{2}{f'\subtext{learn}(\cdot,p_1,\dots,p_h),f'\subtext{predict}},D)$

Remarks:

$f\subtext{learn-effect}$ is the chosen assessment method measuring the chosen (step 2) effectiveness index with the chosen (step 3) learning/test division: it takes a learning technique and a dataset $D$
- $f'\subtext{learn}(\cdot,p^\star_1,\dots,p^\star_h),f'\subtext{predict}$ is the learning technique; $f'\subtext{learn}(\c{3}{\cdot},\c{4}{p_1,\dots,p_h})$ is the learning function with fixed hyperparameters $p_1,\dots,p_h$ and variable dataset $\cdot$
to be feasible, $P'_1 \times \dots \times P'_h$ must be small!

199 / 366

Grid search with the trees

Consider the $f'\subtext{learn}$ for trees and these two hyperparameters:

$n\subtext{min} = p_1 \in \mathbb{N} = P_1$
$p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}$

Let's do hyperparameter tuning with grid search (assuming $|D|=n=1000$ ):

200 / 366

Grid search with the trees

Consider the $f'\subtext{learn}$ for trees and these two hyperparameters:

$n\subtext{min} = p_1 \in \mathbb{N} = P_1$
$p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}$

Let's do hyperparameter tuning with grid search (assuming $|D|=n=1000$ ):

$P'_1=\{1,2,5,10,25\}$ ¹ and $P'_2=P_2$
AUC (with midpoints)
10-fold CV
grid size of $5 \times 3 = 15$
...

for each $j$ -th parameter, choose a small set of $P'_j \subseteq P_j$ values
choose a suitable effectiveness index
choose a suitable learning/test division method
consider all the tuples resulting from the cartesian product $P'_1 \times \dots \times P'_h$ (i.e., the grid)
take the best hyperparameters $p^\star_1,\dots,p^\star_h$

Questions

how many times is $f'\subtext{learn}$ invoked? without considering recurrent invokations
how many times is $f''\subtext{predict}$ invoked?

must be chosen considering the size $n$ of the dataset

200 / 366

Hyperparameter-free learning

Can't we just always do grid search for doing hyperparameter tuning?

Pros:

no need to manually choose the values of the parameters
hopefully chosen parameters are better than "default" values (if any) $\rightarrow$ better effectiveness

Cons:

computationally expensive ( $\propto$ grid size) $\rightarrow$ worse efficiency
depends on a dataset, must be checked for generalization ability
suitable "ranges" of values for each hyperparameter have still to be set manually
- but default ranges are often ok

201 / 366

Hyperparameter-free learning

Can't we just always do grid search for doing hyperparameter tuning?

Pros:

no need to manually choose the values of the parameters
hopefully chosen parameters are better than "default" values (if any) $\rightarrow$ better effectiveness

Cons:

computationally expensive ( $\propto$ grid size) $\rightarrow$ worse efficiency
depends on a dataset, must be checked for generalization ability
suitable "ranges" of values for each hyperparameter have still to be set manually
- but default ranges are often ok

If you do it, you can transform any learning tech. w/ params in a learning tech. w/o params:

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

p_1,\dots,p_h

\seq{(x^{(i)},y^{(i)})}{i}

\seq{P'_j}{j}

f'\subtext{learn}

grid search

\seq{p^\star_j}{j}

f'\subtext{learn}

m

201 / 366

Hyperparameter-free learning

\seq{(x^{(i)},y^{(i)})}{i}

\seq{P'_j}{j}

f'\subtext{learn}

grid search

\seq{p^\star_j}{j}

f'\subtext{learn}

m

function $\text{learn-free}(D)$ {
$f'\subtext{learn}, f'\subtext{predict} \gets \dots$
$P'_1,\dots,P'_h \gets \dots$
$f\subtext{learn-effect} \gets \dots$
$p^\star_1,\dots,p^\star_h \gets \varnothing$
$v_{\text{max},\text{effect}} \gets -\infty$
foreach $p_1,\dots,p_h \in P'_1\times \dots\times P'_h$ {
$v\subtext{effect} \gets f\subtext{learn-effect}(f'\subtext{learn}(\cdot,p_1,\dots,p_h),f'\subtext{predict},D)$
if $v\subtext{effect} \ge v_{\text{max},\text{effect}}$ then {
$v_{\text{max},\text{effect}} \gets v\subtext{effect}$
$p^\star_1,\dots,p^\star_h \gets p_1,\dots,p_h$
}
}
return $f'\subtext{learn}(D,p^\star_1,\dots,p^\star_h)$
}

for each $j$ -th parameter, choose a small set of $P'_j \subseteq P_j$ values
choose a suitable effectiveness index
choose a suitable learning/test division method
consider all the tuples resulting from the cartesian product $P'_1 \times \dots \times P'_h$
take the best hyperparameters $p^\star_1,\dots,p^\star_h$
- i.e., $\argmax$
learn a model with on full dataset and the best found parameters

202 / 366

Hyperparameter-free tree learning exercise

Consider the $f'\subtext{learn}$ for trees and these two hyperparameters:

$n\subtext{min} = p_1 \in \mathbb{N} = P_1$
$p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}$

Consider the improved, hyperparameter-free version of $f'\subtext{learn}$ called $f'\subtext{learn-free}$ :

with accuracy and 10-fold CV
with $|P'_1|=10$ and $|P'_2|=|P_2|=3$

Suppose you want to compare it against the plain version (with $n\subtext{min}=10$ and $p\subtext{impurity}=\text{Gini}$ ):

with AUC (midpoints) and 10-fold CV
using a dataset $|D|=n=1000$ .

Questions

what phases of the ML design process are we doing?
how many times is $f'\subtext{learn-free}$ invoked?
how many times is $f'\subtext{learn}$ invoked? without considering recurrent invokations
how many times is $f''\subtext{predict}$ invoked? assuming $f''\subtext{predict}$ is invoked internally by $f'\subtext{predict}$
how many times is $f'\subtext{predict}$ invoked?

203 / 366

Categorical independent variables and regression204 / 366

Applicability of $f'\subtext{learn}$

Up to now, the $f'\subtext{learn}$ for trees (i.e., recursive binary splitting) was defined¹ as: $f'\subtext{learn}: \mathcal{P}^*(X_1 \times \dots \times X_p \times Y) \to T_{(\{1,\dots,p\}\times \mathbb{R}) \cup Y}$ with:

each $X_j \subseteq \mathbb{R}$ , i.e., with each independent variable being numerical
$Y$ finite and without ordering, i.e., with the dependent variable being categorical

These constraints were needed because:

the branch nodes contain conditions in the form $x_j \le \tau$ , hence an order relation has to be defined in $X_j$ ; $\mathbb{R}$ meets this requirement
the leaf nodes contain a class label $y$

Can we remove these constraints?

here we have the version without probability; with the one with, the codomain of $f'\subtext{learn}$ is $T_{(\{1,\dots,p\}\times \mathbb{R}) \cup P_Y}$

205 / 366

Trees on categorical independent variables

With numerical variables ( $x_j \in \mathbb{R}$ ):

With $\text{find-best-branch()}$ , we find (the index $j$ of) a variable $x_j$ and a threshold value $\tau$ that well separates the data, i.e., we split the data in:

observations such that $x_j \le \tau$
observations such that $x_j > \tau$

No other cases exist: it's a binary split.

Example

$x\subtext{age} \in [0,120]$

x\subtext{age}

vs.

10

\le

●

>

x_{\dots}

vs.

\dots

With categorical variables ( $x_j \in X_j$ ):

With $\text{find-best-branch()}$ , we find (the index $j$ of) a variable $x_j$ and a set of values $X'_j \subset X_j$ that well separates the data, i.e., we split the data in:

observations such that $x_j \in X'_j$
observations such that $x_j \not\in X'_j$

No other cases exist: it's a binary split.

Example

$x\subtext{city} \in \{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}$

x\subtext{city}

vs.

\{\text{Ts},\text{Ve}\}

\in

●

\not\in

x_{\dots}

vs.

\dots

206 / 366

Efficiency with categorical variables

For a given numerical variable $x_j \in \mathbb{R}$ , we choose $\tau^\star$ such that: $\tau^\star = \argmin_{\c{1}{\tau \in \mathbb{R}}} \left(f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})\right)$ In practice, we search the set of midpoints rather than the entire $\mathbb{R}$ : there are $n-1$ midpoints in a dataset with $n$ elements.

Even better, we can consider only the midpoints between consecutive values $x_j^{(i_1)},x_j^{(i_2)}$ for which the labels are different, i.e., $y_j^{(i_1)} \ne y_j^{(i_2)}$

For a given categorical variable $x_j \in X_j$ , we choose $X^\star_j \subset X_j$ such that: $X^\star_j = \argmin_{\c{1}{X'_j \in \mathcal{P}(X_j)}} \left(f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \in X'_j})+f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \not\in X'_j})\right)$ We search the set $\mathcal{P}(X_j)$ of subsets (i.e., the powerset) of $X_j$ , which has $2^{|X_j|}$ values.

207 / 366

Trees with both kinds of variables

Assume a problem with $X = \c{1}{X_1 \times \dots \times X_{p\subtext{num}}} \times \c{2}{X_{p\subtext{num}+1} \times \dots \times X_{p\subtext{num}+p\subtext{cat}}}$ , i.e.:

$p\subtext{num}$ numerical variables
$p\subtext{cat}$ categorical variables

The labels of the tree nodes can be:

class labels $y \in \c{3}{Y}$ or discrete probability distribution $p \in \c{3}{P_Y}$ (terminal nodes)
branch conditions $\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}}$ for numerical variables (non-terminal nodes)
branch conditions $\c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)}$ for categorical variables (non-terminal nodes)
- i.e., each variable with its corresponding powerset of possible values

So the model is a $t \in$ :

$T_{\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}} \; \cup \; \c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)} \; \cup \; \c{3}{Y}}$ , without probability
or $T_{\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}} \; \cup \; \c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)} \; \cup \; \c{3}{P_Y}}$ , with probability

208 / 366

Regression trees

Recursive binary splitting may be used for regression: the learned trees are called regression trees.

Required changes:

in $f'\subtext{learn}$ , when $\text{should-stop}()$ is met, "most frequent class label" does not make sense anymore
- because we have numbers, not classes
in $\text{find-best-branch}()$ , minimizing the $\text{error}()$ does not make sense anymore (same for $\text{gini}()$ and $\text{cross-entropy}()$ )
- because these indexes are for categorical values, not numbers
in $\text{should-stop}()$ , checking if $\text{error}()=0$ does not make sense anymore
- because (classification) error is for categorical values, not numbers

209 / 366

Terminal node labels

In $f'\subtext{learn}$ , when $\text{should-stop}()$ is met, "most frequent class label" does not make sense anymore.

Solution: use the mean $\overline{y}$ .

Classification

The terminal node label is the most frequent class: $y^\star=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}$

If you have to choose just one $y$ , $y^\star$ is the one that minimizes the classification error.

Regression

The terminal node label is the mean $y$ value: $y^\star=\frac{1}{n} \sum_i y^{(i)}=\overline{y}$

If you have to choose just one $y$ , $y^\star$ is the one that minimizes the MSE.

210 / 366

Terminal node labels

In $f'\subtext{learn}$ , when $\text{should-stop}()$ is met, "most frequent class label" does not make sense anymore.

Solution: use the mean $\overline{y}$ .

Classification

The terminal node label is the most frequent class: $y^\star=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}$

If you have to choose just one $y$ , $y^\star$ is the one that minimizes the classification error.

Regression

The terminal node label is the mean $y$ value: $y^\star=\frac{1}{n} \sum_i y^{(i)}=\overline{y}$

If you have to choose just one $y$ , $y^\star$ is the one that minimizes the MSE.

Indeed a dummy regressor predicting always the mean value $\overline{y}$ should be considered a baseline for regression, like the dummy classifier is a baseline for classification:

if you want to do a prediction without using the $x$ , then $\overline{y}$ is the best you can do (on the learning dataset)

210 / 366

Finding the best branch

In $\text{find-best-branch}()$ , minimizing the $\text{error}()$ does not make sense anymore (same for $\text{gini}()$ and $\text{cross-entropy}()$ ).

Solution: use the residual sum of squares (RSS).

Classification

The branch is chosen for which the sum of the impurity on the two sides is the lowest: $\c{1}{\begin{align*} (j^\star, \tau^\star) \gets \argmin_{j,\tau} ( &\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+\\ & \text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}))\end{align*}}$ similarly, for categorical variables

Regression

The branch is chosen for which the sum of the RSS on the two sides is the lowest: $\c{1}{\begin{align*} (j^\star, \tau^\star) \gets \argmin_{j,\tau} ( &\text{RSS}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+\\ & \text{RSS}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}))\end{align*}}$ where: $\text{RSS}(\seq{y^{(i)}}{i}) = \sum_i \left(y^{(i)}-\overline{y}\right)^2$

similarly, for categorical variables; $\text{RSS}(\cdot) = n \text{MSE}(\cdot)$

211 / 366

Stopping criterion

In $\text{should-stop}()$ , checking if $\text{error}()=0$ does not make sense anymore.

Solution: just use RSS.

Classification

Stop if $n\le n\subtext{min}$ or $\text{error}()=0$ .

Regression

Stop if $n\le n\subtext{min}$ or $\text{RSS}()=0$ .

function $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ {
if $n \le n\subtext{min}$ then {
return $\text{true}$ ;
}
if $\text{RSS}(\seq{y^{(i)}}{i})=0$ then {
return $\text{true}$ ;
}
return $\text{false}$
}

In practice, the condition $\text{RSS}()=0$ holds much more unfrequently than the condition $\text{error}()=0$ .

212 / 366

Visualizing the model

With few variables, $p\le 2$ for classification, $p=1$ for regression, the model can be visualized.

Classification

Classifier on carousel

The colored regions are the model. The border(s) between regions with different colors (i.e., different decisions) is the decision boundary.

Regression

Regressor example

The line is the model.

Question: can you draw the tree for this model?

213 / 366

Overfitting with regression trees

Example of regression trees with different complexities

image from Fabio Daolio

Questions

what's the problem size ( $n$ and $p$ )?
what's the model complexity?
how is the real system made?

214 / 366

Tree learning: brief recap215 / 366

Summary

Applicability 👍👍

👍 $Y$ : both regression and classification (binary and multiclass)
👍 $X$ : multivariate $X$ with both numerical and categorical variables
👍 models give probability¹
🫳³ learning technique has one single parameter

Efficiency 👍

👍 in practice, pretty fast both in learning and prediction phase

Explainability/interpretability 👍👍👍

👍 the models can be easily² visualized (global explainability)
👍 the decisions can be analyzed (local explainability)
👍 the learning technique is itself comprehensible
- you should be able to implement by yourself

for classification; if $n\subtext{min}=1$ , it's always $100\%$
if they are small enough...
1 is better than $>1$ , but worse than parameter-free, so 🫳

216 / 366

Summary

Applicability 👍👍

👍 $Y$ : both regression and classification (binary and multiclass)
👍 $X$ : multivariate $X$ with both numerical and categorical variables
👍 models give probability¹
🫳³ learning technique has one single parameter

Efficiency 👍

👍 in practice, pretty fast both in learning and prediction phase

Explainability/interpretability 👍👍👍

👍 the models can be easily² visualized (global explainability)
👍 the decisions can be analyzed (local explainability)
👍 the learning technique is itself comprehensible
- you should be able to implement by yourself

for classification; if $n\subtext{min}=1$ , it's always $100\%$
if they are small enough...
1 is better than $>1$ , but worse than parameter-free, so 🫳

So, why are we not using trees for/in every ML system?

216 / 366

Decision tree effectiveness

Example of regression trees with different complexities image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

The effectiveness depends on the problem and may be limited by the fact that branch nodes consider one variable at once.

The decision boundary of the model is hence constrained to be locally parallel to one of the axes:

may be a limitation or not, depending on the problem
makes $\text{find-best-branch()}$ computationally feasible
- because the search space is small
- because computing the error of the dummy classifier is fast (greedy)

There exist oblique decision trees, which should overcome this limitation.

217 / 366

Towards the first labSoftware for ML218 / 366

Implementing ML systems

Decide: should I use ML?
Decide: supervised vs. unsupervised
Define the problem (problem statement):
- define $X$ and $Y$
- define a way for assessing solutions
  - before designing!
  - applicable to any compatible ML solution
Design the ML system
- choose a learning technique
- choose/design pre- and post-processing steps
Implement the ML system
- learning/prediction phases
- obtain the data
Assess the ML system

Actual execution of:

pre-processing
learning
prediction
assessment

is not made by hand, but by a computer that executes some software.

219 / 366

Software for ML

Nowadays, there are many options.

A few:

libraries for general purpose languages:
- Java: e.g., SMILE
- Python: e.g., scikit-learn
- ...
specialized software environments:
- Octave
- R
a software written from scratch

And many others.

How to choose an ML software?

Possible criteria:

platform constraints
degree of data pre/post-processing
production/prototype
documentation availability
community size
your previous familiarity/knowledge/skills

220 / 366

Interface

In general, the ML software provides an interface that models the key concepts of learning ( $f'\subtext{learn}$ ) and prediction ( $f'\subtext{predict}$ ) phases and the one of the model.

Example (Java+SMILE):

DataFrame dataFrame = ...
RandomForest classifier = RandomForest.fit(Formula.lhs("label"), dataFrame);
Tuple observation = ...;
int predictedLabel = classifier.predict(observation);

Example (R):

d = ...
classifier = randomForest(label~., d)
newD = ...
newLabels = predict(classifier, newD)

221 / 366

A (very) brief Introduction to R222 / 366

What is R?

R is:

a programming language
a software environments with a text-based interactive UI (a console)

RStudio is:

an IDE¹ built around R
also for making notebooks, like in Python

integrated development environment

Some R resources:

language documentation
- "manual"
packages documentation
- for all: Comprehensive R Archive Network (CRAN)
  - e.g., package manual for RandomForest
- for "biggest" packages: their own site
  - Tidyverse
help from online communities
- CrossValidated (where r is the most popular tag)
- StackOverflow

223 / 366

RStudio appearance

224 / 366

RStudio appearance with a notebook

225 / 366

An R notebook on Google Colab

Colab appearance with a notebook

226 / 366

Data types

There are some built-in data types.

Basic:

numeric
character (i.e., strings)
logical (i.e., Booleans)
factor (i.e., categorical)
function
formula

Composed:

vector
matrix
data frame
list

R is not strongly typed: there are (some) implicit conversions.

227 / 366

Data types

There are some built-in data types.

Basic:

numeric
character (i.e., strings)
logical (i.e., Booleans)
factor (i.e., categorical)
function
formula

Composed:

vector
matrix
data frame
list

R is not strongly typed: there are (some) implicit conversions.

A peculiar data type is formula:

it describes a dependency
literals specify dependent and independent variables, e.g.:
- decision~age+height
- Species~. . means "every other variable"

227 / 366

Assigning values> a=3
> a
[1] 3
> v=c(1,2,3)
> v
[1] 1 2 3
> d=as.data.frame(cbind(age=c(20,21,21))))
> d$gender=factor(c("m","m","f"))
> d
  age gender
1  20      2
2  21      2
3  21      1
> levels(d$gender)
[1] "f" "m"
> dep=salary~degree.level+age
> dep
salary ~ degree.level + age
> f = function(x) {x+3}
> f(2)
[1] 5

a is a numeric
v is a vector of numeric
d is a data frame
dep is a formula
f is a function
cbind() stays for column bind (there's an rbind() too)
factor() makes a vector of character a vector of factors
levels() gives the possible values of a factor, i.e.:d$gender is {x2(i)}i\seq{x_2^{(i)}}{i}{x2(i)​}i​
levels(d$gender) is X2X_2X2​


228 / 366

Reading/writing data

There are many packages for reading weird file types.

Some built-in functions for reading/writing CSV files (and variants):

read.csv(), read.csv2(), read.table()
write.csv(), write.csv2(), write.table()

Some built-in functions for reading/writing data in an R-native format:

save()
load()

229 / 366

Basic exploration of data

With summary() (built-in)

> d=iris
> summary(d)
  Sepal.Length    Sepal.Width     Petal.Length  
 Min.   :4.300   Min.   :2.000   Min.   :1.000  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
 Median :5.800   Median :3.000   Median :4.350  
 Mean   :5.843   Mean   :3.057   Mean   :3.758  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
 Max.   :7.900   Max.   :4.400   Max.   :6.900  
  Petal.Width          Species  
 Min.   :0.100   setosa    :50  
 1st Qu.:0.300   versicolor:50  
 Median :1.300   virginica :50  
 Mean   :1.199                  
 3rd Qu.:1.800                  
 Max.   :2.500

With skim() from skimr package

> skim(d)
── Data Summary ────────────────────────
                           Values
Name                       d     
Number of rows             150   
Number of columns          5     
_______________________          
Column type frequency:           
  factor                   1     
  numeric                  4     
________________________         
Group variables            None  
── Variable type: factor ────────────────────────────
  skim_variable n_missing complete_rate ordered
1 Species               0             1 FALSE  
  n_unique top_counts               
1        3 set: 50, ver: 50, vir: 50
── Variable type: numeric ───────────────────────────
  skim_variable n_missing complete_rate mean    sd
1 Sepal.Length          0             1 5.84 0.828
2 Sepal.Width           0             1 3.06 0.436
3 Petal.Length          0             1 3.76 1.77
4 Petal.Width           0             1 1.20 0.762
   p0 p25  p50 p75 p100 hist
1 4.3 5.1 5.8  6.4  7.9 ▆▇▇▅▂
2 2   2.8 3    3.3  4.4 ▁▆▇▂▁
3 1   1.6 4.35 5.1  6.9 ▇▁▆▇▂
4 0.1 0.3 1.3  1.8  2.5 ▇▁▇▅▃

Sizes with length(), dim(), nrow(), ncol(); names with names() (same of colnames()), rownames()

names change with names(d)[2:3] = c("cows", "dogs")

Here d is a multivariate dataset, but which variable is $y$ is not specified.

230 / 366

Selecting portions of data

On vectors:

> v=seq(1,2,by=0.25)
> v
[1] 1.00 1.25 1.50 1.75 2.00
> v[2]
[1] 1.25
> v[2:3]
[1] 1.25 1.50
> v[-2]
[1] 1.00 1.50 1.75 2.00
> v[c(1,2,4)]
[1] 1.00 1.25 1.75
> v[c(T,F,F,T)]
[1] 1.00 1.75 2.00
> v[v<1.6]
[1] 1.00 1.25 1.50
> v[which(v<1.6)]
[1] 1.00 1.25 1.50

On data frames:

> d
  age gender
1  20      m
2  21      m
3  21      f
> d[1,2]
[1] m
Levels: f m
> d[,2]
[1] m m f
Levels: f m
> d[1,]
  age gender
1  20      m
> d$age
[1] 20 21 21

Question: what is d[,c("age","age")]?

231 / 366

Like a pro with `tidyverse`

> iris %>% group_by(Species) %>%
  summarize_at(vars(Sepal.Length,Sepal.Width),
  list(mean=mean,sd=sd))
  %>% pivot_longer(-Species)
# A tibble: 12 × 3
   Species    name              value
   <fct>      <chr>             <dbl>
 1 setosa     Sepal.Length_mean 5.01
 2 setosa     Sepal.Width_mean  3.43
 3 setosa     Sepal.Length_sd   0.352
 4 setosa     Sepal.Width_sd    0.379
 5 versicolor Sepal.Length_mean 5.94
 6 versicolor Sepal.Width_mean  2.77
 7 versicolor Sepal.Length_sd   0.516
 8 versicolor Sepal.Width_sd    0.314
 9 virginica  Sepal.Length_mean 6.59
10 virginica  Sepal.Width_mean  2.97
11 virginica  Sepal.Length_sd   0.636
12 virginica  Sepal.Width_sd    0.322

Useful for:

transforming data with dplyr (cheatsheet)
plotting¹ with ggplot2 (cheatsheet)
reshaping with tidyr (cheatsheet)
reading data with readr (cheatsheet)

Very useful, indeed!

The built-in function for plotting is plot(); since it is overloaded for many custom data types, you can always try feeding plot() with something and see what happens...

232 / 366

(Ready for the) first lab!233 / 366

Lab 1: hardest variable in Iris

consider the Iris dataset
design and implement an ML-based procedure for answering this question:

what's the hardest variable to be predicted in the dataset?

Hints:

the Iris dataset is built-in in R: iris
there are (at least) two packages for tree learning with R
- tree
- rpart this might be a bit better
most packages for doing supervised learning have two functions for learning and prediction:
- packageName() for learning (e.g., tree or rpart)
- predict() for prediction

234 / 366

Tree bagging and Random Forest235 / 366

The (bad) flexibility of trees

Consider this dataset obtained from a system:

A dataset with an outlier

Question: how would you "draw the system" behind this data?

If we learn a regression tree with low flexibility:

the model will not capture the system behavior
it will underfit the data and the system

If we learn a regression tree with high flexibility:

the model will likely better capture the system behavior, but...
it will also model some noise
it will overfit the data

It might be that there is no flexibility value for which we have no underfitting nor overfitting.

236 / 366

The (bad) flexibility of trees

Consider this dataset obtained from a system:

A dataset with an outlier

Question: how would you "draw the system" behind this data?

If we learn a regression tree with low flexibility:

the model will not capture the system behavior
it will underfit the data and the system

If we learn a regression tree with high flexibility:

the model will likely better capture the system behavior, but...
it will also model some noise
it will overfit the data

It might be that there is no flexibility value for which we have no underfitting nor overfitting.

What's that point at $(\approx 80, \approx 22)$ ?

noise, or, from another point of view, a detail of the data, rather than of the system, that we don't want to model

What if we collect another dataset out of the same system?

236 / 366

Human-learning and carsModel: a description in natural language
Learning technique: human giving a description
Flexibility: amount of characters available for the model
Problem instance: learning a model of (the concept) of car from (one) example
237 / 366

Human-learning and cars

Model: a description in natural language
Learning technique: human giving a description
Flexibility: amount of characters available for the model
Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

237 / 366

Human-learning and cars

Model: a description in natural language
Learning technique: human giving a description
Flexibility: amount of characters available for the model
Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

Ferrari Testarossa

"a moving object"

"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"

237 / 366

Human-learning and cars

Model: a description in natural language
Learning technique: human giving a description
Flexibility: amount of characters available for the model
Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

Ferrari Testarossa

"a moving object"

"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"

Fiat 500

"a moving object"

"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, a windshield, chromed fenders, sunroof"

237 / 366

Modeled details

Low complexity	High complexity
"a moving object"	"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object"	"a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object"	"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

238 / 366

Modeled details

Low complexity	High complexity
"a moving object"	"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object"	"a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object"	"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

What if we combine different models with high complexity?

"a [...] moving object with 4 wheels, 2 doors, [...], a windshield, [...]"
much more details about the system, no details about the noise
i.e., no underfitting 😁, no overfitting 😁

238 / 366

Modeled details

Low complexity	High complexity
"a moving object"	"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object"	"a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object"	"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

What if we combine different models with high complexity?

"a [...] moving object with 4 wheels, 2 doors, [...], a windshield, [...]"
much more details about the system, no details about the noise
i.e., no underfitting 😁, no overfitting 😁

When "learners" are common people, this idea is related with the wisdom of the crowds theorem, stating that "a collective opinion may be better than a single expert's opinion".

238 / 366

Wisdom of the crowds

"a collective opinion may be better than a single expert's opinion"

Yes, but only if:

we have many opinions
the opinions are independent
we have a way to aggregate them

239 / 366

Wisdom of the crowds

"a collective opinion may be better than a single expert's opinion"

Yes, but only if:

we have many opinions
the opinions are independent
we have a way to aggregate them

Can we realize a wisdom of the trees? (where (opinion, person) $\leftrightarrow$ (prediction, tree))

we have many opinions
- ok, just learn many trees
the opinions are independent
- ... 🤔
we have a way to aggregate them
- aggregate predictions of the trees:
  - classification: majority
  - regression: average

239 / 366

Independency of trees

A tree is the result of the execution of $f'\subtext{learn}$ on a learning set $D\subtext{learn} = \seq{(x^{(i)},y^{(i)})}{i}$ .

$f'\subtext{learn}$ is deterministic, thus:

if we apply $f'\subtext{learn}$ twice on the same learning set, we obtain two equal models
if we apply $f'\subtext{learn}$ $m$ times on the same dataset, we obtain $m$ equal models
no independency

In order to obtain different trees, we need to apply $f'\subtext{learn}$ on different learning sets!

But we have just one learning set... 🤔

Question: what's the learning set for human-learners?

240 / 366

Different learning sets

Goal: obtaining $m$ different datasets $D_{\text{learn},1}, \dots, D_{\text{learn},m}$ from a dataset $D\subtext{learn}$

decently different from each other
all being decently representative of the underlying system (not too worse than $D\subtext{learn}$ )

241 / 366

Different learning sets

Goal: obtaining $m$ different datasets $D_{\text{learn},1}, \dots, D_{\text{learn},m}$ from a dataset $D\subtext{learn}$

decently different from each other
all being decently representative of the underlying system (not too worse than $D\subtext{learn}$ )

Option 1: (CV-like)

shuffle $D\subtext{learn}$
split $D\subtext{learn}$ in $m$ folds
assign each $D_{\text{learn},j}$ to the $j$ -th fold

Requirements check:

👍 the folds are in general different from each other
👎 if $m$ is large, each $D_{\text{learn},j}$ is small, with size $\frac{1}{m} |D\subtext{learn}|$ , and is likely poorly representative of the system

241 / 366

Different learning sets

Goal: obtaining $m$ different datasets $D_{\text{learn},1}, \dots, D_{\text{learn},m}$ from a dataset $D\subtext{learn}$

decently different from each other
all being decently representative of the underlying system (not too worse than $D\subtext{learn}$ )

Option 1: (CV-like)

shuffle $D\subtext{learn}$
split $D\subtext{learn}$ in $m$ folds
assign each $D_{\text{learn},j}$ to the $j$ -th fold

Requirements check:

👍 the folds are in general different from each other
👎 if $m$ is large, each $D_{\text{learn},j}$ is small, with size $\frac{1}{m} |D\subtext{learn}|$ , and is likely poorly representative of the system

Option 2: rand. sampling w/ repetitions

for each $j \in \{1, \dots, m\}$
1. start with an empty $D_{\text{learn},j}$
2. repeat $n=|D\subtext{learn}|$ times
  1. pick a random el. of $D\subtext{learn}$
  2. add it to $D_{\text{learn},j}$

Requirements check:

👍 the folds are in general different from each other
👍 regardless of $m$ , each $D_{\text{learn},j}$ as large as $D\subtext{learn}$
- you can freely choose $m$ , even $m \ge n$ !

241 / 366

Sampling with repetition

On $D\subtext{learn}$ :

for each $j \in \{1, \dots, m\}$
1. start with an empty $D_{\text{learn},j}$
2. repeat $n=|D\subtext{learn}|$ times
  1. pick a random el. of $D\subtext{learn}$
  2. add it to $D_{\text{learn},j}$

In general:

function $\text{sample-rep}(\{x\sub{1},\dots,x\sub{n}\})$ {
$X' \gets \emptyset$
while $|X'| \le n$ {
$j \gets \text{uniform}(\{1,\dots,n\})$
$X' \gets X' \cup \{x\sub{j}\}$
}
return $X'$
}

f\subtext{sample-rep}

\{x\sub{1},\dots,x\sub{n}\}

\{x\sub{j\sub{1}},\dots,x\sub{j\sub{n}}\}

Remarks:

$f\subtext{sample-rep}$ is not deterministic!
- if you execute twice it on the same input, you get different outputs
when you use sampling with repetition to estimate the distribution of a metric, rather than computing the metric itself on the entire collection, you are doing bootstrapping

242 / 366

Examples and probability

Not deterministic, thus:

one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}$
one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{3}{●},\c{4}{●},\c{3}{●},\c{5}{●},\c{5}{●}\}$
one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{4}{●},\c{1}{●}\}$
...

recall: input and output are multisets

243 / 366

Examples and probability

Not deterministic, thus:

one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}$
one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{3}{●},\c{4}{●},\c{3}{●},\c{5}{●},\c{5}{●}\}$
one invocation: $f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{4}{●},\c{1}{●}\}$
...

recall: input and output are multisets

Given an input with $n$ elements and assuming uniqueness an element has:

a probability of $\left(1-\frac{1}{n}\right)^n$ to not occur in the output
a probability of $\frac{1}{n}\left(1-\frac{1}{n}\right)^{n-1}$ to occur in the output exactly once
a probability of $\left(\frac{1}{n}\right)^2\left(1-\frac{1}{n}\right)^{n-2}$ to occur in the output exactly twice
...

243 / 366

Towards wisdom of the trees

Can we realize a wisdom of the trees? (where (opinion, person) $\leftrightarrow$ (prediction, tree))

we have many opinions
- 👍 ok, just learn many trees
the opinions are independent
- 👍 each tree is learned on a dataset obtained with sampling with repetition
we have a way to aggregate them
- 👍 aggregate predictions of the trees:
  - classification: majority
  - regression: average

244 / 366

Towards wisdom of the trees

Can we realize a wisdom of the trees? (where (opinion, person) $\leftrightarrow$ (prediction, tree))

we have many opinions
- 👍 ok, just learn many trees
the opinions are independent
- 👍 each tree is learned on a dataset obtained with sampling with repetition
we have a way to aggregate them
- 👍 aggregate predictions of the trees:
  - classification: majority
  - regression: average

Ok, we can define a new learning technique that realizes this idea!

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

m

f'\subtext{predict}

x,m

y

This technique is called on tree bagging.

244 / 366

Tree bagging: learning

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

\seq{t_j}{j}

n\subtext{tree}

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i}, \c{1}{n\subtext{tree}})$ {
$T' \gets \emptyset$
while $|T'| \le \c{1}{n\subtext{tree}}$ {
$\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})$
$t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, \c{3}{1})$
$T' \gets T' \cup \{t\}$
}
return $T'$
}

the model is a bag of trees
- it can contain duplicates
$n\subtext{tree}$ is the number of trees in the bag
- a parameter of the learning technique
$\text{learn}\subtext{single}()$ is the $f'\subtext{learn}$ for learning a single tree (recursive binary splitting)
- tree bagging is based on recursive binary splitting
$\text{learn}\subtext{single}()$ is invoked with $n\subtext{min}=1$ , because we want each tree in the bag to give many details¹!

Recall: since one part of this $f'\subtext{learn}$ is not deterministic (namely, $\text{sample-rep}()$ ), the entire $f'\subtext{learn}$ is not deterministic!

not to be confused with a system not being deterministic
not to be confused with an $f''\subtext{predict}$ that returns a probability

this can be obtained also with a reasonably small $n\subtext{min}$ , or with a reasonably large maximum tree depth

245 / 366

Tree bagging: prediction

f'\subtext{predict}

x,\seq{t_j}{j}

y

Classification (decision trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\argmax_{y \in Y} \sum_j \mathbf{1}(y=\c{1}{\text{predict}\subtext{single}}(x,t_j))$
}

$\text{predict}\subtext{single}()$ is the $f'\subtext{predict}$ for the single tree
$\argmax$ is a majority voting:
1. for each $y$ in $Y$ , count the number $\sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))$ of trees in the bag predicting that $y$ (i.e., the votes for that $y$ )
2. select the $y$ with the largest count (i.e., the majority of votes)
easily modifiable to an $f''\subtext{predict}$ (with probability):
- return $p = y \mapsto \frac{1}{|\seq{t_j}{j}|}\sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))$

Regression (regression trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\frac{1}{|\seq{t_j}{j}|} \sum_j \c{1}{\text{predict}\subtext{single}}(x,t_j)$
}

simply returns the mean of the predictions of the tree in the bag
bonus: instead of getting just the mean, by getting also the standard deviation $\sigma$ of the tree predictions we can have a measure of uncertainty of the tree: the larger $\sigma$ , the more uncertain the prediction, the lower the confidence
- uncertainty/confidence is a basic form of local explainability, i.e., understanding the decisions of the model
- uncertainty/confidence can be exploited in the active learning framework

246 / 366

Impact of the parameter ntreen\subtext{tree}ntree​Is ntreen\subtext{tree}ntree​ a flexibility parameter?
Does ntreen\subtext{tree}ntree​ hence impact on learned model complexity, i.e., on tendency to overfitting?
247 / 366

Impact of the parameter $n\subtext{tree}$

Is $n\subtext{tree}$ a flexibility parameter?
Does $n\subtext{tree}$ hence impact on learned model complexity, i.e., on tendency to overfitting?

Apparently yes:

because the larger $n\subtext{tree}$ , the larger the bag, the more complex the model
- each tree has the "maximum" complexity, having being learned with $n\subtext{min}=1$

Apparently no:

because the larger $n\subtext{tree}$ , the larger the number of trees whose prediction is averaged (regression) or subjected to majority voting (classification), i.e., the stronger the smoothing of details

So what? 🤔

247 / 366

$n\subtext{tree}$ : bagging vs. single tree learning

"Experimentally", it turns out that:

with a reasonably large $n\subtext{tree}$ , bagging is better than single tree learning
- "reasonably large" = tens or few hundreds
- "better" = produces more effective models
if you further increase $n\subtext{tree}$ , there's no overfitting

Note that bagging with $n\subtext{tree}=1$ is¹ single tree learning.

Question: are they exactly the same?

248 / 366

$n\subtext{tree}$ : bagging vs. single tree learning

"Experimentally", it turns out that:

with a reasonably large $n\subtext{tree}$ , bagging is better than single tree learning
- "reasonably large" = tens or few hundreds
- "better" = produces more effective models
if you further increase $n\subtext{tree}$ , there's no overfitting

Note that bagging with $n\subtext{tree}=1$ is¹ single tree learning.

Question: are they exactly the same?

Question: can we hence set an arbitrarly large $n\subtext{tree}$ ?

248 / 366

$n\subtext{tree}$ : bagging vs. single tree learning

"Experimentally", it turns out that:

with a reasonably large $n\subtext{tree}$ , bagging is better than single tree learning
- "reasonably large" = tens or few hundreds
- "better" = produces more effective models
if you further increase $n\subtext{tree}$ , there's no overfitting

Note that bagging with $n\subtext{tree}=1$ is¹ single tree learning.

Question: are they exactly the same?

Question: can we hence set an arbitrarly large $n\subtext{tree}$ ?

No! Efficiency linearly dicreases with $n\subtext{tree}$ :

invoking $\text{predict}\subtext{single}()$ $n\subtext{tree}$ times takes, on average, $n\subtext{tree}$ the resources for invoking $\text{predict}\subtext{single}()$ one time, but...
... the invocations may be done in parallel (to some degree)
- time resource is consumed less
- energy resource is not affected

248 / 366

Tree bagging applicability

Since it is based on the learning technique for single trees, bagging has the same applicability:

$Y$ : both regression and classification (binary and multiclass)
$X$ : multivariate $X$ with both numerical and categorical variables
models give probability

249 / 366

Tree bagging applicability

Since it is based on the learning technique for single trees, bagging has the same applicability:

$Y$ : both regression and classification (binary and multiclass)
$X$ : multivariate $X$ with both numerical and categorical variables
models give probability

Note that the idea behind tree bagging can be applied to any base learning technique:

the base technique is called weak learner
the resulting model is an ensemble, hence bagging is a form of ensemble learning

249 / 366

Random Forest250 / 366

Increasing independency

Wisdom of the trees:

many trees
trees are independent
tree predictions are aggregated

Trees independency is obtained by learning them on (slightly) different datasets.

If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.

Can we further increase trees independency?

251 / 366

Increasing independency

Wisdom of the trees:

many trees
trees are independent
tree predictions are aggregated

Trees independency is obtained by learning them on (slightly) different datasets.

If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.

Can we further increase trees independency?

Yes!

Idea: when learning each tree, remove some randomly chosen independent variables from the observations

Tree bagging improved with variables removal is a learning technique called Random Forest:

random because there are two sources of randomness, hence of independency
forest because it gives a bag of trees

251 / 366

Random Forest: learning

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

\seq{t_j}{j}

n\subtext{tree},n\subtext{vars}

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, \c{1}{n\subtext{vars}})$ {
$T' \gets \emptyset$
while $|T'| \le n\subtext{tree}$ {
$\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})$
$\seq{(\c{4}{x^{\prime(j_i)}},y^{(j_i)})}{j_i} \gets \c{3}{\text{retain-vars}}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, \c{1}{n\subtext{vars}})$
$t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(\c{4}{x^{\prime(j_i)}},y^{(j_i)})}{j_i}, 1)$
$T' \gets T' \cup \{t\}$
}
return $T'$
}

the model is a bag of $n\subtext{vars}$ trees, as in bagging
$\c{1}{n\subtext{vars}} \le p$ is the number of variables to be retained
- a parameter of the learning technique
$\text{learn}\subtext{single}()$ gets, at each iteration, a dataset $D' \in \mathcal{P}^*(\c{4}{X'} \times Y)$
- $X=X_1 \times \dots \times X_p$ has all the $p$ vars
- $\c{4}{X'}=X_{j_1} \times \dots \times X_{j_{n\subtext{vars}}}$ has only $n\subtext{vars}$ variables, with each $j_k \in \{1, \dots, p\}$ and $j_{k'} \ne j_{k''}, \forall k',k''$
- $\text{retain-vars}()$ builds $D'$ (with $X'$ inside) from $D$ (with $X$ inside)

Two parts of this $f'\subtext{learn}$ are not deterministic (namely, $\text{sample-rep}()$ and $\text{retain-vars}()$ ), hence the entire $f'\subtext{learn}$ is not deterministic!

252 / 366

Random Forest: prediction

f'\subtext{predict}

x,\seq{t_j}{j}

y

Classification (decision trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\argmax_{y \in Y} \sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))$
}

Regression (regression trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\frac{1}{|\seq{t_j}{j}|} \sum_j \text{predict}\subtext{single}(x,t_j)$
}

Exactly the same as for tree bagging

Question: some of the trees in the bag do not have all variables of $x$ : is this a problem?

253 / 366

Random Forest: prediction

f'\subtext{predict}

x,\seq{t_j}{j}

y

Classification (decision trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\argmax_{y \in Y} \sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))$
}

Regression (regression trees)

function $\text{predict}(x, \seq{t_j}{j})$ {
return $\frac{1}{|\seq{t_j}{j}|} \sum_j \text{predict}\subtext{single}(x,t_j)$
}

Exactly the same as for tree bagging

Question: some of the trees in the bag do not have all variables of $x$ : is this a problem?

No, the tree is still able to process an $x$ , but will not consider (i.e., not use in branch nodes) some of its variable values;

the opposite (variable in the tree, but valued not in $x$ ) would be a problem we'll see

253 / 366

Impact of the parameter $n\subtext{vars}$

Is $n\subtext{vars}$ a flexibility parameter?
Does $n\subtext{vars}$ hence impact on learned model complexity, i.e., on tendency to overfitting?

No, "experimentally", it turns out that:

$n\subtext{vars}$ does not impact on tendency to overfitting
reasonably good default values exist:
- $n\subtext{vars} = \left\lceil\sqrt{p}\right\rceil$ for classification
- $n\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil$ for regression

$\left\lceil x\right\rceil$ is $\text{ceil}(x)$ , i.e., rounding to closest larger integer; $\left\lfloor x\right\rfloor$ is $\text{floor}(x)$ , i.e., rounding to closest smaller integer

254 / 366

Random Forest parameters

Both $n\subtext{tree}$ and $n\subtext{vars}$ do not impact on tendency to overfitting.

In practice, we can use the default values for both:

$n\subtext{tree} = 500$
$n\subtext{vars} = \left\lceil\sqrt{p}\right\rceil$ or $n\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil$

$\Rightarrow$ Random Forest is (almost) a (hyper)parameter-free learning technique!

255 / 366

Random Forest parameters

Both $n\subtext{tree}$ and $n\subtext{vars}$ do not impact on tendency to overfitting.

In practice, we can use the default values for both:

$n\subtext{tree} = 500$
$n\subtext{vars} = \left\lceil\sqrt{p}\right\rceil$ or $n\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil$

$\Rightarrow$ Random Forest is (almost) a (hyper)parameter-free learning technique!

However, "we can use the default values"

does not mean that default values are the best parameter values for any possibly dataset/system more on this later
it means we'd better spend our efforts on designing other components of the ML system:
- engineering better features
- getting better data
- building a better UI
- ...

255 / 366

Visualizing Random Forest for regression

Example of bagging on regression

image from Fabio Daolio

How is this image built?

set the real system as a $f: x \to y$
- plot — $f(x)$ for $x \in [x\subtext{min},x\subtext{max}]$
take a random set of points $\seq{x^{(i)}}{i}$ in $[x\subtext{min},x\subtext{max}]$
compute the corresponding $y$ and perturb them with a noise: $y^{(i)}=f(x^{(i)})+\epsilon$ with $\epsilon \sim N(0,1)$
set the dataset as $D=\seq{(x^{(i)},y^{(i)})}{i}$
- plot ● each $(x^{(i),y^{(i)}})$ in $D$
learn one single tree $t$ on $D$
- plot — $f'\subtext{predict}(x,t)$ for $x \in [x\subtext{min},x\subtext{max}]$
learn¹ a bag $\seq{t_j}{j}$ on $D$
- plot — $f'\subtext{predict}(x,\seq{t_j}{j})$ for $x \in [x\subtext{min},x\subtext{max}]$
- $\forall t_j \in \seq{t_j}{j}$ , plot — $f'\subtext{predict}(x,t_j)$ for $x \in [x\subtext{min},x\subtext{max}]$

Finding: the bag — nicely models the real system —

question: why not at the extremes of the $x$ domain?
question: can you reproduce this for classification and $p=2$ ?

Question: bagging or Random Forest?

256 / 366

Out-of-bag trees

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, n\subtext{vars})$ {
$T' \gets \emptyset$
while $|T'| \le n\subtext{tree}$ {
$\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})$
$\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i} \gets \text{retain-vars}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, n\subtext{vars})$
$t \gets \c{2}{\text{learn}\subtext{single}(\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i}, 1)}$
$T' \gets T' \cup \{t\}$
}
return $T'$
}

Toy example with $D=\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}$

$t_1 = \text{learn}\subtext{single}(\{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}, 1)$ , ● not used
$t_2 = \text{learn}\subtext{single}(\{\c{4}{●},\c{4}{●},\c{1}{●},\c{2}{●},\c{5}{●}\}, 1)$ , ● not used
$t_3 = \text{learn}\subtext{single}(\{\c{5}{●},\c{3}{●},\c{1}{●},\c{2}{●},\c{4}{●}\}, 1)$ , all used
...
$t_j = \text{learn}\subtext{single}(\{\c{3}{●},\c{1}{●},\c{5}{●},\c{4}{●},\c{5}{●}\}, 1)$ , ● not used
...

For every tree, there are zero or more observations that have not been used for learning it.

From another point of view, for every $i$ -th observation $(x^{(i)},y^{(i)})$ , there are some trees which have been learned without that observation:

with $n\subtext{tree}$ trees in the bag, on average, $\frac{1}{3} n\subtext{tree}$ trees have been learned without the observation it can be computed playing a bit with probability; they are called out-of-bag trees
each observation is an unseen observation for its out-of-bag trees

$\Rightarrow$ use unseen observations for measuring an estimate of the error (or accuracy, or another index) on the testing set (the OOB error)

257 / 366

OOB error

Computing the OOB error during the learning:

for each observation $(x^{(i)},y^{(i)})$
1. find the out-of-bag trees
2. obtain their prediction $\hat{y}^{(i)}$ on the observation
compute the error on the predictions (with an $f\subtext{comp-resps}$ )

Remarks:

it is an estimate of the test error, but does not need a test dataset
- still an estimate, not the real test error
it is¹ computed at learning time

Classification error vs. bag size image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

Many libraries compute it only upon user's request.

258 / 366

Interpretability of the trees

Is this model interpretable ( $n\subtext{tree}=1$ )?

Single tree

Is this model interpretable ( $n\subtext{tree}=100$ )?

Forest

Interpreting of the model (i.e., global explainability) is feasible if the model can be visualized:

a single tree can be visualized (if it's small); $100$ trees can not!

There exist other flavors of interpretability:

simulatability: the degree to which the working of the model can be reproduced by a human
composatability: the degree to which the human can split the model in components and interpret them and their role

259 / 366

The role of the variables

x\subtext{age}

vs.

10

\le

●

>

x\subtext{height}

vs.

120

\le

●

>

●

By looking at this tree, we can understand:

exactly what variables are used
exactly when they are used in the decision process
- here, $x\subtext{age}$ is used before $x\subtext{height}$
exactly how, i.e., what they are compared against

In principle, this can be done also for a bag of trees, but it would not scale well... in human terms

Can we have a much coarser view on variables role that scales well to large $n\subtext{tree}$ ?

260 / 366

The role of the variables

x\subtext{age}

vs.

10

\le

●

>

x\subtext{height}

vs.

120

\le

●

>

●

By looking at this tree, we can understand:

exactly what variables are used
exactly when they are used in the decision process
- here, $x\subtext{age}$ is used before $x\subtext{height}$
exactly how, i.e., what they are compared against

In principle, this can be done also for a bag of trees, but it would not scale well... in human terms

Can we have a much coarser view on variables role that scales well to large $n\subtext{tree}$ ?

Yes!

Idea (first option: mean RSS/Gini decrease): when learning

for each tree, for each branch-node
1. measure the RSS/Gini before the branch-node
2. measure the RSS/Gini after the branch-node
3. assign (by increment) the decrease to the branch-node variable
build a ranking of variables based on the sum of decreases (the larger, the more important)

260 / 366

Variable importance by RSS/Gini decrease

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, n\subtext{vars})$ {
$\vect{v} \gets \vect{0}$
$T' \gets \emptyset$
while $|T'| \le n\subtext{tree}$ {
$\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})$
$\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i} \gets \text{retain-vars}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, n\subtext{vars})$
$t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i}, 1, \c{1}{\vect{v}})$
$T' \gets T' \cup \{t\}$
}
return $(T', \c{1}{\vect{v}})$
}

function $\c{2}{\text{learn}\subtext{single}}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min},\c{1}{\vect{v}})$ {
if $\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min})$ then { ... } else {
$e\subtext{before} \gets \text{gini}(\seq{y^{(i)}}{i})$
$(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$
$e\subtext{after} \gets \text{gini}(\seq{y^{(i)}}{i}\big\rvert\sub{x^{(i)}\sub{j} \le \tau})+\text{gini}(\seq{y^{(i)}}{i}\big\rvert\sub{x^{(i)}\sub{j} > \tau})$
$v\sub{j} \gets v\sub{j} + e\subtext{before}-e\subtext{after}$
$t \gets \text{node-from}((j,\tau),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}, \c{1}{\vect{v}}),$
$\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min}, \c{1}{\vect{v}})$
)
return $t$
}
}

for each tree, for each branch-node
1. measure the RSS/Gini before the branch-node
2. measure the RSS/Gini after the branch-node
3. assign (by increment) the decrease to the branch-node variable
build a ranking of variables based on the sum of decreases

$\vect{v}$ stores the Gini decrease for each variable
- initially set to $\vect{0} \in \mathbb{R}^p$
- propagated to each call to $\text{learn}\subtext{single}()$
the error before is Gini computed on the local dataset (the one at the node) before dividing the data
the error after is Gini computed on the local dataset (the one at the node) after dividing the data

Example: $\text{gini}(\seq{y^{(i)}}{i})=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)$

$\seq{y^{(i)}}{i}$	Gini	Gini $\rvert_{ \le \tau}$	Gini $\rvert_{ > \tau}$	Decrease
●●●●/●●●●	0.5	0	0	0.5
●●●●●●/●●	0.375	0	0	0.375
●●●●●/●●●	0.375	0	0.333	0.042
●●●●/●●●●	0.5	0.25	0.25	0

Question: is this $x_j$ categorical or numerical?

261 / 366

OOB-shuffling importance

It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:

if there are categorical variables with many values
because if tends to give more importance to numerical variables
in general, because it works on learning data

} $\rightarrow$ many branches

262 / 366

OOB-shuffling importance

It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:

if there are categorical variables with many values
because if tends to give more importance to numerical variables
in general, because it works on learning data

} $\rightarrow$ many branches

Idea (second option: aka mean accuracy decrease): just after learning

for each $j$ -th variable and each tree $t$ in the bag
1. take the observations $D_t$ not used for $t$
2. measure the accuracy of $t$ on $D_t$
3. shuffle the $j$ -th variable in the observations, obtaining $D'_t$
4. measure the accuracy of $t$ on $D'_t$
5. assign (by increment) the decrease in accuracy to the $j$ -th variable
build a ranking of variables based on the sum of decreases (the larger, the more important)

Rationale: if the decrease is low, it means that shuffling the variable has no effect, so the variable is not really important!

262 / 366

Feature ablation for variable importance

There is also a further, more general variant, that works for any learning technique $f'\subtext{learn}, f'\subtext{predict}$ :

Idea (third option: feature ablation):

measure the effectiveness of $f'\subtext{learn}, f'\subtext{predict}$ on the dataset $D$
for each $j$ -th variable $x_j$
1. build a $D'$ by removing $x_j$ from $D$
2. measure the effectiveness of $f'\subtext{learn}, f'\subtext{predict}$ on the dataset $D'$
3. compute the $j$ -th variable importance as the decrease of effectiveness in $D'$ w.r.t. $D$
build a ranking of variables based on decreases of effectiveness (the larger, the more important)

This method is (a form of) feature ablation, since you remove variables/features an see what happens:

ablation [a-bley-shuhn]: gradually remove material from or erode (a surface or object) by melting, evaporation, frictional action, etc., or erode (material) in this way.

263 / 366

Variable importance as basic interpretability

In summary, for variable instance, we have three options:

Option	Effectiveness	Efficiency	Applicability
Mean RSS/Gini decrease	🤏¹	👍²	🤏 only trees
Mean accuracy decrease	👍	👍³	🤏 bagging
Feature ablation	👍⁴	🤏	👍 universal

not robust to many branches; on learning data
during learning, for free
during learning, almost free
still not perfect: what about redundant variables?

Regardless of the method you use for computing the variable importance, a ranking of the variables according to their importance for having a good model is a basic form of interpretability, as it answers to the question:

what does the model consider as important for doing predictions?

that should mean:

what parts of the system are important according to the model of the system? (global explainability)

264 / 366

Random Forest: summary

Applicability: same as trees 👍👍👍

👍 $Y$ : both regression and classification (binary and multiclass)
👍 $X$ : multivariate $X$ with both numerical and categorical variables
👍 models give probability¹
👍 practically parameter-free

Efficiency 👍

👍 in practice, pretty fast in learning and prediction phase ( $n\subtext{tree} \times$ slower than tree)

Explainability/interpretability 👍👍

👍 the models give variable importance (basic global explainability)
👍 the learning technique is itself comprehensible
- you should be able to implement by yourself

Unless¹ you really need to look at the tree, Random Forest is always better than the single tree:

much much better in effectiveness
not really worse in efficiency
worse in interpretability (but who cares? see 1)

265 / 366

Random Forest effectiveness

Some researchers did a large scale comparison of many supervised machine learning techniques:

Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems?." The journal of machine learning research 15.1 (2014): 3133-3181.

Effectiveness of some supervised learning techniques

Delgado et al. abstract

We evaluate 179 classifiers arising from 17 families [...]
We use 121 data set [...]
The classifiers most likely to be the bests are the random forest [...]

According to practice, we just need Random Forest. But...

266 / 366

No free lunch theorem

Earlier, some researchers formulated the No Free Lunch theorem¹:

Wolpert, David H. "The lack of a priori distinctions between learning algorithms." Neural computation 8.7 (1996): 1341-1390.

Any two optimization algorithms¹ are equivalent when their performance is averaged across all possible problems²

the 1996 Wolpert's paper is about learning algorithms; a later paper by Wolpert's (1997) extends the theorem to optimization algorithms and gives the name to the theorem
not an actual fragment of the paper, but a recap of the same authors in a later paper

According to theory, all learning techniques are the same.

if we considere all (theoretically all!) problems...
my advice: start with Random Forest, then see where to spend your time

267 / 366

Why "No Free Lunch"?

There are many restaurants, each with all food items on the list: food price is in general different among restaurants.

Where should you go to eat?

If you just want to eat something, there is no restaurants where everything costs less.

🤤 eater $\leftrightarrow$ ML designer
🏩 restaurant $\leftrightarrow$ ML technique
🥗 food $\leftrightarrow$ ML problem
💵 price $\leftrightarrow$ effectiveness

But if you know what you want to eat, there's at least one restaurant where that thing has the lowest price.

Question: what does this mean in practice?

268 / 366

Support Vector Machines269 / 366

Building on the weakness of the tree

Binary classificatio problem for SVM: just data

Dataset:

$Y=\{\c{1}{●},\c{2}{●}\}$
$X=\mathbb{R}^2$

A single tree, here, would struggle in establishing a good decision boundary: many corners, many branch nodes.

By looking at the data, we see that a simple line would likely be a good decision boundary

recall: the decision boundary in classification is where the model change the $y$ when $x$ crosses it

Can we draw that simple line?

270 / 366

Line as decision boundary

Binary classificatio problem for SVM: just data

Yes, we can! Here it is!

Despite its apparent simplicity, this "draw the line" operation implies:

we think that a line — can be used to tell apart ● and ● points
- the line — is a model
- we know how to use a model
we executed some procedure for finding the line out of the data

Implicitly, we already defined $M$ , $f'\subtext{learn}: \mathcal{P}^*(\mathbb{R} \times Y) \to M$ , and $f'\subtext{predict}: \mathbb{R}^2 \times M \to Y$

i.e., we defined a new learning technique 🤗

We followed the same approach for trees: now we are more experienced and we can go faster in formalizing it.

271 / 366

Line as a model

Formally, a line-shaped decision boundary in $X=\mathbb{R}^2$ can be defined as $x_2=m x_1 +q$ where $m$ is the slope and $q$ is the intercept.

Alternatively, as: there are many triplets $(\beta_0, \beta_1, \beta_2)$ defining the same line $\beta_0+\beta_1 x_1+\beta_2 x_2=0$

More in general, in $X=\mathbb{R}^p$ , we can define a separating hyperplane as: $\beta_0+\beta_1 x_1+\dots+\beta_p x_p=0$ or, in vectorial form, as: $\vect{\beta}, \vect{x} \in \mathbb{R}^p$ $\beta_0+\vect{\beta}^\intercal\vect{x}=0$

separating, because it can be used to separate the space in two parts
hyperplane, because we are in $\mathbb{R}^p$ $p=1$ : threshold; $p=2$ : line; $p=3$ : plane; $p>3$ : hyperplane

272 / 366

Using a separating hyperplane

Binary classificatio problem for SVM: just data

Intuitively:

if the point $\vect{x}$ is above the line, then $y=\c{2}{●}$
else, if the point $\vect{x}$ is below the line, then $y=\c{1}{●}$
else, if the point $\vect{x}$ is on line, then 🤔

Formally:

$\vect{x}$ is on the line iff $\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{=} 0$
$\vect{x}$ is above the line iff $\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{>} 0$
$\vect{x}$ is below the line iff $\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{<} 0$

Example: This particular line is: $2+1.1 x_1 + x_2 = 0$

For $\vect{x}=(10,10)$ :

$2+1.1 x_1 + x_2 = 2+11+10=23 \c{3}{>} 0$
hence $y=\c{2}{●}$ (above)

For $\vect{x}=(-10,-10)$ :

$2+1.1 x_1 + x_2 = 2-11-10=-19 \c{3}{<} 0$
hence $y=\c{1}{●}$ (above)

273 / 366

$f'\subtext{predict}$ with a separating hyperplane

f'\subtext{predict}

\vect{x},(\beta_0,\vect{\beta})

y

function $\text{predict}(\vect{x}, \c{1}{(\beta_0, \vect{\beta})})$ {
if $\beta_0+\vect{\beta}^\intercal\vect{x} \c{2}{\ge} 0$ then {
return $\text{Pos}$
} else {
return $\text{Neg}$
}
}

Assumptions:

$Y = {\text{Pos},\text{Neg}}$
- binary classification only!¹
$X = \mathbb{R}^p$
- numerical independent variables only!²

$(\beta_0, \vect{\beta})$ is the model
$y = \text{Pos}$ for both the $>$ and $=$ cases
- $y = \text{Neg}$ for $<$ , i.e., otherwise
computationally very fast: just $p$ multiplications and sums

we'll see later how to port this to the case of $|Y| > 2$
we'll see later how to port this to the case of categorical variable

274 / 366

Separating hyperplane with probability

Intuitively, for $\beta_0+\vect{\beta}^\intercal\vect{x}$

the greater (positive and large), the more satisfied the $\ge 0$ condition, hence the more positive
the smaller (negative and large), the more satisfied the $< 0$ condition, hence the more negative

function $\text{predict}(\vect{x}, (\beta_0, \vect{\beta}))$ {
if $\c{3}{\beta_0+\vect{\beta}^\intercal\vect{x}} \ge 0$ then {
return $\text{Pos}$
} else {
return $\text{Neg}$
}
}

Can we use this like a probability? Can we have an $f''\subtext{predict}$ for the hyperplane?

recall the single tree: $f''\subtext{predict}(x,t)=(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{2}{5}}})$ question: can we infer something about $n=|D\subtext{learn}|$ form this?
recall the bag (assume $n\subtext{tree}=100$ ): $f''\subtext{predict}(x,\seq{t_j}{j})=(\c{1}{● \smaller{\frac{38}{100}}}, \c{2}{● \smaller{\frac{62}{100}}})$

275 / 366

Separating hyperplane with probability

Intuitively, for $\beta_0+\vect{\beta}^\intercal\vect{x}$

the greater (positive and large), the more satisfied the $\ge 0$ condition, hence the more positive
the smaller (negative and large), the more satisfied the $< 0$ condition, hence the more negative

function $\text{predict}(\vect{x}, (\beta_0, \vect{\beta}))$ {
if $\c{3}{\beta_0+\vect{\beta}^\intercal\vect{x}} \ge 0$ then {
return $\text{Pos}$
} else {
return $\text{Neg}$
}
}

Can we use this like a probability? Can we have an $f''\subtext{predict}$ for the hyperplane?

recall the single tree: $f''\subtext{predict}(x,t)=(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{2}{5}}})$ question: can we infer something about $n=|D\subtext{learn}|$ form this?
recall the bag (assume $n\subtext{tree}=100$ ): $f''\subtext{predict}(x,\seq{t_j}{j})=(\c{1}{● \smaller{\frac{38}{100}}}, \c{2}{● \smaller{\frac{62}{100}}})$

No! Because $\beta_0+\vect{\beta}^\intercal\vect{x}$ is not bounded in $[0,1]$

we can still use it as a measure of confidence: the smaller $|\beta_0+\vect{\beta}^\intercal\vect{x}|$ , the lower the confidence in the decision; in the extreme case $|\beta_0+\vect{\beta}^\intercal\vect{x}|=0$ means no confidence, i.e., both $y=\text{Pos}$ and $y=\text{Neg}$ are ok

You may map the domain of $\beta_0+\vect{\beta}^\intercal\vect{x}$ , i.e., $[-\infty,+\infty]$ to $[0,1]$ with, e.g., $\tanh$ : if $x \in [-\infty,+\infty]$ , then $\frac{1}{2}+\frac{1}{2}\tanh(x) \in [0,1]$ .
But this is not a common practice, because it still would not be a real probability.

275 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

perfectly separates the ● and ● points

276 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

perfectly separates the ● and ● points

🫣 this condition holds in general, for infinite lines...

Second attempt:

Choose the one that:

perfectly separates the ● and ● points and
is the farthest from the closest points

277 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

perfectly separates the ● and ● points

🫣 this condition holds in general, for infinite lines...

Second attempt:

Choose the one that:

perfectly separates the ● and ● points and
is the farthest from the closest points

278 / 366

The maximal margin classifier

Binary classificatio problem for SVM: just data

The hyperplane that

perfectly separates the $\text{Pos}$ and $\text{Neg}$ points and
is the farthest from the closest points

is called the maximal margin classifier (MMC).

Maximal margin classifier:

classifier, because it can be used for classifying point,
- since it is a separating hyperplane that divides thes space in two portions
maximal margin: because it is the one leaving the largest distance (margin) from the closest points

279 / 366

Support vectors

Binary classificatio problem for SVM: just data

Names:

the band from - - to - - (through —) is the margin
the points lying on the edge of the margin are called support vectors
- they support the band in its position, like nails 📍 with a wooden ruler 📏
- they are points in $\mathbb{R}^p$ , hence vectors
- here, two ●● and one ●

If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!

280 / 366

Learning the maximal margin classifier

Intuitively:

Choose the one that:

perfectly separates the $\text{Pos}$ and $\text{Neg}$ points and
is the farthest from the closest points

Looks like an optimization problem:

"perfectly separates" $\rightarrow$ constraint
"is the farthest" $\rightarrow$ objective

281 / 366

Learning the maximal margin classifier

Intuitively:

Choose the one that:

perfectly separates the $\text{Pos}$ and $\text{Neg}$ points and
is the farthest from the closest points

Looks like an optimization problem:

"perfectly separates" $\rightarrow$ constraint
"is the farthest" $\rightarrow$ objective

Formally:

$\begin{align*} \max_{\beta_0, \dots, \beta_p} & \; \c{4}{m} \\ \text{subject to} & \; \c{3}{\sum_{j=1}^{j=p} \beta_j^2 = \vect{\beta}^\intercal\vect{\beta}= 1} \\ & \; \c{3}{y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m} & \c{3}{\forall i \in \{1, \dots, n\}} \end{align*}$ that means:

find the largest $m$ , such that
every point $\vect{x}^{(i)}$ is at a distance $\ge m$ from the hyperplane
and is on the proper side

Assume by convention that $\text{Pos} \leftrightarrow +1$ and $\text{Neg} \leftrightarrow -1$ , so $y^{(i)}(\dots) \ge m$ is like $\dots \ge m$ for positives and $\dots \le -m$ for negatives

$\beta_0, \dots, \beta_p$ , that is the model $(\beta_0, \vect{\beta})$ , is what we are looking for
mathematically, if $\sum_{j=1}^{j=p} \beta_j^2 = 1$ , then $\beta_0+\vect{\beta}^\intercal\vect{x}$ is the Euclidean distance of $\vect{x}$ from the hyperplane (with sign)
$y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m$ is $=$ for support vectors and $>$ for the other points

281 / 366

$f'\subtext{learn}$ for the maximal margin classifier

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

(\beta_0,\vect{\beta})

function $\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$ {
$(\beta_0,\vect{\beta}) \gets \c{1}{\text{solve}(}$
$\max_{\beta_0,\dots,\beta_p} m,$
$\vect{\beta}^\intercal\vect{\beta}= 1 \land y^{(i)}(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}) \ge m, \forall i$
$)$
return $(\beta_0,\vect{\beta})$
}

$\text{solve}()$ is just a solver for numerical optimization problems which takes the objective and the constraints

In practice, this is an easy optimization problem and solving it is fast! for a computer

282 / 366

Maximal marginal classifier learning

This learning technique is called maximal margin classifier learning.

Efficiency: 👍

👍👍👍 very fast, both in learning and prediction

Applicability: 🫳

🫳 just binary classification more on this later
🫳 just numerical variables more on this later
👍 parameter-free!

Effectiveness: 🤔

overfitting? well, no flexibility, so... 🤔
- what's complexity here? the size of the model is always $p+1$

function $\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})$ {
$(\beta_0,\vect{\beta}) \gets \text{solve}($
$\max_{\beta_0,\dots,\beta_p} m,$
$\vect{\beta}^\intercal\vect{\beta}= 1 \land y^{(i)}(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}) \ge m, \forall i$
$)$
return $(\beta_0,\vect{\beta})$
}

function $\text{predict}(\vect{x}, (\beta_0, \vect{\beta}))$ {
if $\beta_0+\vect{\beta}^\intercal\vect{x} \ge 0$ then {
return $\text{Pos}$
} else {
return $\text{Neg}$
}
}

283 / 366

Maximal margin classifier: issue 1

Binary classificatio problem for SVM: just data

Support vectors:

they support the band in its position, like nails 📍 with a wooden ruler 📏
here, two ●● and one ●

If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!

But, if you move a support vector, then the separating hyperplane moves!

i.e., for small changes of (some) observations (apply some noise to some $\vect{x}^{(i)}$ ), the model changes: looks like variance

284 / 366

Maximal margin classifier: issue 2

Binary classificatio problem for SVM: just data

Even worse, if you apply some noise¹ to some label $y^{(i)}$ , it might be that a separatying hyperplane does not exist at all! 😱

in practice, the $\text{solve}()$ function just halts and say "there's no solution for this optimization problem".

$\Rightarrow$ Applicability: 👎👎👎

How did the tree cope with $y$ noise?

simply by tolerating² some wrong classifications also on the learning data

Can we make MMC tolerant too?

noise to the $y$ : recall the carousel attendat's kids...
if $n\subtext{tree}$ was large enough

285 / 366

Introducing tolerance (1st formulation)

$\begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(1)},\dots,\epsilon^{(n)}}} & \; m \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \\ & \; \sum_{i=1}^{i=n} \c{1}{\epsilon^{(i)}} = \c{2}{c} \end{align*}$

$\epsilon^{(1)},\dots,\epsilon^{(n)}$ are positive slack variables:
- one for each observation
- they act as tolerance w.r.t. the margin
  - $\epsilon^{(i)}=0$ means $\vect{x}^{(i)}$ has to be out of the margin, on correct side
  - $\epsilon^{(i)} \in [0,1]$ means $\vect{x}^{(i)}$ can be inside the margin, on correct side
  - $\epsilon^{(i)} > 1$ means $\vect{x}^{(i)}$ can be on wrong side
$\c{2}{c} \in \mathbb{R}^+$ (for cost), is a budget of tolerance, which is a parameter of the learning technique

This learning technique is called soft margin classifier (SMC, or support vector classifier), because, due to tolerance, the margin can be pushed.

It has one parameter, $c$ :

$c=0$ corresponds to maximal margin classifier (no tolerance)

286 / 366

Role of the parameter $c$ (in 1st formulation)

$c=+\infty$ $\rightarrow$ infinite tolerance $\rightarrow$ you can put the line wherever you want

from another point of view, you can move a lot the points and the line stay the same
hence the model is the same irrespective of learning data $\Rightarrow$ high bias

$c=0$ $\rightarrow$ no tolerance $\rightarrow$ any noise will change the model

hence high variance
even worse: if $c$ is too small, this is an $\approx$ MMC
- for a given dataset, there is a $c\subtext{learnable}$ sucht that if $c<c\subtext{learnable}$ no model is learnable 😱

287 / 366

Variable scale

The threshold $c\subtext{learnable}$ for learnability depends:

on $n$ , for the summation $\sum_{i=1}^{i=n}$
on $p$ , because of $\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}$ the larger $p$ , the longer the summation, as $\vect{\beta}^\intercal\vect{x}^{(i)}=\sum_{j=1}^{j=p} \beta_j x_j$
on the actual scales of the variables

288 / 366

Variable scale

The threshold $c\subtext{learnable}$ for learnability depends:

on $n$ , for the summation $\sum_{i=1}^{i=n}$
on $p$ , because of $\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}$ the larger $p$ , the longer the summation, as $\vect{\beta}^\intercal\vect{x}^{(i)}=\sum_{j=1}^{j=p} \beta_j x_j$
on the actual scales of the variables

Actually, the margin $m$ of the MMC itself depends on the scales of variables!

Original dataset

Scaled dataset: each $x_j$ is $\times \frac{1}{2}$

Trivial dataset before scaling

$D = \{$
$(1,1,\c{1}{●}),$
$(3,3,\c{2}{●})$
$\}$

Trivial dataset after scaling

$D = \{$
$(0.5,0.5,\c{1}{●}),$
$(1.5,1.5,\c{2}{●})$
$\}$

$m=\sqrt{1^2+1^2}=\sqrt{2}$

$m=\sqrt{\frac{1}{2^2}+\frac{1}{2^2}}=\frac{1}{\sqrt{2}}$

288 / 366

Variable scale and hyperplane

Moreover, the coefficients $\beta_j$ depend on the scales of the variables too!

Intuitively: if

$x_j \in [1.4, 2.1]$ (might be the height in meters)
and $x_{j'} \in [20000, 50000]$ (might be the annual income in €)

then $\beta_j$ will be much different than $\beta_{j'}$ , making the computation of $\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}$ (and hence the model) rather sensible to noise.

289 / 366

Variable scale and hyperplane

Moreover, the coefficients $\beta_j$ depend on the scales of the variables too!

Intuitively: if

$x_j \in [1.4, 2.1]$ (might be the height in meters)
and $x_{j'} \in [20000, 50000]$ (might be the annual income in €)

then $\beta_j$ will be much different than $\beta_{j'}$ , making the computation of $\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}$ (and hence the model) rather sensible to noise.

Hence, when using MMC (or SMC, or SVM), you¹ should rescale the variables. Options:

min-max scaling: $x^{\prime(i)}_j = \frac{x^{(i)}_j - \min_{i'} x^{(i')}_j}{\max_{i'} x^{(i')}_j - \min_{i'} x^{(i')}_j}$ where $\min_{i'} x^{(i')}_j$ is the min of $x_j$ in $D$
standardization: $x^{\prime(i)}_j = \frac{1}{\sigma_j} \left(x^{(i)}_j - \mu_j\right)$ where $\mu_j$ and $\sigma_j$ are the mean and standard deviation of $x_j$ in $D$

Standardization is, in general, preferred as it is more robust to outliers.

In practice, most of the ML sw/libraries do it internally.

289 / 366

Scaling as part of the model

Since you have to do the scaling both in learning and prediction, the coefficients needed for scaling (i.e., $\min, \max$ or $\mu, \sigma$ ) do belong to the model!

Learning with scaling: (here, standardization)

\seq{(\vect{x}^{(i)},y^{(i)})}{i}

scaling

\seq{(\vect{x}^{\prime(i)},y^{(i)})}{i}

f'\subtext{learn}

m

join

(m,\vect{\mu},\vect{\sigma})

(\vect{\mu},\vect{\sigma})

$(m,\vect{\mu},\vect{\sigma})$ is the model with scaling, with $\vect{\mu},\vect{\sigma} \in \mathbb{R}^p$ . Here, join builds a tuple

Prediction with scaling:

\vect{x},\c{2}{(m,\vect{\mu},\vect{\sigma})}

split

\vect{x},\vect{\mu},\vect{\sigma}

scale

\vect{x}'

join

\vect{x}',m

f'\subtext{predict}

y

m

If you use the entire dataset (e.g., in CV, or in train/test static division) for computing $\vect{\mu},\vect{\sigma}$ , then you are cheating!
Question: can you write down the pseudocode of "scale"? And scaling? Are they the same?

290 / 366

Introducing tolerance (2nd formulation)

$\begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(1)},\dots,\epsilon^{(n)}}} & \; m - \c{2}{c} \c{1}{\sum_{i=1}^{i=n} \epsilon^{(i)}} \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \end{align*}$

$\epsilon^{(1)},\dots,\epsilon^{(n)}$ are again positive slack variables
their sum is unbounded, but is negatively accounted in the objective: basically, this is a sort-of biobjective optimization:
- maximize $m$
- minimize $\sum_{i=1}^{i=n} \epsilon^{(i)}$
$\c{2}{c} \in \mathbb{R}^+$ , is a weighting parameter saying what's the weight of the two objectives, which is a parameter of the learning technique

This is also the learning technique called soft margin classifier.

Most of the ML sw/libraries are based on this formulation.

The 1st one is often shown in books, e.g., in James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

291 / 366

Role of the parameter $c$ (in 2nd formulation)

$\begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(i)},\dots,\epsilon^{(i)}}} & \; m - \c{2}{c} \c{1}{\sum_{i=1}^{i=n} \epsilon^{(i)}} \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \end{align*}$

$c = 0$ $\rightarrow$ no weight to $\sum_{i=1}^{i=n} \epsilon^{(i)}$ $\rightarrow$ points that are inside the margin cost zero

you can put the line wherever you want
hence, the model is the same irrespective of learning data $\Rightarrow$ high bias

$c = +\infty$ $\rightarrow$ infinite weight to $\sum_{i=1}^{i=n}$ $\rightarrow$ points that are inside the margin cost a lot

max effort to put all points outside the margin
from another point of view, the margin is very sensitive to point positions $\Rightarrow$ high variance
but still, with huge cost, a model can be learned!

292 / 366

SMC: sims and diffs of the two formulations

Similarities:

there is one learning parameter (called $c$ )
$c$ is a flexibility parameter

Differences:

$c$ extreme values:
- $c=+\infty$ (1st) and $c=0$ (2nd) for high bias
- $c=0$ (1st) and $c=+\infty$ (2nd) for high variance
learnability:
- with the 2nd, you can always learn a model from of any dataset $D$
- with the 1st, given a $D$ , there is a $c\subtext{learnable} \le 0$ such that if you set $c < c\subtext{learnable}$ you cannot learn a model from $D$
  - $c\subtext{learnable}=0$ if the data is linearly separable

In practice:

most of the ML sw/libraries are based on the 2nd formulation
you should find (e.g., with CV) a proper value for $c$

293 / 366

Always learn...

A not linearly separable binary classification dataset

Yes, with the 2nd formulation, we can learn a SMC, but it will be a poor model:

simply, the decision boundary here is not a straight line
a line is naturally unable to model the system

More in general, not every binary classification problem can be solved with an hyperplane.

Can we learn non linear decision boundaries?

294 / 366

Beyond the hyperplane: disclaimer

Yes, we can!

But...

❗ Disclaimer ❗

There will be some harder mathematics. We are going to make it simple.

For simplifying it, we'll risky walk on the edge of correcteness...

295 / 366

An alternative formulation for $f'\subtext{predict}$

First, let's give a name to the core computation of $f'\subtext{predict}$ : $f(\vect{x}) = \beta_0 + \vect{\beta}^\intercal \vect{x}=\beta_0 + \sum_{j=1}^{j=p} \beta_j x_j$ with $f: \mathbb{R}^p \to \mathbb{R}$ .

It turns out that this same $f$ can be written also as: $f(\vect{x})=\beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \left\langle \vect{x}, \vect{x}^{(i)} \right\rangle$ where $\left\langle \vect{x}, \vect{x}' \right\rangle = \vect{x}^\intercal \vect{x}' = \sum_{j=1}^{j=p} x_j x'_j$ is the inner product.

$\langle \cdot,\cdot \rangle: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R}$ can also be defined on other sets than $\mathbb{R}^p$ , so it's not just $\vect{x}^\intercal \vect{x}'$ ...

Remarks:

there are $p+1$ $\beta$ coeffs and $n$ $\alpha$ coeffs
- in general, they are different in value
for the left formulation, during optimization you give the $\vect{x}^{(i)}$ to $\text{solve}()$ and obtain the $\beta$ coeffs
- once you fix $\seq{(\vect{x}^{(i)},y^{(i)})}{i}$ , you completely define $f(\vect{x})$
same for the right formulation
- once you fix $\seq{(\vect{x}^{(i)},y^{(i)})}{i}$ and the $\alpha$ coeffs, you completely define $f(\vect{x})$
- the $\alpha$ coeffs are just needed to make the two functions the same

296 / 366

The support vectors and the $\alpha$ coeffs

Binary classificatio problem for SVM: just data

Given that:

$\beta_0 + \vect{\beta}^\intercal \vect{x} = f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \vect{x}^\intercal \vect{x}^{(i)}$

If you move¹ any point which is not a support vector, by definition $f(\vect{x})$ must stay the same:

so the $\beta$ coeffs must stay the same
so the $\alpha$ coeffs must stay the same

Hence, it follows that $\alpha^{(i)}=0$ for every $\vect{x}^{(i)}$ which is not a support vector!

More in general each $\alpha^{(i)}$ says what's the contribution of the corresponding $\vect{x}^{(i)}$ when classifying $\vect{x}$ : $0$ means no contribution.

From the point of view of the optimization, $\text{solve}()$ for the second formulation gives $(\beta_0, \vect{\alpha})$ , with $\vect{\alpha} \in \mathbb{R}^n$ : this also says which are the support vectors. Similarly, the model is $(\beta_0, \vect{\alpha})$ instead of $(\beta_0, \vect{\beta})$ .

Without making it a support vector.

297 / 366

The kernel

Ok, but what about going beyond the hyperplane? We are almost there...

The second formulation may be generalized: $f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \c{2}{k\left(\vect{x}, \vect{x}^{(i)}\right)}$ where $k: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R}$ is a kernel function.

The idea behind the kernel function is to:

transform the original space $X=\mathbb{R}^p$ in another space $X=\mathbb{R}^q$ , with possibly $q \gg p$ , with a $\phi: X \to X'$ , and then
to compute the inner product in the destination space, i.e., $k(\vect{x}, \vect{x}^{(i)})= \phi(\vect{x})^\intercal \phi(\vect{x}^{(i)})$

hoping that an hyperplane can separate the points in $X'$ better than in $X$ .

This thing is called the kernel trick. Understanding the math behind it is beyond the scope of this course. Understanding the way the optimization works with a kernel is beyond the scope of this course.

When you use a kernel, this technique is called Support Vector Machines (SVM) learning.

298 / 366

Common kernels

Linear kernel:

$k(\vect{x}, \vect{x}') = \vect{x}^\intercal \vect{x}'$

the most efficient (computationally cheapest)

Polynomial kernel:

$k(\vect{x}, \vect{x}') = (1+\vect{x}^\intercal \vect{x}')^d$

$d$ , the degree of the kernel, is a parameter

Gaussian kernel:

$k(\vect{x}, \vect{x}') = e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2}$

$\lVert \vect{x} - \vect{x}' \rVert^2$ is the squared Euclidean distance of $\vect{x}$ to $\vect{x}'$
$\gamma$ is a parameter
also called radial basis function (RBF), or just radial, kernel
the most widely used

$f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} k\left(\vect{x}, \vect{x}^{(i)}\right)$

Regardless of the kernel being used, each $\alpha^{(i)}$ says what's the contribution of the corresponding $\vect{x}^{(i)}$ when evaluating $f(\vect{x})$ inside $f'\subtext{predict}$ .

299 / 366

Inside the Gaussian kernel (humbly, toy)

$k(\vect{x}, \vect{x}') = e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2}$ and $f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} k\left(\vect{x}, \vect{x}^{(i)}\right)$

$e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2} \in [0,1]$ ; $\lVert \vect{x} - \vect{x}' \rVert^2$ is the squared distance of $\vect{x}$ to $\vect{x}'$
the larger $\gamma$ , the faster $e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2}$ goes to $0$ with distance

Let's consider a point $\vect{x}$ moving from $(0,3.5)$ to $(6,3.5)$ :

think about its correct color, while moving
put it on the 3D plane, consider its 3 $\alpha$ , draw decision boundary

Gaussian kernel and gamma — $\gamma=0.1$ — $\gamma=1$ — $\gamma=10$

Three support vectors in 2D

3D canvas

300 / 366

Intuitive interpretation Gaussian kernel

Intuitively, and very broadly speaking, the Gaussian kernel maps an $\vect{x}$ to the space where coordinates are the distances to relevant observations of the learning data.

In practice, the decision boundary can smoothly follow any path:

with some risk of overfitting

Drawing of an SVM decision boundary Image from Wikipedia

301 / 366

SVM: summary

Efficiency 👍👍👍

👍 very fast

Explainability/interpretability 🫳

👎 few numbers, but hardly interpretable, no global explainability
- knowing which point are the support vectors is better than nothing...
😶 the learning technique is pure optimization
👍 confidence may be used as basic form of local explainability

Effectiveness 👍👍

👍 in general good with the Gaussian kernel
- but complex interactions between $c$ and $\gamma$ require to choose parameter values carefully

Applicability 🫳

🫳 $Y$ : only binary classifications
🫳 $X$ : only numerical variables
👍 models give a confidence
🫳 with two parameters ( $c$ and $\gamma$ )

302 / 366

Improving applicability303 / 366

$X$ , $Y$ and applicability

Let $X=X_1 \times \dots \times X_p$ :

$X_j$	$Y$	RF	SVM
Numerical	Binary classification	✅	✅
Categorical	Binary classification	✅	❌
Numerical + Categorical	Binary classification	✅	❌
Numerical	Multiclass classification	✅	❌
Categorical	Multiclass classification	✅	❌
Numerical + Categorical	Multiclass classification	✅	❌
Numerical	Regression	✅	❌
Categorical	Regression	✅	❌
Numerical + Categorical	Regression	✅	❌

Let's start by fixing SVM!

304 / 366

From categorical to numerical variables

Let $x_j$ be categorical:

$x_j \in X\sub{j} = \{x\sub{j,\c{1}{1}},\dots,x\sub{j,\c{1}{k}}\}$ (i.e., $k$ different values)

Then, we can replace it with $k$ numerical variables:

$x_{h_1} \in X_{h_1} = \{0,1\}$
...
$x_{h_k} \in X_{h_k} = \{0,1\}$

such that: $\forall i, k: x^{(i)}_{h_k}=\mathbf{1}(x^{(i)}_j=x\sub{j,k})$

This way of encoding one categorical variable with $k$ possible values to $k$ binary numerical variables is called one-hot encoding.

Each one of the resulting binary variables is a dummy variable.

A similar encoding can be applied when $X_j=\mathcal{P}(A)$ .

Example: (extended carousel)

Original features: age, height, city $p=3$

$X = \mathbb{R}^+ \times \mathbb{R}^+ \times \c{2}{\{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}}$

Transformed features: $p=7$

$X' = \mathbb{R}^+ \times \mathbb{R}^+ \times \c{2}{\{0,1\}^5}$

with:

$x^{(i)}\subtext{Ts} = \mathbf{1}(x^{(i)}\subtext{city}=\text{Ts})$
$x^{(i)}\subtext{Ud} = \mathbf{1}(x^{(i)}\subtext{city}=\text{Ud})$
...

hence, e.g.:

$(11,153,\c{2}{\text{Ts}}) \mapsto (11,153,\c{2}{1,0,0,0,0})$
$(79,151,\c{2}{\text{Ud}}) \mapsto (79,151,\c{2}{0,1,0,0,0})$

305 / 366

From binary to multiclass: one-vs-one

Let $\c{1}{f'\subtext{learn}},\c{1}{f'\subtext{predict}}$ be a learning technique applicable to $X,Y\subtext{binary}$ where $\c{3}{Y\subtext{binary}=\{\text{Pos},\text{Neg}\}}$ that produces models in $M$ , i.e., $\c{1}{f'\subtext{learn}}: \mathcal{P}^*(X \times \c{3}{Y\subtext{binary}}) \to M$ and $\c{1}{f'\subtext{predict}}: X \times M \to \c{3}{Y\subtext{binary}}$ .

Let $\c{2}{Y=\{y_1,\dots,y_k\}}$ a finite set with $k>2$ values.

Consider a new learning technique $f'\subtext{learn,ovo},f'\subtext{predict,ovo}$ , based on $\c{1}{f'\subtext{learn}},\c{1}{f'\subtext{predict}}$ , that:

In learning: $f'\subtext{learn,ovo}: \mathcal{P}^*(X \times \c{2}{Y}) \to M^{\frac{k(k-1)}{2}}$

Given a $D \in \mathcal{P}^*(X \times \c{2}{Y})$ :

set $\mathcal{M}=\emptyset$
for each pair of classes, i.e., pair $(h_1,h_2) \in \{1,\dots,k\}$ such that $h\sub{1} < h\sub{2}$ $\frac{k(k-1)}{2}=\binom{k}{2}$ times
1. builds $D'$ by taking only the observations in which $\c{2}{y^{(i)}}=y_{h_1}$ or $\c{2}{y^{(i)}}=y_{h_2}$
2. set each $\c{3}{y'^{(i)}}=\text{Pos}$ if $\c{2}{y^{(i)}}=y_{h_1}$ , or $\c{3}{y'^{(i)}}=\text{Neg}$ otherwise
3. learns a model $m_{h_1,h_2}$ with $f'\subtext{learn}$ , puts it in $\mathcal{M}$
returns $\mathcal{M}$

each $m_{h_1,h_2}$ is a binary classification model learned on $|D'| < |D|$ obs.

In prediction: $f'\subtext{predict,ovo}: X \times M^{\frac{k(k-1)}{2}} \to \c{2}{Y}$

Given an $x \in X$ and a model $\mathcal{M} \in M^{\frac{k(k-1)}{2}}$ :

sets $\vect{v}=\vect{0} \in \mathbb{N}^k$
for each $m_{h_1,h_2} \in \mathcal{M}$ $\frac{k(k-1)}{2}=\binom{k}{2}$ times
1. applies $f'\subtext{predict}$ on $x$ with $m_{h_1,h_2}$ and increments $v_{h_1}$ if the outcome is $\c{3}{y}=\text{Pos}$ , or $v_{h_2}$ otherwise
returns $y\sub{h^\star}$ with $h^\star=\argmax_{h} v_h$

$\vect{v}$ counts the times a class has been predicted

can be extended for giving a probability

306 / 366

From binary to multiclass: one-vs-all

Let $\c{1}{f'\subtext{learn}},\c{1}{f'''\subtext{predict}}$ be a learning technique with confidence/probability, i.e., $\c{1}{f'\subtext{learn}}: \mathcal{P}^*(X \times \c{3}{Y\subtext{binary}}) \to M$ and $\c{1}{f'''\subtext{predict}}: X \times M \to \mathbb{R}$ , with $\c{1}{f'''\subtext{predict}}(x,m)$ being the confidence that $x$ is $\text{Pos}$ . probability would be $\to [0,1]$

Let $\c{2}{Y=\{y_1,\dots,y_k\}}$ a finite set with $k>2$ values.

Consider a new learning technique $f'\subtext{learn,ova},f'\subtext{predict,ova}$ , based on $\c{1}{f'\subtext{learn}},\c{1}{f'''\subtext{predict}}$ , that:

In learning: $f'\subtext{learn,ova}: \mathcal{P}^*(X \times \c{2}{Y}) \to M^k$

Given a $D \in \mathcal{P}^*(X \times \c{2}{Y})$ :

set $\mathcal{M}=\emptyset$
for each class, i.e., $h \in \{1,\dots,k\}$ $k$ times
1. builds $D'$ by setting each $\c{3}{y'^{(i)}}=\text{Pos}$ if $\c{2}{y^{(i)}}=y_h$ , or $\c{3}{y'^{(i)}}=\text{Neg}$ otherwise
2. learns a model $m_h$ with $f'\subtext{learn}$ , puts it in $\mathcal{M}$
returns $\mathcal{M}$

each $m_h$ is a binary classification model learned on $|D'|=|D|$ obs.

In prediction: $f'\subtext{predict,ova}: X \times M^k \to \c{2}{Y}$

Given an $x \in X$ and a model $\mathcal{M} \in M^k$ :

sets $\vect{v}=\vect{0} \in \mathbb{R}^k$
for each $m_h \in \mathcal{M}$ $k$ times
1. applies $f'''\subtext{predict}$ on $x$ with $m_h$ and sets $v_h$ to the outcome $\c{1}{f'''\subtext{predict}}(x,m_h)$
returns $y\sub{h^\star}$ with $h^\star=\argmax_{h} v_h$

$\vect{v}$ holds the confidences for each class

can be extended for giving a probability

307 / 366

$X$ , $Y$ and applicability: $\approx$ fixed!

Let $X=X_1 \times \dots \times X_p$ :

$X_j$	$Y$	RF	SVM	SVM+
Numerical	Binary classification	✅	✅	✅
Categorical	Binary classification	✅	❌	✅
Numerical + Categorical	Binary classification	✅	❌	✅
Numerical	Multiclass classification	✅	❌	✅
Categorical	Multiclass classification	✅	❌	✅
Numerical + Categorical	Multiclass classification	✅	❌	✅
Numerical	Regression	✅	❌	❌³
Categorical	Regression	✅	❌	❌³
Numerical + Categorical	Regression	✅	❌	❌³

SVM+¹²: SVM + one-vs-one/one-vs-all + dummy variables

Not a real name...
In practice, most ML sw/libraries do everything transparently, and let you use SVM+ instead of SVM.
For regression, SVR or other variants.

308 / 366

Missing values

In many practical, business cases, some variables for some observations might miss a value. Formally, $x_j \in X_j \cup \{\c{1}{\varnothing}\}$ . $\emptyset$ is the empty set

Examples: (extended carousel)

$X = \mathbb{R}^+ \times \mathbb{R}^+ \times \{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}$
$x=(15, \c{1}{\varnothing}, \text{Ts})$ $\vect{x}'=(15, \c{1}{\varnothing}, 1,0,0,0,0)$
$x=(12, 155, \c{1}{\varnothing)}$ $\vect{x}'=(12, 155, \c{1}{0,0,0,0,0})$ , actually not a problem!

Trees and SVM cannot work!

a tree cannot test $x\subtext{height} \le \tau$
the SMC/SVM cannot compute $\vect{x}^\intercal\vect{x}^{(i)}$

Solutions:

drop the variable(s) with missing values (ok if many missing values) otherwise, not ok
fill with most common value or mean value
- $\varnothing \gets \argmax_{x_{j,k} \in X_j} \sum_i \mathbf{1}(x^{(i)}_j = x_{j,k})$ for categorical variables
- $\varnothing \gets \frac{1}{\sum_i \mathbf{1}(x^{(i)}_j \ne \varnothing)} \sum_{i: x^{(i)}_j \ne \varnothing} x^{(i)}_j$ for numerical variables
replace with a new class, only for categorical variable
...

309 / 366

Naive Bayes310 / 366

Guess the gender¹

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the cinema 🏪.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

For clarity, let's assume there are two possible genders.

311 / 366

Guess the gender¹

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the cinema 🏪.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

For clarity, let's assume there are two possible genders.

More formally:

$X=X\subtext{hair}$ might be $X\subtext{hair}=\set{\text{long},\neg\text{long}}$ , or a bigger set; not relevant here
$Y=\{\text{man},\text{woman}\}$
you are $f\subtext{predict}$
your life is $f\subtext{learn}$
$f\subtext{predict}(\text{long}) = ?$

311 / 366

Reason with probability

According to your life, you have collected some knowledge, that you can express as probabilities:

the probability a random person is a man is the same of being a woman
- $\prob{\text{a person is a man}}=\prob{p=\text{man}}=0.5=\prob{p=\text{woman}}$
the probability that a man has long hair is low
- $\prob{h=\text{long} \mid p=\text{man}}=0.04$
the probability that a woman has long hair is higher
- $\prob{h=\text{long} \mid p=\text{woman}}=0.5$

where $\prob{A \mid B}$ is the conditional probability, i.e., the probability that, given that the event $B$ occurred, the event $A$ occurs

312 / 366

Guessing the gender with probability

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

So, we want to know $\prob{p=\text{man} \mid h=\text{long}}$ and $\prob{p=\text{woman} \mid h=\text{long}}$ , or maybe just if:

$\prob{p=\text{man} \mid h=\text{long}} \stackrel{?}{>} \prob{p=\text{woman} \mid h=\text{long}}$

But we know $\prob{h=\text{long} \mid p=\text{man}}$ , not $\prob{h=\text{man} \mid p=\text{long}}$ ...

313 / 366

Guessing the gender with probability

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

So, we want to know $\prob{p=\text{man} \mid h=\text{long}}$ and $\prob{p=\text{woman} \mid h=\text{long}}$ , or maybe just if:

$\prob{p=\text{man} \mid h=\text{long}} \stackrel{?}{>} \prob{p=\text{woman} \mid h=\text{long}}$

But we know $\prob{h=\text{long} \mid p=\text{man}}$ , not $\prob{h=\text{man} \mid p=\text{long}}$ ...

In general, $\prob{A \mid B} \ne \prob{B \mid A}$ .

$\prob{\text{win lottery} \mid \text{play lottery}} \ne \prob{\text{play lottery} \mid \text{win lottery}}$

313 / 366

The Bayes rule

$\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}$

where $\prob{A,B}$ is the probability that both $A$ and $B$ occur.

314 / 366

The Bayes rule

$\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}$

where $\prob{A,B}$ is the probability that both $A$ and $B$ occur.

$\prob{B \mid A}=\frac{\prob{B} \prob{A \mid B}}{\prob{A}}$

314 / 366

The Bayes rule

$\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}$

where $\prob{A,B}$ is the probability that both $A$ and $B$ occur.

$\prob{B \mid A}=\frac{\prob{B} \prob{A \mid B}}{\prob{A}}$

What we know:

$\prob{\text{man}}=0.5$
$\prob{\text{woman}}=0.5$
$\prob{\text{long} \mid \text{man}}=0.04$
$\prob{\text{long} \mid \text{woman}}=0.5$

What we compute:

$\prob{\text{man} \mid \text{long}} = \frac{\prob{\text{man}} \prob{\text{long} \mid \text{man}}}{\prob{\text{long}}}=\frac{0.5 \cdot 0.04}{\prob{\text{long}}}=\frac{0.02}{\prob{\text{long}}}$
$\prob{\text{woman} \mid \text{long}} = \frac{\prob{\text{woman}} \prob{\text{long} \mid \text{woman}}}{\prob{\text{long}}}=\frac{0.5 \cdot 0.5}{\prob{\text{long}}}=\frac{0.25}{\prob{\text{long}}}$
$\frac{0.02}{\prob{\text{long}}} < \frac{0.25}{\prob{\text{long}}} \Rightarrow$ 👩 $\Rightarrow$ "excuse me, madam"

We do not really need to know $\prob{\text{long}}$ !

but it could be computed, in some cases

314 / 366

Guess the gender II

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the stadium 🏟.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

What we know:

$\prob{\text{man at 🏟}}=\c{1}{0.98}$
$\prob{\text{woman at 🏟}}=\c{1}{0.02}$
$\prob{\text{long} \mid \text{man}}=0.04$
$\prob{\text{long} \mid \text{woman}}=0.5$

What we compute:

$\prob{\text{man} \mid \text{long}} = \frac{\prob{\text{man}} \prob{\text{long} \mid \text{man}}}{\prob{\text{long}}}=\frac{\c{1}{0.98} \cdot 0.04}{\prob{\text{long}}}=\frac{\c{1}{0.0392}}{\prob{\text{long}}}$
$\prob{\text{woman} \mid \text{long}} = \frac{\prob{\text{woman}} \prob{\text{long} \mid \text{woman}}}{\prob{\text{long}}}=\frac{\c{1}{0.02} \cdot 0.5}{\prob{\text{long}}}=\frac{\c{1}{0.01}}{\prob{\text{long}}}$
$\frac{\c{1}{0.0392}}{\prob{\text{long}}} > \frac{\c{1}{0.01}}{\prob{\text{long}}} \Rightarrow$ 🧔 $\Rightarrow$ "excuse me, sir"

Different natural probability of a person at the stadium being a man!

315 / 366

Prior, posterior, evidence

$\c{2}{\prob{\text{event} \mid \text{evidence}}}=\c{1}{\prob{\text{event}}}\c{3}{\frac{\prob{\text{evidence} \mid \text{event}}}{\prob{\text{evidence}}}}$

prior: the natural probability of the event
- what we know in advance
posterior: the probability of the event, given some evidence
- what we want to know
a correction we apply to the prior knowing the evidence

316 / 366

Bayes for supervised ML

Assume classification with categorical indep. vars:

$X = X_1 \times \dots \times X_p$
- with $X_j=\{x_{j,1}, \dots, x_{j,h_j}\}$
$Y = \{y_1, \dots, y_k\}$

$\c{2}{\prob{\text{event} \mid \text{evidence}}}=\c{1}{\prob{\text{event}}}\c{3}{\frac{\prob{\text{evidence} \mid \text{event}}}{\prob{\text{evidence}}}}$

event: $y$ is one specific class, $y=y_m$
evidence: $x$ is one specific observation, $x=(x_{1,l_1},\dots,x_{p,l_p})$

Hence: $\c{2}{\prob{y=y_m \mid x=(x_{1,l_1},\dots,x_{p,l_p})}}=\c{1}{\prob{y=y_m}}\c{3}{\frac{\prob{x=(x_{1,l_1},\dots,x_{p,l_p}) \mid y=y_m}}{\prob{x=(x_{1,l_1},\dots,x_{p,l_p})}}}$ or, more briefly: $\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}$

317 / 366

Required knowledge

$\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}$

What do we need for predicting $y$ from a $x$ ?

compute $\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}$ for each $y_m$
- hence, each $\c{1}{p(y_m)}$ and each $\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}$
- no need to compute $\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}$ for the comparison
take the $y$ with the largest value

Where to find them?

💡: in the learning data $D$ !

each $\c{1}{p(y_m)}$ : just count the observations in $D$ with $y=y_m$
each $\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}$ : just count the obs. in $D$ with $y=y_m$ and $x=\left(x_{1,l_1},\dots,x_{p,l_p}\right)$
- what if the count is $0$ ? 🤔 not that unlikely...
- how many combinations should I store? $k \prod_{j=1}^{j=p} h_j$

318 / 366

Independent independent¹ variables

Let's do the naive hypothesis that the independent variables are independent¹ from each other: $\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1} \mid y_m, \dots, x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}$ becomes: $p\left(x\sub{1,l\sub{1}},\dots,x\sub{p,l\sub{p}} \mid y\sub{m}\right) = p\left(x\sub{1,l\sub{1}} \mid y\sub{m}, \dots, x\sub{p,l\sub{p}} \mid y\sub{m}\right)$ is always true, also without independency

$\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\frac{\c{1}{p(y_m)}}{\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}} \c{3}{p\left(x_{1,l_1} \mid y_m\right)} \dots \c{3}{p\left(x_{p,l_p} \mid y_m\right)}$

Where to find them?

💡: in the learning data $D$ !

each $\c{1}{p(y_m)}$ : just count the observations in $D$ with $y=y_m$
each $\c{3}{p\left(x_{1,l_j} \mid y_m\right)}$ : just count the obs. in $D$ with $y=y_m$ and $x_j=x_{j,l_j}$
- what if the count is $0$ ? unlikely, but possible
- how many combinations should I store? $\sum_{j=1}^{j=p}k h_j$

The first "independent" refers to $x_j$ and $y$ ; the second "independent" refers to $x_j$ and $x_{j'}$ .

319 / 366

Naive Bayes

The technique based on the independency hypothesis is called Naive Bayes:

based on the Bayes rule
with a naive independency hipothesys

Learning:

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

\vect{p}

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i=1}^{i=n})$ {
$\vect{p} \gets \emptyset$
for $m \in \{1, \dots, |Y|\}$ { // $|Y|=k$
$\c{1}{p_m} \gets \frac{1}{n} \sum_i \mathbf{1}(y^{(i)}=y_m)$
for $j \in \{1, \dots, p\}$ {
for $l \in \{1, \dots, |X_j|\}$ { // $|X\sub{j}|=h\sub{j}$
$\c{3}{p_{m,j,l}} \gets \frac{\sum_i \mathbf{1}(y^{(i)}=y_m \land x_j^{(i)}=x_{j,l})}{\sum_i \mathbf{1}(y^{(i)}=y_m)}$
}
}
}
return $\vect{p}$
}

The model $\vect{p}$ is some data structure holding $k+\sum\sub{j=1}^{j=p}k h\sub{j}$ numbers, i.e., $\vect{p} \in [0,1]^{k+\sum\sub{j=1}^{j=p}k h\sub{j}}$ .

Prediction:

f'\subtext{predict}

x,\vect{p}

y

function $\text{predict}(x,\vect{p})$ { // $x=(x\sub{l\sub{1}},\dots,x\sub{l\sub{p}})$
$m^\star \gets \argmax_{m \in \{1,\dots,|Y|\}} \c{1}{p_m} \prod_{j=1}^{j=p} \c{3}{p_{m,j,l_j}}$
return $y_m$
}

Or, with probability:

function $\text{predict-with-prob}(x,\vect{p})$ {
return $y_m \mapsto \frac{\c{1}{p_m} \prod_{j=1}^{j=p} \c{3}{p_{m,j,l_j}}}{\sum_{m'=1}^{m'=|Y|} \c{1}{p_{m'}} \prod_{j=1}^{j=p} \c{3}{p_{m',j,l_j}}}$
}

320 / 366

Naive Bayes: summary

Efficiency 👍👍👍

👍 very very fast
- in particular with very large datasets, in both $n$ and $p$

Explainability/interpretability 👍👍

👍 the model is a bunch of probabilities!
👍 the technique is very simple

Effectiveness 🫳

🫳 not so good
- the more false the independency hypothesis with the system, the less effective

Applicability 🫳

🫳 $Y$ : classification
🫳 $X$ : only categorical variables but can be extended to the numerical case
👍 models give probability
👍 no hyperparameters
👍 works natively with missing values
- just remove the missing $j$ from $\prod_{j=1}^{j=p} \c{3}{p_{m',j,l_j}}$

321 / 366

k-Nearest Neighbors (kNN)322 / 366

Guess the province

Maps of FVG economy

Given a point on the map, guess its province.

e.g., province of the most northern pig 🐖?
e.g., province of the most eastern fish 🐟?

More formally:

$X= \mathbb{R}^2$ , i.e., the coordinates on the map
$Y=\{\text{Ts},\text{Ud},\text{Pn},\text{Go}\}$
you are $f\subtext{predict}$
$f\subtext{learn}$ is looking at the map¹
- in particular, the position of the 4 chief towns

Let's pretend we do not know the real boundaries of the (former) provinces...

323 / 366

The closest chief town

Tentative explanation of your reasoning, given a point $x$ on the map:

look at the closest chief town
say that the province of $x$ is the one of the closes chief town

324 / 366

The closest chief town

Tentative explanation of your reasoning, given a point $x$ on the map:

look at the closest chief town
say that the province of $x$ is the one of the closes chief town

More in general: In prediction, but given a learning set $D$ :

find the $k$ closest observations in $D$ (the nearest neighbors)
say that $y$ is the most frequent (if classification) or the mean (if regression) of the $k$ closest observations

This is the k-Nearest Neighbors learning technique!

324 / 366

k-Nearest Neighbors

Learning:

f'\subtext{learn}

\seq{(x^{(i)},y^{(i)})}{i}

(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})

\c{2}{k},\c{3}{d}

function $\text{learn}(\seq{(x^{(i)},y^{(i)})}{i}, \c{2}{k},\c{3}{d})$ {
return $(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})$
}

$f'\subtext{learn}$ does nothing!

The model is the dataset $D$

and¹ the number of neighbors $\c{2}{k} \in \mathbb{N}$
and¹ the distance² $\c{3}{d}: X \times X \to \mathbb{R}$

$k$ and $d$ are parameters!

They are used by $f'\subtext{predict}$ , not here, but we put them into the model just to not make the signature of $f'\subtext{predict}$ dirty; ML sw/libraries do the same.
A (dis)similarity measure is enough.

Prediction:

f'\subtext{predict}

x,(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})

y

function $\text{predict}(x,(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d}))$ {

$\vect{s} \gets \vect{0}$ // $\vect{0} \in \mathbb{R}^n$
for $i \in \{1,\dots,n\}$ {
$s_i \gets \c{3}{d}(x,x^{(i)})$
}
$I \gets \emptyset$ //the neighborhood
while $|I|\le \c{2}{k}$ {
$I \gets I \cup \{\argmin_{i \in \{1,\dots,n\} \setminus I} s_i\}$
}
return $\argmax_{y \in Y} \sum_{i \in I} \mathbf{1}(y^{(i)}=y)$ //most frequent
}

Alternatives:

for regression, return $\frac{1}{\c{2}{k}}\sum_{i \in I} y^{(i)}$
with probability, return $y \mapsto \frac{1}{\c{2}{k}}\sum_{i \in I} \mathbf{1}(y^{(i)}=y)$

325 / 366

The distance

By using a proper distance $d: X \times X \to \mathbb{R}$ , kNN can be used on any $X$ ! (applicability 👍👍👍)

Common cases: there is a large literature on distances

for vectorial spaces, i.e., $X=\mathbb{R}^p$
- $\ell$ -norms: with $\ell$ being a parameter, $d(\vect{x},\vect{x}')=\lVert \vect{x},\vect{x}' \rVert_\ell=\sqrt[\ell]{\sum_j |x_j-x'_j|^\ell}$
  - Euclidean with $\ell=2$
  - Manhattan with $\ell=1$
- cosine distance: $d(\vect{x},\vect{x}')=\frac{\vect{x}^\intercal\vect{x}'}{\lVert \vect{x} \rVert \lVert \vect{x}' \rVert}$ $\lVert \cdot \rVert$ is just $\lVert \cdot \rVert_2$
  - disregards the individual scales of the points
- many others
for fixed-length sequences of symbols in an alphabet $A$ , i.e., $X=A^l$
- Hamming distance: $d(x,x')=\sum_{k=1}^{k=l} \mathbf{1}(x_k \ne x'_k)$
- edit distance (many variants)
for variable-length sequences of symbols in an alphabet $A$ , i.e., $X=A^*$
- edit distance or Hamming with some adjustments
for sets, i.e., $X=\mathcal{P}(A)$
- Jaccard distance: $d(x,x')=1-\frac{|x \cap x'|}{|x \cup x'|}$
and combinations of these ones!

Choose one that helps to capture the dependency of $y$ on $x$ !

326 / 366

Role of the $k$ parameter

kNN decision boundaries with two k values

Error vs k in kNN

images from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

Yes, it is a flexibility parameter: link with the Bayes classifier!

the larger the $k$ the more global the estimate of $p(y \mid x)$ ; the smaller, the more local
if $k=n$ then $p(y \mid x)$ does not actually use $x$ , the neighborhood is the entire $D$ $\rightarrow$ high bias
if $k=1$ then $p(y \mid x)$ depends on just one point, little noise can change the output $\rightarrow$ high variance

327 / 366

kNN: summary

Efficiency 🫳

🫳 struggles with large $n$ in prediction
👍 no actual learning phase

Explainability/interpretability 👍

👍 the neighborhood is itself a local explanation of the decision

Effectiveness 🫳

🫳 not particularly good, in practice
- depends on $k$

Applicability 👍

👍 $Y$ : regression and both classifications
👍 $X$ : everything, if you have a proper distance $d$
- but tricky with mixed numerical/categorical cases
👍 models give probability
🫳 two parameters ( $d$ and $k$ ), one impatting on bias-variance trade-off

328 / 366

Lab 2¹: comparison of ML techniques

Consider the DataCo Smart Supply Chain for Big Data Analysis dataset

given the objective of classifying if an order is marked as late delivery, design an implement a ML procedure which answers the question: what is the best classification technique?
given the objective of predicting the sales of each order, design an implement a ML procedure which answers the question: what is the best regression technique?

consider the ML techniques seen during the lectures

Hints:

the dataset is really big (~180k rows): use it at your own advantage!
in Python, the pandas library is the most popular for dataset manipulations and explorations
about ML algorithms, you can find all the ones you need for this lab in the library scikit-learn:
- for classification: Random Forest, SVMs, kNN, and Naive Bayes (in many flavours);
- for regression: Random Forest, SVMs, and kNN
- as well as a bunch of metrics for quantifying the quality of the prediction: metrics

1: designed by Gaia Saveri, tutor A.Y. 2023/2024

329 / 366

Unsupervised learningClustering330 / 366

Back to the origin

Machine Learning is the science of getting computers to learn without being explicitly programmed.

$\downarrow$

Supervised (Machine) Learning is the science of getting computers to learn $f: X \to Y$ from examples autonomously.

$\downarrow$

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

331 / 366

Unsupervised learning definition

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

What's a pattern?

pattern [ˈpat(ə)n]: a model or design used as a guide in needlework and other crafts

In practice:

we assume that the system that generates the data follows some scheme (the pattern)
we do not know the pattern
we want to discover the pattern from a dataset

332 / 366

Supervised vs. unsupervised

Supervised (Machine) Learning is the science of getting computers to learn $f: X \to Y$ from examples autonomously.

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

Key differences

$y$ is a property of $x$
one example is a pair $(x,y)$
what we learn from a dataset can be applied to other $x$

the pattern is a property of the system $s$
the example is the dataset $\mathcal{P}^*(X)$
what we learn from the dataset is not, in general, usable on another dataset
- hence, "find patterns from data" is fairer than "learn patterns from data"

333 / 366

Pattern?

In most of the cases, the pattern one is looking for is grouping:

i.e., we assume the system generates data that is grouped, but we do not know what are the groups

This form of unsupervised learning is called clustering:

given a dataset, find the clusters
- cluster [kluhs-ter]: a group of things or persons close together
- "close together" $\rightarrow$ there is some implicit notion of distance (or similarity)

Meme unsupervised learning vs. clustering

334 / 366

Clustering, more formally

Given a dataset $D \in \mathcal{P}^*(X)$ , find a partitioning $\{D_1, \dots, D_k\}$ of $D$ such that the elements in each $D_i$ are "close together".

each $D_i$ is a cluster

335 / 366

Clustering, more formally

Given a dataset $D \in \mathcal{P}^*(X)$ , find a partitioning $\{D_1, \dots, D_k\}$ of $D$ such that the elements in each $D_i$ are "close together".

each $D_i$ is a cluster

Is this a formal and complete definition? No!

what does it mean "close together"?
- we need a distance/(dis)similarity metric $d: X \times X \to \R^+$ , but it's not an input of the problem it's not in the "given" part
second, how close? what elements?
- intuitively, we want that any two elements of the same cluster are closer each other than any two elements of different clusters
third: where does $k$ (the number of clusters) come from? like $d$ , it's not an input of the problem

335 / 366

Clustering, more formally

Given a dataset $D \in \mathcal{P}^*(X)$ , find a partitioning $\{D_1, \dots, D_k\}$ of $D$ such that the elements in each $D_i$ are "close together".

each $D_i$ is a cluster

Is this a formal and complete definition? No!

what does it mean "close together"?
- we need a distance/(dis)similarity metric $d: X \times X \to \R^+$ , but it's not an input of the problem it's not in the "given" part
second, how close? what elements?
- intuitively, we want that any two elements of the same cluster are closer each other than any two elements of different clusters
third: where does $k$ (the number of clusters) come from? like $d$ , it's not an input of the problem

In practice:

$d$ is dictated by $X$ and is reasonable
- that is, you first shape $X$ (feature engineering), than select an reasonable $d$ for that $X$
$k$ is unknown
- mostly suggested/bounded by the context
- within the reasonable range, picked

335 / 366

Clustering as optimization

In principle, clustering looks like a (biobjecive) optimization problem (given $D \in \mathcal{P}^*(X)$ , $k \in \{1,\dots,|D|\}$ , and $d: X \times X \to \mathbb{R}^+$ ):

$\begin{align*} \max_{D_1, \dots, D_k} & \; \left(\c{4}{\sum _{i,i': i\ne i'}\sum_{x \in D_i, x' \in D_{i'}} d(x,x')}\right) - \left(\c{2}{\sum_i \sum_{x, x' \in D_i} d(x,x')}\right) \\ \text{subject to} & \; \begin{array}{ll} \c{3}{D_1 \cup \dots \cup D_k = D} \\ \c{3}{D_i \cap D_{i'} = \emptyset} & \c{3}{\forall i,i' \in \{1, \dots, k\}} \end{array} \end{align*}$

For any $k,d$ , there exists (at least) one optimal solution. In principle, to find it you can just try all the partitions and measure the distance.

In practice:

you don't know $k$
trying all partitions is unfeasible

maximize the distance between any two $x,x'$ when they belong to different clusters
minimize (i.e., maximize with $-$ ) the distance between any two $x,x'$ when they belong to the same cluster
clusters have to form a partition

Here $D$ and each $D_i$ are bags, not sets. A partition on a bag is better defined if you define a bag as a $m: A \to \mathbb{N}$ , where $A$ is a set and $m(a)$ is the multiplicity of $a \in A$ in the bag. However, for clustering we can reason on sets, because in practice pairs $x,x$ should always end up being in the same cluster.

336 / 366

Assessing clustering

If you assume to know $k$ and $d$ , a clustering method:

is effective on a $D$ if it produces the optimal partition
- or, the closer the produced partition to the optimal one, the more effective
is efficient if it does it taking low resources (i.e., quickly)

But in practice you don't know $k$ ...

Can we just optimize also $k$ ? That is, can we solve the optimization problem for every $k$ and take the best?

337 / 366

Assessing clustering

If you assume to know $k$ and $d$ , a clustering method:

is effective on a $D$ if it produces the optimal partition
- or, the closer the produced partition to the optimal one, the more effective
is efficient if it does it taking low resources (i.e., quickly)

But in practice you don't know $k$ ...

Can we just optimize also $k$ ? That is, can we solve the optimization problem for every $k$ and take the best?

$\begin{align*} \max_{k, D_1, \dots, D_k} & \; \left(\c{4}{\sum _{i,i': i\ne i'}\sum_{x \in D_i, x' \in D_{i'}} d(x,x')}\right) - \left(\c{2}{\sum_i \sum_{x, x' \in D_i} d(x,x')}\right) \\ \text{subject to} & \; \begin{array}{ll} \c{3}{D_1 \cup \dots \cup D_k = D} \\ \c{3}{D_i \cap D_{i'} = \emptyset} & \c{3}{\forall i,i' \in \{1, \dots, k\}} \end{array} \end{align*}$

If you also optimize $k$ , then the optimal solution is the one with $k=|D|$ ...

Can we just optimize also $k$ ? No! It's pointless.

Extreme cases:

$k=1$ , no clustering, just $D_1=D$
- $\c{4}{\sum\sub{i,i'}\sum}=0$ , $\c{2}{\sum\sub{i}\sum}=d\subtext{all}$ is large, hence the objective is large negative
$k=|D|$ , each observation is a cluster
- $\c{4}{\sum\sub{i,i'}\sum}=d\subtext{all}$ is large, $\c{2}{\sum\sub{i}\sum}=0$ , hence the objective is large positive
in between, always increasing

337 / 366

Assessing clustering in practice

How do you evaluate a partitioning of $D$ in practice?

you inspect it manually
you insert the clustering inside the larger information processing system it belongs to and measure some other index e.g., how rich 💰💰💰 you become with this, rather than that, clustering technique?
- a form of extrinsic evaluation: you look at the result in a larger context
you measure some performance indexes devised for clustering
- a form of intrinsic evaluation: you look at the result alone

Question: is manual inspection intrinsic or exstrinsic?

338 / 366

Clustering performance indexes

There are many of them; most are based on the idea of measuring separateness or density of clustering.

Silhouette index: it considers, for each observation, the average distance to the observations in the same cluster and the min distance to the observations in other clusters: $\bar{s}(\seq{D_i}{i=1}^{i=k})=\frac{1}{\left|\bigcup_i D_i\right|}\sum_{x \in \bigcup_i D_i}\frac{\c{1}{d\subtext{out}(x,\seq{D_i}{i})}-\c{2}{d\subtext{in}(x,\seq{D_i}{i})}}{\max\left(\c{1}{d\subtext{out}(x,\seq{D_i}{i})},\c{2}{d\subtext{in}(x,\seq{D_i}{i})}\right)}$ where:

$\c{1}{d\subtext{out}(x,\seq{D_i}{i})}=\min_{D_i \not\ni x} \min_{x' \in D_i} d(x, x')$

$\c{2}{d\subtext{in}(x,\seq{D_i}{i})}=\frac{1}{|D_i \ni x|-1}\sum_{x' \in D_i \ni x \land x \ne x'} d(x, x')$

$\bar{s}(\cdot) \in [-1,1]$ : the larger (closer to $1$ ), the better (i.e., the more separated the clusters).

A similar index is the Dunn index.

339 / 366

Silhouette in practice

Example of Silhouette plot with 3 clusters

$\bar{s}(\seq{D_i}{i})=0.78$

Questions:

$X$ ?
$k$ ?
$d$ ?

340 / 366

Silhouette in practice

Example of Silhouette plot with 4 clusters

$\bar{s}(\seq{D_i}{i})=0.74$

In practice:

the greater $k$ , the lower $\bar{s}(\cdot)$
you choose the $k$ where there is a knee (or elbow)

341 / 366

Hierarchical clustering342 / 366

Hierarchical clustering

Hierarchical clustering is an iterative method that exists in two versions: For both:

at each $j$ -th iteration, there exist one partition $D_1, \dots, D_{k_j}$
at most two clusters differ between partionts at subsequent iterations
you don't set $k$

That is, partition are refined by merging (in agglomerative hierarchical clustering) or by division (in divisive hierarchical clustering).

Moreover, since there the partition is refined over iterations, an hierarchy among clusters is established:

that is, this clustering method gives some more than a simple partition

We'll see just the agglomerative version.

343 / 366

Agglomerative hierarchical clustering

function $\text{cluster}(\seq{x^{(i)}}{i=1}^{i=n})$ {
$j \gets 0$
$\c{1}{\mathcal{D}_j} \gets \{\{x^{(1)}\},\dots,\{x^{(n)}\}\}$
while $|\c{1}{\mathcal{D}_j}|>1$ {
$(i^\star,i^{\prime\star}) \gets \argmin_{i,i' \in \{1,\dots,|\mathcal{D}|\}\land i \ne i'} \c{2}{d\subtext{cluster}}(D_{j,i},D_{j,i'})$
$\c{1}{\mathcal{D}_{j+1}} \gets \c{1}{\mathcal{D}_j} \oplus D_{j,i^\star} \cup D_{j,i^{\prime\star}} \ominus D_{j,i^\star} \ominus D_{j,i^{\prime\star}}$
$j \gets j+1$
}
return $\c{1}{\mathcal{D}_j}$
}

$\c{1}{\mathcal{D}_j}=\{D_{j,1},\dots,D_{j,k_j}\}$ is the partition at the $j$ -th iteration
$\c{2}{d\subtext{cluster}}: \mathcal{P}^\ast(X) \times \mathcal{P}^\ast(X) \to \mathbb{R}^+$ is a (dis)similarity metric defined over sets of observations
- it's a parameter of the technique
$\mathcal{D} \oplus D$ adds $D$ to $\mathcal{D}$
$\mathcal{D} \ominus D$ removes $D$ from $\mathcal{D}$

At each iteration:

consider the current clusters in $\c{1}{\mathcal{D}}$
find the closest ones $D_{i^\star},D_{i^{\prime\star}}$
build the next iteration clusters by
- copying all the existing but $D_{i^\star}$ and $D_{i^{\prime\star}}$
- adding $D_{i^\star} \cup D_{i^{\prime\star}}$

344 / 366

Cluster distances

There exist a few options for $d\subtext{cluster}: \mathcal{P}^\ast(X) \times \mathcal{P}^\ast(X) \to \mathbb{R}^+$ . All are based on a (dis)similarity metric $d$ defined over observations, i.e., $d: X \times X \to \mathbb{R}^+$ .

Single linkage (nearest):

d\subtext{cluster}(D,D')= \min_{x \in D, x' \in D'} d(x,x')

Complete linkage (farthest):

d\subtext{cluster}(D,D')= \max_{x \in D, x' \in D'} d(x,x')

Average linkage:

d\subtext{cluster}(D,D')= \frac{1}{|D| |D'|}\sum_{x \in D, x' \in D'} d(x,x')

Centroid: (only if $X=\mathbb{R}^p$ )

d\subtext{cluster}(D,D')= d(c(D),c(D'))

where $c(D)=\bar{\vect{x}}=\frac{1}{|D|}\sum\sub{\vect{x} \in D} \vect{x}$ and $\bar{\vect{x}}$ is the centroid of $D$ .

Question: what's the efficiency of the 4 $d\subtext{cluster}$ ?

345 / 366

Example in $\mathbb{R}^1$

Input: $D=\{1,2,3,6,7,9,11,12,15,18\}$

Execution¹:

$j$	$\mathcal{D}_j$
0	$\c{1}{\{1\}}, \c{1}{\{2\}}, \{3\}, \{6\}, \{7\}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}$
1	$\c{1}{\{1, 2\}}, \c{1}{\{3\}}, \{6\}, \{7\}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}$
2	$\{1, 2,3\}, \c{1}{\{6\}}, \c{1}{\{7\}}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}$
3	$\{1, 2,3\}, \{6,7\}, \{9\}, \c{1}{\{11\}}, \c{1}{\{12\}}, \{15\}, \{18\}$
4	$\{1, 2,3\}, \c{1}{\{6,7\}}, \c{1}{\{9\}}, \{11,12\}, \{15\}, \{18\}$
5	$\{1, 2,3\}, \c{1}{\{6,7,9\}}, \c{1}{\{11,12\}}, \{15\}, \{18\}$
6	$\c{1}{\{1, 2,3\}}, \c{1}{\{6,7,9,11,12\}}, \{15\}, \{18\}$
7	$\c{1}{\{1, 2,3,6,7,9,11,12\}}, \c{1}{\{15\}}, \{18\}$
8	$\c{1}{\{1, 2,3,6,7,9,11,12,15\}}, \c{1}{\{18\}}$
9	$\{1, 2,3,6,7,9,11,12,15, 18\}$

function $\text{cluster}(\seq{x_i}{i=1}^{i=n})$ {
$j \gets 0$
$\mathcal{D}_j \gets \{\{x_1\},\dots,\{x_n\}\}$
while $|\mathcal{D}_j|>1$ {
$(i^\star,i^{\prime\star}) \gets \c{2}{\argmin}_{i,i' \in \{1,\dots,|\mathcal{D}|\}\land i \ne i'} d\subtext{cluster}(D_{j,i},D_{j,i'})$
$\mathcal{D}_{j+1} \gets \mathcal{D}_j \oplus D_{j,i^\star} \cup D_{j,i^{\prime\star}} \ominus D_{j,i^\star} \ominus D_{j,i^{\prime\star}}$
$j \gets j+1$
}
return $\mathcal{D}_j$
}

Assume single linkage:

$d\subtext{cluster}(D,D')= \min_{x \in D, x' \in D'} d(x,x')$

The output, i.e., the partition of $D$ , is $\mathcal{D}_9$ : the hierarchy is the entire sequence $\mathcal{D}_9,\dots,\mathcal{D}_0$ .

We assume that, in case of tie, the first one is selected by $\argmin$ , i.e., the pair $i,i'$ for which $i+i'$ is the lowest.

346 / 366

Example in $\mathbb{R}^2$

Clustering toy problem: data

Clustering toy problem: distance matrix

Clustering toy problem: dendrogram

The hierarchy $\seq{\mathcal{D}_j}{j}$ , not just the partition $\mathcal{D}_{n-1}$ , can be visualized in the form of a dendrogram where:

each node is a $D' \subseteq D$
the root node is $D$
each node $D'$ has two children $D'_1,D''_2$ that have been merged when forming $D'$
the height of each node is the distance $d\subtext{cluster}$ of its two children

Question: what $d\subtext{cluster}$ is being used here?

347 / 366

Hierarchical clustering on Iris

Dendrogram on Iris

$y$ is ignored while doing the clustering
- but used for coloring the dendrogram

By looking at the dendrogram, one can choose an appropriate $k$ , or simply look at the dendrogram as the pattern.

348 / 366

Partitional clusteringk-means349 / 366

Refining the partition

Consider the optimization problem behind clustering and the following heuristic¹ for solving it:

start with a random partition $\seq{D_h}{h}$
until $\seq{D_h}{h}$ is good enough
1. refine $\seq{D_h}{h}$
return $\seq{D_h}{h}$

heuristic [hyoo-ris-tik]: a trial-and-error method of problem solving used when an ~~algorithmic~~ exact approach is impractical.

350 / 366

Refining the partition

Consider the optimization problem behind clustering and the following heuristic¹ for solving it:

start with a random partition $\seq{D_h}{h}$
until $\seq{D_h}{h}$ is good enough
1. refine $\seq{D_h}{h}$
return $\seq{D_h}{h}$

heuristic [hyoo-ris-tik]: a trial-and-error method of problem solving used when an ~~algorithmic~~ exact approach is impractical.

Good?
- the cluster are well separated
Good enough?
- the partition cannot be further improved
- or some computational budget has been consumed

350 / 366

k-means clustering

function $\text{cluster}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, k)$ {
for $h \in \{1,\dots,k\}$ { //set initial centroids
$\c{1}{\vect{\mu}_h} \gets \vect{x}^{(\sim U(\{1,\dots,n\}))}$
}
$\mathcal{D} \gets \c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}})$
while $\neg\text{should-stop()}$ {
for $h \in \{1,\dots,k\}$ { //recompute centroids
$\vect{\mu}_h \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}$
}
$\mathcal{D}' \gets \c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}})$
if $\mathcal{D}'=\mathcal{D}$ {
break
}
$\mathcal{D} \gets \mathcal{D}'$
}
return $\mathcal{D}$
}

function $\c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}})$ {
$\mathcal{D} \gets \{\emptyset,\dots,\emptyset\}$ // $k$ empty sets
for $i \in \{1,\dots,n\}$ {
$h^\star = \argmin_{h \in \{1,\dots,k\}} d(\vect{x}^{(i)},\c{1}{\vect{\mu}_h})$
$D_{h^\star} \gets D_{h^\star} \cup \{\vect{x}^{(i)}\}$ //assign to the closest centroid
}
return $\mathcal{D}$
}

$X = \mathbb{R}^p$
- otherwise you cannot compute the mean as $\c{1}{\vect{\mu}_h} \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}$
$\vect{\mu}_1,\dots,\vect{\mu}_k$ are the means of the clusters and act as centroids
- there are $k$ means!
- randomly chosen at the first iteration
$\text{assign()}$ assigns observations, i.e., points, to closest centroids
when there's no change in the partition, the loop is stop
- $\text{should-stop()}$ may employ additional stopping criteria, e.g:
  - number of iterations
  - distance traveled by the centroids
this technique is not deterministic, due to the initial random assignment
- $\sim U(\{1,\dots,n\})$ without repetition

351 / 366

Example in $\mathbb{R}^1$

Input: $D=\{1,2,3,6,7,9,11,12,15,18\}$ , $k=3$

Execution (one initial random assignment):

$\mathcal{D}_j$	$\vect{\mu}_1$	$\vect{\mu}_2$	$\vect{\mu}_3$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1}$	$\c{2}{11}$	$\c{4}{15}$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{3}$	$\c{2}{9.8}$	$\c{4}{16.5}$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{3}$	$\c{2}{9.8}$	$\c{4}{16.5}$

Execution (another initial random assignment):

$\mathcal{D}_j$	$\vect{\mu}_1$	$\vect{\mu}_2$	$\vect{\mu}_3$
$\{\c{1}{1},\c{2}{2},\c{4}{3},\c{4}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1}$	$\c{2}{2}$	$\c{4}{3}$
$\{\c{1}{1},\c{2}{2},\c{2}{3},\c{4}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1}$	$\c{2}{2}$	$\c{4}{10.1}$
$\{\c{1}{1},\c{2}{2},\c{2}{3},\c{2}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1}$	$\c{2}{2.5}$	$\c{4}{11.1}$
$\{\c{1}{1},\c{1}{2},\c{2}{3},\c{2}{6},\c{2}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1}$	$\c{2}{3.7}$	$\c{4}{12}$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{1.5}$	$\c{2}{5.3}$	$\c{4}{13}$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{2}$	$\c{2}{7.3}$	$\c{4}{14}$
$\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\}$	$\c{1}{2}$	$\c{2}{7.3}$	$\c{4}{14}$

Question: what's the best clustering? can we answer this question?

function $\text{cluster}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, k)$ {
for $h \in \{1,\dots,k\}$ {
$\vect{\mu}_h \gets \vect{x}^{(\sim U(\{1,\dots,n\}))}$
}
$\mathcal{D} \gets \text{assign}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \seq{\vect{\mu}_h}{h=1}^{h=k})$
while $\neg\text{should-stop()}$ {
for $h \in \{1,\dots,k\}$ {
$\vect{\mu}_h \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}$
}
$\mathcal{D}' \gets \text{assign}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \seq{\vect{\mu}_h}{h=1}^{h=k})$
if $\mathcal{D}'=\mathcal{D}$ {
break
}
$\mathcal{D} \gets \mathcal{D}'$
}
return $\mathcal{D}$
}

352 / 366

Example in $\mathbb{R}^2$

Example of k-means in R^2

Given two points $\vect{\mu}_1,\vect{\mu}_2$ , the line which

is orthogonal to the segment $\overlinesegment{\vect{\mu}_1\vect{\mu}_2}$ and
goes through its midpoint

divides the space in points closer to $\vect{\mu}_1$ and those closer to $\vect{\mu}_2$ .

Image from Wikipedia

353 / 366

Applying ML to text354 / 366

What's text?

Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet $A$ . Hence: $x \in A^*$ where $A$ is usually (in modern times) UTF-16, so it may includes emojis:

there are thousands of them: 🤩🦴🐁...

A dataset $X \in \mathcal{P}^\ast(A^\ast)$ of texts, possibly with labels, is called corpus. A single text $x^{(i)}$ is called document.

355 / 366

What's text?

Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet $A$ . Hence: $x \in A^*$ where $A$ is usually (in modern times) UTF-16, so it may includes emojis:

there are thousands of them: 🤩🦴🐁...

A dataset $X \in \mathcal{P}^\ast(A^\ast)$ of texts, possibly with labels, is called corpus. A single text $x^{(i)}$ is called document.

However, what we usually mean with text is natural language, where the sequence of characters is a noisy container of an underlying information:

given a document $x$ , the actual meaning of $x$ may depend on other documents
given a portion $x' \sqsubset x$ of a document $x$ , its meaning may be different if put in another document $x''$

Natural language is by nature ambiguous!

355 / 366

Examples of text+ML problems

Given a brand (e.g., Illy, Fiat, Dell, U.S. Triestina Calcio, ...), build a system that tells if people is talking good or bad about the brand on Twitter (or Mastodon).
Given a corpus of letters to/from soldiers fighting during the WW1, what are the topics they talk about?
Given a scientific paper $p_1$ , what's the relevance of the citation of another paper $p_2$ referenced in $p_1$ ?

356 / 366

Sentiment analysis

A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document $x$ . This is called sentiment analysis.

Usually, this problem is cast as a form of supervised learning, where $Y$ contains sentiments.

Variants:

$Y =\{\text{Pos},\text{Neg}\}$
$Y =[-1,1]$
$Y=[-1,1]^{10}$
- one for each of anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive (see the Syuzhet package)
...

357 / 366

Sentiment analysis

A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document $x$ . This is called sentiment analysis.

Usually, this problem is cast as a form of supervised learning, where $Y$ contains sentiments.

Variants:

$Y =\{\text{Pos},\text{Neg}\}$
$Y =[-1,1]$
$Y=[-1,1]^{10}$
- one for each of anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive (see the Syuzhet package)
...

In every case, we can¹ apply classic ML (supervised and usnupervised) techniques if you pre-process text to obtain multivariate observations, possibly in $\mathbb{R}^p$ , i.e., we want a $f\subtext{text-to-vect}: A^* \to \mathbb{R}^p$ :

x \in A^*

\vect{x}' \in \mathbb{R}^p

f\subtext{text-to-vect}

Actually, we have to, with the only exception of hierarchical clustering for which we might directly work on text with a suitable $d()$ .

357 / 366

Bag-of-words

Bag-of-words (BOW) is a $f\subtext{text-to-vect}$ based on the idea of associating one numerical variable with each word in a predefined dictionary.

x \in A^*

\vect{x}' \in \mathbb{R}^{|W|}

f\subtext{BOW}

W

In practice, given the dictionary (i.e., set of words $W \in \mathcal{P}(A^*)$ ) and given a document $x$ :

tokenize $x$ in a multiset $T=f\subtext{tokenize}(x)$ of tokens (words)
for each $t \in T$ , set $x'_t$ to the multiplicity $m(t,T)$ of $t$ in $T$ , i.e., to the number of occurrences of the words $t$ in $x$

The outcome is a $\vect{x}' \in \mathbb{R}^{|W|}$ .

An alternative version is to consider the frequencies instead of occurrencies:

i.e., $x'_t=\frac{m(t,T)}{|T|}$
useful if the documents have very different lenghts but the lenght itself is not a relevant information

358 / 366

Common text pre-processing steps

BOW is considers slightly different sequences of characters as different words, and hence as different features, because of tokenization. Usually, this is not good.

In practice, you often do some basic pre-processing steps:

case conversion: everything to lowercase (language independent)
- $x=\text{Banana is my favorite fruit} \mapsto x'=\text{banana is my favorite fruit}$
- $x=\text{I like banana} \mapsto x=\text{i like banana}$
removal of punctuation (language independent)
stemming: each word is replaced with its morphological root (language dependent)
- $x=\text{I liked eating bananas} \mapsto x'=\text{I lik eat banana}$
- $x=\text{andammo tristemente rassegnati} \mapsto x'=\text{andar triste rassegnat}$
removal of stop-words (language dependent)
- stop words are very common words (articles, some prepositions, ...)

Each of this steps is a $f\subtext{pre-proc}: A^\ast \to A^\ast$ :

x \in A^\ast

x' \in A^\ast

f\subtext{pre-proc}

359 / 366

Counter examples

The 4 common pre-processing steps are not always appropriate. It depends on whether they help modeling the $y$ - $x$ dependency.

Sentiment analysis and punctuation:

$\text{I just saw Alice}$
$\text{I just saw Alice!!!}$
$\text{I just saw Alice!!! 🥰😍💘}$

Music genre preferences and case: a bit forced...

$\text{I like the Take That and I hate The Who.}$
$\text{Who likes to take that song of Hate? Me!}$

Instruction level and stemming:

$\text{se fossi stato malato, me ne sarei stato a casa}$
$\text{se ero malato, me ne stavo a casa}$

360 / 366

tf-idf

BOW tends to overweigh words which are very frequent, but not relevant (similarly to stop-words) and underweigh words that are relevant, but rare.

Solution: use tf-idf instead of occurrencies or frequency. tf-idf is the ratio between the term frequency (i.e., the frequency of a word) in a document, and the inverse document frequency, i.e., the frequency in the corpus of documents containing that term.

Given the dictionary $W$ , the corpus $X$ , and a document $x$ :

tokenize $x$ in a multiset $T$ of tokens (words)
for each $t \in T$ , set $x'_t=\c{1}{f\subtext{tf}(t, x)} \c{2}{f\subtext{idf}(t, X)}$

where:

$f\subtext{tf}(t, x)=\frac{m(t,T)}{|T|}$
$f\subtext{idf}(t, X)=\log \frac{|X|}{\sum_{x \in X} \mathbf{1}(t \in f\subtext{tokenize}(x))}$

The more common a word, the greater tf, the (more) lower idf ( $0$ if in every document). The more specific a word to a document, the larger tf, the larger idf.

tf-idf corresponds to a $f\subtext{tf-idf-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{P}^\ast(A^\ast)$ , which is just the identity¹, and a $f\subtext{tf-idf-apply}: A^\ast \times \mathcal{P}^\ast(A^\ast) \to \mathbb{R}^{|W|}$ :

X

X

f\subtext{tf-idf-learn}

x,X

\vect{x}'

f\subtext{tf-idf-apply}

W

or, more verbosely and more formally: $f\subtext{tf-idf-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{F}_{A^\ast \to [0,1]^2}$ , because it returns a mapping between words and two frequencies (tf and idf).

361 / 366

Reducing the dimensionality

With BOW, $p=|W|$ and might be very large.

Common approaches:

use a very small dictionary, tailored to the specific case
learn a small dictionary ( $|W|=k$ ) on the learning data
- you have a $f\subtext{BOW-top-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{P}(A^\ast)$ and a $f\subtext{BOW-top-apply}: A^\ast \times \mathcal{P}(A^\ast) \to \mathbb{R}^k$
- in learning
  - use $f\subtext{BOW-top-learn}(X)=W$ to build the dictionary $W$ from the corpus $X$ , then
  - transform the corpus in a $X' \in \mathcal{P}^\ast(\mathbb{R}^k)$ using $f\subtext{BOW-top-apply}(x^{(i)}, W)=\vect{x}^{\prime(i)}$ on each $x$
- in prediction, use $f\subtext{BOW-top-apply}(x, W)=\vect{x}'$
- $W$ is often set as "the most frequent $k$ words" (but remove stop-words!)
use tf-idf and get $k$ most important words

X

W

f\subtext{BOW-top-learn}

k

x,W

\vect{x}'

f\subtext{BOW-top-apply}

The order of words in $W$ does matter, so it's $W \in (A^\ast)^\ast$ , rather than $W \in \mathcal{P}(A^\ast)$ .

362 / 366

Considering ordering

Both BOW and tf-idf ignore word ordering. But ordering is fundamental in natural language.

Example: (sentiment classification for restaurant reviews)

$\text{The beer was good and the pub was not too noisy.}$
$\text{The beer was not good and the pub was too noisy.}$

Most common solutions:

ngrams
part of speech (POS) tagging

363 / 366

ngrams

Instead of considering word frequencies (or occurrences, or tf-idf), consider the frequencies of short sequences of up-to $n$ words (tokens, or characters in general), i.e., of ngrams.

Example: (with $n=3$ and aggressive stop-word removal)

$\text{The beer was good and the pub was not too noisy.}$
- $x_{\text{beer},\text{good}}=1$ , $x_{\text{pub},\text{not},\text{noisy}}=1$
$\text{The beer was not good and the pub was too noisy.}$
- $x_{\text{beer},\text{not},\text{good}}=1$ , $x_{\text{pub},\text{too},\text{noisy}}=1$

Since $p$ may become very very large, dimensionality reduction becomes very important.

364 / 366

Part-of-speech tagging (very briefly)

A technique belonging to Natural Language Processing methods that assigns the role to each word in a document. Roles can then be used to augment the text-to-num transformation.

POS example

365 / 366

Lab 3: sport vs. politics

Build a system that:

everyday collects a large set of random tweets and groups them in tweets about politics and about sport
for each of the two groups, shows the main topics of discussion

The system uses a dashboard to show its findings. you don't need to build the dashboard here, but imagining it and its usage can facilitate the design of the system

Hints:

the hardest part is collecting the data for designing/building the system
interesting R packages
- tm for doing text mining (tokenization, punctuation, stop-words, stemming, ...)
- other supervised learning: e1071, randomForest
- clustering: kmeans, hclust

366 / 366

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

(Introduction to) Machine Learning and Evolutionary Robotics

456MI, 470SM

Lecturer

Computer Engineering (ING-INF/05) group

Structure of the course

Materials

How to attend lectures

Visual syntax

Lab activities and how to attend

Lecture times

Tutor

Exam

You?

Basic concepts

What is Machine Learning?

What is Machine Learning?

What is Machine Learning?

An example: spam detection

Spam detection: under the hood

Spam detection: under the hood

Making a decision

Making a decision

xxx and yyy names

xxx and yyy names

to learn what?

to learn what?

Prediction

fff for a computer

Further point of view

Writing fff

Writing fff

Finding/writing a program

Finding/writing a program

Desired behavior of fff

Supervised learning

Supervised learning

Examples

Learning set

Learning technique

Learning technique

Learning technique

Learning techniques

Who?

Who?

Learning as optimization

Learning as optimization

Templating fff

Model

Model

Learning a model

Examples of templated fff

Alternative views/terminology

Alternative views/terminology

Common cases and terminology

Variables terminology

Variables terminology

Size of the "problem"

What (sup. learning techniques) we will see

... and...

ML system

ML system example: Twitter profiling

Designing an ML system

Designing an ML system

Phases of design of an ML design

Should I use Machine Learning?

Domain knowledge and data exploration

How to choose components?

Example of Iris

Example of Iris

Iris species

Bob's need

Bob's need

Tackling the Iris problem: phase 1 - ML?

Tackling the Iris problem: phase 1 - ML?

Phase 2 - supervised vs. unsupervised

Phase 2 - supervised vs. unsupervised

Phase 3 - problem statement

Phases 3 - shaping XXX

Phases 3 - feature engineering

Phases 3 - feature engineering for Iris

$x$ and $y$ names

$x$ and $y$ names

$f$ for a computer

Writing $f$

Writing $f$

Desired behavior of $f$

Templating $f$

Examples of templated $f$

Phases 3 - shaping $X$

Comparing $m$ and $s$

Remarks on $f\subtext{comp-behavior}$

Remarks on $f\subtext{comp-behavior}$

The importance of $f\subtext{collect}$ in assessment

Comparing responses with $f\subtext{comp-resps}$

Comparing responses with $f\subtext{comp-resps}$