+ - 0:00:00
Notes for current slide
Notes for next slide

(Introduction to) Machine Learning and Evolutionary Robotics

456MI, 470SM

Eric Medvet

A.Y. 2024/2025

1 / 366

Lecturer

Eric Medvet

Research interests:

  • evolutionary computation
  • embodied artificial intelligence
  • machine learning applications

Labs:

2 / 366

Computer Engineering (ING-INF/05) group

Sylvio Barbon Jr. Sylvio Barbon Jr.
Fondamenti di informatica
Progettazione del software e dei sistemi informativi
meta learning, applied ML, process mining

Alberto Bartoli Alberto Bartoli
Reti di calcolatori
Computer networks 2 and introduction to cybersecurity
security, applied ML, evolutionary computation

Andrea De Lorenzo Andrea De Lorenzo
Basi di dati
Programmazione web
security, applied AI&ML, information retrieval, GP

Eric Medvet Eric Medvet
Programmazione avanzata
Introduction to machine learning and evolutionary robotics
evolutionary computation, embodied AI, applied ML

Laura Nenzi Laura Nenzi
Cyber-physical systems
Introduction to Artificial Intelligence
formal methods, runtime verification

Martino Trevisan Martino Trevisan
Reti di calcolatori
Sistemi operativi
Architetture dei sistemi digitali
network measurements, data privacy, big data

3 / 366

Structure of the course

1st part (6 CFUs, 48 hours): for all: IN23, IN19, SM38, SM36, SM34, SM23, SM28, SM13, and SM64

2nd part (3 CFUs, 24 hours): just for IN23 and IN19

  • what is evolutionary computation?
  • significant applications in robotics

Focus on methodology:

  • how to design, build, and evaluate an ML (or EC) system?
4 / 366

Materials

Teacher slides:

  • available on the course web page
  • might be updated during the course

Notebooks for the lab activity:

  • available on the course web page
  • please, to fully enjoy lab activities, do not look at notebooks in advance

Textbooks:

  • 1st part: James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013 available in UniTs library
  • 2nd part: De Jong, Kenneth A. Evolutionary Computation: A Unified Approach. MIT Press, 2006.

Disclaimer: overlap with course material is very partial!

5 / 366

How to attend lectures

Depending on your learning style and habits, you might want to take notes to augment the slide content.

6 / 366

Visual syntax

This is an important concept.

This is a very important key concept, like a definition.

Sometimes there is something that is marginally important, it is an aside. like this

There will be scientific papers or books to be referred to, like this book: James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

External resourses (e.g., videos, software tools, ...) will be linked directly.

Palette is color-blind safe:

Pseudo-code for describing algorithms in an abstract way:

function factorial(n)\text{factorial}(n) {
p1p \gets 1
while n>1n>1 {
pnpp \gets n p
nn1n \gets n -1
}
return pp;
}

Code in a concrete programming language:

public static String sayHello(String name) {
return "Hello %s".formatted(name);
}
7 / 366

Lab activities and how to attend

Focus on methodology:

  • how to design, build, and evaluate an ML (or EC) system?

Practice (in designing, building, evaluating) is fundamental!

You'll practice doing lab activities:

  • \approx 15 hours in the 1st part
  • in classroom
    • the teacher is there and always available
    • the teacher actively monitors your progresses
    • ... but you can do the activities also at home
  • "solution" shown at the end
    • solution = one way of doing design, build, evaluate
  • agnostic w.r.t. concrete tools used
    • teacher is more familiar with R
    • tutor is more familiar with Python
  • suitable to be done in small group (2–4 students)
8 / 366

Lecture times

Where:

  • Room H, building C1, Piazzale Europa Campus
  • Room 3B, building H3, Piazzale Europa Campus
  • Room 2A, building D, Piazzale Europa Campus

When:

  • Monday, 16.00–19.00, H, C1 \rightarrow 16.00–18.30
  • Tuesday, 11.00–13.00, 3B, H3 \rightarrow 11.00–12.30
  • Wednesday, 10.00–12.00, 2A, D \rightarrow 10.00–11.30
9 / 366

Tutor

Michel El Saliby

Role of the tutor:

  • assisting students during lab activities, together with the teacher
  • first point-of-contact for course-related questions by students
    • the teacher is always available
10 / 366

Exam

The exam may be done in two ways:

  1. project and written test
  2. written test only

The written test consists of few (\approx 6) questions, some with medium-length answer, some with short answer, to be done in 1h.

The project consists in the design, development, and assessment of an ML system dealing with one "problem" chosen among a few options (examples).

  • the student delivers a description, not the software
  • the description is evaluated for clarity, technical soundness, (amount of) results
  • may be done in group (you are encouraged to form groups!)

The grade is the average of written test and project grades:

  • both must be 18\ge 18
  • parts can be repeated
  • honors (lode) if and only if both parts 30\ge 30 and one >30> 30
11 / 366

You?

12 / 366

Basic concepts

13 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

14 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

A few considerations:

  • defining a field of science is hard: science evolves, its boundaries change
  • ML "comes" from many communities' (statistics, computer science, ...) efforts: this (the use of computers) is just one point of view
  • it captures just some "parts" of ML: we'll see
14 / 366

What is Machine Learning?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

A few considerations:

  • defining a field of science is hard: science evolves, its boundaries change
  • ML "comes" from many communities' (statistics, computer science, ...) efforts: this (the use of computers) is just one point of view
  • it captures just some "parts" of ML: we'll see

Let's analyze it in details:

  • is the science: what's science?
  • getting computer: who is doing that?
  • to learn: to learn what? this appears to be the key point!
  • without being explicitly programmed: who is not doing that?
14 / 366

An example: spam detection

GMail screenshot with spam folder

15 / 366

Spam detection: under the hood

What the user sees:

  • unwanted emails (spam) are in a separated place (the spam folder)
  • sometimes some spam email is not put in the spam folder
  • sometimes some not-spam email is put in the spam folder
16 / 366

Spam detection: under the hood

What the user sees:

  • unwanted emails (spam) are in a separated place (the spam folder)
  • sometimes some spam email is not put in the spam folder
  • sometimes some not-spam email is put in the spam folder

What the web-based email system (a computer) does:

  • whenever an email arrives, decides if it is spam or not
  • if spam, move it to the spam folder; otherwise, leave it in the main place

In brief: a computer is making a decision about an email

16 / 366

Making a decision

Let's be formal:

y=f(x)y=f(x)

  • xx: the entity about which the decision has to be made (the email)
  • yy: the decision (spam or not-spam)
  • f()f(\cdot): applying some procedure that, given an xx results in a decision yy
17 / 366

Making a decision

Let's be formal:

y=f(x)y=f(x)

  • xx: the entity about which the decision has to be made (the email)
  • yy: the decision (spam or not-spam)
  • f()f(\cdot): applying some procedure that, given an xx results in a decision yy

y=f(x)y=f(x) is a formal notation capturing the idea that yy is obtained from xx by applying ff on it.
But it is lacky about the nature of xx and yy.

f:XYf: X \to Y

  • XX: the set of all xx, the domain of ff (all the possible emails)
  • YY: the set of all yy, the codomain of ff (Y={spam,not-spam}Y=\{\text{spam}, \text{not-spam}\})

None of the two notations says how ff works internally.

17 / 366

xx and yy names

  • xx is an observation
    • something that can be observed, right because a decision has to be made about it
  • yy is the response (for a given xx)
    • if you feed the decision system with an xx, the system responds with a yy
18 / 366

xx and yy names

  • xx is an observation
    • something that can be observed, right because a decision has to be made about it
  • yy is the response (for a given xx)
    • if you feed the decision system with an xx, the system responds with a yy

Alternative names:

  • xx is an/the input, yy is an/the output
    • ff as information processing system
  • xx is a data point
    • assuming it carries some data about the underlying entity
  • xx is an instance
    • instance [ˈɪnst(ə)ns]: an example or single occurrence of something

Names are used interchangeably; some communities tend to prefer some names.

18 / 366

to learn what?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

  • is the science: what's science?
  • getting computer: who is doing that?
  • to learn: to learn what?
    • how to make a decision yy about an observation xx. That is: f:XYf: X \to Y
  • without being explicitly programmed: who is not doing that?
19 / 366

to learn what?

Machine Learning is the science of getting computers to learn without being explicitly programmed.

  • is the science: what's science?
  • getting computer: who is doing that?
  • to learn: to learn what?
    • how to make a decision yy about an observation xx. That is: f:XYf: X \to Y
  • without being explicitly programmed: who is not doing that?

New version:

Machine Learning is the science of getting computers to learn f:XYf: X \to Y without being explicitly programmed.

we want the computer to learn ff and use it, not just learn it

19 / 366

Prediction

ff is often denoted as fpredictf\subtext{predict} since, given an xx, predicts a yy

  • when used in practice, i.e., in the prediction phase, fpredictf\subtext{predict} guesses about an unknown, real y^\hat{y}
20 / 366

ff for a computer

Computers execute instructions grouped in programs and expressed according to some language.
ff is the mathematical, abstract notation for a computer program that, when executed on an input xXx \in X , outputs a yYy \in Y.

Mathematical notation:

X=R2X = \mathbb{R}^2 Y=RY = \mathbb{R} f:R2Rf: \mathbb{R}^2 \to \mathbb{R} f(x)=f((x1,x2))=x1x2x1f(\vect{x}) = f((x_1,x_2)) = \left\lvert\frac{x_1-x_2}{x_1}\right\rvert

x\vect{x} is a notation for vectors, or, more broadly, for sequences of homogeneous elements, in place of x\vec{x}

Computer language:

public double f(double[] xs) {
return Math.abs((xs[0] - xs[1]) / xs[0]);
}

Most not all, typed ones languages make connection clear:
double[] is XX, i.e., R2\mathbb{R}^2 actually Rp\mathbb{R}^p, with p1p \ge 1
double is YY, i.e., R\mathbb{R}
xs is x\vect{x}
f is ff: types correspond!
no explicit counterpart for yy

21 / 366

Further point of view

Abstract definition (\approx the signature):

  • just domain and codomain, not how the function works

f:R2Rf: \mathbb{R}^2 \to \mathbb{R}

double f(double[] xs)
xR2\vect{x} \in \mathbb{R}^2yRy \in \mathbb{R}ff

Concrete definition (\approx signature and code):

  • domain, codomain, and how the function works

f:R2Rf: \mathbb{R}^2 \to \mathbb{R} y=f(x)=f((x1,x2))=x1+x2\begin{align*} y &= f(\vect{x}) \\ &=f((x_1,x_2)) \\ &=x_1+x_2 \end{align*}

double f(double[] xs) {
return xs[0] + xs[1];
}
xR2\vect{x} \in \mathbb{R}^2yRy \in \mathbb{R}x1+x2x_1+x_2
22 / 366

Writing ff

Usually, computer programs are written by humans, but here:

Machine Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y without being explicitly programmed.

without being explicitly programmed means that fpredictf\subtext{predict} is not written by a human!

It appears verbose, let's get rid of it.

23 / 366

Writing ff

Usually, computer programs are written by humans, but here:

Machine Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y without being explicitly programmed.

without being explicitly programmed means that fpredictf\subtext{predict} is not written by a human!

It appears verbose, let's get rid of it.

New version:

Machine Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y autonomously.

23 / 366

Finding/writing a program

Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"

Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"

"Find" suggests Bob to apprach the task in two steps:

  1. consider the universe of all the possible programs
  2. choose the one (or ones) that does what expected
24 / 366

Finding/writing a program

Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"

Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"

"Find" suggests Bob to apprach the task in two steps:

  1. consider the universe of all the possible programs
  2. choose the one (or ones) that does what expected

In ff terms:

  1. consider FXY={f,f:XY}\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}
  2. choose one fFXYf \in \mathcal{F}_{X \to Y} that does what expected
24 / 366

Desired behavior of ff

  1. consider FXY={f,f:XY}\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}
  2. choose one fFXYf \in \mathcal{F}_{X \to Y} that does what expected

Step 2 is fundamental in practice

  • "find a program that, given a string, returns a number" wouldn't make sense alone!

... but it is hard to be further formalized in general.

There has to be some supervision facilitating the search for a good ff.

25 / 366

Supervised learning

When the supervision is in the form of some examples (observation \rightarrow response) and the learned fpredictf\subtext{predict} should process them correctly.

  • example: "if I give you this observation xx, you should predict this response yy"

New version:

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

26 / 366

Supervised learning

When the supervision is in the form of some examples (observation \rightarrow response) and the learned fpredictf\subtext{predict} should process them correctly.

  • example: "if I give you this observation xx, you should predict this response yy"

New version:

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

In unsupervised learning there is no supervision, there are no examples:

  • nevertheless, there is some implicit expectation about how to process xx
  • we'll discuss unsupervised learning later
26 / 366

Examples

Formally, examples available for learning fpredictf\subtext{predict} are pairs (x,y)(x,y).

A dataset compatible with XX and YY is a bag of pairs (x,y)(x,y): D={(x(i),y(i))}i=1i=nD = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_{i=1}^{i=n} with i:x(i)X,y(i)Y\forall i: x^{(i)} \in X, y^{(i)} \in Y and D=n|D|=n.

Or, more briefly D={(x(i),y(i))}iD = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_i. examples are also denoted by (xi,yi)(x_i,y_i), depending on the community

A bag (DD should be called databag...):

  • can have duplicates (bag \ne set)
  • it does not imply any order among its elements (bag \ne sequence)

In most algorithms, and in their program counterparts, dataset are actually processed sequentially, though

27 / 366

Learning set

A learning set is a dataset that is used for learning an fpredictf\subtext{predict}.

  • may be denoted by DlearnD\subtext{learn} or LL, or TT, for training set

The learning set has to be consistent with the domain and codomain of the function fpredictf\subtext{predict} to be learned:

  • if fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y}, then DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y)
    • X×YX \times Y is the Cartesian product of XX and YY, i.e., the set of all possible (x,y)(x,y) pairs
    • P(A)\mathcal{P}(A) is the powerset of AA, i.e., the set of all the possible subsets of AA
    • P(A)\mathcal{P}^*(A) is a custom notation for the powerset with duplicates, i.e., the set of all the possible multisets of AA
28 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

In brief: given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y), learn an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y}.

29 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

In brief: given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y), learn an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y}.

A supervised learning technique is a way for learning an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y).

flearn:P(X×Y)FXYf\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y} fpredict=flearn(Dlearn)f\subtext{predict} = f\subtext{learn}\left(D\subtext{learn}\right)

flearnf\subtext{learn}DlearnD\subtext{learn}fpredictf\subtext{predict}
29 / 366

Learning technique

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

In brief: given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y), learn an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y}.

A supervised learning technique is a way for learning an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y).

flearn:P(X×Y)FXYf\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y} fpredict=flearn(Dlearn)f\subtext{predict} = f\subtext{learn}\left(D\subtext{learn}\right)

flearnf\subtext{learn}DlearnD\subtext{learn}fpredictf\subtext{predict}
  • learning phase: when flearnf\subtext{learn} is applied to obtain fpredictf\subtext{predict} from DD
  • prediction phase: when fpredictf\subtext{predict} is applied to obtain a yy from an xx
29 / 366

Learning techniques

A supervised learning technique is a way for learn an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y).

Why don't we suffice a single learning technique? Why are there many of them?

They differ in:

  • applicability with respect to XX and/or YY
    • e.g., some require X=RpX = \mathbb{R}^p, some require Y=RY = \mathbb{R}
  • efficiency with respect to Dlearn|D\subtext{learn}|
    • e.g., some are really fast for producing fpredictf\subtext{predict} (O(Dlearn0)\mathcal{O}\left(|D\subtext{learn}|^{\approx 0}\right)), some are slow (O(Dlearn2)\mathcal{O}\left(|D\subtext{learn}|^2\right))
  • effectiveness in terms of the quality of the learned fpredictf\subtext{predict}
  • attributes of learned fpredictf\subtext{predict}
    • nature/type of fpredictf\subtext{predict} (a formula, a text, a tree...)
    • interpretability of fpredictf\subtext{predict}
30 / 366

Who?

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

getting computer: who is doing that?

  • the user of a learning technique, who is likely the designer/developer of a ML system
31 / 366

Who?

Supervised (Machine) Learning is the science of getting computers to learn fpredict:XYf\subtext{predict}: X \to Y from examples autonomously.

getting computer: who is doing that?

  • the user of a learning technique, who is likely the designer/developer of a ML system

is the science: what's science?

  • there's not only the user: someone designs/develops learning techniques

New version:

Supervised (Machine) Learning is about designing and applying supervised learning techniques.

31 / 366

Learning as optimization

A supervised learning technique flearn:P(X×Y)FXYf\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y} can be seen as a form of optimization:

  1. consider FXY={f,f:XY}\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}
  2. find the one fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} that works best on DlearnD\subtext{learn}

Could we use a general optimization technique?
In principle, yes, but:

  • XX (and maybe YY) might be infinite (e.g., X=RpX=\mathbb{R}^p)
  • X×YX \times Y is "more" infinite
  • FXY\mathcal{F}_{X \to Y} is "hugely more" infinite
32 / 366

Learning as optimization

A supervised learning technique flearn:P(X×Y)FXYf\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y} can be seen as a form of optimization:

  1. consider FXY={f,f:XY}\mathcal{F}_{X \to Y} = \{f, f: X \to Y\}
  2. find the one fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} that works best on DlearnD\subtext{learn}

Could we use a general optimization technique?
In principle, yes, but:

  • XX (and maybe YY) might be infinite (e.g., X=RpX=\mathbb{R}^p)
  • X×YX \times Y is "more" infinite
  • FXY\mathcal{F}_{X \to Y} is "hugely more" infinite

Practical solution: reduce FXY\mathcal{F}_{X \to Y} size by considering only the ff of some nature:

  • e.g., for X=Y=RX=Y=\mathbb{R}, consider FRR={f:f(x)=ax+b with a,bR}\mathcal{F}'_{\mathbb{R} \to \mathbb{R}} = \{f: f(x)=ax+b \text{ with } a,b \in \mathbb{R}\}
  • e.g., for xx a UTF-8 strings and yy a Boolean, consider the ff as regular expressions
32 / 366

Templating ff

Often a learning technique works on a reduced FXY\mathcal{F}'_{X \to Y} which is based on an template ff':

  • most parts of ff' are defined, some parts are undefined, variable
  • ff' can be used for prediction only if the undefined parts are defined

E.g., for X=Y=RX=Y=\mathbb{R}, f(x)=ax+bf'(x)=ax+b:

  • you need concrete values for a,ba,b in order to apply ff to an xx, i.e., to obtain a response yy out of an xx
  • this is univariate linear regression: we'll expand
    • univariate because XX has one dimension
    • regression because Y=RY=\mathbb{R}
    • linear because of the template
33 / 366

Model

We can make explicit the undefined part of the template: fpredict(x)=fpredict(x,m)f\subtext{predict}(x) = f'\subtext{predict}(x, m) where mMm \in M is the undefined part.

  • e.g., fpredict(x,a,b)=ax+bf'\subtext{predict}(x, a, b) = ax+b and M=R2M=\mathbb{R}^2

Note that fpredictf'\subtext{predict} is fixed for a given learning technique and defines the reduced FXYFXY\mathcal{F}'_{X \to Y} \subset \mathcal{F}_{X \to Y} where the learning will look for an fpredictf\subtext{predict}.

34 / 366

Model

We can make explicit the undefined part of the template: fpredict(x)=fpredict(x,m)f\subtext{predict}(x) = f'\subtext{predict}(x, m) where mMm \in M is the undefined part.

  • e.g., fpredict(x,a,b)=ax+bf'\subtext{predict}(x, a, b) = ax+b and M=R2M=\mathbb{R}^2

Note that fpredictf'\subtext{predict} is fixed for a given learning technique and defines the reduced FXYFXY\mathcal{F}'_{X \to Y} \subset \mathcal{F}_{X \to Y} where the learning will look for an fpredictf\subtext{predict}.

Given a template fpredictf'\subtext{predict}, mm defines an fpredictf\subtext{predict} that can be used to predict a yy from an xx.
That is, mm is a model of how yy depends on xx.

34 / 366

Learning a model

For techniques based on a template, flearnf\subtext{learn} actually looks just FXY\mathcal{F}'_{X \to Y}, hence in MM, for finding an fpredictf\subtext{predict}.

General case: flearn:P(X×Y)FXYf\subtext{learn}: \mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y} fpredict:XYf\subtext{predict}: X \to Y

The learning technique is defined by flearnf\subtext{learn}.

flearnf\subtext{learn}DlearnD\subtext{learn}fpredictf\subtext{predict}

With template: flearn:P(X×Y)Mf'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M fpredict:X×MYf'\subtext{predict}: X \times M \to Y

The learning technique is defined by flearn,fpredictf'\subtext{learn}, f'\subtext{predict}.

flearnf'\subtext{learn}DlearnD\subtext{learn}mmfpredictf'\subtext{predict}x,mx, myy
35 / 366

Examples of templated ff

Problem: price of a flat from surface X=R+,Y=R+X=\mathbb{R}^+, Y=\mathbb{R}^+ FR+R+={,x2,3,πx3+5x0.1+x,}\mathcal{F}_{\mathbb{R}^+ \to \mathbb{R}^+} = \{\dots,x^2, 3, \pi \frac{x^3+5x}{0.1+x}, \dots\}

Learning technique: linear regression fpredict(x,a,b)=ax+bf'\subtext{predict}(x, a,b) = ax+b M=R×R={(a,b):aRbR}M = \mathbb{R} \times \mathbb{R} = \{(a,b): a \in \mathbb{R} \land b \in \mathbb{R} \} FR+R+={,x+1,3,πx+5,}\mathcal{F}'_{\mathbb{R}^+ \to \mathbb{R}^+} = \{\dots,x+1, 3, \pi x+5, \dots\}

Problem: classify email as spam/not-spam

X=AX=A^*, Y={spam,¬spam}Y=\{ \text{spam},\neg\text{spam} \}, A=A= UTF-8

FAY={}\mathcal{F}_{A^* \to Y} = \{ \dots \} (all predicates on UTF-8 strings)

Learning technique: regex-based flagging
fpredict(x,r)={spamif x matches r¬spamotherwisef'\subtext{predict}(x, r) = \begin{cases} \text{spam} & \text{if } x \text{ matches } r \newline \neg\text{spam} & \text{otherwise} \end{cases}

M=M = regexes ={,ca.++,[a-z]+a.+,}=\{\dots, \text{\htmlClass{ttt}{ca.++}}, \text{\htmlClass{ttt}{[a-z]+a.+}}, \dots\}

FAY={,fpredict(,[a-z]+a.+),}\mathcal{F}'_{A^* \to Y} = \{ \dots, f'\subtext{predict}(\cdot, \text{\htmlClass{ttt}{[a-z]+a.+}}), \dots \}

Choosing the learning technique means choosing one FXY\mathcal{F}'_{X \to Y}!

A=i=0i=AiA^* = \bigcup\sub{i=0}^{i=\infty} A^i

36 / 366

Alternative views/terminology

The model mm is learned on a dataset DD.

  • mm learned from the examples in DD

The model mm is trained on a dataset DD.

  • mm trained to correctly work on the examples in DD

The model mm is fitted on a dataset DD.

  • mm adjusted until it works well on the examples in DD
37 / 366

Alternative views/terminology

The model mm is learned on a dataset DD.

  • mm learned from the examples in DD

The model mm is trained on a dataset DD.

  • mm trained to correctly work on the examples in DD

The model mm is fitted on a dataset DD.

  • mm adjusted until it works well on the examples in DD

Formally, a model is one specific mMm \in M that has been found upon learning.
However, often "model" is used to denote a generic (e.g., still untrained/unfitted) artifact.

  • "fit the model": a model exists before fitting (e.g., before the learning phase)
  • "learn a model": the model is the outcome of the learning phase
37 / 366

Common cases and terminology

Supervised learning techniques may be categorized depending on the kind of X,Y,MX,Y,M they deal with:

With respect to YY, most important cases:

  • YY is a finite set without intrinsic ordering \rightarrow classification
    • yy is said a categorical (or nominal) variable
    • if Y=2|Y|=2 \rightarrow binary classification
      otherwise \rightarrow multiclass classification
  • Y=RY = \mathbb{R} (or YRY \subseteq \mathbb{R}) \rightarrow regression
    • yy is said a numerical variable

With respect to XX, common cases:

  • X=X1××XpX = X_1 \times \dots \times X_p, with each XiX_i being R\mathbb{R} or a finite unordered set (each xx is a pp-sized tuple)
    • XX is multivariate and each xix_i is either numerical or categorical
  • XX is the set of all strings \rightarrow text mining we'll see
38 / 366

Variables terminology

In the common case of a multivariate X=X1××XpX = X_1 \times \dots \times X_p:

  • each xix_i is said a independent variable
    • or feature, since it is a feature of an xXx \in X
    • or attribute, since it is an attribute of an xXx \in X
    • or predictor, since it hopefully helps predicting a yy
  • yy is said the dependent variable, since it is hoped to depend on xx
    • or response variable
39 / 366

Variables terminology

In the common case of a multivariate X=X1××XpX = X_1 \times \dots \times X_p:

  • each xix_i is said a independent variable
    • or feature, since it is a feature of an xXx \in X
    • or attribute, since it is an attribute of an xXx \in X
    • or predictor, since it hopefully helps predicting a yy
  • yy is said the dependent variable, since it is hoped to depend on xx
    • or response variable

Given a dataset DD with D=n|D|=n examples defined over X,YX,Y:

D=(x1(1)xj(1)xp(1)y(1)x1(i)xj(i)xp(i)y(i)x1(n)xj(n)xp(n)y(n))D= \begin{pmatrix} x_1^{(1)} & \dots & \c{3}{x_j^{(1)}} & \dots & x_p^{(1)} & y^{(1)} \newline \dots & \dots & \c{3}{\dots} & \dots & \dots & \dots \newline \c{1}{x_1^{(i)}} & \c{1}{\dots} & \c{2}{x_j^{(i)}} & \c{1}{\dots} & \c{1}{x_p^{(i)}} & \c{4}{y^{(i)}} \newline \dots & \dots & \c{3}{\dots} & \dots & \dots & \dots \newline x_1^{(n)} & \dots & \c{3}{x_j^{(n)}} & \dots & x_p^{(n)} & y^{(n)} \newline \end{pmatrix}

  • x(i)x^{(i)} is the ii-th observation
  • {xj(i)}i\left\{x_j^{(i)}\right\}_i are the values of the jj-th feature
  • xj(i)x^{(i)}_j is the value of the jj-th feature for the ii-th observation recall, order does not matter in DD
  • y(i)y^{(i)} is the response for the ii-th observation
    • if yy is categorical \rightarrow class label
39 / 366

Size of the "problem"

The common notation for the size of a multivariate dataset (i.e., a dataset with a multivariate X=X1××XpX=X_1 \times \dots \times X_p) is:

  • nn number of observations
  • pp number of (independent) variables

On the assumption that a dataset DD implicitly defines the problem (since it bounds XX and YY and hence FXY\mathcal{F}_{X \to Y}), nn and pp also describe the size of the problem.

40 / 366

What (sup. learning techniques) we will see

A family of learning techniques (tree-based) for:

  • multivariate X=X1××XpX = X_1 \times \dots \times X_p, each xx being categorical or numerical
  • classification (binary or multiclass) and regression

A family of learning techniques (SVM) for:

  • X=RpX = \mathbb{R}^p
  • binary classification

A learning technique (kNN) for:

  • any XX with a similarity metric (including X=RpX = \mathbb{R}^p)
  • classification (binary or multiclass) and regression

A learning technique (naive Bayes) for:

  • multivariate X=X1××XpX = X_1 \times \dots \times X_p, each xx being categorical with mention to the hybrid case
  • classification (binary or multiclass)
41 / 366

... and...

What if none of the above learning techniques fits the problem (X,YX,Y) at hand?

We'll see:

  • a method for applying techniques suitable for X=RpX=\mathbb{R}^p to problems where a multivariate XX includes categorical variables
  • a few methods for applying techniques suitable for X=RpX=\mathbb{R}^p to problems where X=X= strings
  • two methods for applying techniques suitable for binary classification (Y=2|Y|=2) to multiclass classification problems (Y2|Y|\ge 2)

What about the other kinds of problems?

42 / 366

ML system

An information processing system in which there is:

  • a supervised learning technique (i.e., a pair flearn,fpredictf'\subtext{learn},f'\subtext{predict})
  • other components operating on XX or YY
    • pre-processing, if "before" the learning technique, i.e., XXX \to X'
    • post-processing, if "after" the learning technique, i.e., YYY' \to Y
43 / 366

ML system example: Twitter profiling

Goal: given a tweet, determine age range and gender of the author

  • problem 1: X=A280,A=X=A^{280}, A=UTF-16, Y={0–16,17–29,30–49,50–}Y=\{ \text{0--16}, \text{17--29}, \text{30--49}, \text{50--}\}
  • problem 1: X=A280,A=X=A^{280}, A=UTF-16, Y={M,F}Y=\{ \text{M}, \text{F}\} or broader

One possible ML system for this problem:

  • ftext-to-num:A280[0,1]50f\subtext{text-to-num}: A^{280} \to [0,1]^{50} (chosen among a few options, maybe adjusted)
  • fforeach:X×FXYYf\subtext{foreach}: X^* \times \mathcal{F}_{X \to Y} \to Y^* (given an f:XYf: X \to Y and a sequence {xi}i\{x_i\}_i, apply ff to each xix_i)
  • flearn,1,fpredict,1f'_{\text{learn},1},f'_{\text{predict},1} and flearn,2,fpredict,2f'_{\text{learn},2},f'_{\text{predict},2} (two learning techniques suitable for classification)

Learning phase:

Dlearn=fforeach(Dlearn,ftext-to-num)D'\subtext{learn} = f\subtext{foreach}(D\subtext{learn}, f\subtext{text-to-num}) just the xx part
mage=flearn,1(Dlearn)m\subtext{age} = f'_{\text{learn},1}(D'\subtext{learn})
mgender=flearn,2(Dlearn)m\subtext{gender} = f'_{\text{learn},2}(D'\subtext{learn})

Prediction phase:

x=ftext-to-num(x)x' = f\subtext{text-to-num}(x)
yage=fpredict,1(x,mage)y\subtext{age} = f'_{\text{predict},1}(x', m\subtext{age})
ygender=fpredict,2(x,mgender)y\subtext{gender} = f'_{\text{predict},2}(x', m\subtext{gender})

44 / 366

Designing an ML system

  • Who chooses the learning technique(s)?
    • And its parameter values?
  • Who chooses/designs the pre- and post-processing components?
    • And their parameter values?
45 / 366

Designing an ML system

  • Who chooses the learning technique(s)?
    • And its parameter values?
  • Who chooses/designs the pre- and post-processing components?
    • And their parameter values?

The designer of the ML system, that is, you¹!

You!

  1. Can those choices be made automatically? "Yes", it's called Auto-ML
45 / 366

Phases of design of an ML design

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement):
    • define XX and YY
    • define a way for assessing solutions
      • before designing!
      • applicable to any compatible ML solution
  4. Design the ML system
    • choose a learning technique
    • choose/design pre- and post-processing steps
  5. Implement the ML system
    • learning/prediction phases
    • obtain the data
  6. Assess the ML system

Steps 4–6 are usually iterated many times

Skills of the ML practitioner/designer:

  • knowing main ML techniques
  • knowing common pre- and post-proc. techs
  • knowing main (comparative) assessment techniques
  • implementing them in production
  • motivate all choices

Skills of the ML researcher:

  • (as above and)
    • but implementing them as prototype
  • disigning new ML/pre-/post-processing/assessment techniques
  • formally/experimentally motivating them

Experience, practice, knowledge!

46 / 366

Should I use Machine Learning?

Recall: we need an fpredict:XYf\subtext{predict}: X \to Y to make a decision yy about an xx

Reasons for running fpredictf\subtext{predict} on a machine:

  • yy has to be computed very quickly
    • a human couldn't keep the pace
  • yy has to be computed in a dangerous context
    • or a human is simply not available
  • the value of yy is very low
  • it is believed that a human would be biased in deciding yy

If fpredictf\subtext{predict} is run on a machine, still fpredictf\subtext{predict} might be designed by a human.

  • human "learning", not machine learning

Reasons for running flearnf\subtext{learn} on a machine, i.e., to obtain fpredictf\subtext{predict} through learning:

  • humans cannot design a reasonable fpredictf\subtext{predict}
  • human-made fpredictf\subtext{predict} is too costly/slow
  • human-made fpredictf\subtext{predict} is not good
    • does not make good decisions

Factors:

  • efficiency
  • effectiveness
  • human dignity (cost)
47 / 366

Domain knowledge and data exploration

Reasons for running flearnf\subtext{learn} on a machine:

  • humans cannot design a reasonable fpredictf\subtext{predict}: yes or no?
  • human-made fpredictf\subtext{predict} is too costly/slow: yes or no?
  • human-made fpredictf\subtext{predict} is not good: yes or no?

Answering these questions requires the knowledge of the domain

  • (necessary, not sufficient)
  • better/more with exploration of the data
    • which data?
    • how to explore it? \rightarrow data visualization
48 / 366

How to choose components?

Component:

  • learning technique
  • pre- or post-processing technique
  • dataset
  • assessment technique

Factors: beyond applicability, which is a yes/no matter

  • effectiveness
    • the component works well (experimental assessment, evaluation metrics and methods)
  • efficiency
    • using the component consumes low resources
  • interpretability
    • the working of the component and/or its outcomes is understandable
  • familiarity
    • the designer does little effort for using the component: e.g., already knows the software tool, good parameter values, ...
  • technological constraints
49 / 366

Example of Iris

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...

50 / 366

Example of Iris

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...

Why a story?

  • we need a concrete case in order to practice the phases of the ML design (steps 1–3)
  • those steps cannot be made with an abstract case
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
50 / 366

Iris species

Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist.

Bob liked to collect Iris flowers and to sort them properly in his collection boxes at home. He organized collected flowers depending on their species.

Iris setosa Iris versicolor Iris virginica

Iris setosa Iris virginica Iris versicolor

51 / 366

Bob's need

Alice: What's up, Bob?
Bob: I'd like to put new flowers into proper boxes.
Well... I'm not an expert of flowers. Can't you do it by yourself?
No, actually I cannot. But I heard you now master the art of machine learning...
Mmmmhhh... I see that you already have flowers in boxes. How did you sort them? Why ML now?
Well, I used to go to a professional botanist, who was able to tell me the species of each Iris flower I collected. I don't want to bother her anymore and her lab is far from here and it takes time to get there and the fuel is getting more and more pricey... 🦖
Ok, I understand. So you think ML can be helpful here. Let's see...

52 / 366

Bob's need

Alice: What's up, Bob?
Bob: I'd like to put new flowers into proper boxes.
Well... I'm not an expert of flowers. Can't you do it by yourself?
No, actually I cannot. But I heard you now master the art of machine learning...
Mmmmhhh... I see that you already have flowers in boxes. How did you sort them? Why ML now?
Well, I used to go to a professional botanist, who was able to tell me the species of each Iris flower I collected. I don't want to bother her anymore and her lab is far from here and it takes time to get there and the fuel is getting more and more pricey... 🦖
Ok, I understand. So you think ML can be helpful here. Let's see...

Some information about the context up to here (Alice's thoughts 💭):

  • problem timings: no real hurry to make a decision
  • scale of the problem: how many flowers would Bob collect in the unit of time?
  • cost of the solution: Bob is basically trying to replace a free service with another free service...
  • expected quality of the solution: how picky will be Bob?

No car accidents to be avoided (timing), no billions of emails to be analyzed (scale), no big business process invoved (cost), no loan decision to be made (quality).

52 / 366

Tackling the Iris problem: phase 1 - ML?

Reasons for running fpredictf\subtext{predict} on a machine:

  • 👎 yy has to be computed very quickly
  • 👎 yy has to be computed in a dangerous context
  • 🤏 the value of yy is very low
  • 🤌 a human would be biased in deciding yy

Reasons for learning fpredictf\subtext{predict} on a machine:

  • 👍 humans cannot design a reasonable fpredictf\subtext{predict}
  • 🤌 human-made fpredictf\subtext{predict} is too costly/slow
  • 🤏 human-made fpredictf\subtext{predict} is not good

👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
53 / 366

Tackling the Iris problem: phase 1 - ML?

Reasons for running fpredictf\subtext{predict} on a machine:

  • 👎 yy has to be computed very quickly
  • 👎 yy has to be computed in a dangerous context
  • 🤏 the value of yy is very low
  • 🤌 a human would be biased in deciding yy

Reasons for learning fpredictf\subtext{predict} on a machine:

  • 👍 humans cannot design a reasonable fpredictf\subtext{predict}
  • 🤌 human-made fpredictf\subtext{predict} is too costly/slow
  • 🤏 human-made fpredictf\subtext{predict} is not good

👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Outcome: ok, let's use ML!

53 / 366

Phase 2 - supervised vs. unsupervised

Do we have examples at hand?

Yes, Bob already collected some flowers and organized them in boxes. For each of them, there's a species label that has been assigned by an expert (the professional botanist). We assume those labels are correctly assigned.

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
54 / 366

Phase 2 - supervised vs. unsupervised

Do we have examples at hand?

Yes, Bob already collected some flowers and organized them in boxes. For each of them, there's a species label that has been assigned by an expert (the professional botanist). We assume those labels are correctly assigned.

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Outcome: it's supervised learning!

54 / 366

Phase 3 - problem statement

In natural language: given an Iris flower, assign a species

Formally:
X={x:x is an Iris flower}X=\{x: x \text{ is an Iris flower}\}
Y={setosa,versicolor,virginica}Y=\{\text{setosa}, \text{versicolor}, \text{virginica}\}

Issues with this XX: 🤔

  • is that a useful definition? that is: can it be used for judging the membership of an object to XX? 🌸?X\text{🌸} \overset{?}{\in} X
  • is an xXx \in X processable by a machine? recall that in later phases:
    • we want to take an flearnf\subtext{learn} that is able to learn an fpredict:XYf\subtext{predict}: X \to Y and use flearnf\subtext{learn} on a machine
    • we want to use the learned fpredictf\subtext{predict} on a machine
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
55 / 366

Phases 3 - shaping XX

We cannot just take another XX, because the problem is "given an Iris flower, assign a species". But we can introduce some pre-processing¹ steps that transform an xXx \in X in an xXx' \in X', with XX' being better, more suitable, for later steps.

That is, we can design an fpre-proc:XXf\subtext{pre-proc}: X \to X' and an XX'!

Requirements:

  • (designing and) applying of fpre-procf\subtext{pre-proc} should have an acceptable cost
  • an x=fpre-proc(x)x'=f\subtext{pre-proc}(x) should retain the information of xx that is useful for obtaining a yy
  • XX' should be compatible with one or more learning techniques see
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
  1. If xXx \in X is not digital, we consider fpre-procf\subtext{pre-proc} to be applied outside the ML system, hence its definition is part of the problem statement; otherwise, if xXx \in X is natively digital, then each fpre-procf\subtext{pre-proc} can be considered as part of the ML system, and its definition is done in phase 4.
56 / 366

Phases 3 - feature engineering

Since most learning techniques are designed to work on a mutlivariate XX, we are going to design an fpre-proc:XX=X1××Xpf\subtext{pre-proc}: X \to X' = X'_1 \times \dots \times X'_p. That is, we are going to define the features and the way to compute them out of an xx.

This step is called feature engineering and is in practice a key step in the design of an ML system, often more important than the choice of the learning technique:

  • for the key requirement concerning the information retaining contained in xx
  • because it is often done before collecting the dataset, which may be a costly, hardly repeatable operation
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
57 / 366

Phases 3 - feature engineering for Iris

Some options:

Function fpre-procf\subtext{pre-proc} Set XX' Cost Info¹ Comp.²
xx' is a textual desc. of xx strings 🫰🫰 🫳 👍
xx' is a digital pic. of xx [0,1]512×512×3[0,1]^{512 \times 512 \times 3} 🫰 🫳 👍³
xx' is "the" DNA of xx {A,C,G,T}\{\text{A}, \text{C}, \text{G}, \text{T}\}^* 🫰🫰🫰 👍 👍
xx' is some measurements of xx Rp\mathbb{R}^p 🫰 🫳 👍
  1. Info retain: 👍: large, i.e., good; 🫳: medium; 👎: small, i.e., bad.
  2. Compatibility: 👍: large, i.e., good; 🫳: medium; 👎: small, i.e., bad.
  3. Not if Alice just attends this course...

The actual decision should be taken by Alice and Bob together, based on domain knowledge of the latter and ML knowledge of the former.

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Requirements for fpre-proc:XXf\subtext{pre-proc}: X \to X':

  • proper cost
  • retaining information
  • compatibility
58 / 366

Phase 3 - flower to vector

Assume choice "xx' is some measurements of xx", namely 4 measurements, then fpre-proc:XR4f\subtext{pre-proc}: X \to \mathbb{R}^4 and fpre-proc(x)=x=(x1,x2,x3,x4)f\subtext{pre-proc}(x)=\vect{x}'=(x'_1,x'_2,x'_3,x'_4) with:

  • x1x'_1 is the¹ sepal length of xx in cm
  • x2x'_2 is the sepal width of xx in cm
  • x3x'_3 is the petal length of xx in cm
  • x4x'_4 is the petal width of xx in cm

Iris sepal and petal measurements

x1x'_1 x2x'_2 x3x'_3 x4x'_4 yy
5.1 3.5 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.3 3.3 6.0 2.5 virginica
  1. Which one? it has to be decided! e.g., the longest, mean value, ...
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Requirements for fpre-proc:XXf\subtext{pre-proc}: X \to X':

  • proper cost
  • retaining information
  • compatibility
59 / 366

Phases 1 and 3 - explore the data

Alice's thoughts 💭: Is it true that we cannot design a reasonable fpredictf\subtext{predict}? Are we retaining information?

Let's look at the data!

  • which data? Bob, give me your samples and let's measure them
  • what to look?
    1. mean values for species and feature
    2. boxplots of values for species and feature
    3. pairwise (with respect to feature) scatterplots of observations

How does Alice choose these 3 approaches, in this order?

  • experience
  • nature of XX' (here R4\mathbb{R}^4)
  • knowledge of basic plots and their cost ​
  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
60 / 366

Phases 1 and 3 - data mean values

Mean values for species and feature

iris %>% group_by(Species) %>% summarise_all(mean)
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246
2 versicolor 5.94 2.77 4.26 1.33
3 virginica 6.59 2.97 5.55 2.03

Findings: setosa looks more different ​

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
61 / 366

Phases 1 and 3 - boxplots

Boxplots of values for species and feature

iris %>% pivot_longer(cols=!Species)
%>% ggplot(aes(x=name, y=value, color=Species)) + geom_boxplot()

Iris boxplot

Findings: overlap between versicolor and virginica, for all features ​

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system
62 / 366

Phases 1 and 3 - pairwise scatterplots

Pairwise scatterplots of observations

ggpairs(iris, columns=1:4, aes(color=Species, alpha=0.5),
upper=list(continuous="points"))

Iris pairwise scatterplots

Findings: overlap! ​

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement)
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Questions:

  • cannot design an fpredictf\subtext{predict}?
  • retaining information?

Outcome:
Yes, let's use ML!

63 / 366

Phase 3 - solution assessment

Problem statement:

  • define XX and YY
  • define a way for assessing solutions ❌

How?

  • next part
64 / 366

The true story of the Iris dataset

Iris paper 1 Anderson, Edgar. "The species problem in Iris." Annals of the Missouri Botanical Garden 23.3 (1936): 457-509.

Iris paper 2 Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.

1936!!!

65 / 366

Basic concepts

Brief recap

66 / 366

Refining a definition of ML

Machine Learning is the science of getting computers to learn without being explicitly programmed. \downarrow Machine Learning is the science of getting computers to learn f:XYf: X \to Y without being explicitly programmed. \downarrow Machine Learning is the science of getting computers to learn f:XYf: X \to Y autonomously. \downarrow Supervised (Machine) Learning is the science of getting computers to learn f:XYf: X \to Y from examples autonomously. \downarrow

Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is a way for learning an fpredictFXYf\subtext{predict} \in \mathcal{F}_{X \to Y} given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y).

67 / 366

Key terms

  • each xXx \in X is an observation, input, data point, or instance
  • each yYy \in Y is a response or output
  • D={(x(i),y(i))}iD = \left\{\left(x^{(i)},y^{(i)}\right)\right\}_i is a dataset compatible with XX and YY; a learning set if used for learning
  • learning phase is when flearnf\subtext{learn} is being applied
  • prediction phase is when fpredictf\subtext{predict} is being applied
  • a model is the variable part mm of a templated fpredict(x)=fpredict(x,m)f\subtext{predict}(x) = f'\subtext{predict}(x, m)
  • if YY is finite and without ordering, it's a classification problem
    • if Y=2|Y|=2, it's binary classification
    • if Y>2|Y|>2, it's multiclass classification
  • if Y=RY = \mathbb{R}, it's a regression problem
  • if X=X1××XpX = X_1 \times \dots \times X_p is multivariate, each xix_i is an independent variable, feature, attribute, or predictor
  • yy is the dependent variable or response variable
    • in classification, yy is the class label
68 / 366

Supervised learning technique

A learning technique is defined by flearn,fpredictf'\subtext{learn}, f'\subtext{predict}:

flearn:P(X×Y)Mf'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M fpredict:X×MYf'\subtext{predict}: X \times M \to Y

flearnf'\subtext{learn}DlearnD\subtext{learn}mmfpredictf'\subtext{predict}x,mx, myy

Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is defined by:

  • a way flearnf'\subtext{learn} for learning a model mMm \in M given a DlearnP(X×Y)D\subtext{learn} \in \mathcal{P}^*(X \times Y);
  • a way fpredictf'\subtext{predict} for computing a response yy given an observation xx and a model mm.
69 / 366

Phases of design of an ML system

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement):
    • define XX and YY
    • feature engineering
    • define a way for assessing solutions
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

Arguments for fpredictf\subtext{predict} on machine:

  • computing yy quickly
  • dangerous context
  • low yy value
  • avoid human bias

Arguments for flearnf\subtext{learn} on machine:

  • cannot build fpredictf\subtext{predict} manually
  • cost building fpredictf\subtext{predict} manually
  • quality of a manually built fpredictf\subtext{predict}

Requirements for fpre-proc:XXf\subtext{pre-proc}: X \to X':

  • proper cost
  • retaining information
  • compatibility
70 / 366

Assessing supervised ML

71 / 366

What to assess?

Subject of the assessment:

  • an ML system (all components)
  • a supervised learning technique (flearnf\subtext{learn} and fpredictf\subtext{predict})
  • a model (mm used in an fpredictf'\subtext{predict})
72 / 366

Axes of assessment

Assume something is assessed with respect to a given goal:

  • Effectiveness: to which degree is the goal achieved?
    • goal poorly achieved \rightarrow low effectiveness 😢
    • goal completely achieved \rightarrow high effectiveness 😁
  • Efficiency: how much resourses are consumed for achieving the goal? to some degree
    • large amount of resources \rightarrow low efficiency 😢
    • small amount of resources \rightarrow high efficiency 😁
  • Interpretability (or explainability): to which degree the way the goal is achieved (or not achieved) is explainable?
    • poorly explainable \rightarrow low interpretability 😢
    • fully explainable \rightarrow high interpretability 😁
73 / 366

Purposes of assessment

Given an axis aa of assessment:

  • absolute assessment: does something meet the expectation in terms of aa?
    • is a model effective enough?
    • is a learning technique explainable enough?
    • is an ML system efficient enough?
  • comparison: is one thing better than one other thing in terms of aa?
    • is model m1m_1 more effective than model m2m_2? maybe obtained with same technique and different parameters
    • is this learning technique more efficient than that learning technique?

"enough" represents some expectation, some minimum degree of aa to be reached.

74 / 366

Purposes of assessment

Given an axis aa of assessment:

  • absolute assessment: does something meet the expectation in terms of aa?
    • is a model effective enough?
    • is a learning technique explainable enough?
    • is an ML system efficient enough?
  • comparison: is one thing better than one other thing in terms of aa?
    • is model m1m_1 more effective than model m2m_2? maybe obtained with same technique and different parameters
    • is this learning technique more efficient than that learning technique?

"enough" represents some expectation, some minimum degree of aa to be reached.

If the outcome of assessment is a quantity (i.e., a number) with a monotonic semantics:

  • comparison corresponds to check for >> or <<
  • absolute assessment corresponds to:
    • establishing a threshold and
    • check for >> or <<

We want assessment to produce a number!

74 / 366

Effectiveness and subject

A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:

  • tuning properly (i.e., such that, after tuning it makes good decisions)
  • making good decisions

A supervised learning technique is a pair flearn,fpredictf\subtext{learn},f\subtext{predict}. Its goals are:

  • learning a good fpredictf\subtext{predict}, for flearnf\subtext{learn}, i.e., an fpredictf\subtext{predict} that makes good decisions
  • making good decisions

A model has one goal:

  • making good decisions (when used in an fpredictf'\subtext{predict})
75 / 366

Effectiveness and subject

A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:

  • tuning properly (i.e., such that, after tuning it makes good decisions)
  • making good decisions

A supervised learning technique is a pair flearn,fpredictf\subtext{learn},f\subtext{predict}. Its goals are:

  • learning a good fpredictf\subtext{predict}, for flearnf\subtext{learn}, i.e., an fpredictf\subtext{predict} that makes good decisions
  • making good decisions

A model has one goal:

  • making good decisions (when used in an fpredictf'\subtext{predict})

Eventually, effectiveness is about making good decisions!

  • Ideally, we want to measure effectiveness with numbers.
75 / 366

Model vs. real system

How to measure if an fpredictf'\subtext{predict} is making good decisions?

Recall: fpredictf\subtext{predict}, possibly through fpredictf'\subtext{predict} and a model mm, models the dependency of yy on xx.

Key underlying assumption: yy depends on xx. That is, there exists some real system s:XYs: X \to Y that, given an xx produces a yy based on xx, that is, sFXYs \in \mathcal{F}_{X \to Y}:

  • given a flat xx, an economical system determines the price yy of xx on the real estate market
  • given two basketball teams about to play a match xx, a sport event determines the outcome yy of xx

Or, there exists in reality some system s1:YXs^{-1}: Y \to X that, given an yy produces an xx based on yy:

  • given a seed of an Iris flower of a given species yy, the nature eventually develops yy in an Iris flower xx

Model mm (or fpredictf\subtext{predict})

fpredict(,m)f'\subtext{predict}(\cdot, m)xxyy

Real system ss

ssxxyy

A templated fpredict:X×MYf'\subtext{predict}: X \times M \to Y with a fixed model mm is an fpredict:XYf\subtext{predict}: X \to Y.

76 / 366

Comparing mm and ss

Model mm (or fpredictf\subtext{predict})

fpredict(,m)f'\subtext{predict}(\cdot, m)xxyy

Real system ss

ssxxyy

How to see if the model mm is modeling the system ss well?

Direct comparison:

  1. "open" ss and look inside
  2. "open" mm and look inside
  3. compare internals of ss and mm

Issues:

  • in practice, ss can rarely/hardly be opened
  • mm might be hard to open

Comparison of behaviors:

  1. collect some examples of the behavior of ss
  2. feed mm with examples
  3. compare responses of ss and mm

Ideally, we want the comparison (step 3) outcome to be a number.

77 / 366

Comparing behaviors

fcomp-behavior:FXY×FXYRf\subtext{comp-behavior}: \mathcal{F}_{X \to Y} \times \mathcal{F}_{X \to Y} \to \mathbb{R}

fcomp-behaviorf\subtext{comp-behavior}fpredict,sf\subtext{predict},sveffectv\subtext{effect}

Or, to highlight the presence of a model in a templated fpredictf\subtext{predict}:

fcomp-behavior:FX×MY×M×FXYRf\subtext{comp-behavior}: \mathcal{F}_{X \times M \to Y} \times M \times \mathcal{F}_{X \to Y} \to \mathbb{R}

fcomp-behaviorf\subtext{comp-behavior}fpredict,m,sf'\subtext{predict},m,sveffectv\subtext{effect}

In both cases:

function comp-behavior(fpredict,s)\text{comp-behavior}(f\subtext{predict}, s) {
{x(i)}icollect()\{x^{(i)}\}_i \gets \text{collect}()
{y(i)}iforeach({x(i)}i,s)\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)
{y^(i)}iforeach({x(i)}i,fpredict)\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})
veffectcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)
return veffectv\subtext{effect};
}

  1. collect some examples of the behavior of ss
  2. feed mm with examples
  3. compare responses of ss and mm

More correctly {(y(i),y^(i))}iforeach({x(i)}i,both(,s,fpredict))\seq{(y^{(i)},\hat{y}^{(i)})}{i} \gets \text{foreach}(\seq{x^{(i)}}{i}, \text{both}(\cdot, s, f\subtext{predict})) with fboth:X×FXY2f\subtext{both}: X \times \mathcal{F}^2_{X \to Y} and fboth(x,f1,f2)=(f1(x),f2(x))f\subtext{both}(x, f_1, f_2) = (f_1(x),f_2(x)).

78 / 366

Remarks on fcomp-behaviorf\subtext{comp-behavior}

function comp-behavior(fpredict,s)\text{comp-behavior}(f\subtext{predict}, s) {
{x(i)}icollect()\{x^{(i)}\}_i \gets \text{collect}()
{y(i)}iforeach({x(i)}i,s)\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)
{y^(i)}iforeach({x(i)}i,fpredict)\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})
veffectcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)
return veffectv\subtext{effect};
}

fcomp-behaviorf\subtext{comp-behavior}fpredict,sf\subtext{predict},sveffectv\subtext{effect}
  1. collect examples of ss behavior
  2. feed mm with examples
  3. compare responses of ss and mm
  • it's a partially abstract function: fcollectf\subtext{collect} and fcomp-respsf\subtext{comp-resps} are abstract (i.e., not given here)
  • we may reason about effectiveness and efficiency of fcomp-behaviorf\subtext{comp-behavior}, but both depend on concrete fcollectf\subtext{collect} and fcomp-respsf\subtext{comp-resps}
    • effectiveness: to which degree fcomp-behaviorf\subtext{comp-behavior} measures if mm behaves like ss?
    • efficiency: how much resources are consumed to apply fcomp-behaviorf\subtext{comp-behavior}?
79 / 366

Remarks on fcomp-behaviorf\subtext{comp-behavior}

function comp-behavior(fpredict,s)\text{comp-behavior}(f\subtext{predict}, s) {
{x(i)}icollect()\{x^{(i)}\}_i \gets \text{collect}()
{y(i)}iforeach({x(i)}i,s)\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)
{y^(i)}iforeach({x(i)}i,fpredict)\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})
veffectcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)
return veffectv\subtext{effect};
}

fcomp-behaviorf\subtext{comp-behavior}fpredict,sf\subtext{predict},sveffectv\subtext{effect}
  1. collect examples of ss behavior
  2. feed mm with examples
  3. compare responses of ss and mm
  • it's a partially abstract function: fcollectf\subtext{collect} and fcomp-respsf\subtext{comp-resps} are abstract (i.e., not given here)
  • we may reason about effectiveness and efficiency of fcomp-behaviorf\subtext{comp-behavior}, but both depend on concrete fcollectf\subtext{collect} and fcomp-respsf\subtext{comp-resps}
    • effectiveness: to which degree fcomp-behaviorf\subtext{comp-behavior} measures if mm behaves like ss?
    • efficiency: how much resources are consumed to apply fcomp-behaviorf\subtext{comp-behavior}?

We'll see many concrete options for fcomp-respsf\subtext{comp-resps}

fcollectf\subtext{collect} is instead hard to define, but it's more important than fcomp-respsf\subtext{comp-resps}

  • working with good data is important!
79 / 366

The importance of fcollectf\subtext{collect} in assessment

  • How many observations to collect? (data size) nn in {(x(i))}i=1i=ncollect()\{(x^{(i)})\}_{i=1}^{i=n} \gets \text{collect}()
  • Which observations to collect? (data coverage)

Goal: the behavior {(x(i),y(i))}i=1i=n\{(x^{(i)},y^{(i)})\}_{i=1}^{i=n} has to be representative of the real system ss

  • the larger nn, the more representative
  • the better the coverage of XX, the more representative

Concerning size nn:

  • small nn, poor effectiveness 👎, great efficiency 👍
  • large nn, great effectiveness 👍, poor efficiency 👎

Concerning coverage of XX

  • poor coverage, poor effectiveness 👎
  • good coverage, good effectiveness 👍

Focus on coverage, rather than size, because it has no drawbacks!

80 / 366

Comparing responses with fcomp-respsf\subtext{comp-resps}

Formally:

fcomp-resps:P(Y2)Rf\subtext{comp-resps}: \mathcal{P}^*(Y^2) \to \mathbb{R}

fcomp-respsf\subtext{comp-resps}{(y(i),y^(i))}i\{(y^{(i)},\hat{y}^{(i)})\}_iveffectv\subtext{effect}

function comp-behavior(fpredict,s)\text{comp-behavior}(f\subtext{predict}, s) {
{x(i)}icollect()\{x^{(i)}\}_i \gets \text{collect}()
{y(i)}iforeach({x(i)}i,s)\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)
{y^(i)}iforeach({x(i)}i,fpredict)\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})
veffectcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)
return veffectv\subtext{effect};
}

where {(y(i),y^(i))}iP(Y2)\{(y^{(i)},\hat{y}^{(i)})\}_i \in \mathcal{P}^*(Y^2) is a multiset of pairs of yy.

Depends only on YY, not on XX!

81 / 366

Comparing responses with fcomp-respsf\subtext{comp-resps}

Formally:

fcomp-resps:P(Y2)Rf\subtext{comp-resps}: \mathcal{P}^*(Y^2) \to \mathbb{R}

fcomp-respsf\subtext{comp-resps}{(y(i),y^(i))}i\{(y^{(i)},\hat{y}^{(i)})\}_iveffectv\subtext{effect}

function comp-behavior(fpredict,s)\text{comp-behavior}(f\subtext{predict}, s) {
{x(i)}icollect()\{x^{(i)}\}_i \gets \text{collect}()
{y(i)}iforeach({x(i)}i,s)\{y^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, s)
{y^(i)}iforeach({x(i)}i,fpredict)\{\hat{y}^{(i)}\}_i \gets \text{foreach}(\{x^{(i)}\}_i, f\subtext{predict})
veffectcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets \text{comp-resps}(\{(y^{(i)},\hat{y}^{(i)})\}_i)
return veffectv\subtext{effect};
}

where {(y(i),y^(i))}iP(Y2)\{(y^{(i)},\hat{y}^{(i)})\}_i \in \mathcal{P}^*(Y^2) is a multiset of pairs of yy.

Depends only on YY, not on XX!

We'll see a few options for the main cases:

  • classification
    • all (i.e., agnostic with respect to Y|Y|): error, accuracy
    • binary: FPR and FNR (and variants), EER, AUC
    • multiclass: weighted accuracy
  • regression: MAE, MSE, MRE

} performance indexes

81 / 366

Assessing models

Classification

82 / 366

Classification error

Recall: in classification YY is a finite set with no ordering

Classification error: more verbosely: classification error rate ferr({(y(i),y^(i))}i=1i=n)=1ni=1i=n1(y(i)y^(i))f\subtext{err}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=\frac{1}{n}\sum_{i=1}^{i=n}\mathbf{1}(y^{(i)}\ne \hat{y}^{(i)}) where 1:{false,true}{0,1}\mathbf{1}: \{\text{false},\text{true}\} \to \{0,1\} is the indicator function: 1(b)={1if b=true0otherwise\mathbf{1}(b) = \begin{cases} 1 &\text{if } b = \text{true}\\ 0 &\text{otherwise} \end{cases}

  • ferrf\subtext{err} is a concrete instance of fcomp-respsf\subtext{comp-resps}
  • the codomain of ferrf\subtext{err} is [0,1][0,1]: [0,1]R[0,1] \subseteq{\mathbb{R}}, so it can be a concrete instance
    • 00 means no errors, it's good 👍
    • 11 means all errors, it's bad 👎
  • in general, numbers in [0,1][0,1] can be expressed as percentages in [0,100][0,100]: xx is the same as 100x%100 x\%
83 / 366

Classification accuracy

Classification accuracy: facc({(y(i),y^(i))}i=1i=n)=1ni=1i=n1(y(i)=y^(i))f\subtext{acc}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=\frac{1}{n}\sum_{i=1}^{i=n}\mathbf{1}(y^{(i)} \c{3}{=} \hat{y}^{(i)})

Clearly, facc({(y(i),y^(i))}i=1i=n)=1ferr({(y(i),y^(i))}i=1i=n)f\subtext{acc}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n})=1-f\subtext{err}(\{(y^{(i)},\hat{y}^{(i)})\}_{i=1}^{i=n}).

The codomain of faccf\subtext{acc} is also [0,1][0,1]:

  • 11 means no errors, it's good 👍
  • 00 means all errors, it's bad 👎

For accuracy, the greater, the better.
For error, the lower, the better.

In principle, the only requirement concerning YY for both faccf\subtext{acc} and faccf\subtext{acc} is that there is an equivalence relation in YY, i.e., that == is defined over YY. However, in practice YY is a finite set without ordering.

84 / 366

Bounds for accuracy (and error)

In principle, accuracy is in [0,1][0,1].

Recall that in the faccf\subtext{acc} is part of an fcomp-behaviorf\subtext{comp-behavior} that should measure how well a model mm models a real system ss.
What are the ideal extreme cases in practice:

  • mm is ss, so it perfectly models ss
  • mm is random, does not model any dependency of yy on xx

From another point of view, what would be the accuracy of a:

  • model that perfectly models the system?
  • random model?
85 / 366

The random classifier (lower bound)

The random classifier¹ is an XYX \to Y doing:

frnd(x)=yi with iU({1,,Y})f\subtext{rnd}(x) = y_i \text{ with } i \sim U(\{1,\dots,|Y|\})

where iU(A)i \sim U(A) means choosing an item of AA with uniform probability.

Here A={1,,Y}A=\{1,\dots,|Y|\}, hence f(x)f(x) gives a random yy, without using xx, i.e., no dependency.

Considering all possibles multisets of responses P(Y)\mathcal{P}^*(Y), the accuracy of the random classifier is, on average, 1Y\frac{1}{|Y|}.

  1. classifier is a shorthand for:
    • a model for doing classifcation, i.e., an fpredictf'\subtext{predict} with categorical YY
    • a supervised learning technique for classification, i.e., a pair flearn,fpredictf'\subtext{learn}, f'\subtext{predict} with categorical YY
86 / 366

Dummy classifier (better lower bound)

Given one specific multiset of responses {y(i)}i\{y^{(i)}\}_i, the dummy classifier is the one that always predicts the most frequent class in {y(i)}i\{y^{(i)}\}_i: fdummy,{y(i)}i(x)=arg maxyY1ni=1i=n1(y=y(i))=arg maxyYFr ⁣(y,{y(i)}i)f_{\text{dummy},\{y^{(i)}\}_i}(x) = \argmax_{y \in Y} \frac{1}{n} \sum_{i=1}^{i=n} \mathbf{1}(y=y^{(i)})=\argmax_{y \in Y} \freq{y, \{y^{(i)}\}_i} On the {y(i)}i\{y^{(i)}\}_i on which it is built, the accuracy of the dummy classifier is maxyYFr ⁣(y,{y(i)}i)\max_{y \in Y} \freq{y, \{y^{(i)}\}_i}.

Recall: we use faccf\subtext{acc} on one specific {y(i)}i\{y^{(i)}\}_i.

Like the random classifier, the dummy classifier does not use xx.

Dummy

dummy [duhm-ee]: a representation of a human figure, as for displaying clothes in store windows

Looks like a human, but does nothing!

87 / 366

Random/dummy classifier: examples

Case: coin tossing, Y={heads,tails}Y=\{\c{1}{\text{heads}},\c{2}{\text{tails}}\}

Random on average (with frndf\subtext{rnd}):

{y(i)}i\seq{y^{(i)}}{i} {y^(i)}i\seq{\hat{y}^{(i)}}{i} facc()f\subtext{acc}()
50%50\%
25%25\%
... ... ...
100%100\%
0%0\%
Average accuracy = 50%50\%

Dummy on {y(i)}i=⬤⬤⬤⬤\seq{y^{(i)}}{i}=\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}
(with fdummy,⬤⬤⬤⬤f_{\text{dummy},\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}}):

facc(⬤⬤⬤⬤,⬤⬤⬤⬤)=75%f\subtext{acc}(\htmlClass{col2 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col2 st}{\text{⬤⬤⬤⬤}}) = 75\%

Case: Iris, Y={setosa,versicolor,virginica}Y=\{\c{1}{\text{setosa}},\c{2}{\text{versicolor}},\c{3}{\text{virginica}}\}

Random on average (with frndf\subtext{rnd}):

{y(i)}i\seq{y^{(i)}}{i} {y^(i)}i\seq{\hat{y}^{(i)}}{i} facc()f\subtext{acc}()
17%\approx 17\%
50%50\%
... ... ...
100%100\%
0%0\%
Average accuracy 33%\approx 33\%

Dummy on {y(i)}i=⬤⬤⬤⬤\seq{y^{(i)}}{i}=\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}
(with fdummy,⬤⬤⬤⬤f_{\text{dummy},\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}}}):

facc(⬤⬤⬤⬤,⬤⬤⬤⬤)=50%f\subtext{acc}(\htmlClass{col3 st}{\text{⬤⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤⬤⬤⬤}}) = 50\%

Here facc(⬤⬤⬤,⬤⬤⬤)f\subtext{acc}(\htmlClass{col3 st}{\text{⬤}}\htmlClass{col1 st}{\text{⬤}}\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤⬤⬤}}) stays for facc({(,),(,),(,)})f\subtext{acc}(\{(\htmlClass{col3 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}}),(\htmlClass{col1 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}}),(\htmlClass{col2 st}{\text{⬤}},\htmlClass{col3 st}{\text{⬤}})\}).

88 / 366

The perfect classifier (upper bound)

A classifier that works exactly as ss:

fperfect(x)=s(x)f\subtext{perfect}(x) = s(x)

If ss is deterministic, the accuracy of fperfect(x)f\subtext{perfect}(x) is 100% on every {x(i)}i\seq{x^{(i)}}{i}, by definition.

Are real systems deterministic in practice?

  • system that makes a mail spam or not-spam
  • Iris species (where nature is an s1s^{-1}...)
  • a bank employee who decides whether or not to grant a loan
  • the real estate market forming the price of a flat (Y=R+Y=\mathbb{R}^+)
89 / 366

The Bayes classifier (better upper bound)

A non deterministic system (i.e., a stochastic or random system) is one that given the same xx may output different yy.

The Bayes classifier is an ideal model of a real system that is not deterministic:

fBayes(x)=arg maxyYPr ⁣(s(x)=yx)f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}

where Pr ⁣(s(x)=yx)\prob{s(x)=y \mid x} is the probability that ss gives yy for xx.

Key facts:

  • on a given {x(i)}i\seq{x^{(i)}}{i} the accuracy of the Bayes classifier is 100%\le 100\% (it may be lower than 100%)
  • on P(X)\mathcal{P}^*(X), i.e., on all possible multisets of observations xx, the Bayes classifier is the optimal classifier, i.e., no other classifier can score a better accuracy it can be proven, not here!
90 / 366

The Bayes classifier (better upper bound)

A non deterministic system (i.e., a stochastic or random system) is one that given the same xx may output different yy.

The Bayes classifier is an ideal model of a real system that is not deterministic:

fBayes(x)=arg maxyYPr ⁣(s(x)=yx)f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}

where Pr ⁣(s(x)=yx)\prob{s(x)=y \mid x} is the probability that ss gives yy for xx.

Key facts:

  • on a given {x(i)}i\seq{x^{(i)}}{i} the accuracy of the Bayes classifier is 100%\le 100\% (it may be lower than 100%)
  • on P(X)\mathcal{P}^*(X), i.e., on all possible multisets of observations xx, the Bayes classifier is the optimal classifier, i.e., no other classifier can score a better accuracy it can be proven, not here!

In practice:

  • the Bayes classifier is an ideal classifier: "building" it requires knowing how ss works, which is undoable in practice
  • intuitively, the more random the system, the lower the accuracy of the Bayes classifier
90 / 366

The Bayes classifier: example

The real system ss is the professor deciding if a student will pass or fail the exam of Introduction to ML. The professor just looks at the student course to decide ❗ fake! and is a bit stochastic.

X={IN19,IN20,SM34,SM35,SM64}X =\{\text{IN19},\text{IN20},\text{SM34},\text{SM35},\text{SM64}\}
Y={fail,pass}Y = \{\text{fail},\text{pass}\}

The probability according to which the professor "reasons" is completeley known:

fail\text{fail} pass\text{pass}
IN19\text{IN19} 20%20\% 80%80\%
IN20\text{IN20} 15%15\% 85%85\%
SM34\text{SM34} 60%60\% 40%40\%
SM35\text{SM35} 80%80\% 20%20\%
SM64\text{SM64} 20%20\% 80%80\%
❗ these are fake numbers!

Pr ⁣(s(x)=yx)={20%if x=IN19y=fail80%if x=IN19y=pass15%if x=IN20y=fail80%if x=SM64y=pass\prob{s(x)=y \mid x}=\begin{cases} 20\% &\text{if } x=\text{IN19} \land y=\text{fail} \\ 80\% &\text{if } x=\text{IN19} \land y=\text{pass} \\ 15\% &\text{if } x=\text{IN20} \land y=\text{fail} \\ \dots \\ 80\% &\text{if } x=\text{SM64} \land y=\text{pass} \end{cases} the table is a compact form for this probability

fBayes(x)={passif x=IN19passif x=IN20failif x=SM34failif x=SM35passif x=SM64f\subtext{Bayes}(x) = \begin{cases} \text{pass} &\text{if } x=\text{IN19} \\ \text{pass} &\text{if } x=\text{IN20} \\ \text{fail} &\text{if } x=\text{SM34} \\ \text{fail} &\text{if } x=\text{SM35} \\ \text{pass} &\text{if } x=\text{SM64} \end{cases} built using the definition fBayes(x)=arg maxyYPr ⁣(s(x)=yx)f\subtext{Bayes}(x) = \argmax_{y \in Y} \prob{s(x)=y \mid x}

Questions

  • what's the accuracy of fBayesf\subtext{Bayes}? What's the model for the Bayes classifier? What's MM?
  • what's the accuracy of fdummyf\subtext{dummy}? And of frndf\subtext{rnd}?
91 / 366

Classification accuracy bounds

Lower Upper
By definition 00 11
Bounds, all data 1Y\frac{1}{\lvert Y\rvert} 11
Better bounds, with one {x(i)}i\seq{x^{(i)}}{i} maxyYFr ⁣(y,{s(x(i))}i)\max_{y \in Y} \freq{y, \{s(x^{(i)})\}_i} 1\le 1

If {x(i)}i\seq{x^{(i)}}{i} is collected properly, it is representative of the behavior of the real system (together with the corresponding {s(x(i))}i\seq{s(x^{(i)})}{i}), hence the third case is the most relevant one:

facc()[maxyYFr ⁣(y,{s(x(i))}i),1ϵ]f\subtext{acc}(\cdot) \in [\max_{y \in Y} \freq{y, \{s(x^{(i)})\}_i}, 1 - \epsilon] ϵ>0\epsilon > 0 is actually unknown

In practice, use the random classifier as a baseline and

  • do not cry 😭 for a missed 100%100\%
  • do not be too happy 🥳 just because you score >0%> 0\%
92 / 366

All data

All data means all the theoretically possible datasets, i.e., for just yy, P(Y)\mathcal{P}^*(Y).

  • on average in P(Y)\mathcal{P}^*(Y), the frequency of each yiYy_i \in Y is 1Y\frac{1}{|Y|}

In practice not all possible datasets are equally probable.

  • often, the frequencies fif_i of yiy_i are known (at least an approximation of them).
  • in these cases, the (approximate) lower bound for the random classifier is: maxifi\max_i f_i

Example: for spam, xx is an email, i.e., a string of text, yy is spam\text{spam} or ¬spam\neg\text{spam}:

  • are we interested in measuring the accuracy of a spam filter on all possible strings (theory)?
  • or are we more interested in knowing its accuracy for actual emails (practice)?
93 / 366

Building the dummy classifier

Consider the random classifier as a supervised learning technique:

  • in learning phase: compute frequencies/probability of classes concrete
  • in prediction phase: choose the most frequent class concrete

Hence, formally:

  • a model mMm \in M is: these are alternative options
    1. the class frequencies f=(f1,,fY)\c{2}{\vect{f} = (f_1,\dots,f_{|Y|})}, with M=FY={f[0,1]Y:f1=1}M=F_Y=\{\vect{f} \in [0,1]^{|Y|}: \lVert \vect{f} \rVert_1=1\}

      x1\lVert \vect{x} \rVert_1 is the 1-norm of a vector x=(x1,,xp)\vect{x}=(x_1,\dots,x_p) with x1\lVert \vect{x} \rVert_1 =ixi=\sum_i x_i

    2. a discrete probability distribution pp over YY, with M=PY={p:Y[0,1] s.t. 1=yYp(y)=Pr ⁣(y=y)}M=P_Y=\{p: Y \to [0,1] \text{ s.t. } 1=\sum_{y' \in Y} p(y')=\prob{y'=y}\} s.t.\text{s.t.} stays for "such that"
    3. the yy part {y(i)}i\seq{y^{(i)}}{i} of a dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}, with M=P(Y)M=\mathcal{P}^*(Y)
    4. just the most frequent class yy^\star, with M=YM=Y
  • flearn:P(X×Y)Mf'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M asbtract
  • fpredict:X×MYf'\subtext{predict}: X \times M \to Y asbtract
94 / 366

Building the dummy classifier (options 1 and 2)

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}m\c{2}{m}
fpredictf'\subtext{predict}x,mx, \c{2}{m}yy

Option 1: the model mm is a vector of frequencies: assume Y={y1,y2,}Y=\{y_1, y_2, \dots\}

flearn({(x(i),y(i))}i)=f=(Fr ⁣(yj,{y(i))}i))jf'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\vect{f}} = \left(\freq{y_j, \seq{y^{(i)})}{i}}\right)_j

fpredict(x,f)=yif'\subtext{predict}(x,\c{2}{\vect{f}})=y_i with i=arg maxifii = \argmax_i f_i

95 / 366

Building the dummy classifier (options 1 and 2)

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}m\c{2}{m}
fpredictf'\subtext{predict}x,mx, \c{2}{m}yy

Option 1: the model mm is a vector of frequencies: assume Y={y1,y2,}Y=\{y_1, y_2, \dots\}

flearn({(x(i),y(i))}i)=f=(Fr ⁣(yj,{y(i))}i))jf'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\vect{f}} = \left(\freq{y_j, \seq{y^{(i)})}{i}}\right)_j

fpredict(x,f)=yif'\subtext{predict}(x,\c{2}{\vect{f}})=y_i with i=arg maxifii = \argmax_i f_i

Option 2: the model mm is a discrete probability distribution: here flearnf'\subtext{learn} a function that returns a function

flearn({(x(i),y(i))}i)=p:p(y)=Fr ⁣(y,{y(i))}i)f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{p}: p(y)= \freq{y, \seq{y^{(i)})}{i}}

fpredict(x,p)=arg maxyYp(y)f'\subtext{predict}(x,\c{2}{p})=\argmax_{y \in Y} \c{2}{p}(y)

95 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model mm is simply the learning dataset: just the yy part of it

flearn({(x(i),y(i))}i)={y(i)}if'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}

fpredict(x,{y(i)}i)=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model mm is simply the learning dataset: just the yy part of it

flearn({(x(i),y(i))}i)={y(i)}if'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}

fpredict(x,{y(i)}i)=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}

Option 4: the model mm is the most frquent class yy^\star:

flearn({(x(i),y(i))}i)=y=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}

fpredict(x,y)=yf'\subtext{predict}(x,y^\star)=\c{2}{y^\star}

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model mm is simply the learning dataset: just the yy part of it

flearn({(x(i),y(i))}i)={y(i)}if'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}

fpredict(x,{y(i)}i)=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}

Option 4: the model mm is the most frquent class yy^\star:

flearn({(x(i),y(i))}i)=y=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}

fpredict(x,y)=yf'\subtext{predict}(x,y^\star)=\c{2}{y^\star}

For all options, works with:

  • any XX (xx never appears in flearnf'\subtext{learn} and fpredictf'\subtext{predict} bodies)
  • finite YY (categorical yy)

Are they different? How?

96 / 366

Building the dummy classifier (options 3 and 4)

Option 3: the model mm is simply the learning dataset: just the yy part of it

flearn({(x(i),y(i))}i)={y(i)}if'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{\seq{y^{(i)}}{i}}

fpredict(x,{y(i)}i)=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{predict}(x,\seq{y^{(i)}}{i})=\argmax_{y \in Y} \freq{y,\c{2}{\seq{y^{(i)}}{i}}}

Option 4: the model mm is the most frquent class yy^\star:

flearn({(x(i),y(i))}i)=y=arg maxyYFr ⁣(y,{y(i)}i)f'\subtext{learn}(\seq{(x^{(i)},y^{(i)})}{i}) = \c{2}{y^\star}=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}

fpredict(x,y)=yf'\subtext{predict}(x,y^\star)=\c{2}{y^\star}

For all options, works with:

  • any XX (xx never appears in flearnf'\subtext{learn} and fpredictf'\subtext{predict} bodies)
  • finite YY (categorical yy)

Are they different? How?

They differ in efficiency, are equal in effectiveness:

  • effectiveness as supervised learning techniques, same by definition
  • efficiency, always high, but: just an implementation matter
    • more or less memory for storing the model mm
    • computational effort more in the learning or prediction phase
96 / 366

Assessing models

Binary classification

97 / 366

Binary classification

Binary classification is a very common scenario.

  • assessment is particularly important
  • there are many indexes

Examples:

  • spam detection
  • decide whether there is a dog in a picture
  • clinical test (more properly: diagnostic test)
98 / 366

Example: diagnostic test

Suppose there is a (ML-based) diagnostic test for a given disease dd. just to give a name to it without calling bad luck...

You are being said the accuracy of the test is 99.8%99.8\%.

Is this a good test or not?

99 / 366

Example: diagnostic test

Suppose there is a (ML-based) diagnostic test for a given disease dd. just to give a name to it without calling bad luck...

You are being said the accuracy of the test is 99.8%99.8\%.

Is this a good test or not?

In "formal" terms, the test is an fpredict:XYf\subtext{predict}: X \to Y with:

  • X={X=\{🧑‍🦰,,👱‍,,🙍,,‍👱,,🙎‍,},\dots\} the set of persons¹
  • Y={has the disease d,does not have the disease d}Y=\{\text{has the disease } d, \text{does not have the disease } d\}

Since Y=2|Y|=2 this is a binary classification problem.

1: or, from another point of view, X={X =\{🧑‍🦰,,👱‍,,🙍,,‍👱,,🙎‍,}×T,\dots\} \times T, with TT being the time, because you test a person at a given time tt, and the outcome might be different from the test outcome for the same person at a later tt'.

99 / 366

The rare disease

Suppose dd is a rare¹ disease which affects 2\approx 2 people every 10001000 and let the accuracy be again 99.8%99.8\%.

Is this a good test or not?

  1. definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).
100 / 366

The rare disease

Suppose dd is a rare¹ disease which affects 2\approx 2 people every 10001000 and let the accuracy be again 99.8%99.8\%.

Is this a good test or not?

  1. definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).

Consider a trivial test that always says "you don't have the disease dd", its accuracy would be 99.8%99.8\%:

  • on 10001000 persons, the trivial test would make correct decisions on 998998 cases
  • is our test good if it works like the trivial test?
100 / 366

The rare disease

Suppose dd is a rare¹ disease which affects 2\approx 2 people every 10001000 and let the accuracy be again 99.8%99.8\%.

Is this a good test or not?

  1. definition of rare for a disease varies from country to country, based on the prevalence with thresholds ranging from 1 on 1538 (Brazil) to 1 in 100000 (Peru).

Consider a trivial test that always says "you don't have the disease dd", its accuracy would be 99.8%99.8\%:

  • on 10001000 persons, the trivial test would make correct decisions on 998998 cases
  • is our test good if it works like the trivial test?

    The trivial test is actually the dummy classifier built knowing that the prevalence is 0.2%0.2\%.

    Dummy classifier meme

100 / 366

The fallacy of the accuracy

99.8%99.8\% was soooo nice, but the test was actually just saying always one yy.

The accuracy alone was not able to capture such a gross error.

  • Can we spot this trivially wrong behavior?
  • From another point of view, can we check how badly the classifier behaves for each class yy?

Yes, also because we are in binary classification and there are only 2=Y2=|Y| possible values for yy (i.e., 2 classes).

There are performance indexes designed right with this aim.

101 / 366

Positives and negatives

First, let's give a standard name to the two possible yy values:

  • positive (one case, denoted with pos\text{pos})
  • negative (the other case neg\text{neg})

How to associate positive/negative with actual YY elements?

  • e.g., spam,¬spam\text{spam}, \neg\text{spam}
  • e.g., has the disease d,does not have the disease d\text{has the disease } d, \text{does not have the disease } d

Common practice:

  • associate positive with the rarest case
  • otherwise, if no rarest case exists or is known, clearly state what's your positive
102 / 366

FPR and FNR

Goal: measuring the error on each of the two classes in binary classification.

The False Positive Rate (FPR) is the rate of negatives that are wrongly¹ classified as positives: fFPR({(y(i),y^(i))}i)=i1(y(i)=negy(i)y^(i))i1(y(i)=neg)f\subtext{FPR}(\{(y^{(i)},\hat{y}^{(i)})\}_i)=\frac{\sum_i\mathbf{1}(\c{1}{y^{(i)}=\text{neg}} \land \c{2}{y^{(i)} \ne \hat{y}^{(i)}})}{\sum_i\mathbf{1}(\c{1}{y^{(i)}=\text{neg}})}

The False Negative Rate (FNR) is the rate of positives that are wrongly classified as negatives: fFNR({(y(i),y^(i))}i)=i1(y(i)=posy(i)y^(i))i1(y(i)=pos)f\subtext{FNR}(\{(y^{(i)},\hat{y}^{(i)})\}_i)=\frac{\sum_i\mathbf{1}(\c{3}{y^{(i)}=\text{pos}} \land \c{2}{y^{(i)} \ne \hat{y}^{(i)}})}{\sum_i\mathbf{1}(\c{3}{y^{(i)}=\text{pos}})}

For both:

  • the codomain is [0,1][0,1] may be 00\frac{0}{0}, i.e., NaN, if no negatives (FPR) or positives (FNR) in the data
  • the lower, the better (like the error)
  • each one is formally an fcomp-respsf\subtext{comp-resps} considering just a part {(y(i),y^(i))}i\seq{(y^{(i)},\hat{y}^{(i)})}{i}
  1. wrongly \rightarrow falsely \rightarrow false
103 / 366

More comfortable notation

FPR=FPN\text{FPR}=\frac{\text{FP}}{\text{N}}

FNR=FNP\text{FNR}=\frac{\text{FN}}{\text{P}}

Assuming that:

  • there is a {(y(i),y^(i))}i\seq{(y^{(i)},\hat{y}^{(i)})}{i}, even if it's not written
  • FP\text{FP} is the number of false positives; FN\text{FN} is the number of false negatives
    • you need both y(i)y^{(i)} and y^(i)\hat{y}^{(i)} for counting them
    • negative/positive is for y^(i)\hat{y}^{(i)}; false is for y(i)y^{(i)}, but considering y^(i)\hat{y}^{(i)}
  • P\text{P} is the number of positives and N\text{N} is the number of negatives
    • you need only y(i)y^{(i)} for counting them
104 / 366

FPR, FNR for the trivial test

Suppose dd is a rare¹ disease which affects 2\approx 2 persons every 10001000 and consider a trivial test that always says "you don't have the disease dd"

  • on 10001000 persons, the trivial test would make correct decisions on 998998 cases 😁 Acc=99.8%\text{Acc} = 99.8\%
  • on the 998998 negative persons, the trivial test does not make any wrong prediction 😁 FPR=FPN=0998=0%\text{FPR}=\frac{\text{FP}}{\text{N}} = \frac{0}{998} = 0 \%
  • on the 22 positive persons, the trivial test makes wrong predictions only 🙁 FNR=FNP=22=100%\text{FNR}=\frac{\text{FN}}{\text{P}} = \frac{2}{2} = 100 \%

Acc\text{Acc} is the more comfortable notation for the accuracy; Err\text{Err} for the error.

105 / 366

Accuracy or FPR, FNR?

When to use accuracy? When to use FPR and FNR?

tl;dr¹: use FPR and FNR in binary classification!

106 / 366

Accuracy or FPR, FNR?

When to use accuracy? When to use FPR and FNR?

tl;dr¹: use FPR and FNR in binary classification!

In decreasing order of informativeness effectiveness of assessment of effectiveness, decreasing order of verbosity:

  • give accuracy, FPR, FNR, frequencies of classes² in YY, possibly other indexes we'll see later
  • give accuracy, FPR, FNR, frequencies of classes
  • FPR, FNR, frequencies of classes
  • FPR, FNR
  • accuracy, frequencies of classes
  • accuracy

Accuracy alone in binary classification is evil! 👿

Just FPR, or just FNR is evil too, but also weird.

  1. too long; didn't read
  2. you need to show them just once, if using the "natural" distribution
106 / 366

The many relatives of FPR, FNR: TPR, TNR

Binary classification and its assessment are so practically relevant that there exist many other "synonyms" of FPR and FNR.

True Positive Rate (TPR), positives correctly classified as positives: TPR=TPP=1FNR\text{TPR}=\frac{\text{TP}}{\text{P}}=1-\text{FNR}

True Negative Rate (TNR), negatives correctly classified as negatives: TNR=TNN=1FPR\text{TNR}=\frac{\text{TN}}{\text{N}}=1-\text{FPR}

For both, the greater, the better (like accuracy); codomain is [0,1][0,1].

Relation with accuracy and error:

Err=FP+FNN+P=P  FNR+N  FPRP+N\text{Err} =\frac{\text{FP}+\text{FN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{FNR}+\text{N} \; \text{FPR}}{\text{P}+\text{N}}

Acc=1Err=TP+TNN+P=P  TPR+N  TNRP+N\text{Acc} =1-\text{Err} =\frac{\text{TP}+\text{TN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{TPR}+\text{N} \; \text{TNR}}{\text{P}+\text{N}}

107 / 366

On balanced data

In classification (binary and multiclass), a dataset is balanced, with respect to the response variable yy, if the frequency of each value of yy is roughly the same.

For a balanced dataset in binary classification, P=N\text{P}=\text{N}, hence:

  • the error rate is the average of FPR and FNR Err=FP+FNN+P=P  FNR+N  FPRP+N=N(FNR+FPR)N+N=12(FNR+FPR)\text{Err} =\frac{\text{FP}+\text{FN}}{\text{N}+\text{P}}=\frac{\text{P} \; \text{FNR}+\text{N} \; \text{FPR}}{\text{P}+\text{N}} =\frac{\text{N} (\text{FNR} + \text{FPR})}{\text{N}+\text{N}} =\frac{1}{2} (\text{FNR} + \text{FPR})
  • the accuracy is the average of TPR and TNR Acc=TP+TNN+P=P  TPR+N  TNRP+N=N(TPR+TNR)N+N=12(TNR+TPR)\text{Acc} =\frac{\text{TP}+\text{TN}}{\text{N}+\text{P}} =\frac{\text{P} \; \text{TPR}+\text{N} \; \text{TNR}}{\text{P}+\text{N}} =\frac{\text{N} (\text{TPR}+\text{TNR})}{\text{N}+\text{N}} =\frac{1}{2} (\text{TNR} + \text{TPR})

The more unbalanced a dataset, the farther the error (accuracy) from the average of FPR and FNR (TPR and TNR), the more misleading 👿 giving error (accuracy) only!

108 / 366

Precision and recall

Precision: Prec=TPTP+FP\text{Prec}=\frac{\text{TP}}{\text{TP}+\text{FP}} may be 00\frac{0}{0}, i.e., NaN, if the classifier never says positive

Recall: Rec=TPP=TPR\text{Rec}=\frac{\text{TP}}{\text{P}}=\text{TPR}

F-measure: or F1, F1-score, F-score F-measure=2PrecRecPrec+Rec\text{F-measure}=2\frac{\text{Prec} \cdot \text{Rec}}{\text{Prec}+\text{Rec}} harmonic mean of precision and recall

They come from the information retrieval scenario:

  • imagine a set of documents DD (e.g., the web)
  • imagine a query qq with an ideal subset DDD^\star \subseteq D as response (relevant documents)
  • the search engine retrieves a subset DDD' \subseteq D of documents (retrieved documents)
  • retrieving a document as binary classification: is dDd \in D relevant or not? relevant = positive

Precision: how many retrieved documents are actually relevant? Prec=DDD=DDDD+DD=TPTP+FP\text{Prec}=\frac{|D' \cap D^\star|}{|D'|}=\frac{\c{1}{|D' \cap D^\star|}}{\c{1}{|D' \cap D^\star|}+\c{2}{|D' \setminus D^\star|}}=\frac{\c{1}{\text{TP}}}{\c{1}{\text{TP}}+\c{2}{\text{FP}}}

Recall: how many of the relevant documents are actually retrieved? Rec=DDD=TPP\text{Rec}=\frac{\c{1}{|D' \cap D^\star|}}{\c{3}{|D^\star|}}=\frac{\c{1}{\text{TP}}}{\c{3}{\text{P}}}

The greater, the better (like accuracy); precision [0,1]\in [0,1] \cup NaN, recall [0,1]\in [0,1], F-measure [0,1]\in [0,1].

109 / 366

Sensitivity and specificity (and more)

Sensitivity: Sensitivity=TPP=TPR\text{Sensitivity}=\frac{\text{TP}}{\text{P}}=\text{TPR}

Specificity: Specificity=TNN=TNR\text{Specificity}=\frac{\text{TN}}{\text{N}}=\text{TNR}

The greater, the better (like accuracy); both in [0,1][0,1].

110 / 366

Sensitivity and specificity (and more)

Sensitivity: Sensitivity=TPP=TPR\text{Sensitivity}=\frac{\text{TP}}{\text{P}}=\text{TPR}

Specificity: Specificity=TNN=TNR\text{Specificity}=\frac{\text{TN}}{\text{N}}=\text{TNR}

The greater, the better (like accuracy); both in [0,1][0,1].

Other similar indexes:

  • Type I error for FPR
  • Type II error for FNR

For both, the lower, the better (like error).

110 / 366

Which terminology?

Rule of the thumb¹ (in binary classification)

  • precision and recall, if in an information retrieval scenario
    • refer to the act of retrieving
  • sensitivity and specificity, if working with a diagnostic test
    • refer to the quality of the text
  • FPR and FNR, otherwise
    • refer to the name of the class

No good reasons imho for using Type I and Type II error:

  • what do they refer to?
  • is there a Type III? 🤔 (No!)
  1. rule of the thumb [ˌruːl əv ˈθʌm]: a broadly accurate guide or principle, based on practice rather than theory
111 / 366

Comparison with FPR and FNR

Suppose you have two models and you compute them on the same data:

  • model m1m_1 with its fpredictf'\subtext{predict} scores FPR=6%\text{FPR}=6\% and FNR=4%\text{FNR}=4\%
  • model m2m_2 with its fpredictf'\subtext{predict} scores FPR=10%\text{FPR}=10\% and FNR=1%\text{FNR}=1\%

Which one is the best?

112 / 366

Comparison with FPR and FNR

Suppose you have two models and you compute them on the same data:

  • model m1m_1 with its fpredictf'\subtext{predict} scores FPR=6%\text{FPR}=6\% and FNR=4%\text{FNR}=4\%
  • model m2m_2 with its fpredictf'\subtext{predict} scores FPR=10%\text{FPR}=10\% and FNR=1%\text{FNR}=1\%

Which one is the best?

In general, it depends on:

  • the cost of the error, possibly different between FPs and FNs
  • the number of positives or negatives
112 / 366

Cost of the error

Assumptions:

  • once fpredictf\subtext{predict} outputs a yy, some action is taken
    • otherwise, taking a decision yy is pointless
  • if the action is wrong, there is some cost to be paid with respect to the correct action (the other one, in binary classification) assume the correct decision has 00 cost
    • otherwise, making attempting to take the correct decision is pointless

Given P+N\text{P}+\text{N} observations, the overall cost cc is: c=cFP  FPR  N+cFN  FNR  Pc = c\subtext{FP} \; \text{FPR} \; \text{N} + c\subtext{FN} \; \text{FNR} \; \text{P} with cFPc\subtext{FP} and cFNc\subtext{FN} the cost of FPs and FNs.

If you know cFPc\subtext{FP}, cFNc\subtext{FN}, N\text{N}, and P\text{P}: (the costs cFPc\subtext{FP}, cFNc\subtext{FN} should come from domain knowledge)

  • you can compute cc (and compare the cost for two models)
  • find a good trade-off for FPR\text{FPR} and FNR\text{FNR} more later
113 / 366

Balancing FPR and FNR

Given a model (not a learning technique), can we "tune" it to prefer avoiding FPs rathern than FNs (or viceversa)?

  • e.g., can we make a diagnostic more sensitive to positives (i.e., prefer avoiding FNs) during a pandemic wave?

Yes! It turns out that for many learning techniques (for classification), the fpredictf'\subtext{predict} internally computes a discrete probability distribution over YY before actually returning one yy.

114 / 366

Model with probability

Formally:

fpredict:X×MPYf''\subtext{predict}: X \times M \to P_{Y} fpredict(x,m)=pf''\subtext{predict}(x, m) = p

fpredict:X×MYf'\subtext{predict}: X \times M \to Y fpredict(x,m)=arg maxyY(fpredict(x,m))(y)=arg maxyYp(y)f'\subtext{predict}(x, m)= \argmax\sub{y \in Y} (f''\subtext{predict}(x, m))(y) = \argmax\sub{y \in Y} p(y)

where PYP_Y is the set of discrete probability distributions over YY.

Example: for spam detection, given an mm and an email xx, fpredict(x,m)f'\subtext{predict}(x, m) might return: p(y)={80%if y=spam20%if y=¬spamp(y)= \begin{cases} 80\% &\text{if } y=\text{spam} \\ 20\% &\text{if } y=\neg\text{spam} \end{cases} For another email, it might return a 30%/70%, instead of an 80%/20%.

115 / 366

Learning technique with probability

A supervised learning technique with probability (for classification) is defined by:

  • an flearn:P(X×Y)Mf'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M, for learning a model from a dataset
  • an fpredict:X×MPYf''\subtext{predict}: X \times M \to P_{Y}, for giving a probability distribution from an observation and a model

For all the techniques of this kind, fpredict:X×MYf'\subtext{predict}: X \times M \to Y and fpredictf\subtext{predict} are always the same: concrete

  • fpredict(x,m)=arg maxyY(fpredict(x,m))(y)f'\subtext{predict}(x, m)= \argmax\sub{y \in Y} (f''\subtext{predict}(x, m))(y)
  • fpredict(x)=fpredict(x,m)f\subtext{predict}(x) = f'\subtext{predict}(x, m)
xxmmfpredictf''\subtext{predict}pparg maxyY\argmax\sub{y \in Y}yy

"internally computes" \rightarrow pp is indeed available internally, but can be obtained from outside

  • in practice, software tools allow to use both fpredictf'\subtext{predict} and fpredictf''\subtext{predict}
116 / 366

Probability and binary classification

In binary classification, with Y={pos,neg}Y=\{\text{pos},\text{neg}\}, pPYp \in P_Y has always this form: p(y)={pposif y=pos1pposif y=negp(y)= \begin{cases} p\subtext{pos} &\text{if } y=\text{pos} \\ 1-p\subtext{pos} &\text{if } y=\text{neg} \end{cases} with ppos[0,1]p\subtext{pos} \in [0,1].

Hence, prediction can be seen as:

fpredict:X×M[0,1]f'''\subtext{predict}: X \times M \to [0,1] fpredict(x,m)=pposf'''\subtext{predict}(x,m)=p\subtext{pos}

fpredict:X×MYf'\subtext{predict}: X \times M \to Y fpredict(x,m)={posif ppos0.5negotherwisef'\subtext{predict}(x,m)= \begin{cases} \text{pos} &\text{if } p\subtext{pos} \ge 0.5 \\ \text{neg} &\text{otherwise} \end{cases}

xxmmfpredictf'''\subtext{predict}pposp\subtext{pos}0.5\ge 0.5yy
117 / 366

Probability and confidence

p(y)={pposif y=pos1pposif y=negp(y)= \begin{cases} p\subtext{pos} &\text{if } y=\text{pos} \\ 1-p\subtext{pos} &\text{if } y=\text{neg} \end{cases}

The closer pposp\subtext{pos} to 0.50.5, the lower the confidence of the model in its decision:

  • ppos=0.51p\subtext{pos}=0.51 means "I think it's a positive, but I'm not sure"
  • ppos=0.49p\subtext{pos}=0.49 means "I think it's a negative, but I'm not sure"
  • ppos=0.98p\subtext{pos}=0.98 means "I'm rather sure it's a positive!"

We may measure the confidence in the binary decision as: conf(x,m)=ppos0.50.5=fpredict(x,m)0.50.5\text{conf}(x,m)=\frac{\abs{p\subtext{pos}-0.5}}{0.5}=\frac{\abs{f'''\subtext{predict}(x,m)-0.5}}{0.5}

conf[0,1]\text{conf} \in [0,1]: the greater, the more confident.

118 / 366

Changing the threshold

If we replace the fixed 0.50.5 threshold with a param τ\tau we obtain a new function:

fpredictτ:X×[0,1]Yf^\tau\subtext{predict}: X \times [0,1] \to Y fpredictτ(x,τ)={posif fpredict(x,m)τnegotherwisef^\tau\subtext{predict}(x,\tau)= \begin{cases} \text{pos} &\text{if } f'''\subtext{predict}(x,m) \ge \tau \\ \text{neg} &\text{otherwise} \end{cases}

xxτ\taummfpredictf'''\subtext{predict}pposp\subtext{pos}τ\ge \tauyy

Note that:

  • for using fpredictτf^\tau\subtext{predict} on an xx, you need a concrete value for τ\tau
    • fpredict(x)=fpredictτ(x,0.5)f\subtext{predict}(x)=f^\tau\subtext{predict}(x, 0.5), i.e., 0.50.5 is the default value for τ\tau in fpredictf\subtext{predict}
  • like for fpredictf\subtext{predict}, the model is inside fpredictτf^\tau\subtext{predict}
  • you can obtain several predictions for the same observation xx by varying τ\tau

Example: if we want our diagnostic test to be more sensible to positives, we lower τ\tau without changing the model!

119 / 366

Threshold τ\tau vs. FPR, FNR

Given the same mm and the same {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}:

  • the greater τ\tau, the less frequent y=posy=\text{pos}, the lower FPR\text{FPR}, the greater FNR\text{FNR}
  • the lower τ\tau, the more frequent y=posy=\text{pos}, the greater FPR\text{FPR}, the lower FNR\text{FNR}

Example:

Example of tau vs. FPR and FNR

  • for the default threshold τ=0.5\tau=0.5, FPR20%\text{FPR}\approx 20\%, FNR15%\text{FNR}\approx 15\%
  • if you want to be more sensitive to positives, set, e.g., τ=0.25\tau=0.25, so there will be a lower FNR13%\text{FNR} \approx 13\%
  • if you know the cost of an FN is \approx double the cost of an FP and the data is balanced, then you should set τ0.12\tau\approx 0.12
why FNR=0%\text{FNR}=0\% for τ=0\tau=0 but FPR>0%\text{FPR}>0\% for τ=1\tau=1?
120 / 366

Equal Error Rate

For a model mm and a dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}, the Equal Error Rate (EER) is the value of FPR (and FNR) for the τ=τEER\tau=\tau\subtext{EER} value for which FPR=FNR\text{FPR}=\text{FNR}.

For EER: the lower, the better (like error); codomain is [0,1][0,1] in practice [0,0.5][0,0.5]

Example of EER

  • for τ=0.65\tau=0.65 (vertical dashed line), FPR=FNR\text{FPR}=\text{FNR}
  • EER19%\text{EER}\approx 19\% (horizontal solid line)
121 / 366

The ROC curve

For a model mm and a dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i} and a sequence (τi)i(\tau_i)_i, the Receiver operating characteristic¹ (ROC) curve is the plot of TPR\text{TPR} (=1FNR= 1-\text{FNR}) vs. FPR\text{FPR} for the different values of τ(τi)i\tau \in (\tau_i)_i.

Example of EER

  • red line: ROC curve
    • each point stays at (FPR,TPR)(\text{FPR},\text{TPR}) for a given τ\tau
  • solid black line: points for which FPR=FNR\text{FPR}=\text{FNR}
    • the xx-coord of the intersection with the red line is EER\text{EER}
    • point at top-left (FPR=FNR=0\text{FPR}=\text{FNR}=0) is the perfect classifier
  • the intersection of dashed and solid black lines is at FPR=FNR=0.5\text{FPR}=\text{FNR}=0.5
    • it is the random classifier
  • points on the dashed line are random classifiers with τ0.5\tau \ne 0.5
    • the ROC for a healthy classifier should never stay on the right of the dashed line!
  1. The name comes from its usage as a graphical tool for assessing radar stations during WW2.
122 / 366

Area Under the Curve (AUC)

For a model mm and a dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i} and a sequence (τi)i(\tau_i)_i, the Area Under the Curve (AUC) is the area under the ROC curve.

For AUC: the greater, the better (like accuracy); codomain is [0,1][0,1] in practice [0.5,1][0.5,1]

Example of EER

  • for the random classifier, AUC=0.5\text{AUC}=0.5
  • for the ideal classifier, AUC=1\text{AUC}=1
123 / 366

How to choose τ\tau values?

For computing both EER\text{EER} and AUC\text{AUC}, you need to compute FPR\text{FPR} and FNR\text{FNR} for many values of τ\tau.

Ingredients:

  • fpredictτf^\tau\subtext{predict}
    • i.e., fpredictf'''\subtext{predict} and a model mm
  • a dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}
  • a sequence (τi)i(\tau_i)_i of threshold values
xxτ\taummfpredictf'''\subtext{predict}pposp\subtext{pos}τ\ge \tauyy

How to choose (τi)i(\tau_i)_i? recall: τ[0,1]\tau \in [0,1]; by convention, you always take also τ=0\tau=0 and τ=1\tau=1

  • evenly spaced in [0,1][0,1] at n+1n+1 points: (τi)i=(in)i=0i=n(\tau_i)_i=(\frac{i}{n})_{i=0}^{i=n}
  • evenly spaced in [τmin,τmax][\tau\subtext{min},\tau\subtext{max}]: (τi)i=(τmin+in(τmaxτmin))i=0i=n(\tau_i)_i=(\tau\subtext{min}+\frac{i}{n}(\tau\subtext{max}-\tau\subtext{min}))_{i=0}^{i=n}
    • with τmin=minifpredict(x(i),m)\tau\subtext{min}=\min_i f'''\subtext{predict}(x^{(i)},m) and τmax=maxifpredict(x(i),m)\tau\subtext{max}=\max_i f'''\subtext{predict}(x^{(i)},m)
  • taking midpoints of (ppos(i))i(p\subtext{pos}^{(i)})_i i.e., sorted {ppos(i)}i\seq{p\subtext{pos}^{(i)}}{i}
    • with ppos(i)=fpredict(x(i),m)p\subtext{pos}^{(i)}=f'''\subtext{predict}(x^{(i)},m)
124 / 366

Example: τ\tau and its values

Y={pos,neg}Y=\{\c{1}{\text{pos}},\c{2}{\text{neg}}\}

y(i)y^{(i)} ppos(i)p\subtext{pos}^{(i)} y^(i)\hat{y}^{(i)} out¹
0.49 FN
0.29 TN
0.63 TP
0.51 TP
0.52 TP
0.47 TN
0.94 TP
0.75 TP
0.53 FP
0.45 TN
  1. with τ=0.5\tau=0.5
τ\tau FPR\text{FPR} FNR\text{FNR}

01
0.50.5 14=25%\frac{1}{4}=25\% 1617%\frac{1}{6}\approx 17\%

01
0.40.4 34=75%\frac{3}{4}=75\% 06=0%\frac{0}{6}=0\%

01
0.60.6 04=0%\frac{0}{4}=0\% 36=50%\frac{3}{6}=50\%

(τi)i(\tau_i)_i evenly spaced in [0,1][0,1] 9+2 values \rightarrow raw 7 on 11 different values

01

(τi)i(\tau_i)_i evenly spaced in [0.29,0.84][0.29,0.84] 9+2 values \rightarrow better but still 7 on 11 different values

01

(τi)i(\tau_i)_i at midpoints 9+2 values \rightarrow optimal 11 on 11 different values

01
125 / 366

Cost of errors, index, and τ\tau

If you know the cost of error (cFPc\subtext{FP} and cFNc\subtext{FN}) and the class frequencies:

  • choose a proper τ\tau and measure FPR\text{FPR}, FNR\text{FNR}, cc

If you don't know the cost of error and you know the classifier will work at a fixed τ\tau:

  • measure FPR\text{FPR}, FNR\text{FNR} for τ=0.5\tau=0.5
  • measure EER\text{EER}

If you don't know the cost of error and don't know at which τ\tau the classifier will work:

  • measure FPR\text{FPR}, FNR\text{FNR} for τ=0.5\tau=0.5
  • measure AUC\text{AUC}
126 / 366

Cost of errors, index, and τ\tau

If you know the cost of error (cFPc\subtext{FP} and cFNc\subtext{FN}) and the class frequencies:

  • choose a proper τ\tau and measure FPR\text{FPR}, FNR\text{FNR}, cc

If you don't know the cost of error and you know the classifier will work at a fixed τ\tau:

  • measure FPR\text{FPR}, FNR\text{FNR} for τ=0.5\tau=0.5
  • measure EER\text{EER}

If you don't know the cost of error and don't know at which τ\tau the classifier will work:

  • measure FPR\text{FPR}, FNR\text{FNR} for τ=0.5\tau=0.5
  • measure AUC\text{AUC}

If you can afford, i.e., you have time/space:

  • measure "everything"
126 / 366

Confusion matrix

Given a multiset {(y(i),y^(i))}i\seq{(y^{(i)},\hat{y}^{(i)})}{i} of pairs, the confusion matrix has:

  • one row for each possible value yy of YY, associated with y(i)y^{(i)} (true labels)
  • one column for each possible value y^\hat{y} of YY, associated with y^(i)\hat{y}^{(i)} (predicted labels)
  • the number of pairs for which y^(i)=y^\hat{y}^{(i)}=\hat{y} and y(i)=yy^{(i)}=y in the cell

Y={pos,neg}Y=\{\c{1}{\text{pos}},\c{2}{\text{neg}}\}

y(i)y^{(i)} ppos(i)p\subtext{pos}^{(i)} y^(i)\hat{y}^{(i)} out
0.49 FN
0.29 TN
0.63 TP
0.51 TP
0.52 TP
0.47 TN
0.94 TP
0.75 TP
0.53 FP
0.45 TN

For this case:

yyy^\hat{y}
5 1
1 3

For binary classification:

yyy^\hat{y} pos\text{pos} neg\text{neg}
pos\text{pos} TP\text{TP} FN\text{FN}
neg\text{neg} FP\text{FP} TN\text{TN}

In general it holds, being c\vect{c} the confusion matrix:

  • the accuracy is the ratio between the sum of the diagonal and the sum of the matrix: Acc=diag(c)1c1\text{Acc} = \frac{\lVert \text{diag}(\vect{c}) \rVert_1}{\lVert \vect{c} \rVert_1}
  • TPR is the ratio of cpos,posc_{\text{pos},\text{pos}} on the sum of the first row, i.e., the row for which y=posy=\text{pos}
  • TNR is the ratio of cneg,negc_{\text{neg},\text{neg}} on the sum of the second row, i.e., the row for which y=negy=\text{neg}
127 / 366

Multiclass classification and regression

128 / 366

Weighted accuracy for multiclass classification

Besides accuracy and error, for unbalanced datasets, the weighted accuracy (or balanced accuracy) is: wAcc=fwAcc({(y(i),y^(i))}i)=1YyY(i1(y(i)=yy(i)=y^(i))i1(y(i)=y))=1YyYAccy\text{wAcc}=f\subtext{wAcc}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})=\frac{1}{|Y|} \sum_{y \in Y} \left( \frac{\sum_i \mathbf{1}(y^{(i)}=y \land y^{(i)}=\hat{y}^{(i)})}{\sum_i \mathbf{1}(y^{(i)}=y)} \right)=\frac{1}{|Y|} \sum_{y \in Y} \text{Acc}_y i.e., the (unweighted) average of the accuracy for each class. You can do the same with error, precision, recall, ...

yyy^\hat{y}
15 1 2 2
1 10 4 1
5 3 28 1
1 0 0 9

Acc=15+10+28+920+16+38+10=6284=73.8%\text{Acc} = \frac{15+10+28+9}{20+16+38+10} = \frac{62}{84} = 73.8\%

Acc=1520=75%\text{Acc}\subtext{\c{1}{⬤}} = \frac{15}{20} = 75\%
Acc=1016=62.5%\text{Acc}\subtext{\c{2}{⬤}} = \frac{10}{16} = 62.5\%
Acc=2837=75.7%\text{Acc}\subtext{\c{3}{⬤}} = \frac{28}{37} = 75.7\%
Acc=910=90%\text{Acc}\subtext{\c{4}{⬤}} = \frac{9}{10} = 90\%

wAcc=14(1520+1016+2837+910)=75.8%\text{wAcc} = \frac{1}{4} \left( \frac{15}{20}+\frac{10}{16}+\frac{28}{37}+\frac{9}{10} \right) = 75.8\%

wAcc\text{wAcc} overlooks class imbalance, Acc\text{Acc} does not; wAcc[0,1]\text{wAcc} \in [0,1]; the greater, the better

  • wAcc\text{wAcc} is like 12(TPR+TNR)\frac{1}{2} (\text{TPR}+\text{TNR})
129 / 366

Errors in regression

Differently from classification, a prediction in regression may be more or less wrong:

  • classification: either y(i)=y^(i)y^{(i)}=\hat{y}^{(i)} (correct) or y(i)y^(i)y^{(i)}\ne\hat{y}^{(i)} (wrong)
  • regression:
    • y(i)=y^(i)y^{(i)}=\hat{y}^{(i)} (perfect);
    • y(i)+1=y^(i)y^{(i)}+1=\hat{y}^{(i)} is wrong
    • y(i)+100=y^(i)y^{(i)}+100=\hat{y}^{(i)} is much more wrong
    • ...

The error in regression measures how far is the prediction y^(i)\hat{y}^{(i)} from the true value y(i)y^{(i)}:

  • recall, we are in the context of behavior comparison, i.e., fcomp-respsf\subtext{comp-resps}
130 / 366

MAE, MSE, RMSE, MAPE

Name fcomp-resps({(y(i),y^(i))}i)f\subtext{comp-resps}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})
Mean Absolute Error (MAE) MAE=1niy(i)y^(i)\text{MAE} = \frac{1}{n} \sum_i \abs{y^{(i)}-\hat{y}^{(i)}}
Mean Squared Error (MSE) MSE=1ni(y(i)y^(i))2\text{MSE} = \frac{1}{n} \sum_i (y^{(i)}-\hat{y}^{(i)})^2
Root Mean Squared Error (RMSE) RMSE=1ni(y(i)y^(i))2=MSE\text{RMSE} = \sqrt{\frac{1}{n} \sum_i (y^{(i)}-\hat{y}^{(i)})^2}=\sqrt{\text{MSE}}
Mean Absolute Percentage Error (MAPE) MAPE=1niy(i)y^(i)y(i)\text{MAPE} = \frac{1}{n} \sum_i \abs{\frac{y^{(i)}-\hat{y}^{(i)}}{y^{(i)}}}

Remarks:

  • for all:
    • the lower, the better
    • domain is [0,+[[0, +\infin[ MAPE might be \infin
  • MAE and RMSE retain the unit of measure: e.g., yy is in meters, MAE is in meters
  • MAPE is scale-independent and dimensionless
  • MSE and RMSE are more influenced by observations with large errors
  • MAPE "does not work" if the true yy is 00
131 / 366

Assessing learning techniques

132 / 366

Purpose of assessment

Premise:

  • an effective learning technique is a pair flearn,fpredictf'\subtext{learn},f'\subtext{predict} that learns a good model mm
    • flearnf'\subtext{learn} needs a dataset for producing mm
  • an effective model mm is one that has the same behavior of the real system ss
    • we measure this with fcomp-behaviorf\subtext{comp-behavior}, that internally uses a dataset

Goal:

  • we want a measure (a number!) of the effectiveness of flearn,fpredictf'\subtext{learn},f'\subtext{predict}

Sketch of solution:

  1. learn an mm with flearnf'\subtext{learn}
  2. measure the effectiveness Eff\text{Eff} of mm with fcomp-behaviorf\subtext{comp-behavior} (and one or more suitable fcomp-respsf\subtext{comp-resps})
  3. say that the effectiveness of the learning technique is Eff\text{Eff}

Eff\text{Eff} might be accuracy, TPR and TNR, MAE, error, ...

133 / 366

What data?

Sketch of solution:

  1. learn an mm with flearnf'\subtext{learn}
  2. measure the effectiveness Eff\text{Eff} of mm with fcomp-behaviorf\subtext{comp-behavior} (and one or more suitable fcomp-respsf\subtext{comp-resps})
  3. say that the effectiveness of the learning technique is Eff\text{Eff}

Both steps 1 and 2 need a dataset:

  • can we use the same DD?
134 / 366

What data?

Sketch of solution:

  1. learn an mm with flearnf'\subtext{learn}
  2. measure the effectiveness Eff\text{Eff} of mm with fcomp-behaviorf\subtext{comp-behavior} (and one or more suitable fcomp-respsf\subtext{comp-resps})
  3. say that the effectiveness of the learning technique is Eff\text{Eff}

Both steps 1 and 2 need a dataset:

  • can we use the same DD?

In principle yes, in practice no:

  • many learning techniques attempt to learn a model mm that, by definition, perfectly models the learning set
  • you want to see if it the learned model generalizes beyond examples
134 / 366

Effectiveness of a learning technique

flearn-effect:LXY×P(X×Y)Rf\subtext{learn-effect}: \mathcal{L}_{X \to Y} \times \mathcal{P}^*(X \times Y) \to \mathbb{R} where LXY\mathcal{L}_{X \to Y} is the set of learning techniques:

  • LXY=FP(X×Y)FXY\mathcal{L}_{X \to Y}= \mathcal{F}_{\mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}}
  • or LXY=FP(X×Y)M×FX×MY\mathcal{L}_{X \to Y} = \mathcal{F}_{\mathcal{P}^*(X \times Y) \to M} \times \mathcal{F}_{X \times M \to Y}
flearn,Df\subtext{learn}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}
or
flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}

Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.

135 / 366

Effectiveness of a learning technique

flearn-effect:LXY×P(X×Y)Rf\subtext{learn-effect}: \mathcal{L}_{X \to Y} \times \mathcal{P}^*(X \times Y) \to \mathbb{R} where LXY\mathcal{L}_{X \to Y} is the set of learning techniques:

  • LXY=FP(X×Y)FXY\mathcal{L}_{X \to Y}= \mathcal{F}_{\mathcal{P}^*(X \times Y) \to \mathcal{F}_{X \to Y}}
  • or LXY=FP(X×Y)M×FX×MY\mathcal{L}_{X \to Y} = \mathcal{F}_{\mathcal{P}^*(X \times Y) \to M} \times \mathcal{F}_{X \times M \to Y}
flearn,Df\subtext{learn}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}
or
flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}

Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.

For consistency, let's reshape model assessment case:

function predict-effect(fpredict,m,D)\text{predict-effect}(f'\subtext{predict}, m, D) {
{(y(i),y^(i))}iforeach(\seq{(y^{(i)},\hat{y}^{(i)})}{i} \gets \text{foreach}(
D,D,
both(,second,fpredict(first(),m))\text{both}(\cdot,\text{second},f'\subtext{predict}(\text{first}(\cdot),m))
))
veffectfcomp-resps({(y(i),y^(i))}i)v\subtext{effect} \gets f\subtext{comp-resps}(\seq{(y^{(i)},\hat{y}^{(i)})}{i})
return veffectv\subtext{effect};
}

fpredict,m,Df'\subtext{predict}, m, Dfpredict-effectf\subtext{predict-effect}veffectv\subtext{effect}

We are just leaving the data collection out of predict-effect()\text{predict-effect}().

first()\text{first}() and second()\text{second}() take the first or second element of a pair.

135 / 366

Same dataset

function learn-effect-same(flearn,fpredict,D)\text{learn-effect-same}(f'\subtext{learn},f'\subtext{predict}, D) {
mflearn(D)m \gets f'\subtext{learn}(D)
veffectpredict-effect(fpredict,m,D)v\subtext{effect} \gets \text{predict-effect}(f'\subtext{predict},m,D)
return veffectv\subtext{effect};
}

flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}

The entire DD is used for learning the model and assessing it.

Effectiveness of assessment:

  • generalization is not assessed
    • for techniques that, by design, learn a model that perfectly models the learning data, learn-effect-same\text{learn-effect-same} gives perfect effectiveness, regardless of mm, regardless of DD
  • what if DD is lucky/unlucky? no robustness w.r.t. DD

Poor! 👎

Efficiency of assessment:

  • learning is executed just once

Good! 👍

136 / 366

Static train/test division

function learn-effect-static(flearn,fpredict,D,r)\text{learn-effect-static}(f'\subtext{learn},f'\subtext{predict}, D,r) {
Dlearnsubbag(D,r)D\subtext{learn} \gets \text{subbag}(D, r)
DtestDDlearnD\subtext{test} \gets D \setminus D\subtext{learn}
mflearn(Dlearn)m \gets f'\subtext{learn}(D\subtext{learn})
veffectpredict-effect(fpredict,m,Dtest)v\subtext{effect} \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})
return veffectv\subtext{effect};
}

flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}rr

r[0,1]r \in [0,1] is a parameter

DD is split in DlearnD\subtext{learn} for learning and a DtestD\subtext{test} for assessment: "split"="partitioned", but DlearnDtestD\subtext{learn} \cap D\subtext{test} might be \ne \emptyset

  • DtestD\subtext{test} is called the test set
  • DlearnD\subtext{learn} and DtestD\subtext{test} do not overlap and DlearnD=r\frac{|D\subtext{learn}|}{|D|}=r; common values: r=80%r=80\%, r=70%r=70\%, ...

Effectiveness of assessment:

  • generalization is assessed
  • what if DD is lucky/unlucky? no robustness w.r.t. DD division in DlearnD\subtext{learn} and DtestD\subtext{test}

Fair! \approx👍

Efficiency of assessment:

  • learning is executed just once

Good! 👍

137 / 366

Role of DtestD\subtext{test}

DtestD\subtext{test}, with respect to the model mm, is unseen data, because it has not been used for learning.

Assesing mm on unseen data answers the questions:

  • to which degree the model generalizes beyond examples?
  • does the model work well on new data?
  • how well will the ML system work in the future? on data that does not exist today
138 / 366

Role of DtestD\subtext{test}

DtestD\subtext{test}, with respect to the model mm, is unseen data, because it has not been used for learning.

Assesing mm on unseen data answers the questions:

  • to which degree the model generalizes beyond examples?
  • does the model work well on new data?
  • how well will the ML system work in the future? on data that does not exist today

In practice DtestD\subtext{test} and DlearnD\subtext{learn} are obtained from a DD that is collected all at once:

  • DtestD\subtext{test} might represent future data only roughly
138 / 366

Assessment vs. reality

What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?

Assessment 👍 - Reality 👎

DD was not representative w.r.t. the real system:

  • low coverage
  • old, i.e., the system has changed

or some bug in the implementation...

Assessment 👎 - Reality 👎

DD is not informative w.r.t. the real system:

  • yy in DD does not depend on xx in DD
    • wrong features
    • too much noise in the features

or some bug in the implementation...

139 / 366

Assessment vs. reality

What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?

Assessment 👍 - Reality 👎

DD was not representative w.r.t. the real system:

  • low coverage
  • old, i.e., the system has changed

or some bug in the implementation...

Assessment 👎 - Reality 👎

DD is not informative w.r.t. the real system:

  • yy in DD does not depend on xx in DD
    • wrong features
    • too much noise in the features

or some bug in the implementation...

Assessment 👍 - Reality 👍

Nice! We did everything well!

or some bug in the implementation...

Assessment 👎 - Reality 👍

Sooooo lucky! 🍀🍀🍀

or some bug in the implementation...

you never know if there is some bug in the implementation...

139 / 366

Repeated random train/test division

function learn-effect-repeated(flearn,fpredict,D,r,k)\text{learn-effect-repeated}(f'\subtext{learn},f'\subtext{predict}, D,r,k) {
for j1,,kj \in 1,\dots,k {
Dlearnsubbag(D,r)D\subtext{learn} \gets \text{subbag}(D, r)
DtestDDlearnD\subtext{test} \gets D \setminus D\subtext{learn}
mflearn(Dlearn)m \gets f'\subtext{learn}(D\subtext{learn})
vjpredict-effect(fpredict,m,Dtest)v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})
}
return 1kjvj\frac{1}{k}\sum_j v_j;
}

flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}r,kr,k

r[0,1]r \in [0,1] and kN+k \in \mathbb{N}^+ is a parameter

DD is split in DlearnD\subtext{learn} and DtestD\subtext{test} for kk times and measures are averaged: subbag()\text{subbag}() has to be not deterministic

  • common values: k=10k=10, k=5k=5, ...

Effectiveness of assessment:

  • generalization is assessed
  • measures are repeated with different DlearnD\subtext{learn} and DtestD\subtext{test}: robustness w.r.t. data

Good! 👍

Efficiency of assessment:

  • learning is executed kk times: might be heavy

k\propto k 🫳

140 / 366

Cross-fold validation (CV)

function learn-effect-cv(flearn,fpredict,D,k)\text{learn-effect-cv}(f'\subtext{learn},f'\subtext{predict}, D, k) {
for j1,,kj \in 1,\dots,k {
Dtestfold(D,j,k)D\subtext{test} \gets \text{fold}(D, j, k)
DlearnDDtestD\subtext{learn} \gets D \setminus D\subtext{test}
mflearn(Dlearn)m \gets f'\subtext{learn}(D\subtext{learn})
vjpredict-effect(fpredict,m,Dtest)v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})
}
return 1kjvj\frac{1}{k}\sum_j v_j;
}

flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}kk

kN+k \in \mathbb{N}^+ is a parameter

Cross-fold validation is like learn-effect-repeated\text{learn-effect-repeated}, but the kk DtestD\subtext{test} are mutually disjoint (folds).

Effectiveness of assessment:

  • generalization is assessed
  • measures are repeated with different DlearnD\subtext{learn} and DtestD\subtext{test}: robustness w.r.t. data

Good! 👍

Efficiency of assessment:

  • learning is executed kk times: might be heavy

k\propto k 🫳

141 / 366

Leave-one-out CV (LOOCV)

Simply a CV where the number of folds kk is D|D|:

  • each DtestD\subtext{test} consists of just one observation
flearn,fpredict,Df'\subtext{learn}, f'\subtext{predict}, Dflearn-effectf\subtext{learn-effect}veffectv\subtext{effect}

Effectiveness of assessment:

  • generalization is assessed
  • measures are repeated with different DlearnD\subtext{learn} and DtestD\subtext{test}: robustness w.r.t. data

Good! 👍

Efficiency of assessment:

  • learning is executed k=Dk=|D| times: might be heavy

Bad 👎

142 / 366

Visual summary

Same

Eff\rightarrow \text{Eff}

1 learning; D|D| predictions

Static random (r=0.8r=0.8)

Eff\rightarrow \text{Eff}

1 learning; D(1r)|D|(1-r) predictions

Repeated random (r=0.8r=0.8, k=4k=4)

Eff1\rightarrow \text{Eff}_1
Eff2\rightarrow \text{Eff}_2
Eff3\rightarrow \text{Eff}_3
Eff4\rightarrow \text{Eff}_4
} Eff\rightarrow \text{Eff}

kk learnings; D(1r)|D|(1-r) pred. after each, kD(1r)k|D|(1-r) pred.

CV (k=5k=5)

Eff1\rightarrow \text{Eff}_1
Eff2\rightarrow \text{Eff}_2
Eff3\rightarrow \text{Eff}_3
Eff4\rightarrow \text{Eff}_4
Eff5\rightarrow \text{Eff}_5
} Eff\rightarrow \text{Eff}

kk learnings; 1kD\frac{1}{k}|D| pred. after each, D|D| pred. tot.

LOOCV

Eff1\rightarrow \text{Eff}_1
Eff2\rightarrow \text{Eff}_2
...
EffD\rightarrow \text{Eff}_{|D|}
} Eff\rightarrow \text{Eff}

D|D| learnings; 11 pred. after each, D|D| pred. tot.

143 / 366

More than the average

Repeated random, CV, and LOOCV internally compute the model effectiveness for several models learned on (slightly) different datasets:

Eff1,Eff2,,EffkEff=1kjEffj\text{Eff}_1, \text{Eff}_2, \dots, \text{Eff}_k \rightarrow \text{Eff}=\c{2}{\frac{1}{k} \sum_j \text{Eff}_j}

function learn-effect-cv(flearn,fpredict,D,k)\text{learn-effect-cv}(f'\subtext{learn},f'\subtext{predict}, D, k) {
for (j1,,kj \in 1,\dots,k) {
Dtestfold(D,j)D\subtext{test} \gets \text{fold}(D, j)
DlearnDDtestD\subtext{learn} \gets D \setminus D\subtext{test}
mflearn(Dlearn)m \gets f'\subtext{learn}(D\subtext{learn})
vjpredict-effect(fpredict,m,Dtest)v_j \gets \text{predict-effect}(f'\subtext{predict},m,D\subtext{test})
}
return 1kjvj\frac{1}{k}\sum_j v_j;
}

We can compute both the mean and the standard deviation from (Effi)i(\text{Eff}_i)_i:

Effμ=1kjEffj\text{Eff}_\mu=\frac{1}{k} \sum_j \text{Eff}_j

Effσ=1kj(EffjEffμ)2\text{Eff}_\sigma=\sqrt{\frac{1}{k} \sum_j \left(\text{Eff}_j-\text{Eff}_\mu\right)^2}

  • Mean Effμ\text{Eff}_\mu: what's the learning technique effectiveness on average?
  • Standard deviation Effσ\text{Eff}_\sigma: how consistent is the learning technique w.r.t. different datasets?
144 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ\tau):

  • for LT1: AUCμ=0.83\text{AUC}_\mu=0.83 and AUCσ=0.04\text{AUC}_\sigma=0.04
  • for LT2: AUCμ=0.75\text{AUC}_\mu=0.75 and AUCσ=0.03\text{AUC}_\sigma=0.03

What's the best learning technique?

145 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ\tau):

  • for LT1: AUCμ=0.83\text{AUC}_\mu=0.83 and AUCσ=0.04\text{AUC}_\sigma=0.04
  • for LT2: AUCμ=0.75\text{AUC}_\mu=0.75 and AUCσ=0.03\text{AUC}_\sigma=0.03

What's the best learning technique?

Now, suppose that you insted find:

  • for LT1: AUCμ=0.81\text{AUC}_\mu=0.81 and AUCσ=0.12\text{AUC}_\sigma=0.12
  • for LT2: AUCμ=0.78\text{AUC}_\mu=0.78 and AUCσ=0.02\text{AUC}_\sigma=0.02

What's the best learning technique?

145 / 366

Comparison using many measures

Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ\tau):

  • for LT1: AUCμ=0.83\text{AUC}_\mu=0.83 and AUCσ=0.04\text{AUC}_\sigma=0.04
  • for LT2: AUCμ=0.75\text{AUC}_\mu=0.75 and AUCσ=0.03\text{AUC}_\sigma=0.03

What's the best learning technique?

Now, suppose that you insted find:

  • for LT1: AUCμ=0.81\text{AUC}_\mu=0.81 and AUCσ=0.12\text{AUC}_\sigma=0.12
  • for LT2: AUCμ=0.78\text{AUC}_\mu=0.78 and AUCσ=0.02\text{AUC}_\sigma=0.02

What's the best learning technique?

  • LT1 is better, on average, but less consistent
  • on actual, unseen data, LT1 might give a worse model than LT2

Can we really state that LT1 is better than LT2?

145 / 366

Comparison and statistics

Broader example:
suppose you meet 1010 guys from Udine and 1010 from Trieste and ask them how tall they are:

City Measures μ\mu σ\sigma
Udine 154,193,170,175,172,183,160,162,161,179154, 193, 170, 175, 172, 183, 160, 162, 161, 179 170.9170.9 12.0212.02
Trieste 167,166,180,175,168,167,173,181,169,173167, 166, 180, 175, 168, 167, 173, 181, 169, 173 171.9171.9 5.445.44

Questions:

  1. are these 1010 guys from Trieste taller than these 1010 guys from Udine?
  2. are guys from Trieste taller than guys from Udine?
146 / 366

Comparison and statistics

Broader example:
suppose you meet 1010 guys from Udine and 1010 from Trieste and ask them how tall they are:

City Measures μ\mu σ\sigma
Udine 154,193,170,175,172,183,160,162,161,179154, 193, 170, 175, 172, 183, 160, 162, 161, 179 170.9170.9 12.0212.02
Trieste 167,166,180,175,168,167,173,181,169,173167, 166, 180, 175, 168, 167, 173, 181, 169, 173 171.9171.9 5.445.44

Questions:

  1. are these 1010 guys from Trieste taller than these 1010 guys from Udine?
  2. are guys from Trieste taller than guys from Udine?

Possible ways of answering:

  • laziest: yes and yes μTs>μUd\mu\subtext{Ts} > \mu\subtext{Ud} and you assume these 10+10 are representative
  • lazy: yes and I don't know μTs>μUd\mu\subtext{Ts} > \mu\subtext{Ud} but you don't assume representativeness
  • smart: yes and let's look at boxplot assume "these" means "these on average"
  • stats-geek: yes and let's do a statistical significance test assume "these" means "these on average"
146 / 366

Comparing with boxplot

Boxplot of Ts and Ud guys height

Questions:

  1. are these 1010 guys from Trieste taller than these 1010 guys from Udine?
  2. are guys from Trieste taller than guys from Udine?

Answers with the boxplot:

  1. yes, but just a bit
  2. prefer not to say
    • as an aside: people from Udine is much less consistent in height
147 / 366

Statistical significance test

Disclaimer: here, just a brief overview; go to statisticians for more details/theory

148 / 366

Statistical significance test

Disclaimer: here, just a brief overview; go to statisticians for more details/theory

For us, a statistical significance test is a procedure that, given two samples {xa,i}i\seq{x_{a,i}}{i} and {xb,i}i\seq{x_{b,i}}{i} (i.e., collections of observations) of two random variables XaX_a and XbX_b and a set of hypotheses H0H_0 (the null hypothesis), returns a number p[0,1]p \in [0,1], called the pp-value.

{xa,i}i,{xb,i}i,H0\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}, H_0fstat-testf\subtext{stat-test}pp

The pp-value represents the probability that, by collecting other two samples from the same random variables and assuming that H0H_0 still holds, the new two samples are more unlikely than {xa,i}i,{xb,i}i\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}.

148 / 366

Example

H0H_0: (you assume all are true)

  • XaX_a is normally distributed
  • XbX_b is normally distributed
  • μa=E[Xa]=μb=E[Xb]\mu_a=E[X_a] = \mu_b=E[X_b] (our question, indeed)

Samples:

  • XaX_a sample: {1,1,2,2,3,3}\{1,1,2,2,3,3\}
  • XbX_b sample: {0,0,1,0,1,1}\{0,0,1,0,1,1\}
149 / 366

Example

H0H_0: (you assume all are true)

  • XaX_a is normally distributed
  • XbX_b is normally distributed
  • μa=E[Xa]=μb=E[Xb]\mu_a=E[X_a] = \mu_b=E[X_b] (our question, indeed)

Samples:

  • XaX_a sample: {1,1,2,2,3,3}\{1,1,2,2,3,3\}
  • XbX_b sample: {0,0,1,0,1,1}\{0,0,1,0,1,1\}

p=0.90p=0.90 means:

  • if you resample XaX_a, XbX_b, very likely you will find samples that are more unlikely, given H0H_0
  • so, these samples are indeed likely, given H0H_0
  • so, I can assume H0H_0 is true
149 / 366

Example

H0H_0: (you assume all are true)

  • XaX_a is normally distributed
  • XbX_b is normally distributed
  • μa=E[Xa]=μb=E[Xb]\mu_a=E[X_a] = \mu_b=E[X_b] (our question, indeed)

Samples:

  • XaX_a sample: {1,1,2,2,3,3}\{1,1,2,2,3,3\}
  • XbX_b sample: {0,0,1,0,1,1}\{0,0,1,0,1,1\}

p=0.90p=0.90 means:

  • if you resample XaX_a, XbX_b, very likely you will find samples that are more unlikely, given H0H_0
  • so, these samples are indeed likely, given H0H_0
  • so, I can assume H0H_0 is true

p=0.01p=0.01 means:

  • if you resample XaX_a, XbX_b, very unlikely you will find samples that are more unlikely, given H0H_0
  • so, these samples are indeed unlikely, given H0H_0
  • so, I can think that H0H_0 is likely false I've been "very lucky" with these samples, if H0H_0 is true; or no luck if it's false
    • not necessarily the μa>μa\mu_a > \mu_a, maybe the normality part
149 / 366

In practice

{xa,i}i,{xb,i}i,H0\seq{x_{a,i}}{i}, \seq{x_{b,i}}{i}, H_0fstat-testf\subtext{stat-test}pp

There exist several concrete statistical significance tests, e.g.:

  • Wilcoxon (in many versions)
  • Friedman (in many versions)

Usually, you aim at "argumenting" μa>μb\mu_a > \mu_b (one-tailed) or μaμb\mu_a \ne \mu_b (inequality):

  1. you choose one test based on the other parts of H0H_0
  2. you compute the pp-value
  3. you hope it is low
    • and compare it against a prededefined threshold α\alpha, usually 0.050.05
    • with \ne, if p<αp<\alpha, you say that there is a statistically significant difference (between the mean values)
150 / 366

Trieste vs. Udine

> wilcox.test(h_ts, h_ud)
Wilcoxon rank sum test with continuity correction
data: h_ts and h_ud
W = 54.5, p-value = 0.7621
alternative hypothesis: true location shift is not equal to 0

H0H_0 \ni true location shift is equal to 0

p=0.7621>0.05p=0.7621 > 0.05: we cannot reject the null hypothesis
\Rightarrow people from Trieste is not taller than people from Udine or, at least, we cannot state this

More on statistical significance tests:

  • Joaquín Derrac et al. "A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms". In: Swarm and Evolutionary Computation 1.1 (2011)
  • Colas, Cédric, Olivier Sigaud, and Pierre-Yves Oudeyer. "How many random seeds? statistical power analysis in deep reinforcement learning experiments." arXiv preprint arXiv:1806.08295 (2018).
  • Greenland, Sander, et al. "Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations." European journal of epidemiology 31.4 (2016): 337-350.
151 / 366

Examples from research papers

152 / 366

Android malware detection¹ (1)

Results presentation for Android malware detection

  • binary classification
  • a few learning techniques
  • 10-CV
  • just effectiveness
    • μ\mu, σ\sigma for accuracy, FPR, FNR

Similar:
Canfora, Gerardo, et al. "Detecting android malware using sequences of system calls." Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile. 2015.

  • one dataset, three variants of effectiveness
    • unseen run of known app
    • unseen app of known family
    • unseen app of unseen family
  1. Canfora, Gerardo, et al. "Acquiring and analyzing app metrics for effective mobile malware detection." Proceedings of the 2016 ACM on International Workshop on Security And Privacy Analytics. 2016.
153 / 366

Twitter botnet detection¹

Results presentation for Twitter botnet detection

  • binary classification
  • a few learning techniques
  • a baseline
  • just effectiveness
  • MCC is the Matthews correlation coefficient
    • MCC=TP  TNFP  FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)\text{MCC}=\frac{\text{TP} \; \text{TN} - \text{FP} \; \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}}
  1. Mazza, Michele, et al. "Rtbust: Exploiting temporal patterns for botnet detection on twitter." Proceedings of the 10th ACM conference on web science. 2019.
154 / 366

Anomaly detection in cyber-physical systems¹

Results presentation for CPS anomaly detection

  • anomaly detection
    • binary classification with only negative examples in learning
  • many datasets
  • two methods
  • fevalsf\subtext{evals} is a measure of efficiency of learning
  • TPR, FPR, AUC for effectiveness
  1. Indri, Patrick, et al. "One-Shot Learning of Ensembles of Temporal Logic Formulas for Anomaly Detection in Cyber-Physical Systems." European Conference on Genetic Programming (Part of EvoStar). Springer, Cham, 2022.
155 / 366

AutoML approaches comparison¹

Results presentation for AutoML comparison

  • 6 approaches
  • 10 scenarios
  • box plots
    • accuracy
    • F1 for unbalanced case
  1. Truong, Anh, et al. "Towards automated machine learning: Evaluation and comparison of AutoML approaches and tools." 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). IEEE, 2019.
156 / 366

Assessing supervised ML

Brief recap

157 / 366

Assessing a model

Question: is the model modeling the real system?

Answer: compare responses on the same data and compute one or more performance indexes!

Model mm (or fpredictf\subtext{predict})

fpredict(,m)f'\subtext{predict}(\cdot, m)xxyy

Real system ss

ssxxyy

Binary classification

  • FPR and FNR
    • TNR and TPR
    • precision and recall
    • sensitivity and spec.
  • EER greater cost, lower efficiency
  • AUC greater cost, lower efficiency

Classification (w/ binary)

  • accuracy
  • error
  • weighted accuracy

Regression

  • MAE
  • MSE
  • RMSE
  • MAPE

Bounds for classification effectiveness:

  • random classifier (lower bound)
  • dummy classifier (better lower bound, baseline)
  • Bayes classifier (ideal upper bound)
158 / 366

Assessing a learning techniques

Effectiveness of the single technique

Sketch: learn a model on DlearnD\subtext{learn}, assess the model on DtestD\subtext{test}; which learning/test division?

Same

Static rnd

Repeated rnd



CV



LOOCV



...

Comparison between techniques

  • just compare one measure: Eff1\text{Eff}_1 vs. Eff2\text{Eff}_2
  • compare μ\mu of several measures: Effμ,1\text{Eff}_{\mu,1} vs. Effμ,2\text{Eff}_{\mu,2}
  • compare μ\mu and σ\sigma of several measures: Effμ,1,Effσ,1\text{Eff}_{\mu,1},\text{Eff}_{\sigma,1} vs. Effμ,2,Effσ,2\text{Eff}_{\mu,2}, \text{Eff}_{\sigma,2}
  • compare using boxplots
  • compare using a statistical significance test
159 / 366

Effectiveness and efficiency of assessment

Indexes¹

largelargelowEffectivenessEfficiencyAcc\text{Acc}/Err\text{Err}FPR\text{FPR}FNR\text{FNR}EER\text{EER}AUC\text{AUC}

Learning/test division

largelargelowEffectivenessEfficiencySameStatic rndCV/Repeated rndLOOCV
  1. Mainly for binary classification
  2. + here means "use both"
160 / 366

Tree-based learning techniques

161 / 366

Once upon a time¹... there is an amusement park with a carousel and an attendant deciding who can ride and who cannot ride. The park owner wants to replace the attendant with a robotic gate.

The owner calls us as machine learning experts.

A carousel

  1. For almost all the learning techniques, we'll (i) see a toy, but "realistic" problem, we'll (ii) try to learn a model by hands (i.e., human learning), and (iii) we'll try to translate the manual procedure into an automatic one (i.e., machine learning).
162 / 366

Approaching the problem

  1. Should we use ML? \rightarrow yes
  2. Supervised vs. unsupervised \rightarrow supervised
  3. Define the problem statement:
    • define XX and YY
    • feature engineering
    • define a way for assessing solutions
  4. Design the ML system
  5. Implement the ML system
  6. Assess the ML system

XX and YY

  • xx is a person approaching the carousel
  • yy is can ride\text{can ride} or cannot ride\text{cannot ride} (binary class)

Features (chosen with domain expert):

  • person height (in cm)
  • person age (in years)

Hence:

  • X=Xheight×Xage=R+×R+X = X\subtext{height} \times X\subtext{age} = \mathbb{R}^+ \times \mathbb{R}^+
  • x=(xheight,xage)\vect{x}=(x\subtext{height}, x\subtext{age}) (p=2p=2 numeric independent variables)

We (the ML expert and the domain expert) decide to collect some data D={(x(i),y(i))}iD=\seq{(x^{(i)},y^{(i)})}{i} by observing the real system:

  • it'll come handy for both learning and assessment
163 / 366

Exploring the data

Carousel data

The data exploration suggests that using ML is not a terrible idea.

Assume we are computer scientists and we like if-then-else (nested) structures: can we manually build an if-then-else structure that allows to make a decision.

Requirements (to keep it feasible manually):

  • each if condition should:
    • involve just one independent variable
    • consist of a threshold comparison
  • the decision has to be or

Strategy:

  • tell apart points of different colors
164 / 366

Building the if-then-else

Carousel data

function predict(x)\text{predict}(\vect{x}) {
if xage10x\subtext{age}\le 10 then {
return
} else {
return
}
}

  • requirements are met
  • background color at position x=(xage,xheight)\vect{x}=(x\subtext{age},x\subtext{height}) is the color the code above will assign to that x\vect{x}, i.e., fpredict(x)f\subtext{predict}(\vect{x})
  • most of the examples fall in the correct colored region
    • maybe the else branch is too rough

Let's improve it!

165 / 366

Building the if-then-else

Carousel data

function predict(x)\text{predict}(\vect{x}) {
if xage10x\subtext{age}\le 10 then {
return
} else {
if xheight120x\subtext{height}\le 120 then {
return
} else {
return
}
}
}

  • requirements are met
  • almost all the examples fall in the correct colored region

Nice job!

166 / 366

The decision tree

This if-then-else nested structure can be represented as a tree:

xagex\subtext{age} vs. 1010\le>>xheightx\subtext{height} vs. 120120\le>>

function predict(x)\text{predict}(\vect{x}) {
if xage10x\subtext{age}\le 10 then {
return
} else {
if xheight120x\subtext{height}\le 120 then {
return
} else {
return
}
}
}

We call this a decision tree, since we use it inside an fpredictf\subtext{predict} for making a decision:

  • it's a binary tree, since nodes have exactly 0 or 2 children
  • non-terminal nodes (or branch nodes) hold a pair (independent variable, threshold)
  • terminal nodes (or leaf nodes) hold one value yYy \in Y
167 / 366

De-hard-coding fpredictf\subtext{predict}

Now: our human learned fpredictf\subtext{predict}

fpredictf\subtext{predict}xxyy

function predict(x)\text{predict}(\vect{x}) {
if xage10x\subtext{age}\le 10 then {
return
} else {
if xheight120x\subtext{height}\le 120 then {
return
} else {
return
}
}
}

Goal: an fpredictf'\subtext{predict} working on any tree

fpredictf'\subtext{predict}x,m\vect{x},myy

function predict(x,m)\text{predict}(\vect{x}, m) {
...
}

We human learned (i.e., manually designed) a function where the decision tree is hard-coded in the predict()\text{predict}() function in the form of an if-then-else structure:

  • can we pull out the decision tree out of it and make predict()\text{predict}() a templated function?
168 / 366

Formalizing the decision tree

Scenario: classification with multivariate numerical features:

  • X=X1××XpX = X_1 \times \dots \times X_p, with each XiRX_i\subseteq\mathbb{R}
    • we write x=(x1,,xp)=(xi)i\vect{x} = (x_1,\dots,x_p)=(x_i)_i
  • YY, finite without ordering

The model tTp,Yt \in T_{p,Y} is a decision tree defined over X1××Xp,YX_1 \times \dots \times X_p,Y, i.e.:

  • each tt is a binary tree
  • each non-terminal node is labeled with a pair (j,τ)(j,\tau), with j{1,,p}j \in \{1,\dots,p\} and τR\tau \in \mathbb{R}
    • jj is the index of the independent variable
    • τ\tau is a threshold for comparison
  • each terminal node is labeled with a yYy \in Y
xagex\subtext{age} vs. 1010\le>>xheightx\subtext{height} vs. 120120\le>>
169 / 366

Compact representation of (binary) trees

We represent a tree tTLt \in T_L as: t=[l;t;t]t = \tree{\c{3}{l}}{\c{4}{t'}}{\c{4}{t''}} where t,tTL{}t', t'' \in T_L \cup \{\varnothing\} are the left and right children trees and lLl \in L is the label.

If the tree is a terminal node¹, it has no children (i.e., t=t=t'=t''=\varnothing) and we write: t=[l;;]=[l]t = \tree{l}{\varnothing}{\varnothing}=\treel{l}

For decision trees:

  • L=({1,,p}×R)YL= (\{1,\dots,p\} \times \mathbb{R}) \cup Y, that is, a label can be a pair (j,τ)(j,\tau) or a yy
  • if lYl \in Y, then t=t=t'=t''=\varnothing

We shorten T({1,,p}×R)YT_{(\{1,\dots,p\} \times \mathbb{R}) \cup Y} as Tp,YT_{p,Y}.

xagex\subtext{age} vs. 1010\le>>xheightx\subtext{height} vs. 120120\le>>

With:

  • X=Xage×Xheight=X1×X2X=X\subtext{age} \times X\subtext{height} = X_1 \times X_2
  • Y={,}Y=\set{\c{1}{●},\c{2}{●}}

This tree is: t=[(1,10);[];[(2,120);[];[]]]t = \tree{(1,10)}{\treel{\c{1}{●}}}{\tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}}

Would you be able to write a parser for this?

  1. Actually, node = tree, i.e., a node is a tree and a tree is a node!
170 / 366

Templated fpredictf'\subtext{predict}

function predict(x,t)\text{predict}(\vect{x}, t) {
if ¬has-children(t)\neg\text{has-children}(t) then {
ylabel-of(t)y \gets \text{label-of}(t)
return yy
} else { //hence rr is a branch node
(j,τ)label-of(t)(j, \tau) \gets \text{label-of}(t)
if xjτx_j \le \tau then {
return predict(x,left-child-of(t))\text{predict}(\vect{x}, \text{left-child-of}(t)) //recursion
} else {
return predict(x,right-child-of(t))\text{predict}(\vect{x}, \text{right-child-of}(t)) //recursion
}
}
}

  • has-children(t)\text{has-children}(t) is true iff tt is not terminal
  • label-of(t)\text{label-of}(t) returns the label of tt
    • a yYy \in Y for terminal nodes
    • a (j,τ){1,,p}×R(j,\tau) \in \{1,\dots,p\} \times \mathbb{R} for non-terminal nodes
  • left-child-of(t)\text{left-child-of}(t) and right-child-of(t)\text{right-child-of}(t) return the left or right child of tt
    • that are other trees, in general

It's a recursive function that:

  • works with any tTp,Yt \in T_{p,Y} and any xRp\vect{x} \in \mathbb{R}^p
  • always returns a yYy \in Y
fpredictf'\subtext{predict}x,t\vect{x},tyy
171 / 366

fpredictf'\subtext{predict} application example

1st call: x=(14,155),t=[(1,10);[];[(2,120);[];[]]]\vect{x}=(14,155), t = \tree{(1,10)}{\treel{\c{1}{●}}}{\tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}}

¬has-children(t)=false\neg\text{has-children}(t)=\text{false}
(j,τ)=(1,10)(j,\tau)=(1,10)
x110=falsex_1 \le 10 = \text{false}
right-child-of(r)=[(2,120);[];[]]\text{right-child-of}(r)= \tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}

2nd call: x=(14,155),t=[(2,120);[];[]]\vect{x}=(14,155), t = \tree{(2,120)}{\treel{\c{1}{●}}}{\treel{\c{2}{●}}}

¬has-children(t)=false\neg\text{has-children}(t)=\text{false}
(j,τ)=(2,120)(j,\tau)=(2,120)
x2120=falsex_2 \le 120 = \text{false}
right-child-of(r)=[]\text{right-child-of}(r)= [\c{2}{●}]

3rd call: x=(14,155),t=[]\vect{x}=(14,155), t = \treel{\c{2}{●}}

¬has-children(t)=true\neg\text{has-children}(t)=\text{true}
y=y=\c{2}{●} return return return

function predict(x,t)\text{predict}(\vect{x}, t) {
if ¬has-children(t)\neg\text{has-children}(t) then {
ylabel-of(t)y \gets \text{label-of}(t)
return yy
} else {
(j,τ)label-of(t)(j, \tau) \gets \text{label-of}(t)
if xjτx_j \le \tau then {
return predict(x,left-child-of(t))\text{predict}(\vect{x}, \text{left-child-of}(t))
} else {
return predict(x,right-child-of(t))\text{predict}(\vect{x}, \text{right-child-of}(t))
}
}
}

172 / 366

Towards tree learning

We have our fpredict:Rp×Tp,YYf'\subtext{predict}: \mathbb{R}^p \times T_{p,Y} \to Y; for having a learning technique we miss only the learning function, i.e., flearn:P(Rp×Y)Tp,Yf'\subtext{learn}: \mathcal{P}^*(\mathbb{R}^p \times Y) \to T_{p,Y}:

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i}tt

What we did manually (i.e., how we human learned):

  1. until we are satisfied
  2. put a vertical/horizontal line that well separates the data
  3. repeat from step 1 once for each on the two resulting regions

Let's rewrite it as (pseudo-)code!

173 / 366

Recursive binary splitting

function learn({(x(i),y(i))}i)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
if should-stop({y(i)}i)\text{should-stop}(\seq{y^{(i)}}{i}) then {
yarg maxyYi1(y(i)=y)y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y) //yy^\star is the most frequent class
return node-from(y,,)\text{node-from}(y^\star,\varnothing,\varnothing)
} else { //hence rr is a branch node
(j,τ)find-best-branch({(x(i),y(i))}i)(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})
tnode-from(t \gets \text{node-from}(
(j,τ),(j,\tau),
learn({(x(i),y(i))}ixj(i)τ),\c{3}{\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau})}, //recursion
learn({(x(i),y(i))}ixj(i)>τ)\c{3}{\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau})} //recursion
)
return tt
}
}

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i}tt
  1. until we are satisfied
  2. put a vertical/horizontal line that well separates the data
  3. repeat step 1 once for each on the two resulting regions

{(x(i),y(i))}ixj(i)τ\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau} is the sub-multiset of {(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i} composed of pairs for which xjτx_j \le \tau

This flearnf'\subtext{learn} is called recursive binary splitting:

  • it's recursive
  • when recurses, splits the data in two parts (binary)
    • it's a top-down approach: starts from the big problem and makes it smaller (divide-et-impera)
  • when stopping recursion, put a node with the most frequent class
174 / 366

Finding the best branch

Intuitively:

  • consider all variables (i.e., all jj) and all¹ threshold values
  • choose the pair (variable, threshold) that best separates the data
    • i.e., that results in the lowest rate of misclassified examples

In detail (and formally):

function find-best-branch({(x(i),y(i))}i)\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
(j,τ)arg minj,τ(error({y(i)}ixj(i)τ)+error({y(i)}ixj(i)>τ))(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\text{error}(\c{1}{\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau}})+\text{error}(\c{1}{\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}})\right)
return (j,τ)(j^\star, \tau^\star)
}

and

function error({y(i)}i)\text{error}(\seq{y^{(i)}}{i}) { //the error of the dummy classifier on {y(i)}i\seq{y^{(i)}}{i}
yarg maxyi1(y(i)=y)y^\star \gets \argmax_y \sum_i \mathbf{1}(y^{(i)}=y) //yy^\star is the most freq class
return 1ni1(y(i)y)\frac{1}{n} \sum_i \mathbf{1}(y^{(i)} \ne y^\star) //n={y(i)}in=|\seq{y^{(i)}}{i}|
}

Interpretation: if we split the data at this point (i.e., a (j,τ)(j, \tau) pair) and use one dummy classifier on each of the two sides, what would be the resulting error?

This approach is greedy, since it tries to obtain the maximum result (finding the branch), with the minimum effort (using just two dummy classifiers later on):

  • in practice, it makes this learning technique computationally light!
  1. you just need to consider, for each jj-th feature, the midpoints of (xj(i))i(x_j^{(i)})_i: at most nn of them
175 / 366

Deciding when to stop (recursion)

Intuitively:

  • if all the examples belong to the same class, stop
    • splitting would be pointless!
  • or, if the number examples is very small, stop \approx what we did while human learning
    • no need to bother

In detail (and formally):

function should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) {
if nnminn \le n\subtext{min} then { //n={y(i)}in=|\seq{y^{(i)}}{i}|
return true\text{true};
}
if error({y(i)}i)=0\text{error}(\seq{y^{(i)}}{i})=0 then {
return true\text{true};
}
return false\text{false}
}

Checking the first condition is, in general, cheaper than checking the second condition.

  • only {y(i)}i\seq{y^{(i)}}{i} is needed to decide whether to stop, {x(i)}i\seq{x^{(i)}}{i} is not used!
  • nminn\subtext{min} is a parameter of fshould-stopf\subtext{should-stop}
    • it represents the "very small" criterion
    • it propagates to flearnf'\subtext{learn}, which uses fshould-stopf\subtext{should-stop}
    • (also denoted as kmink\subtext{min})
  • since error()\text{error()} is the classification error done by the dummy classifier, it is =0=0 iff the most frequent class yy^\star is the only class in {y(i)}i\seq{y^{(i)}}{i}
176 / 366

flearnf'\subtext{learn} application example

1st call:
(j,τ)=(1,7)(j,\tau) = (1,7)

0100169\c{1}{\frac{0}{1}} \c{1}{\frac{6}{9}}0258\c{1}{\frac{0}{2}} \c{2}{\frac{5}{8}}1347\c{1}{\frac{1}{3}} \c{3}{\frac{4}{7}}2436\c{1}{\frac{2}{4}} \c{3}{\frac{3}{6}}2525\c{1}{\frac{2}{5}} \c{3}{\frac{2}{5}}2614\c{1}{\frac{2}{6}} \c{3}{\frac{1}{4}}3703\c{1}{\frac{3}{7}} \c{3}{\frac{0}{3}}4802\c{1}{\frac{4}{8}} \c{3}{\frac{0}{2}}5901\c{1}{\frac{5}{9}} \c{3}{\frac{0}{1}}

1st-l call:
(j,τ)=(1,2)(j,\tau) = (1,2)

0100136\c{1}{\frac{0}{1}} \c{1}{\frac{3}{6}}0225\c{1}{\frac{0}{2}} \c{2}{\frac{2}{5}}1324\c{1}{\frac{1}{3}} \c{1}{\frac{2}{4}}2413\c{1}{\frac{2}{4}} \c{1}{\frac{1}{3}}2512\c{1}{\frac{2}{5}} \c{1}{\frac{1}{2}}2601\c{1}{\frac{2}{6}} \c{1}{\frac{0}{1}}

1st-l-l call:
return []\treel{\c{1}{●}}

010

1st-l-r call:
(j,τ)=(1,4)(j,\tau) = (1,4)

0100124\c{2}{\frac{0}{1}} \c{2}{\frac{2}{4}}0213\c{2}{\frac{0}{2}} \c{1}{\frac{1}{3}}1312\c{2}{\frac{1}{3}} \c{1}{\frac{1}{2}}2401\c{2}{\frac{2}{4}} \c{2}{\frac{0}{1}}

1st-l-r-l call:
return []\treel{\c{2}{●}}

010

1st-l-r-r call:
return []\treel{\c{1}{●}}

010

return [(1,4);[];[]]\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}
return [(1,2);[];[(1,4);[];[]]]\tree{(1,2)}{\treel{\c{1}{●}}}{\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}}

1st-r call:
return []\treel{\c{3}{●}}

010

return [(1,7);[(1,2);[];[(1,4);[];[]]];[]]\tree{(1,7)}{\tree{(1,2)}{\treel{\c{1}{●}}}{\tree{(1,4)}{\treel{\c{2}{●}}}{\treel{\c{1}{●}}}}}{\treel{\c{3}{●}}}

Assume:

  • X=R1=RX=\mathbb{R}^1=\mathbb{R}, Y={,,}Y=\{\c{1}{●},\c{2}{●},\c{3}{●}\}
  • nmin=3n\subtext{min}=3

function learn({(x(i),y(i))}i,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min}) {
if should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) then {
yarg maxyYi1(y(i)=y)y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y)
return node-from(y,,)\text{node-from}(y^\star,\varnothing,\varnothing)
} else {
(j,τ)find-best-branch({(x(i),y(i))}i)(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})
tnode-from(t \gets \text{node-from}(
(j,τ),(j,\tau),
learn({(x(i),y(i))}ixj(i)τ,nmin),\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}),
learn({(x(i),y(i))}ixj(i)>τ,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min})
)
return tt
}
}

Question: what's the accuracy of this tt on the learning set?

177 / 366

Alternatives for find-best-branch()\text{find-best-branch}()

function find-best-branch({(x(i),y(i))}i)\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
(j,τ)arg minj,τ(error({y(i)}ixj(i)τ)+error({y(i)}ixj(i)>τ))(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})}+\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})}\right)
return (j,τ)(j^\star, \tau^\star)
}

error({y(i)}i)\text{error}(\seq{y^{(i)}}{i}) is the error the dummy classifier would do on {y(i)}i\seq{y^{(i)}}{i}: error({y(i)}i)=1maxyFr ⁣(y,{y(i)}i)\c{1}{\text{error}(\seq{y^{(i)}}{i})}=1 - \max_y \freq{y, \seq{y^{(i)}}{i}}

Instead of error()\text{error}(), two other variants can be used:

  • Gini index: gini({y(i)}i)=yFr ⁣(y,{y(i)}i)(1Fr ⁣(y,{y(i)}i))\c{1}{\text{gini}(\seq{y^{(i)}}{i})}=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)
  • Cross entropy: cross-entropy({y(i)}i)=yFr ⁣(y,{y(i)}i)logFr ⁣(y,{y(i)}i)\c{1}{\text{cross-entropy}(\seq{y^{(i)}}{i})}=-\sum_y \freq{y, \seq{y^{(i)}}{i}} \log \freq{y, \seq{y^{(i)}}{i}}
178 / 366

Alternatives for find-best-branch()\text{find-best-branch}()

function find-best-branch({(x(i),y(i))}i)\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
(j,τ)arg minj,τ(error({y(i)}ixj(i)τ)+error({y(i)}ixj(i)>τ))(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})}+\c{1}{\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})}\right)
return (j,τ)(j^\star, \tau^\star)
}

error({y(i)}i)\text{error}(\seq{y^{(i)}}{i}) is the error the dummy classifier would do on {y(i)}i\seq{y^{(i)}}{i}: error({y(i)}i)=1maxyFr ⁣(y,{y(i)}i)\c{1}{\text{error}(\seq{y^{(i)}}{i})}=1 - \max_y \freq{y, \seq{y^{(i)}}{i}}

Instead of error()\text{error}(), two other variants can be used:

  • Gini index: gini({y(i)}i)=yFr ⁣(y,{y(i)}i)(1Fr ⁣(y,{y(i)}i))\c{1}{\text{gini}(\seq{y^{(i)}}{i})}=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)
  • Cross entropy: cross-entropy({y(i)}i)=yFr ⁣(y,{y(i)}i)logFr ⁣(y,{y(i)}i)\c{1}{\text{cross-entropy}(\seq{y^{(i)}}{i})}=-\sum_y \freq{y, \seq{y^{(i)}}{i}} \log \freq{y, \seq{y^{(i)}}{i}}

For all:

  • the lower, the better
  • they measure the node impurity, i.e., the amount ee of cases different from the most frequent one among the examples arrived at a certain node
fimpurityf\subtext{impurity}{y(i)}i\seq{y^{(i)}}{i}eR+e \in \mathbb{R}^+
178 / 366

Node impurity

function find-best-branch({(x(i),y(i))}i,fimpurity)\text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, \c{1}{f\subtext{impurity}}) {
(j,τ)arg minj,τ(fimpurity({y(i)}ixj(i)τ)+fimpurity({y(i)}ixj(i)>τ))(j^\star, \tau^\star) \gets \argmin_{j,\tau} \left(\c{1}{f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})}+\c{1}{f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})}\right)
return (j,τ)(j^\star, \tau^\star)
}

The way to measure the node impurity might be a parameter of find-best-branch()\text{find-best-branch}(), but it has been found that Gini is better for learning trees than error.

Gini, error, cross-entropy vs. frequency of the most frequent class

Here, for binary classification:

  • on the xx-axis: the frequency f=Fr ⁣(pos,{y(i)}i)f=\freq{\text{pos}, \seq{y^{(i)}}{i}} of the positive class
    • f=0.5f=0.5 () is the worst case
    • f=0f=0 () and f=1f=1 () are the best cases
  • on the yy-axis: the three impurity indexes

Gini and cross-entropy are smoother than the error.

179 / 366

Alternatives for should-stop()\text{should-stop}()

Original version: (data size)

  • too few examples or
  • no errors

nnminn \le n\subtext{min} or error({y(i)}i)=0\text{error}(\seq{y^{(i)}}{i})=0

Alternative 1 (tree depth):

  • node depth deeper than τd\tau_d or
  • no errors

requires propagating recursively the depth of the node being currently built

Alternative 1 (node impurity):

  • impurity lower than a τϵ\tau_\epsilon

function should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) {
if nnminn \le n\subtext{min} then {
return true\text{true};
}
if error({y(i)}i)=0\text{error}(\seq{y^{(i)}}{i})=0 then {
return true\text{true};
}
return false\text{false}
}

Impact of the parameter:

  • the lower nminn\subtext{min}, the larger the tree
  • the greater τd\tau_d, the larger the tree
  • the lower τϵ\tau_\epsilon, the larger the tree

(for the same dataset, in general)

180 / 366

Tree learning with probability

Learning technique with probability:

  • flearn:P(X×Y)Mf'\subtext{learn}: \mathcal{P}^*(X \times Y) \to M
  • fpredict:X×MPYf''\subtext{predict}: X \times M \to P_Y
xxmmfpredictf''\subtext{predict}pparg maxyY\argmax\sub{y \in Y}yy

For tree learning:

  • flearn:P(X1××Xp×Y)T({1,,p}×R)PYf'\subtext{learn}: \c{1}{\mathcal{P}^*(X_1 \times \dots \times X_p \times Y)} \to \c{2}{T_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}}
    • given a multivariate dataset, returns a tree in T({1,,p}×R)PYT_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}
  • fpredict:X1××Xp×T({1,,p}×R)PYPYf''\subtext{predict}: \c{1}{X_1 \times \dots \times X_p} \times \c{2}{T_{(\{1,\dots,p\}\times\mathbb{R}) \cup P_Y}} \to \c{3}{P_Y}
    • given a multivariate observation and a tree, returns a discrete probability distribution pPYp \in P_Y

Set of trees T({1,,p}×R)PYT_{\c{1}{(\{1,\dots,p\}\times\mathbb{R})} \cup \c{2}{P_Y}}:

  • L=({1,,p}×R)PYL=\c{1}{(\{1,\dots,p\}\times\mathbb{R})} \cup \c{2}{P_Y} is the set of node labels
  • ({1,,p}×R)\c{1}{(\{1,\dots,p\}\times\mathbb{R})} are branch node labels
  • PY\c{2}{P_Y} are terminal node labels
    • i.e., terminal nodes return discrete probabiliy distributions
181 / 366

flearnf'\subtext{learn} with probability

function learn({(x(i),y(i))}i,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min}) {
if should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) then {
pyFr ⁣(y,{y(i)}i)p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}
return node-from(p,,)\text{node-from}(p,\varnothing,\varnothing)
} else {
(j,τ)find-best-branch({(x(i),y(i))}i)(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})
tnode-from(t \gets \text{node-from}(
(j,τ),(j,\tau),
learn({(x(i),y(i))}ixj(i)τ,nmin),\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}),
learn({(x(i),y(i))}ixj(i)>τ,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min})
)
return tt
}
}

  • yFr ⁣(y,{y(i)}i)y \mapsto \freq{y, \seq{y^{(i)}}{i}} is a way to specify the concrete function that, given a yYy \in Y returns its frequency Fr ⁣(y,{y(i)}i)[0,1]\freq{y, \seq{y^{(i)}}{i}} \in [0,1]
  • "pp \gets \dots" means "the variable¹ pp takes the value \dots" or "the variable pp becomes \dots"
  • hence, pyFr ⁣(y,{y(i)}i)p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}} means "pp becomes the function that maps each yy to its frequency Fr ⁣(y,{y(i)}i)\freq{y, \seq{y^{(i)}}{i}} in {y(i)}i\seq{y^{(i)}}{i}"
  1. here, "variable" as a computer programming term

Before (without probability):

yarg maxyYi1(y(i)=y)y^\star \gets \argmax_{y \in Y} \sum_i \mathbf{1}(y^{(i)}=y)
return node-from(y,,)\text{node-from}(y^\star,\varnothing,\varnothing)

with {y(i)}i\seq{y^{(i)}}{i} being
returns []\treel{\c{1}{●}}

After (with probability):

pyFr ⁣(y,{y(i)}i)p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}
return node-from(p,,)\text{node-from}(p,\varnothing,\varnothing)

with {y(i)}i\seq{y^{(i)}}{i} being
returns [(35,15,15)]\treel{(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{1}{5}}}, \c{3}{● \smaller{\frac{1}{5}}})}
182 / 366

fpredictf'\subtext{predict} with probability

fpredict:X×MYf'\subtext{predict}: X \times M \to Y

function predict(x,t)\text{predict}(\vect{x}, t) {
if ¬has-children(t)\neg\text{has-children}(t) then {
plabel-of(t)p \gets \text{label-of}(t)
yarg maxyYp(y)y^\star \gets \argmax_{y \in Y} p(y)
return yy^\star
} else {
(j,τ)label-of(t)(j, \tau) \gets \text{label-of}(t)
if xjτx_j \le \tau then {
return predict(x,left-child-of(t))\text{predict}(\vect{x}, \text{left-child-of}(t))
} else {
return predict(x,right-child-of(t))\text{predict}(\vect{x}, \text{right-child-of}(t))
}
}
}

fpredict:X×MPYf''\subtext{predict}: X \times M \to P_Y

function predict-with-prob(x,t)\text{predict-with-prob}(\vect{x}, t) {
if ¬has-children(t)\neg\text{has-children}(t) then {
plabel-of(t)p \gets \text{label-of}(t)
return pp
} else {
(j,τ)label-of(t)(j, \tau) \gets \text{label-of}(t)
if xjτx_j \le \tau then {
return predict(x,left-child-of(t))\text{predict}(\vect{x}, \text{left-child-of}(t))
} else {
return predict(x,right-child-of(t))\text{predict}(\vect{x}, \text{right-child-of}(t))
}
}
}

Usually, ML software libraries/tools provide way to access both y^\hat{y} and pp, that are produced out of a single execution.

183 / 366

flearnf'\subtext{learn} with probability application example

1st call:
(j,τ)=(1,7)(j,\tau) = (1,7)

0100169\c{1}{\frac{0}{1}} \c{1}{\frac{6}{9}}0258\c{1}{\frac{0}{2}} \c{2}{\frac{5}{8}}1347\c{1}{\frac{1}{3}} \c{3}{\frac{4}{7}}2436\c{1}{\frac{2}{4}} \c{3}{\frac{3}{6}}2525\c{1}{\frac{2}{5}} \c{3}{\frac{2}{5}}2614\c{1}{\frac{2}{6}} \c{3}{\frac{1}{4}}3703\c{1}{\frac{3}{7}} \c{3}{\frac{0}{3}}4802\c{1}{\frac{4}{8}} \c{3}{\frac{0}{2}}5901\c{1}{\frac{5}{9}} \c{3}{\frac{0}{1}}

1st-l call:
(j,τ)=(1,2)(j,\tau) = (1,2)

0100136\c{1}{\frac{0}{1}} \c{1}{\frac{3}{6}}0225\c{1}{\frac{0}{2}} \c{2}{\frac{2}{5}}1324\c{1}{\frac{1}{3}} \c{1}{\frac{2}{4}}2413\c{1}{\frac{2}{4}} \c{1}{\frac{1}{3}}2512\c{1}{\frac{2}{5}} \c{1}{\frac{1}{2}}2601\c{1}{\frac{2}{6}} \c{1}{\frac{0}{1}}

1st-l-l call:
return [(1)]\treel{(\c{1}{● \smaller{1}})}

010

1st-l-r call:
(j,τ)=(1,4)(j,\tau) = (1,4)

0100124\c{2}{\frac{0}{1}} \c{2}{\frac{2}{4}}0213\c{2}{\frac{0}{2}} \c{1}{\frac{1}{3}}1312\c{2}{\frac{1}{3}} \c{1}{\frac{1}{2}}2401\c{2}{\frac{2}{4}} \c{2}{\frac{0}{1}}

1st-l-r-l call:
return [(1)]\treel{(\c{2}{● \smaller{1}})}

010

1st-l-r-r call:
ret. [(23,13)]\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}

010

return [(1,4);[(1)];[(23,13)]]\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}
return [(1,2);[(1)];[(1,4);[(1)];[(23,13)]]]\tree{(1,2)}{\treel{(\c{1}{● \smaller{1}})}}{\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}}

1st-r call:
return [(1)]\treel{(\c{3}{● \smaller{1}})}

010

return [(1,7);[(1,2);[(1)];[(1,4);[(1)];[(23,13)]]];[(1)]]\tree{(1,7)}{\tree{(1,2)}{\treel{(\c{1}{● \smaller{1}})}}{\tree{(1,4)}{\treel{(\c{2}{● \smaller{1}})}}{\treel{(\c{1}{● \smaller{\frac{2}{3}}}, \c{2}{● \smaller{\frac{1}{3}}})}}}}{\treel{(\c{3}{● \smaller{1}})}}

Assume:

  • X=R1=RX=\mathbb{R}^1=\mathbb{R}, Y={,,}Y=\{\c{1}{●},\c{2}{●},\c{3}{●}\}
  • nmin=3n\subtext{min}=3

function learn({(x(i),y(i))}i,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min}) {
if should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) then {
pyFr ⁣(y,{y(i)}i)p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}
return node-from(p,,)\text{node-from}(p,\varnothing,\varnothing)
} else {
(j,τ)find-best-branch({(x(i),y(i))}i)(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})
tnode-from(t \gets \text{node-from}(
(j,τ),(j,\tau),
learn({(x(i),y(i))}ixj(i)τ,nmin),\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}),
learn({(x(i),y(i))}ixj(i)>τ,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min})
)
return tt
}
}

184 / 366

Let's use the learning technique

If we apply our flearnf'\subtext{learn} to the carousel dataset with nmin=1n\subtext{min}=1 we obtain:

Carousel data

xheightx\subtext{height} vs. 120120\lexagex\subtext{age} vs. 8.9548.954>>xagex\subtext{age} vs. 9.8879.887\le(1)(\c{1}{●\smaller{1}})>>xagex\subtext{age} vs. 9.0029.002\le(1)(\c{1}{●\smaller{1}})>>(1)(\c{2}{●\smaller{1}})\le(1)(\c{2}{●\smaller{1}})>>xagex\subtext{age} vs. 9.499.49\le>>xagex\subtext{age} vs. 9.3069.306(1)(\c{1}{●\smaller{1}})\le(1)(\c{1}{●\smaller{1}})>>(1)(\c{2}{●\smaller{1}})

Question: is this tree ok for you?

hint: recall the other way of assessing a model, w/o the behavior

185 / 366

Tree size

If we compare the tree (i.e., the model) against the attendant's reasoning (i.e., the real system), this tree appears too large!

We can do this, because:

  • trees are inherently inspectionable
  • we know (actually, we have a rough idea about) how the real system works

The carousel

xheightx\subtext{height} vs. 120120\lexagex\subtext{age} vs. 8.9548.954>>xagex\subtext{age} vs. 9.8879.887\le(1)(\c{1}{●\smaller{1}})>>xagex\subtext{age} vs. 9.0029.002\le(1)(\c{1}{●\smaller{1}})>>(1)(\c{2}{●\smaller{1}})\le(1)(\c{2}{●\smaller{1}})>>xagex\subtext{age} vs. 9.499.49\le>>xagex\subtext{age} vs. 9.3069.306(1)(\c{1}{●\smaller{1}})\le(1)(\c{1}{●\smaller{1}})>>(1)(\c{2}{●\smaller{1}})
186 / 366

Model complexity

The tree was large because:

  • nminn\subtext{min} was 11, i.e., flearnf'\subtext{learn} had no bounds while learning the tree
  • and, the dataset made flearnf'\subtext{learn} exploit the low value of nminn\subtext{min}
    • i.e., the dataset required a large tree to be modeled completely
187 / 366

Model complexity

The tree was large because:

  • nminn\subtext{min} was 11, i.e., flearnf'\subtext{learn} had no bounds while learning the tree
  • and, the dataset made flearnf'\subtext{learn} exploit the low value of nminn\subtext{min}
    • i.e., the dataset required a large tree to be modeled completely

In general, almost every kind of model can have different degrees of model complexity.

  • for trees, captured by the size the tree

Moreover, almost every learning technique has at least one parameter affecting the maximum complexity of the learnable models, often called flexibility:

  • a sort of availability of complexity
  • for trees learned with recursive binary splitting, nminn\subtext{min}

Usually, to obtain a complex model, you should have:

  • a learning technique with great flexibility
  • a dataset requiring flexibility
flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mmflexibility
187 / 366

This tree complexity: motivation

Why is our tree too complex?

Because of these two points! ●● \rightarrow

What are they?

  • maybe the attendant was distracted
  • maybe they were two "Portoghesi"
  • maybe they were the attendant's kids
    • i.e., the real system is stochastic and we observed a case where the least probable case happened
  • maybe the owner wrongly wrote down two observations

More in general: there's some noise in the data!

Carousel data

188 / 366

Fitting the noise?

ssxxyy++yy\primenoise

In practice, we often don't have a noise-free dataset {(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}, but have instead a dataset {(x(i),y(i))}i\seq{(x^{(i)},y'^{(i)})}{i} with some noise, i.e., we have the yy' instead of the yy:

  • errors in data collection
  • ss being stochastic and having produced unlikely behaviors

However, our goal is to model ss, not s+s+ noise!

189 / 366

Overfitting

When we have a noisy dataset (potentially always) and we allow for large complexity, by setting a flexibility parameter to a high flexibility, the learning technique fits the noisy data {(x(i),y(i))}i\seq{(x^{(i)},y'^{(i)})}{i} instead of fitting the real system ss, that is, overfitting occurs.

190 / 366

Overfitting

When we have a noisy dataset (potentially always) and we allow for large complexity, by setting a flexibility parameter to a high flexibility, the learning technique fits the noisy data {(x(i),y(i))}i\seq{(x^{(i)},y'^{(i)})}{i} instead of fitting the real system ss, that is, overfitting occurs.

Snake and elephant from Il Piccolo Principe Image from "Il piccolo principe"

Overfits = "fits too much", hence making apparent also those artifacts that are not part of the object being wrapped

  • the model: the snake skin
  • the real system: the snake body
  • the (exaggerated) artifact: the elephant...
190 / 366

Underfitting

When instead we do not allow for enough complexity to model a complex real system, by setting a flexibility parameter to low flexibility, the learning technique does not fit neither the data, nor the system, that is, underfitting occurs.

191 / 366

Underfitting

When instead we do not allow for enough complexity to model a complex real system, by setting a flexibility parameter to low flexibility, the learning technique does not fit neither the data, nor the system, that is, underfitting occurs.

T-rex in a cardboard box

Underfits = "doesn't fit enough", hence proper characteristics of the object being wrapped are not captured

  • the model: the cardboard box
  • the real system: the T-rex
  • the uncaptured characteristics: everything of the T-rex...
191 / 366

Overfitting/underfitting with trees

In flearnf'\subtext{learn}, nminn\subtext{min} represents the flexibility:

  • the greater nminn\subtext{min}, the lower the flexibility

Extreme values:

  • nmin=1n\subtext{min}=1 \rightarrow maximum flexibility
    • the tree will always be as large as it has to be to perfectly¹ model the dataset
  • nmin=+n\subtext{min}=+\infty \rightarrow minimum, i.e., no flexibility
    • the tree will be the smallest possible
  1. Always perfectly? Give a counterexample.
192 / 366

Carousel data

function learn({(x(i),y(i))}i,nmin)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min}) {
if should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) then {
pyFr ⁣(y,{y(i)}i)p \gets y \mapsto \freq{y, \seq{y^{(i)}}{i}}
return node-from(p,,)\text{node-from}(p,\varnothing,\varnothing)
} else {
...
}
}

function should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) {
if nnminn \le n\subtext{min} then {
return true\text{true};
}
...
return false\text{false}
}

The learned tree is a dummy classifier (with probability):

(59103,44103)(\c{1}{\text{●}\smaller{\frac{59}{103}}}, \c{2}{\text{●}\smaller{\frac{44}{103}}})

t=[(59103,44103)]t=\treel{(\c{1}{\text{●}\smaller{\frac{59}{103}}}, \c{2}{\text{●}\smaller{\frac{44}{103}}})}

tt does not attempt to model the dependency between xx and yy, because its complexity budget is completely exhausted by the single leaf node
193 / 366

Bias and variance

As an alternative name for underfitting, we say that a learning technique exhibits high bias:

  • because it tends to generate models that incorporate a bias towards some yy values, regardless of the xx, i.e., models that fail in capturing the xx-yy dependency
    • as extreme case, the dummy classifier completely disregards the xx
194 / 366

Bias and variance

As an alternative name for underfitting, we say that a learning technique exhibits high bias:

  • because it tends to generate models that incorporate a bias towards some yy values, regardless of the xx, i.e., models that fail in capturing the xx-yy dependency
    • as extreme case, the dummy classifier completely disregards the xx

As an alternative name for overfitting, we say that a learning technique exhibits high variance:

  • because, if we repeat the learning with different datasets coming from the same real system, they give different models; this is bad, because they should be the same, since they model the same system
194 / 366

Spotting underfitting/overfitting

In principle:

  1. observe the model
  2. observe the system
  3. compare their complexity:
    • if the model is too simple with respect to the system, that's underfitting
    • if the model is too complex with respect to the system, that's overfitting
195 / 366

Spotting underfitting/overfitting

In principle:

  1. observe the model
  2. observe the system
  3. compare their complexity:
    • if the model is too simple with respect to the system, that's underfitting
    • if the model is too complex with respect to the system, that's overfitting

In practice, this is often (i.e., almost always) unfeasible:

  • you don't know the system complexity
  • you cannot observe the system internals (or the system itself)
  • sometimes, you cannot observe the model internals
195 / 366

Spotting underfitting/overfitting with data

With too low flexibility (here with error):

  • the model cannot capture system characteristic that are also in the learning data
    • \Rightarrow both errors are high
  • increasing the flexibility decreases both errors

With too large flexibility:

  • the model captures also data artifacts (i.e., noise)
    • \Rightarrow learning error is low because noise is modeled and used to assess the model itself
    • \Rightarrow test error is large because the model describes characteristic that are not proper of the real system and hence not visible in data different from the learning data
  • increasing the flexibility decreases the lerning error and increases the test error

Here, overfitting starts with flexibility 0.62\ge 0.62

  • not a real parameter...

Leaning and test error vs. flexibility

Practical procedure:

  1. consider several values of the flexibility parameter
  2. for each value of the flexibility parameter
    1. learn a model
    2. measure¹ its effectiveness² on the learning data
    3. measure¹ its effectiveness² on the test data
  1. with 80/20 static split, CV, ...
  2. with error, accuracy, AUC, ...
196 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have kk slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

  • that is, we do a comparison among learning techniques
197 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have kk slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

  • that is, we do a comparison among learning techniques

In practice:

  • choose the kk candidate parameter values (e.g., nmin=1,2,3,,10n\subtext{min}=1,2,3,\dots,10)
  • choose a suitable effectiveness index (e.g., AUC, accuracy, ...)
  • choose a suitable learning/test division method (e.g., 10-fold CV)
  • for each of the kk values, measure the index, take the one corresponding to the best value
197 / 366

How to choose the proper flexibility?

More in general, how to choose a good value for one or more parameters of the learning technique?

Assumption: "good" means "the one that corresponds to the greatest effectiveness".

From another point of view, we have kk slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:

  • that is, we do a comparison among learning techniques

In practice:

  • choose the kk candidate parameter values (e.g., nmin=1,2,3,,10n\subtext{min}=1,2,3,\dots,10)
  • choose a suitable effectiveness index (e.g., AUC, accuracy, ...)
  • choose a suitable learning/test division method (e.g., 10-fold CV)
  • for each of the kk values, measure the index, take the one corresponding to the best value

This procedure applies to parameters in general, not just to those affecting flexibility;

  • and possibly to indexes related to efficiency, rather than just effectiveness
197 / 366

Hyperparameter tuning

Given a learning technique with hh parameters p1,,php_1,\dots,p_h, each pjp_j defined in its domain PjP_j, hyperparameter tuning is the task of finding the tuple p1,,php^\star_1,\dots,p^\star_h that corresponds to the best effectiveness of the learning technique.

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mmp1,,php_1,\dots,p_h
198 / 366

Hyperparameter tuning

Given a learning technique with hh parameters p1,,php_1,\dots,p_h, each pjp_j defined in its domain PjP_j, hyperparameter tuning is the task of finding the tuple p1,,php^\star_1,\dots,p^\star_h that corresponds to the best effectiveness of the learning technique.

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mmp1,,php_1,\dots,p_h

p1,,php_1,\dots,p_h are called hyperparameters, rather than just parameter, because in some communities and for some learning technique, the model is defined by one or more parameters (often numerical);

  • does not fit well the case of trees

It's called tuning because we slightly change the hyperparameter values until we are happy with the results.

198 / 366

Hyperparameter tuning

Given a learning technique with hh parameters p1,,php_1,\dots,p_h, each pjp_j defined in its domain PjP_j, hyperparameter tuning is the task of finding the tuple p1,,php^\star_1,\dots,p^\star_h that corresponds to the best effectiveness of the learning technique.

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mmp1,,php_1,\dots,p_h

p1,,php_1,\dots,p_h are called hyperparameters, rather than just parameter, because in some communities and for some learning technique, the model is defined by one or more parameters (often numerical);

  • does not fit well the case of trees

It's called tuning because we slightly change the hyperparameter values until we are happy with the results.

Hyperparameter tuning it's a form of optimization, since we are searching the space P1××PhP_1 \times \dots \times P_h for the tuple giving the best, i.e., \approx optimal, effectiveness:

  • since it automatizes part of the design of an ML system, hyperparameter tuning may be considered a simple form of AutoML
198 / 366

A simple form of hyperparameter tuning:

  1. for each jj-th parameter, choose a small set of PjPjP'_j \subseteq P_j values
  2. choose a suitable effectiveness index
  3. choose a suitable learning/test division method
  4. consider all the tuples resulting from the cartesian product P1××PhP'_1 \times \dots \times P'_h (i.e., the grid)
  5. take the best hyperparameters p1,,php^\star_1,\dots,p^\star_h such that: (p1,,ph)=arg max(p1,,ph)P1××Phflearn-effect(flearn(,p1,,ph),fpredict,D)(p^\star_1,\dots,p^\star_h)=\argmax_{(p_1,\dots,p_h) \in P'_1 \times \dots \times P'_h} \c{1}{f\subtext{learn-effect}}(\c{2}{f'\subtext{learn}(\cdot,p_1,\dots,p_h),f'\subtext{predict}},D)

Remarks:

  • flearn-effectf\subtext{learn-effect} is the chosen assessment method measuring the chosen (step 2) effectiveness index with the chosen (step 3) learning/test division: it takes a learning technique and a dataset DD
    • flearn(,p1,,ph),fpredictf'\subtext{learn}(\cdot,p^\star_1,\dots,p^\star_h),f'\subtext{predict} is the learning technique; flearn(,p1,,ph)f'\subtext{learn}(\c{3}{\cdot},\c{4}{p_1,\dots,p_h}) is the learning function with fixed hyperparameters p1,,php_1,\dots,p_h and variable dataset \cdot
  • to be feasible, P1××PhP'_1 \times \dots \times P'_h must be small!
199 / 366

Grid search with the trees

Consider the flearnf'\subtext{learn} for trees and these two hyperparameters:

  • nmin=p1N=P1n\subtext{min} = p_1 \in \mathbb{N} = P_1
  • pimpurity=p2{error,Gini,cross-entropy}p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}

Let's do hyperparameter tuning with grid search (assuming D=n=1000|D|=n=1000):

200 / 366

Grid search with the trees

Consider the flearnf'\subtext{learn} for trees and these two hyperparameters:

  • nmin=p1N=P1n\subtext{min} = p_1 \in \mathbb{N} = P_1
  • pimpurity=p2{error,Gini,cross-entropy}p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}

Let's do hyperparameter tuning with grid search (assuming D=n=1000|D|=n=1000):

  1. P1={1,2,5,10,25}P'_1=\{1,2,5,10,25\}¹ and P2=P2P'_2=P_2
  2. AUC (with midpoints)
  3. 10-fold CV
  4. grid size of 5×3=155 \times 3 = 15
  5. ...
  1. for each jj-th parameter, choose a small set of PjPjP'_j \subseteq P_j values
  2. choose a suitable effectiveness index
  3. choose a suitable learning/test division method
  4. consider all the tuples resulting from the cartesian product P1××PhP'_1 \times \dots \times P'_h (i.e., the grid)
  5. take the best hyperparameters p1,,php^\star_1,\dots,p^\star_h

Questions

  • how many times is flearnf'\subtext{learn} invoked? without considering recurrent invokations
  • how many times is fpredictf''\subtext{predict} invoked?
  1. must be chosen considering the size nn of the dataset
200 / 366

Hyperparameter-free learning

Can't we just always do grid search for doing hyperparameter tuning?

Pros:

  • no need to manually choose the values of the parameters
  • hopefully chosen parameters are better than "default" values (if any) \rightarrow better effectiveness

Cons:

  • computationally expensive (\propto grid size) \rightarrow worse efficiency
  • depends on a dataset, must be checked for generalization ability
  • suitable "ranges" of values for each hyperparameter have still to be set manually
    • but default ranges are often ok
201 / 366

Hyperparameter-free learning

Can't we just always do grid search for doing hyperparameter tuning?

Pros:

  • no need to manually choose the values of the parameters
  • hopefully chosen parameters are better than "default" values (if any) \rightarrow better effectiveness

Cons:

  • computationally expensive (\propto grid size) \rightarrow worse efficiency
  • depends on a dataset, must be checked for generalization ability
  • suitable "ranges" of values for each hyperparameter have still to be set manually
    • but default ranges are often ok

If you do it, you can transform any learning tech. w/ params in a learning tech. w/o params:

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mmp1,,php_1,\dots,p_h
{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}{Pj}j\seq{P'_j}{j}flearnf'\subtext{learn}grid search{pj}j\seq{p^\star_j}{j}flearnf'\subtext{learn}mm
201 / 366

Hyperparameter-free learning

{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}{Pj}j\seq{P'_j}{j}flearnf'\subtext{learn}grid search{pj}j\seq{p^\star_j}{j}flearnf'\subtext{learn}mm

function learn-free(D)\text{learn-free}(D) {
flearn,fpredictf'\subtext{learn}, f'\subtext{predict} \gets \dots
P1,,PhP'_1,\dots,P'_h \gets \dots
flearn-effectf\subtext{learn-effect} \gets \dots
p1,,php^\star_1,\dots,p^\star_h \gets \varnothing
vmax,effectv_{\text{max},\text{effect}} \gets -\infty
foreach p1,,phP1××Php_1,\dots,p_h \in P'_1\times \dots\times P'_h {
veffectflearn-effect(flearn(,p1,,ph),fpredict,D)v\subtext{effect} \gets f\subtext{learn-effect}(f'\subtext{learn}(\cdot,p_1,\dots,p_h),f'\subtext{predict},D)
if veffectvmax,effectv\subtext{effect} \ge v_{\text{max},\text{effect}} then {
vmax,effectveffectv_{\text{max},\text{effect}} \gets v\subtext{effect}
p1,,php1,,php^\star_1,\dots,p^\star_h \gets p_1,\dots,p_h
}
}
return flearn(D,p1,,ph)f'\subtext{learn}(D,p^\star_1,\dots,p^\star_h)
}

  1. for each jj-th parameter, choose a small set of PjPjP'_j \subseteq P_j values
  2. choose a suitable effectiveness index
  3. choose a suitable learning/test division method
  4. consider all the tuples resulting from the cartesian product P1××PhP'_1 \times \dots \times P'_h
  5. take the best hyperparameters p1,,php^\star_1,\dots,p^\star_h
    • i.e., arg max\argmax
  6. learn a model with on full dataset and the best found parameters
202 / 366

Hyperparameter-free tree learning exercise

Consider the flearnf'\subtext{learn} for trees and these two hyperparameters:

  • nmin=p1N=P1n\subtext{min} = p_1 \in \mathbb{N} = P_1
  • pimpurity=p2{error,Gini,cross-entropy}p\subtext{impurity} = p_2 \in \{\text{error}, \text{Gini}, \text{cross-entropy}\}

Consider the improved, hyperparameter-free version of flearnf'\subtext{learn} called flearn-freef'\subtext{learn-free}:

  • with accuracy and 10-fold CV
  • with P1=10|P'_1|=10 and P2=P2=3|P'_2|=|P_2|=3

Suppose you want to compare it against the plain version (with nmin=10n\subtext{min}=10 and pimpurity=Ginip\subtext{impurity}=\text{Gini}):

  • with AUC (midpoints) and 10-fold CV
  • using a dataset D=n=1000|D|=n=1000.

Questions

  • what phases of the ML design process are we doing?
  • how many times is flearn-freef'\subtext{learn-free} invoked?
  • how many times is flearnf'\subtext{learn} invoked? without considering recurrent invokations
  • how many times is fpredictf''\subtext{predict} invoked? assuming fpredictf''\subtext{predict} is invoked internally by fpredictf'\subtext{predict}
  • how many times is fpredictf'\subtext{predict} invoked?
203 / 366

Categorical independent variables and regression

204 / 366

Applicability of flearnf'\subtext{learn}

Up to now, the flearnf'\subtext{learn} for trees (i.e., recursive binary splitting) was defined¹ as: flearn:P(X1××Xp×Y)T({1,,p}×R)Yf'\subtext{learn}: \mathcal{P}^*(X_1 \times \dots \times X_p \times Y) \to T_{(\{1,\dots,p\}\times \mathbb{R}) \cup Y} with:

  • each XjRX_j \subseteq \mathbb{R}, i.e., with each independent variable being numerical
  • YY finite and without ordering, i.e., with the dependent variable being categorical

These constraints were needed because:

  • the branch nodes contain conditions in the form xjτx_j \le \tau, hence an order relation has to be defined in XjX_j; R\mathbb{R} meets this requirement
  • the leaf nodes contain a class label yy

Can we remove these constraints?

  1. here we have the version without probability; with the one with, the codomain of flearnf'\subtext{learn} is T({1,,p}×R)PYT_{(\{1,\dots,p\}\times \mathbb{R}) \cup P_Y}
205 / 366

Trees on categorical independent variables

With numerical variables (xjRx_j \in \mathbb{R}):

With find-best-branch()\text{find-best-branch()}, we find (the index jj of) a variable xjx_j and a threshold value τ\tau that well separates the data, i.e., we split the data in:

  • observations such that xjτx_j \le \tau
  • observations such that xj>τx_j > \tau

No other cases exist: it's a binary split.

Example

xage[0,120]x\subtext{age} \in [0,120]

xagex\subtext{age} vs. 1010\le>>xx_{\dots} vs. \dots

With categorical variables (xjXjx_j \in X_j):

With find-best-branch()\text{find-best-branch()}, we find (the index jj of) a variable xjx_j and a set of values XjXjX'_j \subset X_j that well separates the data, i.e., we split the data in:

  • observations such that xjXjx_j \in X'_j
  • observations such that xj∉Xjx_j \not\in X'_j

No other cases exist: it's a binary split.

Example

xcity{Ts,Ud,Ve,Pn,Go}x\subtext{city} \in \{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}

xcityx\subtext{city} vs. {Ts,Ve}\{\text{Ts},\text{Ve}\}\in∉\not\inxx_{\dots} vs. \dots
206 / 366

Efficiency with categorical variables

For a given numerical variable xjRx_j \in \mathbb{R}, we choose τ\tau^\star such that: τ=arg minτR(fimpurity({y(i)}ixj(i)τ)+fimpurity({y(i)}ixj(i)>τ))\tau^\star = \argmin_{\c{1}{\tau \in \mathbb{R}}} \left(f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau})\right) In practice, we search the set of midpoints rather than the entire R\mathbb{R}: there are n1n-1 midpoints in a dataset with nn elements.

Even better, we can consider only the midpoints between consecutive values xj(i1),xj(i2)x_j^{(i_1)},x_j^{(i_2)} for which the labels are different, i.e., yj(i1)yj(i2)y_j^{(i_1)} \ne y_j^{(i_2)}

For a given categorical variable xjXjx_j \in X_j, we choose XjXjX^\star_j \subset X_j such that: Xj=arg minXjP(Xj)(fimpurity({y(i)}ixj(i)Xj)+fimpurity({y(i)}ixj(i)∉Xj))X^\star_j = \argmin_{\c{1}{X'_j \in \mathcal{P}(X_j)}} \left(f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \in X'_j})+f\subtext{impurity}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \not\in X'_j})\right) We search the set P(Xj)\mathcal{P}(X_j) of subsets (i.e., the powerset) of XjX_j, which has 2Xj2^{|X_j|} values.

207 / 366

Trees with both kinds of variables

Assume a problem with X=X1××Xpnum×Xpnum+1××Xpnum+pcatX = \c{1}{X_1 \times \dots \times X_{p\subtext{num}}} \times \c{2}{X_{p\subtext{num}+1} \times \dots \times X_{p\subtext{num}+p\subtext{cat}}}, i.e.:

  • pnump\subtext{num} numerical variables
  • pcatp\subtext{cat} categorical variables

The labels of the tree nodes can be:

  • class labels yYy \in \c{3}{Y} or discrete probability distribution pPYp \in \c{3}{P_Y} (terminal nodes)
  • branch conditions {1,,pnum}×R\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}} for numerical variables (non-terminal nodes)
  • branch conditions j=pnum+1j=pnum+pcat{j}×P(Xj)\c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)} for categorical variables (non-terminal nodes)
    • i.e., each variable with its corresponding powerset of possible values

So the model is a tt \in:

  • T{1,,pnum}×R    j=pnum+1j=pnum+pcat{j}×P(Xj)    YT_{\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}} \; \cup \; \c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)} \; \cup \; \c{3}{Y}}, without probability
  • or T{1,,pnum}×R    j=pnum+1j=pnum+pcat{j}×P(Xj)    PYT_{\c{1}{\{1,\dots,p\subtext{num}\} \times \mathbb{R}} \; \cup \; \c{2}{\bigcup_{j=p\subtext{num}+1}^{j=p\subtext{num}+p\subtext{cat}} \{j\} \times \mathcal{P}(X_j)} \; \cup \; \c{3}{P_Y}}, with probability
208 / 366

Regression trees

Recursive binary splitting may be used for regression: the learned trees are called regression trees.

Required changes:

  • in flearnf'\subtext{learn}, when should-stop()\text{should-stop}() is met, "most frequent class label" does not make sense anymore
    • because we have numbers, not classes
  • in find-best-branch()\text{find-best-branch}(), minimizing the error()\text{error}() does not make sense anymore (same for gini()\text{gini}() and cross-entropy()\text{cross-entropy}())
    • because these indexes are for categorical values, not numbers
  • in should-stop()\text{should-stop}(), checking if error()=0\text{error}()=0 does not make sense anymore
    • because (classification) error is for categorical values, not numbers
209 / 366

Terminal node labels

In flearnf'\subtext{learn}, when should-stop()\text{should-stop}() is met, "most frequent class label" does not make sense anymore.

Solution: use the mean y\overline{y}.

Classification

The terminal node label is the most frequent class: y=arg maxyYFr ⁣(y,{y(i)}i)y^\star=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}

If you have to choose just one yy, yy^\star is the one that minimizes the classification error.

Regression

The terminal node label is the mean yy value: y=1niy(i)=yy^\star=\frac{1}{n} \sum_i y^{(i)}=\overline{y}

If you have to choose just one yy, yy^\star is the one that minimizes the MSE.

210 / 366

Terminal node labels

In flearnf'\subtext{learn}, when should-stop()\text{should-stop}() is met, "most frequent class label" does not make sense anymore.

Solution: use the mean y\overline{y}.

Classification

The terminal node label is the most frequent class: y=arg maxyYFr ⁣(y,{y(i)}i)y^\star=\argmax_{y \in Y} \freq{y,\seq{y^{(i)}}{i}}

If you have to choose just one yy, yy^\star is the one that minimizes the classification error.

Regression

The terminal node label is the mean yy value: y=1niy(i)=yy^\star=\frac{1}{n} \sum_i y^{(i)}=\overline{y}

If you have to choose just one yy, yy^\star is the one that minimizes the MSE.

Indeed a dummy regressor predicting always the mean value y\overline{y} should be considered a baseline for regression, like the dummy classifier is a baseline for classification:

  • if you want to do a prediction without using the xx, then y\overline{y} is the best you can do (on the learning dataset)
210 / 366

Finding the best branch

In find-best-branch()\text{find-best-branch}(), minimizing the error()\text{error}() does not make sense anymore (same for gini()\text{gini}() and cross-entropy()\text{cross-entropy}()).

Solution: use the residual sum of squares (RSS).

Classification

The branch is chosen for which the sum of the impurity on the two sides is the lowest: (j,τ)arg minj,τ(error({y(i)}ixj(i)τ)+error({y(i)}ixj(i)>τ))\c{1}{\begin{align*} (j^\star, \tau^\star) \gets \argmin_{j,\tau} ( &\text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+\\ & \text{error}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}))\end{align*}} similarly, for categorical variables

Regression

The branch is chosen for which the sum of the RSS on the two sides is the lowest: (j,τ)arg minj,τ(RSS({y(i)}ixj(i)τ)+RSS({y(i)}ixj(i)>τ))\c{1}{\begin{align*} (j^\star, \tau^\star) \gets \argmin_{j,\tau} ( &\text{RSS}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j \le \tau})+\\ & \text{RSS}(\seq{y^{(i)}}{i}\big\rvert_{x^{(i)}_j > \tau}))\end{align*}} where: RSS({y(i)}i)=i(y(i)y)2\text{RSS}(\seq{y^{(i)}}{i}) = \sum_i \left(y^{(i)}-\overline{y}\right)^2

similarly, for categorical variables; RSS()=nMSE()\text{RSS}(\cdot) = n \text{MSE}(\cdot)

211 / 366

Stopping criterion

In should-stop()\text{should-stop}(), checking if error()=0\text{error}()=0 does not make sense anymore.

Solution: just use RSS.

Classification

Stop if nnminn\le n\subtext{min} or error()=0\text{error}()=0.

function should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) {
if nnminn \le n\subtext{min} then {
return true\text{true};
}
if error({y(i)}i)=0\text{error}(\seq{y^{(i)}}{i})=0 then {
return true\text{true};
}
return false\text{false}
}

Regression

Stop if nnminn\le n\subtext{min} or RSS()=0\text{RSS}()=0.

function should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) {
if nnminn \le n\subtext{min} then {
return true\text{true};
}
if RSS({y(i)}i)=0\text{RSS}(\seq{y^{(i)}}{i})=0 then {
return true\text{true};
}
return false\text{false}
}

In practice, the condition RSS()=0\text{RSS}()=0 holds much more unfrequently than the condition error()=0\text{error}()=0.

212 / 366

Visualizing the model

With few variables, p2p\le 2 for classification, p=1p=1 for regression, the model can be visualized.

Classification

Classifier on carousel

The colored regions are the model. The border(s) between regions with different colors (i.e., different decisions) is the decision boundary.

Regression

Regressor example

The line is the model.

Question: can you draw the tree for this model?

213 / 366

Overfitting with regression trees

Example of regression trees with different complexities

image from Fabio Daolio

Questions

  • what's the problem size (nn and pp)?
  • what's the model complexity?
  • how is the real system made?
214 / 366

Tree learning: brief recap

215 / 366

Summary

Applicability 👍👍

  • 👍 YY: both regression and classification (binary and multiclass)
  • 👍 XX: multivariate XX with both numerical and categorical variables
  • 👍 models give probability¹
  • 🫳³ learning technique has one single parameter

Efficiency 👍

  • 👍 in practice, pretty fast both in learning and prediction phase

Explainability/interpretability 👍👍👍

  • 👍 the models can be easily² visualized (global explainability)
  • 👍 the decisions can be analyzed (local explainability)
  • 👍 the learning technique is itself comprehensible
    • you should be able to implement by yourself
  1. for classification; if nmin=1n\subtext{min}=1, it's always 100%100\%
  2. if they are small enough...
  3. 1 is better than >1>1, but worse than parameter-free, so 🫳
216 / 366

Summary

Applicability 👍👍

  • 👍 YY: both regression and classification (binary and multiclass)
  • 👍 XX: multivariate XX with both numerical and categorical variables
  • 👍 models give probability¹
  • 🫳³ learning technique has one single parameter

Efficiency 👍

  • 👍 in practice, pretty fast both in learning and prediction phase

Explainability/interpretability 👍👍👍

  • 👍 the models can be easily² visualized (global explainability)
  • 👍 the decisions can be analyzed (local explainability)
  • 👍 the learning technique is itself comprehensible
    • you should be able to implement by yourself
  1. for classification; if nmin=1n\subtext{min}=1, it's always 100%100\%
  2. if they are small enough...
  3. 1 is better than >1>1, but worse than parameter-free, so 🫳

So, why are we not using trees for/in every ML system?

216 / 366

Decision tree effectiveness

Example of regression trees with different complexities image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

The effectiveness depends on the problem and may be limited by the fact that branch nodes consider one variable at once.

The decision boundary of the model is hence constrained to be locally parallel to one of the axes:

  • may be a limitation or not, depending on the problem
  • makes find-best-branch()\text{find-best-branch()} computationally feasible
    • because the search space is small
    • because computing the error of the dummy classifier is fast (greedy)

There exist oblique decision trees, which should overcome this limitation.

217 / 366

Towards the first lab

Software for ML

218 / 366

Implementing ML systems

  1. Decide: should I use ML?
  2. Decide: supervised vs. unsupervised
  3. Define the problem (problem statement):
    • define XX and YY
    • define a way for assessing solutions
      • before designing!
      • applicable to any compatible ML solution
  4. Design the ML system
    • choose a learning technique
    • choose/design pre- and post-processing steps
  5. Implement the ML system
    • learning/prediction phases
    • obtain the data
  6. Assess the ML system

Actual execution of:

  • pre-processing
  • learning
  • prediction
  • assessment

is not made by hand, but by a computer that executes some software.

219 / 366

Software for ML

Nowadays, there are many options.

A few:

  • libraries for general purpose languages:
  • specialized software environments:
  • a software written from scratch

And many others.

How to choose an ML software?

Possible criteria:

  • platform constraints
  • degree of data pre/post-processing
  • production/prototype
  • documentation availability
  • community size
  • your previous familiarity/knowledge/skills
220 / 366

Interface

In general, the ML software provides an interface that models the key concepts of learning (flearnf'\subtext{learn}) and prediction (fpredictf'\subtext{predict}) phases and the one of the model.

Example (Java+SMILE):

DataFrame dataFrame = ...
RandomForest classifier = RandomForest.fit(Formula.lhs("label"), dataFrame);
Tuple observation = ...;
int predictedLabel = classifier.predict(observation);

Example (R):

d = ...
classifier = randomForest(label~., d)
newD = ...
newLabels = predict(classifier, newD)
221 / 366

A (very) brief Introduction to R

222 / 366

What is R?

R is:

  • a programming language
  • a software environments with a text-based interactive UI (a console)

RStudio is:

  • an IDE¹ built around R
  • also for making notebooks, like in Python
  1. integrated development environment

Some R resources:

  • language documentation
  • packages documentation
    • for all: Comprehensive R Archive Network (CRAN)
    • for "biggest" packages: their own site
  • help from online communities
223 / 366

RStudio appearance

RStudio appearance

224 / 366

RStudio appearance with a notebook

RStudio appearance with a notebook

225 / 366

An R notebook on Google Colab

Colab appearance with a notebook

226 / 366

Data types

There are some built-in data types.

Basic:

  • numeric
  • character (i.e., strings)
  • logical (i.e., Booleans)
  • factor (i.e., categorical)
  • function
  • formula

Composed:

  • vector
  • matrix
  • data frame
  • list

R is not strongly typed: there are (some) implicit conversions.

227 / 366

Data types

There are some built-in data types.

Basic:

  • numeric
  • character (i.e., strings)
  • logical (i.e., Booleans)
  • factor (i.e., categorical)
  • function
  • formula

Composed:

  • vector
  • matrix
  • data frame
  • list

R is not strongly typed: there are (some) implicit conversions.

A peculiar data type is formula:

  • it describes a dependency
  • literals specify dependent and independent variables, e.g.:
    • decision~age+height
    • Species~. . means "every other variable"
227 / 366

Assigning values

> a=3
> a
[1] 3
> v=c(1,2,3)
> v
[1] 1 2 3
> d=as.data.frame(cbind(age=c(20,21,21))))
> d$gender=factor(c("m","m","f"))
> d
age gender
1 20 2
2 21 2
3 21 1
> levels(d$gender)
[1] "f" "m"
> dep=salary~degree.level+age
> dep
salary ~ degree.level + age
> f = function(x) {x+3}
> f(2)
[1] 5
  • a is a numeric
  • v is a vector of numeric
  • d is a data frame
  • dep is a formula
  • f is a function
  • cbind() stays for column bind (there's an rbind() too)
  • factor() makes a vector of character a vector of factors
  • levels() gives the possible values of a factor, i.e.:
    • d$gender is {x2(i)}i\seq{x_2^{(i)}}{i}
    • levels(d$gender) is X2X_2
228 / 366

Reading/writing data

There are many packages for reading weird file types.

Some built-in functions for reading/writing CSV files (and variants):

  • read.csv(), read.csv2(), read.table()
  • write.csv(), write.csv2(), write.table()

Some built-in functions for reading/writing data in an R-native format:

  • save()
  • load()
229 / 366

Basic exploration of data

With summary() (built-in)

> d=iris
> summary(d)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500

With skim() from skimr package

> skim(d)
── Data Summary ────────────────────────
Values
Name d
Number of rows 150
Number of columns 5
_______________________
Column type frequency:
factor 1
numeric 4
________________________
Group variables None
── Variable type: factor ────────────────────────────
skim_variable n_missing complete_rate ordered
1 Species 0 1 FALSE
n_unique top_counts
1 3 set: 50, ver: 50, vir: 50
── Variable type: numeric ───────────────────────────
skim_variable n_missing complete_rate mean sd
1 Sepal.Length 0 1 5.84 0.828
2 Sepal.Width 0 1 3.06 0.436
3 Petal.Length 0 1 3.76 1.77
4 Petal.Width 0 1 1.20 0.762
p0 p25 p50 p75 p100 hist
1 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂
2 2 2.8 3 3.3 4.4 ▁▆▇▂▁
3 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂
4 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃

Sizes with length(), dim(), nrow(), ncol(); names with names() (same of colnames()), rownames()

  • names change with names(d)[2:3] = c("cows", "dogs")

Here d is a multivariate dataset, but which variable is yy is not specified.

230 / 366

Selecting portions of data

On vectors:

> v=seq(1,2,by=0.25)
> v
[1] 1.00 1.25 1.50 1.75 2.00
> v[2]
[1] 1.25
> v[2:3]
[1] 1.25 1.50
> v[-2]
[1] 1.00 1.50 1.75 2.00
> v[c(1,2,4)]
[1] 1.00 1.25 1.75
> v[c(T,F,F,T)]
[1] 1.00 1.75 2.00
> v[v<1.6]
[1] 1.00 1.25 1.50
> v[which(v<1.6)]
[1] 1.00 1.25 1.50

On data frames:

> d
age gender
1 20 m
2 21 m
3 21 f
> d[1,2]
[1] m
Levels: f m
> d[,2]
[1] m m f
Levels: f m
> d[1,]
age gender
1 20 m
> d$age
[1] 20 21 21

Question: what is d[,c("age","age")]?

231 / 366

Like a pro with tidyverse

> iris %>% group_by(Species) %>%
summarize_at(vars(Sepal.Length,Sepal.Width),
list(mean=mean,sd=sd))
%>% pivot_longer(-Species)
# A tibble: 12 × 3
Species name value
<fct> <chr> <dbl>
1 setosa Sepal.Length_mean 5.01
2 setosa Sepal.Width_mean 3.43
3 setosa Sepal.Length_sd 0.352
4 setosa Sepal.Width_sd 0.379
5 versicolor Sepal.Length_mean 5.94
6 versicolor Sepal.Width_mean 2.77
7 versicolor Sepal.Length_sd 0.516
8 versicolor Sepal.Width_sd 0.314
9 virginica Sepal.Length_mean 6.59
10 virginica Sepal.Width_mean 2.97
11 virginica Sepal.Length_sd 0.636
12 virginica Sepal.Width_sd 0.322

Useful for:

Very useful, indeed!

  1. The built-in function for plotting is plot(); since it is overloaded for many custom data types, you can always try feeding plot() with something and see what happens...
232 / 366

(Ready for the) first lab!

233 / 366

Lab 1: hardest variable in Iris

  1. consider the Iris dataset
  2. design and implement an ML-based procedure for answering this question:

what's the hardest variable to be predicted in the dataset?

Hints:

  • the Iris dataset is built-in in R: iris
  • there are (at least) two packages for tree learning with R
    • tree
    • rpart this might be a bit better
  • most packages for doing supervised learning have two functions for learning and prediction:
    • packageName() for learning (e.g., tree or rpart)
    • predict() for prediction
234 / 366

Tree bagging and Random Forest

235 / 366

The (bad) flexibility of trees

Consider this dataset obtained from a system:

A dataset with an outlier

Question: how would you "draw the system" behind this data?

If we learn a regression tree with low flexibility:

  • the model will not capture the system behavior
  • it will underfit the data and the system

If we learn a regression tree with high flexibility:

  • the model will likely better capture the system behavior, but...
  • it will also model some noise
  • it will overfit the data

It might be that there is no flexibility value for which we have no underfitting nor overfitting.

236 / 366

The (bad) flexibility of trees

Consider this dataset obtained from a system:

A dataset with an outlier

Question: how would you "draw the system" behind this data?

If we learn a regression tree with low flexibility:

  • the model will not capture the system behavior
  • it will underfit the data and the system

If we learn a regression tree with high flexibility:

  • the model will likely better capture the system behavior, but...
  • it will also model some noise
  • it will overfit the data

It might be that there is no flexibility value for which we have no underfitting nor overfitting.

What's that point at (80,22)(\approx 80, \approx 22)?

  • noise, or, from another point of view, a detail of the data, rather than of the system, that we don't want to model

What if we collect another dataset out of the same system?

236 / 366

Human-learning and cars

  • Model: a description in natural language
  • Learning technique: human giving a description
  • Flexibility: amount of characters available for the model
  • Problem instance: learning a model of (the concept) of car from (one) example
237 / 366

Human-learning and cars

  • Model: a description in natural language
  • Learning technique: human giving a description
  • Flexibility: amount of characters available for the model
  • Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

237 / 366

Human-learning and cars

  • Model: a description in natural language
  • Learning technique: human giving a description
  • Flexibility: amount of characters available for the model
  • Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

Ferrari Testarossa

"a moving object"

"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"

237 / 366

Human-learning and cars

  • Model: a description in natural language
  • Learning technique: human giving a description
  • Flexibility: amount of characters available for the model
  • Problem instance: learning a model of (the concept) of car from (one) example

VW Maggiolino

Model with low complexity:

"a moving object"

Model with high complexity:

"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"

Ferrari Testarossa

"a moving object"

"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"

Fiat 500

"a moving object"

"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, a windshield, chromed fenders, sunroof"

237 / 366

Modeled details

Low complexity High complexity
"a moving object" "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object" "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object" "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

238 / 366

Modeled details

Low complexity High complexity
"a moving object" "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object" "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object" "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

What if we combine different models with high complexity?

  • "a [...] moving object with 4 wheels, 2 doors, [...], a windshield, [...]"
  • much more details about the system, no details about the noise
  • i.e., no underfitting 😁, no overfitting 😁
238 / 366

Modeled details

Low complexity High complexity
"a moving object" "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine"
"a moving object" "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure"
"a moving object" "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof"

Low complexity: never gives enough details about the system

High complexity: always gives a fair amount of details about the system, but also about noise

What if we combine different models with high complexity?

  • "a [...] moving object with 4 wheels, 2 doors, [...], a windshield, [...]"
  • much more details about the system, no details about the noise
  • i.e., no underfitting 😁, no overfitting 😁

When "learners" are common people, this idea is related with the wisdom of the crowds theorem, stating that "a collective opinion may be better than a single expert's opinion".

238 / 366

Wisdom of the crowds

"a collective opinion may be better than a single expert's opinion"

Yes, but only if:

  • we have many opinions
  • the opinions are independent
  • we have a way to aggregate them

239 / 366

Wisdom of the crowds

"a collective opinion may be better than a single expert's opinion"

Yes, but only if:

  • we have many opinions
  • the opinions are independent
  • we have a way to aggregate them

Can we realize a wisdom of the trees? (where (opinion, person) \leftrightarrow (prediction, tree))

  • we have many opinions
    • ok, just learn many trees
  • the opinions are independent
    • ... 🤔
  • we have a way to aggregate them
    • aggregate predictions of the trees:
      • classification: majority
      • regression: average
239 / 366

Independency of trees

A tree is the result of the execution of flearnf'\subtext{learn} on a learning set Dlearn={(x(i),y(i))}iD\subtext{learn} = \seq{(x^{(i)},y^{(i)})}{i}.

flearnf'\subtext{learn} is deterministic, thus:

  • if we apply flearnf'\subtext{learn} twice on the same learning set, we obtain two equal models
  • if we apply flearnf'\subtext{learn} mm times on the same dataset, we obtain mm equal models
  • no independency

In order to obtain different trees, we need to apply flearnf'\subtext{learn} on different learning sets!

But we have just one learning set... 🤔

Question: what's the learning set for human-learners?

240 / 366

Different learning sets

Goal: obtaining mm different datasets Dlearn,1,,Dlearn,mD_{\text{learn},1}, \dots, D_{\text{learn},m} from a dataset DlearnD\subtext{learn}

  • decently different from each other
  • all being decently representative of the underlying system (not too worse than DlearnD\subtext{learn})
241 / 366

Different learning sets

Goal: obtaining mm different datasets Dlearn,1,,Dlearn,mD_{\text{learn},1}, \dots, D_{\text{learn},m} from a dataset DlearnD\subtext{learn}

  • decently different from each other
  • all being decently representative of the underlying system (not too worse than DlearnD\subtext{learn})

Option 1: (CV-like)

  1. shuffle DlearnD\subtext{learn}
  2. split DlearnD\subtext{learn} in mm folds
  3. assign each Dlearn,jD_{\text{learn},j} to the jj-th fold

Requirements check:

  • 👍 the folds are in general different from each other
  • 👎 if mm is large, each Dlearn,jD_{\text{learn},j} is small, with size 1mDlearn\frac{1}{m} |D\subtext{learn}|, and is likely poorly representative of the system
241 / 366

Different learning sets

Goal: obtaining mm different datasets Dlearn,1,,Dlearn,mD_{\text{learn},1}, \dots, D_{\text{learn},m} from a dataset DlearnD\subtext{learn}

  • decently different from each other
  • all being decently representative of the underlying system (not too worse than DlearnD\subtext{learn})

Option 1: (CV-like)

  1. shuffle DlearnD\subtext{learn}
  2. split DlearnD\subtext{learn} in mm folds
  3. assign each Dlearn,jD_{\text{learn},j} to the jj-th fold

Requirements check:

  • 👍 the folds are in general different from each other
  • 👎 if mm is large, each Dlearn,jD_{\text{learn},j} is small, with size 1mDlearn\frac{1}{m} |D\subtext{learn}|, and is likely poorly representative of the system

Option 2: rand. sampling w/ repetitions

  1. for each j{1,,m}j \in \{1, \dots, m\}
    1. start with an empty Dlearn,jD_{\text{learn},j}
    2. repeat n=Dlearnn=|D\subtext{learn}| times
      1. pick a random el. of DlearnD\subtext{learn}
      2. add it to Dlearn,jD_{\text{learn},j}

Requirements check:

  • 👍 the folds are in general different from each other
  • 👍 regardless of mm, each Dlearn,jD_{\text{learn},j} as large as DlearnD\subtext{learn}
    • you can freely choose mm, even mnm \ge n!
241 / 366

Sampling with repetition

On DlearnD\subtext{learn}:

  1. for each j{1,,m}j \in \{1, \dots, m\}
    1. start with an empty Dlearn,jD_{\text{learn},j}
    2. repeat n=Dlearnn=|D\subtext{learn}| times
      1. pick a random el. of DlearnD\subtext{learn}
      2. add it to Dlearn,jD_{\text{learn},j}

In general:

function sample-rep({x1,,xn})\text{sample-rep}(\{x\sub{1},\dots,x\sub{n}\}) {
XX' \gets \emptyset
while Xn|X'| \le n {
juniform({1,,n})j \gets \text{uniform}(\{1,\dots,n\})
XX{xj}X' \gets X' \cup \{x\sub{j}\}
}
return XX'
}

fsample-repf\subtext{sample-rep}{x1,,xn}\{x\sub{1},\dots,x\sub{n}\}{xj1,,xjn}\{x\sub{j\sub{1}},\dots,x\sub{j\sub{n}}\}

Remarks:

  • fsample-repf\subtext{sample-rep} is not deterministic!
    • if you execute twice it on the same input, you get different outputs
  • when you use sampling with repetition to estimate the distribution of a metric, rather than computing the metric itself on the entire collection, you are doing bootstrapping
242 / 366

Examples and probability

Not deterministic, thus:

  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}
  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{3}{●},\c{4}{●},\c{3}{●},\c{5}{●},\c{5}{●}\}
  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{4}{●},\c{1}{●}\}
  • ...

recall: input and output are multisets

243 / 366

Examples and probability

Not deterministic, thus:

  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}
  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{3}{●},\c{4}{●},\c{3}{●},\c{5}{●},\c{5}{●}\}
  • one invocation: fsample-rep({,,,,}){,,,,}f\subtext{sample-rep}(\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}) \rightarrow \{\c{2}{●},\c{4}{●},\c{3}{●},\c{4}{●},\c{1}{●}\}
  • ...

recall: input and output are multisets

Given an input with nn elements and assuming uniqueness an element has:

  • a probability of (11n)n\left(1-\frac{1}{n}\right)^n to not occur in the output
  • a probability of 1n(11n)n1\frac{1}{n}\left(1-\frac{1}{n}\right)^{n-1} to occur in the output exactly once
  • a probability of (1n)2(11n)n2\left(\frac{1}{n}\right)^2\left(1-\frac{1}{n}\right)^{n-2} to occur in the output exactly twice
  • ...
243 / 366

Towards wisdom of the trees

Can we realize a wisdom of the trees? (where (opinion, person) \leftrightarrow (prediction, tree))

  • we have many opinions
    • 👍 ok, just learn many trees
  • the opinions are independent
    • 👍 each tree is learned on a dataset obtained with sampling with repetition
  • we have a way to aggregate them
    • 👍 aggregate predictions of the trees:
      • classification: majority
      • regression: average

244 / 366

Towards wisdom of the trees

Can we realize a wisdom of the trees? (where (opinion, person) \leftrightarrow (prediction, tree))

  • we have many opinions
    • 👍 ok, just learn many trees
  • the opinions are independent
    • 👍 each tree is learned on a dataset obtained with sampling with repetition
  • we have a way to aggregate them
    • 👍 aggregate predictions of the trees:
      • classification: majority
      • regression: average

Ok, we can define a new learning technique that realizes this idea!

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}mm
fpredictf'\subtext{predict}x,mx,myy

This technique is called on tree bagging.

244 / 366

Tree bagging: learning

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}{tj}j\seq{t_j}{j}ntreen\subtext{tree}

function learn({(x(i),y(i))}i,ntree)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i}, \c{1}{n\subtext{tree}}) {
TT' \gets \emptyset
while Tntree|T'| \le \c{1}{n\subtext{tree}} {
{(x(ji),y(ji))}jisample-rep({(x(i),y(i))}i)\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})
tlearnsingle({(x(ji),y(ji))}ji,1)t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, \c{3}{1})
TT{t}T' \gets T' \cup \{t\}
}
return TT'
}

  • the model is a bag of trees
    • it can contain duplicates
  • ntreen\subtext{tree} is the number of trees in the bag
    • a parameter of the learning technique
  • learnsingle()\text{learn}\subtext{single}() is the flearnf'\subtext{learn} for learning a single tree (recursive binary splitting)
    • tree bagging is based on recursive binary splitting
  • learnsingle()\text{learn}\subtext{single}() is invoked with nmin=1n\subtext{min}=1, because we want each tree in the bag to give many details¹!

Recall: since one part of this flearnf'\subtext{learn} is not deterministic (namely, sample-rep()\text{sample-rep}()), the entire flearnf'\subtext{learn} is not deterministic!

  • not to be confused with a system not being deterministic
  • not to be confused with an fpredictf''\subtext{predict} that returns a probability
  1. this can be obtained also with a reasonably small nminn\subtext{min}, or with a reasonably large maximum tree depth
245 / 366

Tree bagging: prediction

fpredictf'\subtext{predict}x,{tj}jx,\seq{t_j}{j}yy

Classification (decision trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return arg maxyYj1(y=predictsingle(x,tj))\argmax_{y \in Y} \sum_j \mathbf{1}(y=\c{1}{\text{predict}\subtext{single}}(x,t_j))
}

  • predictsingle()\text{predict}\subtext{single}() is the fpredictf'\subtext{predict} for the single tree
  • arg max\argmax is a majority voting:
    1. for each yy in YY, count the number j1(y=predictsingle(x,tj))\sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j)) of trees in the bag predicting that yy (i.e., the votes for that yy)
    2. select the yy with the largest count (i.e., the majority of votes)
  • easily modifiable to an fpredictf''\subtext{predict} (with probability):
    • return p=y1{tj}jj1(y=predictsingle(x,tj))p = y \mapsto \frac{1}{|\seq{t_j}{j}|}\sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))

Regression (regression trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return 1{tj}jjpredictsingle(x,tj)\frac{1}{|\seq{t_j}{j}|} \sum_j \c{1}{\text{predict}\subtext{single}}(x,t_j)
}

  • simply returns the mean of the predictions of the tree in the bag
  • bonus: instead of getting just the mean, by getting also the standard deviation σ\sigma of the tree predictions we can have a measure of uncertainty of the tree: the larger σ\sigma, the more uncertain the prediction, the lower the confidence
    • uncertainty/confidence is a basic form of local explainability, i.e., understanding the decisions of the model
    • uncertainty/confidence can be exploited in the active learning framework
246 / 366

Impact of the parameter ntreen\subtext{tree}

  • Is ntreen\subtext{tree} a flexibility parameter?
  • Does ntreen\subtext{tree} hence impact on learned model complexity, i.e., on tendency to overfitting?
247 / 366

Impact of the parameter ntreen\subtext{tree}

  • Is ntreen\subtext{tree} a flexibility parameter?
  • Does ntreen\subtext{tree} hence impact on learned model complexity, i.e., on tendency to overfitting?

Apparently yes:

  • because the larger ntreen\subtext{tree}, the larger the bag, the more complex the model
    • each tree has the "maximum" complexity, having being learned with nmin=1n\subtext{min}=1

Apparently no:

  • because the larger ntreen\subtext{tree}, the larger the number of trees whose prediction is averaged (regression) or subjected to majority voting (classification), i.e., the stronger the smoothing of details

So what? 🤔

247 / 366

ntreen\subtext{tree}: bagging vs. single tree learning

"Experimentally", it turns out that:

  • with a reasonably large ntreen\subtext{tree}, bagging is better than single tree learning
    • "reasonably large" = tens or few hundreds
    • "better" = produces more effective models
  • if you further increase ntreen\subtext{tree}, there's no overfitting

Note that bagging with ntree=1n\subtext{tree}=1 is¹ single tree learning.

  1. Question: are they exactly the same?

248 / 366

ntreen\subtext{tree}: bagging vs. single tree learning

"Experimentally", it turns out that:

  • with a reasonably large ntreen\subtext{tree}, bagging is better than single tree learning
    • "reasonably large" = tens or few hundreds
    • "better" = produces more effective models
  • if you further increase ntreen\subtext{tree}, there's no overfitting

Note that bagging with ntree=1n\subtext{tree}=1 is¹ single tree learning.

  1. Question: are they exactly the same?

Question: can we hence set an arbitrarly large ntreen\subtext{tree}?

248 / 366

ntreen\subtext{tree}: bagging vs. single tree learning

"Experimentally", it turns out that:

  • with a reasonably large ntreen\subtext{tree}, bagging is better than single tree learning
    • "reasonably large" = tens or few hundreds
    • "better" = produces more effective models
  • if you further increase ntreen\subtext{tree}, there's no overfitting

Note that bagging with ntree=1n\subtext{tree}=1 is¹ single tree learning.

  1. Question: are they exactly the same?

Question: can we hence set an arbitrarly large ntreen\subtext{tree}?

No! Efficiency linearly dicreases with ntreen\subtext{tree}:

  • invoking predictsingle()\text{predict}\subtext{single}() ntreen\subtext{tree} times takes, on average, ntreen\subtext{tree} the resources for invoking predictsingle()\text{predict}\subtext{single}() one time, but...
  • ... the invocations may be done in parallel (to some degree)
    • time resource is consumed less
    • energy resource is not affected
248 / 366

Tree bagging applicability

Since it is based on the learning technique for single trees, bagging has the same applicability:

  • YY: both regression and classification (binary and multiclass)
  • XX: multivariate XX with both numerical and categorical variables
  • models give probability

249 / 366

Tree bagging applicability

Since it is based on the learning technique for single trees, bagging has the same applicability:

  • YY: both regression and classification (binary and multiclass)
  • XX: multivariate XX with both numerical and categorical variables
  • models give probability

Note that the idea behind tree bagging can be applied to any base learning technique:

  • the base technique is called weak learner
  • the resulting model is an ensemble, hence bagging is a form of ensemble learning
249 / 366

Random Forest

250 / 366

Increasing independency

Wisdom of the trees:

  • many trees
  • trees are independent
  • tree predictions are aggregated

Trees independency is obtained by learning them on (slightly) different datasets.

If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.

Can we further increase trees independency?

251 / 366

Increasing independency

Wisdom of the trees:

  • many trees
  • trees are independent
  • tree predictions are aggregated

Trees independency is obtained by learning them on (slightly) different datasets.

If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.

Can we further increase trees independency?

Yes!

Idea: when learning each tree, remove some randomly chosen independent variables from the observations

Tree bagging improved with variables removal is a learning technique called Random Forest:

  • random because there are two sources of randomness, hence of independency
  • forest because it gives a bag of trees
251 / 366

Random Forest: learning

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}{tj}j\seq{t_j}{j}ntree,nvarsn\subtext{tree},n\subtext{vars}

function learn({(x(i),y(i))}i,ntree,nvars)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, \c{1}{n\subtext{vars}}) {
TT' \gets \emptyset
while Tntree|T'| \le n\subtext{tree} {
{(x(ji),y(ji))}jisample-rep({(x(i),y(i))}i)\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})
{(x(ji),y(ji))}jiretain-vars({(x(ji),y(ji))}ji,nvars)\seq{(\c{4}{x^{\prime(j_i)}},y^{(j_i)})}{j_i} \gets \c{3}{\text{retain-vars}}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, \c{1}{n\subtext{vars}})
tlearnsingle({(x(ji),y(ji))}ji,1)t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(\c{4}{x^{\prime(j_i)}},y^{(j_i)})}{j_i}, 1)
TT{t}T' \gets T' \cup \{t\}
}
return TT'
}

  • the model is a bag of nvarsn\subtext{vars} trees, as in bagging
  • nvarsp\c{1}{n\subtext{vars}} \le p is the number of variables to be retained
    • a parameter of the learning technique
  • learnsingle()\text{learn}\subtext{single}() gets, at each iteration, a dataset DP(X×Y)D' \in \mathcal{P}^*(\c{4}{X'} \times Y)
    • X=X1××XpX=X_1 \times \dots \times X_p has all the pp vars
    • X=Xj1××Xjnvars\c{4}{X'}=X_{j_1} \times \dots \times X_{j_{n\subtext{vars}}} has only nvarsn\subtext{vars} variables, with each jk{1,,p}j_k \in \{1, \dots, p\} and jkjk,k,kj_{k'} \ne j_{k''}, \forall k',k''
    • retain-vars()\text{retain-vars}() builds DD' (with XX' inside) from DD (with XX inside)

Two parts of this flearnf'\subtext{learn} are not deterministic (namely, sample-rep()\text{sample-rep}() and retain-vars()\text{retain-vars}()), hence the entire flearnf'\subtext{learn} is not deterministic!

252 / 366

Random Forest: prediction

fpredictf'\subtext{predict}x,{tj}jx,\seq{t_j}{j}yy

Classification (decision trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return arg maxyYj1(y=predictsingle(x,tj))\argmax_{y \in Y} \sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))
}

Regression (regression trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return 1{tj}jjpredictsingle(x,tj)\frac{1}{|\seq{t_j}{j}|} \sum_j \text{predict}\subtext{single}(x,t_j)
}

Exactly the same as for tree bagging

Question: some of the trees in the bag do not have all variables of xx: is this a problem?

253 / 366

Random Forest: prediction

fpredictf'\subtext{predict}x,{tj}jx,\seq{t_j}{j}yy

Classification (decision trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return arg maxyYj1(y=predictsingle(x,tj))\argmax_{y \in Y} \sum_j \mathbf{1}(y=\text{predict}\subtext{single}(x,t_j))
}

Regression (regression trees)

function predict(x,{tj}j)\text{predict}(x, \seq{t_j}{j}) {
return 1{tj}jjpredictsingle(x,tj)\frac{1}{|\seq{t_j}{j}|} \sum_j \text{predict}\subtext{single}(x,t_j)
}

Exactly the same as for tree bagging

Question: some of the trees in the bag do not have all variables of xx: is this a problem?

No, the tree is still able to process an xx, but will not consider (i.e., not use in branch nodes) some of its variable values;

  • the opposite (variable in the tree, but valued not in xx) would be a problem we'll see
253 / 366

Impact of the parameter nvarsn\subtext{vars}

  • Is nvarsn\subtext{vars} a flexibility parameter?
  • Does nvarsn\subtext{vars} hence impact on learned model complexity, i.e., on tendency to overfitting?

No, "experimentally", it turns out that:

  • nvarsn\subtext{vars} does not impact on tendency to overfitting
  • reasonably good default values exist:
    • nvars=pn\subtext{vars} = \left\lceil\sqrt{p}\right\rceil for classification
    • nvars=13pn\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil for regression

x\left\lceil x\right\rceil is ceil(x)\text{ceil}(x), i.e., rounding to closest larger integer; x\left\lfloor x\right\rfloor is floor(x)\text{floor}(x), i.e., rounding to closest smaller integer

254 / 366

Random Forest parameters

Both ntreen\subtext{tree} and nvarsn\subtext{vars} do not impact on tendency to overfitting.

In practice, we can use the default values for both:

  • ntree=500n\subtext{tree} = 500
  • nvars=pn\subtext{vars} = \left\lceil\sqrt{p}\right\rceil or nvars=13pn\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil

\Rightarrow Random Forest is (almost) a (hyper)parameter-free learning technique!

255 / 366

Random Forest parameters

Both ntreen\subtext{tree} and nvarsn\subtext{vars} do not impact on tendency to overfitting.

In practice, we can use the default values for both:

  • ntree=500n\subtext{tree} = 500
  • nvars=pn\subtext{vars} = \left\lceil\sqrt{p}\right\rceil or nvars=13pn\subtext{vars} = \left\lceil\frac{1}{3} p\right\rceil

\Rightarrow Random Forest is (almost) a (hyper)parameter-free learning technique!

However, "we can use the default values"

  • does not mean that default values are the best parameter values for any possibly dataset/system more on this later
  • it means we'd better spend our efforts on designing other components of the ML system:
    • engineering better features
    • getting better data
    • building a better UI
    • ...
255 / 366

Visualizing Random Forest for regression

Example of bagging on regression

image from Fabio Daolio

How is this image built?

  1. set the real system as a f:xyf: x \to y
    • plot f(x)f(x) for x[xmin,xmax]x \in [x\subtext{min},x\subtext{max}]
  2. take a random set of points {x(i)}i\seq{x^{(i)}}{i} in [xmin,xmax][x\subtext{min},x\subtext{max}]
  3. compute the corresponding yy and perturb them with a noise: y(i)=f(x(i))+ϵy^{(i)}=f(x^{(i)})+\epsilon with ϵN(0,1)\epsilon \sim N(0,1)
  4. set the dataset as D={(x(i),y(i))}iD=\seq{(x^{(i)},y^{(i)})}{i}
    • plot each (x(i),y(i))(x^{(i),y^{(i)}}) in DD
  5. learn one single tree tt on DD
    • plot fpredict(x,t)f'\subtext{predict}(x,t) for x[xmin,xmax]x \in [x\subtext{min},x\subtext{max}]
  6. learn¹ a bag {tj}j\seq{t_j}{j} on DD
    • plot fpredict(x,{tj}j)f'\subtext{predict}(x,\seq{t_j}{j}) for x[xmin,xmax]x \in [x\subtext{min},x\subtext{max}]
    • tj{tj}j\forall t_j \in \seq{t_j}{j}, plot fpredict(x,tj)f'\subtext{predict}(x,t_j) for x[xmin,xmax]x \in [x\subtext{min},x\subtext{max}]

Finding: the bag nicely models the real system

  • question: why not at the extremes of the xx domain?
  • question: can you reproduce this for classification and p=2p=2?
  1. Question: bagging or Random Forest?
256 / 366

Out-of-bag trees

function learn({(x(i),y(i))}i,ntree,nvars)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, n\subtext{vars}) {
TT' \gets \emptyset
while Tntree|T'| \le n\subtext{tree} {
{(x(ji),y(ji))}jisample-rep({(x(i),y(i))}i)\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})
{(x(ji),y(ji))}jiretain-vars({(x(ji),y(ji))}ji,nvars)\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i} \gets \text{retain-vars}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, n\subtext{vars})
tlearnsingle({(x(ji),y(ji))}ji,1)t \gets \c{2}{\text{learn}\subtext{single}(\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i}, 1)}
TT{t}T' \gets T' \cup \{t\}
}
return TT'
}

Toy example with D={,,,,}D=\{\c{1}{●},\c{2}{●},\c{3}{●},\c{4}{●},\c{5}{●}\}

  • t1=learnsingle({,,,,},1)t_1 = \text{learn}\subtext{single}(\{\c{2}{●},\c{4}{●},\c{3}{●},\c{1}{●},\c{1}{●}\}, 1), not used
  • t2=learnsingle({,,,,},1)t_2 = \text{learn}\subtext{single}(\{\c{4}{●},\c{4}{●},\c{1}{●},\c{2}{●},\c{5}{●}\}, 1), not used
  • t3=learnsingle({,,,,},1)t_3 = \text{learn}\subtext{single}(\{\c{5}{●},\c{3}{●},\c{1}{●},\c{2}{●},\c{4}{●}\}, 1), all used
  • ...
  • tj=learnsingle({,,,,},1)t_j = \text{learn}\subtext{single}(\{\c{3}{●},\c{1}{●},\c{5}{●},\c{4}{●},\c{5}{●}\}, 1), not used
  • ...

For every tree, there are zero or more observations that have not been used for learning it.

From another point of view, for every ii-th observation (x(i),y(i))(x^{(i)},y^{(i)}), there are some trees which have been learned without that observation:

  • with ntreen\subtext{tree} trees in the bag, on average, 13ntree\frac{1}{3} n\subtext{tree} trees have been learned without the observation it can be computed playing a bit with probability; they are called out-of-bag trees
  • each observation is an unseen observation for its out-of-bag trees

\Rightarrow use unseen observations for measuring an estimate of the error (or accuracy, or another index) on the testing set (the OOB error)

257 / 366

OOB error

Computing the OOB error during the learning:

  1. for each observation (x(i),y(i))(x^{(i)},y^{(i)})
    1. find the out-of-bag trees
    2. obtain their prediction y^(i)\hat{y}^{(i)} on the observation
  2. compute the error on the predictions (with an fcomp-respsf\subtext{comp-resps})

Remarks:

  • it is an estimate of the test error, but does not need a test dataset
    • still an estimate, not the real test error
  • it is¹ computed at learning time

Classification error vs. bag size image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

  1. Many libraries compute it only upon user's request.
258 / 366

Interpretability of the trees

Is this model interpretable (ntree=1n\subtext{tree}=1)?

Single tree

Is this model interpretable (ntree=100n\subtext{tree}=100)?

Forest

Interpreting of the model (i.e., global explainability) is feasible if the model can be visualized:

  • a single tree can be visualized (if it's small); 100100 trees can not!

There exist other flavors of interpretability:

  • simulatability: the degree to which the working of the model can be reproduced by a human
  • composatability: the degree to which the human can split the model in components and interpret them and their role
259 / 366

The role of the variables

xagex\subtext{age} vs. 1010\le>>xheightx\subtext{height} vs. 120120\le>>

By looking at this tree, we can understand:

  • exactly what variables are used
  • exactly when they are used in the decision process
    • here, xagex\subtext{age} is used before xheightx\subtext{height}
  • exactly how, i.e., what they are compared against

In principle, this can be done also for a bag of trees, but it would not scale well... in human terms

Can we have a much coarser view on variables role that scales well to large ntreen\subtext{tree}?

260 / 366

The role of the variables

xagex\subtext{age} vs. 1010\le>>xheightx\subtext{height} vs. 120120\le>>

By looking at this tree, we can understand:

  • exactly what variables are used
  • exactly when they are used in the decision process
    • here, xagex\subtext{age} is used before xheightx\subtext{height}
  • exactly how, i.e., what they are compared against

In principle, this can be done also for a bag of trees, but it would not scale well... in human terms

Can we have a much coarser view on variables role that scales well to large ntreen\subtext{tree}?

Yes!

Idea (first option: mean RSS/Gini decrease): when learning

  1. for each tree, for each branch-node
    1. measure the RSS/Gini before the branch-node
    2. measure the RSS/Gini after the branch-node
    3. assign (by increment) the decrease to the branch-node variable
  2. build a ranking of variables based on the sum of decreases (the larger, the more important)
260 / 366

Variable importance by RSS/Gini decrease

function learn({(x(i),y(i))}i,ntree,nvars)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i},n\subtext{tree}, n\subtext{vars}) {
v0\vect{v} \gets \vect{0}
TT' \gets \emptyset
while Tntree|T'| \le n\subtext{tree} {
{(x(ji),y(ji))}jisample-rep({(x(i),y(i))}i)\seq{(x^{(j_i)},y^{(j_i)})}{j_i} \gets \text{sample-rep}(\seq{(x^{(i)},y^{(i)})}{i})
{(x(ji),y(ji))}jiretain-vars({(x(ji),y(ji))}ji,nvars)\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i} \gets \text{retain-vars}(\seq{(x^{(j_i)},y^{(j_i)})}{j_i}, n\subtext{vars})
tlearnsingle({(x(ji),y(ji))}ji,1,v)t \gets \c{2}{\text{learn}\subtext{single}}(\seq{(x^{\prime(j_i)},y^{(j_i)})}{j_i}, 1, \c{1}{\vect{v}})
TT{t}T' \gets T' \cup \{t\}
}
return (T,v)(T', \c{1}{\vect{v}})
}

function learnsingle({(x(i),y(i))}i,nmin,v)\c{2}{\text{learn}\subtext{single}}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}, n\subtext{min},\c{1}{\vect{v}}) {
if should-stop({y(i)}i,nmin)\text{should-stop}(\seq{y^{(i)}}{i}, n\subtext{min}) then { ... } else {
ebeforegini({y(i)}i)e\subtext{before} \gets \text{gini}(\seq{y^{(i)}}{i})
(j,τ)find-best-branch({(x(i),y(i))}i)(j, \tau) \gets \text{find-best-branch}(\seq{(\vect{x}^{(i)},y^{(i)})}{i})
eaftergini({y(i)}ixj(i)τ)+gini({y(i)}ixj(i)>τ)e\subtext{after} \gets \text{gini}(\seq{y^{(i)}}{i}\big\rvert\sub{x^{(i)}\sub{j} \le \tau})+\text{gini}(\seq{y^{(i)}}{i}\big\rvert\sub{x^{(i)}\sub{j} > \tau})
vjvj+ebeforeeafterv\sub{j} \gets v\sub{j} + e\subtext{before}-e\subtext{after}
tnode-from((j,τ),t \gets \text{node-from}((j,\tau),
learn({(x(i),y(i))}ixj(i)τ,nmin,v),\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j \le \tau}, n\subtext{min}, \c{1}{\vect{v}}),
learn({(x(i),y(i))}ixj(i)>τ,nmin,v)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}\big\rvert_{x^{(i)}_j > \tau}, n\subtext{min}, \c{1}{\vect{v}})
)
return tt
}
}

  1. for each tree, for each branch-node
    1. measure the RSS/Gini before the branch-node
    2. measure the RSS/Gini after the branch-node
    3. assign (by increment) the decrease to the branch-node variable
  2. build a ranking of variables based on the sum of decreases
  • v\vect{v} stores the Gini decrease for each variable
    • initially set to 0Rp\vect{0} \in \mathbb{R}^p
    • propagated to each call to learnsingle()\text{learn}\subtext{single}()
  • the error before is Gini computed on the local dataset (the one at the node) before dividing the data
  • the error after is Gini computed on the local dataset (the one at the node) after dividing the data

Example: gini({y(i)}i)=yFr ⁣(y,{y(i)}i)(1Fr ⁣(y,{y(i)}i))\text{gini}(\seq{y^{(i)}}{i})=\sum_y \freq{y, \seq{y^{(i)}}{i}} \left(1-\freq{y, \seq{y^{(i)}}{i}}\right)

{y(i)}i\seq{y^{(i)}}{i} Gini Giniτ\rvert_{ \le \tau} Gini>τ\rvert_{ > \tau} Decrease
/ 0.5 0 0 0.5
/ 0.375 0 0 0.375
/ 0.375 0 0.333 0.042
/ 0.5 0.25 0.25 0

Question: is this xjx_j categorical or numerical?

261 / 366

OOB-shuffling importance

It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:

  • if there are categorical variables with many values
  • because if tends to give more importance to numerical variables
  • in general, because it works on learning data

} \rightarrow many branches

262 / 366

OOB-shuffling importance

It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:

  • if there are categorical variables with many values
  • because if tends to give more importance to numerical variables
  • in general, because it works on learning data

} \rightarrow many branches

Idea (second option: aka mean accuracy decrease): just after learning

  1. for each jj-th variable and each tree tt in the bag
    1. take the observations DtD_t not used for tt
    2. measure the accuracy of tt on DtD_t
    3. shuffle the jj-th variable in the observations, obtaining DtD'_t
    4. measure the accuracy of tt on DtD'_t
    5. assign (by increment) the decrease in accuracy to the jj-th variable
  2. build a ranking of variables based on the sum of decreases (the larger, the more important)

Rationale: if the decrease is low, it means that shuffling the variable has no effect, so the variable is not really important!

262 / 366

Feature ablation for variable importance

There is also a further, more general variant, that works for any learning technique flearn,fpredictf'\subtext{learn}, f'\subtext{predict}:

Idea (third option: feature ablation):

  1. measure the effectiveness of flearn,fpredictf'\subtext{learn}, f'\subtext{predict} on the dataset DD
  2. for each jj-th variable xjx_j
    1. build a DD' by removing xjx_j from DD
    2. measure the effectiveness of flearn,fpredictf'\subtext{learn}, f'\subtext{predict} on the dataset DD'
    3. compute the jj-th variable importance as the decrease of effectiveness in DD' w.r.t. DD
  3. build a ranking of variables based on decreases of effectiveness (the larger, the more important)

This method is (a form of) feature ablation, since you remove variables/features an see what happens:

  • ablation [a-bley-shuhn]: gradually remove material from or erode (a surface or object) by melting, evaporation, frictional action, etc., or erode (material) in this way.
263 / 366

Variable importance as basic interpretability

In summary, for variable instance, we have three options:

Option Effectiveness Efficiency Applicability
Mean RSS/Gini decrease 🤏1 👍2 🤏 only trees
Mean accuracy decrease 👍 👍3 🤏 bagging
Feature ablation 👍4 🤏 👍 universal
  1. not robust to many branches; on learning data
  2. during learning, for free
  3. during learning, almost free
  4. still not perfect: what about redundant variables?

Regardless of the method you use for computing the variable importance, a ranking of the variables according to their importance for having a good model is a basic form of interpretability, as it answers to the question:

  • what does the model consider as important for doing predictions?

that should mean:

  • what parts of the system are important according to the model of the system? (global explainability)
264 / 366

Random Forest: summary

Applicability: same as trees 👍👍👍

  • 👍 YY: both regression and classification (binary and multiclass)
  • 👍 XX: multivariate XX with both numerical and categorical variables
  • 👍 models give probability¹
  • 👍 practically parameter-free

Efficiency 👍

  • 👍 in practice, pretty fast in learning and prediction phase (ntree×n\subtext{tree} \times slower than tree)

Explainability/interpretability 👍👍

  • 👍 the models give variable importance (basic global explainability)
  • 👍 the learning technique is itself comprehensible
    • you should be able to implement by yourself

Unless¹ you really need to look at the tree, Random Forest is always better than the single tree:

  • much much better in effectiveness
  • not really worse in efficiency
  • worse in interpretability (but who cares? see 1)
265 / 366

Random Forest effectiveness

Some researchers did a large scale comparison of many supervised machine learning techniques:

  • Fernández-Delgado, Manuel, et al. "Do we need hundreds of classifiers to solve real world classification problems?." The journal of machine learning research 15.1 (2014): 3133-3181.

Effectiveness of some supervised learning techniques

Delgado et al. abstract

We evaluate 179 classifiers arising from 17 families [...]
We use 121 data set [...]
The classifiers most likely to be the bests are the random forest [...]

According to practice, we just need Random Forest. But...

266 / 366

No free lunch theorem

Earlier, some researchers formulated the No Free Lunch theorem¹:

  • Wolpert, David H. "The lack of a priori distinctions between learning algorithms." Neural computation 8.7 (1996): 1341-1390.

Any two optimization algorithms¹ are equivalent when their performance is averaged across all possible problems²

  1. the 1996 Wolpert's paper is about learning algorithms; a later paper by Wolpert's (1997) extends the theorem to optimization algorithms and gives the name to the theorem
  2. not an actual fragment of the paper, but a recap of the same authors in a later paper

According to theory, all learning techniques are the same.

  • if we considere all (theoretically all!) problems...
  • my advice: start with Random Forest, then see where to spend your time
267 / 366

Why "No Free Lunch"?

There are many restaurants, each with all food items on the list: food price is in general different among restaurants.

Where should you go to eat?

If you just want to eat something, there is no restaurants where everything costs less.

  • 🤤 eater \leftrightarrow ML designer
  • 🏩 restaurant \leftrightarrow ML technique
  • 🥗 food \leftrightarrow ML problem
  • 💵 price \leftrightarrow effectiveness

But if you know what you want to eat, there's at least one restaurant where that thing has the lowest price.

  • Question: what does this mean in practice?
268 / 366

Support Vector Machines

269 / 366

Building on the weakness of the tree

Binary classificatio problem for SVM: just data

Dataset:

  • Y={,}Y=\{\c{1}{●},\c{2}{●}\}
  • X=R2X=\mathbb{R}^2

A single tree, here, would struggle in establishing a good decision boundary: many corners, many branch nodes.

By looking at the data, we see that a simple line would likely be a good decision boundary

  • recall: the decision boundary in classification is where the model change the yy when xx crosses it

Can we draw that simple line?

270 / 366

Line as decision boundary

Binary classificatio problem for SVM: just data

Yes, we can! Here it is!

Despite its apparent simplicity, this "draw the line" operation implies:

  • we think that a line can be used to tell apart and points
    • the line is a model
    • we know how to use a model
  • we executed some procedure for finding the line out of the data

Implicitly, we already defined MM, flearn:P(R×Y)Mf'\subtext{learn}: \mathcal{P}^*(\mathbb{R} \times Y) \to M, and fpredict:R2×MYf'\subtext{predict}: \mathbb{R}^2 \times M \to Y

  • i.e., we defined a new learning technique 🤗

We followed the same approach for trees: now we are more experienced and we can go faster in formalizing it.

271 / 366

Line as a model

Formally, a line-shaped decision boundary in X=R2X=\mathbb{R}^2 can be defined as x2=mx1+qx_2=m x_1 +q where mm is the slope and qq is the intercept.

Alternatively, as: there are many triplets (β0,β1,β2)(\beta_0, \beta_1, \beta_2) defining the same line β0+β1x1+β2x2=0\beta_0+\beta_1 x_1+\beta_2 x_2=0

More in general, in X=RpX=\mathbb{R}^p, we can define a separating hyperplane as: β0+β1x1++βpxp=0\beta_0+\beta_1 x_1+\dots+\beta_p x_p=0 or, in vectorial form, as: β,xRp\vect{\beta}, \vect{x} \in \mathbb{R}^p β0+βx=0\beta_0+\vect{\beta}^\intercal\vect{x}=0

  • separating, because it can be used to separate the space in two parts
  • hyperplane, because we are in Rp\mathbb{R}^p p=1p=1: threshold; p=2p=2: line; p=3p=3: plane; p>3p>3: hyperplane
272 / 366

Using a separating hyperplane

Binary classificatio problem for SVM: just data

Intuitively:

  • if the point x\vect{x} is above the line, then y=y=\c{2}{●}
  • else, if the point x\vect{x} is below the line, then y=y=\c{1}{●}
  • else, if the point x\vect{x} is on line, then 🤔

Formally:

  • x\vect{x} is on the line iff β0+β1x1+β2x2=0\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{=} 0
  • x\vect{x} is above the line iff β0+β1x1+β2x2>0\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{>} 0
  • x\vect{x} is below the line iff β0+β1x1+β2x2<0\beta_0+\beta_1 x_1+\beta_2 x_2 \c{3}{<} 0

Example: This particular line is: 2+1.1x1+x2=02+1.1 x_1 + x_2 = 0

For x=(10,10)\vect{x}=(10,10):

  • 2+1.1x1+x2=2+11+10=23>02+1.1 x_1 + x_2 = 2+11+10=23 \c{3}{>} 0
  • hence y=y=\c{2}{●} (above)

For x=(10,10)\vect{x}=(-10,-10):

  • 2+1.1x1+x2=21110=19<02+1.1 x_1 + x_2 = 2-11-10=-19 \c{3}{<} 0
  • hence y=y=\c{1}{●} (above)
273 / 366

fpredictf'\subtext{predict} with a separating hyperplane

fpredictf'\subtext{predict}x,(β0,β)\vect{x},(\beta_0,\vect{\beta})yy

function predict(x,(β0,β))\text{predict}(\vect{x}, \c{1}{(\beta_0, \vect{\beta})}) {
if β0+βx0\beta_0+\vect{\beta}^\intercal\vect{x} \c{2}{\ge} 0 then {
return Pos\text{Pos}
} else {
return Neg\text{Neg}
}
}

Assumptions:

  • Y=Pos,NegY = {\text{Pos},\text{Neg}}
    • binary classification only!¹
  • X=RpX = \mathbb{R}^p
    • numerical independent variables only!²
  • (β0,β)(\beta_0, \vect{\beta}) is the model
  • y=Posy = \text{Pos} for both the >> and == cases
    • y=Negy = \text{Neg} for <<, i.e., otherwise
  • computationally very fast: just pp multiplications and sums
  1. we'll see later how to port this to the case of Y>2|Y| > 2
  2. we'll see later how to port this to the case of categorical variable
274 / 366

Separating hyperplane with probability

Intuitively, for β0+βx\beta_0+\vect{\beta}^\intercal\vect{x}

  • the greater (positive and large), the more satisfied the 0\ge 0 condition, hence the more positive
  • the smaller (negative and large), the more satisfied the <0< 0 condition, hence the more negative

function predict(x,(β0,β))\text{predict}(\vect{x}, (\beta_0, \vect{\beta})) {
if β0+βx0\c{3}{\beta_0+\vect{\beta}^\intercal\vect{x}} \ge 0 then {
return Pos\text{Pos}
} else {
return Neg\text{Neg}
}
}

Can we use this like a probability? Can we have an fpredictf''\subtext{predict} for the hyperplane?

  • recall the single tree: fpredict(x,t)=(35,25)f''\subtext{predict}(x,t)=(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{2}{5}}}) question: can we infer something about n=Dlearnn=|D\subtext{learn}| form this?
  • recall the bag (assume ntree=100n\subtext{tree}=100): fpredict(x,{tj}j)=(38100,62100)f''\subtext{predict}(x,\seq{t_j}{j})=(\c{1}{● \smaller{\frac{38}{100}}}, \c{2}{● \smaller{\frac{62}{100}}})
275 / 366

Separating hyperplane with probability

Intuitively, for β0+βx\beta_0+\vect{\beta}^\intercal\vect{x}

  • the greater (positive and large), the more satisfied the 0\ge 0 condition, hence the more positive
  • the smaller (negative and large), the more satisfied the <0< 0 condition, hence the more negative

function predict(x,(β0,β))\text{predict}(\vect{x}, (\beta_0, \vect{\beta})) {
if β0+βx0\c{3}{\beta_0+\vect{\beta}^\intercal\vect{x}} \ge 0 then {
return Pos\text{Pos}
} else {
return Neg\text{Neg}
}
}

Can we use this like a probability? Can we have an fpredictf''\subtext{predict} for the hyperplane?

  • recall the single tree: fpredict(x,t)=(35,25)f''\subtext{predict}(x,t)=(\c{1}{● \smaller{\frac{3}{5}}}, \c{2}{● \smaller{\frac{2}{5}}}) question: can we infer something about n=Dlearnn=|D\subtext{learn}| form this?
  • recall the bag (assume ntree=100n\subtext{tree}=100): fpredict(x,{tj}j)=(38100,62100)f''\subtext{predict}(x,\seq{t_j}{j})=(\c{1}{● \smaller{\frac{38}{100}}}, \c{2}{● \smaller{\frac{62}{100}}})

No! Because β0+βx\beta_0+\vect{\beta}^\intercal\vect{x} is not bounded in [0,1][0,1]

  • we can still use it as a measure of confidence: the smaller β0+βx|\beta_0+\vect{\beta}^\intercal\vect{x}|, the lower the confidence in the decision; in the extreme case β0+βx=0|\beta_0+\vect{\beta}^\intercal\vect{x}|=0 means no confidence, i.e., both y=Posy=\text{Pos} and y=Negy=\text{Neg} are ok

You may map the domain of β0+βx\beta_0+\vect{\beta}^\intercal\vect{x}, i.e., [,+][-\infty,+\infty] to [0,1][0,1] with, e.g., tanh\tanh: if x[,+]x \in [-\infty,+\infty], then 12+12tanh(x)[0,1]\frac{1}{2}+\frac{1}{2}\tanh(x) \in [0,1].
But this is not a common practice, because it still would not be a real probability.

275 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

  • perfectly separates the and points
276 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

  • perfectly separates the and points

🫣 this condition holds in general, for infinite lines...

Second attempt:

Choose the one that:

  • perfectly separates the and points and
  • is the farthest from the closest points
277 / 366

Learning the separating hyperplane

Binary classificatio problem for SVM: just data

How to choose the separating line?

First attempt:

Choose the one that:

  • perfectly separates the and points

🫣 this condition holds in general, for infinite lines...

Second attempt:

Choose the one that:

  • perfectly separates the and points and
  • is the farthest from the closest points
278 / 366

The maximal margin classifier

Binary classificatio problem for SVM: just data

The hyperplane that

  • perfectly separates the Pos\text{Pos} and Neg\text{Neg} points and
  • is the farthest from the closest points

is called the maximal margin classifier (MMC).

Maximal margin classifier:

  • classifier, because it can be used for classifying point,
    • since it is a separating hyperplane that divides thes space in two portions
  • maximal margin: because it is the one leaving the largest distance (margin) from the closest points
279 / 366

Support vectors

Binary classificatio problem for SVM: just data

Names:

  • the band from - - to - - (through ) is the margin
  • the points lying on the edge of the margin are called support vectors
    • they support the band in its position, like nails 📍 with a wooden ruler 📏
    • they are points in Rp\mathbb{R}^p, hence vectors
    • here, two and one

If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!

280 / 366

Learning the maximal margin classifier

Intuitively:

Choose the one that:

  • perfectly separates the Pos\text{Pos} and Neg\text{Neg} points and
  • is the farthest from the closest points

Looks like an optimization problem:

  • "perfectly separates" \rightarrow constraint
  • "is the farthest" \rightarrow objective
281 / 366

Learning the maximal margin classifier

Intuitively:

Choose the one that:

  • perfectly separates the Pos\text{Pos} and Neg\text{Neg} points and
  • is the farthest from the closest points

Looks like an optimization problem:

  • "perfectly separates" \rightarrow constraint
  • "is the farthest" \rightarrow objective

Formally:

maxβ0,,βp  msubject to  j=1j=pβj2=ββ=1  y(i)(β0+βx(i))mi{1,,n} \begin{align*} \max_{\beta_0, \dots, \beta_p} & \; \c{4}{m} \\ \text{subject to} & \; \c{3}{\sum_{j=1}^{j=p} \beta_j^2 = \vect{\beta}^\intercal\vect{\beta}= 1} \\ & \; \c{3}{y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m} & \c{3}{\forall i \in \{1, \dots, n\}} \end{align*} that means:

  • find the largest mm, such that
  • every point x(i)\vect{x}^{(i)} is at a distance m\ge m from the hyperplane
  • and is on the proper side

Assume by convention that Pos+1\text{Pos} \leftrightarrow +1 and Neg1\text{Neg} \leftrightarrow -1, so y(i)()my^{(i)}(\dots) \ge m is like m\dots \ge m for positives and m\dots \le -m for negatives

  • β0,,βp\beta_0, \dots, \beta_p, that is the model (β0,β)(\beta_0, \vect{\beta}), is what we are looking for
  • mathematically, if j=1j=pβj2=1\sum_{j=1}^{j=p} \beta_j^2 = 1, then β0+βx\beta_0+\vect{\beta}^\intercal\vect{x} is the Euclidean distance of x\vect{x} from the hyperplane (with sign)
  • y(i)(β0+βx(i))my^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m is == for support vectors and >> for the other points
281 / 366

flearnf'\subtext{learn} for the maximal margin classifier

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}(β0,β)(\beta_0,\vect{\beta})

function learn({(x(i),y(i))}i)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
(β0,β)solve((\beta_0,\vect{\beta}) \gets \c{1}{\text{solve}(}
maxβ0,,βpm,\max_{\beta_0,\dots,\beta_p} m,
ββ=1y(i)(β0+βx(i))m,i\vect{\beta}^\intercal\vect{\beta}= 1 \land y^{(i)}(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}) \ge m, \forall i
))
return (β0,β)(\beta_0,\vect{\beta})
}

  • solve()\text{solve}() is just a solver for numerical optimization problems which takes the objective and the constraints

In practice, this is an easy optimization problem and solving it is fast! for a computer

282 / 366

Maximal marginal classifier learning

This learning technique is called maximal margin classifier learning.

Efficiency: 👍

  • 👍👍👍 very fast, both in learning and prediction

Applicability: 🫳

  • 🫳 just binary classification more on this later
  • 🫳 just numerical variables more on this later
  • 👍 parameter-free!

Effectiveness: 🤔

  • overfitting? well, no flexibility, so... 🤔
    • what's complexity here? the size of the model is always p+1p+1

function learn({(x(i),y(i))}i)\text{learn}(\seq{(\vect{x}^{(i)},y^{(i)})}{i}) {
(β0,β)solve((\beta_0,\vect{\beta}) \gets \text{solve}(
maxβ0,,βpm,\max_{\beta_0,\dots,\beta_p} m,
ββ=1y(i)(β0+βx(i))m,i\vect{\beta}^\intercal\vect{\beta}= 1 \land y^{(i)}(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}) \ge m, \forall i
))
return (β0,β)(\beta_0,\vect{\beta})
}

function predict(x,(β0,β))\text{predict}(\vect{x}, (\beta_0, \vect{\beta})) {
if β0+βx0\beta_0+\vect{\beta}^\intercal\vect{x} \ge 0 then {
return Pos\text{Pos}
} else {
return Neg\text{Neg}
}
}

283 / 366

Maximal margin classifier: issue 1

Binary classificatio problem for SVM: just data

Support vectors:

  • they support the band in its position, like nails 📍 with a wooden ruler 📏
  • here, two and one

If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!

But, if you move a support vector, then the separating hyperplane moves!

  • i.e., for small changes of (some) observations (apply some noise to some x(i)\vect{x}^{(i)}), the model changes: looks like variance
284 / 366

Maximal margin classifier: issue 2

Binary classificatio problem for SVM: just data

Even worse, if you apply some noise¹ to some label y(i)y^{(i)}, it might be that a separatying hyperplane does not exist at all! 😱

  • in practice, the solve()\text{solve}() function just halts and say "there's no solution for this optimization problem".

\Rightarrow Applicability: 👎👎👎

How did the tree cope with yy noise?

  • simply by tolerating² some wrong classifications also on the learning data

Can we make MMC tolerant too?

  1. noise to the yy: recall the carousel attendat's kids...
  2. if ntreen\subtext{tree} was large enough
285 / 366

Introducing tolerance (1st formulation)

maxβ0,,βp,ϵ(1),,ϵ(n)  msubject to  ββ=1  y(i)(β0+βx(i))m(1ϵ(i))i{1,,n}  ϵ(i)0i{1,,n}  i=1i=nϵ(i)=c \begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(1)},\dots,\epsilon^{(n)}}} & \; m \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \\ & \; \sum_{i=1}^{i=n} \c{1}{\epsilon^{(i)}} = \c{2}{c} \end{align*}

  • ϵ(1),,ϵ(n)\epsilon^{(1)},\dots,\epsilon^{(n)} are positive slack variables:
    • one for each observation
    • they act as tolerance w.r.t. the margin
      • ϵ(i)=0\epsilon^{(i)}=0 means x(i)\vect{x}^{(i)} has to be out of the margin, on correct side
      • ϵ(i)[0,1]\epsilon^{(i)} \in [0,1] means x(i)\vect{x}^{(i)} can be inside the margin, on correct side
      • ϵ(i)>1\epsilon^{(i)} > 1 means x(i)\vect{x}^{(i)} can be on wrong side
  • cR+\c{2}{c} \in \mathbb{R}^+ (for cost), is a budget of tolerance, which is a parameter of the learning technique

This learning technique is called soft margin classifier (SMC, or support vector classifier), because, due to tolerance, the margin can be pushed.

It has one parameter, cc:

  • c=0c=0 corresponds to maximal margin classifier (no tolerance)
286 / 366

Role of the parameter cc (in 1st formulation)

maxβ0,,βp,ϵ(1),,ϵ(n)  msubject to  ββ=1  y(i)(β0+βx(i))m(1ϵ(i))i{1,,n}  ϵ(i)0i{1,,n}  i=1i=nϵ(i)=c \begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(1)},\dots,\epsilon^{(n)}}} & \; m \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \\ & \; \sum_{i=1}^{i=n} \c{1}{\epsilon^{(i)}} = \c{2}{c} \end{align*}

c=+c=+\infty \rightarrow infinite tolerance \rightarrow you can put the line wherever you want

  • from another point of view, you can move a lot the points and the line stay the same
  • hence the model is the same irrespective of learning data \Rightarrow high bias

c=0c=0 \rightarrow no tolerance \rightarrow any noise will change the model

  • hence high variance
  • even worse: if cc is too small, this is an \approx MMC
    • for a given dataset, there is a clearnablec\subtext{learnable} sucht that if c<clearnablec<c\subtext{learnable} no model is learnable 😱
287 / 366

Variable scale

The threshold clearnablec\subtext{learnable} for learnability depends:

  • on nn, for the summation i=1i=n\sum_{i=1}^{i=n}
  • on pp, because of β0+βx(i)\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)} the larger pp, the longer the summation, as βx(i)=j=1j=pβjxj\vect{\beta}^\intercal\vect{x}^{(i)}=\sum_{j=1}^{j=p} \beta_j x_j
  • on the actual scales of the variables
288 / 366

Variable scale

The threshold clearnablec\subtext{learnable} for learnability depends:

  • on nn, for the summation i=1i=n\sum_{i=1}^{i=n}
  • on pp, because of β0+βx(i)\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)} the larger pp, the longer the summation, as βx(i)=j=1j=pβjxj\vect{\beta}^\intercal\vect{x}^{(i)}=\sum_{j=1}^{j=p} \beta_j x_j
  • on the actual scales of the variables

Actually, the margin mm of the MMC itself depends on the scales of variables!

Original dataset

Scaled dataset: each xjx_j is ×12\times \frac{1}{2}

Trivial dataset before scaling

D={D = \{
(1,1,),(1,1,\c{1}{●}),
(3,3,)(3,3,\c{2}{●})
}\}

Trivial dataset after scaling

D={D = \{
(0.5,0.5,),(0.5,0.5,\c{1}{●}),
(1.5,1.5,)(1.5,1.5,\c{2}{●})
}\}

m=12+12=2m=\sqrt{1^2+1^2}=\sqrt{2}

m=122+122=12m=\sqrt{\frac{1}{2^2}+\frac{1}{2^2}}=\frac{1}{\sqrt{2}}

288 / 366

Variable scale and hyperplane

Moreover, the coefficients βj\beta_j depend on the scales of the variables too!

Intuitively: if

  • xj[1.4,2.1]x_j \in [1.4, 2.1] (might be the height in meters)
  • and xj[20000,50000]x_{j'} \in [20000, 50000] (might be the annual income in €)

then βj\beta_j will be much different than βj\beta_{j'}, making the computation of β0+βx(i)\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)} (and hence the model) rather sensible to noise.

289 / 366

Variable scale and hyperplane

Moreover, the coefficients βj\beta_j depend on the scales of the variables too!

Intuitively: if

  • xj[1.4,2.1]x_j \in [1.4, 2.1] (might be the height in meters)
  • and xj[20000,50000]x_{j'} \in [20000, 50000] (might be the annual income in €)

then βj\beta_j will be much different than βj\beta_{j'}, making the computation of β0+βx(i)\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)} (and hence the model) rather sensible to noise.

Hence, when using MMC (or SMC, or SVM), you¹ should rescale the variables. Options:

  • min-max scaling: xj(i)=xj(i)minixj(i)maxixj(i)minixj(i)x^{\prime(i)}_j = \frac{x^{(i)}_j - \min_{i'} x^{(i')}_j}{\max_{i'} x^{(i')}_j - \min_{i'} x^{(i')}_j} where minixj(i)\min_{i'} x^{(i')}_j is the min of xjx_j in DD
  • standardization: xj(i)=1σj(xj(i)μj)x^{\prime(i)}_j = \frac{1}{\sigma_j} \left(x^{(i)}_j - \mu_j\right) where μj\mu_j and σj\sigma_j are the mean and standard deviation of xjx_j in DD

Standardization is, in general, preferred as it is more robust to outliers.

  1. In practice, most of the ML sw/libraries do it internally.
289 / 366

Scaling as part of the model

Since you have to do the scaling both in learning and prediction, the coefficients needed for scaling (i.e., min,max\min, \max or μ,σ\mu, \sigma) do belong to the model!

Learning with scaling: (here, standardization)

{(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i}scaling{(x(i),y(i))}i\seq{(\vect{x}^{\prime(i)},y^{(i)})}{i}flearnf'\subtext{learn}mmjoin(m,μ,σ)(m,\vect{\mu},\vect{\sigma})(μ,σ)(\vect{\mu},\vect{\sigma})

(m,μ,σ)(m,\vect{\mu},\vect{\sigma}) is the model with scaling, with μ,σRp\vect{\mu},\vect{\sigma} \in \mathbb{R}^p. Here, join builds a tuple

Prediction with scaling:

x,(m,μ,σ)\vect{x},\c{2}{(m,\vect{\mu},\vect{\sigma})}splitx,μ,σ\vect{x},\vect{\mu},\vect{\sigma}scalex\vect{x}'joinx,m\vect{x}',mfpredictf'\subtext{predict}yymm

If you use the entire dataset (e.g., in CV, or in train/test static division) for computing μ,σ\vect{\mu},\vect{\sigma}, then you are cheating!
Question: can you write down the pseudocode of "scale"? And scaling? Are they the same?

290 / 366

Introducing tolerance (2nd formulation)

maxβ0,,βp,ϵ(1),,ϵ(n)  mci=1i=nϵ(i)subject to  ββ=1  y(i)(β0+βx(i))m(1ϵ(i))i{1,,n}  ϵ(i)0i{1,,n} \begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(1)},\dots,\epsilon^{(n)}}} & \; m - \c{2}{c} \c{1}{\sum_{i=1}^{i=n} \epsilon^{(i)}} \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \end{align*}

  • ϵ(1),,ϵ(n)\epsilon^{(1)},\dots,\epsilon^{(n)} are again positive slack variables
  • their sum is unbounded, but is negatively accounted in the objective: basically, this is a sort-of biobjective optimization:
    • maximize mm
    • minimize i=1i=nϵ(i)\sum_{i=1}^{i=n} \epsilon^{(i)}
  • cR+\c{2}{c} \in \mathbb{R}^+, is a weighting parameter saying what's the weight of the two objectives, which is a parameter of the learning technique

This is also the learning technique called soft margin classifier.

Most of the ML sw/libraries are based on this formulation.

The 1st one is often shown in books, e.g., in James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

291 / 366

Role of the parameter cc (in 2nd formulation)

maxβ0,,βp,ϵ(i),,ϵ(i)  mci=1i=nϵ(i)subject to  ββ=1  y(i)(β0+βx(i))m(1ϵ(i))i{1,,n}  ϵ(i)0i{1,,n} \begin{align*} \max_{\beta_0, \dots, \beta_p,\c{1}{\epsilon^{(i)},\dots,\epsilon^{(i)}}} & \; m - \c{2}{c} \c{1}{\sum_{i=1}^{i=n} \epsilon^{(i)}} \\ \text{subject to} & \; \vect{\beta}^\intercal\vect{\beta}= 1 \\ & \; y^{(i)}\left(\beta_0+\vect{\beta}^\intercal\vect{x}^{(i)}\right) \ge m\c{1}{(1-\epsilon^{(i)})} & \forall i \in \{1, \dots, n\} \\ & \; \c{1}{\epsilon^{(i)}} \ge 0 & \forall i \in \{1, \dots, n\} \end{align*}

c=0c = 0 \rightarrow no weight to i=1i=nϵ(i)\sum_{i=1}^{i=n} \epsilon^{(i)} \rightarrow points that are inside the margin cost zero

  • you can put the line wherever you want
  • hence, the model is the same irrespective of learning data \Rightarrow high bias

c=+c = +\infty \rightarrow infinite weight to i=1i=n\sum_{i=1}^{i=n} \rightarrow points that are inside the margin cost a lot

  • max effort to put all points outside the margin
  • from another point of view, the margin is very sensitive to point positions \Rightarrow high variance
  • but still, with huge cost, a model can be learned!
292 / 366

SMC: sims and diffs of the two formulations

Similarities:

  • there is one learning parameter (called cc)
  • cc is a flexibility parameter

Differences:

  • cc extreme values:
    • c=+c=+\infty (1st) and c=0c=0 (2nd) for high bias
    • c=0c=0 (1st) and c=+c=+\infty (2nd) for high variance
  • learnability:
    • with the 2nd, you can always learn a model from of any dataset DD
    • with the 1st, given a DD, there is a clearnable0c\subtext{learnable} \le 0 such that if you set c<clearnablec < c\subtext{learnable} you cannot learn a model from DD
      • clearnable=0c\subtext{learnable}=0 if the data is linearly separable

In practice:

  • most of the ML sw/libraries are based on the 2nd formulation
  • you should find (e.g., with CV) a proper value for cc
293 / 366

Always learn...

A not linearly separable binary classification dataset

Yes, with the 2nd formulation, we can learn a SMC, but it will be a poor model:

  • simply, the decision boundary here is not a straight line
  • a line is naturally unable to model the system

More in general, not every binary classification problem can be solved with an hyperplane.

Can we learn non linear decision boundaries?

294 / 366

Beyond the hyperplane: disclaimer

Yes, we can!

But...

Disclaimer

There will be some harder mathematics. We are going to make it simple.

For simplifying it, we'll risky walk on the edge of correcteness...

295 / 366

An alternative formulation for fpredictf'\subtext{predict}

First, let's give a name to the core computation of fpredictf'\subtext{predict}: f(x)=β0+βx=β0+j=1j=pβjxjf(\vect{x}) = \beta_0 + \vect{\beta}^\intercal \vect{x}=\beta_0 + \sum_{j=1}^{j=p} \beta_j x_j with f:RpRf: \mathbb{R}^p \to \mathbb{R}.

It turns out that this same ff can be written also as: f(x)=β0+i=1i=nα(i)x,x(i)f(\vect{x})=\beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \left\langle \vect{x}, \vect{x}^{(i)} \right\rangle where x,x=xx=j=1j=pxjxj\left\langle \vect{x}, \vect{x}' \right\rangle = \vect{x}^\intercal \vect{x}' = \sum_{j=1}^{j=p} x_j x'_j is the inner product.

,:Rp×RpR\langle \cdot,\cdot \rangle: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R} can also be defined on other sets than Rp\mathbb{R}^p, so it's not just xx\vect{x}^\intercal \vect{x}'...

Remarks:

  • there are p+1p+1 β\beta coeffs and nn α\alpha coeffs
    • in general, they are different in value
  • for the left formulation, during optimization you give the x(i)\vect{x}^{(i)} to solve()\text{solve}() and obtain the β\beta coeffs
    • once you fix {(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i}, you completely define f(x)f(\vect{x})
  • same for the right formulation
    • once you fix {(x(i),y(i))}i\seq{(\vect{x}^{(i)},y^{(i)})}{i} and the α\alpha coeffs, you completely define f(x)f(\vect{x})
    • the α\alpha coeffs are just needed to make the two functions the same
296 / 366

The support vectors and the α\alpha coeffs

Binary classificatio problem for SVM: just data

Given that:

β0+βx=f(x)=β0+i=1i=nα(i)xx(i)\beta_0 + \vect{\beta}^\intercal \vect{x} = f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \vect{x}^\intercal \vect{x}^{(i)}

If you move¹ any point which is not a support vector, by definition f(x)f(\vect{x}) must stay the same:

  • so the β\beta coeffs must stay the same
  • so the α\alpha coeffs must stay the same

Hence, it follows that α(i)=0\alpha^{(i)}=0 for every x(i)\vect{x}^{(i)} which is not a support vector!

More in general each α(i)\alpha^{(i)} says what's the contribution of the corresponding x(i)\vect{x}^{(i)} when classifying x\vect{x}: 00 means no contribution.

From the point of view of the optimization, solve()\text{solve}() for the second formulation gives (β0,α)(\beta_0, \vect{\alpha}), with αRn\vect{\alpha} \in \mathbb{R}^n: this also says which are the support vectors. Similarly, the model is (β0,α)(\beta_0, \vect{\alpha}) instead of (β0,β)(\beta_0, \vect{\beta}).

  1. Without making it a support vector.
297 / 366

The kernel

Ok, but what about going beyond the hyperplane? We are almost there...

The second formulation may be generalized: f(x)=β0+i=1i=nα(i)k(x,x(i))f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} \c{2}{k\left(\vect{x}, \vect{x}^{(i)}\right)} where k:Rp×RpRk: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R} is a kernel function.

The idea behind the kernel function is to:

  1. transform the original space X=RpX=\mathbb{R}^p in another space X=RqX=\mathbb{R}^q, with possibly qpq \gg p, with a ϕ:XX\phi: X \to X', and then
  2. to compute the inner product in the destination space, i.e., k(x,x(i))=ϕ(x)ϕ(x(i))k(\vect{x}, \vect{x}^{(i)})= \phi(\vect{x})^\intercal \phi(\vect{x}^{(i)})

hoping that an hyperplane can separate the points in XX' better than in XX.

This thing is called the kernel trick. Understanding the math behind it is beyond the scope of this course. Understanding the way the optimization works with a kernel is beyond the scope of this course.

When you use a kernel, this technique is called Support Vector Machines (SVM) learning.

298 / 366

Common kernels

Linear kernel:

k(x,x)=xxk(\vect{x}, \vect{x}') = \vect{x}^\intercal \vect{x}'

  • the most efficient (computationally cheapest)

Polynomial kernel:

k(x,x)=(1+xx)dk(\vect{x}, \vect{x}') = (1+\vect{x}^\intercal \vect{x}')^d

  • dd, the degree of the kernel, is a parameter

Gaussian kernel:

k(x,x)=eγxx2k(\vect{x}, \vect{x}') = e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2}

  • xx2\lVert \vect{x} - \vect{x}' \rVert^2 is the squared Euclidean distance of x\vect{x} to x\vect{x}'
  • γ\gamma is a parameter
  • also called radial basis function (RBF), or just radial, kernel
  • the most widely used

f(x)=β0+i=1i=nα(i)k(x,x(i))f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} k\left(\vect{x}, \vect{x}^{(i)}\right)

Regardless of the kernel being used, each α(i)\alpha^{(i)} says what's the contribution of the corresponding x(i)\vect{x}^{(i)} when evaluating f(x)f(\vect{x}) inside fpredictf'\subtext{predict}.

299 / 366

Inside the Gaussian kernel (humbly, toy)

k(x,x)=eγxx2k(\vect{x}, \vect{x}') = e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2} and f(x)=β0+i=1i=nα(i)k(x,x(i))f(\vect{x}) = \beta_0 + \sum_{i=1}^{i=n} \alpha^{(i)} k\left(\vect{x}, \vect{x}^{(i)}\right)

  • eγxx2[0,1]e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2} \in [0,1]; xx2\lVert \vect{x} - \vect{x}' \rVert^2 is the squared distance of x\vect{x} to x\vect{x}'
  • the larger γ\gamma, the faster eγxx2e^{-\gamma \lVert \vect{x} - \vect{x}' \rVert^2} goes to 00 with distance

Let's consider a point x\vect{x} moving from (0,3.5)(0,3.5) to (6,3.5)(6,3.5):

  • think about its correct color, while moving
  • put it on the 3D plane, consider its 3 α\alpha, draw decision boundary

Gaussian kernel and gamma γ=0.1\gamma=0.1 γ=1\gamma=1 γ=10\gamma=10

Three support vectors in 2D

3D canvas

300 / 366

Intuitive interpretation Gaussian kernel

Intuitively, and very broadly speaking, the Gaussian kernel maps an x\vect{x} to the space where coordinates are the distances to relevant observations of the learning data.

In practice, the decision boundary can smoothly follow any path:

  • with some risk of overfitting

Drawing of an SVM decision boundary Image from Wikipedia

301 / 366

SVM: summary

Efficiency 👍👍👍

  • 👍 very fast

Explainability/interpretability 🫳

  • 👎 few numbers, but hardly interpretable, no global explainability
    • knowing which point are the support vectors is better than nothing...
  • 😶 the learning technique is pure optimization
  • 👍 confidence may be used as basic form of local explainability

Effectiveness 👍👍

  • 👍 in general good with the Gaussian kernel
    • but complex interactions between cc and γ\gamma require to choose parameter values carefully

Applicability 🫳

  • 🫳 YY: only binary classifications
  • 🫳 XX: only numerical variables
  • 👍 models give a confidence
  • 🫳 with two parameters (cc and γ\gamma)
302 / 366

Improving applicability

303 / 366

XX, YY and applicability

Let X=X1××XpX=X_1 \times \dots \times X_p:

XjX_j YY RF SVM
Numerical Binary classification
Categorical Binary classification
Numerical + Categorical Binary classification
Numerical Multiclass classification
Categorical Multiclass classification
Numerical + Categorical Multiclass classification
Numerical Regression
Categorical Regression
Numerical + Categorical Regression

Let's start by fixing SVM!

304 / 366

From categorical to numerical variables

Let xjx_j be categorical:

  • xjXj={xj,1,,xj,k}x_j \in X\sub{j} = \{x\sub{j,\c{1}{1}},\dots,x\sub{j,\c{1}{k}}\} (i.e., kk different values)

Then, we can replace it with kk numerical variables:

  • xh1Xh1={0,1}x_{h_1} \in X_{h_1} = \{0,1\}
  • ...
  • xhkXhk={0,1}x_{h_k} \in X_{h_k} = \{0,1\}

such that: i,k:xhk(i)=1(xj(i)=xj,k)\forall i, k: x^{(i)}_{h_k}=\mathbf{1}(x^{(i)}_j=x\sub{j,k})

This way of encoding one categorical variable with kk possible values to kk binary numerical variables is called one-hot encoding.

Each one of the resulting binary variables is a dummy variable.

A similar encoding can be applied when Xj=P(A)X_j=\mathcal{P}(A).

Example: (extended carousel)

Original features: age, height, city p=3p=3

  • X=R+×R+×{Ts,Ud,Ve,Pn,Go}X = \mathbb{R}^+ \times \mathbb{R}^+ \times \c{2}{\{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}}

Transformed features: p=7p=7

  • X=R+×R+×{0,1}5X' = \mathbb{R}^+ \times \mathbb{R}^+ \times \c{2}{\{0,1\}^5}

with:

  • xTs(i)=1(xcity(i)=Ts)x^{(i)}\subtext{Ts} = \mathbf{1}(x^{(i)}\subtext{city}=\text{Ts})
  • xUd(i)=1(xcity(i)=Ud)x^{(i)}\subtext{Ud} = \mathbf{1}(x^{(i)}\subtext{city}=\text{Ud})
  • ...

hence, e.g.:

  • (11,153,Ts)(11,153,1,0,0,0,0)(11,153,\c{2}{\text{Ts}}) \mapsto (11,153,\c{2}{1,0,0,0,0})
  • (79,151,Ud)(79,151,0,1,0,0,0)(79,151,\c{2}{\text{Ud}}) \mapsto (79,151,\c{2}{0,1,0,0,0})
305 / 366

From binary to multiclass: one-vs-one

Let flearn,fpredict\c{1}{f'\subtext{learn}},\c{1}{f'\subtext{predict}} be a learning technique applicable to X,YbinaryX,Y\subtext{binary} where Ybinary={Pos,Neg}\c{3}{Y\subtext{binary}=\{\text{Pos},\text{Neg}\}} that produces models in MM, i.e., flearn:P(X×Ybinary)M\c{1}{f'\subtext{learn}}: \mathcal{P}^*(X \times \c{3}{Y\subtext{binary}}) \to M and fpredict:X×MYbinary\c{1}{f'\subtext{predict}}: X \times M \to \c{3}{Y\subtext{binary}}.

Let Y={y1,,yk}\c{2}{Y=\{y_1,\dots,y_k\}} a finite set with k>2k>2 values.

Consider a new learning technique flearn,ovo,fpredict,ovof'\subtext{learn,ovo},f'\subtext{predict,ovo}, based on flearn,fpredict\c{1}{f'\subtext{learn}},\c{1}{f'\subtext{predict}}, that:

In learning: flearn,ovo:P(X×Y)Mk(k1)2f'\subtext{learn,ovo}: \mathcal{P}^*(X \times \c{2}{Y}) \to M^{\frac{k(k-1)}{2}}

Given a DP(X×Y)D \in \mathcal{P}^*(X \times \c{2}{Y}):

  1. set M=\mathcal{M}=\emptyset
  2. for each pair of classes, i.e., pair (h1,h2){1,,k}(h_1,h_2) \in \{1,\dots,k\} such that h1<h2h\sub{1} < h\sub{2} k(k1)2=(k2)\frac{k(k-1)}{2}=\binom{k}{2} times
    1. builds DD' by taking only the observations in which y(i)=yh1\c{2}{y^{(i)}}=y_{h_1} or y(i)=yh2\c{2}{y^{(i)}}=y_{h_2}
    2. set each y(i)=Pos\c{3}{y'^{(i)}}=\text{Pos} if y(i)=yh1\c{2}{y^{(i)}}=y_{h_1}, or y(i)=Neg\c{3}{y'^{(i)}}=\text{Neg} otherwise
    3. learns a model mh1,h2m_{h_1,h_2} with flearnf'\subtext{learn}, puts it in M\mathcal{M}
  3. returns M\mathcal{M}

each mh1,h2m_{h_1,h_2} is a binary classification model learned on D<D|D'| < |D| obs.

In prediction: fpredict,ovo:X×Mk(k1)2Yf'\subtext{predict,ovo}: X \times M^{\frac{k(k-1)}{2}} \to \c{2}{Y}

Given an xXx \in X and a model MMk(k1)2\mathcal{M} \in M^{\frac{k(k-1)}{2}}:

  1. sets v=0Nk\vect{v}=\vect{0} \in \mathbb{N}^k
  2. for each mh1,h2Mm_{h_1,h_2} \in \mathcal{M} k(k1)2=(k2)\frac{k(k-1)}{2}=\binom{k}{2} times
    1. applies fpredictf'\subtext{predict} on xx with mh1,h2m_{h_1,h_2} and increments vh1v_{h_1} if the outcome is y=Pos\c{3}{y}=\text{Pos}, or vh2v_{h_2} otherwise
  3. returns yhy\sub{h^\star} with h=arg maxhvhh^\star=\argmax_{h} v_h

v\vect{v} counts the times a class has been predicted

can be extended for giving a probability

306 / 366

From binary to multiclass: one-vs-all

Let flearn,fpredict\c{1}{f'\subtext{learn}},\c{1}{f'''\subtext{predict}} be a learning technique with confidence/probability, i.e., flearn:P(X×Ybinary)M\c{1}{f'\subtext{learn}}: \mathcal{P}^*(X \times \c{3}{Y\subtext{binary}}) \to M and fpredict:X×MR\c{1}{f'''\subtext{predict}}: X \times M \to \mathbb{R}, with fpredict(x,m)\c{1}{f'''\subtext{predict}}(x,m) being the confidence that xx is Pos\text{Pos}. probability would be [0,1]\to [0,1]

Let Y={y1,,yk}\c{2}{Y=\{y_1,\dots,y_k\}} a finite set with k>2k>2 values.

Consider a new learning technique flearn,ova,fpredict,ovaf'\subtext{learn,ova},f'\subtext{predict,ova}, based on flearn,fpredict\c{1}{f'\subtext{learn}},\c{1}{f'''\subtext{predict}}, that:

In learning: flearn,ova:P(X×Y)Mkf'\subtext{learn,ova}: \mathcal{P}^*(X \times \c{2}{Y}) \to M^k

Given a DP(X×Y)D \in \mathcal{P}^*(X \times \c{2}{Y}):

  1. set M=\mathcal{M}=\emptyset
  2. for each class, i.e., h{1,,k}h \in \{1,\dots,k\} kk times
    1. builds DD' by setting each y(i)=Pos\c{3}{y'^{(i)}}=\text{Pos} if y(i)=yh\c{2}{y^{(i)}}=y_h, or y(i)=Neg\c{3}{y'^{(i)}}=\text{Neg} otherwise
    2. learns a model mhm_h with flearnf'\subtext{learn}, puts it in M\mathcal{M}
  3. returns M\mathcal{M}

each mhm_h is a binary classification model learned on D=D|D'|=|D| obs.

In prediction: fpredict,ova:X×MkYf'\subtext{predict,ova}: X \times M^k \to \c{2}{Y}

Given an xXx \in X and a model MMk\mathcal{M} \in M^k:

  1. sets v=0Rk\vect{v}=\vect{0} \in \mathbb{R}^k
  2. for each mhMm_h \in \mathcal{M} kk times
    1. applies fpredictf'''\subtext{predict} on xx with mhm_h and sets vhv_h to the outcome fpredict(x,mh)\c{1}{f'''\subtext{predict}}(x,m_h)
  3. returns yhy\sub{h^\star} with h=arg maxhvhh^\star=\argmax_{h} v_h

v\vect{v} holds the confidences for each class

can be extended for giving a probability

307 / 366

XX, YY and applicability: \approx fixed!

Let X=X1××XpX=X_1 \times \dots \times X_p:

XjX_j YY RF SVM SVM+
Numerical Binary classification
Categorical Binary classification
Numerical + Categorical Binary classification
Numerical Multiclass classification
Categorical Multiclass classification
Numerical + Categorical Multiclass classification
Numerical Regression ❌³
Categorical Regression ❌³
Numerical + Categorical Regression ❌³

SVM+¹²: SVM + one-vs-one/one-vs-all + dummy variables

  1. Not a real name...
  2. In practice, most ML sw/libraries do everything transparently, and let you use SVM+ instead of SVM.
  3. For regression, SVR or other variants.
308 / 366

Missing values

In many practical, business cases, some variables for some observations might miss a value. Formally, xjXj{}x_j \in X_j \cup \{\c{1}{\varnothing}\}. \emptyset is the empty set

Examples: (extended carousel)

  • X=R+×R+×{Ts,Ud,Ve,Pn,Go}X = \mathbb{R}^+ \times \mathbb{R}^+ \times \{\text{Ts},\text{Ud},\text{Ve},\text{Pn},\text{Go}\}
  • x=(15,,Ts)x=(15, \c{1}{\varnothing}, \text{Ts}) x=(15,,1,0,0,0,0)\vect{x}'=(15, \c{1}{\varnothing}, 1,0,0,0,0)
  • x=(12,155,)x=(12, 155, \c{1}{\varnothing)} x=(12,155,0,0,0,0,0)\vect{x}'=(12, 155, \c{1}{0,0,0,0,0}), actually not a problem!

Trees and SVM cannot work!

  • a tree cannot test xheightτx\subtext{height} \le \tau
  • the SMC/SVM cannot compute xx(i)\vect{x}^\intercal\vect{x}^{(i)}

Solutions:

  • drop the variable(s) with missing values (ok if many missing values) otherwise, not ok
  • fill with most common value or mean value
    • arg maxxj,kXji1(xj(i)=xj,k)\varnothing \gets \argmax_{x_{j,k} \in X_j} \sum_i \mathbf{1}(x^{(i)}_j = x_{j,k}) for categorical variables
    • 1i1(xj(i))i:xj(i)xj(i)\varnothing \gets \frac{1}{\sum_i \mathbf{1}(x^{(i)}_j \ne \varnothing)} \sum_{i: x^{(i)}_j \ne \varnothing} x^{(i)}_j for numerical variables
  • replace with a new class, only for categorical variable
  • ...
309 / 366

Naive Bayes

310 / 366

Guess the gender¹

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the cinema 🏪.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

  1. For clarity, let's assume there are two possible genders.
311 / 366

Guess the gender¹

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the cinema 🏪.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

  1. For clarity, let's assume there are two possible genders.

More formally:

  • X=XhairX=X\subtext{hair} might be Xhair={long,¬long}X\subtext{hair}=\set{\text{long},\neg\text{long}}, or a bigger set; not relevant here
  • Y={man,woman}Y=\{\text{man},\text{woman}\}
  • you are fpredictf\subtext{predict}
  • your life is flearnf\subtext{learn}
  • fpredict(long)=?f\subtext{predict}(\text{long}) = ?
311 / 366

Reason with probability

According to your life, you have collected some knowledge, that you can express as probabilities:

  • the probability a random person is a man is the same of being a woman
    • Pr ⁣(a person is a man)=Pr ⁣(p=man)=0.5=Pr ⁣(p=woman)\prob{\text{a person is a man}}=\prob{p=\text{man}}=0.5=\prob{p=\text{woman}}
  • the probability that a man has long hair is low
    • Pr ⁣(h=longp=man)=0.04\prob{h=\text{long} \mid p=\text{man}}=0.04
  • the probability that a woman has long hair is higher
    • Pr ⁣(h=longp=woman)=0.5\prob{h=\text{long} \mid p=\text{woman}}=0.5

where Pr ⁣(AB)\prob{A \mid B} is the conditional probability, i.e., the probability that, given that the event BB occurred, the event AA occurs

312 / 366

Guessing the gender with probability

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

So, we want to know Pr ⁣(p=manh=long)\prob{p=\text{man} \mid h=\text{long}} and Pr ⁣(p=womanh=long)\prob{p=\text{woman} \mid h=\text{long}}, or maybe just if:

  • Pr ⁣(p=manh=long)>?Pr ⁣(p=womanh=long)\prob{p=\text{man} \mid h=\text{long}} \stackrel{?}{>} \prob{p=\text{woman} \mid h=\text{long}}

But we know Pr ⁣(h=longp=man)\prob{h=\text{long} \mid p=\text{man}}, not Pr ⁣(h=manp=long)\prob{h=\text{man} \mid p=\text{long}}...

313 / 366

Guessing the gender with probability

Do you say "excuse me, sir" 🧔‍♀️ or "excuse me, madam" 👩?

So, we want to know Pr ⁣(p=manh=long)\prob{p=\text{man} \mid h=\text{long}} and Pr ⁣(p=womanh=long)\prob{p=\text{woman} \mid h=\text{long}}, or maybe just if:

  • Pr ⁣(p=manh=long)>?Pr ⁣(p=womanh=long)\prob{p=\text{man} \mid h=\text{long}} \stackrel{?}{>} \prob{p=\text{woman} \mid h=\text{long}}

But we know Pr ⁣(h=longp=man)\prob{h=\text{long} \mid p=\text{man}}, not Pr ⁣(h=manp=long)\prob{h=\text{man} \mid p=\text{long}}...

In general, Pr ⁣(AB)Pr ⁣(BA)\prob{A \mid B} \ne \prob{B \mid A}.

  • Pr ⁣(win lotteryplay lottery)Pr ⁣(play lotterywin lottery)\prob{\text{win lottery} \mid \text{play lottery}} \ne \prob{\text{play lottery} \mid \text{win lottery}}
313 / 366

The Bayes rule

Pr ⁣(A)Pr ⁣(BA)=Pr ⁣(A,B)=Pr ⁣(B)Pr ⁣(AB)\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}

where Pr ⁣(A,B)\prob{A,B} is the probability that both AA and BB occur.

314 / 366

The Bayes rule

Pr ⁣(A)Pr ⁣(BA)=Pr ⁣(A,B)=Pr ⁣(B)Pr ⁣(AB)\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}

where Pr ⁣(A,B)\prob{A,B} is the probability that both AA and BB occur.

Pr ⁣(BA)=Pr ⁣(B)Pr ⁣(AB)Pr ⁣(A)\prob{B \mid A}=\frac{\prob{B} \prob{A \mid B}}{\prob{A}}

314 / 366

The Bayes rule

Pr ⁣(A)Pr ⁣(BA)=Pr ⁣(A,B)=Pr ⁣(B)Pr ⁣(AB)\prob{A} \prob{B \mid A}=\prob{A, B} = \prob{B} \prob{A \mid B}

where Pr ⁣(A,B)\prob{A,B} is the probability that both AA and BB occur.

Pr ⁣(BA)=Pr ⁣(B)Pr ⁣(AB)Pr ⁣(A)\prob{B \mid A}=\frac{\prob{B} \prob{A \mid B}}{\prob{A}}

What we know:

  • Pr ⁣(man)=0.5\prob{\text{man}}=0.5
  • Pr ⁣(woman)=0.5\prob{\text{woman}}=0.5
  • Pr ⁣(longman)=0.04\prob{\text{long} \mid \text{man}}=0.04
  • Pr ⁣(longwoman)=0.5\prob{\text{long} \mid \text{woman}}=0.5

What we compute:

  • Pr ⁣(manlong)=Pr ⁣(man)Pr ⁣(longman)Pr ⁣(long)=0.50.04Pr ⁣(long)=0.02Pr ⁣(long)\prob{\text{man} \mid \text{long}} = \frac{\prob{\text{man}} \prob{\text{long} \mid \text{man}}}{\prob{\text{long}}}=\frac{0.5 \cdot 0.04}{\prob{\text{long}}}=\frac{0.02}{\prob{\text{long}}}
  • Pr ⁣(womanlong)=Pr ⁣(woman)Pr ⁣(longwoman)Pr ⁣(long)=0.50.5Pr ⁣(long)=0.25Pr ⁣(long)\prob{\text{woman} \mid \text{long}} = \frac{\prob{\text{woman}} \prob{\text{long} \mid \text{woman}}}{\prob{\text{long}}}=\frac{0.5 \cdot 0.5}{\prob{\text{long}}}=\frac{0.25}{\prob{\text{long}}}
  • 0.02Pr ⁣(long)<0.25Pr ⁣(long)\frac{0.02}{\prob{\text{long}}} < \frac{0.25}{\prob{\text{long}}} \Rightarrow 👩 \Rightarrow "excuse me, madam"

We do not really need to know Pr ⁣(long)\prob{\text{long}}!

but it could be computed, in some cases

314 / 366

Guess the gender II

You are in the line 🚶🚶‍♂️🚶‍♀️🚶🚶🚶‍♀️🚶‍♂️🚶🚶‍♀️ at the stadium 🏟.

The ticket 🎟 of the person before you in the line falls on the ground.

The person has long hair.

What we know:

  • Pr ⁣(man at 🏟)=0.98\prob{\text{man at 🏟}}=\c{1}{0.98}
  • Pr ⁣(woman at 🏟)=0.02\prob{\text{woman at 🏟}}=\c{1}{0.02}
  • Pr ⁣(longman)=0.04\prob{\text{long} \mid \text{man}}=0.04
  • Pr ⁣(longwoman)=0.5\prob{\text{long} \mid \text{woman}}=0.5

What we compute:

  • Pr ⁣(manlong)=Pr ⁣(man)Pr ⁣(longman)Pr ⁣(long)=0.980.04Pr ⁣(long)=0.0392Pr ⁣(long)\prob{\text{man} \mid \text{long}} = \frac{\prob{\text{man}} \prob{\text{long} \mid \text{man}}}{\prob{\text{long}}}=\frac{\c{1}{0.98} \cdot 0.04}{\prob{\text{long}}}=\frac{\c{1}{0.0392}}{\prob{\text{long}}}
  • Pr ⁣(womanlong)=Pr ⁣(woman)Pr ⁣(longwoman)Pr ⁣(long)=0.020.5Pr ⁣(long)=0.01Pr ⁣(long)\prob{\text{woman} \mid \text{long}} = \frac{\prob{\text{woman}} \prob{\text{long} \mid \text{woman}}}{\prob{\text{long}}}=\frac{\c{1}{0.02} \cdot 0.5}{\prob{\text{long}}}=\frac{\c{1}{0.01}}{\prob{\text{long}}}
  • 0.0392Pr ⁣(long)>0.01Pr ⁣(long)\frac{\c{1}{0.0392}}{\prob{\text{long}}} > \frac{\c{1}{0.01}}{\prob{\text{long}}} \Rightarrow 🧔 \Rightarrow "excuse me, sir"

Different natural probability of a person at the stadium being a man!

315 / 366

Prior, posterior, evidence

Pr ⁣(eventevidence)=Pr ⁣(event)Pr ⁣(evidenceevent)Pr ⁣(evidence)\c{2}{\prob{\text{event} \mid \text{evidence}}}=\c{1}{\prob{\text{event}}}\c{3}{\frac{\prob{\text{evidence} \mid \text{event}}}{\prob{\text{evidence}}}}

  • prior: the natural probability of the event
    • what we know in advance
  • posterior: the probability of the event, given some evidence
    • what we want to know
  • a correction we apply to the prior knowing the evidence
316 / 366

Bayes for supervised ML

Assume classification with categorical indep. vars:

  • X=X1××XpX = X_1 \times \dots \times X_p
    • with Xj={xj,1,,xj,hj}X_j=\{x_{j,1}, \dots, x_{j,h_j}\}
  • Y={y1,,yk}Y = \{y_1, \dots, y_k\}

Pr ⁣(eventevidence)=Pr ⁣(event)Pr ⁣(evidenceevent)Pr ⁣(evidence)\c{2}{\prob{\text{event} \mid \text{evidence}}}=\c{1}{\prob{\text{event}}}\c{3}{\frac{\prob{\text{evidence} \mid \text{event}}}{\prob{\text{evidence}}}}

  • event: yy is one specific class, y=ymy=y_m
  • evidence: xx is one specific observation, x=(x1,l1,,xp,lp)x=(x_{1,l_1},\dots,x_{p,l_p})

Hence: Pr ⁣(y=ymx=(x1,l1,,xp,lp))=Pr ⁣(y=ym)Pr ⁣(x=(x1,l1,,xp,lp)y=ym)Pr ⁣(x=(x1,l1,,xp,lp))\c{2}{\prob{y=y_m \mid x=(x_{1,l_1},\dots,x_{p,l_p})}}=\c{1}{\prob{y=y_m}}\c{3}{\frac{\prob{x=(x_{1,l_1},\dots,x_{p,l_p}) \mid y=y_m}}{\prob{x=(x_{1,l_1},\dots,x_{p,l_p})}}} or, more briefly: p(ymx1,l1,,xp,lp)=p(ym)p(x1,l1,,xp,lpym)p(x1,l1,,xp,lp)\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}

317 / 366

Required knowledge

p(ymx1,l1,,xp,lp)=p(ym)p(x1,l1,,xp,lpym)p(x1,l1,,xp,lp)\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}

What do we need for predicting yy from a xx?

  1. compute p(ymx1,l1,,xp,lp)\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)} for each ymy_m
    • hence, each p(ym)\c{1}{p(y_m)} and each p(x1,l1,,xp,lpym)\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}
    • no need to compute p(x1,l1,,xp,lp)\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)} for the comparison
  2. take the yy with the largest value

Where to find them?

💡: in the learning data DD!

  • each p(ym)\c{1}{p(y_m)}: just count the observations in DD with y=ymy=y_m
  • each p(x1,l1,,xp,lpym)\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}: just count the obs. in DD with y=ymy=y_m and x=(x1,l1,,xp,lp)x=\left(x_{1,l_1},\dots,x_{p,l_p}\right)
    • what if the count is 00? 🤔 not that unlikely...
    • how many combinations should I store? kj=1j=phjk \prod_{j=1}^{j=p} h_j
318 / 366

Independent independent¹ variables

Let's do the naive hypothesis that the independent variables are independent¹ from each other: p(ymx1,l1,,xp,lp)=p(ym)p(x1,l1,,xp,lpym)p(x1,l1,,xp,lp)=p(ym)p(x1,l1ym,,xp,lpym)p(x1,l1,,xp,lp)\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1},\dots,x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}}=\c{1}{p(y_m)}\c{3}{\frac{p\left(x_{1,l_1} \mid y_m, \dots, x_{p,l_p} \mid y_m\right)}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}} becomes: p(x1,l1,,xp,lpym)=p(x1,l1ym,,xp,lpym)p\left(x\sub{1,l\sub{1}},\dots,x\sub{p,l\sub{p}} \mid y\sub{m}\right) = p\left(x\sub{1,l\sub{1}} \mid y\sub{m}, \dots, x\sub{p,l\sub{p}} \mid y\sub{m}\right) is always true, also without independency

p(ymx1,l1,,xp,lp)=p(ym)p(x1,l1,,xp,lp)p(x1,l1ym)p(xp,lpym)\c{2}{p\left(y_m \mid x_{1,l_1},\dots,x_{p,l_p}\right)}=\frac{\c{1}{p(y_m)}}{\c{3}{p\left(x_{1,l_1},\dots,x_{p,l_p}\right)}} \c{3}{p\left(x_{1,l_1} \mid y_m\right)} \dots \c{3}{p\left(x_{p,l_p} \mid y_m\right)}

Where to find them?

💡: in the learning data DD!

  • each p(ym)\c{1}{p(y_m)}: just count the observations in DD with y=ymy=y_m
  • each p(x1,ljym)\c{3}{p\left(x_{1,l_j} \mid y_m\right)}: just count the obs. in DD with y=ymy=y_m and xj=xj,ljx_j=x_{j,l_j}
    • what if the count is 00? unlikely, but possible
    • how many combinations should I store? j=1j=pkhj \sum_{j=1}^{j=p}k h_j
  1. The first "independent" refers to xjx_j and yy; the second "independent" refers to xjx_j and xjx_{j'}.
319 / 366

Naive Bayes

The technique based on the independency hypothesis is called Naive Bayes:

  • based on the Bayes rule
  • with a naive independency hipothesys

Learning:

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}p\vect{p}

function learn({(x(i),y(i))}i=1i=n)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i=1}^{i=n}) {
p\vect{p} \gets \emptyset
for m{1,,Y}m \in \{1, \dots, |Y|\} { //Y=k|Y|=k
pm1ni1(y(i)=ym)\c{1}{p_m} \gets \frac{1}{n} \sum_i \mathbf{1}(y^{(i)}=y_m)
for j{1,,p}j \in \{1, \dots, p\} {
for l{1,,Xj}l \in \{1, \dots, |X_j|\} { //Xj=hj|X\sub{j}|=h\sub{j}
pm,j,li1(y(i)=ymxj(i)=xj,l)i1(y(i)=ym)\c{3}{p_{m,j,l}} \gets \frac{\sum_i \mathbf{1}(y^{(i)}=y_m \land x_j^{(i)}=x_{j,l})}{\sum_i \mathbf{1}(y^{(i)}=y_m)}
}
}
}
return p\vect{p}
}

The model p\vect{p} is some data structure holding k+j=1j=pkhjk+\sum\sub{j=1}^{j=p}k h\sub{j} numbers, i.e., p[0,1]k+j=1j=pkhj\vect{p} \in [0,1]^{k+\sum\sub{j=1}^{j=p}k h\sub{j}}.

Prediction:

fpredictf'\subtext{predict}x,px,\vect{p}yy

function predict(x,p)\text{predict}(x,\vect{p}) { //x=(xl1,,xlp)x=(x\sub{l\sub{1}},\dots,x\sub{l\sub{p}})
marg maxm{1,,Y}pmj=1j=ppm,j,ljm^\star \gets \argmax_{m \in \{1,\dots,|Y|\}} \c{1}{p_m} \prod_{j=1}^{j=p} \c{3}{p_{m,j,l_j}}
return ymy_m
}

Or, with probability:

function predict-with-prob(x,p)\text{predict-with-prob}(x,\vect{p}) {
return ympmj=1j=ppm,j,ljm=1m=Ypmj=1j=ppm,j,ljy_m \mapsto \frac{\c{1}{p_m} \prod_{j=1}^{j=p} \c{3}{p_{m,j,l_j}}}{\sum_{m'=1}^{m'=|Y|} \c{1}{p_{m'}} \prod_{j=1}^{j=p} \c{3}{p_{m',j,l_j}}}
}

320 / 366

Naive Bayes: summary

Efficiency 👍👍👍

  • 👍 very very fast
    • in particular with very large datasets, in both nn and pp

Explainability/interpretability 👍👍

  • 👍 the model is a bunch of probabilities!
  • 👍 the technique is very simple

Effectiveness 🫳

  • 🫳 not so good
    • the more false the independency hypothesis with the system, the less effective

Applicability 🫳

  • 🫳 YY: classification
  • 🫳 XX: only categorical variables but can be extended to the numerical case
  • 👍 models give probability
  • 👍 no hyperparameters
  • 👍 works natively with missing values
    • just remove the missing jj from j=1j=ppm,j,lj\prod_{j=1}^{j=p} \c{3}{p_{m',j,l_j}}
321 / 366

k-Nearest Neighbors (kNN)

322 / 366

Guess the province

Maps of FVG economy

Given a point on the map, guess its province.

  • e.g., province of the most northern pig 🐖?
  • e.g., province of the most eastern fish 🐟?

More formally:

  • X=R2X= \mathbb{R}^2, i.e., the coordinates on the map
  • Y={Ts,Ud,Pn,Go}Y=\{\text{Ts},\text{Ud},\text{Pn},\text{Go}\}
  • you are fpredictf\subtext{predict}
  • flearnf\subtext{learn} is looking at the map¹
    • in particular, the position of the 4 chief towns
  1. Let's pretend we do not know the real boundaries of the (former) provinces...
323 / 366

The closest chief town

Tentative explanation of your reasoning, given a point xx on the map:

  1. look at the closest chief town
  2. say that the province of xx is the one of the closes chief town
324 / 366

The closest chief town

Tentative explanation of your reasoning, given a point xx on the map:

  1. look at the closest chief town
  2. say that the province of xx is the one of the closes chief town

More in general: In prediction, but given a learning set DD:

  1. find the kk closest observations in DD (the nearest neighbors)
  2. say that yy is the most frequent (if classification) or the mean (if regression) of the kk closest observations

This is the k-Nearest Neighbors learning technique!

324 / 366

k-Nearest Neighbors

Learning:

flearnf'\subtext{learn}{(x(i),y(i))}i\seq{(x^{(i)},y^{(i)})}{i}({(x(i),y(i))}i,k,d)(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})k,d\c{2}{k},\c{3}{d}

function learn({(x(i),y(i))}i,k,d)\text{learn}(\seq{(x^{(i)},y^{(i)})}{i}, \c{2}{k},\c{3}{d}) {
return ({(x(i),y(i))}i,k,d)(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})
}

flearnf'\subtext{learn} does nothing!

The model is the dataset DD

  • and¹ the number of neighbors kN\c{2}{k} \in \mathbb{N}
  • and¹ the distance² d:X×XR\c{3}{d}: X \times X \to \mathbb{R}

kk and dd are parameters!

  1. They are used by fpredictf'\subtext{predict}, not here, but we put them into the model just to not make the signature of fpredictf'\subtext{predict} dirty; ML sw/libraries do the same.
  2. A (dis)similarity measure is enough.

Prediction:

fpredictf'\subtext{predict}x,({(x(i),y(i))}i,k,d)x,(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})yy

function predict(x,({(x(i),y(i))}i,k,d))\text{predict}(x,(\seq{(x^{(i)},y^{(i)})}{i},\c{2}{k},\c{3}{d})) {

s0\vect{s} \gets \vect{0} //0Rn\vect{0} \in \mathbb{R}^n
for i{1,,n}i \in \{1,\dots,n\} {
sid(x,x(i))s_i \gets \c{3}{d}(x,x^{(i)})
}
II \gets \emptyset //the neighborhood
while Ik|I|\le \c{2}{k} {
II{arg mini{1,,n}Isi}I \gets I \cup \{\argmin_{i \in \{1,\dots,n\} \setminus I} s_i\}
}
return arg maxyYiI1(y(i)=y)\argmax_{y \in Y} \sum_{i \in I} \mathbf{1}(y^{(i)}=y) //most frequent
}

Alternatives:

  • for regression, return 1kiIy(i)\frac{1}{\c{2}{k}}\sum_{i \in I} y^{(i)}
  • with probability, return y1kiI1(y(i)=y)y \mapsto \frac{1}{\c{2}{k}}\sum_{i \in I} \mathbf{1}(y^{(i)}=y)
325 / 366

The distance

By using a proper distance d:X×XRd: X \times X \to \mathbb{R}, kNN can be used on any XX! (applicability 👍👍👍)

Common cases: there is a large literature on distances

  • for vectorial spaces, i.e., X=RpX=\mathbb{R}^p
    • \ell-norms: with \ell being a parameter, d(x,x)=x,x=jxjxjd(\vect{x},\vect{x}')=\lVert \vect{x},\vect{x}' \rVert_\ell=\sqrt[\ell]{\sum_j |x_j-x'_j|^\ell}
      • Euclidean with =2\ell=2
      • Manhattan with =1\ell=1
    • cosine distance: d(x,x)=xxxxd(\vect{x},\vect{x}')=\frac{\vect{x}^\intercal\vect{x}'}{\lVert \vect{x} \rVert \lVert \vect{x}' \rVert} \lVert \cdot \rVert is just 2\lVert \cdot \rVert_2
      • disregards the individual scales of the points
    • many others
  • for fixed-length sequences of symbols in an alphabet AA, i.e., X=AlX=A^l
    • Hamming distance: d(x,x)=k=1k=l1(xkxk)d(x,x')=\sum_{k=1}^{k=l} \mathbf{1}(x_k \ne x'_k)
    • edit distance (many variants)
  • for variable-length sequences of symbols in an alphabet AA, i.e., X=AX=A^*
    • edit distance or Hamming with some adjustments
  • for sets, i.e., X=P(A)X=\mathcal{P}(A)
    • Jaccard distance: d(x,x)=1xxxxd(x,x')=1-\frac{|x \cap x'|}{|x \cup x'|}
  • and combinations of these ones!

Choose one that helps to capture the dependency of yy on xx!

326 / 366

Role of the kk parameter

kNN decision boundaries with two k values

Error vs k in kNN

images from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013

Yes, it is a flexibility parameter: link with the Bayes classifier!

  • the larger the kk the more global the estimate of p(yx)p(y \mid x); the smaller, the more local
  • if k=nk=n then p(yx)p(y \mid x) does not actually use xx, the neighborhood is the entire DD \rightarrow high bias
  • if k=1k=1 then p(yx)p(y \mid x) depends on just one point, little noise can change the output \rightarrow high variance
327 / 366

kNN: summary

Efficiency 🫳

  • 🫳 struggles with large nn in prediction
  • 👍 no actual learning phase

Explainability/interpretability 👍

  • 👍 the neighborhood is itself a local explanation of the decision

Effectiveness 🫳

  • 🫳 not particularly good, in practice
    • depends on kk

Applicability 👍

  • 👍 YY: regression and both classifications
  • 👍 XX: everything, if you have a proper distance dd
    • but tricky with mixed numerical/categorical cases
  • 👍 models give probability
  • 🫳 two parameters (dd and kk), one impatting on bias-variance trade-off
328 / 366

Lab 2¹: comparison of ML techniques

Consider the DataCo Smart Supply Chain for Big Data Analysis dataset

  • given the objective of classifying if an order is marked as late delivery, design an implement a ML procedure which answers the question: what is the best classification technique?

  • given the objective of predicting the sales of each order, design an implement a ML procedure which answers the question: what is the best regression technique?

consider the ML techniques seen during the lectures

Hints:

  • the dataset is really big (~180k rows): use it at your own advantage!
  • in Python, the pandas library is the most popular for dataset manipulations and explorations
  • about ML algorithms, you can find all the ones you need for this lab in the library scikit-learn:

1: designed by Gaia Saveri, tutor A.Y. 2023/2024

329 / 366

Unsupervised learning

Clustering

330 / 366

Back to the origin

Machine Learning is the science of getting computers to learn without being explicitly programmed.

\downarrow

Supervised (Machine) Learning is the science of getting computers to learn f:XYf: X \to Y from examples autonomously.

\downarrow

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

331 / 366

Unsupervised learning definition

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

What's a pattern?

  • pattern [ˈpat(ə)n]: a model or design used as a guide in needlework and other crafts

In practice:

  • we assume that the system that generates the data follows some scheme (the pattern)
  • we do not know the pattern
  • we want to discover the pattern from a dataset
332 / 366

Supervised vs. unsupervised

Supervised (Machine) Learning is the science of getting computers to learn f:XYf: X \to Y from examples autonomously.

Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.

Key differences

  • yy is a property of xx
  • one example is a pair (x,y)(x,y)
  • what we learn from a dataset can be applied to other xx
  • the pattern is a property of the system ss
  • the example is the dataset P(X)\mathcal{P}^*(X)
  • what we learn from the dataset is not, in general, usable on another dataset
    • hence, "find patterns from data" is fairer than "learn patterns from data"
333 / 366

Pattern?

In most of the cases, the pattern one is looking for is grouping:

  • i.e., we assume the system generates data that is grouped, but we do not know what are the groups

This form of unsupervised learning is called clustering:

  • given a dataset, find the clusters
    • cluster [kluhs-ter]: a group of things or persons close together
    • "close together" \rightarrow there is some implicit notion of distance (or similarity)

Meme unsupervised learning vs. clustering

334 / 366

Clustering, more formally

Given a dataset DP(X)D \in \mathcal{P}^*(X), find a partitioning {D1,,Dk}\{D_1, \dots, D_k\} of DD such that the elements in each DiD_i are "close together".

  • each DiD_i is a cluster
335 / 366

Clustering, more formally

Given a dataset DP(X)D \in \mathcal{P}^*(X), find a partitioning {D1,,Dk}\{D_1, \dots, D_k\} of DD such that the elements in each DiD_i are "close together".

  • each DiD_i is a cluster

Is this a formal and complete definition? No!

  • what does it mean "close together"?
    • we need a distance/(dis)similarity metric d:X×XR+d: X \times X \to \R^+, but it's not an input of the problem it's not in the "given" part
  • second, how close? what elements?
    • intuitively, we want that any two elements of the same cluster are closer each other than any two elements of different clusters
  • third: where does kk (the number of clusters) come from? like dd, it's not an input of the problem
335 / 366

Clustering, more formally

Given a dataset DP(X)D \in \mathcal{P}^*(X), find a partitioning {D1,,Dk}\{D_1, \dots, D_k\} of DD such that the elements in each DiD_i are "close together".

  • each DiD_i is a cluster

Is this a formal and complete definition? No!

  • what does it mean "close together"?
    • we need a distance/(dis)similarity metric d:X×XR+d: X \times X \to \R^+, but it's not an input of the problem it's not in the "given" part
  • second, how close? what elements?
    • intuitively, we want that any two elements of the same cluster are closer each other than any two elements of different clusters
  • third: where does kk (the number of clusters) come from? like dd, it's not an input of the problem

In practice:

  • dd is dictated by XX and is reasonable
    • that is, you first shape XX (feature engineering), than select an reasonable dd for that XX
  • kk is unknown
    • mostly suggested/bounded by the context
    • within the reasonable range, picked
335 / 366

Clustering as optimization

In principle, clustering looks like a (biobjecive) optimization problem (given DP(X)D \in \mathcal{P}^*(X), k{1,,D}k \in \{1,\dots,|D|\}, and d:X×XR+d: X \times X \to \mathbb{R}^+):

maxD1,,Dk  (i,i:iixDi,xDid(x,x))(ix,xDid(x,x))subject to  D1Dk=DDiDi=i,i{1,,k} \begin{align*} \max_{D_1, \dots, D_k} & \; \left(\c{4}{\sum _{i,i': i\ne i'}\sum_{x \in D_i, x' \in D_{i'}} d(x,x')}\right) - \left(\c{2}{\sum_i \sum_{x, x' \in D_i} d(x,x')}\right) \\ \text{subject to} & \; \begin{array}{ll} \c{3}{D_1 \cup \dots \cup D_k = D} \\ \c{3}{D_i \cap D_{i'} = \emptyset} & \c{3}{\forall i,i' \in \{1, \dots, k\}} \end{array} \end{align*}

For any k,dk,d, there exists (at least) one optimal solution. In principle, to find it you can just try all the partitions and measure the distance.

In practice:

  • you don't know kk
  • trying all partitions is unfeasible
  • maximize the distance between any two x,xx,x' when they belong to different clusters
  • minimize (i.e., maximize with -) the distance between any two x,xx,x' when they belong to the same cluster
  • clusters have to form a partition

Here DD and each DiD_i are bags, not sets. A partition on a bag is better defined if you define a bag as a m:ANm: A \to \mathbb{N}, where AA is a set and m(a)m(a) is the multiplicity of aAa \in A in the bag. However, for clustering we can reason on sets, because in practice pairs x,xx,x should always end up being in the same cluster.

336 / 366

Assessing clustering

If you assume to know kk and dd, a clustering method:

  • is effective on a DD if it produces the optimal partition
    • or, the closer the produced partition to the optimal one, the more effective
  • is efficient if it does it taking low resources (i.e., quickly)

But in practice you don't know kk...

Can we just optimize also kk? That is, can we solve the optimization problem for every kk and take the best?

337 / 366

Assessing clustering

If you assume to know kk and dd, a clustering method:

  • is effective on a DD if it produces the optimal partition
    • or, the closer the produced partition to the optimal one, the more effective
  • is efficient if it does it taking low resources (i.e., quickly)

But in practice you don't know kk...

Can we just optimize also kk? That is, can we solve the optimization problem for every kk and take the best?

maxk,D1,,Dk  (i,i:iixDi,xDid(x,x))(ix,xDid(x,x))subject to  D1Dk=DDiDi=i,i{1,,k} \begin{align*} \max_{k, D_1, \dots, D_k} & \; \left(\c{4}{\sum _{i,i': i\ne i'}\sum_{x \in D_i, x' \in D_{i'}} d(x,x')}\right) - \left(\c{2}{\sum_i \sum_{x, x' \in D_i} d(x,x')}\right) \\ \text{subject to} & \; \begin{array}{ll} \c{3}{D_1 \cup \dots \cup D_k = D} \\ \c{3}{D_i \cap D_{i'} = \emptyset} & \c{3}{\forall i,i' \in \{1, \dots, k\}} \end{array} \end{align*}

If you also optimize kk, then the optimal solution is the one with k=Dk=|D|...

Can we just optimize also kk? No! It's pointless.

Extreme cases:

  • k=1k=1, no clustering, just D1=DD_1=D
    • i,i=0\c{4}{\sum\sub{i,i'}\sum}=0, i=dall\c{2}{\sum\sub{i}\sum}=d\subtext{all} is large, hence the objective is large negative
  • k=Dk=|D|, each observation is a cluster
    • i,i=dall\c{4}{\sum\sub{i,i'}\sum}=d\subtext{all} is large, i=0\c{2}{\sum\sub{i}\sum}=0, hence the objective is large positive
  • in between, always increasing
337 / 366

Assessing clustering in practice

How do you evaluate a partitioning of DD in practice?

  • you inspect it manually
  • you insert the clustering inside the larger information processing system it belongs to and measure some other index e.g., how rich 💰💰💰 you become with this, rather than that, clustering technique?
    • a form of extrinsic evaluation: you look at the result in a larger context
  • you measure some performance indexes devised for clustering
    • a form of intrinsic evaluation: you look at the result alone

Question: is manual inspection intrinsic or exstrinsic?

338 / 366

Clustering performance indexes

There are many of them; most are based on the idea of measuring separateness or density of clustering.

Silhouette index: it considers, for each observation, the average distance to the observations in the same cluster and the min distance to the observations in other clusters: sˉ({Di}i=1i=k)=1iDixiDidout(x,{Di}i)din(x,{Di}i)max(dout(x,{Di}i),din(x,{Di}i))\bar{s}(\seq{D_i}{i=1}^{i=k})=\frac{1}{\left|\bigcup_i D_i\right|}\sum_{x \in \bigcup_i D_i}\frac{\c{1}{d\subtext{out}(x,\seq{D_i}{i})}-\c{2}{d\subtext{in}(x,\seq{D_i}{i})}}{\max\left(\c{1}{d\subtext{out}(x,\seq{D_i}{i})},\c{2}{d\subtext{in}(x,\seq{D_i}{i})}\right)} where:

dout(x,{Di}i)=minDi∌xminxDid(x,x)\c{1}{d\subtext{out}(x,\seq{D_i}{i})}=\min_{D_i \not\ni x} \min_{x' \in D_i} d(x, x')

din(x,{Di}i)=1Dix1xDixxxd(x,x)\c{2}{d\subtext{in}(x,\seq{D_i}{i})}=\frac{1}{|D_i \ni x|-1}\sum_{x' \in D_i \ni x \land x \ne x'} d(x, x')

sˉ()[1,1]\bar{s}(\cdot) \in [-1,1]: the larger (closer to 11), the better (i.e., the more separated the clusters).

A similar index is the Dunn index.

339 / 366

Silhouette in practice

Example of Silhouette plot with 3 clusters

sˉ({Di}i)=0.78\bar{s}(\seq{D_i}{i})=0.78

Questions:

  • XX?
  • kk?
  • dd?
340 / 366

Silhouette in practice

Example of Silhouette plot with 4 clusters

sˉ({Di}i)=0.74\bar{s}(\seq{D_i}{i})=0.74

In practice:

  • the greater kk, the lower sˉ()\bar{s}(\cdot)
  • you choose the kk where there is a knee (or elbow)
341 / 366

Hierarchical clustering

342 / 366

Hierarchical clustering

Hierarchical clustering is an iterative method that exists in two versions: For both:

  • at each jj-th iteration, there exist one partition D1,,DkjD_1, \dots, D_{k_j}
  • at most two clusters differ between partionts at subsequent iterations
  • you don't set kk

That is, partition are refined by merging (in agglomerative hierarchical clustering) or by division (in divisive hierarchical clustering).

Moreover, since there the partition is refined over iterations, an hierarchy among clusters is established:

  • that is, this clustering method gives some more than a simple partition

We'll see just the agglomerative version.

343 / 366

Agglomerative hierarchical clustering

function cluster({x(i)}i=1i=n)\text{cluster}(\seq{x^{(i)}}{i=1}^{i=n}) {
j0j \gets 0
Dj{{x(1)},,{x(n)}}\c{1}{\mathcal{D}_j} \gets \{\{x^{(1)}\},\dots,\{x^{(n)}\}\}
while Dj>1|\c{1}{\mathcal{D}_j}|>1 {
(i,i)arg mini,i{1,,D}iidcluster(Dj,i,Dj,i)(i^\star,i^{\prime\star}) \gets \argmin_{i,i' \in \{1,\dots,|\mathcal{D}|\}\land i \ne i'} \c{2}{d\subtext{cluster}}(D_{j,i},D_{j,i'})
Dj+1DjDj,iDj,iDj,iDj,i\c{1}{\mathcal{D}_{j+1}} \gets \c{1}{\mathcal{D}_j} \oplus D_{j,i^\star} \cup D_{j,i^{\prime\star}} \ominus D_{j,i^\star} \ominus D_{j,i^{\prime\star}}
jj+1j \gets j+1
}
return Dj\c{1}{\mathcal{D}_j}
}

  • Dj={Dj,1,,Dj,kj}\c{1}{\mathcal{D}_j}=\{D_{j,1},\dots,D_{j,k_j}\} is the partition at the jj-th iteration
  • dcluster:P(X)×P(X)R+\c{2}{d\subtext{cluster}}: \mathcal{P}^\ast(X) \times \mathcal{P}^\ast(X) \to \mathbb{R}^+ is a (dis)similarity metric defined over sets of observations
    • it's a parameter of the technique
  • DD\mathcal{D} \oplus D adds DD to D\mathcal{D}
  • DD\mathcal{D} \ominus D removes DD from D\mathcal{D}

At each iteration:

  1. consider the current clusters in D\c{1}{\mathcal{D}}
  2. find the closest ones Di,DiD_{i^\star},D_{i^{\prime\star}}
  3. build the next iteration clusters by
    • copying all the existing but DiD_{i^\star} and DiD_{i^{\prime\star}}
    • adding DiDiD_{i^\star} \cup D_{i^{\prime\star}}
344 / 366

Cluster distances

There exist a few options for dcluster:P(X)×P(X)R+d\subtext{cluster}: \mathcal{P}^\ast(X) \times \mathcal{P}^\ast(X) \to \mathbb{R}^+. All are based on a (dis)similarity metric dd defined over observations, i.e., d:X×XR+d: X \times X \to \mathbb{R}^+.

  • Single linkage (nearest):
dcluster(D,D)=minxD,xDd(x,x)d\subtext{cluster}(D,D')= \min_{x \in D, x' \in D'} d(x,x')

  • Complete linkage (farthest):
dcluster(D,D)=maxxD,xDd(x,x)d\subtext{cluster}(D,D')= \max_{x \in D, x' \in D'} d(x,x')

  • Average linkage:
dcluster(D,D)=1DDxD,xDd(x,x)d\subtext{cluster}(D,D')= \frac{1}{|D| |D'|}\sum_{x \in D, x' \in D'} d(x,x')

  • Centroid: (only if X=RpX=\mathbb{R}^p)
dcluster(D,D)=d(c(D),c(D))d\subtext{cluster}(D,D')= d(c(D),c(D'))

where c(D)=xˉ=1DxDxc(D)=\bar{\vect{x}}=\frac{1}{|D|}\sum\sub{\vect{x} \in D} \vect{x} and xˉ\bar{\vect{x}} is the centroid of DD.

Question: what's the efficiency of the 4 dclusterd\subtext{cluster}?

345 / 366

Example in R1\mathbb{R}^1

Input: D={1,2,3,6,7,9,11,12,15,18}D=\{1,2,3,6,7,9,11,12,15,18\}

Execution¹:

jj Dj\mathcal{D}_j
0 {1},{2},{3},{6},{7},{9},{11},{12},{15},{18}\c{1}{\{1\}}, \c{1}{\{2\}}, \{3\}, \{6\}, \{7\}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}
1 {1,2},{3},{6},{7},{9},{11},{12},{15},{18}\c{1}{\{1, 2\}}, \c{1}{\{3\}}, \{6\}, \{7\}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}
2 {1,2,3},{6},{7},{9},{11},{12},{15},{18}\{1, 2,3\}, \c{1}{\{6\}}, \c{1}{\{7\}}, \{9\}, \{11\}, \{12\}, \{15\}, \{18\}
3 {1,2,3},{6,7},{9},{11},{12},{15},{18}\{1, 2,3\}, \{6,7\}, \{9\}, \c{1}{\{11\}}, \c{1}{\{12\}}, \{15\}, \{18\}
4 {1,2,3},{6,7},{9},{11,12},{15},{18}\{1, 2,3\}, \c{1}{\{6,7\}}, \c{1}{\{9\}}, \{11,12\}, \{15\}, \{18\}
5 {1,2,3},{6,7,9},{11,12},{15},{18}\{1, 2,3\}, \c{1}{\{6,7,9\}}, \c{1}{\{11,12\}}, \{15\}, \{18\}
6 {1,2,3},{6,7,9,11,12},{15},{18}\c{1}{\{1, 2,3\}}, \c{1}{\{6,7,9,11,12\}}, \{15\}, \{18\}
7 {1,2,3,6,7,9,11,12},{15},{18}\c{1}{\{1, 2,3,6,7,9,11,12\}}, \c{1}{\{15\}}, \{18\}
8 {1,2,3,6,7,9,11,12,15},{18}\c{1}{\{1, 2,3,6,7,9,11,12,15\}}, \c{1}{\{18\}}
9 {1,2,3,6,7,9,11,12,15,18}\{1, 2,3,6,7,9,11,12,15, 18\}

function cluster({xi}i=1i=n)\text{cluster}(\seq{x_i}{i=1}^{i=n}) {
j0j \gets 0
Dj{{x1},,{xn}}\mathcal{D}_j \gets \{\{x_1\},\dots,\{x_n\}\}
while Dj>1|\mathcal{D}_j|>1 {
(i,i)arg mini,i{1,,D}iidcluster(Dj,i,Dj,i)(i^\star,i^{\prime\star}) \gets \c{2}{\argmin}_{i,i' \in \{1,\dots,|\mathcal{D}|\}\land i \ne i'} d\subtext{cluster}(D_{j,i},D_{j,i'})
Dj+1DjDj,iDj,iDj,iDj,i\mathcal{D}_{j+1} \gets \mathcal{D}_j \oplus D_{j,i^\star} \cup D_{j,i^{\prime\star}} \ominus D_{j,i^\star} \ominus D_{j,i^{\prime\star}}
jj+1j \gets j+1
}
return Dj\mathcal{D}_j
}

Assume single linkage:

  • dcluster(D,D)=minxD,xDd(x,x)d\subtext{cluster}(D,D')= \min_{x \in D, x' \in D'} d(x,x')

The output, i.e., the partition of DD, is D9\mathcal{D}_9: the hierarchy is the entire sequence D9,,D0\mathcal{D}_9,\dots,\mathcal{D}_0.

  1. We assume that, in case of tie, the first one is selected by arg min\argmin, i.e., the pair i,ii,i' for which i+ii+i' is the lowest.
346 / 366

Example in R2\mathbb{R}^2

Clustering toy problem: data

Clustering toy problem: distance matrix

Clustering toy problem: dendrogram

The hierarchy {Dj}j\seq{\mathcal{D}_j}{j}, not just the partition Dn1\mathcal{D}_{n-1}, can be visualized in the form of a dendrogram where:

  • each node is a DDD' \subseteq D
  • the root node is DD
  • each node DD' has two children D1,D2D'_1,D''_2 that have been merged when forming DD'
  • the height of each node is the distance dclusterd\subtext{cluster} of its two children

Question: what dclusterd\subtext{cluster} is being used here?

347 / 366

Hierarchical clustering on Iris

Dendrogram on Iris

  • yy is ignored while doing the clustering
    • but used for coloring the dendrogram

By looking at the dendrogram, one can choose an appropriate kk, or simply look at the dendrogram as the pattern.

348 / 366

Partitional clustering

k-means

349 / 366

Refining the partition

Consider the optimization problem behind clustering and the following heuristic¹ for solving it:

  1. start with a random partition {Dh}h\seq{D_h}{h}
  2. until {Dh}h\seq{D_h}{h} is good enough
    1. refine {Dh}h\seq{D_h}{h}
  3. return {Dh}h\seq{D_h}{h}
  1. heuristic [hyoo-ris-tik]: a trial-and-error method of problem solving used when an algorithmic exact approach is impractical.
350 / 366

Refining the partition

Consider the optimization problem behind clustering and the following heuristic¹ for solving it:

  1. start with a random partition {Dh}h\seq{D_h}{h}
  2. until {Dh}h\seq{D_h}{h} is good enough
    1. refine {Dh}h\seq{D_h}{h}
  3. return {Dh}h\seq{D_h}{h}
  1. heuristic [hyoo-ris-tik]: a trial-and-error method of problem solving used when an algorithmic exact approach is impractical.

  • Good?
    • the cluster are well separated
  • Good enough?
    • the partition cannot be further improved
    • or some computational budget has been consumed
350 / 366

k-means clustering

function cluster({x(i)}i=1i=n,k)\text{cluster}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, k) {
for h{1,,k}h \in \{1,\dots,k\} { //set initial centroids
μhx(U({1,,n}))\c{1}{\vect{\mu}_h} \gets \vect{x}^{(\sim U(\{1,\dots,n\}))}
}
Dassign({x(i)}i=1i=n,{μh}h=1h=k)\mathcal{D} \gets \c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}})
while ¬should-stop()\neg\text{should-stop()} {
for h{1,,k}h \in \{1,\dots,k\} { //recompute centroids
μh1DhxDhx\vect{\mu}_h \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}
}
Dassign({x(i)}i=1i=n,{μh}h=1h=k)\mathcal{D}' \gets \c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}})
if D=D\mathcal{D}'=\mathcal{D} {
break
}
DD\mathcal{D} \gets \mathcal{D}'
}
return D\mathcal{D}
}

function assign({x(i)}i=1i=n,{μh}h=1h=k)\c{2}{\text{assign}}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \c{1}{\seq{\vect{\mu}_h}{h=1}^{h=k}}) {
D{,,}\mathcal{D} \gets \{\emptyset,\dots,\emptyset\} //kk empty sets
for i{1,,n}i \in \{1,\dots,n\} {
h=arg minh{1,,k}d(x(i),μh)h^\star = \argmin_{h \in \{1,\dots,k\}} d(\vect{x}^{(i)},\c{1}{\vect{\mu}_h})
DhDh{x(i)}D_{h^\star} \gets D_{h^\star} \cup \{\vect{x}^{(i)}\} //assign to the closest centroid
}
return D\mathcal{D}
}

  • X=RpX = \mathbb{R}^p
    • otherwise you cannot compute the mean as μh1DhxDhx\c{1}{\vect{\mu}_h} \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}
  • μ1,,μk\vect{\mu}_1,\dots,\vect{\mu}_k are the means of the clusters and act as centroids
    • there are kk means!
    • randomly chosen at the first iteration
  • assign()\text{assign()} assigns observations, i.e., points, to closest centroids
  • when there's no change in the partition, the loop is stop
    • should-stop()\text{should-stop()} may employ additional stopping criteria, e.g:
      • number of iterations
      • distance traveled by the centroids
  • this technique is not deterministic, due to the initial random assignment
    • U({1,,n})\sim U(\{1,\dots,n\}) without repetition
351 / 366

Example in R1\mathbb{R}^1

Input: D={1,2,3,6,7,9,11,12,15,18}D=\{1,2,3,6,7,9,11,12,15,18\}, k=3k=3

Execution (one initial random assignment):

Dj\mathcal{D}_j μ1\vect{\mu}_1 μ2\vect{\mu}_2 μ3\vect{\mu}_3
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\} 1\c{1}{1} 11\c{2}{11} 15\c{4}{15}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\} 3\c{1}{3} 9.8\c{2}{9.8} 16.5\c{4}{16.5}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{1}{6},\c{2}{7},\c{2}{9},\c{2}{11},\c{2}{12},\c{4}{15},\c{4}{18}\} 3\c{1}{3} 9.8\c{2}{9.8} 16.5\c{4}{16.5}

Execution (another initial random assignment):

Dj\mathcal{D}_j μ1\vect{\mu}_1 μ2\vect{\mu}_2 μ3\vect{\mu}_3
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{2}{2},\c{4}{3},\c{4}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 1\c{1}{1} 2\c{2}{2} 3\c{4}{3}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{2}{2},\c{2}{3},\c{4}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 1\c{1}{1} 2\c{2}{2} 10.1\c{4}{10.1}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{2}{2},\c{2}{3},\c{2}{6},\c{4}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 1\c{1}{1} 2.5\c{2}{2.5} 11.1\c{4}{11.1}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{2}{3},\c{2}{6},\c{2}{7},\c{4}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 1\c{1}{1} 3.7\c{2}{3.7} 12\c{4}{12}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 1.5\c{1}{1.5} 5.3\c{2}{5.3} 13\c{4}{13}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 2\c{1}{2} 7.3\c{2}{7.3} 14\c{4}{14}
{1,2,3,6,7,9,11,12,15,18}\{\c{1}{1},\c{1}{2},\c{1}{3},\c{2}{6},\c{2}{7},\c{2}{9},\c{4}{11},\c{4}{12},\c{4}{15},\c{4}{18}\} 2\c{1}{2} 7.3\c{2}{7.3} 14\c{4}{14}

Question: what's the best clustering? can we answer this question?

function cluster({x(i)}i=1i=n,k)\text{cluster}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, k) {
for h{1,,k}h \in \{1,\dots,k\} {
μhx(U({1,,n}))\vect{\mu}_h \gets \vect{x}^{(\sim U(\{1,\dots,n\}))}
}
Dassign({x(i)}i=1i=n,{μh}h=1h=k)\mathcal{D} \gets \text{assign}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \seq{\vect{\mu}_h}{h=1}^{h=k})
while ¬should-stop()\neg\text{should-stop()} {
for h{1,,k}h \in \{1,\dots,k\} {
μh1DhxDhx\vect{\mu}_h \gets \frac{1}{|D_h|} \sum_{\vect{x} \in D_h} \vect{x}
}
Dassign({x(i)}i=1i=n,{μh}h=1h=k)\mathcal{D}' \gets \text{assign}(\seq{\vect{x}^{(i)}}{i=1}^{i=n}, \seq{\vect{\mu}_h}{h=1}^{h=k})
if D=D\mathcal{D}'=\mathcal{D} {
break
}
DD\mathcal{D} \gets \mathcal{D}'
}
return D\mathcal{D}
}

352 / 366

Example in R2\mathbb{R}^2

Example of k-means in R^2

Given two points μ1,μ2\vect{\mu}_1,\vect{\mu}_2, the line which

  • is orthogonal to the segment μ1μ2undefined\overlinesegment{\vect{\mu}_1\vect{\mu}_2} and
  • goes through its midpoint

divides the space in points closer to μ1\vect{\mu}_1 and those closer to μ2\vect{\mu}_2.

Image from Wikipedia

353 / 366

Applying ML to text

354 / 366

What's text?

Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet AA. Hence: xAx \in A^* where AA is usually (in modern times) UTF-16, so it may includes emojis:

  • there are thousands of them: 🤩🦴🐁...

A dataset XP(A)X \in \mathcal{P}^\ast(A^\ast) of texts, possibly with labels, is called corpus. A single text x(i)x^{(i)} is called document.

355 / 366

What's text?

Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet AA. Hence: xAx \in A^* where AA is usually (in modern times) UTF-16, so it may includes emojis:

  • there are thousands of them: 🤩🦴🐁...

A dataset XP(A)X \in \mathcal{P}^\ast(A^\ast) of texts, possibly with labels, is called corpus. A single text x(i)x^{(i)} is called document.

However, what we usually mean with text is natural language, where the sequence of characters is a noisy container of an underlying information:

  • given a document xx, the actual meaning of xx may depend on other documents
  • given a portion xxx' \sqsubset x of a document xx, its meaning may be different if put in another document xx''

Natural language is by nature ambiguous!

355 / 366

Examples of text+ML problems

  • Given a brand (e.g., Illy, Fiat, Dell, U.S. Triestina Calcio, ...), build a system that tells if people is talking good or bad about the brand on Twitter (or Mastodon).

  • Given a corpus of letters to/from soldiers fighting during the WW1, what are the topics they talk about?

  • Given a scientific paper p1p_1, what's the relevance of the citation of another paper p2p_2 referenced in p1p_1?

356 / 366

Sentiment analysis

A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document xx. This is called sentiment analysis.

Usually, this problem is cast as a form of supervised learning, where YY contains sentiments.

Variants:

  • Y={Pos,Neg}Y =\{\text{Pos},\text{Neg}\}
  • Y=[1,1]Y =[-1,1]
  • Y=[1,1]10Y=[-1,1]^{10}
    • one for each of anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive (see the Syuzhet package)
  • ...
357 / 366

Sentiment analysis

A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document xx. This is called sentiment analysis.

Usually, this problem is cast as a form of supervised learning, where YY contains sentiments.

Variants:

  • Y={Pos,Neg}Y =\{\text{Pos},\text{Neg}\}
  • Y=[1,1]Y =[-1,1]
  • Y=[1,1]10Y=[-1,1]^{10}
    • one for each of anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive (see the Syuzhet package)
  • ...

In every case, we can¹ apply classic ML (supervised and usnupervised) techniques if you pre-process text to obtain multivariate observations, possibly in Rp\mathbb{R}^p, i.e., we want a ftext-to-vect:ARpf\subtext{text-to-vect}: A^* \to \mathbb{R}^p:

xAx \in A^*xRp\vect{x}' \in \mathbb{R}^pftext-to-vectf\subtext{text-to-vect}
  1. Actually, we have to, with the only exception of hierarchical clustering for which we might directly work on text with a suitable d()d().
357 / 366

Bag-of-words

Bag-of-words (BOW) is a ftext-to-vectf\subtext{text-to-vect} based on the idea of associating one numerical variable with each word in a predefined dictionary.

xAx \in A^*xRW\vect{x}' \in \mathbb{R}^{|W|}fBOWf\subtext{BOW}WW

In practice, given the dictionary (i.e., set of words WP(A)W \in \mathcal{P}(A^*)) and given a document xx:

  1. tokenize xx in a multiset T=ftokenize(x)T=f\subtext{tokenize}(x) of tokens (words)
  2. for each tTt \in T, set xtx'_t to the multiplicity m(t,T)m(t,T) of tt in TT, i.e., to the number of occurrences of the words tt in xx

The outcome is a xRW\vect{x}' \in \mathbb{R}^{|W|}.

An alternative version is to consider the frequencies instead of occurrencies:

  • i.e., xt=m(t,T)Tx'_t=\frac{m(t,T)}{|T|}
  • useful if the documents have very different lenghts but the lenght itself is not a relevant information
358 / 366

Common text pre-processing steps

BOW is considers slightly different sequences of characters as different words, and hence as different features, because of tokenization. Usually, this is not good.

In practice, you often do some basic pre-processing steps:

  • case conversion: everything to lowercase (language independent)
    • x=Banana is my favorite fruitx=banana is my favorite fruitx=\text{Banana is my favorite fruit} \mapsto x'=\text{banana is my favorite fruit}
    • x=I like bananax=i like bananax=\text{I like banana} \mapsto x=\text{i like banana}
  • removal of punctuation (language independent)
  • stemming: each word is replaced with its morphological root (language dependent)
    • x=I liked eating bananasx=I lik eat bananax=\text{I liked eating bananas} \mapsto x'=\text{I lik eat banana}
    • x=andammo tristemente rassegnatix=andar triste rassegnatx=\text{andammo tristemente rassegnati} \mapsto x'=\text{andar triste rassegnat}
  • removal of stop-words (language dependent)
    • stop words are very common words (articles, some prepositions, ...)

Each of this steps is a fpre-proc:AAf\subtext{pre-proc}: A^\ast \to A^\ast:

xAx \in A^\astxAx' \in A^\astfpre-procf\subtext{pre-proc}
359 / 366

Counter examples

The 4 common pre-processing steps are not always appropriate. It depends on whether they help modeling the yy-xx dependency.

Sentiment analysis and punctuation:

  • I just saw Alice\text{I just saw Alice}
  • I just saw Alice!!!\text{I just saw Alice!!!}
  • I just saw Alice!!! 🥰😍💘\text{I just saw Alice!!! 🥰😍💘}

Music genre preferences and case: a bit forced...

  • I like the Take That and I hate The Who.\text{I like the Take That and I hate The Who.}
  • Who likes to take that song of Hate? Me!\text{Who likes to take that song of Hate? Me!}

Instruction level and stemming:

  • se fossi stato malato, me ne sarei stato a casa\text{se fossi stato malato, me ne sarei stato a casa}
  • se ero malato, me ne stavo a casa\text{se ero malato, me ne stavo a casa}
360 / 366

tf-idf

BOW tends to overweigh words which are very frequent, but not relevant (similarly to stop-words) and underweigh words that are relevant, but rare.

Solution: use tf-idf instead of occurrencies or frequency. tf-idf is the ratio between the term frequency (i.e., the frequency of a word) in a document, and the inverse document frequency, i.e., the frequency in the corpus of documents containing that term.

Given the dictionary WW, the corpus XX, and a document xx:

  1. tokenize xx in a multiset TT of tokens (words)
  2. for each tTt \in T, set xt=ftf(t,x)fidf(t,X)x'_t=\c{1}{f\subtext{tf}(t, x)} \c{2}{f\subtext{idf}(t, X)}

where:

  • ftf(t,x)=m(t,T)Tf\subtext{tf}(t, x)=\frac{m(t,T)}{|T|}
  • fidf(t,X)=logXxX1(tftokenize(x))f\subtext{idf}(t, X)=\log \frac{|X|}{\sum_{x \in X} \mathbf{1}(t \in f\subtext{tokenize}(x))}

The more common a word, the greater tf, the (more) lower idf (00 if in every document). The more specific a word to a document, the larger tf, the larger idf.

tf-idf corresponds to a ftf-idf-learn:P(A)P(A)f\subtext{tf-idf-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{P}^\ast(A^\ast), which is just the identity¹, and a ftf-idf-apply:A×P(A)RWf\subtext{tf-idf-apply}: A^\ast \times \mathcal{P}^\ast(A^\ast) \to \mathbb{R}^{|W|}:

XXXXftf-idf-learnf\subtext{tf-idf-learn}
x,Xx,Xx\vect{x}'ftf-idf-applyf\subtext{tf-idf-apply}WW
  1. or, more verbosely and more formally: ftf-idf-learn:P(A)FA[0,1]2f\subtext{tf-idf-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{F}_{A^\ast \to [0,1]^2}, because it returns a mapping between words and two frequencies (tf and idf).
361 / 366

Reducing the dimensionality

With BOW, p=Wp=|W| and might be very large.

Common approaches:

  • use a very small dictionary, tailored to the specific case
  • learn a small dictionary (W=k|W|=k) on the learning data
    • you have a fBOW-top-learn:P(A)P(A)f\subtext{BOW-top-learn}: \mathcal{P}^\ast(A^\ast) \to \mathcal{P}(A^\ast) and a fBOW-top-apply:A×P(A)Rkf\subtext{BOW-top-apply}: A^\ast \times \mathcal{P}(A^\ast) \to \mathbb{R}^k
    • in learning
      • use fBOW-top-learn(X)=Wf\subtext{BOW-top-learn}(X)=W to build the dictionary WW from the corpus XX, then
      • transform the corpus in a XP(Rk)X' \in \mathcal{P}^\ast(\mathbb{R}^k) using fBOW-top-apply(x(i),W)=x(i)f\subtext{BOW-top-apply}(x^{(i)}, W)=\vect{x}^{\prime(i)} on each xx
    • in prediction, use fBOW-top-apply(x,W)=xf\subtext{BOW-top-apply}(x, W)=\vect{x}'
    • WW is often set as "the most frequent kk words" (but remove stop-words!)
  • use tf-idf and get kk most important words
XXWWfBOW-top-learnf\subtext{BOW-top-learn}kk
x,Wx,Wx\vect{x}'fBOW-top-applyf\subtext{BOW-top-apply}

The order of words in WW does matter, so it's W(A)W \in (A^\ast)^\ast, rather than WP(A)W \in \mathcal{P}(A^\ast).

362 / 366

Considering ordering

Both BOW and tf-idf ignore word ordering. But ordering is fundamental in natural language.

Example: (sentiment classification for restaurant reviews)

  • The beer was good and the pub was not too noisy.\text{The beer was good and the pub was not too noisy.}
  • The beer was not good and the pub was too noisy.\text{The beer was not good and the pub was too noisy.}

Most common solutions:

  • ngrams
  • part of speech (POS) tagging
363 / 366

ngrams

Instead of considering word frequencies (or occurrences, or tf-idf), consider the frequencies of short sequences of up-to nn words (tokens, or characters in general), i.e., of ngrams.

Example: (with n=3n=3 and aggressive stop-word removal)

  • The beer was good and the pub was not too noisy.\text{The beer was good and the pub was not too noisy.}
    • xbeer,good=1x_{\text{beer},\text{good}}=1, xpub,not,noisy=1x_{\text{pub},\text{not},\text{noisy}}=1
  • The beer was not good and the pub was too noisy.\text{The beer was not good and the pub was too noisy.}
    • xbeer,not,good=1x_{\text{beer},\text{not},\text{good}}=1, xpub,too,noisy=1x_{\text{pub},\text{too},\text{noisy}}=1

Since pp may become very very large, dimensionality reduction becomes very important.

364 / 366

Part-of-speech tagging (very briefly)

A technique belonging to Natural Language Processing methods that assigns the role to each word in a document. Roles can then be used to augment the text-to-num transformation.

POS example

365 / 366

Lab 3: sport vs. politics

Build a system that:

  1. everyday collects a large set of random tweets and groups them in tweets about politics and about sport
  2. for each of the two groups, shows the main topics of discussion

The system uses a dashboard to show its findings. you don't need to build the dashboard here, but imagining it and its usage can facilitate the design of the system

Hints:

  • the hardest part is collecting the data for designing/building the system
  • interesting R packages
    • tm for doing text mining (tokenization, punctuation, stop-words, stemming, ...)
    • other supervised learning: e1071, randomForest
    • clustering: kmeans, hclust
366 / 366

Lecturer

Eric Medvet

Research interests:

  • evolutionary computation
  • embodied artificial intelligence
  • machine learning applications

Labs:

2 / 366
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow