A.Y. 2024/2025
Eric Medvet
Research interests:
Labs:
Sylvio Barbon Jr.
Fondamenti di informatica
Progettazione del software e dei sistemi informativi
meta learning, applied ML, process mining
Alberto Bartoli
Reti di calcolatori
Computer networks 2 and introduction to cybersecurity
security, applied ML, evolutionary computation
Andrea De Lorenzo
Basi di dati
Programmazione web
security, applied AI&ML, information retrieval, GP
Eric Medvet
Programmazione avanzata
Introduction to machine learning and evolutionary robotics
evolutionary computation, embodied AI, applied ML
Laura Nenzi
Cyber-physical systems
Introduction to Artificial Intelligence
formal methods, runtime verification
Martino Trevisan
Reti di calcolatori
Sistemi operativi
Architetture dei sistemi digitali
network measurements, data privacy, big data
1st part (6 CFUs, 48 hours): for all: IN23, IN19, SM38, SM36, SM34, SM23, SM28, SM13, and SM64
2nd part (3 CFUs, 24 hours): just for IN23 and IN19
Focus on methodology:
Teacher slides:
Notebooks for the lab activity:
Textbooks:
Disclaimer: overlap with course material is very partial!
Depending on your learning style and habits, you might want to take notes to augment the slide content.
This is an important concept.
This is a very important key concept, like a definition.
Sometimes there is something that is marginally important, it is an aside. like this
There will be scientific papers or books to be referred to, like this book: James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013
External resourses (e.g., videos, software tools, ...) will be linked directly.
Palette is color-blind safe: ⬤⬤⬤⬤⬤
Pseudo-code for describing algorithms in an abstract way:
function factorial(n) {
p←1
while n>1 {
p←np
n←n−1
}
return p;
}
Code in a concrete programming language:
public static String sayHello(String name) { return "Hello %s".formatted(name);}
Focus on methodology:
Practice (in designing, building, evaluating) is fundamental!
You'll practice doing lab activities:
Where:
When:
Michel El Saliby
Role of the tutor:
The exam may be done in two ways:
The written test consists of few (≈ 6) questions, some with medium-length answer, some with short answer, to be done in 1h.
The project consists in the design, development, and assessment of an ML system dealing with one "problem" chosen among a few options (examples).
The grade is the average of written test and project grades:
Machine Learning is the science of getting computers to learn without being explicitly programmed.
Machine Learning is the science of getting computers to learn without being explicitly programmed.
A few considerations:
Machine Learning is the science of getting computers to learn without being explicitly programmed.
A few considerations:
Let's analyze it in details:
What the user sees:
What the user sees:
What the web-based email system (a computer) does:
In brief: a computer is making a decision about an email
Let's be formal:
y=f(x)
Let's be formal:
y=f(x)
y=f(x) is a formal notation capturing the idea that y is obtained from x by applying f on it.
But it is lacky about the nature of x and y.
f:X→Y
None of the two notations says how f works internally.
Alternative names:
Names are used interchangeably; some communities tend to prefer some names.
Machine Learning is the science of getting computers to learn without being explicitly programmed.
Machine Learning is the science of getting computers to learn without being explicitly programmed.
New version:
Machine Learning is the science of getting computers to learn f:X→Y without being explicitly programmed.
we want the computer to learn f and use it, not just learn it
f is often denoted as fpredict since, given an x, predicts a y
Computers execute instructions grouped in programs and expressed according to some language.
f is the mathematical, abstract notation for a computer program that, when executed on an input x∈X , outputs a y∈Y.
Mathematical notation:
X=R2 Y=R f:R2→R f(x)=f((x1,x2))=x1x1−x2
x is a notation for vectors, or, more broadly, for sequences of homogeneous elements, in place of x
Computer language:
public double f(double[] xs) { return Math.abs((xs[0] - xs[1]) / xs[0]);}
Most not all, typed ones languages make connection clear:double[]
is X, i.e., R2 actually Rp, with p≥1double
is Y, i.e., Rxs
is xf
is f: types correspond!
no explicit counterpart for y
Abstract definition (≈ the signature):
f:R2→R
double f(double[] xs)
Concrete definition (≈ signature and code):
f:R2→R y=f(x)=f((x1,x2))=x1+x2
double f(double[] xs) { return xs[0] + xs[1];}
Usually, computer programs are written by humans, but here:
Machine Learning is the science of getting computers to learn fpredict:X→Y without being explicitly programmed.
without being explicitly programmed means that fpredict is not written by a human!
It appears verbose, let's get rid of it.
Usually, computer programs are written by humans, but here:
Machine Learning is the science of getting computers to learn fpredict:X→Y without being explicitly programmed.
without being explicitly programmed means that fpredict is not written by a human!
It appears verbose, let's get rid of it.
New version:
Machine Learning is the science of getting computers to learn fpredict:X→Y autonomously.
Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"
Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"
"Find" suggests Bob to apprach the task in two steps:
Alice (computer science instructor) to Bob (student):
"Please, write a program that, given a string, returns the number of vowel occurrences in the string"
Alternative version:
"Please, find a program that, given a string, returns the number of vowel occurrences in the string"
"Find" suggests Bob to apprach the task in two steps:
In f terms:
Step 2 is fundamental in practice
... but it is hard to be further formalized in general.
There has to be some supervision facilitating the search for a good f.
When the supervision is in the form of some examples (observation → response) and the learned fpredict should process them correctly.
New version:
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
When the supervision is in the form of some examples (observation → response) and the learned fpredict should process them correctly.
New version:
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
In unsupervised learning there is no supervision, there are no examples:
Formally, examples available for learning fpredict are pairs (x,y).
A dataset compatible with X and Y is a bag of pairs (x,y): D={(x(i),y(i))}i=1i=n with ∀i:x(i)∈X,y(i)∈Y and ∣D∣=n.
Or, more briefly D={(x(i),y(i))}i. examples are also denoted by (xi,yi), depending on the community
A bag (D should be called databag...):
In most algorithms, and in their program counterparts, dataset are actually processed sequentially, though
A learning set is a dataset that is used for learning an fpredict.
The learning set has to be consistent with the domain and codomain of the function fpredict to be learned:
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
In brief: given a Dlearn∈P∗(X×Y), learn an fpredict∈FX→Y.
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
In brief: given a Dlearn∈P∗(X×Y), learn an fpredict∈FX→Y.
A supervised learning technique is a way for learning an fpredict∈FX→Y given a Dlearn∈P∗(X×Y).
flearn:P∗(X×Y)→FX→Y fpredict=flearn(Dlearn)
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
In brief: given a Dlearn∈P∗(X×Y), learn an fpredict∈FX→Y.
A supervised learning technique is a way for learning an fpredict∈FX→Y given a Dlearn∈P∗(X×Y).
flearn:P∗(X×Y)→FX→Y fpredict=flearn(Dlearn)
A supervised learning technique is a way for learn an fpredict∈FX→Y given a Dlearn∈P∗(X×Y).
Why don't we suffice a single learning technique? Why are there many of them?
They differ in:
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
getting computer: who is doing that?
Supervised (Machine) Learning is the science of getting computers to learn fpredict:X→Y from examples autonomously.
getting computer: who is doing that?
is the science: what's science?
New version:
Supervised (Machine) Learning is about designing and applying supervised learning techniques.
A supervised learning technique flearn:P∗(X×Y)→FX→Y can be seen as a form of optimization:
Could we use a general optimization technique?
In principle, yes, but:
A supervised learning technique flearn:P∗(X×Y)→FX→Y can be seen as a form of optimization:
Could we use a general optimization technique?
In principle, yes, but:
Practical solution: reduce FX→Y size by considering only the f of some nature:
Often a learning technique works on a reduced FX→Y′ which is based on an template f′:
E.g., for X=Y=R, f′(x)=ax+b:
We can make explicit the undefined part of the template: fpredict(x)=fpredict′(x,m) where m∈M is the undefined part.
Note that fpredict′ is fixed for a given learning technique and defines the reduced FX→Y′⊂FX→Y where the learning will look for an fpredict.
We can make explicit the undefined part of the template: fpredict(x)=fpredict′(x,m) where m∈M is the undefined part.
Note that fpredict′ is fixed for a given learning technique and defines the reduced FX→Y′⊂FX→Y where the learning will look for an fpredict.
Given a template fpredict′, m defines an fpredict that can be used to predict a y from an x.
That is, m is a model of how y depends on x.
For techniques based on a template, flearn actually looks just FX→Y′, hence in M, for finding an fpredict.
General case: flearn:P∗(X×Y)→FX→Y fpredict:X→Y
The learning technique is defined by flearn.
With template: flearn′:P∗(X×Y)→M fpredict′:X×M→Y
The learning technique is defined by flearn′,fpredict′.
Problem: price of a flat from surface X=R+,Y=R+ FR+→R+={…,x2,3,π0.1+xx3+5x,…}
Learning technique: linear regression fpredict′(x,a,b)=ax+b M=R×R={(a,b):a∈R∧b∈R} FR+→R+′={…,x+1,3,πx+5,…}
Problem: classify email as spam/not-spam
X=A∗, Y={spam,¬spam}, A= UTF-8
FA∗→Y={…} (all predicates on UTF-8 strings)
Learning technique: regex-based flagging
fpredict′(x,r)={spam¬spamif x matches rotherwise
M= regexes ={…,ca.++,[a-z]+a.+,…}
FA∗→Y′={…,fpredict′(⋅,[a-z]+a.+),…}
Choosing the learning technique means choosing one FX→Y′!
A∗=⋃i=0i=∞Ai
The model m is learned on a dataset D.
The model m is trained on a dataset D.
The model m is fitted on a dataset D.
The model m is learned on a dataset D.
The model m is trained on a dataset D.
The model m is fitted on a dataset D.
Formally, a model is one specific m∈M that has been found upon learning.
However, often "model" is used to denote a generic (e.g., still untrained/unfitted) artifact.
Supervised learning techniques may be categorized depending on the kind of X,Y,M they deal with:
With respect to Y, most important cases:
With respect to X, common cases:
In the common case of a multivariate X=X1×⋯×Xp:
In the common case of a multivariate X=X1×⋯×Xp:
Given a dataset D with ∣D∣=n examples defined over X,Y:
D=x1(1)…x1(i)…x1(n)……………xj(1)…xj(i)…xj(n)……………xp(1)…xp(i)…xp(n)y(1)…y(i)…y(n)
The common notation for the size of a multivariate dataset (i.e., a dataset with a multivariate X=X1×⋯×Xp) is:
On the assumption that a dataset D implicitly defines the problem (since it bounds X and Y and hence FX→Y), n and p also describe the size of the problem.
A family of learning techniques (tree-based) for:
A family of learning techniques (SVM) for:
A learning technique (kNN) for:
A learning technique (naive Bayes) for:
What if none of the above learning techniques fits the problem (X,Y) at hand?
We'll see:
What about the other kinds of problems?
An information processing system in which there is:
Goal: given a tweet, determine age range and gender of the author
One possible ML system for this problem:
Learning phase:
Dlearn′=fforeach(Dlearn,ftext-to-num) just the x part
mage=flearn,1′(Dlearn′)
mgender=flearn,2′(Dlearn′)
Prediction phase:
x′=ftext-to-num(x)
yage=fpredict,1′(x′,mage)
ygender=fpredict,2′(x′,mgender)
The designer of the ML system, that is, you¹!
Steps 4–6 are usually iterated many times
Skills of the ML practitioner/designer:
Skills of the ML researcher:
Experience, practice, knowledge!
Recall: we need an fpredict:X→Y to make a decision y about an x
Reasons for running fpredict on a machine:
If fpredict is run on a machine, still fpredict might be designed by a human.
Reasons for running flearn on a machine, i.e., to obtain fpredict through learning:
Factors:
Reasons for running flearn on a machine:
Answering these questions requires the knowledge of the domain
Component:
Factors: beyond applicability, which is a yes/no matter
Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...
Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist...
Why a story?
Once upon a time, there were Alice, a ML expert, and Bob, an amateur botanist.
Bob liked to collect Iris flowers and to sort them properly in his collection boxes at home. He organized collected flowers depending on their species.
Iris setosa Iris virginica Iris versicolor
Alice: What's up, Bob?
Bob: I'd like to put new flowers into proper boxes.
Well... I'm not an expert of flowers. Can't you do it by yourself?
No, actually I cannot. But I heard you now master the art of machine learning...
Mmmmhhh... I see that you already have flowers in boxes. How did you sort them? Why ML now?
Well, I used to go to a professional botanist, who was able to tell me the species of each Iris flower I collected. I don't want to bother her anymore and her lab is far from here and it takes time to get there and the fuel is getting more and more pricey... 🦖
Ok, I understand. So you think ML can be helpful here. Let's see...
Alice: What's up, Bob?
Bob: I'd like to put new flowers into proper boxes.
Well... I'm not an expert of flowers. Can't you do it by yourself?
No, actually I cannot. But I heard you now master the art of machine learning...
Mmmmhhh... I see that you already have flowers in boxes. How did you sort them? Why ML now?
Well, I used to go to a professional botanist, who was able to tell me the species of each Iris flower I collected. I don't want to bother her anymore and her lab is far from here and it takes time to get there and the fuel is getting more and more pricey... 🦖
Ok, I understand. So you think ML can be helpful here. Let's see...
Some information about the context up to here (Alice's thoughts 💭):
No car accidents to be avoided (timing), no billions of emails to be analyzed (scale), no big business process invoved (cost), no loan decision to be made (quality).
Reasons for running fpredict on a machine:
Reasons for learning fpredict on a machine:
👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...
Reasons for running fpredict on a machine:
Reasons for learning fpredict on a machine:
👍: yes!; 👎: no!; 🤏: maybe a bit; 🤌: who knows...
Outcome: ok, let's use ML!
Do we have examples at hand?
Yes, Bob already collected some flowers and organized them in boxes. For each of them, there's a species label that has been assigned by an expert (the professional botanist). We assume those labels are correctly assigned.
Do we have examples at hand?
Yes, Bob already collected some flowers and organized them in boxes. For each of them, there's a species label that has been assigned by an expert (the professional botanist). We assume those labels are correctly assigned.
Outcome: it's supervised learning!
In natural language: given an Iris flower, assign a species
Formally:
X={x:x is an Iris flower}
Y={setosa,versicolor,virginica}
Issues with this X: 🤔
We cannot just take another X, because the problem is "given an Iris flower, assign a species". But we can introduce some pre-processing¹ steps that transform an x∈X in an x′∈X′, with X′ being better, more suitable, for later steps.
That is, we can design an fpre-proc:X→X′ and an X′!
Requirements:
Since most learning techniques are designed to work on a mutlivariate X, we are going to design an fpre-proc:X→X′=X1′×⋯×Xp′. That is, we are going to define the features and the way to compute them out of an x.
This step is called feature engineering and is in practice a key step in the design of an ML system, often more important than the choice of the learning technique:
Some options:
Function fpre-proc | Set X′ | Cost | Info¹ | Comp.² |
---|---|---|---|---|
x′ is a textual desc. of x | strings | 🫰🫰 | 🫳 | 👍 |
x′ is a digital pic. of x | [0,1]512×512×3 | 🫰 | 🫳 | 👍³ |
x′ is "the" DNA of x | {A,C,G,T}∗ | 🫰🫰🫰 | 👍 | 👍 |
x′ is some measurements of x | Rp | 🫰 | 🫳 | 👍 |
The actual decision should be taken by Alice and Bob together, based on domain knowledge of the latter and ML knowledge of the former.
Requirements for fpre-proc:X→X′:
Assume choice "x′ is some measurements of x", namely 4 measurements, then fpre-proc:X→R4 and fpre-proc(x)=x′=(x1′,x2′,x3′,x4′) with:
x1′ | x2′ | x3′ | x4′ | y |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
7.0 | 3.2 | 4.7 | 1.4 | versicolor |
6.3 | 3.3 | 6.0 | 2.5 | virginica |
Requirements for fpre-proc:X→X′:
Alice's thoughts 💭: Is it true that we cannot design a reasonable fpredict? Are we retaining information?
Let's look at the data!
How does Alice choose these 3 approaches, in this order?
Mean values for species and feature
iris %>% group_by(Species) %>% summarise_all(mean)
# A tibble: 3 × 5 Species Sepal.Length Sepal.Width Petal.Length Petal.Width <fct> <dbl> <dbl> <dbl> <dbl>1 setosa 5.01 3.43 1.46 0.2462 versicolor 5.94 2.77 4.26 1.333 virginica 6.59 2.97 5.55 2.03
Findings: setosa looks more different
Boxplots of values for species and feature
iris %>% pivot_longer(cols=!Species) %>% ggplot(aes(x=name, y=value, color=Species)) + geom_boxplot()
Findings: overlap between versicolor and virginica, for all features
Pairwise scatterplots of observations
ggpairs(iris, columns=1:4, aes(color=Species, alpha=0.5), upper=list(continuous="points"))
Findings: overlap!
Questions:
Outcome:
Yes, let's use ML!
Problem statement:
How?
Anderson, Edgar. "The species problem in Iris." Annals of the Missouri Botanical Garden 23.3 (1936): 457-509.
Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.
1936!!!
Machine Learning is the science of getting computers to learn without being explicitly programmed. ↓ Machine Learning is the science of getting computers to learn f:X→Y without being explicitly programmed. ↓ Machine Learning is the science of getting computers to learn f:X→Y autonomously. ↓ Supervised (Machine) Learning is the science of getting computers to learn f:X→Y from examples autonomously. ↓
Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is a way for learning an fpredict∈FX→Y given a Dlearn∈P∗(X×Y).
A learning technique is defined by flearn′,fpredict′:
flearn′:P∗(X×Y)→M fpredict′:X×M→Y
Supervised (Machine) Learning is about designing and applying supervised learning techniques. A supervised learning technique is defined by:
Arguments for fpredict on machine:
Arguments for flearn on machine:
Requirements for fpre-proc:X→X′:
Subject of the assessment:
Assume something is assessed with respect to a given goal:
Given an axis a of assessment:
"enough" represents some expectation, some minimum degree of a to be reached.
Given an axis a of assessment:
"enough" represents some expectation, some minimum degree of a to be reached.
If the outcome of assessment is a quantity (i.e., a number) with a monotonic semantics:
We want assessment to produce a number!
A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:
A supervised learning technique is a pair flearn,fpredict. Its goals are:
A model has one goal:
A ML system can be seen as a composite learning technique. It has two running modes: one in which it tunes itself, one in which it makes decisions. ML system goals are:
A supervised learning technique is a pair flearn,fpredict. Its goals are:
A model has one goal:
Eventually, effectiveness is about making good decisions!
How to measure if an fpredict′ is making good decisions?
Recall: fpredict, possibly through fpredict′ and a model m, models the dependency of y on x.
Key underlying assumption: y depends on x. That is, there exists some real system s:X→Y that, given an x produces a y based on x, that is, s∈FX→Y:
Or, there exists in reality some system s−1:Y→X that, given an y produces an x based on y:
Model m (or fpredict)
Real system s
A templated fpredict′:X×M→Y with a fixed model m is an fpredict:X→Y.
Model m (or fpredict)
Real system s
How to see if the model m is modeling the system s well?
Direct comparison:
Issues:
Comparison of behaviors:
Ideally, we want the comparison (step 3) outcome to be a number.
fcomp-behavior:FX→Y×FX→Y→R
Or, to highlight the presence of a model in a templated fpredict:
fcomp-behavior:FX×M→Y×M×FX→Y→R
In both cases:
function comp-behavior(fpredict,s) {
{x(i)}i←collect()
{y(i)}i←foreach({x(i)}i,s)
{y^(i)}i←foreach({x(i)}i,fpredict)
veffect←comp-resps({(y(i),y^(i))}i)
return veffect;
}
More correctly {(y(i),y^(i))}i←foreach({x(i)}i,both(⋅,s,fpredict)) with fboth:X×FX→Y2 and fboth(x,f1,f2)=(f1(x),f2(x)).
function comp-behavior(fpredict,s) {
{x(i)}i←collect()
{y(i)}i←foreach({x(i)}i,s)
{y^(i)}i←foreach({x(i)}i,fpredict)
veffect←comp-resps({(y(i),y^(i))}i)
return veffect;
}
function comp-behavior(fpredict,s) {
{x(i)}i←collect()
{y(i)}i←foreach({x(i)}i,s)
{y^(i)}i←foreach({x(i)}i,fpredict)
veffect←comp-resps({(y(i),y^(i))}i)
return veffect;
}
We'll see many concrete options for fcomp-resps
fcollect is instead hard to define, but it's more important than fcomp-resps
Goal: the behavior {(x(i),y(i))}i=1i=n has to be representative of the real system s
Concerning size n:
Concerning coverage of X
Focus on coverage, rather than size, because it has no drawbacks!
Formally:
fcomp-resps:P∗(Y2)→R
function comp-behavior(fpredict,s) {
{x(i)}i←collect()
{y(i)}i←foreach({x(i)}i,s)
{y^(i)}i←foreach({x(i)}i,fpredict)
veffect←comp-resps({(y(i),y^(i))}i)
return veffect;
}
where {(y(i),y^(i))}i∈P∗(Y2) is a multiset of pairs of y.
Depends only on Y, not on X!
Formally:
fcomp-resps:P∗(Y2)→R
function comp-behavior(fpredict,s) {
{x(i)}i←collect()
{y(i)}i←foreach({x(i)}i,s)
{y^(i)}i←foreach({x(i)}i,fpredict)
veffect←comp-resps({(y(i),y^(i))}i)
return veffect;
}
where {(y(i),y^(i))}i∈P∗(Y2) is a multiset of pairs of y.
Depends only on Y, not on X!
We'll see a few options for the main cases:
} performance indexes
Recall: in classification Y is a finite set with no ordering
Classification error: more verbosely: classification error rate ferr({(y(i),y^(i))}i=1i=n)=n1i=1∑i=n1(y(i)=y^(i)) where 1:{false,true}→{0,1} is the indicator function: 1(b)={10if b=trueotherwise
Classification accuracy: facc({(y(i),y^(i))}i=1i=n)=n1i=1∑i=n1(y(i)=y^(i))
Clearly, facc({(y(i),y^(i))}i=1i=n)=1−ferr({(y(i),y^(i))}i=1i=n).
The codomain of facc is also [0,1]:
For accuracy, the greater, the better.
For error, the lower, the better.
In principle, the only requirement concerning Y for both facc and facc is that there is an equivalence relation in Y, i.e., that = is defined over Y. However, in practice Y is a finite set without ordering.
In principle, accuracy is in [0,1].
Recall that in the facc is part of an fcomp-behavior that should measure how well a model m models a real system s.
What are the ideal extreme cases in practice:
From another point of view, what would be the accuracy of a:
The random classifier¹ is an X→Y doing:
frnd(x)=yi with i∼U({1,…,∣Y∣})
where i∼U(A) means choosing an item of A with uniform probability.
Here A={1,…,∣Y∣}, hence f(x) gives a random y, without using x, i.e., no dependency.
Considering all possibles multisets of responses P∗(Y), the accuracy of the random classifier is, on average, ∣Y∣1.
Given one specific multiset of responses {y(i)}i, the dummy classifier is the one that always predicts the most frequent class in {y(i)}i: fdummy,{y(i)}i(x)=y∈Yargmaxn1i=1∑i=n1(y=y(i))=y∈YargmaxFr(y,{y(i)}i) On the {y(i)}i on which it is built, the accuracy of the dummy classifier is maxy∈YFr(y,{y(i)}i).
Recall: we use facc on one specific {y(i)}i.
Like the random classifier, the dummy classifier does not use x.
dummy [duhm-ee]: a representation of a human figure, as for displaying clothes in store windows
Looks like a human, but does nothing!
Case: coin tossing, Y={heads,tails}
Random on average (with frnd):
{y(i)}i | {y^(i)}i | facc() |
---|---|---|
⬤⬤⬤⬤⬤⬤ | ⬤⬤⬤⬤⬤⬤ | 50% |
⬤⬤⬤⬤ | ⬤⬤⬤⬤ | 25% |
... | ... | ... |
⬤ | ⬤ | 100% |
⬤ | ⬤ | 0% |
Dummy on {y(i)}i=⬤⬤⬤⬤
(with fdummy,⬤⬤⬤⬤):
facc(⬤⬤⬤⬤,⬤⬤⬤⬤)=75%
Case: Iris, Y={setosa,versicolor,virginica}
Random on average (with frnd):
{y(i)}i | {y^(i)}i | facc() |
---|---|---|
⬤⬤⬤⬤⬤⬤ | ⬤⬤⬤⬤⬤⬤ | ≈17% |
⬤⬤⬤⬤ | ⬤⬤⬤⬤ | 50% |
... | ... | ... |
⬤ | ⬤ | 100% |
⬤ | ⬤ | 0% |
Dummy on {y(i)}i=⬤⬤⬤⬤
(with fdummy,⬤⬤⬤⬤):
facc(⬤⬤⬤⬤,⬤⬤⬤⬤)=50%
Here facc(⬤⬤⬤,⬤⬤⬤) stays for facc({(⬤,⬤),(⬤,⬤),(⬤,⬤)}).
A classifier that works exactly as s:
fperfect(x)=s(x)
If s is deterministic, the accuracy of fperfect(x) is 100% on every {x(i)}i, by definition.
Are real systems deterministic in practice?
A non deterministic system (i.e., a stochastic or random system) is one that given the same x may output different y.
The Bayes classifier is an ideal model of a real system that is not deterministic:
fBayes(x)=y∈YargmaxPr(s(x)=y∣x)
where Pr(s(x)=y∣x) is the probability that s gives y for x.
Key facts:
A non deterministic system (i.e., a stochastic or random system) is one that given the same x may output different y.
The Bayes classifier is an ideal model of a real system that is not deterministic:
fBayes(x)=y∈YargmaxPr(s(x)=y∣x)
where Pr(s(x)=y∣x) is the probability that s gives y for x.
Key facts:
In practice:
The real system s is the professor deciding if a student will pass or fail the exam of Introduction to ML. The professor just looks at the student course to decide ❗ fake! and is a bit stochastic.
X={IN19,IN20,SM34,SM35,SM64}
Y={fail,pass}
The probability according to which the professor "reasons" is completeley known:
fail | pass | |
---|---|---|
IN19 | 20% | 80% |
IN20 | 15% | 85% |
SM34 | 60% | 40% |
SM35 | 80% | 20% |
SM64 | 20% | 80% |
Pr(s(x)=y∣x)=⎩⎨⎧20%80%15%…80%if x=IN19∧y=failif x=IN19∧y=passif x=IN20∧y=failif x=SM64∧y=pass the table is a compact form for this probability
fBayes(x)=⎩⎨⎧passpassfailfailpassif x=IN19if x=IN20if x=SM34if x=SM35if x=SM64 built using the definition fBayes(x)=argmaxy∈YPr(s(x)=y∣x)
Questions
Lower | Upper | |
---|---|---|
By definition | 0 | 1 |
Bounds, all data | ∣Y∣1 | 1 |
Better bounds, with one {x(i)}i | maxy∈YFr(y,{s(x(i))}i) | ≤1 |
If {x(i)}i is collected properly, it is representative of the behavior of the real system (together with the corresponding {s(x(i))}i), hence the third case is the most relevant one:
facc(⋅)∈[maxy∈YFr(y,{s(x(i))}i),1−ϵ] ϵ>0 is actually unknown
In practice, use the random classifier as a baseline and
All data means all the theoretically possible datasets, i.e., for just y, P∗(Y).
In practice not all possible datasets are equally probable.
Example: for spam, x is an email, i.e., a string of text, y is spam or ¬spam:
Consider the random classifier as a supervised learning technique:
Hence, formally:
∥x∥1 is the 1-norm of a vector x=(x1,…,xp) with ∥x∥1 =∑ixi
Option 1: the model m is a vector of frequencies: assume Y={y1,y2,…}
flearn′({(x(i),y(i))}i)=f=(Fr(yj,{y(i))}i))j
fpredict′(x,f)=yi with i=argmaxifi
Option 1: the model m is a vector of frequencies: assume Y={y1,y2,…}
flearn′({(x(i),y(i))}i)=f=(Fr(yj,{y(i))}i))j
fpredict′(x,f)=yi with i=argmaxifi
Option 2: the model m is a discrete probability distribution: here flearn′ a function that returns a function
flearn′({(x(i),y(i))}i)=p:p(y)=Fr(y,{y(i))}i)
fpredict′(x,p)=argmaxy∈Yp(y)
Option 3: the model m is simply the learning dataset: just the y part of it
flearn′({(x(i),y(i))}i)={y(i)}i
fpredict′(x,{y(i)}i)=argmaxy∈YFr(y,{y(i)}i)
Option 3: the model m is simply the learning dataset: just the y part of it
flearn′({(x(i),y(i))}i)={y(i)}i
fpredict′(x,{y(i)}i)=argmaxy∈YFr(y,{y(i)}i)
Option 4: the model m is the most frquent class y⋆:
flearn′({(x(i),y(i))}i)=y⋆=argmaxy∈YFr(y,{y(i)}i)
fpredict′(x,y⋆)=y⋆
Option 3: the model m is simply the learning dataset: just the y part of it
flearn′({(x(i),y(i))}i)={y(i)}i
fpredict′(x,{y(i)}i)=argmaxy∈YFr(y,{y(i)}i)
Option 4: the model m is the most frquent class y⋆:
flearn′({(x(i),y(i))}i)=y⋆=argmaxy∈YFr(y,{y(i)}i)
fpredict′(x,y⋆)=y⋆
For all options, works with:
Are they different? How?
Option 3: the model m is simply the learning dataset: just the y part of it
flearn′({(x(i),y(i))}i)={y(i)}i
fpredict′(x,{y(i)}i)=argmaxy∈YFr(y,{y(i)}i)
Option 4: the model m is the most frquent class y⋆:
flearn′({(x(i),y(i))}i)=y⋆=argmaxy∈YFr(y,{y(i)}i)
fpredict′(x,y⋆)=y⋆
For all options, works with:
Are they different? How?
They differ in efficiency, are equal in effectiveness:
Binary classification is a very common scenario.
Examples:
Suppose there is a (ML-based) diagnostic test for a given disease d. just to give a name to it without calling bad luck...
You are being said the accuracy of the test is 99.8%.
Is this a good test or not?
Suppose there is a (ML-based) diagnostic test for a given disease d. just to give a name to it without calling bad luck...
You are being said the accuracy of the test is 99.8%.
Is this a good test or not?
In "formal" terms, the test is an fpredict:X→Y with:
Since ∣Y∣=2 this is a binary classification problem.
1: or, from another point of view, X={🧑🦰,👱,🙍,👱,🙎,…}×T, with T being the time, because you test a person at a given time t, and the outcome might be different from the test outcome for the same person at a later t′.
Suppose d is a rare¹ disease which affects ≈2 people every 1000 and let the accuracy be again 99.8%.
Is this a good test or not?
Suppose d is a rare¹ disease which affects ≈2 people every 1000 and let the accuracy be again 99.8%.
Is this a good test or not?
Consider a trivial test that always says "you don't have the disease d", its accuracy would be 99.8%:
Suppose d is a rare¹ disease which affects ≈2 people every 1000 and let the accuracy be again 99.8%.
Is this a good test or not?
Consider a trivial test that always says "you don't have the disease d", its accuracy would be 99.8%:
The trivial test is actually the dummy classifier built knowing that the prevalence is 0.2%.
99.8% was soooo nice, but the test was actually just saying always one y.
The accuracy alone was not able to capture such a gross error.
Yes, also because we are in binary classification and there are only 2=∣Y∣ possible values for y (i.e., 2 classes).
There are performance indexes designed right with this aim.
First, let's give a standard name to the two possible y values:
How to associate positive/negative with actual Y elements?
Common practice:
Goal: measuring the error on each of the two classes in binary classification.
The False Positive Rate (FPR) is the rate of negatives that are wrongly¹ classified as positives: fFPR({(y(i),y^(i))}i)=∑i1(y(i)=neg)∑i1(y(i)=neg∧y(i)=y^(i))
The False Negative Rate (FNR) is the rate of positives that are wrongly classified as negatives: fFNR({(y(i),y^(i))}i)=∑i1(y(i)=pos)∑i1(y(i)=pos∧y(i)=y^(i))
For both:
NaN
, if no negatives (FPR) or positives (FNR) in the dataFPR=NFP
FNR=PFN
Assuming that:
Suppose d is a rare¹ disease which affects ≈2 persons every 1000 and consider a trivial test that always says "you don't have the disease d"
Acc is the more comfortable notation for the accuracy; Err for the error.
When to use accuracy? When to use FPR and FNR?
tl;dr¹: use FPR and FNR in binary classification!
When to use accuracy? When to use FPR and FNR?
tl;dr¹: use FPR and FNR in binary classification!
In decreasing order of informativeness effectiveness of assessment of effectiveness, decreasing order of verbosity:
Accuracy alone in binary classification is evil! 👿
Just FPR, or just FNR is evil too, but also weird.
Binary classification and its assessment are so practically relevant that there exist many other "synonyms" of FPR and FNR.
True Positive Rate (TPR), positives correctly classified as positives: TPR=PTP=1−FNR
True Negative Rate (TNR), negatives correctly classified as negatives: TNR=NTN=1−FPR
For both, the greater, the better (like accuracy); codomain is [0,1].
Relation with accuracy and error:
Err=N+PFP+FN=P+NPFNR+NFPR
Acc=1−Err=N+PTP+TN=P+NPTPR+NTNR
In classification (binary and multiclass), a dataset is balanced, with respect to the response variable y, if the frequency of each value of y is roughly the same.
For a balanced dataset in binary classification, P=N, hence:
The more unbalanced a dataset, the farther the error (accuracy) from the average of FPR and FNR (TPR and TNR), the more misleading 👿 giving error (accuracy) only!
Precision:
Prec=TP+FPTP
may be 00, i.e., NaN
, if the classifier never says positive
Recall: Rec=PTP=TPR
F-measure: or F1, F1-score, F-score F-measure=2Prec+RecPrec⋅Rec harmonic mean of precision and recall
They come from the information retrieval scenario:
Precision: how many retrieved documents are actually relevant? Prec=∣D′∣∣D′∩D⋆∣=∣D′∩D⋆∣+∣D′∖D⋆∣∣D′∩D⋆∣=TP+FPTP
Recall: how many of the relevant documents are actually retrieved? Rec=∣D⋆∣∣D′∩D⋆∣=PTP
The greater, the better (like accuracy); precision ∈[0,1]∪ NaN
, recall ∈[0,1], F-measure ∈[0,1].
Sensitivity: Sensitivity=PTP=TPR
Specificity: Specificity=NTN=TNR
The greater, the better (like accuracy); both in [0,1].
Sensitivity: Sensitivity=PTP=TPR
Specificity: Specificity=NTN=TNR
The greater, the better (like accuracy); both in [0,1].
Other similar indexes:
For both, the lower, the better (like error).
Rule of the thumb¹ (in binary classification)
No good reasons imho for using Type I and Type II error:
Suppose you have two models and you compute them on the same data:
Which one is the best?
Suppose you have two models and you compute them on the same data:
Which one is the best?
In general, it depends on:
Assumptions:
Given P+N observations, the overall cost c is: c=cFPFPRN+cFNFNRP with cFP and cFN the cost of FPs and FNs.
If you know cFP, cFN, N, and P: (the costs cFP, cFN should come from domain knowledge)
Given a model (not a learning technique), can we "tune" it to prefer avoiding FPs rathern than FNs (or viceversa)?
Yes! It turns out that for many learning techniques (for classification), the fpredict′ internally computes a discrete probability distribution over Y before actually returning one y.
Formally:
fpredict′′:X×M→PY fpredict′′(x,m)=p
fpredict′:X×M→Y fpredict′(x,m)=y∈Yargmax(fpredict′′(x,m))(y)=y∈Yargmaxp(y)
Example: for spam detection, given an m and an email x, fpredict′(x,m) might return: p(y)={80%20%if y=spamif y=¬spam For another email, it might return a 30%/70%, instead of an 80%/20%.
A supervised learning technique with probability (for classification) is defined by:
For all the techniques of this kind, fpredict′:X×M→Y and fpredict are always the same: concrete
"internally computes" → p is indeed available internally, but can be obtained from outside
In binary classification, with Y={pos,neg}, p∈PY has always this form: p(y)={ppos1−pposif y=posif y=neg with ppos∈[0,1].
Hence, prediction can be seen as:
fpredict′′′:X×M→[0,1] fpredict′′′(x,m)=ppos
fpredict′:X×M→Y fpredict′(x,m)={posnegif ppos≥0.5otherwise
p(y)={ppos1−pposif y=posif y=neg
The closer ppos to 0.5, the lower the confidence of the model in its decision:
We may measure the confidence in the binary decision as: conf(x,m)=0.5∣ppos−0.5∣=0.5fpredict′′′(x,m)−0.5
conf∈[0,1]: the greater, the more confident.
If we replace the fixed 0.5 threshold with a param τ we obtain a new function:
fpredictτ:X×[0,1]→Y fpredictτ(x,τ)={posnegif fpredict′′′(x,m)≥τotherwise
Note that:
Example: if we want our diagnostic test to be more sensible to positives, we lower τ without changing the model!
Given the same m and the same {(x(i),y(i))}i:
Example:
For a model m and a dataset {(x(i),y(i))}i, the Equal Error Rate (EER) is the value of FPR (and FNR) for the τ=τEER value for which FPR=FNR.
For EER: the lower, the better (like error); codomain is [0,1] in practice [0,0.5]
For a model m and a dataset {(x(i),y(i))}i and a sequence (τi)i, the Receiver operating characteristic¹ (ROC) curve is the plot of TPR (=1−FNR) vs. FPR for the different values of τ∈(τi)i.
For a model m and a dataset {(x(i),y(i))}i and a sequence (τi)i, the Area Under the Curve (AUC) is the area under the ROC curve.
For AUC: the greater, the better (like accuracy); codomain is [0,1] in practice [0.5,1]
For computing both EER and AUC, you need to compute FPR and FNR for many values of τ.
Ingredients:
How to choose (τi)i? recall: τ∈[0,1]; by convention, you always take also τ=0 and τ=1
Y={pos,neg}
y(i) | ppos(i) | y^(i) | out¹ |
---|---|---|---|
⬤ | 0.49 | ⬤ | FN |
⬤ | 0.29 | ⬤ | TN |
⬤ | 0.63 | ⬤ | TP |
⬤ | 0.51 | ⬤ | TP |
⬤ | 0.52 | ⬤ | TP |
⬤ | 0.47 | ⬤ | TN |
⬤ | 0.94 | ⬤ | TP |
⬤ | 0.75 | ⬤ | TP |
⬤ | 0.53 | ⬤ | FP |
⬤ | 0.45 | ⬤ | TN |
(τi)i evenly spaced in [0,1] 9+2 values → raw 7 on 11 different values
(τi)i evenly spaced in [0.29,0.84] 9+2 values → better but still 7 on 11 different values
(τi)i at midpoints 9+2 values → optimal 11 on 11 different values
If you know the cost of error (cFP and cFN) and the class frequencies:
If you don't know the cost of error and you know the classifier will work at a fixed τ:
If you don't know the cost of error and don't know at which τ the classifier will work:
If you know the cost of error (cFP and cFN) and the class frequencies:
If you don't know the cost of error and you know the classifier will work at a fixed τ:
If you don't know the cost of error and don't know at which τ the classifier will work:
If you can afford, i.e., you have time/space:
Given a multiset {(y(i),y^(i))}i of pairs, the confusion matrix has:
Y={pos,neg}
y(i) | ppos(i) | y^(i) | out |
---|---|---|---|
⬤ | 0.49 | ⬤ | FN |
⬤ | 0.29 | ⬤ | TN |
⬤ | 0.63 | ⬤ | TP |
⬤ | 0.51 | ⬤ | TP |
⬤ | 0.52 | ⬤ | TP |
⬤ | 0.47 | ⬤ | TN |
⬤ | 0.94 | ⬤ | TP |
⬤ | 0.75 | ⬤ | TP |
⬤ | 0.53 | ⬤ | FP |
⬤ | 0.45 | ⬤ | TN |
For this case:
yy^ | ⬤ | ⬤ |
---|---|---|
⬤ | 5 | 1 |
⬤ | 1 | 3 |
For binary classification:
yy^ | pos | neg |
---|---|---|
pos | TP | FN |
neg | FP | TN |
In general it holds, being c the confusion matrix:
Besides accuracy and error, for unbalanced datasets, the weighted accuracy (or balanced accuracy) is: wAcc=fwAcc({(y(i),y^(i))}i)=∣Y∣1y∈Y∑(∑i1(y(i)=y)∑i1(y(i)=y∧y(i)=y^(i)))=∣Y∣1y∈Y∑Accy i.e., the (unweighted) average of the accuracy for each class. You can do the same with error, precision, recall, ...
yy^ | ⬤ | ⬤ | ⬤ | ⬤ |
---|---|---|---|---|
⬤ | 15 | 1 | 2 | 2 |
⬤ | 1 | 10 | 4 | 1 |
⬤ | 5 | 3 | 28 | 1 |
⬤ | 1 | 0 | 0 | 9 |
Acc=20+16+38+1015+10+28+9=8462=73.8%
Acc⬤=2015=75%
Acc⬤=1610=62.5%
Acc⬤=3728=75.7%
Acc⬤=109=90%
wAcc=41(2015+1610+3728+109)=75.8%
wAcc overlooks class imbalance, Acc does not; wAcc∈[0,1]; the greater, the better
Differently from classification, a prediction in regression may be more or less wrong:
The error in regression measures how far is the prediction y^(i) from the true value y(i):
Name | fcomp-resps({(y(i),y^(i))}i) |
---|---|
Mean Absolute Error (MAE) | MAE=n1∑iy(i)−y^(i) |
Mean Squared Error (MSE) | MSE=n1∑i(y(i)−y^(i))2 |
Root Mean Squared Error (RMSE) | RMSE=n1∑i(y(i)−y^(i))2=MSE |
Mean Absolute Percentage Error (MAPE) | MAPE=n1∑iy(i)y(i)−y^(i) |
Remarks:
Premise:
Goal:
Sketch of solution:
Eff might be accuracy, TPR and TNR, MAE, error, ...
Sketch of solution:
Both steps 1 and 2 need a dataset:
Sketch of solution:
Both steps 1 and 2 need a dataset:
In principle yes, in practice no:
flearn-effect:LX→Y×P∗(X×Y)→R where LX→Y is the set of learning techniques:
Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.
flearn-effect:LX→Y×P∗(X×Y)→R where LX→Y is the set of learning techniques:
Given a learning technique and a dataset, returns a number representing the effectiveness of the learning technique on that dataset.
For consistency, let's reshape model assessment case:
function predict-effect(fpredict′,m,D) {
{(y(i),y^(i))}i←foreach(
D,
both(⋅,second,fpredict′(first(⋅),m))
)
veffect←fcomp-resps({(y(i),y^(i))}i)
return veffect;
}
We are just leaving the data collection out of predict-effect().
first() and second() take the first or second element of a pair.
function learn-effect-same(flearn′,fpredict′,D) {
m←flearn′(D)
veffect←predict-effect(fpredict′,m,D)
return veffect;
}
The entire D is used for learning the model and assessing it.
Effectiveness of assessment:
Poor! 👎
Efficiency of assessment:
Good! 👍
function learn-effect-static(flearn′,fpredict′,D,r) {
Dlearn←subbag(D,r)
Dtest←D∖Dlearn
m←flearn′(Dlearn)
veffect←predict-effect(fpredict′,m,Dtest)
return veffect;
}
r∈[0,1] is a parameter
D is split in Dlearn for learning and a Dtest for assessment: "split"="partitioned", but Dlearn∩Dtest might be =∅
Effectiveness of assessment:
Fair! ≈👍
Efficiency of assessment:
Good! 👍
Dtest, with respect to the model m, is unseen data, because it has not been used for learning.
Assesing m on unseen data answers the questions:
Dtest, with respect to the model m, is unseen data, because it has not been used for learning.
Assesing m on unseen data answers the questions:
In practice Dtest and Dlearn are obtained from a D that is collected all at once:
What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?
Assessment 👍 - Reality 👎
D was not representative w.r.t. the real system:
or some bug in the implementation...
Assessment 👎 - Reality 👎
D is not informative w.r.t. the real system:
or some bug in the implementation...
What if the model/ML system does not work well on actual unseen/new/future data? That is, what if the prediction are wrong in practice?
Assessment 👍 - Reality 👎
D was not representative w.r.t. the real system:
or some bug in the implementation...
Assessment 👎 - Reality 👎
D is not informative w.r.t. the real system:
or some bug in the implementation...
Assessment 👍 - Reality 👍
Nice! We did everything well!
or some bug in the implementation...
Assessment 👎 - Reality 👍
Sooooo lucky! 🍀🍀🍀
or some bug in the implementation...
you never know if there is some bug in the implementation...
function learn-effect-repeated(flearn′,fpredict′,D,r,k) {
for j∈1,…,k {
Dlearn←subbag(D,r)
Dtest←D∖Dlearn
m←flearn′(Dlearn)
vj←predict-effect(fpredict′,m,Dtest)
}
return k1∑jvj;
}
r∈[0,1] and k∈N+ is a parameter
D is split in Dlearn and Dtest for k times and measures are averaged: subbag() has to be not deterministic
Effectiveness of assessment:
Good! 👍
Efficiency of assessment:
∝k 🫳
function learn-effect-cv(flearn′,fpredict′,D,k) {
for j∈1,…,k {
Dtest←fold(D,j,k)
Dlearn←D∖Dtest
m←flearn′(Dlearn)
vj←predict-effect(fpredict′,m,Dtest)
}
return k1∑jvj;
}
k∈N+ is a parameter
Cross-fold validation is like learn-effect-repeated, but the k Dtest are mutually disjoint (folds).
Effectiveness of assessment:
Good! 👍
Efficiency of assessment:
∝k 🫳
Simply a CV where the number of folds k is ∣D∣:
Effectiveness of assessment:
Good! 👍
Efficiency of assessment:
Bad 👎
Same
→Eff
1 learning; ∣D∣ predictions
Static random (r=0.8)
→Eff
1 learning; ∣D∣(1−r) predictions
Repeated random (r=0.8, k=4)
→Eff1
→Eff2
→Eff3
→Eff4
}
→Eff
k learnings; ∣D∣(1−r) pred. after each, k∣D∣(1−r) pred.
CV (k=5)
→Eff1
→Eff2
→Eff3
→Eff4
→Eff5
}
→Eff
k learnings; k1∣D∣ pred. after each, ∣D∣ pred. tot.
LOOCV
→Eff1
→Eff2
...
→Eff∣D∣
}
→Eff
∣D∣ learnings; 1 pred. after each, ∣D∣ pred. tot.
Repeated random, CV, and LOOCV internally compute the model effectiveness for several models learned on (slightly) different datasets:
Eff1,Eff2,…,Effk→Eff=k1j∑Effj
function learn-effect-cv(flearn′,fpredict′,D,k) {
for (j∈1,…,k) {
Dtest←fold(D,j)
Dlearn←D∖Dtest
m←flearn′(Dlearn)
vj←predict-effect(fpredict′,m,Dtest)
}
return k1∑jvj;
}
We can compute both the mean and the standard deviation from (Effi)i:
Effμ=k1j∑Effj
Effσ=k1j∑(Effj−Effμ)2
Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ):
What's the best learning technique?
Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ):
What's the best learning technique?
Now, suppose that you insted find:
What's the best learning technique?
Suppose you have assessed two learning techniques with 10-CV and AUC (with midpoints τ):
What's the best learning technique?
Now, suppose that you insted find:
What's the best learning technique?
Can we really state that LT1 is better than LT2?
Broader example:
suppose you meet 10 guys from Udine and 10 from Trieste and ask them how tall they are:
City | Measures | μ | σ |
---|---|---|---|
Udine | 154,193,170,175,172,183,160,162,161,179 | 170.9 | 12.02 |
Trieste | 167,166,180,175,168,167,173,181,169,173 | 171.9 | 5.44 |
Questions:
Broader example:
suppose you meet 10 guys from Udine and 10 from Trieste and ask them how tall they are:
City | Measures | μ | σ |
---|---|---|---|
Udine | 154,193,170,175,172,183,160,162,161,179 | 170.9 | 12.02 |
Trieste | 167,166,180,175,168,167,173,181,169,173 | 171.9 | 5.44 |
Questions:
Possible ways of answering:
Questions:
Answers with the boxplot:
Disclaimer: here, just a brief overview; go to statisticians for more details/theory
Disclaimer: here, just a brief overview; go to statisticians for more details/theory
For us, a statistical significance test is a procedure that, given two samples {xa,i}i and {xb,i}i (i.e., collections of observations) of two random variables Xa and Xb and a set of hypotheses H0 (the null hypothesis), returns a number p∈[0,1], called the p-value.
The p-value represents the probability that, by collecting other two samples from the same random variables and assuming that H0 still holds, the new two samples are more unlikely than {xa,i}i,{xb,i}i.
H0: (you assume all are true)
Samples:
H0: (you assume all are true)
Samples:
p=0.90 means:
H0: (you assume all are true)
Samples:
p=0.90 means:
p=0.01 means:
There exist several concrete statistical significance tests, e.g.:
Usually, you aim at "argumenting" μa>μb (one-tailed) or μa=μb (inequality):
> wilcox.test(h_ts, h_ud) Wilcoxon rank sum test with continuity correctiondata: h_ts and h_udW = 54.5, p-value = 0.7621alternative hypothesis: true location shift is not equal to 0
H0∋ true location shift is equal to 0
p=0.7621>0.05: we cannot reject the null hypothesis
⇒ people from Trieste is not taller than people from Udine or, at least, we cannot state this
More on statistical significance tests:
Similar:
Canfora, Gerardo, et al. "Detecting android malware using sequences of system calls." Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile. 2015.
Question: is the model modeling the real system?
Answer: compare responses on the same data and compute one or more performance indexes!
Model m (or fpredict)
Real system s
Binary classification
Classification (w/ binary)
Regression
Bounds for classification effectiveness:
Effectiveness of the single technique
Sketch: learn a model on Dlearn, assess the model on Dtest; which learning/test division?
Same
Static rnd
Repeated rnd
CV
LOOCV
...
Comparison between techniques
Indexes¹
Learning/test division
Once upon a time¹... there is an amusement park with a carousel and an attendant deciding who can ride and who cannot ride. The park owner wants to replace the attendant with a robotic gate.
The owner calls us as machine learning experts.
X and Y
Features (chosen with domain expert):
Hence:
We (the ML expert and the domain expert) decide to collect some data D={(x(i),y(i))}i by observing the real system:
The data exploration suggests that using ML is not a terrible idea.
Assume we are computer scientists and we like if-then-else
(nested) structures:
can we manually build an if-then-else
structure that allows to make a decision.
Requirements (to keep it feasible manually):
if
condition should:Strategy:
if-then-else
function predict(x) {
if xage≤10 then {
return ●
} else {
return ●
}
}
else
branch is too roughLet's improve it!
if-then-else
function predict(x) {
if xage≤10 then {
return ●
} else {
if xheight≤120 then {
return ●
} else {
return ●
}
}
}
Nice job!
This if-then-else
nested structure can be represented as a tree:
function predict(x) {
if xage≤10 then {
return ●
} else {
if xheight≤120 then {
return ●
} else {
return ●
}
}
}
We call this a decision tree, since we use it inside an fpredict for making a decision:
Now: our human learned fpredict
function predict(x) {
if xage≤10 then {
return ●
} else {
if xheight≤120 then {
return ●
} else {
return ●
}
}
}
Goal: an fpredict′ working on any tree
function predict(x,m) {
...
}
We human learned (i.e., manually designed) a function where the decision tree is hard-coded in the predict() function in the form of an if-then-else
structure:
Scenario: classification with multivariate numerical features:
The model t∈Tp,Y is a decision tree defined over X1×⋯×Xp,Y, i.e.:
We represent a tree t∈TL as: t=[l;t′;t′′] where t′,t′′∈TL∪{∅} are the left and right children trees and l∈L is the label.
If the tree is a terminal node¹, it has no children (i.e., t′=t′′=∅) and we write: t=[l;∅;∅]=[l]
For decision trees:
We shorten T({1,…,p}×R)∪Y as Tp,Y.
With:
This tree is: t=[(1,10);[●];[(2,120);[●];[●]]]
Would you be able to write a parser for this?
function predict(x,t) {
if ¬has-children(t) then {
y←label-of(t)
return y
} else { //hence r is a branch node
(j,τ)←label-of(t)
if xj≤τ then {
return predict(x,left-child-of(t)) //recursion
} else {
return predict(x,right-child-of(t)) //recursion
}
}
}
It's a recursive function that:
1st call: x=(14,155),t=[(1,10);[●];[(2,120);[●];[●]]]
¬has-children(t)=false
(j,τ)=(1,10)
x1≤10=false
right-child-of(r)=[(2,120);[●];[●]]
2nd call: x=(14,155),t=[(2,120);[●];[●]]
¬has-children(t)=false
(j,τ)=(2,120)
x2≤120=false
right-child-of(r)=[●]
3rd call: x=(14,155),t=[●]
¬has-children(t)=true
y=● return return return
function predict(x,t) {
if ¬has-children(t) then {
y←label-of(t)
return y
} else {
(j,τ)←label-of(t)
if xj≤τ then {
return predict(x,left-child-of(t))
} else {
return predict(x,right-child-of(t))
}
}
}
We have our fpredict′:Rp×Tp,Y→Y; for having a learning technique we miss only the learning function, i.e., flearn′:P∗(Rp×Y)→Tp,Y:
What we did manually (i.e., how we human learned):
Let's rewrite it as (pseudo-)code!
function learn({(x(i),y(i))}i) {
if should-stop({y(i)}i) then {
y⋆←argmaxy∈Y∑i1(y(i)=y) //y⋆ is the most frequent class
return node-from(y⋆,∅,∅)
} else { //hence r is a branch node
(j,τ)←find-best-branch({(x(i),y(i))}i)
t←node-from(
(j,τ),
learn({(x(i),y(i))}ixj(i)≤τ), //recursion
learn({(x(i),y(i))}ixj(i)>τ) //recursion
)
return t
}
}
{(x(i),y(i))}ixj(i)≤τ is the sub-multiset of {(x(i),y(i))}i composed of pairs for which xj≤τ
This flearn′ is called recursive binary splitting:
Intuitively:
In detail (and formally):
function find-best-branch({(x(i),y(i))}i) {
(j⋆,τ⋆)←argminj,τ(error({y(i)}ixj(i)≤τ)+error({y(i)}ixj(i)>τ))
return (j⋆,τ⋆)
}
function error({y(i)}i) { //the error of the dummy classifier on {y(i)}i
y⋆←argmaxy∑i1(y(i)=y) //y⋆ is the most freq class
return n1∑i1(y(i)=y⋆) //n=∣{y(i)}i∣
}
Interpretation: if we split the data at this point (i.e., a (j,τ) pair) and use one dummy classifier on each of the two sides, what would be the resulting error?
This approach is greedy, since it tries to obtain the maximum result (finding the branch), with the minimum effort (using just two dummy classifiers later on):
Intuitively:
In detail (and formally):
function should-stop({y(i)}i,nmin) {
if n≤nmin then { //n=∣{y(i)}i∣
return true;
}
if error({y(i)}i)=0 then {
return true;
}
return false
}
Checking the first condition is, in general, cheaper than checking the second condition.
1st call:
(j,τ)=(1,7)
1st-l call:
(j,τ)=(1,2)
1st-l-l call:
return [●]
1st-l-r call:
(j,τ)=(1,4)
1st-l-r-l call:
return [●]
1st-l-r-r call:
return [●]
return [(1,4);[●];[●]]
return [(1,2);[●];[(1,4);[●];[●]]]
1st-r call:
return [●]
return [(1,7);[(1,2);[●];[(1,4);[●];[●]]];[●]]
Assume:
function learn({(x(i),y(i))}i,nmin) {
if should-stop({y(i)}i,nmin) then {
y⋆←argmaxy∈Y∑i1(y(i)=y)
return node-from(y⋆,∅,∅)
} else {
(j,τ)←find-best-branch({(x(i),y(i))}i)
t←node-from(
(j,τ),
learn({(x(i),y(i))}ixj(i)≤τ,nmin),
learn({(x(i),y(i))}ixj(i)>τ,nmin)
)
return t
}
}
Question: what's the accuracy of this t on the learning set?
function find-best-branch({(x(i),y(i))}i) {
(j⋆,τ⋆)←argminj,τ(error({y(i)}ixj(i)≤τ)+error({y(i)}ixj(i)>τ))
return (j⋆,τ⋆)
}
error({y(i)}i) is the error the dummy classifier would do on {y(i)}i: error({y(i)}i)=1−maxyFr(y,{y(i)}i)
Instead of error(), two other variants can be used:
function find-best-branch({(x(i),y(i))}i) {
(j⋆,τ⋆)←argminj,τ(error({y(i)}ixj(i)≤τ)+error({y(i)}ixj(i)>τ))
return (j⋆,τ⋆)
}
error({y(i)}i) is the error the dummy classifier would do on {y(i)}i: error({y(i)}i)=1−maxyFr(y,{y(i)}i)
Instead of error(), two other variants can be used:
For all:
function find-best-branch({(x(i),y(i))}i,fimpurity) {
(j⋆,τ⋆)←argminj,τ(fimpurity({y(i)}ixj(i)≤τ)+fimpurity({y(i)}ixj(i)>τ))
return (j⋆,τ⋆)
}
The way to measure the node impurity might be a parameter of find-best-branch(), but it has been found that Gini is better for learning trees than error.
Here, for binary classification:
Gini and cross-entropy are smoother than the error.
Original version: (data size)
n≤nmin or error({y(i)}i)=0
Alternative 1 (tree depth):
requires propagating recursively the depth of the node being currently built
Alternative 1 (node impurity):
function should-stop({y(i)}i,nmin) {
if n≤nmin then {
return true;
}
if error({y(i)}i)=0 then {
return true;
}
return false
}
Impact of the parameter:
(for the same dataset, in general)
Learning technique with probability:
For tree learning:
Set of trees T({1,…,p}×R)∪PY:
function learn({(x(i),y(i))}i,nmin) {
if should-stop({y(i)}i,nmin) then {
p←y↦Fr(y,{y(i)}i)
return node-from(p,∅,∅)
} else {
(j,τ)←find-best-branch({(x(i),y(i))}i)
t←node-from(
(j,τ),
learn({(x(i),y(i))}ixj(i)≤τ,nmin),
learn({(x(i),y(i))}ixj(i)>τ,nmin)
)
return t
}
}
Before (without probability):
y⋆←argmaxy∈Y∑i1(y(i)=y)
return node-from(y⋆,∅,∅)
After (with probability):
p←y↦Fr(y,{y(i)}i)
return node-from(p,∅,∅)
fpredict′:X×M→Y
function predict(x,t) {
if ¬has-children(t) then {
p←label-of(t)
y⋆←argmaxy∈Yp(y)
return y⋆
} else {
(j,τ)←label-of(t)
if xj≤τ then {
return predict(x,left-child-of(t))
} else {
return predict(x,right-child-of(t))
}
}
}
fpredict′′:X×M→PY
function predict-with-prob(x,t) {
if ¬has-children(t) then {
p←label-of(t)
return p
} else {
(j,τ)←label-of(t)
if xj≤τ then {
return predict(x,left-child-of(t))
} else {
return predict(x,right-child-of(t))
}
}
}
Usually, ML software libraries/tools provide way to access both y^ and p, that are produced out of a single execution.
1st call:
(j,τ)=(1,7)
1st-l call:
(j,τ)=(1,2)
1st-l-l call:
return [(●1)]
1st-l-r call:
(j,τ)=(1,4)
1st-l-r-l call:
return [(●1)]
1st-l-r-r call:
ret. [(●32,●31)]
return [(1,4);[(●1)];[(●32,●31)]]
return [(1,2);[(●1)];[(1,4);[(●1)];[(●32,●31)]]]
1st-r call:
return [(●1)]
return [(1,7);[(1,2);[(●1)];[(1,4);[(●1)];[(●32,●31)]]];[(●1)]]
Assume:
function learn({(x(i),y(i))}i,nmin) {
if should-stop({y(i)}i,nmin) then {
p←y↦Fr(y,{y(i)}i)
return node-from(p,∅,∅)
} else {
(j,τ)←find-best-branch({(x(i),y(i))}i)
t←node-from(
(j,τ),
learn({(x(i),y(i))}ixj(i)≤τ,nmin),
learn({(x(i),y(i))}ixj(i)>τ,nmin)
)
return t
}
}
If we apply our flearn′ to the carousel dataset with nmin=1 we obtain:
Question: is this tree ok for you?
hint: recall the other way of assessing a model, w/o the behavior
If we compare the tree (i.e., the model) against the attendant's reasoning (i.e., the real system), this tree appears too large!
We can do this, because:
The tree was large because:
The tree was large because:
In general, almost every kind of model can have different degrees of model complexity.
Moreover, almost every learning technique has at least one parameter affecting the maximum complexity of the learnable models, often called flexibility:
Usually, to obtain a complex model, you should have:
Why is our tree too complex?
Because of these two points! ●● →
What are they?
More in general: there's some noise in the data!
In practice, we often don't have a noise-free dataset {(x(i),y(i))}i, but have instead a dataset {(x(i),y′(i))}i with some noise, i.e., we have the y′ instead of the y:
However, our goal is to model s, not s+ noise!
When we have a noisy dataset (potentially always) and we allow for large complexity, by setting a flexibility parameter to a high flexibility, the learning technique fits the noisy data {(x(i),y′(i))}i instead of fitting the real system s, that is, overfitting occurs.
When we have a noisy dataset (potentially always) and we allow for large complexity, by setting a flexibility parameter to a high flexibility, the learning technique fits the noisy data {(x(i),y′(i))}i instead of fitting the real system s, that is, overfitting occurs.
Image from "Il piccolo principe"
Overfits = "fits too much", hence making apparent also those artifacts that are not part of the object being wrapped
When instead we do not allow for enough complexity to model a complex real system, by setting a flexibility parameter to low flexibility, the learning technique does not fit neither the data, nor the system, that is, underfitting occurs.
When instead we do not allow for enough complexity to model a complex real system, by setting a flexibility parameter to low flexibility, the learning technique does not fit neither the data, nor the system, that is, underfitting occurs.
Underfits = "doesn't fit enough", hence proper characteristics of the object being wrapped are not captured
In flearn′, nmin represents the flexibility:
Extreme values:
function learn({(x(i),y(i))}i,nmin) {
if should-stop({y(i)}i,nmin) then {
p←y↦Fr(y,{y(i)}i)
return node-from(p,∅,∅)
} else {
...
}
}
function should-stop({y(i)}i,nmin) {
if n≤nmin then {
return true;
}
...
return false
}
The learned tree is a dummy classifier (with probability):
t=[(●10359,●10344)]
As an alternative name for underfitting, we say that a learning technique exhibits high bias:
As an alternative name for underfitting, we say that a learning technique exhibits high bias:
As an alternative name for overfitting, we say that a learning technique exhibits high variance:
In principle:
In principle:
In practice, this is often (i.e., almost always) unfeasible:
With too low flexibility (here with error):
With too large flexibility:
Here, overfitting starts with flexibility ≥0.62
Practical procedure:
More in general, how to choose a good value for one or more parameters of the learning technique?
Assumption: "good" means "the one that corresponds to the greatest effectiveness".
From another point of view, we have k slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:
More in general, how to choose a good value for one or more parameters of the learning technique?
Assumption: "good" means "the one that corresponds to the greatest effectiveness".
From another point of view, we have k slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:
In practice:
More in general, how to choose a good value for one or more parameters of the learning technique?
Assumption: "good" means "the one that corresponds to the greatest effectiveness".
From another point of view, we have k slightly different (i.e., they differ only in the value of the parameter) learning techniques and we have to choose one:
In practice:
This procedure applies to parameters in general, not just to those affecting flexibility;
Given a learning technique with h parameters p1,…,ph, each pj defined in its domain Pj, hyperparameter tuning is the task of finding the tuple p1⋆,…,ph⋆ that corresponds to the best effectiveness of the learning technique.
Given a learning technique with h parameters p1,…,ph, each pj defined in its domain Pj, hyperparameter tuning is the task of finding the tuple p1⋆,…,ph⋆ that corresponds to the best effectiveness of the learning technique.
p1,…,ph are called hyperparameters, rather than just parameter, because in some communities and for some learning technique, the model is defined by one or more parameters (often numerical);
It's called tuning because we slightly change the hyperparameter values until we are happy with the results.
Given a learning technique with h parameters p1,…,ph, each pj defined in its domain Pj, hyperparameter tuning is the task of finding the tuple p1⋆,…,ph⋆ that corresponds to the best effectiveness of the learning technique.
p1,…,ph are called hyperparameters, rather than just parameter, because in some communities and for some learning technique, the model is defined by one or more parameters (often numerical);
It's called tuning because we slightly change the hyperparameter values until we are happy with the results.
Hyperparameter tuning it's a form of optimization, since we are searching the space P1×⋯×Ph for the tuple giving the best, i.e., ≈ optimal, effectiveness:
A simple form of hyperparameter tuning:
Remarks:
Consider the flearn′ for trees and these two hyperparameters:
Let's do hyperparameter tuning with grid search (assuming ∣D∣=n=1000):
Consider the flearn′ for trees and these two hyperparameters:
Let's do hyperparameter tuning with grid search (assuming ∣D∣=n=1000):
Questions
Can't we just always do grid search for doing hyperparameter tuning?
Pros:
Cons:
Can't we just always do grid search for doing hyperparameter tuning?
Pros:
Cons:
If you do it, you can transform any learning tech. w/ params in a learning tech. w/o params:
function learn-free(D) {
flearn′,fpredict′←…
P1′,…,Ph′←…
flearn-effect←…
p1⋆,…,ph⋆←∅
vmax,effect←−∞
foreach p1,…,ph∈P1′×⋯×Ph′ {
veffect←flearn-effect(flearn′(⋅,p1,…,ph),fpredict′,D)
if veffect≥vmax,effect then {
vmax,effect←veffect
p1⋆,…,ph⋆←p1,…,ph
}
}
return flearn′(D,p1⋆,…,ph⋆)
}
Consider the flearn′ for trees and these two hyperparameters:
Consider the improved, hyperparameter-free version of flearn′ called flearn-free′:
Suppose you want to compare it against the plain version (with nmin=10 and pimpurity=Gini):
Questions
Up to now, the flearn′ for trees (i.e., recursive binary splitting) was defined¹ as: flearn′:P∗(X1×⋯×Xp×Y)→T({1,…,p}×R)∪Y with:
These constraints were needed because:
Can we remove these constraints?
With numerical variables (xj∈R):
With find-best-branch(), we find (the index j of) a variable xj and a threshold value τ that well separates the data, i.e., we split the data in:
No other cases exist: it's a binary split.
Example
xage∈[0,120]
With categorical variables (xj∈Xj):
With find-best-branch(), we find (the index j of) a variable xj and a set of values Xj′⊂Xj that well separates the data, i.e., we split the data in:
No other cases exist: it's a binary split.
Example
xcity∈{Ts,Ud,Ve,Pn,Go}
For a given numerical variable xj∈R, we choose τ⋆ such that: τ⋆=τ∈Rargmin(fimpurity({y(i)}ixj(i)≤τ)+fimpurity({y(i)}ixj(i)>τ)) In practice, we search the set of midpoints rather than the entire R: there are n−1 midpoints in a dataset with n elements.
Even better, we can consider only the midpoints between consecutive values xj(i1),xj(i2) for which the labels are different, i.e., yj(i1)=yj(i2)
For a given categorical variable xj∈Xj, we choose Xj⋆⊂Xj such that: Xj⋆=Xj′∈P(Xj)argmin(fimpurity({y(i)}ixj(i)∈Xj′)+fimpurity({y(i)}ixj(i)∈Xj′)) We search the set P(Xj) of subsets (i.e., the powerset) of Xj, which has 2∣Xj∣ values.
Assume a problem with X=X1×⋯×Xpnum×Xpnum+1×⋯×Xpnum+pcat, i.e.:
The labels of the tree nodes can be:
So the model is a t∈:
Recursive binary splitting may be used for regression: the learned trees are called regression trees.
Required changes:
In flearn′, when should-stop() is met, "most frequent class label" does not make sense anymore.
Solution: use the mean y.
Classification
The terminal node label is the most frequent class: y⋆=y∈YargmaxFr(y,{y(i)}i)
If you have to choose just one y, y⋆ is the one that minimizes the classification error.
Regression
The terminal node label is the mean y value: y⋆=n1i∑y(i)=y
If you have to choose just one y, y⋆ is the one that minimizes the MSE.
In flearn′, when should-stop() is met, "most frequent class label" does not make sense anymore.
Solution: use the mean y.
Classification
The terminal node label is the most frequent class: y⋆=y∈YargmaxFr(y,{y(i)}i)
If you have to choose just one y, y⋆ is the one that minimizes the classification error.
Regression
The terminal node label is the mean y value: y⋆=n1i∑y(i)=y
If you have to choose just one y, y⋆ is the one that minimizes the MSE.
Indeed a dummy regressor predicting always the mean value y should be considered a baseline for regression, like the dummy classifier is a baseline for classification:
In find-best-branch(), minimizing the error() does not make sense anymore (same for gini() and cross-entropy()).
Solution: use the residual sum of squares (RSS).
Classification
The branch is chosen for which the sum of the impurity on the two sides is the lowest: (j⋆,τ⋆)←j,τargmin(error({y(i)}ixj(i)≤τ)+error({y(i)}ixj(i)>τ)) similarly, for categorical variables
Regression
The branch is chosen for which the sum of the RSS on the two sides is the lowest: (j⋆,τ⋆)←j,τargmin(RSS({y(i)}ixj(i)≤τ)+RSS({y(i)}ixj(i)>τ)) where: RSS({y(i)}i)=i∑(y(i)−y)2
similarly, for categorical variables; RSS(⋅)=nMSE(⋅)
In should-stop(), checking if error()=0 does not make sense anymore.
Solution: just use RSS.
Classification
Stop if n≤nmin or error()=0.
function should-stop({y(i)}i,nmin) {
if n≤nmin then {
return true;
}
if error({y(i)}i)=0 then {
return true;
}
return false
}
Regression
Stop if n≤nmin or RSS()=0.
function should-stop({y(i)}i,nmin) {
if n≤nmin then {
return true;
}
if RSS({y(i)}i)=0 then {
return true;
}
return false
}
In practice, the condition RSS()=0 holds much more unfrequently than the condition error()=0.
With few variables, p≤2 for classification, p=1 for regression, the model can be visualized.
Classification
The colored regions are the model. The border(s) between regions with different colors (i.e., different decisions) is the decision boundary.
Regression
The line is the model.
Question: can you draw the tree for this model?
image from Fabio Daolio
Questions
Applicability 👍👍
Efficiency 👍
Explainability/interpretability 👍👍👍
Applicability 👍👍
Efficiency 👍
Explainability/interpretability 👍👍👍
So, why are we not using trees for/in every ML system?
image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013
The effectiveness depends on the problem and may be limited by the fact that branch nodes consider one variable at once.
The decision boundary of the model is hence constrained to be locally parallel to one of the axes:
There exist oblique decision trees, which should overcome this limitation.
Actual execution of:
is not made by hand, but by a computer that executes some software.
Nowadays, there are many options.
A few:
And many others.
How to choose an ML software?
Possible criteria:
In general, the ML software provides an interface that models the key concepts of learning (flearn′) and prediction (fpredict′) phases and the one of the model.
Example (Java+SMILE):
DataFrame dataFrame = ...RandomForest classifier = RandomForest.fit(Formula.lhs("label"), dataFrame);Tuple observation = ...;int predictedLabel = classifier.predict(observation);
Example (R):
d = ...classifier = randomForest(label~., d)newD = ...newLabels = predict(classifier, newD)
R is:
RStudio is:
Some R resources:
r
is the most popular tag)There are some built-in data types.
Basic:
Composed:
R is not strongly typed: there are (some) implicit conversions.
There are some built-in data types.
Basic:
Composed:
R is not strongly typed: there are (some) implicit conversions.
A peculiar data type is formula:
decision~age+height
Species~.
.
means "every other variable"> a=3> a[1] 3> v=c(1,2,3)> v[1] 1 2 3> d=as.data.frame(cbind(age=c(20,21,21))))> d$gender=factor(c("m","m","f"))> d age gender1 20 22 21 23 21 1> levels(d$gender)[1] "f" "m"> dep=salary~degree.level+age> depsalary ~ degree.level + age> f = function(x) {x+3}> f(2)[1] 5
a
is a numericv
is a vector of numericd
is a data framedep
is a formulaf
is a functioncbind()
stays for column bind (there's an rbind()
too)factor()
makes a vector of character a vector of factorslevels()
gives the possible values of a factor, i.e.:d$gender
is {x2(i)}ilevels(d$gender)
is X2There are many packages for reading weird file types.
Some built-in functions for reading/writing CSV files (and variants):
read.csv()
, read.csv2()
, read.table()
write.csv()
, write.csv2()
, write.table()
Some built-in functions for reading/writing data in an R-native format:
save()
load()
With summary()
(built-in)
> d=iris> summary(d) Sepal.Length Sepal.Width Petal.Length Min. :4.300 Min. :2.000 Min. :1.000 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 Median :5.800 Median :3.000 Median :4.350 Mean :5.843 Mean :3.057 Mean :3.758 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 Max. :7.900 Max. :4.400 Max. :6.900 Petal.Width Species Min. :0.100 setosa :50 1st Qu.:0.300 versicolor:50 Median :1.300 virginica :50 Mean :1.199 3rd Qu.:1.800 Max. :2.500
With skim()
from skimr
package
> skim(d)── Data Summary ──────────────────────── ValuesName d Number of rows 150 Number of columns 5 _______________________ Column type frequency: factor 1 numeric 4 ________________________ Group variables None ── Variable type: factor ──────────────────────────── skim_variable n_missing complete_rate ordered1 Species 0 1 FALSE n_unique top_counts 1 3 set: 50, ver: 50, vir: 50── Variable type: numeric ─────────────────────────── skim_variable n_missing complete_rate mean sd1 Sepal.Length 0 1 5.84 0.8282 Sepal.Width 0 1 3.06 0.4363 Petal.Length 0 1 3.76 1.774 Petal.Width 0 1 1.20 0.762 p0 p25 p50 p75 p100 hist1 4.3 5.1 5.8 6.4 7.9 ▆▇▇▅▂2 2 2.8 3 3.3 4.4 ▁▆▇▂▁3 1 1.6 4.35 5.1 6.9 ▇▁▆▇▂4 0.1 0.3 1.3 1.8 2.5 ▇▁▇▅▃
Sizes with length()
, dim()
, nrow()
, ncol()
; names with names()
(same of colnames()
), rownames()
names(d)[2:3] = c("cows", "dogs")
Here d
is a multivariate dataset, but which variable is y is not specified.
On vectors:
> v=seq(1,2,by=0.25)> v[1] 1.00 1.25 1.50 1.75 2.00> v[2][1] 1.25> v[2:3][1] 1.25 1.50> v[-2][1] 1.00 1.50 1.75 2.00> v[c(1,2,4)][1] 1.00 1.25 1.75> v[c(T,F,F,T)][1] 1.00 1.75 2.00> v[v<1.6][1] 1.00 1.25 1.50> v[which(v<1.6)][1] 1.00 1.25 1.50
On data frames:
> d age gender1 20 m2 21 m3 21 f> d[1,2][1] mLevels: f m> d[,2][1] m m fLevels: f m> d[1,] age gender1 20 m> d$age[1] 20 21 21
Question: what is d[,c("age","age")]
?
tidyverse
> iris %>% group_by(Species) %>% summarize_at(vars(Sepal.Length,Sepal.Width), list(mean=mean,sd=sd)) %>% pivot_longer(-Species)# A tibble: 12 × 3 Species name value <fct> <chr> <dbl> 1 setosa Sepal.Length_mean 5.01 2 setosa Sepal.Width_mean 3.43 3 setosa Sepal.Length_sd 0.352 4 setosa Sepal.Width_sd 0.379 5 versicolor Sepal.Length_mean 5.94 6 versicolor Sepal.Width_mean 2.77 7 versicolor Sepal.Length_sd 0.516 8 versicolor Sepal.Width_sd 0.314 9 virginica Sepal.Length_mean 6.5910 virginica Sepal.Width_mean 2.9711 virginica Sepal.Length_sd 0.63612 virginica Sepal.Width_sd 0.322
Useful for:
dplyr
(cheatsheet)ggplot2
(cheatsheet)tidyr
(cheatsheet)readr
(cheatsheet)Very useful, indeed!
plot()
; since it is overloaded for many custom data types, you can always try feeding plot()
with something and see what happens...what's the hardest variable to be predicted in the dataset?
Hints:
iris
tree
rpart
this might be a bit better()
for learning (e.g., tree
or rpart
)predict()
for predictionConsider this dataset obtained from a system:
Question: how would you "draw the system" behind this data?
If we learn a regression tree with low flexibility:
If we learn a regression tree with high flexibility:
It might be that there is no flexibility value for which we have no underfitting nor overfitting.
Consider this dataset obtained from a system:
Question: how would you "draw the system" behind this data?
If we learn a regression tree with low flexibility:
If we learn a regression tree with high flexibility:
It might be that there is no flexibility value for which we have no underfitting nor overfitting.
What's that point at (≈80,≈22)?
What if we collect another dataset out of the same system?
Model with low complexity:
"a moving object"
Model with high complexity:
"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"
Model with low complexity:
"a moving object"
Model with high complexity:
"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"
"a moving object"
"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"
Model with low complexity:
"a moving object"
Model with high complexity:
"a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, a windshield, curved rear enclosing engine"
"a moving object"
"a red-colored moving object with 4 wheels, 2 doors, side air intakes, a windshield, a small horse figure"
"a moving object"
"a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, a windshield, chromed fenders, sunroof"
Low complexity | High complexity |
---|---|
"a moving object" | "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine" |
"a moving object" | "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure" |
"a moving object" | "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof" |
Low complexity: never gives enough details about the system
High complexity: always gives a fair amount of details about the system, but also about noise
Low complexity | High complexity |
---|---|
"a moving object" | "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine" |
"a moving object" | "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure" |
"a moving object" | "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof" |
Low complexity: never gives enough details about the system
High complexity: always gives a fair amount of details about the system, but also about noise
What if we combine different models with high complexity?
Low complexity | High complexity |
---|---|
"a moving object" | "a blue-colored moving object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine" |
"a moving object" | "a red-colored moving object with 4 wheels, 2 doors, side air intakes, and a small horse figure" |
"a moving object" | "a small red-colored moving object with 4 wheels, 2 doors, a white stripe on the front, chromed fenders, sunroof" |
Low complexity: never gives enough details about the system
High complexity: always gives a fair amount of details about the system, but also about noise
What if we combine different models with high complexity?
When "learners" are common people, this idea is related with the wisdom of the crowds theorem, stating that "a collective opinion may be better than a single expert's opinion".
"a collective opinion may be better than a single expert's opinion"
Yes, but only if:
"a collective opinion may be better than a single expert's opinion"
Yes, but only if:
Can we realize a wisdom of the trees? (where (opinion, person) ↔ (prediction, tree))
A tree is the result of the execution of flearn′ on a learning set Dlearn={(x(i),y(i))}i.
flearn′ is deterministic, thus:
In order to obtain different trees, we need to apply flearn′ on different learning sets!
But we have just one learning set... 🤔
Question: what's the learning set for human-learners?
Goal: obtaining m different datasets Dlearn,1,…,Dlearn,m from a dataset Dlearn
Goal: obtaining m different datasets Dlearn,1,…,Dlearn,m from a dataset Dlearn
Option 1: (CV-like)
Requirements check:
Goal: obtaining m different datasets Dlearn,1,…,Dlearn,m from a dataset Dlearn
Option 1: (CV-like)
Requirements check:
Option 2: rand. sampling w/ repetitions
Requirements check:
On Dlearn:
In general:
function sample-rep({x1,…,xn}) {
X′←∅
while ∣X′∣≤n {
j←uniform({1,…,n})
X′←X′∪{xj}
}
return X′
}
Remarks:
Not deterministic, thus:
recall: input and output are multisets
Not deterministic, thus:
recall: input and output are multisets
Given an input with n elements and assuming uniqueness an element has:
Can we realize a wisdom of the trees? (where (opinion, person) ↔ (prediction, tree))
Can we realize a wisdom of the trees? (where (opinion, person) ↔ (prediction, tree))
Ok, we can define a new learning technique that realizes this idea!
This technique is called on tree bagging.
function learn({(x(i),y(i))}i,ntree) {
T′←∅
while ∣T′∣≤ntree {
{(x(ji),y(ji))}ji←sample-rep({(x(i),y(i))}i)
t←learnsingle({(x(ji),y(ji))}ji,1)
T′←T′∪{t}
}
return T′
}
Recall: since one part of this flearn′ is not deterministic (namely, sample-rep()), the entire flearn′ is not deterministic!
Classification (decision trees)
function predict(x,{tj}j) {
return argmaxy∈Y∑j1(y=predictsingle(x,tj))
}
Regression (regression trees)
function predict(x,{tj}j) {
return ∣{tj}j∣1∑jpredictsingle(x,tj)
}
Apparently yes:
Apparently no:
So what? 🤔
"Experimentally", it turns out that:
Note that bagging with ntree=1 is¹ single tree learning.
"Experimentally", it turns out that:
Note that bagging with ntree=1 is¹ single tree learning.
Question: can we hence set an arbitrarly large ntree?
"Experimentally", it turns out that:
Note that bagging with ntree=1 is¹ single tree learning.
Question: can we hence set an arbitrarly large ntree?
No! Efficiency linearly dicreases with ntree:
Since it is based on the learning technique for single trees, bagging has the same applicability:
Since it is based on the learning technique for single trees, bagging has the same applicability:
Note that the idea behind tree bagging can be applied to any base learning technique:
Wisdom of the trees:
Trees independency is obtained by learning them on (slightly) different datasets.
If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.
Can we further increase trees independency?
Wisdom of the trees:
Trees independency is obtained by learning them on (slightly) different datasets.
If there are variables (strong predictors) which are very useful for separating the observations, still all trees may share a very similar structure, due to the way they are built.
Can we further increase trees independency?
Yes!
Idea: when learning each tree, remove some randomly chosen independent variables from the observations
Tree bagging improved with variables removal is a learning technique called Random Forest:
function learn({(x(i),y(i))}i,ntree,nvars) {
T′←∅
while ∣T′∣≤ntree {
{(x(ji),y(ji))}ji←sample-rep({(x(i),y(i))}i)
{(x′(ji),y(ji))}ji←retain-vars({(x(ji),y(ji))}ji,nvars)
t←learnsingle({(x′(ji),y(ji))}ji,1)
T′←T′∪{t}
}
return T′
}
Two parts of this flearn′ are not deterministic (namely, sample-rep() and retain-vars()), hence the entire flearn′ is not deterministic!
Classification (decision trees)
function predict(x,{tj}j) {
return argmaxy∈Y∑j1(y=predictsingle(x,tj))
}
Regression (regression trees)
function predict(x,{tj}j) {
return ∣{tj}j∣1∑jpredictsingle(x,tj)
}
Exactly the same as for tree bagging
Question: some of the trees in the bag do not have all variables of x: is this a problem?
Classification (decision trees)
function predict(x,{tj}j) {
return argmaxy∈Y∑j1(y=predictsingle(x,tj))
}
Regression (regression trees)
function predict(x,{tj}j) {
return ∣{tj}j∣1∑jpredictsingle(x,tj)
}
Exactly the same as for tree bagging
Question: some of the trees in the bag do not have all variables of x: is this a problem?
No, the tree is still able to process an x, but will not consider (i.e., not use in branch nodes) some of its variable values;
No, "experimentally", it turns out that:
⌈x⌉ is ceil(x), i.e., rounding to closest larger integer; ⌊x⌋ is floor(x), i.e., rounding to closest smaller integer
Both ntree and nvars do not impact on tendency to overfitting.
In practice, we can use the default values for both:
⇒ Random Forest is (almost) a (hyper)parameter-free learning technique!
Both ntree and nvars do not impact on tendency to overfitting.
In practice, we can use the default values for both:
⇒ Random Forest is (almost) a (hyper)parameter-free learning technique!
However, "we can use the default values"
image from Fabio Daolio
How is this image built?
Finding: the bag — nicely models the real system —
function learn({(x(i),y(i))}i,ntree,nvars) {
T′←∅
while ∣T′∣≤ntree {
{(x(ji),y(ji))}ji←sample-rep({(x(i),y(i))}i)
{(x′(ji),y(ji))}ji←retain-vars({(x(ji),y(ji))}ji,nvars)
t←learnsingle({(x′(ji),y(ji))}ji,1)
T′←T′∪{t}
}
return T′
}
Toy example with D={●,●,●,●,●}
For every tree, there are zero or more observations that have not been used for learning it.
From another point of view, for every i-th observation (x(i),y(i)), there are some trees which have been learned without that observation:
⇒ use unseen observations for measuring an estimate of the error (or accuracy, or another index) on the testing set (the OOB error)
Computing the OOB error during the learning:
Remarks:
image from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013
Is this model interpretable (ntree=1)?
Is this model interpretable (ntree=100)?
Interpreting of the model (i.e., global explainability) is feasible if the model can be visualized:
There exist other flavors of interpretability:
By looking at this tree, we can understand:
In principle, this can be done also for a bag of trees, but it would not scale well... in human terms
Can we have a much coarser view on variables role that scales well to large ntree?
By looking at this tree, we can understand:
In principle, this can be done also for a bag of trees, but it would not scale well... in human terms
Can we have a much coarser view on variables role that scales well to large ntree?
Yes!
Idea (first option: mean RSS/Gini decrease): when learning
function learn({(x(i),y(i))}i,ntree,nvars) {
v←0
T′←∅
while ∣T′∣≤ntree {
{(x(ji),y(ji))}ji←sample-rep({(x(i),y(i))}i)
{(x′(ji),y(ji))}ji←retain-vars({(x(ji),y(ji))}ji,nvars)
t←learnsingle({(x′(ji),y(ji))}ji,1,v)
T′←T′∪{t}
}
return (T′,v)
}
function learnsingle({(x(i),y(i))}i,nmin,v) {
if should-stop({y(i)}i,nmin) then { ... } else {
ebefore←gini({y(i)}i)
(j,τ)←find-best-branch({(x(i),y(i))}i)
eafter←gini({y(i)}ixj(i)≤τ)+gini({y(i)}ixj(i)>τ)
vj←vj+ebefore−eafter
t←node-from((j,τ),
learn({(x(i),y(i))}ixj(i)≤τ,nmin,v),
learn({(x(i),y(i))}ixj(i)>τ,nmin,v)
)
return t
}
}
Example: gini({y(i)}i)=∑yFr(y,{y(i)}i)(1−Fr(y,{y(i)}i))
{y(i)}i | Gini | Gini∣≤τ | Gini∣>τ | Decrease |
---|---|---|---|---|
●●●●/●●●● | 0.5 | 0 | 0 | 0.5 |
●●●●●●/●● | 0.375 | 0 | 0 | 0.375 |
●●●●●/●●● | 0.375 | 0 | 0.333 | 0.042 |
●●●●/●●●● | 0.5 | 0.25 | 0.25 | 0 |
Question: is this xj categorical or numerical?
It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:
} → many branches
It has been showed experimentally that RSS/Gini decrease is not effective as variable importance:
} → many branches
Idea (second option: aka mean accuracy decrease): just after learning
Rationale: if the decrease is low, it means that shuffling the variable has no effect, so the variable is not really important!
There is also a further, more general variant, that works for any learning technique flearn′,fpredict′:
Idea (third option: feature ablation):
This method is (a form of) feature ablation, since you remove variables/features an see what happens:
In summary, for variable instance, we have three options:
Option | Effectiveness | Efficiency | Applicability |
---|---|---|---|
Mean RSS/Gini decrease | 🤏1 | 👍2 | 🤏 only trees |
Mean accuracy decrease | 👍 | 👍3 | 🤏 bagging |
Feature ablation | 👍4 | 🤏 | 👍 universal |
Regardless of the method you use for computing the variable importance, a ranking of the variables according to their importance for having a good model is a basic form of interpretability, as it answers to the question:
that should mean:
Applicability: same as trees 👍👍👍
Efficiency 👍
Explainability/interpretability 👍👍
Unless¹ you really need to look at the tree, Random Forest is always better than the single tree:
Some researchers did a large scale comparison of many supervised machine learning techniques:
We evaluate 179 classifiers arising from 17 families [...]
We use 121 data set [...]
The classifiers most likely to be the bests are the random forest [...]
According to practice, we just need Random Forest. But...
Earlier, some researchers formulated the No Free Lunch theorem¹:
Any two optimization algorithms¹ are equivalent when their performance is averaged across all possible problems²
According to theory, all learning techniques are the same.
There are many restaurants, each with all food items on the list: food price is in general different among restaurants.
Where should you go to eat?
If you just want to eat something, there is no restaurants where everything costs less.
But if you know what you want to eat, there's at least one restaurant where that thing has the lowest price.
Dataset:
A single tree, here, would struggle in establishing a good decision boundary: many corners, many branch nodes.
By looking at the data, we see that a simple line would likely be a good decision boundary
Can we draw that simple line?
Yes, we can! Here it is!
Despite its apparent simplicity, this "draw the line" operation implies:
Implicitly, we already defined M, flearn′:P∗(R×Y)→M, and fpredict′:R2×M→Y
We followed the same approach for trees: now we are more experienced and we can go faster in formalizing it.
Formally, a line-shaped decision boundary in X=R2 can be defined as x2=mx1+q where m is the slope and q is the intercept.
Alternatively, as: there are many triplets (β0,β1,β2) defining the same line β0+β1x1+β2x2=0
More in general, in X=Rp, we can define a separating hyperplane as: β0+β1x1+⋯+βpxp=0 or, in vectorial form, as: β,x∈Rp β0+β⊺x=0
Intuitively:
Formally:
Example: This particular line is: 2+1.1x1+x2=0
For x=(10,10):
For x=(−10,−10):
function predict(x,(β0,β)) {
if β0+β⊺x≥0 then {
return Pos
} else {
return Neg
}
}
Assumptions:
Intuitively, for β0+β⊺x
function predict(x,(β0,β)) {
if β0+β⊺x≥0 then {
return Pos
} else {
return Neg
}
}
Can we use this like a probability? Can we have an fpredict′′ for the hyperplane?
Intuitively, for β0+β⊺x
function predict(x,(β0,β)) {
if β0+β⊺x≥0 then {
return Pos
} else {
return Neg
}
}
Can we use this like a probability? Can we have an fpredict′′ for the hyperplane?
No! Because β0+β⊺x is not bounded in [0,1]
You may map the domain of β0+β⊺x, i.e., [−∞,+∞] to [0,1] with, e.g., tanh: if x∈[−∞,+∞], then 21+21tanh(x)∈[0,1].
But this is not a common practice, because it still would not be a real probability.
How to choose the separating line?
First attempt:
Choose the one that:
How to choose the separating line?
First attempt:
Choose the one that:
🫣 this condition holds in general, for infinite lines...
Second attempt:
Choose the one that:
How to choose the separating line?
First attempt:
Choose the one that:
🫣 this condition holds in general, for infinite lines...
Second attempt:
Choose the one that:
The hyperplane that
is called the maximal margin classifier (MMC).
Maximal margin classifier:
Names:
If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!
Intuitively:
Choose the one that:
Looks like an optimization problem:
Intuitively:
Choose the one that:
Looks like an optimization problem:
Formally:
β0,…,βpmaxsubject tomj=1∑j=pβj2=β⊺β=1y(i)(β0+β⊺x(i))≥m∀i∈{1,…,n} that means:
Assume by convention that Pos↔+1 and Neg↔−1, so y(i)(…)≥m is like ⋯≥m for positives and ⋯≤−m for negatives
function learn({(x(i),y(i))}i) {
(β0,β)←solve(
maxβ0,…,βpm,
β⊺β=1∧y(i)(β0+β⊺x(i))≥m,∀i
)
return (β0,β)
}
In practice, this is an easy optimization problem and solving it is fast! for a computer
This learning technique is called maximal margin classifier learning.
Efficiency: 👍
Applicability: 🫳
Effectiveness: 🤔
function learn({(x(i),y(i))}i) {
(β0,β)←solve(
maxβ0,…,βpm,
β⊺β=1∧y(i)(β0+β⊺x(i))≥m,∀i
)
return (β0,β)
}
function predict(x,(β0,β)) {
if β0+β⊺x≥0 then {
return Pos
} else {
return Neg
}
}
Support vectors:
If you move (not too much) any of the points which are not support vectors, the separating hyperplane stays the same!
But, if you move a support vector, then the separating hyperplane moves!
Even worse, if you apply some noise¹ to some label y(i), it might be that a separatying hyperplane does not exist at all! 😱
⇒ Applicability: 👎👎👎
How did the tree cope with y noise?
Can we make MMC tolerant too?
β0,…,βp,ϵ(1),…,ϵ(n)maxsubject tomβ⊺β=1y(i)(β0+β⊺x(i))≥m(1−ϵ(i))ϵ(i)≥0i=1∑i=nϵ(i)=c∀i∈{1,…,n}∀i∈{1,…,n}
This learning technique is called soft margin classifier (SMC, or support vector classifier), because, due to tolerance, the margin can be pushed.
It has one parameter, c:
β0,…,βp,ϵ(1),…,ϵ(n)maxsubject tomβ⊺β=1y(i)(β0+β⊺x(i))≥m(1−ϵ(i))ϵ(i)≥0i=1∑i=nϵ(i)=c∀i∈{1,…,n}∀i∈{1,…,n}
c=+∞ → infinite tolerance → you can put the line wherever you want
c=0 → no tolerance → any noise will change the model
The threshold clearnable for learnability depends:
The threshold clearnable for learnability depends:
Actually, the margin m of the MMC itself depends on the scales of variables!
Original dataset
Scaled dataset: each xj is ×21
D={
(1,1,●),
(3,3,●)
}
D={
(0.5,0.5,●),
(1.5,1.5,●)
}
m=12+12=2
m=221+221=21
Moreover, the coefficients βj depend on the scales of the variables too!
Intuitively: if
then βj will be much different than βj′, making the computation of β0+β⊺x(i) (and hence the model) rather sensible to noise.
Moreover, the coefficients βj depend on the scales of the variables too!
Intuitively: if
then βj will be much different than βj′, making the computation of β0+β⊺x(i) (and hence the model) rather sensible to noise.
Hence, when using MMC (or SMC, or SVM), you¹ should rescale the variables. Options:
Standardization is, in general, preferred as it is more robust to outliers.
Since you have to do the scaling both in learning and prediction, the coefficients needed for scaling (i.e., min,max or μ,σ) do belong to the model!
Learning with scaling: (here, standardization)
(m,μ,σ) is the model with scaling, with μ,σ∈Rp. Here, join builds a tuple
Prediction with scaling:
If you use the entire dataset (e.g., in CV, or in train/test static division) for computing μ,σ, then you are cheating!
Question: can you write down the pseudocode of "scale"? And scaling? Are they the same?
β0,…,βp,ϵ(1),…,ϵ(n)maxsubject tom−ci=1∑i=nϵ(i)β⊺β=1y(i)(β0+β⊺x(i))≥m(1−ϵ(i))ϵ(i)≥0∀i∈{1,…,n}∀i∈{1,…,n}
This is also the learning technique called soft margin classifier.
Most of the ML sw/libraries are based on this formulation.
The 1st one is often shown in books, e.g., in James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013
β0,…,βp,ϵ(i),…,ϵ(i)maxsubject tom−ci=1∑i=nϵ(i)β⊺β=1y(i)(β0+β⊺x(i))≥m(1−ϵ(i))ϵ(i)≥0∀i∈{1,…,n}∀i∈{1,…,n}
c=0 → no weight to ∑i=1i=nϵ(i) → points that are inside the margin cost zero
c=+∞ → infinite weight to ∑i=1i=n → points that are inside the margin cost a lot
Similarities:
Differences:
In practice:
Yes, with the 2nd formulation, we can learn a SMC, but it will be a poor model:
More in general, not every binary classification problem can be solved with an hyperplane.
Can we learn non linear decision boundaries?
Yes, we can!
But...
❗ Disclaimer ❗
There will be some harder mathematics. We are going to make it simple.
For simplifying it, we'll risky walk on the edge of correcteness...
First, let's give a name to the core computation of fpredict′: f(x)=β0+β⊺x=β0+j=1∑j=pβjxj with f:Rp→R.
It turns out that this same f can be written also as: f(x)=β0+i=1∑i=nα(i)⟨x,x(i)⟩ where ⟨x,x′⟩=x⊺x′=∑j=1j=pxjxj′ is the inner product.
⟨⋅,⋅⟩:Rp×Rp→R can also be defined on other sets than Rp, so it's not just x⊺x′...
Remarks:
Given that:
β0+β⊺x=f(x)=β0+∑i=1i=nα(i)x⊺x(i)
If you move¹ any point which is not a support vector, by definition f(x) must stay the same:
Hence, it follows that α(i)=0 for every x(i) which is not a support vector!
More in general each α(i) says what's the contribution of the corresponding x(i) when classifying x: 0 means no contribution.
From the point of view of the optimization, solve() for the second formulation gives (β0,α), with α∈Rn: this also says which are the support vectors. Similarly, the model is (β0,α) instead of (β0,β).
Ok, but what about going beyond the hyperplane? We are almost there...
The second formulation may be generalized: f(x)=β0+i=1∑i=nα(i)k(x,x(i)) where k:Rp×Rp→R is a kernel function.
The idea behind the kernel function is to:
hoping that an hyperplane can separate the points in X′ better than in X.
This thing is called the kernel trick. Understanding the math behind it is beyond the scope of this course. Understanding the way the optimization works with a kernel is beyond the scope of this course.
When you use a kernel, this technique is called Support Vector Machines (SVM) learning.
Linear kernel:
k(x,x′)=x⊺x′
Polynomial kernel:
k(x,x′)=(1+x⊺x′)d
Gaussian kernel:
k(x,x′)=e−γ∥x−x′∥2
f(x)=β0+i=1∑i=nα(i)k(x,x(i))
Regardless of the kernel being used, each α(i) says what's the contribution of the corresponding x(i) when evaluating f(x) inside fpredict′.
k(x,x′)=e−γ∥x−x′∥2 and f(x)=β0+∑i=1i=nα(i)k(x,x(i))
Let's consider a point x moving from (0,3.5) to (6,3.5):
—γ=0.1 —γ=1 —γ=10
Intuitively, and very broadly speaking, the Gaussian kernel maps an x to the space where coordinates are the distances to relevant observations of the learning data.
In practice, the decision boundary can smoothly follow any path:
Image from Wikipedia
Efficiency 👍👍👍
Explainability/interpretability 🫳
Effectiveness 👍👍
Applicability 🫳
Let X=X1×⋯×Xp:
Xj | Y | RF | SVM |
---|---|---|---|
Numerical | Binary classification | ✅ | ✅ |
Categorical | Binary classification | ✅ | ❌ |
Numerical + Categorical | Binary classification | ✅ | ❌ |
Numerical | Multiclass classification | ✅ | ❌ |
Categorical | Multiclass classification | ✅ | ❌ |
Numerical + Categorical | Multiclass classification | ✅ | ❌ |
Numerical | Regression | ✅ | ❌ |
Categorical | Regression | ✅ | ❌ |
Numerical + Categorical | Regression | ✅ | ❌ |
Let's start by fixing SVM!
Let xj be categorical:
Then, we can replace it with k numerical variables:
such that: ∀i,k:xhk(i)=1(xj(i)=xj,k)
This way of encoding one categorical variable with k possible values to k binary numerical variables is called one-hot encoding.
Each one of the resulting binary variables is a dummy variable.
A similar encoding can be applied when Xj=P(A).
Example: (extended carousel)
Original features: age, height, city p=3
Transformed features: p=7
with:
hence, e.g.:
Let flearn′,fpredict′ be a learning technique applicable to X,Ybinary where Ybinary={Pos,Neg} that produces models in M, i.e., flearn′:P∗(X×Ybinary)→M and fpredict′:X×M→Ybinary.
Let Y={y1,…,yk} a finite set with k>2 values.
Consider a new learning technique flearn,ovo′,fpredict,ovo′, based on flearn′,fpredict′, that:
In learning: flearn,ovo′:P∗(X×Y)→M2k(k−1)
Given a D∈P∗(X×Y):
each mh1,h2 is a binary classification model learned on ∣D′∣<∣D∣ obs.
In prediction: fpredict,ovo′:X×M2k(k−1)→Y
Given an x∈X and a model M∈M2k(k−1):
v counts the times a class has been predicted
can be extended for giving a probability
Let flearn′,fpredict′′′ be a learning technique with confidence/probability, i.e., flearn′:P∗(X×Ybinary)→M and fpredict′′′:X×M→R, with fpredict′′′(x,m) being the confidence that x is Pos. probability would be →[0,1]
Let Y={y1,…,yk} a finite set with k>2 values.
Consider a new learning technique flearn,ova′,fpredict,ova′, based on flearn′,fpredict′′′, that:
In learning: flearn,ova′:P∗(X×Y)→Mk
Given a D∈P∗(X×Y):
each mh is a binary classification model learned on ∣D′∣=∣D∣ obs.
In prediction: fpredict,ova′:X×Mk→Y
Given an x∈X and a model M∈Mk:
v holds the confidences for each class
can be extended for giving a probability
Let X=X1×⋯×Xp:
Xj | Y | RF | SVM | SVM+ |
---|---|---|---|---|
Numerical | Binary classification | ✅ | ✅ | ✅ |
Categorical | Binary classification | ✅ | ❌ | ✅ |
Numerical + Categorical | Binary classification | ✅ | ❌ | ✅ |
Numerical | Multiclass classification | ✅ | ❌ | ✅ |
Categorical | Multiclass classification | ✅ | ❌ | ✅ |
Numerical + Categorical | Multiclass classification | ✅ | ❌ | ✅ |
Numerical | Regression | ✅ | ❌ | ❌³ |
Categorical | Regression | ✅ | ❌ | ❌³ |
Numerical + Categorical | Regression | ✅ | ❌ | ❌³ |
SVM+¹²: SVM + one-vs-one/one-vs-all + dummy variables
In many practical, business cases, some variables for some observations might miss a value. Formally, xj∈Xj∪{∅}. ∅ is the empty set
Examples: (extended carousel)
Trees and SVM cannot work!
Solutions:
You are in the line 🚶🚶♂️🚶♀️🚶🚶🚶♀️🚶♂️🚶🚶♀️ at the cinema 🏪.
The ticket 🎟 of the person before you in the line falls on the ground.
The person has long hair.
Do you say "excuse me, sir" 🧔♀️ or "excuse me, madam" 👩?
You are in the line 🚶🚶♂️🚶♀️🚶🚶🚶♀️🚶♂️🚶🚶♀️ at the cinema 🏪.
The ticket 🎟 of the person before you in the line falls on the ground.
The person has long hair.
Do you say "excuse me, sir" 🧔♀️ or "excuse me, madam" 👩?
More formally:
According to your life, you have collected some knowledge, that you can express as probabilities:
where Pr(A∣B) is the conditional probability, i.e., the probability that, given that the event B occurred, the event A occurs
Do you say "excuse me, sir" 🧔♀️ or "excuse me, madam" 👩?
So, we want to know Pr(p=man∣h=long) and Pr(p=woman∣h=long), or maybe just if:
But we know Pr(h=long∣p=man), not Pr(h=man∣p=long)...
Do you say "excuse me, sir" 🧔♀️ or "excuse me, madam" 👩?
So, we want to know Pr(p=man∣h=long) and Pr(p=woman∣h=long), or maybe just if:
But we know Pr(h=long∣p=man), not Pr(h=man∣p=long)...
In general, Pr(A∣B)=Pr(B∣A).
Pr(A)Pr(B∣A)=Pr(A,B)=Pr(B)Pr(A∣B)
where Pr(A,B) is the probability that both A and B occur.
Pr(A)Pr(B∣A)=Pr(A,B)=Pr(B)Pr(A∣B)
where Pr(A,B) is the probability that both A and B occur.
Pr(B∣A)=Pr(A)Pr(B)Pr(A∣B)
Pr(A)Pr(B∣A)=Pr(A,B)=Pr(B)Pr(A∣B)
where Pr(A,B) is the probability that both A and B occur.
Pr(B∣A)=Pr(A)Pr(B)Pr(A∣B)
What we know:
What we compute:
We do not really need to know Pr(long)!
but it could be computed, in some cases
You are in the line 🚶🚶♂️🚶♀️🚶🚶🚶♀️🚶♂️🚶🚶♀️ at the stadium 🏟.
The ticket 🎟 of the person before you in the line falls on the ground.
The person has long hair.
What we know:
What we compute:
Different natural probability of a person at the stadium being a man!
Pr(event∣evidence)=Pr(event)Pr(evidence)Pr(evidence∣event)
Assume classification with categorical indep. vars:
Pr(event∣evidence)=Pr(event)Pr(evidence)Pr(evidence∣event)
Hence: Pr(y=ym∣x=(x1,l1,…,xp,lp))=Pr(y=ym)Pr(x=(x1,l1,…,xp,lp))Pr(x=(x1,l1,…,xp,lp)∣y=ym) or, more briefly: p(ym∣x1,l1,…,xp,lp)=p(ym)p(x1,l1,…,xp,lp)p(x1,l1,…,xp,lp∣ym)
p(ym∣x1,l1,…,xp,lp)=p(ym)p(x1,l1,…,xp,lp)p(x1,l1,…,xp,lp∣ym)
What do we need for predicting y from a x?
Where to find them?
💡: in the learning data D!
Let's do the naive hypothesis that the independent variables are independent¹ from each other: p(ym∣x1,l1,…,xp,lp)=p(ym)p(x1,l1,…,xp,lp)p(x1,l1,…,xp,lp∣ym)=p(ym)p(x1,l1,…,xp,lp)p(x1,l1∣ym,…,xp,lp∣ym) becomes: p(x1,l1,…,xp,lp∣ym)=p(x1,l1∣ym,…,xp,lp∣ym) is always true, also without independency
p(ym∣x1,l1,…,xp,lp)=p(x1,l1,…,xp,lp)p(ym)p(x1,l1∣ym)…p(xp,lp∣ym)
Where to find them?
💡: in the learning data D!
The technique based on the independency hypothesis is called Naive Bayes:
Learning:
function learn({(x(i),y(i))}i=1i=n) {
p←∅
for m∈{1,…,∣Y∣} { //∣Y∣=k
pm←n1∑i1(y(i)=ym)
for j∈{1,…,p} {
for l∈{1,…,∣Xj∣} { //∣Xj∣=hj
pm,j,l←∑i1(y(i)=ym)∑i1(y(i)=ym∧xj(i)=xj,l)
}
}
}
return p
}
The model p is some data structure holding k+∑j=1j=pkhj numbers, i.e., p∈[0,1]k+∑j=1j=pkhj.
Prediction:
function predict(x,p) { //x=(xl1,…,xlp)
m⋆←argmaxm∈{1,…,∣Y∣}pm∏j=1j=ppm,j,lj
return ym
}
Or, with probability:
function predict-with-prob(x,p) {
return ym↦∑m′=1m′=∣Y∣pm′∏j=1j=ppm′,j,ljpm∏j=1j=ppm,j,lj
}
Efficiency 👍👍👍
Explainability/interpretability 👍👍
Effectiveness 🫳
Applicability 🫳
Given a point on the map, guess its province.
More formally:
Tentative explanation of your reasoning, given a point x on the map:
Tentative explanation of your reasoning, given a point x on the map:
More in general: In prediction, but given a learning set D:
This is the k-Nearest Neighbors learning technique!
Learning:
function learn({(x(i),y(i))}i,k,d) {
return ({(x(i),y(i))}i,k,d)
}
flearn′ does nothing!
The model is the dataset D
k and d are parameters!
Prediction:
function predict(x,({(x(i),y(i))}i,k,d)) {
s←0 //0∈Rn
for i∈{1,…,n} {
si←d(x,x(i))
}
I←∅ //the neighborhood
while ∣I∣≤k {
I←I∪{argmini∈{1,…,n}∖Isi}
}
return argmaxy∈Y∑i∈I1(y(i)=y) //most frequent
}
Alternatives:
By using a proper distance d:X×X→R, kNN can be used on any X! (applicability 👍👍👍)
Common cases: there is a large literature on distances
Choose one that helps to capture the dependency of y on x!
images from James, Gareth, et al.; An introduction to statistical learning. Vol. 112. New York: springer, 2013
Yes, it is a flexibility parameter: link with the Bayes classifier!
Efficiency 🫳
Explainability/interpretability 👍
Effectiveness 🫳
Applicability 👍
Consider the DataCo Smart Supply Chain for Big Data Analysis dataset
given the objective of classifying if an order is marked as late delivery, design an implement a ML procedure which answers the question: what is the best classification technique?
given the objective of predicting the sales of each order, design an implement a ML procedure which answers the question: what is the best regression technique?
consider the ML techniques seen during the lectures
Hints:
1: designed by Gaia Saveri, tutor A.Y. 2023/2024
Machine Learning is the science of getting computers to learn without being explicitly programmed.
↓
Supervised (Machine) Learning is the science of getting computers to learn f:X→Y from examples autonomously.
↓
Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.
Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.
What's a pattern?
In practice:
Supervised (Machine) Learning is the science of getting computers to learn f:X→Y from examples autonomously.
Unsupervised (Machine) Learning is the science of getting computers to learn patterns from data autonomously.
Key differences
In most of the cases, the pattern one is looking for is grouping:
This form of unsupervised learning is called clustering:
Given a dataset D∈P∗(X), find a partitioning {D1,…,Dk} of D such that the elements in each Di are "close together".
Given a dataset D∈P∗(X), find a partitioning {D1,…,Dk} of D such that the elements in each Di are "close together".
Is this a formal and complete definition? No!
Given a dataset D∈P∗(X), find a partitioning {D1,…,Dk} of D such that the elements in each Di are "close together".
Is this a formal and complete definition? No!
In practice:
In principle, clustering looks like a (biobjecive) optimization problem (given D∈P∗(X), k∈{1,…,∣D∣}, and d:X×X→R+):
D1,…,Dkmaxsubject toi,i′:i=i′∑x∈Di,x′∈Di′∑d(x,x′)−i∑x,x′∈Di∑d(x,x′)D1∪⋯∪Dk=DDi∩Di′=∅∀i,i′∈{1,…,k}
For any k,d, there exists (at least) one optimal solution. In principle, to find it you can just try all the partitions and measure the distance.
In practice:
Here D and each Di are bags, not sets. A partition on a bag is better defined if you define a bag as a m:A→N, where A is a set and m(a) is the multiplicity of a∈A in the bag. However, for clustering we can reason on sets, because in practice pairs x,x should always end up being in the same cluster.
If you assume to know k and d, a clustering method:
But in practice you don't know k...
Can we just optimize also k? That is, can we solve the optimization problem for every k and take the best?
If you assume to know k and d, a clustering method:
But in practice you don't know k...
Can we just optimize also k? That is, can we solve the optimization problem for every k and take the best?
k,D1,…,Dkmaxsubject toi,i′:i=i′∑x∈Di,x′∈Di′∑d(x,x′)−i∑x,x′∈Di∑d(x,x′)D1∪⋯∪Dk=DDi∩Di′=∅∀i,i′∈{1,…,k}
If you also optimize k, then the optimal solution is the one with k=∣D∣...
Can we just optimize also k? No! It's pointless.
Extreme cases:
How do you evaluate a partitioning of D in practice?
Question: is manual inspection intrinsic or exstrinsic?
There are many of them; most are based on the idea of measuring separateness or density of clustering.
Silhouette index: it considers, for each observation, the average distance to the observations in the same cluster and the min distance to the observations in other clusters: sˉ({Di}i=1i=k)=∣⋃iDi∣1x∈⋃iDi∑max(dout(x,{Di}i),din(x,{Di}i))dout(x,{Di}i)−din(x,{Di}i) where:
dout(x,{Di}i)=Di∋xminx′∈Dimind(x,x′)
din(x,{Di}i)=∣Di∋x∣−11x′∈Di∋x∧x=x′∑d(x,x′)
sˉ(⋅)∈[−1,1]: the larger (closer to 1), the better (i.e., the more separated the clusters).
A similar index is the Dunn index.
sˉ({Di}i)=0.78
Questions:
sˉ({Di}i)=0.74
In practice:
Hierarchical clustering is an iterative method that exists in two versions: For both:
That is, partition are refined by merging (in agglomerative hierarchical clustering) or by division (in divisive hierarchical clustering).
Moreover, since there the partition is refined over iterations, an hierarchy among clusters is established:
We'll see just the agglomerative version.
function cluster({x(i)}i=1i=n) {
j←0
Dj←{{x(1)},…,{x(n)}}
while ∣Dj∣>1 {
(i⋆,i′⋆)←argmini,i′∈{1,…,∣D∣}∧i=i′dcluster(Dj,i,Dj,i′)
Dj+1←Dj⊕Dj,i⋆∪Dj,i′⋆⊖Dj,i⋆⊖Dj,i′⋆
j←j+1
}
return Dj
}
At each iteration:
There exist a few options for dcluster:P∗(X)×P∗(X)→R+. All are based on a (dis)similarity metric d defined over observations, i.e., d:X×X→R+.
where c(D)=xˉ=∣D∣1∑x∈Dx and xˉ is the centroid of D.
Question: what's the efficiency of the 4 dcluster?
Input: D={1,2,3,6,7,9,11,12,15,18}
Execution¹:
j | Dj |
---|---|
0 | {1},{2},{3},{6},{7},{9},{11},{12},{15},{18} |
1 | {1,2},{3},{6},{7},{9},{11},{12},{15},{18} |
2 | {1,2,3},{6},{7},{9},{11},{12},{15},{18} |
3 | {1,2,3},{6,7},{9},{11},{12},{15},{18} |
4 | {1,2,3},{6,7},{9},{11,12},{15},{18} |
5 | {1,2,3},{6,7,9},{11,12},{15},{18} |
6 | {1,2,3},{6,7,9,11,12},{15},{18} |
7 | {1,2,3,6,7,9,11,12},{15},{18} |
8 | {1,2,3,6,7,9,11,12,15},{18} |
9 | {1,2,3,6,7,9,11,12,15,18} |
function cluster({xi}i=1i=n) {
j←0
Dj←{{x1},…,{xn}}
while ∣Dj∣>1 {
(i⋆,i′⋆)←argmini,i′∈{1,…,∣D∣}∧i=i′dcluster(Dj,i,Dj,i′)
Dj+1←Dj⊕Dj,i⋆∪Dj,i′⋆⊖Dj,i⋆⊖Dj,i′⋆
j←j+1
}
return Dj
}
Assume single linkage:
The output, i.e., the partition of D, is D9: the hierarchy is the entire sequence D9,…,D0.
The hierarchy {Dj}j, not just the partition Dn−1, can be visualized in the form of a dendrogram where:
Question: what dcluster is being used here?
By looking at the dendrogram, one can choose an appropriate k, or simply look at the dendrogram as the pattern.
Consider the optimization problem behind clustering and the following heuristic¹ for solving it:
Consider the optimization problem behind clustering and the following heuristic¹ for solving it:
function cluster({x(i)}i=1i=n,k) {
for h∈{1,…,k} { //set initial centroids
μh←x(∼U({1,…,n}))
}
D←assign({x(i)}i=1i=n,{μh}h=1h=k)
while ¬should-stop() {
for h∈{1,…,k} { //recompute centroids
μh←∣Dh∣1∑x∈Dhx
}
D′←assign({x(i)}i=1i=n,{μh}h=1h=k)
if D′=D {
break
}
D←D′
}
return D
}
function assign({x(i)}i=1i=n,{μh}h=1h=k) {
D←{∅,…,∅} //k empty sets
for i∈{1,…,n} {
h⋆=argminh∈{1,…,k}d(x(i),μh)
Dh⋆←Dh⋆∪{x(i)} //assign to the closest centroid
}
return D
}
Input: D={1,2,3,6,7,9,11,12,15,18}, k=3
Execution (one initial random assignment):
Dj | μ1 | μ2 | μ3 |
---|---|---|---|
{1,2,3,6,7,9,11,12,15,18} | 1 | 11 | 15 |
{1,2,3,6,7,9,11,12,15,18} | 3 | 9.8 | 16.5 |
{1,2,3,6,7,9,11,12,15,18} | 3 | 9.8 | 16.5 |
Execution (another initial random assignment):
Dj | μ1 | μ2 | μ3 |
---|---|---|---|
{1,2,3,6,7,9,11,12,15,18} | 1 | 2 | 3 |
{1,2,3,6,7,9,11,12,15,18} | 1 | 2 | 10.1 |
{1,2,3,6,7,9,11,12,15,18} | 1 | 2.5 | 11.1 |
{1,2,3,6,7,9,11,12,15,18} | 1 | 3.7 | 12 |
{1,2,3,6,7,9,11,12,15,18} | 1.5 | 5.3 | 13 |
{1,2,3,6,7,9,11,12,15,18} | 2 | 7.3 | 14 |
{1,2,3,6,7,9,11,12,15,18} | 2 | 7.3 | 14 |
Question: what's the best clustering? can we answer this question?
function cluster({x(i)}i=1i=n,k) {
for h∈{1,…,k} {
μh←x(∼U({1,…,n}))
}
D←assign({x(i)}i=1i=n,{μh}h=1h=k)
while ¬should-stop() {
for h∈{1,…,k} {
μh←∣Dh∣1∑x∈Dhx
}
D′←assign({x(i)}i=1i=n,{μh}h=1h=k)
if D′=D {
break
}
D←D′
}
return D
}
Given two points μ1,μ2, the line which
divides the space in points closer to μ1 and those closer to μ2.
Image from Wikipedia
Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet A. Hence: x∈A∗ where A is usually (in modern times) UTF-16, so it may includes emojis:
A dataset X∈P∗(A∗) of texts, possibly with labels, is called corpus. A single text x(i) is called document.
Formally, a piece of text is a variable-length sequence of symbols belonging to an alphabet A. Hence: x∈A∗ where A is usually (in modern times) UTF-16, so it may includes emojis:
A dataset X∈P∗(A∗) of texts, possibly with labels, is called corpus. A single text x(i) is called document.
However, what we usually mean with text is natural language, where the sequence of characters is a noisy container of an underlying information:
Natural language is by nature ambiguous!
Given a brand (e.g., Illy, Fiat, Dell, U.S. Triestina Calcio, ...), build a system that tells if people is talking good or bad about the brand on Twitter (or Mastodon).
Given a corpus of letters to/from soldiers fighting during the WW1, what are the topics they talk about?
Given a scientific paper p1, what's the relevance of the citation of another paper p2 referenced in p1?
A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document x. This is called sentiment analysis.
Usually, this problem is cast as a form of supervised learning, where Y contains sentiments.
Variants:
A relevant class of problems is the one in which the goal is to gain insights about the sentiments an author was feeling while authoring a document x. This is called sentiment analysis.
Usually, this problem is cast as a form of supervised learning, where Y contains sentiments.
Variants:
In every case, we can¹ apply classic ML (supervised and usnupervised) techniques if you pre-process text to obtain multivariate observations, possibly in Rp, i.e., we want a ftext-to-vect:A∗→Rp:
Bag-of-words (BOW) is a ftext-to-vect based on the idea of associating one numerical variable with each word in a predefined dictionary.
In practice, given the dictionary (i.e., set of words W∈P(A∗)) and given a document x:
The outcome is a x′∈R∣W∣.
An alternative version is to consider the frequencies instead of occurrencies:
BOW is considers slightly different sequences of characters as different words, and hence as different features, because of tokenization. Usually, this is not good.
In practice, you often do some basic pre-processing steps:
Each of this steps is a fpre-proc:A∗→A∗:
The 4 common pre-processing steps are not always appropriate. It depends on whether they help modeling the y-x dependency.
Sentiment analysis and punctuation:
Music genre preferences and case: a bit forced...
Instruction level and stemming:
BOW tends to overweigh words which are very frequent, but not relevant (similarly to stop-words) and underweigh words that are relevant, but rare.
Solution: use tf-idf instead of occurrencies or frequency. tf-idf is the ratio between the term frequency (i.e., the frequency of a word) in a document, and the inverse document frequency, i.e., the frequency in the corpus of documents containing that term.
Given the dictionary W, the corpus X, and a document x:
where:
The more common a word, the greater tf, the (more) lower idf (0 if in every document). The more specific a word to a document, the larger tf, the larger idf.
tf-idf corresponds to a ftf-idf-learn:P∗(A∗)→P∗(A∗), which is just the identity¹, and a ftf-idf-apply:A∗×P∗(A∗)→R∣W∣:
With BOW, p=∣W∣ and might be very large.
Common approaches:
The order of words in W does matter, so it's W∈(A∗)∗, rather than W∈P(A∗).
Both BOW and tf-idf ignore word ordering. But ordering is fundamental in natural language.
Example: (sentiment classification for restaurant reviews)
Most common solutions:
Instead of considering word frequencies (or occurrences, or tf-idf), consider the frequencies of short sequences of up-to n words (tokens, or characters in general), i.e., of ngrams.
Example: (with n=3 and aggressive stop-word removal)
Since p may become very very large, dimensionality reduction becomes very important.
A technique belonging to Natural Language Processing methods that assigns the role to each word in a document. Roles can then be used to augment the text-to-num transformation.
Build a system that:
The system uses a dashboard to show its findings. you don't need to build the dashboard here, but imagining it and its usage can facilitate the design of the system
Hints:
tm
for doing text mining (tokenization, punctuation, stop-words, stemming, ...)e1071
, randomForest
kmeans
, hclust
Eric Medvet
Research interests:
Labs:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |