Lab 2: most effective ML algorithm¶
import io
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Get the data¶
The dataset is online, you can find information about its content and download it from the page: https://data.mendeley.com/datasets/8gx2fvg2k6/5
The go-to library for reading, writing and managing datasets is pandas
, in this cell we will use it to read a file stored as .csv
, using the function read_csv
.
This will return a pandas DataFrame
, which is the data structure used by this library to store tabular dataset.
If you use Google Colab, there are two ways in which you can read an external file, based on where you want to store it:
local file (i.e. stored on your machine):
import pandas as pd from google.colab import files uploaded = files.upload() data = pd.read_csv(io.BytesIO(uploaded['file-name.csv']), encoding='latin-1')
file stored on Google Drive (i.e. remote storage):
import pandas as pd from google.colab import drive drive.mount('/content/drive/') %cd /content/drive/MyDrive data = pd.read_csv('file-name.csv')
If you are using instead your machine to carry out the lab activity, just:
import pandas as pd
data = pd.read_csv('file-name.csv')
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/MyDrive/
(many of the error raised by the function read_csv
are fixed by changing the encoding type.. look carefully at the error message!)
import pandas as pd
data = pd.read_csv('DataCoSupplyChainDataset.csv', encoding='latin-1')
Get info about the data¶
The function info()
applied to a DataFrame
, gives basic information about the dataset: its shape, its columns' names with their type and number of missing entries, the memory it requires.
print('Basic Information About the dataset: ')
data.info()
Take a subset of the data¶
From previous cell, we see that our dataset has $180519$ rows.. a lot!
Since the goal of this lab activity is not that of building a deployable model for the real world, but just that of wrapping up the content of the course, we reduce the number of rows to avoid enormous execution times.
Caveat: even if your machine / Colab session is endowed with GPU accelaration, it is not authomatically exploited!
There are many ways for reducing the size of a dataset, in what follows we will just select the ten most represented countries and three departments of our choice (and we will also drop a couple of columns).
data['Customer Full Name'] = data['Customer Fname'].astype(str) + data['Customer Lname'].astype(str)
data = data.drop(['Customer Email','Product Status','Customer Password','Customer Street','Customer Fname','Customer Lname',
'Latitude','Longitude','Product Description','Product Image','Order Zipcode','shipping date (DateOrders)'], axis=1)
most_frequent_countries = data['Order Country'].value_counts()[:10].keys()
data = data[data['Order Country'].isin(most_frequent_countries)]
selected_categories = ['Outdoors', 'Fitness', 'Technology']
data = data[data['Department Name'].isin(selected_categories)]
Have a glance at the top rows of the dataset with the function head
:
data.head()
Check if and how many cells of the dataset are empty (missing values):
data.isnull().sum()
Since our dataset is big and it has only one missing value in the column Customer Zipcode
, we decide to drop the entire row (i.e. datapoint) corresponding to the missing data:
data = data.dropna()
A note on categorical features¶
When applying the method info
, you might have noticed that several columns are of type object
: this means that they are categorical attributes.
In principle this is totally fine, hower in what follows we will use the library scikit-learn
for building our classifiers and regressors (arguably it is the most used among python ML practictioners), in which most machine learning models are designed to work with numerical data. Therefore, you often need to preprocess your categorical features before feeding them into these models.
The following are the most popular (~ effective) ways:
- One-Hot Encoding: convert categorical variables into a binary matrix (0s and 1s) where each category becomes a new column with a binary value (1 or 0).
OneHotEncoder
inscikit-learn
can be used for this purpose. - Label Encoding: convert each category into a unique integer.
LabelEncoder
inscikit-learn
can be used for this purpose.
The following observations hold:
Label Encoding is more memory-efficient compared to One-Hot Encoding, especially when dealing with a large number of categories. It represents categories with integers, requiring less memory.
Label Encoding is sensitive to model assumptions: some machine learning algorithms may interpret the numeric labels as ordinal information, leading to biased results. This can affect the performance of models that assume features are continuous and have a meaningful order.
One-Hot Encoding can lead to a high-dimensional dataset, especially when dealing with categorical features with a large number of unique values. This can make the dataset sparse and might pose challenges for some models.
It's often a good idea to experiment with both encoding methods and choose the one that works best for your particular use case.
We restate that the goal of this lab is not to build the perfect model, hence we decide to use label encoding to save memory and computational time.
A note on the choice of the techniques¶
This lab activity is meant to recap the content of the course seen so far, as well as a possible way to implement the studied algortihms without the need of re-inventing the wheel.
Many possible solutions are right, not a single one is perfect: here we show several possibilities for tuning and training several models. Of course, in real-world scenarios (and likely for the final project), they should be combined based on your needs and / or background knowledge.
Classification¶
Given the above dataset, we solve the two following classification problems:
binary classification: predict whether an order will be marked as
late delivery
;multiclass classification: predict the
Delivery Status
of each point in the dataset.
Binary Classification¶
# import preproceesing algorithm
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# import classification algorithms
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
# import evaluation algorithms
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, f1_score, precision_recall_curve, roc_curve, auc
class_data = data.copy()
Let's begin by preparing the dataset: from the column Delivery Status
, we get a binary label by setting to $1$ the cells containing the status Late Delivery
, and $0$ all the others.
We use the library numpy
to do this operation:
# create target column
# np.where (condition, value_if_true, value_if_false)
class_data['late_delivery'] = np.where(class_data['Delivery Status'] == 'Late delivery', 1, 0)
class_data.drop(['Delivery Status'], axis=1, inplace=True)
class_data['late_delivery'].value_counts()
Encode categorical columns:
categorical_columns = class_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
class_data[list(categorical_columns)] = class_data[list(categorical_columns)].apply(label_encoder.fit_transform)
# class_data.dtypes
Apply some pre-processing to make the classification interesting (i.e. non trivial), by dropping columns which are highly correlated with our target:
class_data.info()
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(class_data[['late_delivery', 'Late_delivery_risk']].corr(), annot=True)
plt.tight_layout()
plt.show()
class_data.drop(['Late_delivery_risk'], axis=1, inplace=True)
Explicitly separate features and target columns:
x_class = class_data.loc[:, class_data.columns != 'late_delivery']
y_class = class_data['late_delivery']
x_class.dtypes
Split the data statically into train and test set using the scikit-learn
apposite function train_test_split
, here we decided to keep $20$% of the data for testing:
x_class_train, x_class_test, y_class_train, y_class_test = train_test_split(x_class, y_class, test_size=0.2, random_state=26)
Normalize the data to scale and standardize the features of your dataset so that they have a consistent range.
Common methods for normalization include:
Min-Max Scaling: scales the data to a specific range, typically [0, 1].
Standardization: transforms the data to have a mean of 0 and a standard deviation of 1.
Here we use the latter, via the scikit-learn
function StandardScaler
:
std_scaler = StandardScaler()
x_class_train = std_scaler.fit_transform(x_class_train)
x_class_test = std_scaler.fit_transform(x_class_test)
Comparing binary classification algorithm¶
We will use the following metrics to compare the results of several binary classification algorithms:
Receiver Operating Characteristic (ROC) curve, to assess and visualize the trade-off between the true positive rate and the false positive rate as the discrimination threshold varies. It will be displayed with its Area Under the Curve (AUC): higher AUC indicates better model performance, with a perfect classifier having an AUC of 1, and a random classifier having an AUC of 0.5.
scikit-learn
provides the functionroc_curve
, which takes in input the vector of ground-truth classes and probability estimates of the positive class, which can be obtained via the methodpredict_proba
of each classifier model (having care of selecting the column corresponding to the positive class).roc_curve
gives as output the false positive rate and the true positive rate, which can then be fed to the functionauc
, to get the AUC.Precision-Recall (PR) curve, another graphical tool used to assess the performance of a binary classification model, that focuses on the trade-off between precision and recall at different classification thresholds. It will be displayed with its Area Under the Curve (AUC-PR): a higher AUC-PR indicates better model performance, with a perfect classifier having an AUC-PR of 1.
scikit-learn
provides the functionprecision_recall_curve
, taking the same inputs asroc_curve
.precision_recall_curve
gives as output the precision and the recall, which can then be fed to the functionauc
, to get the AUC-PR.Accuracy: ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in the dataset.
In
scikit-learn
it can be computed with the functionaccuracy_score
, taking as input the true target and the predicted one.
def compare_binary_classification(names, prec_rec_values, prec_rec_auc, roc_values, roc_auc, acc_values):
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
cmap = sns.color_palette('Paired')[1::2]
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
for i in range(len(names)):
axs[0].plot(roc_values[i][0], roc_values[i][1], label='{} AUC {:.3f}'.format(names[i], roc_auc[i]), c=cmap[i])
axs[1].plot(prec_rec_values[i][1], prec_rec_values[i][0], label='{} AUC-PR {:.3f}'.format(names[i], prec_rec_auc[i]), c=cmap[i])
axs[0].legend(loc='lower right')
axs[0].set_xlabel('FPR')
axs[0].set_ylabel('TPR')
axs[0].set_title('ROC Curve')
axs[1].legend(loc='lower left')
axs[1].set_xlabel('Recall')
axs[1].set_ylabel('Precision')
axs[1].set_title('PR Curve')
bars = axs[2].bar(x=np.arange(len(names)), height=acc_values, color=cmap)
axs[2].set_xticks(np.arange(len(names)))
axs[2].set_xticklabels(names)
axs[2].set_ylim()
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x(), yval + .005, '{:.4f}'.format(yval))
axs[2].set_ylabel('Accuracy')
axs[2].set_title('Accuracy Bar Plots')
plt.tight_layout()
plt.show()
# useful containers
model_name_class, prec_rec_class, pr_auc_class, fpr_tpr_class, roc_auc_class, acc_class = [[] for _ in range(6)]
A note on sklearn models¶
Most scikit-learn
methods (both for regression and classification) share the same structure: after having instantiated a model, the method fit
takes as input the train dataset and the train target, and trains the model. Then the updated model produces prediction with the function predict
, having as input the test dataset.
We will use the following binary classification algorithms:
- Random Forest (RF):
RandomForestClassifier
# random forest classifier
model_name_class.append('RF')
rf_classifier = RandomForestClassifier()
rf_classifier.fit(x_class_train, y_class_train)
y_rf_class = rf_classifier.predict(x_class_test)
rf_class_accuracy = accuracy_score(y_rf_class, y_class_test)
acc_class.append(rf_class_accuracy)
rf_probs = rf_classifier.predict_proba(x_class_test)[:, 1]
rf_precision, rf_recall, _ = precision_recall_curve(y_class_test, rf_probs)
rf_auc = auc(rf_recall, rf_precision)
prec_rec_class.append([rf_precision, rf_recall])
pr_auc_class.append(rf_auc)
rf_fpr, rf_tpr, _ = roc_curve(y_class_test, rf_probs)
rf_roc_auc = auc(rf_fpr, rf_tpr)
fpr_tpr_class.append([rf_fpr, rf_tpr])
roc_auc_class.append(rf_roc_auc)
- Support Vector Machines (SVM):
SVC
# support vector machine
model_name_class.append('SVM')
sv_classifier = SVC(probability=True)
sv_classifier.fit(x_class_train, y_class_train)
y_sv_class = sv_classifier.predict(x_class_test)
sv_class_accuracy = accuracy_score(y_sv_class, y_class_test)
acc_class.append(sv_class_accuracy)
sv_probs = sv_classifier.predict_proba(x_class_test)[:, 1]
sv_precision, sv_recall, _ = precision_recall_curve(y_class_test, sv_probs)
sv_auc = auc(sv_recall, sv_precision)
prec_rec_class.append([sv_precision, sv_recall])
pr_auc_class.append(sv_auc)
sv_fpr, sv_tpr, _ = roc_curve(y_class_test, sv_probs)
fpr_tpr_class.append([sv_fpr, sv_tpr])
sv_roc_auc = auc(sv_fpr, sv_tpr)
roc_auc_class.append(sv_roc_auc)
Naive Bayes (NB):
BernoulliNB
(the Bernoulli Naive Bayes algorithm is a variant of the Naive Bayes algorithm that is specifically designed for binary data, where features represent either presence or absence of a particular characteristic)
# naive bayes
model_name_class.append('NB')
nb_classifier = BernoulliNB()
nb_classifier.fit(x_class_train, y_class_train)
y_nb_class = nb_classifier.predict(x_class_test)
nb_class_accuracy = accuracy_score(y_nb_class, y_class_test)
acc_class.append(nb_class_accuracy)
nb_probs = nb_classifier.predict_proba(x_class_test)[:, 1]
nb_precision, nb_recall, _ = precision_recall_curve(y_class_test, nb_probs)
nb_auc = auc(nb_recall, nb_precision)
prec_rec_class.append([nb_precision, nb_recall])
pr_auc_class.append(nb_auc)
nb_fpr, nb_tpr, _ = roc_curve(y_class_test, nb_probs)
nb_roc_auc = auc(nb_fpr, nb_tpr)
fpr_tpr_class.append([nb_fpr, nb_tpr])
roc_auc_class.append(nb_roc_auc)
- K-Nearest Neighbors (KNN):
KNeighborsClassifier
# k-nearest neighbors
model_name_class.append('KNN')
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(x_class_train, y_class_train)
y_knn_class = knn_classifier.predict(x_class_test)
knn_class_accuracy = accuracy_score(y_knn_class, y_class_test)
acc_class.append(knn_class_accuracy)
knn_probs = knn_classifier.predict_proba(x_class_test)[:, 1]
knn_precision, knn_recall, _ = precision_recall_curve(y_class_test, knn_probs)
knn_auc = auc(knn_recall, knn_precision)
prec_rec_class.append([knn_precision, knn_recall])
pr_auc_class.append(knn_auc)
knn_fpr, knn_tpr, _ = roc_curve(y_class_test, knn_probs)
knn_roc_auc = auc(knn_fpr, knn_tpr)
fpr_tpr_class.append([knn_fpr, knn_tpr])
roc_auc_class.append(knn_roc_auc)
Dummy Classifier (DUMMY):
DummyClassifier
(as a baseline)
# dummy classifier
model_name_class.append('DUMMY')
dummy_classifier = DummyClassifier()
dummy_classifier.fit(x_class_train, y_class_train)
y_dummy_class = dummy_classifier.predict(x_class_test)
dummy_class_accuracy = accuracy_score(y_dummy_class, y_class_test)
acc_class.append(dummy_class_accuracy)
dummy_probs = dummy_classifier.predict_proba(x_class_test)[:, 1]
dummy_precision, dummy_recall, _ = precision_recall_curve(y_class_test, dummy_probs)
dummy_auc = auc(dummy_recall, dummy_precision)
prec_rec_class.append([dummy_precision, dummy_recall])
pr_auc_class.append(dummy_auc)
dummy_fpr, dummy_tpr, _ = roc_curve(y_class_test, dummy_probs)
dummy_roc_auc = auc(dummy_fpr, dummy_tpr)
fpr_tpr_class.append([dummy_fpr, dummy_tpr])
roc_auc_class.append(dummy_roc_auc)
Finally compare the collected results:
compare_binary_classification(model_name_class, prec_rec_class, pr_auc_class, fpr_tpr_class, roc_auc_class, acc_class)
Multi-class classification¶
Hyperparameter grid search¶
In the following experiments, we will show how to perform hyperparamer tuning using grid search on salient parameters of each algorithm. We do so by specifying the hyperparameters to tune and a range of values or discrete choices for each hyperparameter. This creates a grid of hyperparameter combinations to explore.
In particular we will plot the process of hyperparameter tuning with errorbars: for each hyperparameter configuration, we repeat $n$ times the experiment and show the mean and the standard deviation of the results (in this case in terms of accuracy).
# import evaluation metric
from sklearn.metrics import ConfusionMatrixDisplay
# import classification algorithms
from sklearn.naive_bayes import MultinomialNB
Again we pre-process the data by diving the features and the target (preserving the label encoding performed at the previous point):
x_mclass = x_class.copy()
y_mclass = pd.Series(label_encoder.fit_transform(data['Delivery Status'].copy()))
class_names = label_encoder.classes_
first_space = [s.find(' ') for s in class_names]
class_names = [s[:first_space[i]] + '\n' + s[first_space[i]: ] for i,s in enumerate(class_names)]
y_mclass.value_counts()
Train and test split:
x_mclass_train, x_mclass_test, y_mclass_train, y_mclass_test = train_test_split(x_mclass, y_mclass, test_size=0.2, random_state=26)
x_mclass_train = std_scaler.fit_transform(x_mclass_train)
x_mclass_test = std_scaler.fit_transform(x_mclass_test)
We will use the following metrics to compare the results of several binary classification algorithms:
Accuracy (as before)
Confusion Matrix: a table that describes the performance of a classification model on a set of test data where true class labels are known. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. The confusion matrix is particularly useful for understanding the types of errors made by the classifier.
Computed and plotted by
scikit-learn
with the functionConfusionMatrixDisplay
which takes as input the trainned classifier, test features, test targets, and optionally class labels.
def compare_multiclass_classification(names, classifiers, colormaps, x_test, y_test, acc_values, class_labels):
fig, axs = plt.subplots(len(names)+1, 1, figsize=(7, 20))
bars = axs[0].bar(x=np.arange(len(names)), height=acc_values, color=[c(0.5) for c in colormaps])
axs[0].set_xticks(np.arange(len(names)))
axs[0].set_xticklabels(names)
for bar in bars:
yval = bar.get_height()
axs[0].text(bar.get_x() + 0.25, yval + .005, '{:.4f}'.format(yval))
axs[0].set_ylabel('Accuracy')
axs[0].set_title('Accuracy Bar Plots')
for i in range(1, len(names)+1):
ConfusionMatrixDisplay.from_estimator(classifiers[i-1], x_mclass_test, y_mclass_test,
display_labels=class_labels, cmap=colormaps[i-1], normalize='true', colorbar=False, ax=axs[i])
if i != len(names):
axs[i].set_xticklabels([])
axs[i].set_xlabel('')
axs[i].set_title('{} Confusion Matrix'.format(names[i-1]))
plt.tight_layout()
plt.show()
# containers for storing results
model_name_mclass, acc_mclass, classifier_mclass, cmaps = [[] for _ in range(4)]
We will use the following multiclass classification algorithms:
- Random Forest (RF):
RandomForestClassifier
, tuning the number of trees that form the ensamble (n_estimators
)
# random forest
model_name_mclass.append('RF')
n_estimators = [50, 100, 150, 200, 250]
rf_acc_mean, rf_acc_std = [[], []]
for n_est in n_estimators:
current_acc = []
for _ in range(10):
rf_mclassifier = RandomForestClassifier(n_estimators=n_est)
rf_mclassifier.fit(x_mclass_train, y_mclass_train)
y_rf_mclass = rf_mclassifier.predict(x_mclass_test)
rf_mclass_accuracy = accuracy_score(y_rf_mclass, y_mclass_test)
current_acc.append(rf_mclass_accuracy)
current_acc = np.array(current_acc)
rf_acc_mean.append(np.mean(current_acc))
rf_acc_std.append(np.std(current_acc))
rf_acc_mean = np.array(rf_acc_mean)
rf_acc_std = np.array(rf_acc_std)
rf_best_idx = np.argmax(rf_acc_mean)
rf_mclassifier = RandomForestClassifier(n_estimators=n_estimators[rf_best_idx])
rf_mclassifier.fit(x_mclass_train, y_mclass_train)
y_rf_mclass = rf_mclassifier.predict(x_mclass_test)
rf_mclass_accuracy = accuracy_score(y_rf_mclass, y_mclass_test)
acc_mclass.append(rf_mclass_accuracy)
classifier_mclass.append(rf_mclassifier)
cmaps.append(plt.cm.Blues)
plt.errorbar(x=np.arange(len(n_estimators)), y=rf_acc_mean, yerr=rf_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Number of Estimators')
plt.xticks(np.arange(len(n_estimators)), labels=n_estimators)
plt.tight_layout()
plt.show()
- Support Vector Machine (SVM):
SVC
, tuning the type of kernel used (kernel
)
# support vector machine
model_name_mclass.append('SVM')
kernel_type = ['linear', 'poly', 'rbf', 'sigmoid']
sv_acc_mean, sv_acc_std = [[], []]
for k_type in kernel_type:
current_acc = []
for _ in range(10):
sv_mclassifier = SVC(kernel=k_type)
sv_mclassifier.fit(x_mclass_train, y_mclass_train)
y_sv_mclass = sv_mclassifier.predict(x_mclass_test)
sv_mclass_accuracy = accuracy_score(y_sv_mclass, y_mclass_test)
current_acc.append(sv_mclass_accuracy)
current_acc = np.array(current_acc)
sv_acc_mean.append(np.mean(current_acc))
sv_acc_std.append(np.std(current_acc))
sv_acc_mean = np.array(sv_acc_mean)
sv_acc_std = np.array(sv_acc_std)
sv_best_idx = np.argmax(sv_acc_mean)
sv_mclassifier = SVC(kernel=kernel_type[sv_best_idx])
sv_mclassifier.fit(x_mclass_train, y_mclass_train)
y_sv_mclass = sv_mclassifier.predict(x_mclass_test)
sv_mclass_accuracy = accuracy_score(y_sv_mclass, y_mclass_test)
acc_mclass.append(sv_mclass_accuracy)
classifier_mclass.append(sv_mclassifier)
cmaps.append(plt.cm.Greens)
plt.errorbar(x=np.arange(len(kernel_type)), y=sv_acc_mean, yerr=sv_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Kernel Type')
plt.xticks(np.arange(len(kernel_type)), labels=kernel_type)
plt.tight_layout()
plt.show()
- Naive Bayes (NB): either
BernoulliNB
orGaussianNB
(modelling the likelihood of each class as a Gaussian distribution.)
# naive bayes
model_name_mclass.append('NB')
bayes_type = [GaussianNB(), BernoulliNB()]
nb_acc_mean, nb_acc_std = [[], []]
for nb_type in bayes_type:
current_acc = []
for _ in range(10):
nb_mclassifier = nb_type
nb_mclassifier.fit(x_mclass_train, y_mclass_train)
y_nb_mclass = nb_mclassifier.predict(x_mclass_test)
nb_mclass_accuracy = accuracy_score(y_nb_mclass, y_mclass_test)
current_acc.append(nb_mclass_accuracy)
current_acc = np.array(current_acc)
nb_acc_mean.append(np.mean(current_acc))
nb_acc_std.append(np.std(current_acc))
nb_acc_mean = np.array(nb_acc_mean)
nb_acc_std = np.array(nb_acc_std)
nb_best_idx = np.argmax(nb_acc_mean)
nb_mclassifier = bayes_type[nb_best_idx]
nb_mclassifier.fit(x_mclass_train, y_mclass_train)
y_nb_mclass = nb_mclassifier.predict(x_mclass_test)
nb_mclass_accuracy = accuracy_score(y_nb_mclass, y_mclass_test)
acc_mclass.append(nb_mclass_accuracy)
classifier_mclass.append(nb_mclassifier)
cmaps.append(plt.cm.Reds)
plt.errorbar(x=np.arange(len(bayes_type)), y=nb_acc_mean, yerr=nb_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Likelihood Distribution')
plt.xticks(np.arange(len(bayes_type)), labels=['Gaussian', 'Bernoulli'])
plt.tight_layout()
plt.show()
- K-Nearest Neighbors (KNN):
KNeighborsClassifier
, tuning the number of neighbors to retrieve (n_neighbors
)
# k-nearest neighbors majority voting
model_name_mclass.append('KNN')
n_neighs = [5, 10, 20, 25, 50]
knn_acc_mean, knn_acc_std = [[], []]
for num_neigh in n_neighs:
current_acc = []
for _ in range(10):
knn_mclassifier = KNeighborsClassifier(n_neighbors=num_neigh)
knn_mclassifier.fit(x_mclass_train, y_mclass_train)
y_knn_mclass = knn_mclassifier.predict(x_mclass_test)
knn_mclass_accuracy = accuracy_score(y_knn_mclass, y_mclass_test)
current_acc.append(knn_mclass_accuracy)
current_acc = np.array(current_acc)
knn_acc_mean.append(np.mean(current_acc))
knn_acc_std.append(np.std(current_acc))
knn_acc_mean = np.array(knn_acc_mean)
knn_acc_std = np.array(knn_acc_std)
knn_best_idx = np.argmax(knn_acc_mean)
knn_mclassifier = KNeighborsClassifier(n_neighs[knn_best_idx])
knn_mclassifier.fit(x_mclass_train, y_mclass_train)
y_knn_mclass = knn_mclassifier.predict(x_mclass_test)
knn_mclass_accuracy = accuracy_score(y_knn_mclass, y_mclass_test)
acc_mclass.append(knn_mclass_accuracy)
classifier_mclass.append(knn_mclassifier)
cmaps.append(plt.cm.Oranges)
plt.errorbar(x=np.arange(len(n_neighs)), y=knn_acc_mean, yerr=knn_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors')
plt.xticks(np.arange(len(n_neighs)), labels=n_neighs)
plt.tight_layout()
Dummy Classifier (DUMMY):
DummyClassifier
(as a baseline)
# dummy classifier
model_name_mclass.append('DUMMY')
dummy_mclassifier = DummyClassifier()
dummy_mclassifier.fit(x_mclass_train, y_mclass_train)
y_dummy_mclass = dummy_mclassifier.predict(x_mclass_test)
dummy_mclass_accuracy = accuracy_score(y_dummy_mclass, y_mclass_test)
acc_mclass.append(dummy_mclass_accuracy)
classifier_mclass.append(dummy_mclassifier)
cmaps.append(plt.cm.Purples)
Finally we compare the results:
compare_multiclass_classification(model_name_mclass, classifier_mclass, cmaps, x_mclass_test,
y_mclass_test, acc_mclass, class_names)
Regression¶
Given the original dataset, we now compare regression models towards the task of predicting the amount of Sales
K-Fold Cross Validation¶
In these experiments, instead of doing a static train and test split, we will evaluate the performance of the regression algorithms via k-fold cross-validation.
To do, we will leverage the method KFold
of scikit-learn
which takes as input the number of folds in which we want to divide the dataset and a boolean value indicating whether or not we want to shuffle the data prior to creating the chunks.
Calling split
provides train/test indices to split data in train/test sets.
# import validation algorithm
from sklearn.model_selection import KFold
# import regression algorithms
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.dummy import DummyRegressor
# import evaluation algorithm
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
Pre-process and split into features and target:
reg_data = data.copy()
categorical_columns = reg_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
reg_data[list(categorical_columns)] = reg_data[list(categorical_columns)].apply(label_encoder.fit_transform)
x_reg = reg_data.loc[:, reg_data.columns != 'Sales']
y_reg = reg_data['Sales']
We will use the following metrics to compare the results of several regression algorithms:
Mean Absolute Error (MAE):
mean_absolute_error
inscikit-learn
;Mean Squared Error (MSE):
mean_squared_error
inscikit-learn
;Mean Absolute Percentage Error (MAPE):
mean_absolute_percentage_error
inscikit-learn
.
All methods takes in input ground truth target vector and predicted values.
For each of them, we will show a boxplot summarizing the results of the cross-validation procedure.
# dataframe for storing results in a convenient way for shwing boxplot later
reg_metrics = pd.DataFrame(columns=['Model', 'MAE', 'MSE', 'MAPE'])
We will use the following multiclass classification algorithms:
- Random Forest (RF):
RandomForestRegressor
# random forest
rf_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in rf_kfolds.split(x_reg):
rf_regressor = RandomForestRegressor()
rf_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
y_rf_reg = rf_regressor.predict(x_reg.iloc[test_idx])
current_rf_mae = mean_absolute_error(y_reg.iloc[test_idx], y_rf_reg)
current_rf_mse = mean_squared_error(y_reg.iloc[test_idx], y_rf_reg)
current_rf_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_rf_reg)
reg_metrics.loc[len(reg_metrics)] = ['RF', current_rf_mae, current_rf_mse, current_rf_mape]
- Support Vector Machines (SVM):
SVR
# support vector machine
sv_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in sv_kfolds.split(x_reg):
sv_regressor = SVR()
sv_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
y_sv_reg = sv_regressor.predict(x_reg.iloc[test_idx])
current_sv_mae = mean_absolute_error(y_reg.iloc[test_idx], y_sv_reg)
current_sv_mse = mean_squared_error(y_reg.iloc[test_idx], y_sv_reg)
current_sv_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_sv_reg)
reg_metrics.loc[len(reg_metrics)] = ['SVM', current_sv_mae, current_sv_mse, current_sv_mape]
- K-Nearest Neighbors (KNN):
KNeighborsRegressor
# k-nearest neighbors
knn_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in knn_kfolds.split(x_reg):
knn_regressor = KNeighborsRegressor()
knn_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
y_knn_reg = knn_regressor.predict(x_reg.iloc[test_idx])
current_knn_mae = mean_absolute_error(y_reg.iloc[test_idx], y_knn_reg)
current_knn_mse = mean_squared_error(y_reg.iloc[test_idx], y_knn_reg)
current_knn_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_knn_reg)
reg_metrics.loc[len(reg_metrics)] = ['KNN', current_knn_mae, current_knn_mse, current_knn_mape]
Dummy Regressor (DUMMY):
DummyRegressor
(as a baseline)
# dummy regressor
dummy_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in dummy_kfolds.split(x_reg):
dummy_regressor = DummyRegressor()
dummy_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
y_dummy_reg = dummy_regressor.predict(x_reg.iloc[test_idx])
current_dummy_mae = mean_absolute_error(y_reg.iloc[test_idx], y_dummy_reg)
current_dummy_mse = mean_squared_error(y_reg.iloc[test_idx], y_dummy_reg)
current_dummy_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_dummy_reg)
reg_metrics.loc[len(reg_metrics)] = ['DUMMY', current_dummy_mae, current_dummy_mse, current_dummy_mape]
Finally we compare the results:
fig, axs = plt.subplots(1, 3, figsize=(12, 5))
sns.boxplot(data=reg_metrics, x="MAE", y="Model", ax=axs[0], palette=sns.color_palette('Paired')[1::2])
sns.boxplot(data=reg_metrics, x="MSE", y="Model", ax=axs[1], palette=sns.color_palette('Paired')[1::2])
axs[1].set_yticklabels('')
axs[1].set_ylabel('')
sns.boxplot(data=reg_metrics, x="MAPE", y="Model", ax=axs[2], palette=sns.color_palette('Paired')[1::2])
axs[2].set_yticklabels('')
axs[2].set_ylabel('')
plt.tight_layout()
plt.show()