Lab 2: most effective ML algorithm¶

import io
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Get the data¶

The dataset is online, you can find information about its content and download it from the page: https://data.mendeley.com/datasets/8gx2fvg2k6/5

The go-to library for reading, writing and managing datasets is pandas, in this cell we will use it to read a file stored as .csv, using the function read_csv.

This will return a pandas DataFrame, which is the data structure used by this library to store tabular dataset.

If you use Google Colab, there are two ways in which you can read an external file, based on where you want to store it:

local file (i.e. stored on your machine):

import pandas as pd
from google.colab import files
uploaded = files.upload()
data = pd.read_csv(io.BytesIO(uploaded['file-name.csv']), encoding='latin-1')

file stored on Google Drive (i.e. remote storage):

import pandas as pd
from google.colab import drive
drive.mount('/content/drive/')
%cd /content/drive/MyDrive
data = pd.read_csv('file-name.csv')

If you are using instead your machine to carry out the lab activity, just:

import pandas as pd
data = pd.read_csv('file-name.csv')

from google.colab import drive
drive.mount('/content/drive/')

%cd /content/drive/MyDrive/

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
/content/drive/MyDrive

(many of the error raised by the function read_csv are fixed by changing the encoding type.. look carefully at the error message!)

import pandas as pd
data = pd.read_csv('DataCoSupplyChainDataset.csv', encoding='latin-1')

Get info about the data¶

The function info() applied to a DataFrame, gives basic information about the dataset: its shape, its columns' names with their type and number of missing entries, the memory it requires.

print('Basic Information About the dataset: ')
data.info()

Basic Information About the dataset: 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Type                           180519 non-null  object 
 1   Days for shipping (real)       180519 non-null  int64  
 2   Days for shipment (scheduled)  180519 non-null  int64  
 3   Benefit per order              180519 non-null  float64
 4   Sales per customer             180519 non-null  float64
 5   Delivery Status                180519 non-null  object 
 6   Late_delivery_risk             180519 non-null  int64  
 7   Category Id                    180519 non-null  int64  
 8   Category Name                  180519 non-null  object 
 9   Customer City                  180519 non-null  object 
 10  Customer Country               180519 non-null  object 
 11  Customer Email                 180519 non-null  object 
 12  Customer Fname                 180519 non-null  object 
 13  Customer Id                    180519 non-null  int64  
 14  Customer Lname                 180511 non-null  object 
 15  Customer Password              180519 non-null  object 
 16  Customer Segment               180519 non-null  object 
 17  Customer State                 180519 non-null  object 
 18  Customer Street                180519 non-null  object 
 19  Customer Zipcode               180516 non-null  float64
 20  Department Id                  180519 non-null  int64  
 21  Department Name                180519 non-null  object 
 22  Latitude                       180519 non-null  float64
 23  Longitude                      180519 non-null  float64
 24  Market                         180519 non-null  object 
 25  Order City                     180519 non-null  object 
 26  Order Country                  180519 non-null  object 
 27  Order Customer Id              180519 non-null  int64  
 28  order date (DateOrders)        180519 non-null  object 
 29  Order Id                       180519 non-null  int64  
 30  Order Item Cardprod Id         180519 non-null  int64  
 31  Order Item Discount            180519 non-null  float64
 32  Order Item Discount Rate       180519 non-null  float64
 33  Order Item Id                  180519 non-null  int64  
 34  Order Item Product Price       180519 non-null  float64
 35  Order Item Profit Ratio        180519 non-null  float64
 36  Order Item Quantity            180519 non-null  int64  
 37  Sales                          180519 non-null  float64
 38  Order Item Total               180519 non-null  float64
 39  Order Profit Per Order         180519 non-null  float64
 40  Order Region                   180519 non-null  object 
 41  Order State                    180519 non-null  object 
 42  Order Status                   180519 non-null  object 
 43  Order Zipcode                  24840 non-null   float64
 44  Product Card Id                180519 non-null  int64  
 45  Product Category Id            180519 non-null  int64  
 46  Product Description            0 non-null       float64
 47  Product Image                  180519 non-null  object 
 48  Product Name                   180519 non-null  object 
 49  Product Price                  180519 non-null  float64
 50  Product Status                 180519 non-null  int64  
 51  shipping date (DateOrders)     180519 non-null  object 
 52  Shipping Mode                  180519 non-null  object 
dtypes: float64(15), int64(14), object(24)
memory usage: 73.0+ MB

Take a subset of the data¶

From previous cell, we see that our dataset has $180519$ rows.. a lot!

Since the goal of this lab activity is not that of building a deployable model for the real world, but just that of wrapping up the content of the course, we reduce the number of rows to avoid enormous execution times.

Caveat: even if your machine / Colab session is endowed with GPU accelaration, it is not authomatically exploited!

There are many ways for reducing the size of a dataset, in what follows we will just select the ten most represented countries and three departments of our choice (and we will also drop a couple of columns).

data['Customer Full Name'] = data['Customer Fname'].astype(str) + data['Customer Lname'].astype(str)
data = data.drop(['Customer Email','Product Status','Customer Password','Customer Street','Customer Fname','Customer Lname',
           'Latitude','Longitude','Product Description','Product Image','Order Zipcode','shipping date (DateOrders)'], axis=1)

most_frequent_countries = data['Order Country'].value_counts()[:10].keys()
data = data[data['Order Country'].isin(most_frequent_countries)]

selected_categories = ['Outdoors', 'Fitness', 'Technology']
data = data[data['Department Name'].isin(selected_categories)]

Have a glance at the top rows of the dataset with the function head:

data.head()

Check if and how many cells of the dataset are empty (missing values):

data.isnull().sum()

Type                             0
Days for shipping (real)         0
Days for shipment (scheduled)    0
Benefit per order                0
Sales per customer               0
Delivery Status                  0
Late_delivery_risk               0
Category Id                      0
Category Name                    0
Customer City                    0
Customer Country                 0
Customer Id                      0
Customer Segment                 0
Customer State                   0
Customer Zipcode                 1
Department Id                    0
Department Name                  0
Market                           0
Order City                       0
Order Country                    0
Order Customer Id                0
order date (DateOrders)          0
Order Id                         0
Order Item Cardprod Id           0
Order Item Discount              0
Order Item Discount Rate         0
Order Item Id                    0
Order Item Product Price         0
Order Item Profit Ratio          0
Order Item Quantity              0
Sales                            0
Order Item Total                 0
Order Profit Per Order           0
Order Region                     0
Order State                      0
Order Status                     0
Product Card Id                  0
Product Category Id              0
Product Name                     0
Product Price                    0
Shipping Mode                    0
Customer Full Name               0
dtype: int64

Since our dataset is big and it has only one missing value in the column Customer Zipcode, we decide to drop the entire row (i.e. datapoint) corresponding to the missing data:

data = data.dropna()

A note on categorical features¶

When applying the method info, you might have noticed that several columns are of type object: this means that they are categorical attributes.

In principle this is totally fine, hower in what follows we will use the library scikit-learn for building our classifiers and regressors (arguably it is the most used among python ML practictioners), in which most machine learning models are designed to work with numerical data. Therefore, you often need to preprocess your categorical features before feeding them into these models.

The following are the most popular (~ effective) ways:

One-Hot Encoding: convert categorical variables into a binary matrix (0s and 1s) where each category becomes a new column with a binary value (1 or 0). OneHotEncoder in scikit-learn can be used for this purpose.
Label Encoding: convert each category into a unique integer. LabelEncoder in scikit-learn can be used for this purpose.

The following observations hold:

Label Encoding is more memory-efficient compared to One-Hot Encoding, especially when dealing with a large number of categories. It represents categories with integers, requiring less memory.
Label Encoding is sensitive to model assumptions: some machine learning algorithms may interpret the numeric labels as ordinal information, leading to biased results. This can affect the performance of models that assume features are continuous and have a meaningful order.
One-Hot Encoding can lead to a high-dimensional dataset, especially when dealing with categorical features with a large number of unique values. This can make the dataset sparse and might pose challenges for some models.

It's often a good idea to experiment with both encoding methods and choose the one that works best for your particular use case.

We restate that the goal of this lab is not to build the perfect model, hence we decide to use label encoding to save memory and computational time.

A note on the choice of the techniques¶

This lab activity is meant to recap the content of the course seen so far, as well as a possible way to implement the studied algortihms without the need of re-inventing the wheel.

Many possible solutions are right, not a single one is perfect: here we show several possibilities for tuning and training several models. Of course, in real-world scenarios (and likely for the final project), they should be combined based on your needs and / or background knowledge.

Classification¶

Given the above dataset, we solve the two following classification problems:

binary classification: predict whether an order will be marked as late delivery;
multiclass classification: predict the Delivery Status of each point in the dataset.

Binary Classification¶

# import preproceesing algorithm
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
# import classification algorithms
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
# import evaluation algorithms
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, f1_score, precision_recall_curve, roc_curve, auc

class_data = data.copy()

Let's begin by preparing the dataset: from the column Delivery Status, we get a binary label by setting to $1$ the cells containing the status Late Delivery, and $0$ all the others.

We use the library numpy to do this operation:

# create target column
# np.where (condition, value_if_true, value_if_false)
class_data['late_delivery'] = np.where(class_data['Delivery Status'] == 'Late delivery', 1, 0)
class_data.drop(['Delivery Status'], axis=1, inplace=True)
class_data['late_delivery'].value_counts()

1    4356
0    3509
Name: late_delivery, dtype: int64

Encode categorical columns:

categorical_columns = class_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
class_data[list(categorical_columns)] = class_data[list(categorical_columns)].apply(label_encoder.fit_transform)
# class_data.dtypes

Apply some pre-processing to make the classification interesting (i.e. non trivial), by dropping columns which are highly correlated with our target:

class_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7865 entries, 1 to 179586
Data columns (total 42 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Type                           7865 non-null   int64  
 1   Days for shipping (real)       7865 non-null   int64  
 2   Days for shipment (scheduled)  7865 non-null   int64  
 3   Benefit per order              7865 non-null   float64
 4   Sales per customer             7865 non-null   float64
 5   Late_delivery_risk             7865 non-null   int64  
 6   Category Id                    7865 non-null   int64  
 7   Category Name                  7865 non-null   int64  
 8   Customer City                  7865 non-null   int64  
 9   Customer Country               7865 non-null   int64  
 10  Customer Id                    7865 non-null   int64  
 11  Customer Segment               7865 non-null   int64  
 12  Customer State                 7865 non-null   int64  
 13  Customer Zipcode               7865 non-null   float64
 14  Department Id                  7865 non-null   int64  
 15  Department Name                7865 non-null   int64  
 16  Market                         7865 non-null   int64  
 17  Order City                     7865 non-null   int64  
 18  Order Country                  7865 non-null   int64  
 19  Order Customer Id              7865 non-null   int64  
 20  order date (DateOrders)        7865 non-null   int64  
 21  Order Id                       7865 non-null   int64  
 22  Order Item Cardprod Id         7865 non-null   int64  
 23  Order Item Discount            7865 non-null   float64
 24  Order Item Discount Rate       7865 non-null   float64
 25  Order Item Id                  7865 non-null   int64  
 26  Order Item Product Price       7865 non-null   float64
 27  Order Item Profit Ratio        7865 non-null   float64
 28  Order Item Quantity            7865 non-null   int64  
 29  Sales                          7865 non-null   float64
 30  Order Item Total               7865 non-null   float64
 31  Order Profit Per Order         7865 non-null   float64
 32  Order Region                   7865 non-null   int64  
 33  Order State                    7865 non-null   int64  
 34  Order Status                   7865 non-null   int64  
 35  Product Card Id                7865 non-null   int64  
 36  Product Category Id            7865 non-null   int64  
 37  Product Name                   7865 non-null   int64  
 38  Product Price                  7865 non-null   float64
 39  Shipping Mode                  7865 non-null   int64  
 40  Customer Full Name             7865 non-null   int64  
 41  late_delivery                  7865 non-null   int64  
dtypes: float64(11), int64(31)
memory usage: 2.6 MB

import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(class_data[['late_delivery', 'Late_delivery_risk']].corr(), annot=True)
plt.tight_layout()
plt.show()

class_data.drop(['Late_delivery_risk'], axis=1, inplace=True)

Explicitly separate features and target columns:

x_class = class_data.loc[:, class_data.columns != 'late_delivery']
y_class = class_data['late_delivery']
x_class.dtypes

Type                               int64
Days for shipping (real)           int64
Days for shipment (scheduled)      int64
Benefit per order                float64
Sales per customer               float64
Category Id                        int64
Category Name                      int64
Customer City                      int64
Customer Country                   int64
Customer Id                        int64
Customer Segment                   int64
Customer State                     int64
Customer Zipcode                 float64
Department Id                      int64
Department Name                    int64
Market                             int64
Order City                         int64
Order Country                      int64
Order Customer Id                  int64
order date (DateOrders)            int64
Order Id                           int64
Order Item Cardprod Id             int64
Order Item Discount              float64
Order Item Discount Rate         float64
Order Item Id                      int64
Order Item Product Price         float64
Order Item Profit Ratio          float64
Order Item Quantity                int64
Sales                            float64
Order Item Total                 float64
Order Profit Per Order           float64
Order Region                       int64
Order State                        int64
Order Status                       int64
Product Card Id                    int64
Product Category Id                int64
Product Name                       int64
Product Price                    float64
Shipping Mode                      int64
Customer Full Name                 int64
dtype: object

Split the data statically into train and test set using the scikit-learn apposite function train_test_split, here we decided to keep $20$% of the data for testing:

x_class_train, x_class_test, y_class_train, y_class_test = train_test_split(x_class, y_class, test_size=0.2, random_state=26)

Normalize the data to scale and standardize the features of your dataset so that they have a consistent range.

Common methods for normalization include:

Min-Max Scaling: scales the data to a specific range, typically [0, 1].
Standardization: transforms the data to have a mean of 0 and a standard deviation of 1.

Here we use the latter, via the scikit-learn function StandardScaler:

std_scaler = StandardScaler()
x_class_train = std_scaler.fit_transform(x_class_train)
x_class_test = std_scaler.fit_transform(x_class_test)

Comparing binary classification algorithm¶

We will use the following metrics to compare the results of several binary classification algorithms:

Receiver Operating Characteristic (ROC) curve, to assess and visualize the trade-off between the true positive rate and the false positive rate as the discrimination threshold varies. It will be displayed with its Area Under the Curve (AUC): higher AUC indicates better model performance, with a perfect classifier having an AUC of 1, and a random classifier having an AUC of 0.5.

scikit-learn provides the function roc_curve, which takes in input the vector of ground-truth classes and probability estimates of the positive class, which can be obtained via the method predict_proba of each classifier model (having care of selecting the column corresponding to the positive class). roc_curve gives as output the false positive rate and the true positive rate, which can then be fed to the function auc, to get the AUC.
Precision-Recall (PR) curve, another graphical tool used to assess the performance of a binary classification model, that focuses on the trade-off between precision and recall at different classification thresholds. It will be displayed with its Area Under the Curve (AUC-PR): a higher AUC-PR indicates better model performance, with a perfect classifier having an AUC-PR of 1.

scikit-learn provides the function precision_recall_curve, taking the same inputs as roc_curve. precision_recall_curve gives as output the precision and the recall, which can then be fed to the function auc, to get the AUC-PR.
Accuracy: ratio of correctly predicted instances (both true positives and true negatives) to the total number of instances in the dataset.

In scikit-learn it can be computed with the function accuracy_score, taking as input the true target and the predicted one.

def compare_binary_classification(names, prec_rec_values, prec_rec_auc, roc_values, roc_auc, acc_values):
  sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})
  cmap = sns.color_palette('Paired')[1::2]
  fig, axs = plt.subplots(1, 3, figsize=(12, 5))
  for i in range(len(names)):
    axs[0].plot(roc_values[i][0], roc_values[i][1], label='{} AUC {:.3f}'.format(names[i], roc_auc[i]), c=cmap[i])
    axs[1].plot(prec_rec_values[i][1], prec_rec_values[i][0], label='{} AUC-PR {:.3f}'.format(names[i], prec_rec_auc[i]), c=cmap[i])
  axs[0].legend(loc='lower right')
  axs[0].set_xlabel('FPR')
  axs[0].set_ylabel('TPR')
  axs[0].set_title('ROC Curve')
  axs[1].legend(loc='lower left')
  axs[1].set_xlabel('Recall')
  axs[1].set_ylabel('Precision')
  axs[1].set_title('PR Curve')
  bars = axs[2].bar(x=np.arange(len(names)), height=acc_values, color=cmap)
  axs[2].set_xticks(np.arange(len(names)))
  axs[2].set_xticklabels(names)
  axs[2].set_ylim()
  for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .005, '{:.4f}'.format(yval))
  axs[2].set_ylabel('Accuracy')
  axs[2].set_title('Accuracy Bar Plots')
  plt.tight_layout()
  plt.show()

# useful containers
model_name_class, prec_rec_class, pr_auc_class, fpr_tpr_class, roc_auc_class, acc_class  = [[] for _ in range(6)]

A note on sklearn models¶

Most scikit-learn methods (both for regression and classification) share the same structure: after having instantiated a model, the method fit takes as input the train dataset and the train target, and trains the model. Then the updated model produces prediction with the function predict, having as input the test dataset.

We will use the following binary classification algorithms:

Random Forest (RF): RandomForestClassifier

# random forest classifier
model_name_class.append('RF')
rf_classifier = RandomForestClassifier()
rf_classifier.fit(x_class_train, y_class_train)
y_rf_class = rf_classifier.predict(x_class_test)
rf_class_accuracy = accuracy_score(y_rf_class, y_class_test)
acc_class.append(rf_class_accuracy)
rf_probs = rf_classifier.predict_proba(x_class_test)[:, 1]
rf_precision, rf_recall, _ = precision_recall_curve(y_class_test, rf_probs)
rf_auc = auc(rf_recall, rf_precision)
prec_rec_class.append([rf_precision, rf_recall])
pr_auc_class.append(rf_auc)
rf_fpr, rf_tpr, _ = roc_curve(y_class_test, rf_probs)
rf_roc_auc = auc(rf_fpr, rf_tpr)
fpr_tpr_class.append([rf_fpr, rf_tpr])
roc_auc_class.append(rf_roc_auc)

Support Vector Machines (SVM): SVC

# support vector machine
model_name_class.append('SVM')
sv_classifier = SVC(probability=True)
sv_classifier.fit(x_class_train, y_class_train)
y_sv_class = sv_classifier.predict(x_class_test)
sv_class_accuracy = accuracy_score(y_sv_class, y_class_test)
acc_class.append(sv_class_accuracy)
sv_probs = sv_classifier.predict_proba(x_class_test)[:, 1]
sv_precision, sv_recall, _ = precision_recall_curve(y_class_test, sv_probs)
sv_auc = auc(sv_recall, sv_precision)
prec_rec_class.append([sv_precision, sv_recall])
pr_auc_class.append(sv_auc)
sv_fpr, sv_tpr, _ = roc_curve(y_class_test, sv_probs)
fpr_tpr_class.append([sv_fpr, sv_tpr])
sv_roc_auc = auc(sv_fpr, sv_tpr)
roc_auc_class.append(sv_roc_auc)

Naive Bayes (NB): BernoulliNB

(the Bernoulli Naive Bayes algorithm is a variant of the Naive Bayes algorithm that is specifically designed for binary data, where features represent either presence or absence of a particular characteristic)

# naive bayes
model_name_class.append('NB')
nb_classifier = BernoulliNB()
nb_classifier.fit(x_class_train, y_class_train)
y_nb_class = nb_classifier.predict(x_class_test)
nb_class_accuracy = accuracy_score(y_nb_class, y_class_test)
acc_class.append(nb_class_accuracy)
nb_probs = nb_classifier.predict_proba(x_class_test)[:, 1]
nb_precision, nb_recall, _ = precision_recall_curve(y_class_test, nb_probs)
nb_auc = auc(nb_recall, nb_precision)
prec_rec_class.append([nb_precision, nb_recall])
pr_auc_class.append(nb_auc)
nb_fpr, nb_tpr, _ = roc_curve(y_class_test, nb_probs)
nb_roc_auc = auc(nb_fpr, nb_tpr)
fpr_tpr_class.append([nb_fpr, nb_tpr])
roc_auc_class.append(nb_roc_auc)

K-Nearest Neighbors (KNN): KNeighborsClassifier

# k-nearest neighbors
model_name_class.append('KNN')
knn_classifier = KNeighborsClassifier()
knn_classifier.fit(x_class_train, y_class_train)
y_knn_class = knn_classifier.predict(x_class_test)
knn_class_accuracy = accuracy_score(y_knn_class, y_class_test)
acc_class.append(knn_class_accuracy)
knn_probs = knn_classifier.predict_proba(x_class_test)[:, 1]
knn_precision, knn_recall, _ = precision_recall_curve(y_class_test, knn_probs)
knn_auc = auc(knn_recall, knn_precision)
prec_rec_class.append([knn_precision, knn_recall])
pr_auc_class.append(knn_auc)
knn_fpr, knn_tpr, _ = roc_curve(y_class_test, knn_probs)
knn_roc_auc = auc(knn_fpr, knn_tpr)
fpr_tpr_class.append([knn_fpr, knn_tpr])
roc_auc_class.append(knn_roc_auc)

Dummy Classifier (DUMMY): DummyClassifier

(as a baseline)

# dummy classifier
model_name_class.append('DUMMY')
dummy_classifier = DummyClassifier()
dummy_classifier.fit(x_class_train, y_class_train)
y_dummy_class = dummy_classifier.predict(x_class_test)
dummy_class_accuracy = accuracy_score(y_dummy_class, y_class_test)
acc_class.append(dummy_class_accuracy)
dummy_probs = dummy_classifier.predict_proba(x_class_test)[:, 1]
dummy_precision, dummy_recall, _ = precision_recall_curve(y_class_test, dummy_probs)
dummy_auc = auc(dummy_recall, dummy_precision)
prec_rec_class.append([dummy_precision, dummy_recall])
pr_auc_class.append(dummy_auc)
dummy_fpr, dummy_tpr, _ = roc_curve(y_class_test, dummy_probs)
dummy_roc_auc = auc(dummy_fpr, dummy_tpr)
fpr_tpr_class.append([dummy_fpr, dummy_tpr])
roc_auc_class.append(dummy_roc_auc)

Finally compare the collected results:

compare_binary_classification(model_name_class, prec_rec_class, pr_auc_class, fpr_tpr_class, roc_auc_class, acc_class)

Multi-class classification¶

Hyperparameter grid search¶

In the following experiments, we will show how to perform hyperparamer tuning using grid search on salient parameters of each algorithm. We do so by specifying the hyperparameters to tune and a range of values or discrete choices for each hyperparameter. This creates a grid of hyperparameter combinations to explore.

In particular we will plot the process of hyperparameter tuning with errorbars: for each hyperparameter configuration, we repeat $n$ times the experiment and show the mean and the standard deviation of the results (in this case in terms of accuracy).

# import evaluation metric
from sklearn.metrics import ConfusionMatrixDisplay
# import classification algorithms
from sklearn.naive_bayes import MultinomialNB

Again we pre-process the data by diving the features and the target (preserving the label encoding performed at the previous point):

x_mclass = x_class.copy()
y_mclass = pd.Series(label_encoder.fit_transform(data['Delivery Status'].copy()))
class_names = label_encoder.classes_
first_space = [s.find(' ') for s in class_names]
class_names = [s[:first_space[i]] + '\n' + s[first_space[i]: ] for i,s in enumerate(class_names)]
y_mclass.value_counts()

1    4356
0    1834
3    1312
2     363
dtype: int64

Train and test split:

x_mclass_train, x_mclass_test, y_mclass_train, y_mclass_test = train_test_split(x_mclass, y_mclass, test_size=0.2, random_state=26)
x_mclass_train = std_scaler.fit_transform(x_mclass_train)
x_mclass_test = std_scaler.fit_transform(x_mclass_test)

We will use the following metrics to compare the results of several binary classification algorithms:

Accuracy (as before)
Confusion Matrix: a table that describes the performance of a classification model on a set of test data where true class labels are known. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class. The confusion matrix is particularly useful for understanding the types of errors made by the classifier.

Computed and plotted by scikit-learn with the function ConfusionMatrixDisplay which takes as input the trainned classifier, test features, test targets, and optionally class labels.

def compare_multiclass_classification(names, classifiers, colormaps, x_test, y_test, acc_values, class_labels):
  fig, axs = plt.subplots(len(names)+1, 1, figsize=(7, 20))
  bars = axs[0].bar(x=np.arange(len(names)), height=acc_values, color=[c(0.5) for c in colormaps])
  axs[0].set_xticks(np.arange(len(names)))
  axs[0].set_xticklabels(names)
  for bar in bars:
    yval = bar.get_height()
    axs[0].text(bar.get_x() + 0.25, yval + .005, '{:.4f}'.format(yval))
  axs[0].set_ylabel('Accuracy')
  axs[0].set_title('Accuracy Bar Plots')
  for i in range(1, len(names)+1):
      ConfusionMatrixDisplay.from_estimator(classifiers[i-1], x_mclass_test, y_mclass_test,
                                            display_labels=class_labels, cmap=colormaps[i-1], normalize='true', colorbar=False, ax=axs[i])
      if i != len(names):
        axs[i].set_xticklabels([])
        axs[i].set_xlabel('')
      axs[i].set_title('{} Confusion Matrix'.format(names[i-1]))
  plt.tight_layout()
  plt.show()

# containers for storing results
model_name_mclass, acc_mclass, classifier_mclass, cmaps = [[] for _ in range(4)]

We will use the following multiclass classification algorithms:

Random Forest (RF): RandomForestClassifier, tuning the number of trees that form the ensamble (n_estimators)

# random forest
model_name_mclass.append('RF')
n_estimators = [50, 100, 150, 200, 250]
rf_acc_mean, rf_acc_std = [[], []]
for n_est in n_estimators:
  current_acc = []
  for _ in range(10):
    rf_mclassifier = RandomForestClassifier(n_estimators=n_est)
    rf_mclassifier.fit(x_mclass_train, y_mclass_train)
    y_rf_mclass = rf_mclassifier.predict(x_mclass_test)
    rf_mclass_accuracy = accuracy_score(y_rf_mclass, y_mclass_test)
    current_acc.append(rf_mclass_accuracy)
  current_acc = np.array(current_acc)
  rf_acc_mean.append(np.mean(current_acc))
  rf_acc_std.append(np.std(current_acc))

rf_acc_mean = np.array(rf_acc_mean)
rf_acc_std = np.array(rf_acc_std)
rf_best_idx = np.argmax(rf_acc_mean)
rf_mclassifier = RandomForestClassifier(n_estimators=n_estimators[rf_best_idx])
rf_mclassifier.fit(x_mclass_train, y_mclass_train)
y_rf_mclass = rf_mclassifier.predict(x_mclass_test)
rf_mclass_accuracy = accuracy_score(y_rf_mclass, y_mclass_test)
acc_mclass.append(rf_mclass_accuracy)
classifier_mclass.append(rf_mclassifier)
cmaps.append(plt.cm.Blues)

plt.errorbar(x=np.arange(len(n_estimators)), y=rf_acc_mean, yerr=rf_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Number of Estimators')
plt.xticks(np.arange(len(n_estimators)), labels=n_estimators)
plt.tight_layout()
plt.show()

Support Vector Machine (SVM): SVC, tuning the type of kernel used (kernel)

# support vector machine
model_name_mclass.append('SVM')
kernel_type = ['linear', 'poly', 'rbf', 'sigmoid']
sv_acc_mean, sv_acc_std = [[], []]
for k_type in kernel_type:
  current_acc = []
  for _ in range(10):
    sv_mclassifier = SVC(kernel=k_type)
    sv_mclassifier.fit(x_mclass_train, y_mclass_train)
    y_sv_mclass = sv_mclassifier.predict(x_mclass_test)
    sv_mclass_accuracy = accuracy_score(y_sv_mclass, y_mclass_test)
    current_acc.append(sv_mclass_accuracy)
  current_acc = np.array(current_acc)
  sv_acc_mean.append(np.mean(current_acc))
  sv_acc_std.append(np.std(current_acc))

sv_acc_mean = np.array(sv_acc_mean)
sv_acc_std = np.array(sv_acc_std)
sv_best_idx = np.argmax(sv_acc_mean)
sv_mclassifier = SVC(kernel=kernel_type[sv_best_idx])
sv_mclassifier.fit(x_mclass_train, y_mclass_train)
y_sv_mclass = sv_mclassifier.predict(x_mclass_test)
sv_mclass_accuracy = accuracy_score(y_sv_mclass, y_mclass_test)
acc_mclass.append(sv_mclass_accuracy)
classifier_mclass.append(sv_mclassifier)
cmaps.append(plt.cm.Greens)

plt.errorbar(x=np.arange(len(kernel_type)), y=sv_acc_mean, yerr=sv_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Kernel Type')
plt.xticks(np.arange(len(kernel_type)), labels=kernel_type)
plt.tight_layout()
plt.show()

Naive Bayes (NB): either BernoulliNB or GaussianNB (modelling the likelihood of each class as a Gaussian distribution.)

# naive bayes
model_name_mclass.append('NB')
bayes_type = [GaussianNB(), BernoulliNB()]
nb_acc_mean, nb_acc_std = [[], []]
for nb_type in bayes_type:
  current_acc = []
  for _ in range(10):
    nb_mclassifier = nb_type
    nb_mclassifier.fit(x_mclass_train, y_mclass_train)
    y_nb_mclass = nb_mclassifier.predict(x_mclass_test)
    nb_mclass_accuracy = accuracy_score(y_nb_mclass, y_mclass_test)
    current_acc.append(nb_mclass_accuracy)
  current_acc = np.array(current_acc)
  nb_acc_mean.append(np.mean(current_acc))
  nb_acc_std.append(np.std(current_acc))

nb_acc_mean = np.array(nb_acc_mean)
nb_acc_std = np.array(nb_acc_std)
nb_best_idx = np.argmax(nb_acc_mean)
nb_mclassifier = bayes_type[nb_best_idx]
nb_mclassifier.fit(x_mclass_train, y_mclass_train)
y_nb_mclass = nb_mclassifier.predict(x_mclass_test)
nb_mclass_accuracy = accuracy_score(y_nb_mclass, y_mclass_test)
acc_mclass.append(nb_mclass_accuracy)
classifier_mclass.append(nb_mclassifier)
cmaps.append(plt.cm.Reds)

plt.errorbar(x=np.arange(len(bayes_type)), y=nb_acc_mean, yerr=nb_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Likelihood Distribution')
plt.xticks(np.arange(len(bayes_type)), labels=['Gaussian', 'Bernoulli'])
plt.tight_layout()
plt.show()

K-Nearest Neighbors (KNN): KNeighborsClassifier, tuning the number of neighbors to retrieve (n_neighbors)

# k-nearest neighbors majority voting
model_name_mclass.append('KNN')
n_neighs = [5, 10, 20, 25, 50]
knn_acc_mean, knn_acc_std = [[], []]
for num_neigh in n_neighs:
  current_acc = []
  for _ in range(10):
    knn_mclassifier = KNeighborsClassifier(n_neighbors=num_neigh)
    knn_mclassifier.fit(x_mclass_train, y_mclass_train)
    y_knn_mclass = knn_mclassifier.predict(x_mclass_test)
    knn_mclass_accuracy = accuracy_score(y_knn_mclass, y_mclass_test)
    current_acc.append(knn_mclass_accuracy)
  current_acc = np.array(current_acc)
  knn_acc_mean.append(np.mean(current_acc))
  knn_acc_std.append(np.std(current_acc))

knn_acc_mean = np.array(knn_acc_mean)
knn_acc_std = np.array(knn_acc_std)
knn_best_idx = np.argmax(knn_acc_mean)
knn_mclassifier = KNeighborsClassifier(n_neighs[knn_best_idx])
knn_mclassifier.fit(x_mclass_train, y_mclass_train)
y_knn_mclass = knn_mclassifier.predict(x_mclass_test)
knn_mclass_accuracy = accuracy_score(y_knn_mclass, y_mclass_test)
acc_mclass.append(knn_mclass_accuracy)
classifier_mclass.append(knn_mclassifier)
cmaps.append(plt.cm.Oranges)

plt.errorbar(x=np.arange(len(n_neighs)), y=knn_acc_mean, yerr=knn_acc_std, marker='o', linestyle='-')
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors')
plt.xticks(np.arange(len(n_neighs)), labels=n_neighs)
plt.tight_layout()

Dummy Classifier (DUMMY): DummyClassifier

(as a baseline)

# dummy classifier
model_name_mclass.append('DUMMY')
dummy_mclassifier = DummyClassifier()
dummy_mclassifier.fit(x_mclass_train, y_mclass_train)
y_dummy_mclass = dummy_mclassifier.predict(x_mclass_test)
dummy_mclass_accuracy = accuracy_score(y_dummy_mclass, y_mclass_test)
acc_mclass.append(dummy_mclass_accuracy)
classifier_mclass.append(dummy_mclassifier)
cmaps.append(plt.cm.Purples)

Finally we compare the results:

compare_multiclass_classification(model_name_mclass, classifier_mclass, cmaps, x_mclass_test,
                                  y_mclass_test, acc_mclass, class_names)

Regression¶

Given the original dataset, we now compare regression models towards the task of predicting the amount of Sales

K-Fold Cross Validation¶

In these experiments, instead of doing a static train and test split, we will evaluate the performance of the regression algorithms via k-fold cross-validation.

To do, we will leverage the method KFold of scikit-learn which takes as input the number of folds in which we want to divide the dataset and a boolean value indicating whether or not we want to shuffle the data prior to creating the chunks.

Calling split provides train/test indices to split data in train/test sets.

# import validation algorithm
from sklearn.model_selection import KFold
# import regression algorithms
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.dummy import DummyRegressor
# import evaluation algorithm
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

Pre-process and split into features and target:

reg_data = data.copy()
categorical_columns = reg_data.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
reg_data[list(categorical_columns)] = reg_data[list(categorical_columns)].apply(label_encoder.fit_transform)

x_reg = reg_data.loc[:, reg_data.columns != 'Sales']
y_reg = reg_data['Sales']

We will use the following metrics to compare the results of several regression algorithms:

Mean Absolute Error (MAE): mean_absolute_error in scikit-learn;
Mean Squared Error (MSE): mean_squared_error in scikit-learn;
Mean Absolute Percentage Error (MAPE): mean_absolute_percentage_error in scikit-learn.

All methods takes in input ground truth target vector and predicted values.

For each of them, we will show a boxplot summarizing the results of the cross-validation procedure.

# dataframe for storing results in a convenient way for shwing boxplot later
reg_metrics = pd.DataFrame(columns=['Model', 'MAE', 'MSE', 'MAPE'])

We will use the following multiclass classification algorithms:

Random Forest (RF): RandomForestRegressor

# random forest
rf_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in rf_kfolds.split(x_reg):
  rf_regressor = RandomForestRegressor()
  rf_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
  y_rf_reg = rf_regressor.predict(x_reg.iloc[test_idx])
  current_rf_mae = mean_absolute_error(y_reg.iloc[test_idx], y_rf_reg)
  current_rf_mse = mean_squared_error(y_reg.iloc[test_idx], y_rf_reg)
  current_rf_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_rf_reg)
  reg_metrics.loc[len(reg_metrics)] = ['RF', current_rf_mae, current_rf_mse, current_rf_mape]

Support Vector Machines (SVM): SVR

# support vector machine
sv_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in sv_kfolds.split(x_reg):
  sv_regressor = SVR()
  sv_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
  y_sv_reg = sv_regressor.predict(x_reg.iloc[test_idx])
  current_sv_mae = mean_absolute_error(y_reg.iloc[test_idx], y_sv_reg)
  current_sv_mse = mean_squared_error(y_reg.iloc[test_idx], y_sv_reg)
  current_sv_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_sv_reg)
  reg_metrics.loc[len(reg_metrics)] = ['SVM', current_sv_mae, current_sv_mse, current_sv_mape]

K-Nearest Neighbors (KNN): KNeighborsRegressor

# k-nearest neighbors
knn_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in knn_kfolds.split(x_reg):
  knn_regressor = KNeighborsRegressor()
  knn_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
  y_knn_reg = knn_regressor.predict(x_reg.iloc[test_idx])
  current_knn_mae = mean_absolute_error(y_reg.iloc[test_idx], y_knn_reg)
  current_knn_mse = mean_squared_error(y_reg.iloc[test_idx], y_knn_reg)
  current_knn_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_knn_reg)
  reg_metrics.loc[len(reg_metrics)] = ['KNN', current_knn_mae, current_knn_mse, current_knn_mape]

Dummy Regressor (DUMMY): DummyRegressor

(as a baseline)

# dummy regressor
dummy_kfolds = KFold(n_splits=10, shuffle=True)
for train_idx, test_idx in dummy_kfolds.split(x_reg):
  dummy_regressor = DummyRegressor()
  dummy_regressor.fit(x_reg.iloc[train_idx], y_reg.iloc[train_idx])
  y_dummy_reg = dummy_regressor.predict(x_reg.iloc[test_idx])
  current_dummy_mae = mean_absolute_error(y_reg.iloc[test_idx], y_dummy_reg)
  current_dummy_mse = mean_squared_error(y_reg.iloc[test_idx], y_dummy_reg)
  current_dummy_mape = mean_absolute_percentage_error(y_reg.iloc[test_idx], y_dummy_reg)
  reg_metrics.loc[len(reg_metrics)] = ['DUMMY', current_dummy_mae, current_dummy_mse, current_dummy_mape]

Finally we compare the results:

fig, axs = plt.subplots(1, 3, figsize=(12, 5))
sns.boxplot(data=reg_metrics, x="MAE", y="Model", ax=axs[0], palette=sns.color_palette('Paired')[1::2])
sns.boxplot(data=reg_metrics, x="MSE", y="Model", ax=axs[1], palette=sns.color_palette('Paired')[1::2])
axs[1].set_yticklabels('')
axs[1].set_ylabel('')
sns.boxplot(data=reg_metrics, x="MAPE", y="Model", ax=axs[2], palette=sns.color_palette('Paired')[1::2])
axs[2].set_yticklabels('')
axs[2].set_ylabel('')
plt.tight_layout()
plt.show()

	Type	Days for shipping (real)	Days for shipment (scheduled)	Benefit per order	Sales per customer	Delivery Status	Late_delivery_risk	Category Id	Category Name	Customer City	...	Order Profit Per Order	Order Region	Order State	Order Status	Product Card Id	Product Category Id	Product Name	Product Price	Shipping Mode	Customer Full Name
1	TRANSFER	5	4	-249.089996	311.359985	Late delivery	1	73	Sporting Goods	Caguas	...	-249.089996	South Asia	Rajastán	PENDING	1360	73	Smart watch	327.75	Standard Class	IreneLuna
2	CASH	4	4	-247.779999	309.720001	Shipping on time	0	73	Sporting Goods	San Jose	...	-247.779999	South Asia	Rajastán	CLOSED	1360	73	Smart watch	327.75	Standard Class	GillianMaldonado
3	DEBIT	3	4	22.860001	304.809998	Advance shipping	0	73	Sporting Goods	Los Angeles	...	22.860001	Oceania	Queensland	COMPLETE	1360	73	Smart watch	327.75	Standard Class	TanaTate
4	PAYMENT	2	4	134.210007	298.250000	Advance shipping	0	73	Sporting Goods	Caguas	...	134.210007	Oceania	Queensland	PENDING_PAYMENT	1360	73	Smart watch	327.75	Standard Class	OrliHendricks
5	TRANSFER	6	4	18.580000	294.980011	Shipping canceled	0	73	Sporting Goods	Tonawanda	...	18.580000	Oceania	Queensland	CANCELED	1360	73	Smart watch	327.75	Standard Class	KimberlyFlowers