Intro to Hyperparameter tuning with Wine classification dataset.

In this post, we will learn about basic Hyperparameter tuning using GridSearch. Since I want to focus solely on this concept I will skip other pre-processing and feature selection techniques which are also important steps in a Machine Learning pipeline.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine

data = load_wine()
wine_data = pd.DataFrame(data.data, columns=data.feature_names)
wine_target = pd.DataFrame(data.target, columns=['wine_class'])

Here's a truncated glimpse of the dataframe

We will split the data into train and test sets so that we can do all of our EDA on the training set and keep the test set untouched.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(wine_data, wine_target, test_size=0.2, random_state=42)

Let's quickly generate summary statistics to get an overview of our data and find if there are any null values as well.

print(X_train.describe())
print(X_train.info())

         alcohol  malic_acid  ...  od280/od315_of_diluted_wines      proline
count  142.000000  142.000000  ...                    142.000000   142.000000
mean    12.979085    2.373521  ...                      2.592817   734.894366
std      0.820116    1.143934  ...                      0.722141   302.323595
min     11.030000    0.890000  ...                      1.270000   278.000000
25%     12.332500    1.615000  ...                      1.837500   502.500000
50%     13.010000    1.875000  ...                      2.775000   660.000000
75%     13.677500    3.135000  ...                      3.170000   932.750000
max     14.830000    5.800000  ...                      4.000000  1547.000000

[8 rows x 13 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 142 entries, 158 to 102
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       142 non-null    float64
 1   malic_acid                    142 non-null    float64
 2   ash                           142 non-null    float64
 3   alcalinity_of_ash             142 non-null    float64
 4   magnesium                     142 non-null    float64
 5   total_phenols                 142 non-null    float64
 6   flavanoids                    142 non-null    float64
 7   nonflavanoid_phenols          142 non-null    float64
 8   proanthocyanins               142 non-null    float64
 9   color_intensity               142 non-null    float64
 10  hue                           142 non-null    float64
 11  od280/od315_of_diluted_wines  142 non-null    float64
 12  proline                       142 non-null    float64

It looks like we are good to go as there are no null values or incompatible data types

Hyperparameter Search

Hyperparameters are features that are intrinsic to the model. For instance, in the case of KNeighbors classifiers number of neighbors and in the case of Decision tree classifier, number of trees is the hyperparameter. These are things that we cannot learn from the data and are unique to the particular model that we would use. Hyperparameters are also found to affect model performance and the way to know what is the best hyperparameters requires experimentation. The process of finding the best value of hyperparameters is called tuning. GridSearch and RandomSearch are common techniques for hyperparameter tuning. For this article, we will look at a very basic type of GridSearch where will only tune 1 parameter the n_neighbbors.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

k_values = [1,2,3,4,5,6,7]
mse_values = []

for k in k_values:
  knn = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
  knn.fit(X_train, y_train)
  y_pred = knn.predict(X_test)
  mse_values.append(accuracy_score(y_test,y_pred))

print(mse_values)

Output:

[0.7777777777777778, 0.7222222222222222, 0.8055555555555556, 0.75, 0.7222222222222222, 0.7222222222222222, 0.6944444444444444]

Looks like, k=3 performed best for this dataset and higher values than this decreased model accuracy, we can safely choose this. In practice, you will want to use more robust Hyperparameter tuning techniques such as GridSearch and RandomSearch where you can combine lots of different combinations of hyperparameters and evaluate model performance in one go, as we will see in the next post. Stay tuned friends.