Table of Content: 

  1. Introduction
  2. Types of Hyperparameter Tuning
    1. Manual Search
    2. Grid Search CV
    3. Randomized Search CV
  3. Example Of Randomized Search CV
  4. Conclusion

 

Introduction

  1. Ever wondered what parameters value to choose as an input for model?
  2. What should be the maximum depth for decision tree?
  3. How many trees should be in random forest?
  4. How many layers should I have in neural network layer or number of neurons in layers?
  5. What should be the learning rate for model?

And many more questions like this could be answered by Hyperparameter Tuning.

So now the question arises what is Hyperparameter. Basically, it’s a set of parameters which are set to optimize or to get the best performance by model. Now, we don’t know at which values of parameters model will give best result, so we tune the parameters to get optimum result. So we can say that process of finding the combination of hyperparameters which reduces the predefined loss functions and increases the accuracy of given data is known as Hyperparameter Tuning.

 

Types of Hyperparameter Tuning

We can tune the model by following methods:

  1. Manual Search
  2. Grid Search CV
  3. Randomized Search CV

 

Manual Search Hyperparameter Tuning

As the name indicates, the process of tuning the hyperparameters manually by trial and error method and by experience is known as Manual Search tuning.

This process is repeated until we don’t get the best hyperparameters which increases the accuracy of model.

 

Grid Search CV

In Grid Search CV, model is build on every combination of given various hyperparameters and evaluate each model and gives the hyperparameters which have the highest accuracy.

In this technique each set of parameters are considered and model accuracy is noted. The set of parameters which give the highest accuracy is considered to build the model.

Read more about Grid Search in Python here.

 

Randomized Search CV

In Randomized Search CV, hyperparameters combination are taken randomly to find the best solution for building the model.

As it takes random values, it is faster than grid search cv but gives less accuracy than grid search cv.

 

Implementation of Randomized Search CV

 

    # Importing Libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import warnings
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.tree import DecisionTreeRegressor
    from xgboost import XGBRegressor
    from sklearn.model_selection import RandomizedSearchCV
    from scipy.stats import randint
    from xgboost import XGBRegressor
    from sklearn.metrics import accuracy_score , r2_score
    from imblearn.over_sampling import SMOTE
    from sklearn.preprocessing import StandardScaler
    warnings.filterwarnings('ignore')

    # Importing Dataset
    # Download red wine quality and white wine quality, sourced from UCI Machine Learning Archive
    red = pd.read_csv('winequality-red.csv', sep = ';')
    white = pd.read_csv('winequality-white.csv',sep=';')

    red.head(5)

   fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
0            7.4              0.70         0.00  ...       0.56      9.4        5
1            7.8              0.88         0.00  ...       0.68      9.8        5
2            7.8              0.76         0.04  ...       0.65      9.8        5
3           11.2              0.28         0.56  ...       0.58      9.8        6
4            7.4              0.70         0.00  ...       0.56      9.4        5

    white.head()

   fixed acidity  volatile acidity  citric acid  ...  sulphates  alcohol  quality
0            7.0              0.27         0.36  ...       0.45      8.8        6
1            6.3              0.30         0.34  ...       0.49      9.5        6
2            8.1              0.28         0.40  ...       0.44     10.1        6
3            7.2              0.23         0.32  ...       0.40      9.9        6
4            7.2              0.23         0.32  ...       0.40      9.9        6

    # Combining both data
    red['Type'] = 'Red'
    white['Type'] = 'White'

    df = pd.concat([red,white], axis=0)
    df.head(5)

   fixed acidity  volatile acidity  citric acid  ...  alcohol  quality  Type
0            7.4              0.70         0.00  ...      9.4        5     1
1            7.8              0.88         0.00  ...      9.8        5     1
2            7.8              0.76         0.04  ...      9.8        5     1
3           11.2              0.28         0.56  ...      9.8        6     1
4            7.4              0.70         0.00  ...      9.4        5     1

    df.describe()

       fixed acidity  volatile acidity  ...      quality         Type
count    6497.000000       6497.000000  ...  6497.000000  6497.000000
mean        7.215307          0.339666  ...     5.818378     0.246114
std         1.296434          0.164636  ...     0.873255     0.430779
min         3.800000          0.080000  ...     3.000000     0.000000
25%         6.400000          0.230000  ...     5.000000     0.000000
50%         7.000000          0.290000  ...     6.000000     0.000000
75%         7.700000          0.400000  ...     6.000000     0.000000
max        15.900000          1.580000  ...     9.000000     1.000000

    df.info()

Int64Index: 6497 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  Type                  6497 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 710.6 KB

    # check missing values
    df.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
Type                    0
dtype: int64

    df['Type'].value_counts()

white    4898
red      1599
Name: Type, dtype: int64

    df['quality'].value_counts()

6    2836
5    2138
7    1079
4     216
8     193
3      30
9       5
Name: quality, dtype: int64

As we can see, the data is imbalanced, we will have to handle this later.

    # correlation between features
    corr = df.corr(method='spearman')
    plt.figure(figsize=(20,10))
    sns.heatmap(corr, annot=True)

Correlation plot for selecting best variables

Figure 1 : Correlation plot for selecting best variables

 

From heatmap in Figure 1, it is clear that the most correlated features with quality are alcohol, density, free sulfur dioxide, chloride and citric acid.

    # Splitting the dataset
    X = df[['alcohol', 'density', 'free sulfur dioxide', 'chlorides','citric acid']]
    y = df['quality']

    # Balancing the dataset
    oversample = SMOTE(k_neighbors=4)
    # transform the dataset
    X, y = oversample.fit_resample(X, y)

    # Split into train and test sets
    X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.25, random_state=42)

    # Feature Scaling
    scaler = StandardScaler()
    scaler.fit(X)
    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    # Build the Logistic model
    lr = LogisticRegression()
    lr.fit(X_train,y_train)
    lr_pred = lr.predict(X_test)
    r2_score(lr_pred,y_test)

0.46038027394992365

    # Build the Random Forest Model
    from sklearn.ensemble import RandomForestRegressor
    rf = RandomForestRegressor()
    rf.fit(X_train,y_train)
    rf_pred = rf.predict(X_test)
    r2_score(rf_pred,y_test)

0.916805568147207

    # Build Decision Tree Model
    dt = DecisionTreeRegressor()
    dt.fit(X_train,y_train)
    pred_dt = dt.predict(X_test)
    r2_score(pred_dt,y_test)

0.8533160326226503

    # Build XGBoost Model
    xgb = XGBRegressor()
    xgb.fit(X_train,y_train)
    xg_pred = xgb.predict(X_test)
    r2_score(xg_pred,y_test)

0.8705397134271582

Accuracy Score of Random Forest Regressor Model is 91.68%. Now, let’s tune the hyperparameters and see how its effect the accuracy of model.

    # Hyperparameter Tuning
    rs = {'n_estimators': [100,150,200,300],
          'max_features': ['auto', 'sqrt'],
          'max_depth': [10, 20, 40, 50, 60, 80, 100, None],
          'min_samples_leaf': randint(1,8),
          'bootstrap': [True,False]}

    random_rf = RandomizedSearchCV(estimator = rf, param_distributions = rs, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
    # Fit the random search model
    random_rf.fit(X_train, y_train)

RandomizedSearchCV(cv=3, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 20, 40, 50, 60, 80,
                                                      100, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': ,
                                        'n_estimators': [100, 150, 200, 300]},
                   random_state=42, verbose=2)

    random_rf.best_params_

{'bootstrap': False,
 'max_depth': 50,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'n_estimators': 300}

    random_rf.best_estimator_

RandomForestRegressor(bootstrap=False, max_depth=50, max_features='sqrt', n_estimators=300)
    rf1 = RandomForestRegressor(bootstrap=False, ccp_alpha=0.0, criterion='mse',
                          max_depth=50, max_features='sqrt', max_leaf_nodes=None,
                          max_samples=None, min_impurity_decrease=0.0,
                          min_impurity_split=None, min_samples_leaf=1,
                          min_samples_split=2, min_weight_fraction_leaf=0.0,
                          n_estimators=300, n_jobs=None, oob_score=False,
                          random_state=None, verbose=0, warm_start=False)
    
    rf1.fit(X_train,y_train)

RandomForestRegressor(bootstrap=False, max_depth=50, max_features='sqrt',
                      n_estimators=300)

    rf_pr = rf1.predict(X_test)
    r2_score(rf_pr,y_test)

0.9322745287584387

Conclusion

With the help of hyperparameter tuning models accuracy increased by 2% and now it is 93.22%.

Randomized Search CV is faster than Grid Search CV. So, for big data we can use Randomized Search CV instead of Grid Search but if higher accuracy is needed than we should go with Grid search.

 

 

About the Author's:

Sachin Kumar Gupta

Sachin, is a Mechanical Engineer and data science enthusiast. He loves to find trend in data and extract useful information from it. He has executed projects on Machine Learning and Deep Learning using Python.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.