Table of Content

  1. Inspiration
  2. Dataset Description
  3. Dependency Packages Installing in python
  4. Loading Dataset
  5. Import Required Libraries
  6. Explore Dataset
  7. Build Model using Pycaret
  8. Evaluate Model
  9. Stack Model
  10. Save Model for Reuse
  11. Conclusion

 

 

Inspiration

The object of this project is to predict how long a car on a production line will take to pass the testing phase. Knowing that predicted time will help to reduce the time that cars spend on test bench. This would lead to increase in the production effeciency by contributing in faster testing, without disturbing standards.

This project's approach relies on building that predictive capability using machine learning and various algorithm's in order to help manufacturing industry to increase production. It's a regression problem. In this project we will explore low code pycaret library advanced techniques such as

  1. Dimensionality reduction PCA (Principal Componanent Analysis)
  2. Transforming target
  3. Handling of Low variance features
  4. Combining rare levels
  5. Stacking of models

 

Dataset Description

  1. Dataset contains masked features each representing a custom feature in a Mercedes car and each row represents a seperate car. For example, a variable could be 4WD, added air suspension, or a head-up display.
  2. Target column is 'y' which represents the time taken by car to pass test in seconds
  3. For validation purpose test data is available on kaggle. You can get test score by submitting the test predictions at Kaggle website where this competetion was held. But for ease of use we are going to use only the train data from Kaggle and use the test data by simulation. We have kept a copy of the train data in the Loading data section.

 

Dependency Packages Installing in python

We are installing packages in python by using “pip” followed by install package name. For example, write “ !pip install pycaret ” and run that tab when you are using jupyter notebook.

If you are using a editor you have to modify the script accordingly, for instance, if you are using the anaconda distribution then you can use navigator to install the package or you can use the anaconda prompt where you can type - "pip install pycaret" to install. Its better if you create a new environment in anaconda navigator or command prompt, wherein you use python 3.6 version. This would avoid disturbing your existing dependencies build for other projects.

Required packages for this project are as below:

  1. Pandas- to perform dataframe operations
  2. Numpy- to perform array operations
  3. Matplotlib- to plot charts
  4. Seaborn – to plot charts
  5. Missingno- anlysing the missing values
  6. pycaret – to build the model and save the model.

 

Loading Dataset

Click on this link to download the data set

 

Import Required Libraries

    # Lets import required libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns

    import warnings
    warnings.filterwarnings('ignore')

    %matplotlib inline

    # reading the dataset
    df = pd.read_csv('mercedes-benz-production-optimization-data.csv.csv')

Explore Dataset

As our problem is related to regression we will first analyse the target dataset. We will have a look at the dataset features and see how it will help to predict the target.

    df.info()

RangeIndex: 4209 entries, 0 to 4208
Columns: 378 entries, ID to X385
dtypes: float64(1), int64(369), object(8)
memory usage: 12.1+ MB

 

    # data the has lots of features so setting options to view all columns
    pd.options.display.max_columns = 400

    df.head() # not displaying the output in this writeup, but u may get a scrolling tab if using notebook or it may show up in editor with first 5 records

    # Visualize the target variable 
    plt.scatter(range(len(df)), np.sort(df.y.values), alpha=0.3)

Scatter plot of target variable

Figure 1 : Scatter plot of target variable

    # Distribution of target variable
    df[['y']].boxplot(figsize=(10,8))

Visualize distribution of target variable using boxplot

Figure 2 : Visualize distribution of target variable using boxplot

    # Have a look at the histogram to check for skewness
    df[['y']].hist(figsize=(10,8),bins=70)

Histogram of target variable

Figure 3 : Histogram of target variable

 

From above we conclude that,

  1. We have 4208 rows and 378 columns with 1 float that is target variable , 369 int with 0/1 values. Incidentally, this means those are also categorical variables and 8 object(categorical) values with multiple categories.
  2. Our target variable 'y' is not following normal distribution. Transorming target data to normal distribution will be the right approach in this case.
  3. We have detected one datapoint which is lying at 250 sec. even though it is valid but it is far away from all other datapoints so consider it as an outlier and we will remove that data row. If we don't do that, then it will affect our regression analysis.
    # remove instance/record with extreme value in target variable
    df = df[df.y < 225]

    # Visually check the target variable 
    plt.scatter(range(len(df)), np.sort(df.y.values), alpha=0.3)

Scatter plot of dependent variable without outlier data

Figure 4 : Scatter plot of dependent variable without outlier data

Now we know that we have 378 columns in our dataset. Availability of more number of features may not help in getting generalised model, in data science world this situation is known as Curse of dimensionality. Just to make generalised model we should have to use dimensionality reduction technics and we can neglect the features having low variance this will help to get reduce the number of features that doesnt help in regression. Handling low variance features become very easy with pycaret just we have to set the option true. if you want to learn how to handle low variance features with pandas i will share the code at the end of the project. just look at there.

 

Lets Build Model using Pycaret

Pycaret is very simple and low code library though you should have knowledge about your dataset, what type of treatment you required. In above dataset exploration we found some interesting facts what we should have to do in preprocessing. Pycaret will build a pipeline for all those options which you turned on. As a ML Engineer you should know what you have to do with the dataset to get desired output right, otherwise it becomes garbage in garbage out type of situation. As our problem is of regression we will import regression from pycaret as below.

    from pycaret.regression import *
    
    # Lets verify our pycaret version as using same version only will reproduce same results 
    pycaret.__version__
    

'2.3.0'

Generally we split the dataset into three parts, in 70:30:10 proportion for training, testing and validation purpose. Remember that train test split is also known as hold out method. Validation split dataset is the one which is unknown for model in its bulding and testing phase. This validation data split is used to validate the model metric accuracy, with metrics we got while building and testing. With pycaret, Hold out method or we say train test split is automatically done with setup, where we can specify propotions size too. If we want to change default train as 70% and test as 30%. we will split 10?ta for validation pupose as below.

    train_data = df.sample(frac=0.9, random_state=5)
    validation_data = df.drop(train_data.index)

    train_data.reset_index(drop=True, inplace=True)
    validation_data.reset_index(drop=True, inplace=True)

    print('Data for Modeling: ' + str(train_data.shape))
    print('Unseen Data For Predictions ' + str(validation_data.shape))

Data for Modeling: (3787, 378)
Unseen Data For Predictions (421, 378)

 

Very important part of pycaret library, is its setup. We will defining setup with utmost care as, what and how we feed decides what we get.

    reg = setup(data = train_data, target = 'y', train_size=0.8,
                ignore_features=['ID'], session_id=20,
                normalize=True, pca=True, pca_method='kernel', 
                transform_target=True, ignore_low_variance = True, 
                combine_rare_levels = True, remove_outliers=True)

Note:

Here we specified train size as 80% means test size will be 20% of train data. If we don't specify the train size, it will take train size as 70% by default.

  1. We have listed 'ID' feature in ignore features as ID is unique for each datapoint so it will add value in regression analysis.
  2. Session Id is just like random seed in sklearn used in order to control reproducibility so that every time whenever we run this code we get same result.
  3. normalize seting as true will use zscore method to standardize the data. Standardization is important so as to reduce bias error and reduce calculation time for some algorithms such as linear regression.
  4. PCA set as True will use dimensionality reduction technic PCA and setting PCA method as 'Kernel' will use PCA transformation using RVF kernel method. Here, with pycaret we do have three options for PCA method i.e linear, kernel and incremental (replacement of 'linear' when dataset is too large).
  5. transform taget set as true will transform target using box-cox method to help to build generalised ression model. Prediction results have usually improved after using this transformation.
  6. ignore_low_variance will ignore low variance features while building model. This will help to reduce dimensions and complexity of model by reducing those features which don't have much to add in model's predictive capability.
  7. combine_rare_levels set as true will combine the level which has very low appearance in dataset by combining that level with available categories in that column, which will help to reduce category levels. This in turn help's to reduce complexity of model.
  8. remove outliers set as true, will remove outliers considering datapoints which are far away from 95% of data. There is no need to mention missing values imputation, as by default pycaret will handle missing values smartly by imputing them automatically.
    # Lets verify available models list for regression
    models()

Pycaret Model

Figure 5 : Pycaret Model

Some models may not be seen in the output, just because in latest pycaret version you have to install those models such as catboost etc seperately..Post that it will be shown in your models section otherwise only default models will be used only..Also note, if you are using Google colab or binder or notebook then you will see these output or else you will have to manually check each object.

compare_model() function will compare all above models having turbo as true, if we want to exclude time consuming models then we can do it by specifying list of models to exclude parameter.

    cm = compare_models(exclude = ['catboost','xgboost','lightgbm'], n_select=3)

 

Top 3 Performing Models

Figure 6 : Top3 performing model saved

we have mentioned n_select as 3 in compare model code and assigned it to cm variable. This will save top 3 performing models as a list in cm variable which we can use for further operations.

    # Lets verify what cm variable contains
    cm

[PowerTransformedTargetRegressor(alpha_1=1e-06, alpha_2=1e-06, alpha_init=None,
                                 compute_score=False, copy_X=True,
                                 fit_intercept=True, lambda_1=1e-06,
                                 lambda_2=1e-06, lambda_init=None, n_iter=300,
                                 normalize=False,
                                 power_transformer_method='box-cox',
                                 power_transformer_standardize=True,
                                 regressor=BayesianRidge(alpha_1=1e-06,
                                                         alpha_2=1e-06,
                                                         alpha_init=None,
                                                         compute_score=False,
                                                         copy_X=True,
                                                         fit_intercept=True,
                                                         lambda_1=1e-06,
                                                         lambda_2=1e-06,
                                                         lambda_init=None,
                                                         n_iter=300,
                                                         normalize=False,
                                                         tol=0.001,
                                                         verbose=False),
                                 tol=0.001, verbose=False),
 PowerTransformedTargetRegressor(copy_X=True, fit_intercept=True, n_jobs=-1,
                                 normalize=False,
                                 power_transformer_method='box-cox',
                                 power_transformer_standardize=True,
                                 regressor=LinearRegression(copy_X=True,
                                                            fit_intercept=True,
                                                            n_jobs=-1,
                                                            normalize=False)),
 PowerTransformedTargetRegressor(copy_X=True, eps=2.220446049250313e-16,
                                 fit_intercept=True, fit_path=True, jitter=None,
                                 n_nonzero_coefs=500, normalize=True,
                                 power_transformer_method='box-cox',
                                 power_transformer_standardize=True,
                                 precompute='auto', random_state=20,
                                 regressor=Lars(copy_X=True,
                                                eps=2.220446049250313e-16,
                                                fit_intercept=True,
                                                fit_path=True, jitter=None,
                                                n_nonzero_coefs=500,
                                                normalize=True,
                                                precompute='auto',
                                                random_state=20, verbose=False),
                                 verbose=False)]

We can use create model function when we are sure which model we want to use and  don't want to waste our time in comparing all models.

    # For instance if we are sure to use svm
    svm_model = create_model('svm')

SVM ModelPerformance Parameters

Figure 7 : SVM Model Performance Parameters ( If using Editors like Spyder, you may not get this output )

    predict_model(cm[0]) 
    # where cm[0] is first model(byesian regression) in cm variable list by using indexing we are using the same model for prediction.
    # In same way we an access second model i.e linear regression by using cm[1] 

Prediction on Test Data

Figure 8 : Prediction on Test Data

 

Evaluate Model

    evaluate_model(cm[0])

Evaluate residuals

Figure 9 : Evaluate residuals

Evaluate Prediction Errors

Figure 10 : Evaluate Prediction Errors

 

  1. Predicted model using test data which has been split during setup we can compare metrics if we found not much differnce in both train and predicted model metrics then its for sure our model is not overfitted. we can use the same model to predict target for new data after evaluation of plots.
  2. With evaluate_model() function we can evaluate model residuals plot, error plot, feature importance etc. This will help to decide whether to choose the model or not for deployment.

 

Stack Model

Stacking of models is also possible with pycaret in a very easy way. First of all we should know what stacking means..Stacking is the method in which all top models which we want to stack are treated as one model. Stacked model's predicted value is the averaage of each individual model's predicted value. Lets see how to stack the models

Stacked Model Metrics

Figure 11 : Stacked Model Metrics

 

We can see a big difference in MAE metrics after stacking models. Individually we were geting MAE as approx 5.3, now its reduced to approx 4.9 with stacked model.

We will retest and predict the testing time on production line for test data

    # Lets check the test score
    predict_model(stacked)


Stacked Model Predictions

Figure 12 : Stacked Model Predictions

We are predicting the target variable i.e testing time on production line, from new unseen data by giving loaded model and validation data only to predict_model() function..thats it..

    new_prediction = predict_model(stacked,validation_data)
    new_prediction

Validation Data Output

Figure 13 : Validation Data Output

    # Lets verify metrics of prediction on validation dataset.

    from pycaret.utils import check_metric
    newpredMAE = check_metric(new_prediction.y, new_prediction.Label, 'MAE')
    newpredMSE = check_metric(new_prediction.y, new_prediction.Label, 'MSE')
    newpredR2 = check_metric(new_prediction.y, new_prediction.Label, 'R2')

    print('New Prediction Metrics is as below: \n MAE = {}, \n MSE = {},\n R2 = {}'.format(newpredMAE,newpredMSE,newpredR2))

New Prediction Metrics is as below:

MAE = 5.131,

MSE = 68.0641,

R2 = 0.5641

Our validation score matches nearly to test score. Thus we can say its a generalised model.

 

Save Model for Reuse

    save_model(stacked, 'stacked_model_time_prediction_mercedes_benze')

Transformation Pipeline and Model Succesfully Saved

(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=False,
                                       features_todrop=['ID'], id_columns=[],
                                       ml_usecase='regression',
                                       numerical_features=[], target='y',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strategy=...
                                                                                                eps=2.220446049250313e-16,
                                                                                                fit_intercept=True,
                                                                                                fit_path=True,
                                                                                                jitter=None,
                                                                                                n_nonzero_coefs=500,
                                                                                                normalize=True,
                                                                                                precompute='auto',
                                                                                                random_state=20,
                                                                                                verbose=False))],
                                                                              final_estimator=SVR(C=1.0,
                                                                                                  cache_size=200,
                                                                                                  coef0=0.0,
                                                                                                  degree=3,
                                                                                                  epsilon=0.1,
                                                                                                  gamma='scale',
                                                                                                  kernel='rbf',
                                                                                                  max_iter=-1,
                                                                                                  shrinking=True,
                                                                                                  tol=0.001,
                                                                                                  verbose=False),
                                                                              n_jobs=-1,
                                                                              passthrough=True,
                                                                              verbose=0),
                                                  verbose=0)]],
          verbose=False),
 'stacked_model_time_prediction_mercedes_benze.pkl')

    # Lets load the transformation pipeline and Model saved earlier for use
    stack = load_model('stacked_model_time_prediction_mercedes_benze')

With this we finished using Pycaret for predicting the "Testing time on Production Line" for the Mercedes-Benz dataset. Don't stop at this, although we had mentioned, ignore_low_variance = True, in the setup of pycaret , ideally we don't want this. Why don't you redo this who process by making sure that the zero variance variables are removed. To help you in this direction, we have compiled a small receipe, half cooked though so that you can explore and learn more.

First, find zero variance features is by calculating variance of features so that we can drop the features that contains zero variance. Note that, we are using full dataset here as we have to treat the whole dataset before any split. 

    feature_data_var_df = pd.DataFrame(df.var(axis=0))
    feature_data_var_df[feature_data_var_df[0]==0]

    # lets define the list of columns to drop
    cols_to_drop = pd.Series.tolist(feature_data_var_df[feature_data_var_df[0]==0].index)

    # lets drop the columns not required from drop list
    df = df.drop(cols_to_drop,axis=1)
    print(df.shape)

Conclusion

We have build the project using advanced concepts of dimensionality reduction using PCA, handling low variance, combining rare levels categories, stacking of models in very easy way with the help of pycaret. If we have to do this with the help of sklearn only, then we might need lots of code to build the pipelines. We found that stacking of models helps to improve the metrics. Also our validation score matches with test scores pretty good, this is the indication that we have build a good generalized model. This is the real problem and real approach which we had covered in this project. Please do like and comment, if you find any issues in this project then please notify. Also, if you are just getting to learn pycaret then you should read this "Loan Approval Case Study in Python: A step by step guide on how to create a loan approval model in python using Pycaret"

 

 

About the Author:

Ganesh Kumar

 

Ganeshkumar Patel, is a Mechanical Engineer and applies data science for his process improvement. He loves to play with data and algorithms. During his spare time he likes to pen down his knowledge as a blogger.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC.Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.