Table of Content:

To understand Lasso Regression, first we have to know what is Regularization.

So, Regularization is the concept that is used to avoid over fitting of the data by adding penalty to achieve less variance with the test data. The model is likely to perform better at prediction thereafter.

In simple term, it reduces parameters and simplifies the model so that it has the lowest over fitting or say less variance with test data, hence better prediction with test data.

For example, when we are training the model with train dataset which have two nodes, it fitted well with a straight line and for this its residual become zero. But when we test the model with test dataset we get high variance, which means our model is over fitted and we have to remove it.

It is observed that when we train our model it gives 100 percent accuracy on train data, but when we test it with test dataset its accuracy is substantially lower.

## Types of Penalties

Regularization works by biasing data equal to or nearly equal to zero. In simple words it shrinks the slope of the data to find the best fit. Note the two Types of penalties as below.

### L1 regularization

It adds a L1 penalty to the absolute or mode value of the magnitude of coefficient. In this process some coefficient can become zero and eliminated from model.

### L2 regularization

It adds a L2 penalty to the square of the magnitude of coefficient. In this, all coefficients will shrink by the same factor to find the best fit for the model.

## Regularization Technique

There are two main regularization techniques:

1. Ridge Regression
2. Lasso Regression.

Both have different way of assigning a penalty to the coefficients. In this article, we will learn about Lasso Regularization technique.

## Lasso Regression

“LASSO” stands for Least Absolute Shrinkage and Selection Operator. This model uses shrinkage. Shrinkage basically means that the data points are recalibrated by adding a penalty so as to shrink the coefficients to zero if they are not substantial.

It uses L1 regularization penalty technique. This particular type of regression is well-suited for models showing high levels of multicollinearity or when we have to automate certain parts of model selection, like parameter elimination or feature selection.

We can represent Lasso loss functions mathematically as: Figure 1 : Mathematical Formulation for LASSO Loss Function

## Example Of Lasso Regression

```    import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing  import StandardScaler
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score as ac

```

Load the data set by clicking on this link - Vehicle Price dataset

```    price_data = pd.read_csv('vehicle_price.csv')

Unnamed: 0                              Name  ...  New_Price  Price
0           0            Maruti Wagon R LXI CNG  ...        NaN   1.75
1           1  Hyundai Creta 1.6 CRDi SX Option  ...        NaN  12.50
2           2                      Honda Jazz V  ...  8.61 Lakh   4.50
3           3                 Maruti Ertiga VDI  ...        NaN   6.00
4           4   Audi A4 New 2.0 TDI Multitronic  ...        NaN  17.74

```
```    price_data.info()

RangeIndex: 6019 entries, 0 to 6018
Data columns (total 14 columns):
#   Column             Non-Null Count  Dtype
---  ------             --------------  -----
0   Unnamed: 0         6019 non-null   int64
1   Name               6019 non-null   object
2   Location           6019 non-null   object
3   Year               6019 non-null   int64
4   Kilometers_Driven  6019 non-null   int64
5   Fuel_Type          6019 non-null   object
6   Transmission       6019 non-null   object
7   Owner_Type         6019 non-null   object
8   Mileage            6017 non-null   object
9   Engine             5983 non-null   object
10  Power              5983 non-null   object
11  Seats              5977 non-null   float64
12  New_Price          824 non-null    object
13  Price              6019 non-null   float64
dtypes: float64(2), int64(3), object(9)
memory usage: 658.5+ KB

```
```    price_data.isnull().sum()

Unnamed: 0              0
Name                    0
Location                0
Year                    0
Kilometers_Driven       0
Fuel_Type               0
Transmission            0
Owner_Type              0
Mileage                 2
Engine                 36
Power                  36
Seats                  42
New_Price            5195
Price                   0
dtype: int64

```
```    # Droping location, new price and unnamed column
price_data = price_data.drop(['Unnamed: 0', 'New_Price','Location'], axis=1)
price_data = price_data.dropna()
price_data = price_data.reset_index(drop=True)

price_data['Fuel_Type'].value_counts()

Diesel    3195
Petrol    2714
CNG         56
LPG         10
Name: Fuel_Type, dtype: int64

price_data['Transmission'].value_counts()

Manual       4266
Automatic    1709
Name: Transmission, dtype: int64

price_data['Owner_Type'].value_counts()

First             4903
Second             953
Third              111
Fourth & Above       8
Name: Owner_Type, dtype: int64

```
```    # Lets split some columns to make a new feature
train_df = price_data.copy()
name = train_df['Name'].str.split(" ", n =2, expand = True)
train_df['Company'] = name
train_df['Model'] = name

train_df['Mileage'] = train_df['Mileage'].str.split(" ", n=1, expand = True).get(0)
train_df['Engine'] = train_df['Engine'].str.split(" ", n=1, expand = True).get(0)
train_df['Power'] = train_df['Power'].str.split(" ", n=1, expand = True).get(0)

train_df = train_df.drop(['Name'], axis = 1)
train_df['Mileage'] = train_df['Mileage'].astype(float)
train_df['Engine'] = train_df['Engine'].astype(int)
train_df.replace("null", np.nan, inplace = True)

train_df = train_df.dropna()
train_df = train_df.reset_index(drop=True)
train_df['Power'] = train_df['Power'].astype(float)

train_df['Company'].value_counts()

Maruti           1175
Hyundai          1058
Honda             600
Toyota            394
Mercedes-Benz     316
Volkswagen        314
Ford              294
Mahindra          268
BMW               262
Audi              235
Tata              183
Skoda             172
Renault           145
Chevrolet         120
Nissan             89
Land               57
Jaguar             40
Mitsubishi         27
Mini               26
Fiat               23
Volvo              21
Porsche            16
Jeep               15
Datsun             13
Force               3
ISUZU               2
Bentley             1
Isuzu               1
Lamborghini         1
Name: Company, dtype: int64

train_df['Company'] = train_df['Company'].replace('ISUZU', 'Isuzu')

```
```    # Handling Rare Categorical Feature
cat_features = [feature for feature in train_df.columns if train_df[feature].dtype == 'O']

for feature in cat_features:
temp = train_df.groupby(feature)['Price'].count()/len(train_df)
temp_df = temp[temp > 0.01].index
train_df[feature] = np.where(train_df[feature].isin(temp_df), train_df[feature], 'Rare')

```
```    train_df['Company'].value_counts()

Maruti           1175
Hyundai          1058
Honda             600
Toyota            394
Mercedes-Benz     316
Volkswagen        314
Ford              294
Mahindra          268
BMW               262
Rare              247
Audi              235
Tata              183
Skoda             172
Renault           145
Chevrolet         120
Nissan             89
Name: Company, dtype: int64

train_df.info()

RangeIndex: 5872 entries, 0 to 5871
Data columns (total 12 columns):
#   Column             Non-Null Count  Dtype
---  ------             --------------  -----
0   Year               5872 non-null   int64
1   Kilometers_Driven  5872 non-null   int64
2   Fuel_Type          5872 non-null   object
3   Transmission       5872 non-null   object
4   Owner_Type         5872 non-null   object
5   Mileage            5872 non-null   float64
6   Engine             5872 non-null   int32
7   Power              5872 non-null   float64
8   Seats              5872 non-null   float64
9   Price              5872 non-null   float64
10  Company            5872 non-null   object
11  Model              5872 non-null   object
dtypes: float64(4), int32(1), int64(2), object(5)
memory usage: 527.7+ KB

train_df['Seats'] = train_df['Seats'].astype(int)
```
```    # Encoding Categorical data
columns = ['Fuel_Type','Transmission','Owner_Type','Company','Model']
def categorical_ohe(multicolumns):
df = train_df.copy()
i = 0
for fields in multicolumns:
print(fields)
d1 = pd.get_dummies(train_df[fields])
train_df.drop([fields], axis = 1)
if i == 0:
df = d1.copy()
else:
df = pd.concat([df, d1], axis = 1)
i = i + 1
df = pd.concat([df,train_df], axis = 1)
return df

final_df = categorical_ohe(columns)
final_df = final_df.loc[:,~final_df.columns.duplicated()]

now = datetime.datetime.now()
final_df['Year'] = final_df['Year'].apply(lambda x : now.year - x)

corr = final_df.corr()
corr

Diesel    Petrol      Rare  ...     Power     Seats     Price
Diesel     1.000000 -0.977947 -0.113891  ...  0.292420  0.309581  0.321035
Petrol    -0.977947  1.000000 -0.096114  ... -0.272662 -0.303177 -0.309363
Rare      -0.113891 -0.096114  1.000000  ... -0.096618 -0.033244 -0.058408
Automatic  0.139557 -0.125612 -0.067592  ...  0.644688 -0.074554  0.585623
Manual    -0.139557  0.125612  0.067592  ... -0.644688  0.074554 -0.585623
...       ...       ...  ...       ...       ...       ...
Mileage    0.097562 -0.130056  0.153696  ... -0.538844 -0.331576 -0.341652
Engine     0.430151 -0.410837 -0.095742  ...  0.866301  0.401116  0.658047
Power      0.292420 -0.272662 -0.096618  ...  1.000000  0.101460  0.772843
Seats      0.309581 -0.303177 -0.033244  ...  0.101460  1.000000  0.055547
Price      0.321035 -0.309363 -0.058408  ...  0.772843  0.055547  1.000000

corr[corr['Price'] > 0.4]

Diesel    Petrol      Rare  ...     Power     Seats     Price
Automatic  0.139557 -0.125612 -0.067592  ...  0.644688 -0.074554  0.585623
Engine     0.430151 -0.410837 -0.095742  ...  0.866301  0.401116  0.658047
Power      0.292420 -0.272662 -0.096618  ...  1.000000  0.101460  0.772843
Price      0.321035 -0.309363 -0.058408  ...  0.772843  0.055547  1.000000

```
```    df = final_df.drop(final_df[columns],axis=1)

X = df.drop(['Price'],axis=1)
y = df['Price']

Diesel  Petrol  Rare  Automatic  ...  Mileage  Engine   Power  Seats
0       0       0     1          0  ...    26.60     998   58.16      5
1       1       0     0          0  ...    19.67    1582  126.20      5
2       0       1     0          0  ...    18.20    1199   88.70      5
3       1       0     0          0  ...    20.77    1248   88.76      7
4       1       0     0          1  ...    15.20    1968  140.80      5

```
```    # Splitting Dataset into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state = 0)

# Feature Scaling
scaler = StandardScaler()
scaler.fit(X)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

best_alpha = 0.00099
regr = Lasso(alpha=best_alpha, max_iter=50000)
regr.fit(X_train,y_train)

y_pred = regr.predict(X_test)

ac(y_test,y_pred)
0.7761524710799422
```

## Conclusion

In this article we have learnt about Regularization, types of penalty in regularization and techniques or types of regularization. And then we have learnt about Lasso Regression in brief and how to implement it. In our model we got the accuracy of 77% which can be further increased by hyperparameter tuning, in case you don’t have any idea about how to do hyperparameter tuning you can please refer to this previous article about how to do hyperparameter tuning. 