How To Implement Linear Regression From Scratch In Python.

Table of Content

Linear Regression
What is Linear Regression used for ?
Why is linear regression a Supervised Machine Learning Algorithm ?
What are the types of linear regression?
1. Simple Linear Regression
2. Multiple Linear Regression
How do you calculate parameters of simple linear regression?
Scatter plot
General trend of a linear regression line
Case study (University admission prediction)
What is R Square and how do you interpret R Squared in regression analysis?
Implementing Linear Regression from scratch in Python

Linear Regression

It is a type of predictive modelling technique based on statistics, created to predict the value of a specific entity based on historical data. It is a supervised machine learning algorithm. It is used to predict the value of a continuous variable also known as target variable or response variable based on one or more than one predictor (also known as features or independent variables).

What is linear regression used for?

Linear regression can be used in different sectors viz. in real estate sector for the valuation of a property, in the retail sector for predicting monthly sales and the price of goods, for estimating the salary of an employee, in the educational sector for predicting the %marks of a student in the final exam based on his previous performance, etc. Financial forecasting is a classic application of regression that uses related information to predict the future value of entities like revenues, expenses, exchange rates, and capital costs.

Why is linear regression a supervised machine learning algorithm ?

Linear regression is a supervised machine learning technique in which the system is trained with the target variable for identifying the trend before it can predict the outcome with unknown feature variable.

For example, consider the diagram below, in which a training model is constructed consisting of a single feature variable which is GRE score, and a target variable which is the % chance of getting admission into the University.

Once the machine is trained with the specified model, the system can predict the target variable (% change of getting admission) based on the test dataset containing unlabelled GRE score as specified in the diagram below.

linear regression on university admission data set

Figure 1 : Linear Model predicting using Features and Target variable on Test Data

What are the types of linear regression?

There are primarily two types of linear regression, simple linear regression and multiple linear regression.

Simple Linear Regression

In simple linear regression a single feature variable or predictor determines the value of target variable or the outcome.

The equation is:

simple linear regression equation and explanation of terms

Multiple Linear Regression

In Multiple linear regression, multiple feature variables are involved in determining the outcome or the value of target variable.

The equation is as follows: -

multiple linear regression equation

How do you calculate parameters of simple linear regression?

Linear regression models the linear relationship between two variables using a simple regression line which is a straight line. The two variables here are the independent variable, which is the cause and the dependant variable also known as the output, target or the response variable which depicts the effect.

For example, let us consider two variables which are linearly related. We need to find a linear function that predicts the response value(y) with the help of its feature variable or independent variable(x).

The simple linear regression equation with one dependent or feature variable and one independent or response variable is defined by the formula

simple linear regression equation explanation

We have a sample data set in which value of response y against every feature x is tabulated as in Figure 2:

x as feature vector

y as response vector

sample data for simple linear regression Figure 2: Table of sample data

Scatter plot

A scatter chart helps to visualize the response against every feature and we are trying to draw a line draw covering most of the data points. We can also call this as a regression line.

fitting a straight line through the data

Figure 3: Scatter Plot and Straight Linear Line Passing through data

General trend of a linear regression line

A regression line is a straight line depicting the linear relationship between the two variables and is represented by a regression coefficient( ß) which is also the slope of the line, and the intercept which is also labelled as constant and is the expected mean value of y when all x=0.

There can be 3 different scenarios:

1. When there is no change in the value of y with respect to change in the value of x, ß is 0.

2. With increase in value of x, the value of y increases (x is directly proportional to y), ß is +ve.

3. With increase in value of x, y decreases (x is inversely proportional to y), ß is -ve.

trends of a linear regression line Figure 4 : Trend of Linear Regression Line

Case study (University admission prediction)

The University Admission Prediction dataset contains several parameters which are considered important during the application for Masters Programs. The data is available on kaggle. We have kept a copy of the University Admission Prediction dataset here.

For Simple linear regression let us consider the “GRE Score” (out of 340) as the feature which influences the “chance of admit” which ranges from 0 to 1.

Step1: Plotting a scatter chart

Plot a scatter chart to analyse the relationship between the variables. Linear regression is possible only if a linear relationship between the two variables exists. In scatter chart the points should fall along a line and not be like a blob. We have taken only the first 5 records here for analysis.


GRE score	Chance of admit
337	0.92
324	0.76
316	0.72
322	0.8
314	0.65

Figure 5 : Snapshot of top 5 Records in University Admission Dataset

scatter plot of university admission data Figure 6 : Scatterplot of top 5 Records in University Admission Dataset along with linear line

Step 2: Calculating the Residuals

Now that we know there is a linear relationship between x and y, the system looks for a best fit regression line which minimise the residuals. Residuals are deviations of the actual data points from the regression line. The best fit line will have minimum residuals.

The regression line is sometimes called the "line of best fit" or the "best fit line". Since it "best fits" the data, the line passes through the mean of x and mean of y called the centroid.

GRE score ( X )	Chance of Admit ( Y )	Deviation GRE score (Xi-mean(Xi))	Deviation Chance of Admit (Yi-mean(Yi))
337	0.92	14.4	0.15
324	0.76	1.4	-0.01
316	0.72	-6.6	-0.05
322	0.8	-0.6	0.03
314	0.65	-8.6	-0.12
Mean (Xi)	Mean (Yi)
322.6	0.77

Figure 7 : Manual Computations for the Linear Regression Line

Figure 8 : Graph showing Computation of residuals in Linear Regression

Step 3: Calculating the Slope

Calculate the slope of the line using the following equation: -

Formula for calculating the slope of the regression line

Manual Computations for the Linear Regression Line

Figure 9 : Table of University Admission Dataset beta compuation

ß = 3.49/327.2

= 0.010666259

As per the equation

y = ß*x +c

By substituting mean(x) and mean(y) in the equation, we get the value of the intercept (c)

c = 0.77 - 0.010666259 * 322.6

= - 2.670935208

Final equation:

Y = 0.010666259x - 2.670935208

regression line on university admission data shown with slope and constant value

Figure 10 : Regression line in University Admission Dataset along with the slope represenation

Step 4 : Calculating the estimated value of y (chance of admit) using test dataset

GRE Score	Chance of admit
320	?
325	?

Y₁= 0.010666259 * 320 - 2.670935208

Y₁= 0.742267672

Y₂= 0.010666259 * 325 - 2.670935208

Y₂= 0.795598967

What is R Square and how do you interpret R Squared in regression analysis?

R Square is also known as coefficient of determination and is a measure of the goodness of fit of the regression line to existing data points. The value of R Square ranges from 0 to 1.

If R Square is 1 then all the data points fall perfectly on the Regression line.

If R Square is 0, the regression line is horizontal. The predictor p does not account for any variation in the target variable y.

If R Square is between 0 and 1, the predictor p accounts for R2 *100 percent variation in the target variable y.

R square is represented by the following formula

Formula for r square value of the regression line

Where,

y_iis the target or response variable in the test dataset and p_i is the predicted response variable. R - square is calculated in the training dataset to understand the accuracy of the model.

Implementing Linear Regression from scratch in Python

import pandas as pd

import the dataset

df_admit=pd.read_csv("c:/csv/admission_predict.csv")

Extracting the first 5 records

df1=df_admit.head(5)

df1

snapshot of top 5 records of university admission data in python.

Selecting the GRE Score and the Chance of admit column

df1=df1.iloc[:,[1,7]]

df1

university admission data in python subset with only 2 columns

setting GRE score as X vector

df1_x=df1.iloc[:,[0]]

df1_x

feature for linear regression in university admission data set

converting object to numeric column

x=df1_x.iloc[:,0]

x=pd.to_numeric(x)

setting Chance of admit as Y vector

df1_y=df1.iloc[:,[1]]

df1_y

target for linear regression in university admission data set.

converting object to numeric column

y=df1_y.iloc[:,0]

import numpy as np

calculating the mean of x and y vectors

df1_x_mean=np.mean(df1_x)

df1_x_mean

df1_y_mean=np.mean(df1_y)

df1_y_mean

calculating the deviation in x

deviation_x=[]

for i in x:

deviation_x.append(i-df1_x_mean)

deviation_x

calculating the deviation in y

deviation_y=[]

for i in y:

deviation_y.append(i-df1_y_mean)

deviation_y

calculating the product of deviations

product_deviation=np.array(deviation_x)*np.array(deviation_y) 

product_deviation

Sum_product_deviation=np.sum(product_deviation) 
Sum_product_deviation

calculating the square of deviation of x and y

Sq_x_deviation=np.array(deviation_x)**2 

Sq_x_deviation

Sum_Sq_x_deviation=np.sum(Sq_x_deviation) 
Sum_Sq_x_deviation

calculating the regression coefficient, the slope of regression line

Regression_coefficient=Sum_product_deviation/Sum_Sq_x_deviation

Regression_coefficient

calculating the intercept

intercept=np.float(df1_y_mean)-(Regression_coefficient*np.float(df1_x_mean)) 

intercept

calculating the predicted value of y from test data

predicted_y1=Regression_coefficient*test_data[0]+intercept 

predicted_y1 


predicted_y2=Regression_coefficient*test_data[1]+intercept 
predicted_y2

Implementing linear regression with SKLearn library

from sklearn.linear_model import LinearRegression

model1=LinearRegression()

#training the model with X as feature and Y as target

model1=model1.fit(df1_x,df1_y)

printing the regression coefficient or the slope of the regression line

print(model1.coef_)

printing the intercept

print(model1.intercept_)

creating the test dataset

x_test=pd.DataFrame({"GRE Score":[320,325]})

Predicting the value of Y (Chance of admit)

p=model1.predict(x_test)

Calculating R2_score

from sklearn.metrics import r2_score

The target variable of the test dataset is unknown so we need to use the training dataset

p_tr=model1.predict(df1_x)

r2_score(df1_y,p_tr)

We have done this implemenation on the top 5 records of the dataset. As a next step you should implement the same on the complete data set.

About the Author:

Indrani Sen

Indrani Sen is an Academician, Freelance Machine learning and coding instructor, and Ph.D. research scholar in the University of Mumbai. She has more than 15 years of experience in Teaching Computer Science and IT in various leading colleges and the University of Mumbai. As a machine learning trainer, she has worked with various clients like Tata Consultancy Services, Great Lakes Institute of Management, Regenesys Business school, etc. to name a few.

How to implement Linear regression from scratch in Python

Modal

Are you sure to delete this information ?