Introduction

Stock market price prediction sounds fascinating but is equally difficult. In this article, we will show you how to write a python program that predicts the price of stock using machine learning algorithm called Linear Regression. We will work with historical data of APPLE company. The data shows the stock price of APPLE from 2015-05-27 to 2020-05-22. In this article, our aim is to implement a machine learning algorithm (Linear Regression) to predict stock price of APPLE company.

 

Table of Content

  1. Data Preprocessing
  2. Splitting Dataset
  3. Model Building (linear Regression)
  4. Predictions and Model Evaluation
  5. Predicted vs Actual Prices
  6. Conclusion

 

Let’s see how to predict stock prices using Machine Learning and the python programming language. we will start this task by importing all the necessary python libraries that we need for this task:

    # Importing libraries
    import numpy as np
    from numpy import array
    import pandas as pd
    from sklearn import preprocessing
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.preprocessing import MinMaxScaler    
    from sklearn.linear_model import LinearRegression
    import math
    from sklearn.metrics import mean_squared_error

Data Preprocessing

We would be using the Apple Inc. stock scrip data for this project. We have a historic data set from 27th May 2015 to 22nd May 2020. A copy of the data used is kept over here. Click on the Apple Stock Download data to get a csv file format copied on your disk. 

    df = pd.read_csv('AAPL.csv')

We will have a look at the dataset using df.head(), it will show the first 5 entries of the dataset.

    pd.set_option('display.max_columns', None)
    df.head()

   Unnamed: 0 symbol                       date    close     high     low    open    volume    adjClose     adjHigh      adjLow     adjOpen adjVolume  divCash  splitFactor
0           0   AAPL  2015-05-27 00:00:00+00:00  132.045  132.260  130.05  130.34  45833246  121.682558  121.880685  119.844118  120.111360  45833246      0.0          1.0
1           1   AAPL  2015-05-28 00:00:00+00:00  131.780  131.950  131.10  131.86  30733309  121.438354  121.595013  120.811718  121.512076  30733309      0.0          1.0
2           2   AAPL  2015-05-29 00:00:00+00:00  130.280  131.450  129.90  131.23  50884452  120.056069  121.134251  119.705890  120.931516  50884452      0.0          1.0
3           3   AAPL  2015-06-01 00:00:00+00:00  130.535  131.390  130.05  131.20  32112797  120.291057  121.078960  119.844118  120.903870  32112797      0.0          1.0
4           4   AAPL  2015-06-02 00:00:00+00:00  129.960  130.655  129.32  129.86  33667627  119.761181  120.401640  119.171406  119.669029  33667627      0.0          1.0

[5 rows x 15 columns]

    # Closing Price
    df1 = df['close']
    df['close'].plot()

Apple Stock Market Data Visualization

Figure 1 : Apple Stock Market Data Visualization

 

Scaling Data

Before we begin our model fitting, lets normalize this data. This will boost the performance. It is clear that the df1 is a vector. But the problem is MinMaxScaler works on numpy 2D arrays, not on vectors. So, we will convert df1 to 2D array using np.array(df1).reshape(-1,1)) and then apply the scaling.

    df1 = np.array(df1)
    df1 = df1.reshape(-1,1)

    scaler = MinMaxScaler(feature_range=(0,1))
    df1 = scaler.fit_transform(df1)
    print(df1)

[[0.17607447]
 [0.17495567]
 [0.16862282]
 ...
 [0.96635143]
 [0.9563033 ]
 [0.96491598]]


Splitting Data into Training and Testing Set

    df1.shape
(1258, 1)

In this analysis we will split the dataset into 65% training and 35% testing set. Lets split our data into training and testing sets as a standard process.

    # splitting dataset into train and test split
    training_size = int(len(df1)*0.65)
    test_size = len(df1)-training_size
    train_data,test_data  =df1[0:training_size,:], df1[training_size:len(df1),:1]
    train_data.shape
(817, 1)

    test_data.shape
(441, 1)

    training_size, test_size
(817, 441)

    train_data[:10]
array([[0.17607447],
       [0.17495567],
       [0.16862282],
       [0.1696994 ],
       [0.16727181],
       [0.16794731],
       [0.16473866],
       [0.16174111],
       [0.1581525 ],
       [0.15654817]])

Converting Array of Matrix into a Dataset Matrix

Now we will write a function that will prepare the dataset so that we can fit it easily in the Linear Regression model.

Windowing Dataset

For better performance of any time series (univariate), it is necessary to use the splitting window on the dataset. The concept is simple. We will convert the dataset into several overlapping series. You will have an idea by seeing the picture below.

Specimen Sliding Window Approach on Normalized Traffic Flow Data

Figure 2 : Specimen Sliding Window Approach on Normalized Traffic Flow Data

Figure 2, shows the window size = 2. We will be using suitable window size for the best performance. You can try with any number you want. It is a hyper parameter that is needed to be tuned.

    def create_dataset(dataset, time_step=1):
        dataX, dataY = [], []
        for i in range(len(dataset)-time_step-1):
            a= dataset[i:(i+time_step), 0]
            dataX.append(a)
            dataY.append(dataset[i+ time_step, 0])
        return np.array(dataX), np.array(dataY)

Let's choose window size = 100 for now and apply the windowing on training and testing data's

    time_step = 100
    X_train, y_train = create_dataset(train_data, time_step)
    X_test, y_test = create_dataset(test_data, time_step)
    train_data.shape, test_data.shape
((817, 1), (441, 1))

    # A total of 817 + 441 = 1258
    # allocate series of 817  from index 1 to 817
    trainplot = np.arange(1,818)
    # allocate series of 818 to 1258
    testplot = np.arange(818,1259)

    # Ploting Train and Test Data
    plt.figure(figsize=(12,8))
    plt.plot(trainplot,scaler.inverse_transform(train_data)[:,0], 'green', label='Train data')
    plt.plot(testplot, scaler.inverse_transform(test_data)[:,0],'blue', label='Test data')
    plt.legend()
    plt.title('Train and Test Data')
    plt.show()

Apple Stock Market Data Visualization Train and Test Series

Figure 3 : Apple Stock Market Data Visualization Train and Test Series

 

Model Building (linear Regression)

Now it's time to build our model ::::: LinearRegression

    model = LinearRegression()
    model.fit(X_train, y_train)

LinearRegression()

Predictions and Model Evaluation

Predictions of Testing Set ::::: Now we visualize how our models perform within the test set

    predictions = model.predict(X_test)
    print("Predicted Value",predictions[:10][0])
    print("Expected Value",y_test[:10][0])

Predicted Value 0.26591241262096627
Expected Value 0.2727349489149709

    pred_df= pd.DataFrame(predictions)
    pred_df['TrueValues']=y_test

    new_pred_df=pred_df.rename(columns={0: 'Predictions'})
    new_pred_df.head()
   
Predictions  TrueValues   
0     0.265912    0.272735
1     0.267869    0.276619
2     0.289373    0.280672
3     0.286837    0.265811
4     0.264365    0.268429

 

Plot Predicted vs Actual Prices of Test Series

    plt.figure(figsize=(12,8))
    sns.lineplot(data=new_pred_df)
    plt.title("Predictions Vs True Values on Testing Set")

Text(0.5, 1.0, 'Predictions Vs True Values on Testing Set')

Plot of Predicted vs Actual Apple Stock Test Data

Figure 4: Plot of Predicted vs Actual Apple Stock Test Data

    print("model Accuracy on training data:",model.score(X_train, y_train))

model Accuracy on training data: 0.9970342320018716

    # Model accuracy on Testing data
    print("model Accuracy is on training data:",model.score(X_test, y_test))

model Accuracy on testing data: 0.9847722212152704

    # Lets Do the prediction and check performance metrics
    train_predict = model.predict(X_train)
    test_predict = model.predict(X_test)
    train_predict = train_predict.reshape(-1, 1)
    test_predict = test_predict.reshape(-1, 1)

    # Transform back to original form
    train_predict = scaler.inverse_transform(train_predict)
    test_predict = scaler.inverse_transform(test_predict)

    # Calculate RMSE performance metrics
    math.sqrt(mean_squared_error(y_train,train_predict))
142.1363100026703

    # Test Data RMSE
    math.sqrt(mean_squared_error(y_test,test_predict))
238.13157949250507

Conclusion

Our model performed good at predicting the Apple Stock price using a Linear Regression model. This entire code stack can be reused in any stock price prediction. This prediction is only short-term. We wont recommend to use this model for medium to long term forecast periods, as it depreciates in performance. Not because our Linear model is bad, but, because Stock markets are highly volatile. Read through this implementation of Stock price prediction using LSTM.

 

 

About the Author's:

Tushar Patil

Tushar is pursuing his B.Tech degree from Dr. J J Magdum College of Engineering, Jaysingpur. He is a student who is passionate about Machine Learning and Data Science. He has been working in Machine Learning and Data Science and provides consulting / product development services.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.