Table of Content

  1. Introduction
  2. About Sentiment Analysis and It’s Working
  3. Import Necessary Libraries
  4. Importing Dataset
  5. Data Preprocessing / Text Cleaning
  6. Plotting WordCloud
  7. Visualizing Cleaned dataset
  8. Building LSTM model
  9. Conclusion

 

 

Introduction

Stock market is one of the most sensitive markets, where entire market depends upon the sentiment of the peoples and they can change the trend of the market. There are also others factors which decide the trend of market and one of them are everyday news.

Have you ever wondered what would be the impact of everyday news on stock market trend?

Volatile nature of stock market gives the equal opportunity to earn money to all and also have a risk to loss it as well. But if we can predict the trend or situation of market we can make profit or minimize the losses.

And here A.I. plays an important role to predict the stock market trend or movement of particular stock whether they will go up or fall down using sentiment analysis.

 

About Sentiment Analysis and It’s Working

Sentiment analysis is a process of analyzing the sentiment of peoples from various sources where they can freely express their feelings and opinion like social media, stock market related blogs, etc. Such sources influence the other peoples to make decision accordingly.

 It uses Natural Language Processing (NLP) to divide the sentiment into three categories like positive, negative and neutral. If the sentiment is positive then stock price may increases, if it is negative price may decrease and if it is neutral it maybe neither increase nor decrease.

If you want to know more about NLP and its process, you can read my previous article on it from here.

In this article we are going to do sentiment analysis using LSTM model.

 

Import Necessary Libraries

    # Importing Libraries
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import plotly.express as px

    # Import packages for TensorFlow deployment
    import tensorflow as tf
    from tensorflow.keras.models import Model,Sequential
    from tensorflow.keras.preprocessing.text import one_hot,Tokenizer
    from tensorflow.keras.layers import Dense, Flatten, Embedding, Input, LSTM, Conv1D, MaxPool1D, Bidirectional, Dropout
    from tensorflow.keras.utils import to_categorical
    from tensorflow.keras.preprocessing.sequence import pad_sequences

    # Import NLP And Gensim
    # Gensim is a open source lib which we have used for unsupervised topic modeling and nlp
    from wordcloud import WordCloud, STOPWORDS
    import nltk
    import re
    from nltk.stem import PorterStemmer, WordNetLemmatizer
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize, sent_tokenize
    import gensim
    from gensim.utils import simple_preprocess
    from gensim.parsing.preprocessing import STOPWORDS

Importing Dataset

We have a stock news data taken from various twitters handles post regarding news of stock trends. You can download the file in CSV format by clicking on this link, download data. Save it in your working directory to access using below code.

    stock = pd.read_csv("stock_data.csv")
    # Dispaly all columns
    pd.set_option('display.max_columns', None)
    stock

                                                   Text  Sentiment
0     Kickers on my watchlist XIDE TIT SOQ PNK CPW B...          1
1     user: AAP MOVIE. 55% return for the FEA/GEED i...          1
2     user I'd be afraid to short AMZN - they are lo...          1
3                                     MNTA Over 12.00            1
4                                      OI  Over 21.37            1
                                                    ...        ...
5786  Industry body CII said #discoms are likely to ...         -1
5787  #Gold prices slip below Rs 46,000 as #investor...         -1
5788  Workers at Bajaj Auto have agreed to a 10% wag...          1
5789  #Sharemarket LIVE: Sensex off day’s high, up 6...          1
5790  #Sensex, #Nifty climb off day's highs, still u...          1

[5791 rows x 2 columns]

    stock.info()


RangeIndex: 5791 entries, 0 to 5790
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Text       5791 non-null   object
 1   Sentiment  5791 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 90.6+ KB

    stock["Sentiment"].value_counts()

 1    3685
-1    2106
Name: Sentiment, dtype: int64

As we can see we have value of 1 which indicate positive sentiment and -1 which indicates negative sentiment but for simplicity we can replace -1 with 0 so that our model will predict well.

    stock["Sentiment"] = stock["Sentiment"].replace(-1,0)

Data Preprocessing / Text Cleaning

As the data is taken from twitter we have to remove punctuations from text like hashtag handles and other punctuation marks. Also we have to remove stopwords which don’t play any role in predicting the sentiment of text.

    # Removing Punctuation From Text
    import string
    string.punctuation

    def remove_punct(message):
      Test_punc_removed = [ char for char in message if char not in string.punctuation ]
      Test_join = ''.join(Test_punc_removed)
      return Test_join
    
    stock['Text Without Punct'] = stock['Text'].apply(remove_punct)
    stock

                                                   Text  Sentiment   Text Without Punct
0     Kickers on my watchlist XIDE TIT SOQ PNK CPW B...          1   Kickers on my watchlist XIDE TIT SOQ PNK CPW B...
1     user: AAP MOVIE. 55% return for the FEA/GEED i...          1   user AAP MOVIE 55 return for the FEAGEED indic...
2     user I'd be afraid to short AMZN - they are lo...          1   user Id be afraid to short AMZN  they are look...
3                                     MNTA Over 12.00            1                                    MNTA Over 1200  
4                                      OI  Over 21.37            1                                     OI  Over 2137  
                                                    ...        ...                                                 ...
5786  Industry body CII said #discoms are likely to ...          0   Industry body CII said discoms are likely to s...
5787  #Gold prices slip below Rs 46,000 as #investor...          0   Gold prices slip below Rs 46000 as investors b...
5788  Workers at Bajaj Auto have agreed to a 10% wag...          1   Workers at Bajaj Auto have agreed to a 10 wage...
5789  #Sharemarket LIVE: Sensex off day’s high, up 6...          1   Sharemarket LIVE Sensex off day’s high up 600 ...
5790  #Sensex, #Nifty climb off day's highs, still u...          1   Sensex Nifty climb off days highs still up 2 K...

[5791 rows x 3 columns]

    # Removing Stopwords
    nltk.download('stopwords')
    
    stop_words = stopwords.words('english')
    stop_words.extend(['from', 'subject', 're', 'edu', 'use','will','aap','co','day','user','stock','today','week','year'])
    
    def preprocess(text):
        result = []
        for token in gensim.utils.simple_preprocess(text):
            if len(token) >= 3 and token not in stop_words:
                result.append(token)
                
        return result
    
    stock['Text Without Punc & Stopwords'] = stock['Text Without Punct'].apply(preprocess)
    stock


                                                   Text  Sentiment                                  Text Without Punct                       Text Without Punc & Stopwords
0     Kickers on my watchlist XIDE TIT SOQ PNK CPW B...          1   Kickers on my watchlist XIDE TIT SOQ PNK CPW B...   [kickers, watchlist, xide, tit, soq, pnk, cpw,...
1     user: AAP MOVIE. 55% return for the FEA/GEED i...          1   user AAP MOVIE 55 return for the FEAGEED indic...   [movie, return, feageed, indicator, trades, aw...
2     user I'd be afraid to short AMZN - they are lo...          1   user Id be afraid to short AMZN  they are look...   [afraid, short, amzn, looking, like, nearmonop...
3                                     MNTA Over 12.00            1                                    MNTA Over 1200                                              [mnta]  
4                                      OI  Over 21.37            1                                     OI  Over 2137                                                   [] 
                                                    ...        ...                                                 ...   
5786  Industry body CII said #discoms are likely to ...          0   Industry body CII said discoms are likely to s...   [industry, body, cii, said, discoms, likely, s...
5787  #Gold prices slip below Rs 46,000 as #investor...          0   Gold prices slip below Rs 46000 as investors b...   [gold, prices, slip, investors, book, profits,...
5788  Workers at Bajaj Auto have agreed to a 10% wag...          1   Workers at Bajaj Auto have agreed to a 10 wage...   [workers, bajaj, auto, agreed, wage, cut, peri...
5789  #Sharemarket LIVE: Sensex off day’s high, up 6...          1   Sharemarket LIVE Sensex off day’s high up 600 ...   [sharemarket, live, sensex, high, points, nift...
5790  #Sensex, #Nifty climb off day's highs, still u...          1   Sensex Nifty climb off days highs still up 2 K...   [sensex, nifty, climb, days, highs, still, key...

[5791 rows x 4 columns]

Plotting WordCloud 

Wordcloud is a visualization technique used for text data representation where size of the words shows its importance or frequency of occurrence of that word.

    # join the words into a string
    stock['Text Without Punc & Stopwords Joined'] = stock['Text Without Punc & Stopwords'].apply(lambda x: " ".join(x))
    stock

                                                   Text  Sentiment                                  Text Without Punct                       Text Without Punc & Stopwords                Text Without Punc & Stopwords Joined
0     Kickers on my watchlist XIDE TIT SOQ PNK CPW B...          1   Kickers on my watchlist XIDE TIT SOQ PNK CPW B...   [kickers, watchlist, xide, tit, soq, pnk, cpw,...   kickers watchlist xide tit soq pnk cpw bpz tra...
1     user: AAP MOVIE. 55% return for the FEA/GEED i...          1   user AAP MOVIE 55 return for the FEAGEED indic...   [movie, return, feageed, indicator, trades, aw...       movie return feageed indicator trades awesome
2     user I'd be afraid to short AMZN - they are lo...          1   user Id be afraid to short AMZN  they are look...   [afraid, short, amzn, looking, like, nearmonop...   afraid short amzn looking like nearmonopoly eb...
3                                     MNTA Over 12.00            1                                    MNTA Over 1200                                              [mnta]                                                  mnta
4                                      OI  Over 21.37            1                                     OI  Over 2137                                                   []                                                     
                                                    ...        ...                                                 ...                                                 ...   
5786  Industry body CII said #discoms are likely to ...          0   Industry body CII said discoms are likely to s...   [industry, body, cii, said, discoms, likely, s...   industry body cii said discoms likely suffer n...
5787  #Gold prices slip below Rs 46,000 as #investor...          0   Gold prices slip below Rs 46000 as investors b...   [gold, prices, slip, investors, book, profits,...   gold prices slip investors book profits amid c...
5788  Workers at Bajaj Auto have agreed to a 10% wag...          1   Workers at Bajaj Auto have agreed to a 10 wage...   [workers, bajaj, auto, agreed, wage, cut, peri...   workers bajaj auto agreed wage cut period apri...
5789  #Sharemarket LIVE: Sensex off day’s high, up 6...          1   Sharemarket LIVE Sensex off day’s high up 600 ...   [sharemarket, live, sensex, high, points, nift...   sharemarket live sensex high points nifty test...
5790  #Sensex, #Nifty climb off day's highs, still u...          1   Sensex Nifty climb off days highs still up 2 K...   [sensex, nifty, climb, days, highs, still, key...   sensex nifty climb days highs still key factor...

[5791 rows x 5 columns]

    # Most common words used in Positive Sentiment
    plt.figure(figsize = (20,20))
    wc = WordCloud(max_words = 1000 , width = 1600 , height = 800).generate(" ".join(stock[stock['Sentiment'] == 1]['Text Without Punc & Stopwords Joined']))
    plt.imshow(wc, interpolation = 'bilinear');

Most common words used in Positive Sentiment Analysis for Stock News Data

Figure 1 :  Common Positive Words

In the above, wordcloud plot we can see the words which have positive sentiment or which indicates that the stock may move up.

 

    # Most common words used in Negative Sentiment
    plt.figure(figsize = (20,20))
    wc1 = WordCloud(max_words = 1000 , width = 1600 , height = 800).generate(" ".join(stock[stock['Sentiment'] == 0]['Text Without Punc & Stopwords Joined']))
    plt.imshow(wc1, interpolation = 'bilinear');

Most common words used in Negative Sentiment Analysis for Stock News Data

Figure 2 :  Common Negative Words

In this wordcloud plot, we can see words which have negative sentiment or which indicates that the stock may move down.

 

Visualizing Cleaned dataset

In this part we will see the frequency of words and length of each tweet and assign them a numerical value for each word.

    nltk.download('punkt')

    # word_tokenize is used to break up a string into words
    print(stock['Text Without Punc & Stopwords Joined'][0])

kickers watchlist xide tit soq pnk cpw bpz trade method method see prev posts

    print(nltk.word_tokenize(stock['Text Without Punc & Stopwords Joined'][0]))

['kickers', 'watchlist', 'xide', 'tit', 'soq', 'pnk', 'cpw', 'bpz', 'trade', 'method', 'method', 'see', 'prev', 'posts']

    # Obtain the maximum length of data in the document
    # This will be later used when word embeddings are generated
    maxlen = -1
    for doc in stock['Text Without Punc & Stopwords Joined']:
        tokens = nltk.word_tokenize(doc)
        if(maxlen < len(tokens)):
            maxlen = len(tokens)
    print("The maximum number of words in any document is:", maxlen)

The maximum number of words in any document is: 20
    
    tweets_length = [ len(nltk.word_tokenize(x)) for x in stock['Text Without Punc & Stopwords Joined'] ]
    # Plot the distribution for the number of words in a text
    fig = plt.hist(x = tweets_length,  bins=100)
    fig.show()     

Plot the distribution for the number of words in a text

Figure 3 :  Plot the distribution for the number of words in a text

    # PREPARE THE DATA BY TOKENIZING AND PADDING
    # Obtain the total words present in the dataset
    list_of_words = []
    for i in stock['Text Without Punc & Stopwords']:
        for j in i:
            list_of_words.append(j)
    
    list_of_words[0:5]
['kickers', 'watchlist', 'xide', 'tit', 'soq']

    list_of_words[45095:45098]
['factors', 'driving', 'dstreet']

    # Obtain the total number of unique words
    total_words = len(list(set(list_of_words)))
    total_words
    
9268

Now we will split the data before assigning each word a particular number.

    # split the data into test and train 
    X = stock['Text Without Punc & Stopwords']
    y = stock['Sentiment']
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)
    
    X_train.shape
(5211,)    

    X_test.shape
(580,)
    
    y.value_counts()
1    3685
0    2106
Name: Sentiment, dtype: int64

Process of assigning a number to text or word is known as tokenizing.

    
    # Create a tokenizer to tokenize the words and create sequences of tokenized words
    tokenizers = Tokenizer(num_words = total_words)
    tokenizers.fit_on_texts(X_train)    

    # Training data
    train_seq = tokenizers.texts_to_sequences(X_train)
    
    # Testing data
    test_seq = tokenizers.texts_to_sequences(X_test)
    
    train_seq[0:2]
[[1289, 37, 91], [1884, 1, 39, 30, 3, 3616, 3617]]
    
    print("The encoding for document\n", X_train[1:2],"\n is: ", train_seq[1])

The encoding for document
 2    [afraid, short, amzn, looking, like, nearmonop...
Name: Text Without Punc & Stopwords, dtype: object 
 is:  [1884, 1, 39, 30, 3, 3616, 3617]

    # Add padding to training and testing
    padded_train = pad_sequences(train_seq, maxlen = 29)
    padded_test = pad_sequences(test_seq, maxlen = 29)
    
    for i, doc in enumerate(padded_train[:3]):
         print("The padded encoding for document:", i+1," is:", doc)

The padded encoding for document: 1  is: 
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 1289   37    91]
The padded encoding for document: 2  is: 
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 1884    1   39   30    3 3616  3617]
The padded encoding for document: 3  is: 
[   0    0    0    0    0    0    0    0    0    0    0    0    0   64   84  485  173 3618  985 1115 1885 1290  155 1886 1291 1887   15 3619 7]

    # Convert the data to categorical 2D representation
    y_train_cat = to_categorical(y_train,2)
    y_test_cat = to_categorical(y_test,2)

Building LSTM model

    # Sequential Model
    model = Sequential()
    
    # embedding layer
    model.add(Embedding(total_words, output_dim = 512))
    
    # Bi-Directional RNN and LSTM
    model.add(LSTM(256))
    
    # Dense layers
    model.add(Dense(128, activation = 'relu'))
    model.add(Dropout(0.3))
    model.add(Dense(2,activation = 'softmax'))
    model.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['acc'])
    model.summary()    

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, None, 512)         4745216   
_________________________________________________________________
lstm (LSTM)                  (None, 256)               787456    
_________________________________________________________________
dense (Dense)                (None, 128)               32896     
_________________________________________________________________
dropout (Dropout)            (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258       
=================================================================
Total params: 5,565,826
Trainable params: 5,565,826
Non-trainable params: 0
_________________________________________________________________

    # train the model
     model.fit(padded_train, y_train_cat, batch_size = 32, validation_split = 0.2, epochs = 3)

Epoch 1/3
131/131 [==============================] - 37s 237ms/step - loss: 0.6101 - acc: 0.6570 - val_loss: 0.5213 - val_acc: 0.7411
Epoch 2/3
131/131 [==============================] - 31s 236ms/step - loss: 0.2690 - acc: 0.9030 - val_loss: 0.5403 - val_acc: 0.7488
Epoch 3/3
131/131 [==============================] - 30s 230ms/step - loss: 0.1174 - acc: 0.9605 - val_loss: 0.7665 - val_acc: 0.7440

    # make prediction
    pred = model.predict(padded_test)
    
    # make prediction
    prediction = []
    for i in pred:
      prediction.append(np.argmax(i))
      
    # list containing original values
    original = []
    for i in y_test_cat:
      original.append(np.argmax(i))
    
    # acuracy score on text data
    from sklearn.metrics import accuracy_score
    
    accuracy = accuracy_score(original, prediction)
    accuracy

0.7603448275862069

    # Plot the confusion matrix
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(original, prediction)
    sns.heatmap(cm, annot = True) 

Plot of Confusion Matrix in Stock News Analysis

Figure 4 :  Plot of Confusion Matrix in Stock News Analysis

 

Conclusion

We got the accuracy of approx. 76% using LSTM model we can further increase the accuracy by finding the best hyper parameter for this model and this can be done by hyper parameter tuning, you can read more about this from here.

Other methods can also be used to increase the accuracy of the model. You can read the article on How to perform Sentiment Analysis in python.

 

 

About the Author's:

Sachin Gupta

Sachin Kumar Gupta

Sachin, is a Mechanical Engineer and data science enthusiast. He loves to find trend in data and extract useful information from it. He has executed projects on Machine Learning and Deep Learning using Python.