Table of Content

  1. Introduction About NLP
  2. Importing Necessary Libraries and Dataset
  3. Removing Noise from dataset
  4. Model Preparation
  5. Checking accuracy of model
  6. Conclusion


Introduction about NLP

A large numbers of data generated today’s are unstructured and requires processing to get insights from it. Some of its examples are news articles, social media post, products reviews on e-commerce sites, etc...

The process of analyzing this data and get insights from it falls under the field of NLP.

Now, you may be wondering what NLP is. So NLP stands for Natural Language Processing, which is the part of AI which studies how machine, interact with human language. Examples of NLP are Chabot, spell-checker, language translator, sentiment analysis, etc...

In this article we have a dataset which contains sentences and they are labelled with positive or negative sentiment.


Importing Necessary Libraries and Dataset

We are going to use NLTK package in python for all NLP task in this article, so if you have not installed nltk package then in your command prompt write the below code to install the package.

    # import required packages
    import pandas as pd
    import numpy as np
    import re
    import nltk
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import confusion_matrix, accuracy_score'stopwords')
[nltk_data] Downloading package stopwords to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer'wordnet')
[nltk_data] Downloading package wordnet to C:\Users\Imurgence\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\
    # Download the data sets by clicking on the links
    # amazon reviews data set
    # imdb review data set
    # yelp reviews data set
    # Importing Dataset
    am = pd.read_excel('amazon.xlsx', header=None)
    im = pd.read_excel('imdb.xlsx',header=None)
    ye = pd.read_excel('yelp.xlsx',header=None)

    print("Shape of \namazon : {} \nimdb : {} \nyelp : {}".format(am.shape,im.shape,ye.shape))
Shape of 
amazon : (1000, 2) 
imdb : (748, 2) 
yelp : (1000, 2)

    # Combining the datasets
    df = pd.concat([am,im,ye],axis=0, ignore_index=True)
    df.columns = ['text','sentiment']

                                                text  sentiment
0  So there is no way for me to plug it in here i...          0
1                        Good case, Excellent value.          1
2                             Great for the jawbone.          1
3  Tied to charger for conversations lasting more...          0
4                                  The mic is great.          1

RangeIndex: 2748 entries, 0 to 2747
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   text       2744 non-null   object
 1   sentiment  2748 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 43.1+ KB

Removing Noise from Dataset

In this step we will remove noise from dataset. Noise is a part of a text which does not add any meaning or information to data which helps to predict the sentiment. The most common words in a language are called stop words. Some examples of stop words are “is”, “am”, “are” and “a”, etc... They are generally irrelevant when processing language, unless a specific use case warrants their inclusion.

Let’s see what stopwords contain

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

It also contain not word which is essential for review or sentiment analysis. So we have to first remove this word from stopwords.

    stopword = stopwords.words('english') 

Now let’s make a list which contains only essential words to build model.

    corpus = []
    for i in range(0, len(df)):
        review = re.sub('[^a-zA-Z]',' ', str(df['text'][i]))
        review = review.lower()
        review = review.split()
        wordnet = WordNetLemmatizer()
        review = [wordnet.lemmatize(word) for word in review if not word in stopword ]
        review = " ".join(review)

Let’s have a look how our cleaned word list looks like


['way plug u unless go converter',
 'good case excellent value',
 'great jawbone',
 'tied charger conversation lasting minute major problem']


['think food flavor texture lacking',
 'appetite instantly gone',
 'overall not impressed would not go back',
 'whole experience underwhelming think go ninja sushi next time']

Model Preperation

We have different models to do sentiment analysis like Bag of Words, TF-IDF, and Word2Vec. Word2Vec is used where we have large dataset and Bag of Words and TF-IDF can be used for small dataset. In this article we are going to use TF-IDF model for sentiment analysis.

    # Creating TF-IDF Model
    cv = TfidfVectorizer()
    X = cv.fit_transform(corpus).toarray()
    y = df.iloc[:,-1]

    # Splitting Dataset into Train and Test data
    X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state=42)

    # Training Model with Naive Bayes
    classifier = MultinomialNB(),y_train)

    y_pred = classifier.predict(X_test)

Checking accuracy of model

    cm = confusion_matrix(y_test,y_pred)
    acc = accuracy_score(y_test,y_pred)
    print(" cm :\n {} \nacc : {}".format(cm,acc)) 

cm :
 [[225  66]
 [ 35 224]] 
acc : 0.8163636363636364

We got the accuracy of 81.63 % which is a pretty good score. If we use separate data and build model only for one at a time we can get better accuracy score. We have a three dataset which belongs to product, movie and location review. Let’s check the model:

    def review(text):
        df1 = [text]
        X1 = cv.transform(df1).toarray()
        prediction = classifier.predict(X1)
        return prediction

    # Create text labels to be returned
    outlabel=["Negative Sentiment","Positive Sentiment"]

    outlabel[int(review("This product is awesome"))]
'Positive Sentiment'

    outlabel[int(review("Location is not good"))]
'Negative Sentiment'
    outlabel[int(review("Movie was boring"))]
'Negative Sentiment'

    outlabel[int(review("Great ambience"))]
'Positive Sentiment'

    outlabel[int(review("Location is not good but moview was amazing"))]
'Positive Sentiment'

    outlabel[int(review("moview was amazing but Location is not good"))]
'Positive Sentiment'


This article is a basic sentiment analysis model using the nltk library. First, we installed necessary libraries and then removed noises from data. Finally, we built a model to associate reviews of product, movies and places to a particular sentiment and checked our model working by giving random input to model. Do read this important article on time series using LSTM in python.



About the Author's:

Sachin Kumar Gupta

Sachin, is a Mechanical Engineer and data science enthusiast. He loves to find trend in data and extract useful information from it. He has executed projects on Machine Learning and Deep Learning using Python.


Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.