Table of Content

  1. What is a Recommender System ?
  2. Introduction to Content-Based Recommender System
  3. Advantages and Disadvantages of Content-Based Recommender System
  4. Creating a Content-Based Recommender System
    1. Importing Necessary Libraries
    2. Reading and Viewing the data
    3. Retaining Relevant Data
    4. Converting the Overview of Movies into Vectors
    5. Vector Comparision using Sigmoid Kernel
    6. Reverse Mapping Movie’s Name with Index
    7. Function for returning similar movies
    8. Output
  5. Future Scope of this project

 

 

 

What is a Recommender System ?

A Recommender System is an information filtering system that predicts the output based on the user’s past selections or based on the item’s information with which the user interacted. These systems deal with overload problems by efficiently delivering relevant information.

Recommender systems are now part of our daily life from shopping for an online store to watching new series on Netflix, these systems are deployed everywhere.

 

Introduction to Content-Based Recommender System

Though there are various type's of recommender system which have their unique way of giving the recommendations to the users, our prime focus will be Content-Based Recommender System.

Now, what is a Content-Based Recommender System?

A Recommender System, that provides recommendations to the user based on the item similarity format is known as a Content-Based Recommender System. So in broader terms, this type of recommender system recommends products that are similar to the products that are already liked or viewed by the users.

Have you ever thought about why you get the recommendation of sci-fi movies when you watch interstellar? Yes you guessed it right because of the content-based recommender system as interstellar is a sci-fi movie and other sci-fi movies are similar to interstellar.

 

Advantages and Disadvantages of Content-Based Recommender System

Let’s start with advantages first

  1. It overcomes the cold start problem i.e even if the database does not contain user preferences it still shows recommendations to users.
  2. It easily adjusts its recommendations as the user changes preferences.
  3. User similarity is not available hence no profile sharing is present, so privacy is maintained.

These are some disadvantages of Content-Based Systems

  1. As the recommendation depends upon item similarity hence a rich description of items must be given to the systems.
  2. Content Overspecialization also occurs in which content similar to the one already present in the User’s list is not recommended to the User.

 

Creating a Content-Based Recommender System

Importing Necessary Libraries

    # Importing the Necessary Libraries
    import numpy as np
    import pandas as pd 
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import sigmoid_kernel

Reading and Viewing the data

    # Click here to Download the data required for the Content-Based Movie Recommender System Case Study
    # Reading and Viewing the data
    credits = pd.read_csv('tmdb_5000_credits.csv')
    movies = pd.read_csv('tmdb_5000_movies.csv')
    pd.set_option('display.max_columns', None)

    credits.head()

   movie_id                                     title                                               cast                                               crew
0     19995                                    Avatar  [{"cast_id": 242, "character": "Jake Sully", "...  [{"credit_id": "52fe48009251416c750aca23", "de...  
1       285  Pirates of the Caribbean: At World's End  [{"cast_id": 4, "character": "Captain Jack Spa...  [{"credit_id": "52fe4232c3a36847f800b579", "de...   
2    206647                                   Spectre  [{"cast_id": 1, "character": "James Bond", "cr...  [{"credit_id": "54805967c3a36829b5002c41", "de...   
3     49026                     The Dark Knight Rises  [{"cast_id": 2, "character": "Bruce Wayne / Ba...  [{"credit_id": "52fe4781c3a36847f81398c3", "de...   
4     49529                               John Carter  [{"cast_id": 5, "character": "John Carter", "c...  [{"credit_id": "52fe479ac3a36847f813eaa3", "de...   
    
    movies.head()
 Content Filtering in Movies Data Set Preview

Retaining Relevant Data

    # Dropping the unwanted parts and merging the data
    credits['id'] = credits['movie_id']
    credits.drop('movie_id',axis=1,inplace=True)

    df = movies.merge(credits,on='id')
    df.drop(['homepage', 'title_x', 'title_y', 'status','production_countries'],axis=1,inplace=True)

Converting the Overview of Movies into Vectors

    df['overview'][0]
'In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but becomes torn between following orders and protecting an alien civilization.'

    # Creating an object of Tfidf vector class and fitting it on the overview of movies
    tfidf = TfidfVectorizer(stop_words='english',analyzer='word',min_df=3,strip_accents='unicode'
                         ,ngram_range=(1,3),token_pattern=r'\w{1,}',max_features=None)

    df['overview'].fillna('',inplace=True)
    vector = tfidf.fit_transform(df['overview'])
    vector

<4803x10417 sparse matrix of type ''
 with 127220 stored elements in Compressed Sparse Row format>

Vector Comparision using Sigmoid Kernel

# Comparision Between Each and Every index
sigmoid = sigmoid_kernel(vector,vector)

sigmoid

array([[0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
        0.76159416],
       [0.76159416, 0.76163447, 0.76159416, ..., 0.76159513, 0.76159416,
        0.76159416],
       [0.76159416, 0.76159416, 0.76163447, ..., 0.76159486, 0.76159416,
        0.76159455],
       ...,
       [0.76159416, 0.76159513, 0.76159486, ..., 0.76163447, 0.76159483,
        0.76159473],
       [0.76159416, 0.76159416, 0.76159416, ..., 0.76159483, 0.76163447,
        0.76159461],
       [0.76159416, 0.76159416, 0.76159455, ..., 0.76159473, 0.76159461,
        0.76163447]])

Reverse Mapping Movie’s Name with Index

    # Now reverse mapping of movies and indices 
    index = pd.Series(df.index,index=df['original_title']).drop_duplicates()
    index

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4

El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

    index['Avatar']
0

    sigmoid[0]
array([0.76163447, 0.76159416, 0.76159416, ..., 0.76159416, 0.76159416,
       0.76159416])

    # Seeing the sigmoid value of every vector with this vector. 
    list(enumerate(sigmoid[index['Avatar']]))

Function for returning similar movies

    # Creating a function that returns top 10 movies that are similar to given movie
    def content(title,sigmoid=sigmoid):
        position = index[title]
        score = sorted(list(enumerate(sigmoid[position])),key=lambda x:x[1],reverse=True)
        indices = score[1:11]
        movie_indices = [i[0] for i in indices]
        # Top 10 most similar movies
        return df['original_title'].iloc[movie_indices]

Output

    content('Avatar')

1341                Obitaemyy Ostrov
634                       The Matrix
3604                       Apollo 18
2130                    The American
775                        Supernova
529                 Tears of the Sun
151                          Beowulf
311     The Adventures of Pluto Nash
847                         Semi-Pro
942                 The Book of Life
Name: original_title, dtype: object

Future Scope of this project

This is a first-generation recommender system but currently, more complex recommender systems are employed in the industry which uses neural networks for predictions, so after this, you will get basic intuition of what content-based recommender systems are and you can embed these systems into a web application or android application.

 

 

About the Author's:

Utkarsh Bahukhandi

Utkarsh Bahukhandi, is B.Tech undergraduate from Maharaja Agrasen institute of technology. He is a data science enthusiast and explores challenging projects in ML and DS niche like Natural Language Processing and Computer Vision.

 

Mohan Rai

Mohan Rai is an Alumni of IIM Bangalore , he has completed his MBA from University of Pune and Bachelor of Science (Statistics) from University of Pune. He is a Certified Data Scientist by EMC. Mohan is a learner and has been enriching his experience throughout his career by exposing himself to several opportunities in the capacity of an Advisor, Consultant and a Business Owner. He has more than 18 years’ experience in the field of Analytics and has worked as an Analytics SME on domains ranging from IT, Banking, Construction, Real Estate, Automobile, Component Manufacturing and Retail. His functional scope covers areas including Training, Research, Sales, Market Research, Sales Planning, and Market Strategy.