Measuring Text Similarity with Movie plot data

Python Natural Language Processing Unsupervised Machine Learning

For this post, I will be analyzing textual data of movie plots to determine their similarity with TF-IDF and Clustering.

(7 min read)

Tarid Wongvorachan (University of Alberta)https://www.ualberta.ca
2021-12-29

Introduction

Show code
# Import modules
import numpy as np
import pandas as pd
import nltk

# Set seed for reproducibility
np.random.seed(5)

# Read in IMDb and Wikipedia movie data (both in the same file)
movies_df = pd.read_csv("movies.csv")

print("Number of movies loaded: %s " % (len(movies_df)))

# Display the data
Number of movies loaded: 100 
Show code
movies_df
    rank  ...                                          imdb_plot
0      0  ...  In late summer 1945, guests are gathered for t...
1      1  ...  In 1947, Andy Dufresne (Tim Robbins), a banker...
2      2  ...  The relocation of Polish Jews from surrounding...
3      3  ...  The film opens in 1964, where an older and fat...
4      4  ...  In the early years of World War II, December 1...
..   ...  ...                                                ...
95    95  ...  Shortly after moving to Los Angeles with his p...
96    96  ...  L.B. "Jeff" Jeffries (James Stewart) recuperat...
97    97  ...  Sights of Vienna, Austria, flash across the sc...
98    98  ...  At the end of an ordinary work day, advertisin...
99    99  ...                                                NaN

[100 rows x 5 columns]

Combine Wikipedia and IMDb plot summaries

Show code
# Combine wiki_plot and imdb_plot into a single column
movies_df["plot"] = movies_df["wiki_plot"].astype(str) + "\n" + \
                    movies_df["imdb_plot"].astype(str)
                    
movies_df.head()
   rank  ...                                               plot
0     0  ...  On the day of his only daughter's wedding, Vit...
1     1  ...  In 1947, banker Andy Dufresne is convicted of ...
2     2  ...  In 1939, the Germans move Polish Jews into the...
3     3  ...  In a brief scene in 1964, an aging, overweight...
4     4  ...  It is early December 1941. American expatriate...

[5 rows x 6 columns]

Tokenization

Show code
# Tokenize a paragraph into sentences and store in sent_tokenized
sent_tokenized = [sent for sent in nltk.sent_tokenize("""
                        Today (May 19, 2016) is his only daughter's wedding. 
                        Vito Corleone is the Godfather.
                        """)]
                        
# Word Tokenize first sentence from sent_tokenized, save as words_tokenized

words_tokenized = [word for word in nltk.word_tokenize(sent_tokenized[0])]

# Remove tokens that do not contain any letters from words_tokenized
import re

filtered = [word for word in words_tokenized if re.search('[a-zA-Z]', word)]

# Display filtered words to observe words after tokenization
filtered
['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']

Stemming

Show code
# Import the SnowballStemmer to perform stemming
from nltk.stem.snowball import SnowballStemmer

# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")

# Print filtered to observe words without stemming
print("Without stemming: ", filtered)

# Stem the words from filtered and store in stemmed_words
Without stemming:  ['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']
Show code
stemmed_words = [stemmer.stem(t) for t in filtered]

# Print the stemmed_words to observe words after stemming
print("After stemming:   ", stemmed_words)
After stemming:    ['today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed']

Tokenization and Stemming together

Show code
# Define a function to perform both stemming and tokenization
def tokenize_and_stem(text):
    
    # Tokenize by sentence, then by word
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    
    # Filter out raw tokens to remove noise
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    # Stem the filtered_tokens
    stems = [stemmer.stem(t) for t in filtered_tokens]
    
    return stems

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)
['today', 'may', 'is', 'his', 'onli', 'daughter', "'s", 'wed']

Create TF-IDF Vectorizer

Show code
# Import TfidfVectorizer to create TF-IDF vectors
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiate TfidfVectorizer object with stopwords and tokenizer
# parameters for efficient processing of text
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem,
                                 ngram_range=(1,3))

Fit transform TF-IDF Vectorizer

Show code
# Fit and transform the tfidf_vectorizer with the "plot" of each movie
# to create a vector representation of the plot summaries
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["plot"]])
C:\Users\tarid\AppData\Roaming\Python\Python38\site-packages\sklearn\feature_extraction\text.py:383: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
  warnings.warn('Your stop_words may be inconsistent with '
Show code
print(tfidf_matrix.shape)
(100, 564)

Import K-Means and create clusters

Show code
# Import k-means to perform clustering
from sklearn.cluster import KMeans

# Create a KMeans object with 5 clusters and save as km
km = KMeans(n_clusters=5)

# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)
KMeans(n_clusters=5)
Show code
clusters = km.labels_.tolist()

# Create a column cluster to denote the generated cluster for each movie
movies_df["cluster"] = clusters

# Display number of films per cluster (clusters from 0 to 4)
movies_df['cluster'].value_counts() 
3    31
1    27
0    22
4    13
2     7
Name: cluster, dtype: int64
Show code
import seaborn as sns
import matplotlib.pyplot as plt

#convert the cluster list into a dataframe
clusters_df = pd.DataFrame(clusters, columns = ['cluster_group'])

sns.set_theme(style="whitegrid")
sns.catplot(x="cluster_group", kind="count", data=clusters_df)

Calculate similarity distance

Show code
# Import cosine_similarity to calculate similarity of movie plots
from sklearn.metrics.pairwise import cosine_similarity

# Calculate the similarity distance
similarity_distance = 1 - cosine_similarity(tfidf_matrix)

Import Matplotlib, Linkage, and Dendrograms

Show code
# Import modules necessary to plot dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

# Create mergings matrix 
mergings = linkage(similarity_distance, method='complete')

# Plot the dendrogram, using title as label column
dendrogram_ = dendrogram(mergings,
               labels=[x for x in movies_df["title"]],
               leaf_rotation=90,
               leaf_font_size=16,
)

# Adjust the plot
fig = plt.gcf()
_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
fig.set_size_inches(120, 50)

# Show the plotted dendrogram
plt.grid(False)
plt.show()

Concluding remark

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Wongvorachan (2021, Dec. 29). Tarid Wongvorachan: Measuring Text Similarity with Movie plot data. Retrieved from https://taridwong.github.io/posts/2021-12-29-movie-similarity/

BibTeX citation

@misc{wongvorachan2021measuring,
  author = {Wongvorachan, Tarid},
  title = {Tarid Wongvorachan: Measuring Text Similarity with Movie plot data},
  url = {https://taridwong.github.io/posts/2021-12-29-movie-similarity/},
  year = {2021}
}