Skip to content

A comparative case study on stemming vs lemmatization using IMDb movie reviews, focusing on NLP preprocessing and vocabulary analysis.

Notifications You must be signed in to change notification settings

ssrishtix/IMDB-Sentiment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

IMDB-Sentiment

IMDb Movie Review - NLP Preprocessing

Overview

This project is a case study comparing two text preprocessing techniques:

Stemming (using PorterStemmer)

Lemmatization (using WordNetLemmatizer)

The goal is to observe the impact on vocabulary size, text clarity, and information retention.

Steps Performed

Cleaned the text: lowercasing, punctuation removal, digit removal, etc.

Removed custom stopwords.

Created two versions of the reviews: one stemmed, one lemmatized.

Analyzed and visualized the results using bar plots and word clouds.

Results

Stemmed Vocabulary Size: ~22,000 words

Lemmatized Vocabulary Size: ~26,000 words

Conclusion: Lemmatization preserved better semantic meaning and richer vocabulary compared to stemming.

Technologies

Python

Pandas

NLTK

Matplotlib

Seaborn

WordCloud

About

A comparative case study on stemming vs lemmatization using IMDb movie reviews, focusing on NLP preprocessing and vocabulary analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published