SMU MSDS — Natural Language Processing
Eight progressive homework assignments covering foundational and applied NLP techniques. Dataset sources include Project Gutenberg classic literature and 150 live-scraped IMDb movie reviews across three genres.
Implemented a lexical diversity scoring function (unique words / total words) and applied it to Project Gutenberg texts at different grade levels. Compared vocabulary richness across elementary, middle school, and college-level texts using matplotlib visualizations.
Tokenization with NLTK word_tokenize, stopword filtering, frequency distributions, and n-gram analysis. Built text cleaning pipelines that remove punctuation and normalize case before analysis.
Scraped 150 IMDb reviews across three genres (Baby Driver, Fast & Furious, third genre) using BeautifulSoup and requests. Applied NLTK POS tagging and RegexpParser to extract noun phrases and identify genre-specific vocabulary patterns.
Collected 24 Amazon book titles (machine learning category) and search engine queries. Vectorized with TfidfVectorizer and CountVectorizer, then computed pairwise cosine similarity to find the most semantically similar documents.
Applied NLTK named entity chunking, compared PorterStemmer vs. WordNetLemmatizer on the same corpus, and evaluated how each normalization strategy affects downstream similarity scores.
Used VADER (Valence Aware Dictionary and sEntiment Reasoner) to score all 150 IMDb reviews. Aggregated compound sentiment scores by movie and genre with pandas, visualized distributions with matplotlib and seaborn.
| Category | Tools |
|---|---|
| Core NLP | NLTK, VADER, RegexpParser |
| Vectorization | TfidfVectorizer, CountVectorizer |
| Data | pandas, numpy |
| Web Scraping | BeautifulSoup, requests |
| Visualization | matplotlib, seaborn |
| Similarity | scikit-learn (cosine_similarity) |
| GitHub Repository | ← Back to Projects |