← All projects

BERT Sentiment Classifier

Hybrid sentiment pipeline - BERT embeddings with KMeans clustering and a fine-tuned discriminative head.

Why it mattersTwo ways to attack the same NLP problem - unsupervised structure-finding vs supervised fine-tuning - measured on the same evaluation harness.

PythonTransformersPyTorchscikit-learn

What it does

A text-classification pipeline that attacks sentiment analysis two ways on the same data: an unsupervised approach using KMeans over BERT embeddings, and a supervised approach fine-tuning BERT directly. Both run through the same evaluation harness so the comparison is on equal footing.

Where it applies

  • Text-classification problems where labels are scarce - the unsupervised path lets you cluster first and label second.
  • A teaching example for the gap between "find structure" and "fit a label" - useful when stakeholders aren't sure whether supervision is needed yet.
  • A reusable preprocessing and evaluation scaffold for further NLP work: lowercasing, special-character cleanup, stopword removal, lemmatisation, with imbalanced-class diagnostics built in.

How it works (high level)

Preprocessing standardises text, then two parallel paths: an unsupervised path encodes documents with BERT, runs KMeans, and maps clusters to sentiments via majority vote against a small labelled subset. A supervised path fine-tunes BERT on the same train/validation/test split. Both produce the standard set of metrics - accuracy, precision, recall, F1, confusion matrix - and a CSV of test predictions for downstream review.

Outcome

A balanced, two-headed comparison on a single evaluation harness, with trained models and cluster mappings exported for reuse.

Stack

Python · Transformers · PyTorch · scikit-learn · NLTK · Gensim.