Visit my Portfolio! →

Yelp ML Platform.

Personal project

Built an end-to-end machine learning platform on the full 7-million-review Yelp dataset that powers two services: a business recommendation engine and a sentiment classifier, served through one REST API. The work spans large-scale data processing, model training, API serving, containerization, and automated testing.

Links

  • github↗

Stack

  • PySpark
  • FastAPI
  • MLflow
  • Docker
  • NLTK
  • NumPy
  • Pytest

System architecture. Tap to enlarge.

Overview

A production-style machine learning platform built on the full Yelp Open Dataset. It takes raw review data all the way to a served API, with two models: one that recommends businesses and one that scores the sentiment of review text.

Approach

  • Large-scale ETL. A Spark pipeline converts the full seven-million-review dataset from JSON to Parquet at about 462,000 rows per second on a single eight-core node.
  • Two models. A Spark ALS collaborative-filtering recommender and a TF-IDF plus logistic-regression sentiment classifier, tracked and versioned with MLflow.
  • Fast serving. The sentiment model is exported to a Spark-free NumPy inference path with verified prediction parity, served through a FastAPI service.

Results

The recommender reached Recall@10 of 5.5 percent, 6.2 times a most-popular baseline, with a bias-augmented variant at RMSE 1.17. The sentiment classifier reached 86.3 percent accuracy and a macro-F1 of 0.73 with class weighting. The exported inference path runs at p99 0.11 ms with 100 percent parity to the training model.

Engineering

PySpark for ETL and model training, FastAPI for serving, MLflow for experiment tracking, and Docker Compose for orchestration, with a Pytest suite and GitHub Actions CI. Every headline number is backed by committed, provenance-stamped benchmark outputs.