Yelp ML Platform · Rushir Bhavsar

Overview

A production-style machine learning platform built on the full Yelp Open Dataset. It takes raw review data all the way to a served API, with two models: one that recommends businesses and one that scores the sentiment of review text.

Approach

Large-scale ETL. A Spark pipeline converts the full seven-million-review dataset from JSON to Parquet at about 462,000 rows per second on a single eight-core node.
Two models. A Spark ALS collaborative-filtering recommender and a TF-IDF plus logistic-regression sentiment classifier, tracked and versioned with MLflow.
Fast serving. The sentiment model is exported to a Spark-free NumPy inference path with verified prediction parity, served through a FastAPI service.

Results

The recommender reached Recall@10 of 5.5 percent, 6.2 times a most-popular baseline, with a bias-augmented variant at RMSE 1.17. The sentiment classifier reached 86.3 percent accuracy and a macro-F1 of 0.73 with class weighting. The exported inference path runs at p99 0.11 ms with 100 percent parity to the training model.

Engineering

PySpark for ETL and model training, FastAPI for serving, MLflow for experiment tracking, and Docker Compose for orchestration, with a Pytest suite and GitHub Actions CI. Every headline number is backed by committed, provenance-stamped benchmark outputs.