Visit my Portfolio! →

Samhita.

Data pipeline @sushrutalgs.ai

Built a Python pipeline that turns full-length surgical-textbook PDFs into clean, structured, machine-readable data, processing 220 chapters into a searchable knowledge base of sections, figures, and tables enriched with AI-generated descriptions. It produces a versioned, hash-verified export that the search platform loads into its graph and vector databases.

Source is private; sushrutalgs.ai is a live product. Happy to walk through the code or grant read access on request.

Links

  • live product↗
  • Request repo access→

Stack

  • Python
  • Pydantic
  • BioLORD
  • Cloudflare R2

System architecture. Tap to enlarge.

Overview

sushrutalgs.ai answers surgical-exam questions with citations traced back to standard textbooks. Samhita is the data layer that makes that possible: it converts three full surgical textbooks from raw PDF into clean, structured, machine-readable knowledge that the retrieval backend can load.

Approach

  • Deterministic parsing. An Adobe-JSON parser with a six-phase recovery pipeline extracts sections, figures, and tables, fixing documented edge cases such as a bug that silently dropped 968 table elements.
  • Knowledge graph. The content is assembled into a 71,621-node, 130,057-edge graph spanning chapters, sections, figures, tables, and cross-references.
  • Embeddings and taxonomy. Each unit is embedded with BioLORD, a medical-domain model, and tagged against a 17-domain taxonomy for retrieval.

Results

220 chapters and 5,941 pages processed end to end, into 52,871 embeddings and 5,987 resolved cross-references, with 100 percent structural validation across all 220 exported chapter packages and a clean rebuild that reproduced every count.

Engineering

Built in Python with Pydantic, PyMuPDF, and the Adobe PDF Services and Anthropic Claude APIs. Exports are immutable and content-hashed, published to Cloudflare R2 with manifest drift detection, and gated by a CI workflow with mocked tests.