Samhita · Rushir Bhavsar

Overview

sushrutalgs.ai answers surgical-exam questions with citations traced back to standard textbooks. Samhita is the data layer that makes that possible: it converts three full surgical textbooks from raw PDF into clean, structured, machine-readable knowledge that the retrieval backend can load.

Approach

Deterministic parsing. An Adobe-JSON parser with a six-phase recovery pipeline extracts sections, figures, and tables, fixing documented edge cases such as a bug that silently dropped 968 table elements.
Knowledge graph. The content is assembled into a 71,621-node, 130,057-edge graph spanning chapters, sections, figures, tables, and cross-references.
Embeddings and taxonomy. Each unit is embedded with BioLORD, a medical-domain model, and tagged against a 17-domain taxonomy for retrieval.

Results

220 chapters and 5,941 pages processed end to end, into 52,871 embeddings and 5,987 resolved cross-references, with 100 percent structural validation across all 220 exported chapter packages and a clean rebuild that reproduced every count.

Engineering

Built in Python with Pydantic, PyMuPDF, and the Adobe PDF Services and Anthropic Claude APIs. Exports are immutable and content-hashed, published to Cloudflare R2 with manifest drift detection, and gated by a CI workflow with mocked tests.