Data pipeline @sushrutalgs.ai
Built a Python pipeline that turns full-length surgical-textbook PDFs into clean, structured, machine-readable data, processing 220 chapters into a searchable knowledge base of sections, figures, and tables enriched with AI-generated descriptions. It produces a versioned, hash-verified export that the search platform loads into its graph and vector databases.
Source is private; sushrutalgs.ai is a live product. Happy to walk through the code or grant read access on request.
Stack
System architecture. Tap to enlarge.
Overview
sushrutalgs.ai answers surgical-exam questions with citations traced back to standard textbooks. Samhita is the data layer that makes that possible: it converts three full surgical textbooks from raw PDF into clean, structured, machine-readable knowledge that the retrieval backend can load.
Approach
Results
220 chapters and 5,941 pages processed end to end, into 52,871 embeddings and 5,987 resolved cross-references, with 100 percent structural validation across all 220 exported chapter packages and a clean rebuild that reproduced every count.
Engineering
Built in Python with Pydantic, PyMuPDF, and the Adobe PDF Services and Anthropic Claude APIs. Exports are immutable and content-hashed, published to Cloudflare R2 with manifest drift detection, and gated by a CI workflow with mocked tests.