Precision medicine is a learning loop. A patient’s genome or sample is sequenced; analysts sift through potential variants; clinicians and scientists weigh multiple strands of evidence - population frequency, segregation and de novo status, functional data, case series, disease mechanism, and phenotype fit. A classification is issued against a formal criteria and, as new evidence emerges, the call may be revisited. The quality and speed of that loop depend on a dependable substrate: a variant database that consolidates what the literature and public resources already know, keeps that knowledge current, and preserves a traceable path back to the exact sentence, table, or figure that supports each entry. When the substrate is strong, reports are timely, classifications are consistent, regulatory responses are confident, and analytics and ML models have inputs they can trust. When it’s weak or missing, teams slow down, disagree more often, and struggle to defend decisions.
This is not an abstract problem. Rare diseases alone touch an estimated 3.5–5.9% of the global population - hundreds of millions of people whose results must be interpreted against a shifting corpus of evidence. Meanwhile, public scaffolds expand and evolve. ClinVar recently formalized three distinct classification tracks - germline pathogenicity, somatic oncogenicity, and somatic clinical impact - and by June 30, 2025 had logged 5.43 million submissions across ~3.59 million variants. That growth is good for transparency and reproducibility; it also raises the operational bar for local databases to stay aligned and explainable. (PubMed, clinvarminer.genetics.utah.edu)
A credible database is not a folder of PDFs or a spreadsheet of notes. It is a governed, searchable asset that encodes the who/what/where of each finding. Genes and transcripts are normalized to HGNC; variants carry HGVS nomenclature and stable genomic coordinates; diseases and phenotypes map to MONDO and HPO; and every value is anchored to verbatim evidence with section, page, and persistent identifiers (PMIDs/DOIs). The same record also carries external context - ClinVar coorelations, DECIPHER structural observations, gnomAD and other global population frequencies - so reviewers can reconcile local calls with the public record in one place. This structure turns practical questions into simple queries: What is known about GENE X in Disease Y? Which variants have segregation or functional support? What changed since the last release? How do those changes intersect with population frequency and prior assertions? (Decipher Genomics, PMC, gnomAD)
Consider two everyday settings. A hospital laboratory focuses on rare genetic disorders. A diagnostics company runs NGS panels or exomes at scale. In both, the first interpretive move after sequencing is only one part of the loop: compare the patient’s variants with what has been reported, and then place each finding into a coherent evidence narrative. That narrative goes well beyond “database lookup.” Analysts must compile literature evidence, align it to the disease mechanism, check phenotype fit, weigh population data, evaluate segregation or de novo status, consider functional assays and computational predictions, and then apply ACMG/AMP criteria (or disease/gene-specific specifications) to reach a defensible classification. In pharmacogenomics, the loop also includes translating genotype into prescribing guidance using CPIC evidence levels and recommendation strengths. (Nature, PMC, CPIC)
The time burden is substantial and well documented. Peer-reviewed studies place per-variant evidence work in the tens of minutes: around 40 minutes on average (with a 10–120 minute range) in expert programs, and ~60 minutes for difficult cases that require narrative justification. Multiply that by thousands of variants per year across multiple indications and the operational reality becomes obvious: turnaround drifts, audit questions pile up, and parallel teams quietly re-curate the same ground. (PMC)
Imagine opening your laptop on Monday, selecting a disease area, and pulling every relevant paper, supplement, and case report from the last five years into one workspace in minutes. You sketch a simple schema - gene, variant (HGVS), phenotype, study type, “assay mentioned,” “segregation,” “ancestry context,” “page/section/snippet.” The system begins to populate a draft disease database. Each row carries its own receipts: the sentence that supports it, the page it came from, the figure or table it references. Conflicts appear as tracked deltas, not surprises. By Friday you cut v1.0, a versioned, searchable disease database. Next month you refresh and ship v1.1 with clean release notes.
That one-week rhythm changes the work. Sign-out is faster because evidence is already where it needs to be. Reviews stop being scavenger hunts. And curation finally compounds: each release stands on the shoulders of the last.
For two years of my PhD, I did the opposite of that ideal. Ten of us spent our days building a single disease database by hand. We hunted PubMedIDs first, distributed PDFs, copied tables, reconciled gene and variant names, argued over legacy nomenclature, and tried to track what changed between versions. We made progress, but slowly. The work was necessary, yet the mechanics - searching, extracting, curating, normalizing, proving provenance - consumed far more time than the judgment calls that actually required experts.
That experience is why this “what if” matters to me. I know exactly which steps soak up hours, and which deserve human attention.
Last week, I tried a small, focused experiment: build a credible, literature-backed variant database for a neurogenetic indication - fast. I started with six PDFs around intellectual disability and sketched the schema I’d wanted back in grad school: normalized gene and variant fields; disease/phenotype; study and evidence type; “assay mentioned,” “segregation,” “ancestry context”; and full provenance (PMID/DOI, page, section, figure/table, verbatim snippet).
Within minutes of ingestion I had the first draft variant database as structured rows with provenance. Gene and variant recognition held up on spot-check. Mid-run schema edits were painless (adding “evidence type,” for example). I wasn’t spending my time hunting anymore. I was adjudicating evidence and shaping a database.
Behind that experiment was Polly Xtract, Elucidata’s AI capability for biomedical document understanding. It is not a standalone tool; it powers our curation services and enterprise deployments. The goal is simple: compress the mechanical steps - gathering documents from mixed sources, reading narrative text and tables, normalizing to ontologies, reconciling conflicts, and attaching receipts to every field - so experts can focus on judgment. Xtract writes the output in a CSV file that can be then fed into Polly Atlas with lineage and versions, so a disease database is not a spreadsheet on someone’s desktop; it is a governed asset. From there, the Python SDK makes the data truly usable: joins to ClinVar, DECIPHER, and gnomAD; quick ACMG signal tallies; QC dashboards; and boring-but-vital diffs between v1.0 and v1.1. (NCBI, Decipher Genomics, gnomAD)
The headline isn’t “AI reads PDFs.” It’s that the hours we used to spend searching, copying, and reformatting collapse into a repeatable step - and the hours that require expertise are protected.
Picture a rare-disease lab on a busy week. Sequencing finishes. Analysts surface candidate variants. Instead of fanning out across Google Scholar and old Slack threads, they open the disease database. Every entry already links to the sentence that supports it, the page it came from, and any figure or table that matters. If two publications disagree, the discrepancy is visible with context, not hidden in someone’s notes. Curators spend their time on the judgments that matter - segregation strength, phenotype fit, mechanism - not on re-locating the same case report for the fifth time. When a sign-out pathologist asks, “show me the sentence,” it takes seconds, not days.
Operations change too. Release management stops being ad hoc. You publish Disease X Variant DB v1, then v1.1 a month later with a clear changelog: new case series added; one ClinVar reclassification ingested; two corrections promoted from review. QA and Regulatory get what they need for audits. Downstream pipelines - reporting templates, biomarker matrices, ML feature stores - pull from a single, explainable source of truth.
That first intellectual-disability pilot was deliberately small, but the pattern generalizes. Pick an indication. Define the schema around how your team actually reasons (not just what a PDF happens to contain). Ingest the literature. Let the system draft the extraction and normalization. Use experts to resolve conflicts and shape the release. Repeat on a schedule.
The payoff isn’t only speed. It is consistency across teams, cleaner handoffs to QA and Regulatory, and the confidence to expand into adjacent indications without starting from zero. If you have ever tied up a ten-person team for months to build a single disease database, as we did - you will feel the difference.
If fast-tracking clinical reporting is the goal, start by making the variant database in minutes with Xtract—then let Polly turn that AI-ready structure into harmonised insights and releases.