From PDFs to a Variant Database: How Elucidata’s Polly Xtract Turns Genetic Evidence into Decisions

Precision medicine is a learning loop. A patient’s genome or sample is sequenced; analysts sift through potential variants; clinicians and scientists weigh multiple strands of evidence - population frequency, segregation and de novo status, functional data, case series, disease mechanism, and phenotype fit. A classification is issued against a formal criteria and, as new evidence emerges, the call may be revisited. The quality and speed of that loop depend on a dependable substrate: a variant database that consolidates what the literature and public resources already know, keeps that knowledge current, and preserves a traceable path back to the exact sentence, table, or figure that supports each entry. When the substrate is strong, reports are timely, classifications are consistent, regulatory responses are confident, and analytics and ML models have inputs they can trust. When it’s weak or missing, teams slow down, disagree more often, and struggle to defend decisions.

This is not an abstract problem. Rare diseases alone touch an estimated 3.5–5.9% of the global population - hundreds of millions of people whose results must be interpreted against a shifting corpus of evidence. Meanwhile, public scaffolds expand and evolve. ClinVar recently formalized three distinct classification tracks - germline pathogenicity, somatic oncogenicity, and somatic clinical impact - and by June 30, 2025 had logged 5.43 million submissions across ~3.59 million variants. That growth is good for transparency and reproducibility; it also raises the operational bar for local databases to stay aligned and explainable. (PubMed, clinvarminer.genetics.utah.edu)

What a modern variant database really is

A credible database is not a folder of PDFs or a spreadsheet of notes. It is a governed, searchable asset that encodes the who/what/where of each finding. Genes and transcripts are normalized to HGNC; variants carry HGVS nomenclature and stable genomic coordinates; diseases and phenotypes map to MONDO and HPO; and every value is anchored to verbatim evidence with section, page, and persistent identifiers (PMIDs/DOIs). The same record also carries external context - ClinVar coorelations, DECIPHER structural observations, gnomAD and other global population frequencies - so reviewers can reconcile local calls with the public record in one place. This structure turns practical questions into simple queries: What is known about GENE X in Disease Y? Which variants have segregation or functional support? What changed since the last release? How do those changes intersect with population frequency and prior assertions? (Decipher Genomics, PMC, gnomAD)

Why building and maintaining it strains teams

Consider two everyday settings. A hospital laboratory focuses on rare genetic disorders. A diagnostics company runs NGS panels or exomes at scale. In both, the first interpretive move after sequencing is only one part of the loop: compare the patient’s variants with what has been reported, and then place each finding into a coherent evidence narrative. That narrative goes well beyond “database lookup.” Analysts must compile literature evidence, align it to the disease mechanism, check phenotype fit, weigh population data, evaluate segregation or de novo status, consider functional assays and computational predictions, and then apply ACMG/AMP criteria (or disease/gene-specific specifications) to reach a defensible classification. In pharmacogenomics, the loop also includes translating genotype into prescribing guidance using CPIC evidence levels and recommendation strengths. (Nature, PMC, CPIC)

The time burden is substantial and well documented. Peer-reviewed studies place per-variant evidence work in the tens of minutes: around 40 minutes on average (with a 10–120 minute range) in expert programs, and ~60 minutes for difficult cases that require narrative justification. Multiply that by thousands of variants per year across multiple indications and the operational reality becomes obvious: turnaround drifts, audit questions pile up, and parallel teams quietly re-curate the same ground. (PMC)

What if curation were easy?

Imagine opening your laptop on Monday, selecting a disease area, and pulling every relevant paper, supplement, and case report from the last five years into one workspace in minutes. You sketch a simple schema - gene, variant (HGVS), phenotype, study type, “assay mentioned,” “segregation,” “ancestry context,” “page/section/snippet.” The system begins to populate a draft disease database. Each row carries its own receipts: the sentence that supports it, the page it came from, the figure or table it references. Conflicts appear as tracked deltas, not surprises. By Friday you cut v1.0, a versioned, searchable disease database. Next month you refresh and ship v1.1 with clean release notes.

That one-week rhythm changes the work. Sign-out is faster because evidence is already where it needs to be. Reviews stop being scavenger hunts. And curation finally compounds: each release stands on the shoulders of the last.

A quick story from the trenches

For two years of my PhD, I did the opposite of that ideal. Ten of us spent our days building a single disease database by hand. We hunted PubMedIDs first, distributed PDFs, copied tables, reconciled gene and variant names, argued over legacy nomenclature, and tried to track what changed between versions. We made progress, but slowly. The work was necessary, yet the mechanics - searching, extracting, curating, normalizing, proving provenance - consumed far more time than the judgment calls that actually required experts.

That experience is why this “what if” matters to me. I know exactly which steps soak up hours, and which deserve human attention.

From what-if to working practice

Last week, I tried a small, focused experiment: build a credible, literature-backed variant database for a neurogenetic indication - fast. I started with six PDFs around intellectual disability and sketched the schema I’d wanted back in grad school: normalized gene and variant fields; disease/phenotype; study and evidence type; “assay mentioned,” “segregation,” “ancestry context”; and full provenance (PMID/DOI, page, section, figure/table, verbatim snippet).

Within minutes of ingestion I had the first draft variant database as structured rows with provenance. Gene and variant recognition held up on spot-check. Mid-run schema edits were painless (adding “evidence type,” for example). I wasn’t spending my time hunting anymore. I was adjudicating evidence and shaping a database.

Where Polly Xtract fits - quietly powering the flow

Behind that experiment was Polly Xtract, Elucidata’s AI capability for biomedical document understanding. It is not a standalone tool; it powers our curation services and enterprise deployments. The goal is simple: compress the mechanical steps - gathering documents from mixed sources, reading narrative text and tables, normalizing to ontologies, reconciling conflicts, and attaching receipts to every field - so experts can focus on judgment. Xtract writes the output in a CSV file that can be then fed into Polly Atlas with lineage and versions, so a disease database is not a spreadsheet on someone’s desktop; it is a governed asset. From there, the Python SDK makes the data truly usable: joins to ClinVar, DECIPHER, and gnomAD; quick ACMG signal tallies; QC dashboards; and boring-but-vital diffs between v1.0 and v1.1. (NCBI, Decipher Genomics, gnomAD)

The headline isn’t “AI reads PDFs.” It’s that the hours we used to spend searching, copying, and reformatting collapse into a repeatable step - and the hours that require expertise are protected.

Why this matters to diagnostics

Picture a rare-disease lab on a busy week. Sequencing finishes. Analysts surface candidate variants. Instead of fanning out across Google Scholar and old Slack threads, they open the disease database. Every entry already links to the sentence that supports it, the page it came from, and any figure or table that matters. If two publications disagree, the discrepancy is visible with context, not hidden in someone’s notes. Curators spend their time on the judgments that matter - segregation strength, phenotype fit, mechanism - not on re-locating the same case report for the fifth time. When a sign-out pathologist asks, “show me the sentence,” it takes seconds, not days.

Operations change too. Release management stops being ad hoc. You publish Disease X Variant DB v1, then v1.1 a month later with a clear changelog: new case series added; one ClinVar reclassification ingested; two corrections promoted from review. QA and Regulatory get what they need for audits. Downstream pipelines - reporting templates, biomarker matrices, ML feature stores - pull from a single, explainable source of truth.

Scaling the lesson

That first intellectual-disability pilot was deliberately small, but the pattern generalizes. Pick an indication. Define the schema around how your team actually reasons (not just what a PDF happens to contain). Ingest the literature. Let the system draft the extraction and normalization. Use experts to resolve conflicts and shape the release. Repeat on a schedule.

The payoff isn’t only speed. It is consistency across teams, cleaner handoffs to QA and Regulatory, and the confidence to expand into adjacent indications without starting from zero. If you have ever tied up a ten-person team for months to build a single disease database, as we did - you will feel the difference.

If fast-tracking clinical reporting is the goal, start by making the variant database in minutes with Xtract—then let Polly turn that AI-ready structure into harmonised insights and releases.

Early access to Xtract

References

  • Nguengang Wakap SN, et al. “Estimating cumulative point prevalence of rare diseases.” Eur J Hum Genet (2020). Evidence-based global prevalence 3.5–5.9%. (PubMed)
  • Richards S, et al. “Standards and guidelines for the interpretation of sequence variants (ACMG/AMP).” Genet Med(2015). Foundational criteria for germline interpretation. (Nature)
  • Harrison SM, et al. “Overview of specifications to the ACMG/AMP guidelines.” Hum Mutat (2019). ClinGen disease/gene-specific specifications. (PMC)
  • CPIC. “Guidelines” and “Levels of Evidence.” Evidence grading and prescribing recommendations in pharmacogenomics. (cpicpgx.org)
  • Landrum MJ, et al. “ClinVar: updates to support classifications of both germline and somatic variants.” Nucleic Acids Res (2025). Three classification tracks; >3 million variants. (Oxford Academic)
  • ClinVar Miner (University of Utah). Snapshot June 30, 2025: 5,432,546 submissions; 3,591,080 variants. (ClinVar Miner)
  • DECIPHER—Mapping the Clinical Genome. Structural variation and phenotype platform. (Sanger Institute)
  • Broad Institute. gnomAD v4.1 release notes and changelog (2024). Population frequency updates and features. (gnomAD)
  • Patel MJ, et al. “Disease-specific ACMG/AMP guidelines improve sequence variant interpretation for hearing loss.” Genet Med (2021). Average curation time ~40 minutes (10–120). (PMC)
  • Li L, et al. “Tracking updates in clinical databases increases efficiency for clinical variant interpretation.” Genet Med Open (2024). Effort distribution; “hard” variants approach ~60 minutes. (gimopen.org)

Blog Categories

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

Blog Categories