Every year, tens of thousands of biomedical studies are published. Each holds a small piece of the larger puzzle - a new gene interaction, a novel biomarker, an unexpected drug response. Yet, despite this explosion of information, most of it remains trapped in silos.
What if all these fragments could talk to each other? What if the sum of biomedical data could actually act as one connected brain?
That’s the promise of the knowledge graph (KG) - a data model that doesn’t just store information, but connects it. In a KG, knowledge is represented as a network of entities and relationships: a gene “codes” for a protein, a drug “treats” a disease, a disease is “associated” with a phenotype.
Unlike rigid, tabular databases, KGs reflect the true complexity of biology - where everything is interlinked.
Over the years, several biomedical KGs have tried to capture this interconnectedness. Early disease-centric graphs like the Human Disease Network (HDN) and Human Symptom–Disease Network (HSDN) demonstrated how linking diseases by shared features could yield new insights. Later, SPOKE (Scalable Precision Medicine Open Knowledge Engine) integrated diverse biomedical databases to focus on diseases, while GARD (Genetic and Rare Diseases Information Center) centered on rare diseases.
Each was a step forward, but all shared a common limitation: fragmentation and narrow scope. None could bring together the full spectrum of biomedical knowledge across scales, modalities, and ontologies.
That gap led to the creation of PrimeKG - a more ambitious, integrated effort.
PrimeKG emerged as a “multimodal knowledge graph for precision medicine,” integrating 20 high-quality resources, biorepositories, and ontologies into a unified network. Its mission was to offer a holistic view of disease biology, breaking through the fragmentation that had long plagued biomedical research.
PrimeKG’s scale was unprecedented. By consolidating data from sources like DisGeNET, DrugBank, Mayo Clinic, and Orphanet, it brought together 17,080 diseases and over 4 million relationships spanning ten biological layers - from genes and proteins to drugs, anatomy, and pathways. This expanded disease coverage by one to two orders of magnitude compared to predecessors like SPOKE and HSDN.
Another innovation lay in its disease representation. Biomedical databases often use conflicting ontologies, making consistent linking difficult. PrimeKG standardized all disease nodes to the MONDO Disease Ontology, mapping disparate vocabularies onto a shared foundation.
To resolve disease ambiguity - where subtypes or synonyms muddle definitions, PrimeKG used semi-automated grouping with string matching and BERT-based embeddings, clustering 22,205 disease concepts into 17,080 clinically meaningful groups. This was a significant advance toward reducing manual curation while improving consistency.
PrimeKG also enriched its structure with new relationship types such as indications, contraindications, and off-label uses, allowing for deeper insights into drug–disease interactions. It even blended structured graph data with textual context from clinical repositories like Mayo Clinic and Orphanet - an important early move toward multimodality.
For all its innovation, PrimeKG had one crucial shortcoming - it was static.
The public version, built using data available up to June 2021, represented a snapshot of biomedical knowledge frozen in time.
In a field that evolves daily, static graphs age quickly. New drugs get approved, new biomarkers are discovered, and disease definitions are constantly revised. A resource that can’t evolve with science soon loses its relevance.
PrimeKG’s technical validation proved its potential - it could even predict repurposing opportunities for drugs approved after its cutoff date. But it couldn’t contain any of those new discoveries itself.
This static nature created several problems:
For a scientist exploring a new target or repurposing opportunity, this meant incomplete and even misleading results.
In the AI era, where model accuracy depends on data freshness, a static KG becomes more of a historical archive than a discovery engine.
Recognizing this gap, Elucidata built PollyKG - a next-generation, dynamic knowledge graph designed for the age of AI-driven biomedical research.
PollyKG isn’t just an update to existing graphs like PrimeKG. It’s a paradigm shift - from a static snapshot to a living, continuously evolving ecosystem.
PollyKG systematically ingests and harmonizes data from an ever-expanding set of public sources - integrating new drugs, genetic associations, disease relationships, and pathway discoveries as they appear.
This ensures that scientists are always working with the most current, trustworthy dataset, eliminating the lag between discovery and availability.
PollyKG allows organizations to merge proprietary data - from assays, high-throughput screens, or clinical trials - with public biomedical knowledge in a secure and scalable way.
This contextualizes in-house findings within the broader biomedical landscape, helping researchers generate more grounded hypotheses and reduce redundant experiments.
Navigating millions of nodes and relationships can be overwhelming. PollyKG democratizes access by allowing users to ask questions in plain English - like, “Which drugs target proteins involved in inflammatory pathways?”
The system converts these natural language prompts into graph queries, returning results that are interpretable and actionable - no coding required.
PollyKG spans a wide range of data types: genomic, clinical, proteomic, and even single-cell datasets all unified within one connected graph. It also extends beyond human data, harmonizing knowledge across 50+ species, offering translational context for animal models and comparative studies.
PrimeKG marked an important milestone - the first major attempt to unify biomedical data at scale.
But PollyKG represents what comes next: a living system that learns, grows, and evolves alongside the science it serves.
In an age where biomedical knowledge doubles roughly every 18 months, only such dynamic systems can keep research ahead of the curve.
For data scientists, PollyKG means fewer blind spots in model training.
For translational researchers, it means faster hypothesis generation.
For pharma R&D, it means accelerating the journey from data to discovery - with confidence in every connection.
PrimeKG organized biomedical data.
PollyKG lets it evolve.
By combining continuous updates, private data integration, multimodal context, and natural language accessibility, PollyKG transforms the knowledge graph from a static artifact into a living engine of discovery.
In the digital era of precision medicine, where insights are only as current as the data behind them, PollyKG is what keeps science in motion.