
In the drug discovery cycle, identifying a link between a gene and a disease is first step toward a drug target breakthrough. To build these hypotheses, most computational biology teams rely on foundational databases like Open Targets, a standard resource that scans millions of abstracts to find "co-occurrences" - instances where a gene and a disease are simply mentioned in the same sentence.
But this is exactly where the discovery process hits a wall: the system fail to explain the nature of the association. Co-occurrence is not the same as a biological mechanism - just because two entities share a sentence does not mean they actually interact. Yet, in conventional target discovery systems, a weak text overlap and a strong causal mechanism look identical. When systems treat every text match as a legitimate link, researchers get buried under false leads and true signal is drowned out by noise .The burden of interpretation shifts to manually reviewing hundreds of sentences before making multi-million dollar screening decisions.
To truly accelerate drug discovery, we need to move beyond flat associations and extract the actual biological "why." We solve this by co-building a mechanistically rich knowledge graph that transform raw text overlaps into confidence-graded, causal biological relationships at scale.
Platforms like Open Targets aggregate evidence from the literature, but they represent these associations at the same flat level of abstraction-
Disease ↔ Target
"This gene is the primary driver of this disease" and "this disease has no known association with this gene" are treated identically if the entities simply appear in the same sentence. There is no directionality, no capture of functional effect, and no statistical weighting.
Consider how this impacts target selection by these two examples:
Imagine screening for targets related to Rheumatoid Arthritis. A standard query pulls up a paper containing this sentence:
"The TNFRSF1A R92Q mutation is frequent in rheumatoid arthritis but shows no evidence for association or linkage with the disease."
Furthermore, Polly processes the full text, including methods and supplementary data to confirm that no causal pathway exists.
Now compare that to a sentence describing PCSK9:
"Gain-of-function PCSK9 mutations are causative of familial hypercholesterolemia... whereas loss-of-function PCSK9 mutations are associated with very low LDL-C levels and protection against CAD."
PCSK9 induces lysosomal degradation of the LDL receptor in the liver → reduces LDL-C clearance → drives atherosclerotic plaque formation→ CVD.
But conventional association-based systems flatten both examples into the same type of edge.
The Polly Knowledge Graph transforms unstructured text into a dynamically updating biological operating system, connecting 31 million nodes and 60 million relationships to real biological meaning.
It goes beyond Open Targets and enriches it in three ways:
So instead of a Disease ↔ Target edge,
the graph captures:
Disease ↔ Target
Previously existed only inside free text becomes computationally queryable.
This capability is delivered through a robust architecture. It starts with the Base KG, a high-quality foundation built from curated public data with mechanistically rich edges already applied. From there, Accelerators are layered on top, integrating proprietary in-house data and custom scoring frameworks to reflect a team's specific therapeutic hypotheses.
This Mechanistic qualification fundamentally changes how researchers prioritize and validate targets. When a knowledge graph distinguishes between a mere co-occurrence, a causal link, a gain-of-function mutation, or a protective effect, it becomes more actionable and reliable.
For a translational medicine team, this delivers:
It is one thing to manually parse a single sentence; it is an entirely different challenge to process the 21 million text records in Open Targets or the 8 million freely available full-text papers (resulting in roughly 3 billion sentences).
If you attempt to extract these 15+ entity pairs across hundreds of millions of sentences using frontier Large Language Models (like GPT or Claude), the costs are prohibitive easily reaching upwards of $500K just for abstracts.
Our solution is a shift to fine-tuned, BERT-family models specifically trained for distinct entities (e.g., a model for proteins, a model for anatomy) so we can process the entire accessible corpus at scale at much lower costs while preserving extraction quality.