Why Target Discovery Demands Mechanistic Context for Better Therapeutic Decisions

In the drug discovery cycle, identifying a link between a gene and a disease is first step toward a drug target breakthrough. To build these hypotheses, most computational biology teams rely on foundational databases like Open Targets, a standard resource that scans millions of abstracts to find "co-occurrences" - instances where a gene and a disease are simply mentioned in the same sentence.

But this is exactly where the discovery process hits a wall: the system fail to explain the nature of the association. Co-occurrence is not the same as a biological mechanism - just because two entities share a sentence does not mean they actually interact. Yet, in conventional target discovery systems, a weak text overlap and a strong causal mechanism look identical. When systems treat every text match as a legitimate link, researchers get buried under false leads and true signal is drowned out by noise .The burden of interpretation shifts to manually reviewing hundreds of sentences before making multi-million dollar screening decisions.

To truly accelerate drug discovery, we need to move beyond flat associations and extract the actual biological "why." We solve this by co-building a mechanistically rich knowledge graph that transform raw text overlaps into confidence-graded, causal biological relationships at scale.

The Problem with Flat Associations: Two Case Studies

Platforms like Open Targets aggregate evidence from the literature, but they represent these associations at the same flat level of abstraction-

Disease ↔ Target

"This gene is the primary driver of this disease" and "this disease has no known association with this gene" are treated identically if the entities simply appear in the same sentence. There is no directionality, no capture of functional effect, and no statistical weighting.

Consider how this impacts target selection by these two examples:

Example 1: Weak or Non-Causal Evidence (TNFRSF1A & Rheumatoid Arthritis)

Imagine screening for targets related to Rheumatoid Arthritis. A standard query pulls up a paper containing this sentence:

"The TNFRSF1A R92Q mutation is frequent in rheumatoid arthritis but shows no evidence for association or linkage with the disease."

What Open Targets gives you: It captures the text as evidence for the relationship simply because the words co-occur, scoring it as a positive association.
What Polly KG returns: The system's qualification layer reads the explicit context "shows no evidence" and grades the evidence as contradicting. The system recognizes the sentence explicitly states no association, flags it as a negative finding, and ensures it does not contribute positively to the confidence score.

Furthermore, Polly processes the full text, including methods and supplementary data to confirm that no causal pathway exists.

The Result: Net confidence: LOW , Do not advance. Your team avoids a biological dead end without having to manually read the abstract.

Example 2: Strong Mechanistic Evidence (PCSK9 & Cardiovascular Disease)

Now compare that to a sentence describing PCSK9:

"Gain-of-function PCSK9 mutations are causative of familial hypercholesterolemia... whereas loss-of-function PCSK9 mutations are associated with very low LDL-C levels and protection against CAD."

What Open Targets gives you: It captures the co-occurrence, logs that the sentence was found, scores the association, and stops there.
What Polly KG returns: The evidence is graded as strongly supporting. The system flags the explicit causal claim and captures the gain- and loss-of-function evidence. Because the graph ingests the full text of scientific papers including methods and supplementary data, it goes beyond the abstract to extract the complete mechanistic chain:

PCSK9 induces lysosomal degradation of the LDL receptor in the liver → reduces LDL-C clearance → drives atherosclerotic plaque formation→ CVD.

The Result: Net confidence: HIGH, Advance with mechanistic clarity. Researchers are handed a clear, actionable pathway, rather than just a statistical signal.

But conventional association-based systems flatten both examples into the same type of edge.

Polly Knowledge Graphs- Turning Literature into Structured Biology

The Polly Knowledge Graph transforms unstructured text into a dynamically updating biological operating system, connecting 31 million nodes and 60 million relationships to real biological meaning.

It goes beyond Open Targets and enriches it in three ways:

Graded Evidence: Rather than just surfacing a sentence, an AI-driven qualification layer assesses each piece of evidence. Every edge (connection) is graded as supporting, contradicting, or neutral, separating the signal from the noise automatically.
Beyond Abstracts: Open Targets relies on roughly 10 million abstracts. Polly KG runs its qualification process across the full text of 8 million public-domain papers other not just Open Targets, capturing the mechanisms hidden in supplementary data and methods sections.
Deep Entity Extraction: The system extracts over 29 entity types (including assays, drugs, anatomy, and species), functional effects (gain/loss-of-function), and statistical qualifiers (FDR, p-value, etc.) directly onto the graph edges.

So instead of a Disease ↔ Target edge,

the graph captures:

Disease ↔ Target

Causative relationship
Loss-of-function effect
Evidence strength
Specific experimental model
Tissue context
Statistical weight (e.g., P < 0.05)

Previously existed only inside free text becomes computationally queryable.

This capability is delivered through a robust architecture. It starts with the Base KG, a high-quality foundation built from curated public data with mechanistically rich edges already applied. From there, Accelerators are layered on top, integrating proprietary in-house data and custom scoring frameworks to reflect a team's specific therapeutic hypotheses.

What This Means for Target Validation

This Mechanistic qualification fundamentally changes how researchers prioritize and validate targets. When a knowledge graph distinguishes between a mere co-occurrence, a causal link, a gain-of-function mutation, or a protective effect, it becomes more actionable and reliable.

For a translational medicine team, this delivers:

Faster hypothesis generation: Instead of starting from scratch to figure out the "why," computational biologists can instantly query the graph for explicit mechanistic chains, experimental contexts, and statistical weights.
Better downstream therapeutic decisions: Advancing targets with high-confidence, causal mechanisms improves the probability of clinical success.
More scalable biological reasoning: Target validation shifts from a manual literature review process into an automated, computationally queryable workflow.
Revealing Hidden Repurposing Opportunities: By mapping the exact mechanistic chain (rather than just the disease endpoint), computational teams can identify shared biological pathways across entirely different indications, unlocking new uses for existing assets.

Scaling the Mechanistic Extraction

It is one thing to manually parse a single sentence; it is an entirely different challenge to process the 21 million text records in Open Targets or the 8 million freely available full-text papers (resulting in roughly 3 billion sentences).

If you attempt to extract these 15+ entity pairs across hundreds of millions of sentences using frontier Large Language Models (like GPT or Claude), the costs are prohibitive easily reaching upwards of $500K just for abstracts.

Our solution is a shift to fine-tuned, BERT-family models specifically trained for distinct entities (e.g., a model for proteins, a model for anatomy) so we can process the entire accessible corpus at scale at much lower costs while preserving extraction quality.

‍