Designing Scoring Frameworks for Knowledge Graphs

Knowledge graph scoring is what turns a connected dataset into a decision tool. When teams first adopt a biomedical knowledge graph, the early thrill is connectivity - omics files, phenotype tables, drug–target lists, and clinical summaries finally speak the same language. But the very next challenge appears: too much is connected.

Take a common autoimmune indication. Curated databases like OMIM already list 500+ genes associated with rheumatoid arthritis, and broader association sources add orders of magnitude more signals. If you just project those links into a graph, you’ll get long, alphabetized lists that all “look” equally plausible. Without scoring, scientists still face the same dilemma: where do we begin?

Scoring is the layer that turns a graph from a static map of relationships into a decision-making tool. It encodes your priorities, whether it is novelty vs. validation or repurposing vs. new discovery, and translates them into ranked outputs that scientists can act on.

Take the example of drug repurposing. Here, the most valuable hits are drugs that impact non-target genes already linked to other indications. If the goal is entirely new targets, the priorities flip; now the emphasis is on novelty, under-explored genes, and sparse prior evidence. The same dataset, two very different shortlists.

Polly KG was designed with this premise: every connection should be both visible and rankable - and the way we rank should faithfully reflect the problem you’re solving.

Base vs. Biased Scoring in Knowledge Graphs

There isn’t a universal “best” score. What counts as “high value” depends on the scientific goal.

Polly KG uses a two-tier scheme:

Base scores are neutral, evidence-driven measures. They capture transparent signals such as the clinical trial phase of a drug, the number of drugs already associated with a gene, or whether that gene is considered tractable. For instance, a gene linked only to preclinical compounds ranks higher than one tied to marketed drugs, because it represents a greater opportunity for discovery. Similarly, a gene with fewer drug associations scores higher than one already saturated with treatments, helping teams focus on areas with more potential impact.
Biased scores build on base scores by tilting neutral signals toward a specific context. For example, if the focus is rheumatoid arthritis, disease-specific evidence is weighted more heavily, so those connections rise to the top. The base scoring remains intact but is adjusted to reflect the priorities of the user.

A “before/after” to make it concrete

Base-only (novelty-leaning):

GENE_A - preclinical, few drugs
TNF - approved, many drugs
IL6 - in trials

Biased to rheumatoid arthritis:

IL6- strong disease-specific evidence
GENE_A - still novel
TNF - less relevant to the indication

Same graph, same evidence - different lens, different shortlist.

Every score is a scientific choice.

Scores aren’t just math; they’re judgment, expressed numerically.

When novelty is the aim, you up-weight signals that mark under-explored biology: preclinical over approved, fewer known drugs per gene over many, gently down-weight well-trodden targets. That inevitably pushes some highly drugged, repurposable genes down the list. These are not right-or-wrong decisions; they are trade-offs that should be deliberate and aligned with the problem at hand.

Flip the program to repurposing, and the logic flips with it. You privilege well-characterized targets, multiple independent lines of evidence, and tractability histories. A gene might rise precisely because human evidence and clinical tooling are deep (and potentially portable), even though it wouldn’t qualify as “novel.”

The deeper lesson is that scoring must always be anchored in context. There is no single formula that works for every program. A discovery team and a translational team will want different things from the same graph. Polly KG makes that flexibility possible.

Why Multi-Modal Evidence Strengthens Scores

Another critical principle is that no single data type should dominate. A GWAS hit by itself is rarely enough to justify attention, just as a gene expression spike in one dataset might mean little without supporting evidence.

Scoring in Polly KG is designed to compound across modalities - clinical trials, gene expression, genetic variants, phenotypes - so that no decision rests on a lone signal.

Take the case of a gene–drug connection. Its score may combine the trial phase, number of drugs already linked, and tractability as baseline evidence. On top of that, it may be reinforced by expression data showing upregulation in disease tissue, or by genetic evidence from GWAS studies. The result is not a binary yes/no association but a layered ranking that reflects the breadth and depth of evidence.

This multi-modal compounding is what transforms a biomedical knowledge graph from an information store into a drug discovery knowledge graph - a decision engine that scientists can trust.

Case study - Cross-species target discovery

A biopharma team needed to surface human targets across multiple indications using cross-species evidence (human, mouse, and NHP/dog). We enriched the base Polly KG with the team’s internal datasets and selected public resources, harmonized across species, and enabled natural-language querying so scientists could ask questions directly. In collaboration with the biology leads, we designed a multiparameter scoring framework (base + disease-biased) aligned to their criteria. Within weeks, users ran hundreds of queries and prioritized a short list that moved into wet-lab validation - illustrating how a tailored scoring layer turns broad connectivity into focused action.

Usability beats maximalism

A graph is only useful if people can query it in seconds. At one point, in our experience, modeling sample-level connections pushed a graph from ~16 GB to ~27 GB; traversals that used to finish in seconds started timing out. The fix wasn’t bigger machines; it was smarter aggregation (e.g., by tissue or species) and filtering out low-confidence edges that didn’t help the use case. Scientists kept the details they needed, and the speed they expected.

This is a scoring lesson, too: the same instinct that prunes structure should prune evidence. High-noise modalities should contribute less until they earn their keep.

Case study: Advancing target ID and validation in AML

An oncology program required a comprehensive, multi-omics view to reduce downstream risk and accelerate decisions. We assembled an atlas of several thousand indication-specific human samples from multiple public sources, harmonized them, and fed the data into a multi-modal knowledge graph. A custom scoring framework ranked differentiation-oriented targets for experimental follow-up, while public cancer-cell-line data guided validation choices. Over six months, the team validated multiple targets, achieved a several-fold acceleration in target identification, and advanced a candidate to the next development stage - evidence that a scoring layer can compress time from data to decision.

Make it governable - and transparent.

Scoring rules should evolve with the science without surprising downstream users. The way to do that is simple: version your scoring schema (e.g., scoring_v0.3), keep a one-line note on what changed and why, and run a quick “rank-delta” check before and after any material update. If three of the top 20 move, look at the reasons; if the shifts reflect your intent (say, a tighter GWAS threshold or a stronger disease bias), proceed with scientist sign-off. Add basic guardrails - minimum evidence floors - so no single modality can hijack the list.

Equally important, keep the system explainable. If a gene sits at #1, everyone should know why. In Polly KG, base components (trial phase, drug count, tractability) are visible, and disease biases are explicit (“this list is RA-weighted”). That clarity lets discovery and translational teams negotiate the weights, not the data - and it’s what builds trust in the list.

Where scoring goes next

Validation as evidence. When bench results arrive - say, a perturbation screen that consistently shifts a fibrosis marker - those readouts should land on edges as first-class signals. In ranking, “validated in our hands” should outrun “promising in silico.”

AI-assisted thresholds, human-audited. Machine learning can suggest cut-points (where a p-value or effect-size threshold best separates signal from noise) or propose cross-modal weightings. Domain experts should own those decisions after inspecting suggestions against known biology and program risk.

Topology-aware features and link prediction. Graph-native signals - path length, node degree, evidence density - are natural next layers. They can surface indirect but compelling routes (e.g., short, high-confidence paths connecting a drug class to an unexpected phenotype) and feed a principled link-prediction module.

From maps to compasses

Most first-time graph programs stumble the same way: graphs too big to be interactive, opaque or rigid scoring, one-size-fits-all formulas that force teams to compromise. A better pattern:

Keep the graph queryable (aggregate where it doesn’t hurt science).

Keep the scores transparent (so users can argue weights, not the data).

Let intent drive bias (base + biased scores, not endless forks).

Compound across modalities (confidence lives in convergence).

Do that, and your knowledge graph stops being a static atlas of associations and becomes a compass - one that points your team to the next experiment, not just the next edge.

Scoring doesn’t just organize your graph - it tells you where to go next.

‍