The Data-Centric Mandate: Why the Hero of AI-Driven Drug Discovery is Data, Not the Model

AI investment and the Eroom’s Law problem

Over the past decade, biopharmaceutical companies and investors have poured billions of dollars into artificial intelligence tools for drug discovery. Venture funding alone exceeded $20 billion in the five years leading up to 2025. This surge of interest reflects a hope that machine learning models can reverse Eroom’s Law, the observation that drug discovery costs roughly double every nine years while the number of new drugs approved per billion dollars spent falls. A 2012 Nature Reviews Drug Discovery analysis of regulatory filings found that the number of new drugs approved per billion dollars invested has halved roughly every nine years since 1950, despite advances in science and technology.

Despite this AI-driven investment boom, overall R&D productivity has not yet improved. Pharmaceutical companies must now ask: What if we’ve been optimizing the wrong variable?

Pattern matching vs. novel biology: The real challenge of AI in life sciences

In the technology world, AI thrives on pattern matching. Netflix discovered that people who liked Kevin Spacey and political thrillers would probably enjoy House of Cards. Spotify uses listening data from people with tastes similar to yours to recommend songs you’ll likely enjoy.

In biology, model innovation has accelerated rapidly. Transcriptformer and other tools push boundaries in single-cell analysis, and AlphaFold has delivered breakthroughs in protein structure prediction. These models are exciting and often work well.

But drug discovery remains a bet. A biotech startup might believe a bispecific antibody will outperform standard therapy based on limited evidence or a scientific hunch. To turn that hypothesis into a therapeutic product, the company must make a series of additional high-stakes bets:

Pinpointing the biological pathway and the right target
Demonstrating the candidate is effective in the laboratory
Showing the lead is safe in humans
Proving that it is efficacious in clinical settings
Anticipating risks and planning to avoid or manage adverse events
Preparing for real-world adoption

In this reality, success hinges not on whether the model is a transformer or a graph neural network but on whether the data behind each decision is context-rich, clean and grounded in biology.

When computational predictions meet clinical reality

AI researchers have built increasingly sophisticated architectures. Graph neural networks predict molecular properties, transformers analyze protein structures and generative models design new compounds. These models perform well in internal tests, yet prospective validation often fails. Hit rates drop, predicted properties diverge from experimental measurements and molecules prioritized by algorithms fail in later stages.

Real-world examples illustrate this disconnect. In October 2023, Exscientia discontinued its A2A receptor antagonist EXS‑21546 because the company concluded that achieving the prolonged high level of target coverage necessary for a therapeutic effect would be difficult based on peer data. BenevolentAI’s topical pan‑Trk inhibitor BEN‑2293 met its primary safety endpoint in Phase IIa, but the candidate did not significantly improve itch or inflammation across the broader patient population. These outcomes highlight the challenge of translating computational predictions into clinical success and emphasize that data quality remains a central obstacle rather than the AI methodology itself.

The Out‑of‑Distribution problem: Biology doesn’t behave like tech

Classical machine learning assumes that training and test data come from the same distribution. Drug discovery repeatedly violates this assumption because it intentionally seeks novelty - new molecules, targets and patient responses. Hidden technical variability further shifts the distribution. Different laboratories use different reagents, protocols and instruments, sequencing platforms evolve and even within the same institution, batch effects from sample preparation can dominate true biological signal.

In single-cell RNA sequencing, when researchers combine datasets from multiple labs, cells sometimes cluster by the laboratory or sequencing platform rather than by cell type. Studies note that batch effects can account for large fractions of variance in multi-source datasets, sometimes matching or exceeding the biological signal. Models trained on such data risk learning technical artifacts instead of biology. Targets nominated from these analyses might represent experimental noise rather than genuine disease mechanisms.

The Data‑Centric paradigm: A fundamental reorientation

Andrew Ng and other AI leaders advocate a shift toward data‑centric AI. Ng observes that in the traditional workflow, researchers hold the data fixed and iterate on the model. In data‑centric AI, they instead hold the model constant and iteratively improve the quality and consistency of the data. He emphasizes that consistency of data is paramount and that now that models have advanced to a certain point, “we have to make the data work as well”. Data‑centric AI calls for systematic engineering of datasets to improve quality, representativeness and annotation richness rather than chasing ever more complex architectures.

In drug discovery, where generating each data point is costly and time‑consuming, the marginal return on improving data quality often outweighs gains from algorithmic sophistication. When you have hundreds of annotated compounds instead of millions of images, each data point must be high-quality and context-rich.

Regulators are echoing this shift. The FDA’s Good Machine Learning Practice principles state that clinical study participants and data sets should be representative of the intended patient population and that training datasets should be independent of test sets. While the FDA does not use the term “data‑centric AI,” its guidelines underscore the importance of dataset quality, provenance and representativeness in building trustworthy AI systems.

What data‑centric AI requires: Three essential capabilities

Implementing data‑centric AI in drug discovery requires three core capabilities:

Harmonization. Researchers must unify datasets generated under varied technical conditions. In single-cell biology, methods like Harmony or Seurat’s integration workflow align cell populations across laboratories while preserving genuine biological differences. Effective harmonization requires balancing removal of technical variation against preservation of real biology, and it demands domain expertise to validate results.
Curation and annotation. Datasets need rich metadata capturing ontology mappings, quality control, provenance and experimental context. Without detailed annotations, such as patient cohorts, disease stages or treatment conditions - datasets cannot support predictive modeling or meaningful biological discovery.
Out‑of‑Distribution‑aware evaluation. Models should be evaluated under realistic conditions rather than using random train–test splits that artificially inflate performance. Techniques like temporal splits, leave‑one‑batch‑out validation and scaffold splits in chemistry ensure that test sets are truly distinct and reveal model weaknesses before expensive prospective experiments. Emerging methods like UMAP‑based clustering splits create even more challenging out‑of‑distribution evaluations for virtual screening.

By focusing on harmonization, curation and OOD‑aware evaluation, organizations can create AI‑ready assets that capture real biology instead of technical noise.

The path forward: Infrastructure over iteration

Evidence across the industry suggests that improving data quality leads to larger performance gains than fine‑tuning algorithms when data are scarce or heterogeneous. AI leaders like Andrew Ng argue that many models are effectively “solved problems,” meaning that further gains require better data. Regulatory mandates are formalizing data‑quality requirements, and high‑profile failures tied to unvetted datasets are reinforcing the message. For biopharma decision‑makers, the challenge is cultural and infrastructural: they must prioritize data engineering, invest in metadata and curation infrastructure and build teams that integrate domain expertise into data preparation.

The opportunity is significant. Organizations that build robust data‑centric capabilities can generate reproducible insights and move candidates through the pipeline more efficiently. Competitors who focus solely on algorithmic innovation may continue to chase mirages built on noisy data.

In the end, the hero of AI‑driven drug discovery is not a magical model that finds signal in chaos. It is the careful, systematic work of transforming heterogeneous, noisy data into harmonized, curated and honestly evaluated assets. That foundational investment is what will ultimately enable AI to deliver on its promise in medicine.

‍