September 8, 2025
6 Mins read

The Contrarian Approach That’s Making Biomedical AI Practical

Abhishek Jha
Co-Founder & CEO, Elucidata

Hi, I’m AJ, a dad, a husband and a scientist turned entrepreneur. In this newsletter we will look at how AI-native organizations are built, and what it truly takes to go from AI naive to AI native.

Five years ago, we stood at a whiteboard, thinking we could solve the AI readiness problem for biomedical data. Our mandate was bold. We wanted all the datasets to be AI ready for all the use cases for all the stakeholders. Over the next few years, as our ambition met the reality of "real" biomedical data, we struggled. Today, only in hindsight, I can look back and say, we were wrong in our approach. What we learned and what I want to share in this newsletter is the hard-earned path to doing it right.

A mile wide and an inch deep!

The Whiteboard Moment

There’s a moment that plays on loop in my head.

We were standing at a whiteboard, five years ago, mapping out how we’d make R&D data usable for AI . Our ambition was breathtaking: bring together all the data, text, tables, images and harmonize it into a single data atlas. One data atlas to feed all downstream applications.

I remember thinking: if we do this right, we solve the biggest problem in life sciences AI.

We didn’t!

We built something that looked impressive, an enterprise-wide data atlas. But we quickly ran into the same issues many others have since: rigid schemas, curation overheads, high storage costs, and worst of all, data that couldn’t talk to downstream tools without manual patchwork .

The ROI just wasn’t there.

An inch wide and a mile deep

As I started talking more about our challenges with other thought leaders in this space, a realization emerged which was both comforting and troublesome at the same time. We were not alone.

Whether it’s pharma’s internal platforms or public–private partnerships, many ambitious AI initiatives have quietly hit the same wall.

The truth is, broad harmonization sounds noble but in practice, it's not. It’s human-intensive, slow, and rarely pays off .

For example, take the UK Biobank: a monumental public resource, but one that took over a decade and hundreds of millions of pounds to centralize data across the NHS and still struggles with supporting different use cases. In a nutshell, it does solve an availability problem by creating an amazing resource, but the usability problem—aka the AI readiness problem—does not meet the bar. This isn't a failure of intent. This is a reminder of just how costly and brittle “boiling the ocean” approaches can be.

We in fact draw a lot of inspiration from the industry-defining Novartis’s Data42 initiative . I draw energy from one of Vas's quotes about "AI – But first Data".

From our experiences at other large pharma and biotech, we have strongly pivoted and expanded on a contrarian approach to the data curation problem, in times of data lakes/data fabrics. This approach involves taking only a focused and relevant corpus of data (for example, for a particular therapeutic area, mechanism of action, or a pre-decided purpose-driven collection of legacy clinical trial data) and has a much faster and better ROI than taking a more expansive broad data lake approach of harmonizing all the data for every stakeholder.

This approach has served our customers very well and has certainly defined our approach to the AI-readiness problem.
This isn’t just our story, it’s the story of an industry still struggling to turn fragmented data into usable, AI-ready assets. Many are still trying to wrangle all their data into enterprise lakes, only to find themselves stuck with more complexity than clarity!

From Boiling the Ocean to Building Data Products

Instead of trying to boil the ocean, we flipped the question. What’s the business case? What’s the downstream decision we’re trying to make? And only then: What data do we need for that?

We call the outcome a “data product,” a secure, structured, compliant corpus tailored to a specific use case. We’ve built over 200 of them since, each tied to real workflows: validating a target, manufacturing a vector, assembling a cohort.

I know this sounds like a narrower approach. It is. But that’s precisely why it works. It’s deeper. It's practical. And it forces alignment from day one between science, AI, and ROI.

Why Data-Centric AI is Crucial in Biomedicine

In some sense, what we embraced is what people today call data-centric AI.

For years, the default thinking was: more data will trump bad data.
But in biomedical research, clean data trumps more data.

Models trained on biased or poorly curated datasets don’t just underperform, they actively mislead. An oncology model trained on limited trial cohorts can fail spectacularly on real-world patients. A toxicity predictor built on incomplete preclinical data can derail an otherwise promising program.

That’s why in biomedicine, the marginal value of better data is often higher than the marginal value of more data.

This field doesn’t suffer from a shortage of algorithms. It suffers from the scarcity of data that is usable, representative, and trustworthy.

In our world, doing the hard, unglamorous work of structuring messy trial protocols, normalizing assay results, and harmonizing multi-modal datasets isn’t optional — it’s the foundation for any credible AI output downstream.

In the process, adoption came naturally because what we delivered wasn’t abstract infrastructure, but something immediately tied to their real-world workflows.

From AI-Naive to AI-Native

I’ve come to think of this as a contrarian position, but maybe it won’t be for long.

The industry is learning.

Slowly, painfully.

The pendulum is swinging away from “integrate everything” to “optimize for decisions.” From data accumulation to data usability.

And honestly? That shift excites me.

As drawn from one of the recent conversations I had with a friend, "While the world was tormented and swayed and grilled by AI; none of that was or will be possible without grinding through messy data; it’s a truth that’s not told or heard enough."

This is how we help organizations move from AI naive to AI native.

It’s not glamorous. But it’s honest. And it’s working.

Until next time,
AJ
Abhishek Jha
CEO – Elucidata

Scoops from Elucidata
Product Update:
Introducing Polly Xtract

Unlock structured insights from complex protocols in seconds, no manual curation required.

Whitepaper:
Co-build to Accelerate Pipeline Decisions: TMI Framework for Comprehensive Therapeutic Evaluation
Customer Success Story:
Elucidata Delivers 99% Accurate Oncology Metadata Curation for ICI Therapy Research
Trending Blog:
How AI frees and refocuses the scientific mind
Recent Webinar:
Agentic AI Approach to PK/PD Intel in Clinical Trials
Follow Us

We are now available on Scientist.com and Science Exchange. You can now engage with us directly on these platforms for data-centric AI collaborations.

Copyright © 2025 Elucidata Corporation , All rights reserved.

625 Massachusetts Ave 2nd Floor
Cambridge, MA 02139
USA

Our mailing address is:
info@elucidata.io