November 25, 2025
6 Mins read

Ignore the Outliers at your own risk! - From the CEO's desk

Abhishek Jha
Co-Founder & CEO, Elucidata

I'm AJ, a dad, a husband, and a scientist turned entrepreneur. Every month, I write this newsletter as a space to step outside the barrage of hype and take a clear-eyed look at the realities and lessons in entrepreneurship, technology, data, and discovery, especially where they intersect with life sciences.

Let's dive into this month's newsletter!

Pattern Matching. Not.

In 2010, I finally stepped into the “real world” with a job at Agios Pharmaceuticals. One tiny but special aspect of being at Agios was the opportunity to interact with one of the founders, Lew Cantley. I often saw him in meetings, thinking out loud and sharing his insights. It was beautiful to witness.

Over the years, he would also swing by my desk and ask a few questions, which was always a big fanboy moment for me. Every time. You can learn more about Lew from his own wiki page , but I want to draw from something he would often repeat. I’ll loosely paraphrase: any scientific observation that did not make sense would always intrigue him. He would argue that such observations were often the starting point for discovering something valuable and novel.

Lew was, in the language of machine learning, always excited about “out-of-distribution” observations, the rare, the unusual, the unexpected that defied previously observed patterns.

That mindset of treating an observation that does not fit the pattern as a potential signal, not noise, is the foundation not just of impactful science, but also of powerful machine learning for many valuable problems.
So what's exactly an "out-of-distribution" data?

A few simple observations.

For convenience, though imperfect, we can divide the evolution of AI in two phases: pre-LLM and post-LLM era. Pre-LLM world was primarily dominated by supervised models. Every single model (the good ones and the bad ones) relied on training data sets and test data sets.

The default assumption for all supervised models is that the training datasets and test datasets are drawn from the same distribution, Independent and Identically Distributed (IID) (Krizhevsky et al., 2012; He et al., 2015). However, when you are done with responding to the third referee and publishing the model, the model enters the wild real world, where test samples models can be distinct from what the model was trained on Out-of-distribution (OOD) data.

In practice, the inability of many models to handle OOD can span the spectrum from funny to fatal.

You are searching for a rare indie documentary on Netflix, you don't find it on the first page and you shrug and move on. However, the consequences of your self driving car’s AI model not being able to handle such problems can be profound and consequential. Similar with AI applications in healthcare, life sciences, finance, manufacturing to name a few.

Closer to home, almost every valuable biotech discovery hinges on that rare but valuable observation – essentially, a class of OOD observations as Lew had repeatedly pointed out to me drawing from his own prolific career.

The resistant patient.
The unexpected drug response.
The unexplained trial result.

Once you rule out these observations being an artifact; they’re the signals that lead to rethinking disease, identifying new mechanisms, and launching the next generation of innovation.

Traditional AI will not work for OOD!

Conventional supervised learning techniques cannot be straightforwardly applied to resolve OOD generalization. This is because the fundamental assumption underpinning this approach is that of Independent and Identically Distributed (I.I.D.) data which postulates that the training and test datasets originate from the same distribution.

However, this assumption is systematically violated in OOD generalization scenarios due to inevitable distributional shifts, rendering classical learning theory inadequate.

Scale to the rescue. Welcome to the post-LLM world!

What if you could build a model on such a large and diverse corpus of data that the likelihood of encountering out-of-distribution (OOD) test data is low?

Let’s add one more thing to the wish list: this large, diverse dataset doesn’t need to be labeled as supervised learning models require.

A slew of LLMs based on Foundation Models has made this wish a reality, with varying degrees of success.

And for good reason: Foundation models that have captured the public imagination like few other technological advances in recent years are exactly that: pre-trained on huge corpora of unlabeled data.

But …

… problems remain. In some ways worse but OOD is now being predicted with high degree of confidence even if wrong.

A problem discussed widely: hallucinations.

Besides, for more domain specific tasks you would need to fine tune LLMs which would again require Supervised Fine Tuning methods. Overall, you have a better starting point because you have access to a pretty powerful though flawed pre-trained model but the fundamental problem of handling OOD remains unsolved.

Handling OOD is at the bleeding edge of contemporary AI research and development with very high value implications in many industries, including but not limited to drug discovery.

Let us dig a little deeper in OOD!

The term out-of-distribution (OOD) detection started floating around back in 2016. Since then, it’s become one of those buzzwords that quietly turned into a full-blown research trend. People have tried just about everything, classification tricks, density models, distance metrics, all chasing the same goal: figuring out when an AI model sees something it’s never seen before.

What’s funny is that there are a bunch of related fields, anomaly detection, novelty detection, open set recognition, outlier detection, all trying to solve a similar problem, just with slightly different rules.

Each community has built its own vocabulary and benchmarks, often without much cross-talk. The end result? A tangle of overlapping ideas that make the whole space feel more confusing than it needs to be.

It feels like we’re at a point where the AI community could really use a map, a way to connect these threads into one clear framework.

The pieces are all there; but they need to be pulled together.

Data-centric AI is the way out.
Make Data the hero, not the sidekick.

There’s a comforting myth that “more data equals better models.”

But real progress in the last decade came from getting the right data.

Let me share an example. A little background first. CommonCrawl is a non-profit that crawls all of www and creates a repository open source repository for anyone to use. In the last 18 years they have crawled 300 Billion pages and add 3-5 Billion new pages to it every month.

As you can guess, this is a very valuable resource for AI researchers. It is part of the training dataset for all the popular Foundation Models you have heard of recently and has been cited in over 10,000 research papers.

All this background was to tell you about a new paper that creates a resource called DCLM, after performing some very basic quality filters on the entire corpus of 300 Billion web pages.

Projects like DataComp show the same trend: they sifted through 240 trillion tokens and found that models trained on the top 1.4%, the cleanest, best-tailored data, outperformed models fed with “noisy” mass data.

The core tenet of Data-centric AI is to shift the focus on data (hero; not the sidekick) in contrast to model-centric AI approach.

The idea is you can get more bang for your buck—improvement in your model's performance—by focusing on data than the model. This is acutely true for OOD problems where you often do not have large corpus to begin with but you are tasked to predict some very valuable OOD observations.
Having high quality data is a key element of Data-Centric AI framework.

However, there are other complimentary approaches that can help with OOD problems:
  • Physics-based rules (e.g., AlphaFold + structural data from rcsb.org)
  • Federated learning to train on diverse relevant data
  • Synthetic data for diversity challenges

Parting thoughts....

Traditional AI approaches will continue to struggle to deliver value, particularly when it comes to real world problems where you run into OOD observations.

It is time to embrace Data-centric AI approach and deliver on the promise of AI in enterprises.

The data is rarely center stage, but it is always the bottleneck.

Data is the hero, not the sidekick.

Until next time,
AJ

Abhishek Jha
CEO – Elucidata

Scoops from Elucidata

We are now available on Scientist.com and Science Exchange. You can now engage with us directly on these platforms for data-centric AI collaborations.

Copyright ©2025 Elucidata Corporation, All rights reserved.

Workbar Cambridge, 130 Bishop Allen Dr 5th floor, Cambridge, MA 02139, United States

Our mailing address is:
info@elucidata.io