A Data-Centric Approach to Solving Real-World Problems in Life Sciences

Molecular studies have created a vast resource of useful data. Precision medicine and targeted drug discovery are more data-intensive than they have ever been. Manually mining the data for generating useful insights has become obsolete. Machine learning (ML) models are being used to both find and analyze datasets.

In the last decade, ML models have matured massively for many problems. For relatively complex tasks, one need not start from scratch. Taking a pre-existing model and tuning it is often faster, cheaper, and gives better insights. But even pre-existing models need relevant data to give useful results. Even robust pre-existing models must be tuned on the ‘right data’.

The data needed is still hard to find. Even when found, it’s not always well-curated and hence low quality. Though high throughput experiments churn up data at a very fast pace, the data loses its value because of missing labels, wrong annotations, and irretrievable storage. It's high time we stopped looking (just) at the models and started iterating on the data - add all the missing labels and tags, annotate each dataset properly, curate it under different fields, follow a specific ontology, add relevant metadata fields, make the datatypes homogeneous.

In short, improve the quality of data.

‍

Data-Centric Approach

Data-centric AI advocates iterating on the data to improve outcomes while (largely) keeping the model fixed. The argument is that tuning the data is often much easier than creating a model from scratch.

This is especially important in biomedical research. In biomedical research, the datasets are few - often less than 1000 samples. Compare this to internet companies. They keep generating millions of labels every day for spam filters. Or other domains like business analytics where data could go up to millions of data points (aka samples).

Researchers tried to create ML models for imaging-based Covid-19 diagnosis during the pandemic. The need was dire, and the promise huge. But many ML models proposed for imaging-based Covid-19 diagnosis were found to be unsuitable for clinical translation due to the use of low-quality or biased training data¹^‍. Most research problems in target discovery often operate on much less data than imaging-based diagnosis.

The good news is that results can improve dramatically when the input data is highly curated and relevant. This is an opportunity for scientists. One does not always have to go looking for new samples or find new datasets.

Some real-life examples to illustrate that ‘less can be more' if data quality is high are given below:

‍

Identification of Differentiation Targets

A Cambridge-based oncology-focused company aims to cure cancer by differentiation therapy. They approached Elucidata to utilize ML expertise to identify differentiation targets in AML.

They had several pain points:

1. Finding datasets relevant to AML due to varied ontology usage (such as Acute Myeloid leukemia, AML, Myeloid leukemia) in the articles. ‍

‍2. The number of relevant datasets was very small. Still, the researchers were able to utilize the available data to find relevant targets and even fast-track target identification to 2-3 months, significantly shorter than the average time period of 1-2 years.

This was made possible because we had highly curated datasets that followed a fixed ontology so that we were able to retrieve the datasets faster and again due to the fact that we had high-quality data to train the models, even the standard model gave outstanding results and make rapid progress in target identification.

‍

Identification of Regulatory Switches

Another success story comes from a collaboration between an early-stage pharma company and Elucidata. They wanted to study the effects of gene perturbation on cell fate. They wanted to identify regulatory switches in cell fate reprogramming, for which they approached us. We identified two targets (which they validated experimentally later) that could regulate cell fate reprogramming. This was achieved in a relatively short span of six months!

They had 50- 100 datasets that they wanted to work with. We curated these datasets, ensured that it was of high quality, and helped them train a preexisting model with this highly curated data. We achieved very good results in record time, which was seconded by experimental data. These outcomes are not a miracle. They are achieved as a result of a careful cleaning of data that ensures that the model performs to its fullest potential.

Now, let's look at a slightly different case:

‍

Search for Relevant Datasets

Through a careful understanding of the data and an iterative data-centric model, PollyBERT, we have achieved human-level accuracy in certain tasks like identification of cell type, cell line, tissue, disease, etc., from biomedical abstracts available on public datasets like GEO.

This is a huge milestone as we can automate and streamline the search for datasets substantially with minimal human intervention! Think of how many hours you can save if you can filter your dataset search by cell type, cell line, tissue, disease drugs, etc., and get relevant datasets within seconds! That is the power of curated data and a data-centric approach.

‍

Relevance of Data-Centric Approach in the Life Sciences Domain

If you were to make an ML model for spam filters in your inbox, you would have the advantage of billions of data points. Life Sciences data is even hard to find and is highly variable. Some real-world examples include predicting rare events such as cell lines that will respond to a drug or patients who will respond to treatment. You will only have a few thousand data points or maybe even less to train a model. Though generating more data is important, it is often expensive, time-consuming, and sometimes not possible (such as in the case of rare disorders).

This is where the data-centric approach becomes very important. It essentially argues that the quality of data matters more than the quantity! In a data-centric approach, besides the domain-specific vocabulary, factors such as proper labels, fixed ontologies, recency of the research, the rationale for choosing a dataset, the number of data points available, etc., are also considered in deciding the training data. Also, it is ensured that the model has timely feedback about data drifts and concept drifts. Hence it becomes possible to obtain high performance with fewer highly curated datasets. At Elucidata, we have a team of curators and domain experts who collaborate closely with ML experts and engineers to ensure that the relevance and the quality of data being used in the data-centric models meet our high standards.

Contact us if you want to take advantage of our data-centric platform, Polly, to find and analyze relevant datasets or learn more about using our 1.5 million curated datasets to train your models.