Data Science & Machine Learning

Model Improvement Giving Marginal Results? Look at the Data!

Deepthi Das
January 19, 2022

“Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” — Clive Humby, 2006

Data Centricity to increase the accuracy of ML predictions: A paradigm shift

The traditional approach to machine learning (ML) was to curate the data to a machine-readable level, train a model, and then fine-tune the model to improve the accuracy of the results. Andrew Ng who is a familiar name in the circle of ML enthusiasts spearheaded the data-centric AI movement in 2019 where he stressed the need to shift from a model-centric approach to a data-centric approach. According to him, data centricity is a mindset as much as it is a technical architecture. It acknowledges data’s valuable and versatile role in the ML pipeline. In contrast to the model-centric approach, a data-centric architecture is one where data exists independently of a singular application and can empower a broad range of stakeholders. This allows for greater opportunities in accelerating digital transformation: data can be more versatile, integrative, and available to those that need it.


Model centricity to Data centricity

Though data processing is of paramount importance in machine learning, it is often treated as a preliminary step to be carried out before working on the ML algorithm. The focus is mostly laid on making the available data more machine-actionable. As a result, hundreds of hours are wasted on tuning a model built on low-quality data. That’s one of the main reasons why the accuracy of a model is significantly lower than expected and it has nothing to do with the efficiency of model tuning.

”With the increasing power and availability of machine learning models, gains from model improvements have become marginal”. — Hazy Research, Stanford

Therefore, an improvement in current data practices is of paramount importance in building reliable machine learning products.

So, how can a data-centric approach improve predictive outcomes?

Once the realization sinks in that we need to focus on data as much as or maybe more than on the model, we come to the various aspects needed in a data-centric approach. In this blog, we will consider Biomedical data as the domain where ML is applied.

Data centric approach

In a data-centric approach, the data is viewed through the lens of a domain expert as well as that of the ML expert. The ML expert will work on the curated data which is in machine-readable/ actionable form, use it as training data, train the model to perform a specific function. The model-centric approach will then go forward to improve the accuracy of prediction by improving the algorithm whereas the data-centric approach will improve the training data by reiterating data quality from the point of view of the domain expert to increase the accuracy of results obtained by the same model. But what if our highly annotated dataset does not account for the real-world variance..!

The answer lies in how data curation is perceived..

Data curation is the work of organizing and managing a collection of datasets to meet the needs and interests of a specific group of people. It means different things for people from different domains. For an ML engineer, curated data equates to relevant data which is arranged in a specific form, following a specific format that is machine-readable/ actionable. However, for a domain expert, the quality of data or curated data lays importance on very different aspects of the data (discussed below).

If the data taken for training an algorithm is biased, it can lead to inaccurate predictions even if the model is highly advanced. This gap is what is bridged in a data-centric approach. For example, while trying to predict a disease based on a particular gene expression, even if the training dataset has accurately annotated data from 10 females and 90 males for the study, an increase in efficiency of the model cannot be ensured by iterating on the model. However, if we iterate on the data, correct this gender bias, we can probably get a better outcome.

Data-centric approach invokes critical thinking about data curation in terms of

  1. Breadth of curation: Depending on the type of study, relevant curation fields need to be worked on to optimize predictive outcomes. There is no ‘one size fits all solution’.
  2. Noise in data: Background noise in the data or graphs could be a major obstacle when it comes to machine learning. For example, an un-targeted metabolomics study remains limited in the degree of reliance with the identification of detected signals. If a large proportion of the signals are non-reproducible results from noise in dataset like adducts, contaminants, and artifacts the accuracy of prediction could be lower.
  3. Number of data points: Most ML problems will want more and more data points to account for higher statistical power. A tradeoff is usually made to account for time, effort, and expenses for collecting more data. Data centric approaches will call for thinking critically about such tradeoffs. Moreover, some data might be much harder to find like data on rare diseases. In such cases it becomes vital to ensure that the collected data are representative of the real world heterogeneity.
  4. Missing data points and/or labels: Where the labels are missing completely
  5. Due to inefficiencies in collecting metadata
  6. Inefficient data extraction/engineering
  7. Rationale of choosing a particular dataset/cohort : As mentioned earlier, a bias in the training data in terms of Age, Gender ethnicity etc. could adversely effect the predictive outcome of a model
  8. Recency of the data: With the rapid growth of technology and improvement of precision instruments, some experimental results become obsolete. These results, even if it is properly annotated, could cause a dip in accuracy.

At Elucidata, we believe that data curation is the core of the data-centric approach

We have an expert team of Bioinformaticians, Biologists, Data curators, Data scientists, and ML Engineers who can look at biomedical data holistically to help our clients with their specific research needs and come up with custom curation pipelines to fast track their research. With the understanding that high-quality machine actionable data is central to biomedical research, we have more than 1 Million machine-actionable Biomedical datasets with exponential growth in the number each quarter on Polly, our cloud platform. In Biomedical research and drug discovery, every second is of infinite value. We, armed with the scientific expertise and over 1Million ML-ready datasets, are ready to hold hands with stakeholders from the biomedical industry to help them reach their goals faster.

Elucidata is well equipped to help you accelerate your biomedical research! To know more about our resources and services, write to us at info@elucidata.io

References:

All You Need To Know About Data Curation | iunera

Data-Centric AI – Landing AI

Blog Categories

Blog Categories

Request Demo