Patterns in data are the fundamental building blocks for making predictions using machine learning (ML). In a Utopian world, good ML models enable an improved understanding of scientific phenomena. ML-based methods have been used to explore the inherent predictability of phenomena, especially for social outcomes, disease outbreaks, etc. Also, models with higher predictive accuracy can aid in the research and development of better diagnostic tools. It all sounds perfect, so what’s the catch?
The increased adaptation of ML in science brings a fair share of concerns about the reproducibility of the predicted results. Although ML-based science is still in its infant stage, technological advancements are experiencing rapid growth. But that may not necessarily be a good thing.
In a literature survey of 20 research papers that used ML methods, the authors found that 17 fields had errors that directly affected 329 papers, with wildly overoptimistic conclusions in a few cases.
An ML-based research finding can be deemed reproducible if you can run your model on certain datasets and obtain similar results repeatedly on a particular project. This process encompasses design, reporting, data analysis, and interpretation. It has been observed that slight data changes, different software environments or versions, and numerous other small variations can adversely affect model accuracy. Another issue is data leakage, where information from outside the training dataset is used to create the model. This additional information can allow the model to learn or know something that it otherwise would not know and generally leads to inflated estimates of model performance.
But all is not lost! There are steps that can be taken at both the code level and the data level to improve this and to bring prediction accuracy to near human levels, closer to the Utopian world.
Consistency, transparency, and collaboration are the key building blocks for a reliable and reproducible ML application. Following are a few steps that can be taken to achieve this:
1. Ensuring access to high-quality curated data: This is one of the most important areas in improving ML prediction accuracy and reproducibility. Issues with the quality of the dataset could affect the results of ML-based science. Data quality issues include not addressing missing values in the data, the small size of datasets compared to the number of predictors, and noise in the data. All this adversely affects the model training and, ultimately, its performance.
2. A clean separation between training and test dataset: If the training dataset is not separated from the
test dataset during all pre-processing, modeling, and evaluation steps, the model will have access to information in the test set before its performance is evaluated. This will set the model accuracy higher than the real value, and the results will not hold with a different test set.
3. Ensuring the use of legitimate features for the model to train on: If the model has access to features that should not be legitimately available for use in the modeling exercise, this could result in leakage. E.g. A recent study found that a research paper had included 'the use of anti-hypertensive drugs’ as a feature for predicting hypertension. However, this feature would not be available to the model when predicting the health outcome of a new patient. Including this feature made the prediction a trivial task and would not serve well in training the model. Hence, researchers must be mindful in choosing the features suitable for a modeling task and justify their choice using domain expertise.
4. Ensuring transparency and proper documentation: One crucial step in creating reproducible projects is ensuring that documentation starts on day one. The documentation process should describe/ explain the choices made, as well as all the important details needed to successfully execute the project. It should also track proposed hypotheses, experiments, and outcomes. Publishing raw code should be mandated for the ML model to be reproducible. If there’s a leak, it can be accurately identified and corrected before it can be used for analysis and deriving inferences.
You'd want to be able to run your ML models and still get the same results in the future. Someone else would also want to be able to do the same. Reproducibility opens up opportunities for collaborations, baseline evaluations, and newer experiments. Let us ensure reproducibility and support the future 'you' and many other researchers who build on your research!
Also Read: Elucidata’s unique data quality assurance program to know how we ensure the quality of its 1.5M datasets.
Get the latest insights on Biomolecular data and ML