Data Science & Machine Learning

Designing Data-centric ML Projects for Biomolecular Data

Shefali Lathwal, Kushal Shah
June 15, 2022

Introduction to Biomolecular Data

Over the last two decades, advances in cutting-edge engineering and sequencing technology have led to increased generation and availability of biological data. However, extracting useful insights from these complex data will require a major upgradation in our methods of data analysis due to the sheer complexity of biological systems which necessitates integrating biological and chemical data of different types and from different sources to draw actionable insights.

The biomolecular data within the scope of our discussion typically include the following:

  1. Measurements made through high-throughput omics technologies such as genomics, epigenomics, transcriptomics, metabolomics, proteomics, lipidomics, etc.
  2. Molecular measurements made through other high or low throughput assays in laboratories such as ELISA, genetic tests, gene or protein expression panels, etc.
  3. Structural data of biomolecules such as DNA, proteins and chemicals such as small and large drugs.
  4. Drug metabolism and pharmacokinetic data.

These biomolecular data are routinely gathered and used alongside other types of data such as patient clinical records and imaging throughout life sciences, pre-clinical and clinical research.

Conventionally, these data have been analysed using various kinds of statistical algorithms, which are largely limited to fitting known functions to the data. Machine Learning (ML) takes this concept to a whole new level by allowing researchers to fit unknown functions to data in order to generalise the results to new data that the model has never seen before.

The Necessity and Adoption of Machine Learning (ML)

The use of Machine Learning in the pharmaceutical and biotech industries is growing as evidenced by the increasing number of collaborations between big pharmaceutical companies and artificial intelligence companies across their pre-clinical and clinical programs as shown in the figure below. The number of these partnerships has been increasing exponentially over the last 5 years indicating increased trust and adoption of AI tools in the industry as well as recognition of the need for collaboration in the space.

Just last year, a biotech company, Evotec, announced that they are taking an anti-cancer drug developed with Excientia, a company that uses Artificial Intelligence (AI) for discovery of small molecule-based drugs, to phase I clinical trials.  Several diagnostic tests that have been tested in trials or are under investigation use ML techniques to identify novel biomarkers. At Elucidata, we have used Machine Learning on multi-omics data to find multiple drug targets for Acute Myeloid Leukemia that have been validated experimentally.

Four-Fold Utility of ML Models for Biomolecular Analysis

For analysing biomolecular data, ML models prove to be very handy in four ways:

  1. Handling high-dimensional data
    ML models help in bringing the data to a manageable size through various techniques of dimensionality reduction. Biomolecular data suffers from the curse of dimensionality since the number of features (genes, proteins, etc) in a dataset is often much larger than the number of samples, which leads to a severe degradation in model performance. Techniques like Principal Component Analysis (PCA) and its more complex variants are very useful tools for dimensionality reduction.
  2. Ability to fit unknown predictive functions to the data
    ML models allow a wide variety of unknown predictive functions to be fitted to the data just by a change of parameters without actually changing the underlying model. This is especially true for Artificial Neural Networks which can fit any arbitrary function if sufficient amount of data is available. And now we even have Deep Learning techniques like auto-encoders which go beyond conventional Machine Learning and can help in even extracting abstract features from the given data. Auto-encoders were recently used in an interesting paper published in Clinical Cancer Research to extract abstract features to predict patient survival probabilities using multi-omics data.
  3. Generalising predictions on unknown data
    ML models help in generalising the model predictions and getting good results even for data that the model has not been trained on. This is the concept of generalisation in ML as compared to pure optimisation in conventional statistical techniques. In optimisation, the goal is to find the best model fit to the training data, whereas in generalisation a certain amount of degradation in training accuracy is allowed to enable better performance over new data. This is especially important in biology since datasets taken from different sources often have different distributions and properties, and so a model optimised for one dataset is unlikely to work for another one.
  4. Integrating diverse data types
    Machine Learning can be performed on a variety of unstructured and structured data types (including images, text, audio, video and tabular data) and it enables the integration of not just different types of biomolecular data, but also the integration of biomolecular data with text and and imaging data that are not possible with traditional statistical methods.

Due to the above reasons, Machine Learning is fast becoming an indispensable tool for analysis of biomolecular data. However, it has also been a cautionary tale as evidenced by a recent review of Machine Learning using COVID imaging data that determined that most studies had low quality of input data, biases in the data, or problems with the ML methodology which led to development of many models that were not reliable enough to apply in clinical settings. Another excellent review paper by Google explains that in many real-world applications outside of internet companies, where the datasets available are often small, the quality of data is more important than the quantity of data. Not paying attention to the data can lead to not just lost time and sunk cost, but adverse real-world impact as well. Therefore, ML should not be thought of as a magic wand. Running a successful ML project will require you to rigorously assess your data as well as your methodology. The new paradigm of data-centric ML1 is especially relevant here to enable accurate predictions with relatively small amounts of training data, as is often the case with applications using biomolecular data.

Bringing ML into Your Research

If you decide to start using Machine Learning in your research, we recommend using a data-centric approach that generally becomes useful when the size of available data is relatively small and it is possible to make considerable improvements in model performance through systematic improvements in data quality by incorporating domain knowledge.

In order to execute a data-centric ML project, we recommend the following steps:

  1. Study design
    Start with a clearly defined problem statement. Evaluate whether Machine Learning can help with this problem. If you are dealing with high-dimensional data, many different types of data, or data with high variability and unknown underlying structure, the answer will likely be yes. Also, define what ML approaches would be most suitable for this problem, i.e. Supervised Learning, Unsupervised Learning or a combination of both. In each case, clearly define your inputs, outcomes and the metrics you will use to evaluate the models.
  2. Select an initial data cohort
    Once your problem is well-defined, define inclusion criteria for the data that you will use and perform an audit of the available data. The inclusion criteria may include criteria on experiment design, availability of metadata, the type of measurements, quality of measurements etc. After this step, you should end up with an initial data cohort that will form the basis of your project. It is important to note that creating this data cohort should not be thought of as a one-time activity. As you gain more information throughout the project, be willing to modify your selection criteria and change your data cohort.
  3. Work on the data
    Once you have your initial data cohort, you need to do a deep dive into the data. If you have been working with biomolecular data, then some of these steps will already be familiar to you. Ask whether all the samples are suitable to be included in your analysis, if data are coming from multiple sources and studies, can they be combined (for example, are they studying similar populations, are the data comparable, i.e., have they been processed in a similar way, what pre-processing would be needed before they can be combined etc.), which data would be the most appropriate for external validation, and how you should split the data for training, development, and validation.
  4. Work on the data some more
    Determine what pre-processing steps (scaling, normalisation, dimensional reduction, feature selection, feature transformation etc.) are required to prepare the data for training.
  5. Train the model, perform error analysis, tweak the data
    Start with the simplest possible model and iterate on the training data while evaluating the performance using the chosen metric. Perform error analysis to understand where the model is going wrong and instead of jumping to a different model to improve performance, ask whether there are gaps in your data and how you can modify the data to improve performance.

The Future of ML with Biomolecular Data

Mathematical modelling of biology and making predictions about the behaviour of biological systems has been hard due to the extremely high level of complexity and large variability in system behaviour. This is unlike other non-living physical systems where at least the equations governing the system behaviour are usually well known. This is where Machine Learning can make a huge difference! Machine Learning is not a magic wand, but it has a lot of predictive power if used judiciously with a data-centric approach. ML algorithms have the power to learn patterns in the data which may be too complex for the human mind to decipher. However, collecting and manually labelling a large amount of biomolecular data required for the training of usual ML algorithms is both expensive and time consuming. Therefore, for most practical problems, scientists have to find ways to work with a relatively small amount of data, which means roughly a few hundred to a thousand data points. Even in other ML domains like processing of language and images, we have seen the necessity of a data-centric approach despite the availability of large pre-trained models that enable making predictions with relatively small amounts of training data for specific tasks. Hence, it is certainly clear that research groups which follow a data-centric approach for ML applications in biology will far outshine groups that do not. This will also usher in a new data-driven approach to science!

1 These data-centric techniques focus on systematically improving the data to boost the performance of established prediction models. In addition to general statistical methods of improving data, domain knowledge integration is also an essential part of data improvement when working with biomolecular data.

Blog Categories

Blog Categories

Request Demo