The Power of Automated Curation for Bulk RNA-seq Data

Bulk RNA-seq is a powerful sequencing approach to identify and characterize gene expression levels, differential gene expression, alternative splicing events, non-coding RNA molecules, gene fusion events, and improve gene annotation. Even though public repositories like GEO, ENA, etc., house enormous volumes of sequencing data, efficient data reuse is still a problem. Unstructured dataset information, inconsistent quality control and data processing, and inconsistent probe-gene mappings are all significant issues. As a result, thorough curation and data processing is required and should be mandated before downstream use.

A well-defined curation process must be followed for consuming, storing, documenting, and sharing a resource. As the publically available RNA-seq data is likely to be in the range of tens of terabytes or more, it is important to automate parts of the curation process to scale up data processing and make it available for downstream usage faster.

“Technology, through automation and artificial intelligence, is definitely one of the most disruptive sources.”—Alain Dehaze

In this blog, we take you through the advantages of an accurate automation process and how it can be achieved.

Why Do We Need Automated Curation?

The following are some advantages that an automated curation process offers:

1. Accuracy: Manually curated data is integrated into the machine-learning models. This data is carefully examined before training the ML models with it. This contributes to accuracy which comes from thoroughly reviewed data.

2. Speed: Because models perform automatic curation, it must be faster.
For instance, manually curating 10,000 datasets by 30 curators for six different fields would need 1 week, whereas automatically curating the same 10,000 datasets would require 1.5 to 2 days.

3. Scalability: The ML model is scalable and can run on any number of datasets. At any time, additional curation requests can be made.

4. Consistency: Manual curators must adhere to a set of rules while curating. Depending on each curator's knowledge, these may vary. The curation process becomes more consistent when it is automated. The ML models follow uniformity according to a set of established principles.

Automated Curation of Metadata

Let’s take a look at how the curation process can be automated accurately through an ML Ops platform. Here, we take the example of Elucidata’s Polly platform. Polly is a data-centric MLOps platform for biomedical data that provides access to FAIR (Findable Accessible Interoperable and Reusable) multi-omics data from public and proprietary sources. It is built on top of NLP-based AI models, and generates harmonized metadata annotations with scientific context at an accuracy matching that of human experts. Polly’s curation infrastructure PollyBERT enriches the way we access metadata from various data sources. The model is trained on ~17 billion words and ~660 million parameters.

On Polly, an ML-based curation pipeline uses the principle named entity normalization to identify metadata for datasets. This helps to recognize biological terms in plain language with appropriate dataset context. Further, we attach a standard ontology to each biological entity that our model recognises. We also provide each biological entity that our model identifies a standard ontology. As a result, two datasets with comparable biological references that Authors A and B published in various research articles on the same subject will have a variety of related ontologies that will enable them to be grouped together on Polly. Furthermore, illness, tissue, and other ontologies will genuinely remain consistent regardless of the sources you use, be it LINCS, GTEx, or GEO, making pertinent datasets instantly searchable.

Our curation model goes one step further. Every sample in a dataset is identified with multiple ontologies that can be used to search for those samples.

For example, every Normal, Lung, and NHBE cell line studied in any Polly dataset would be searchable with any/all three ontologies.

Users could potentially build a global ‘normal lung dataset’ from multiple studies processed uniformly to use as a control for multiple studies. Users could compare it to lung cell lines treated with Rapamycin across multiple datasets on Polly, once again searchable using sample-level ontologies.

We use NLP models to extract keywords from the available metadata (abstract, experiment design, sample metadata tables) and normalize these keywords to the specific ontology systems. GEOTron is one such curation pipeline that uses NLP models in combination with rule-based approaches to generate metadata tags for all GEO datasets.

These images show a comparison between the model accuracy of Polly with other state-of-the-art models and human experts.

On Polly, each dataset is available with six standard harmonized metadata fields. These metadata fields are mapped to specific biomedical ontologies to provide harmonized metadata. The standard fields are Organism, Disease, Tissue, Cell Line, Cell Type, and Drug. These fields are also mapped at the sample level. In addition to the six standard fields, more metadata fields are available at the dataset and sample level.

A sneak peek of the bulk RNA-Seq data on Polly.

Automated Curation of Custom Metadata

For each research, the user will need information from different perspectives and the standard metadata fields with the most common search fields like cell type, disease, tissue, etc. might not be enough to improve the findability of relevant datasets. This is where the curation of custom metadata fields holds great value. Automating this process is a bit more challenging but is possible, as described below.

In order to automate curation, the content that will be saved in the custom field, as well as the input and output for the curation model must be defined first. For example, for the cancer stage, we begin by selecting our ontologies (TNM, Number Stage, and Cancer Grades). The example metadata after choosing the ontology is defined as the input and tags from the three ontologies as the output.

After that, it's time to look for trends in the metadata so that regular expression patterns to extract the necessary data can be built. For instance, if a metadata field with the labels cancer stage or tumor stage contains T, N, or M and is followed by an integer, we may develop a rule that extracts that portion and refers to it as a cancer stage label. Then, using a model for automated curation, we generate numerous such rules and aggregate them. The model is evaluated with manually curated data to gauge the model's performance before it is put to actual use.

The quantity of both public and private multi-omics datasets has increased at a never-before-seen rate during the past two decades. Our curation efforts are just a starting point for assisting the community in adopting a data-first strategy for AI/ML approaches to drug development.