Faster Insights on Omics Data Signatures with Polly Discover

Yogesh Lakhotia, Omnya Mohamed Izzeldin
February 12, 2024
Faster Insights on Omics Data Signatures with Polly Discover
What are the upregulated and downregulated genes in response to a treatment?
Are there specific gene signatures associated with a disease subtypes or stages?
How are signalling pathways affected by genetic mutations?
How does your in-house data compare with the publicly available data?

These are fundamental questions when researching gene expression data to identify candidate genes and biomarkers associated with diseases. However, addressing these questions using public databases is highly non-trivial. Data quality and variability remain persistent concerns due to variations in experimental protocols, sample sizes, and platform differences. These factors introduce noise and bias, akin to searching for a needle in a haystack when attempting to find and extract meaningful gene signatures from the available data.

Challenges While Exploring Public Bulk RNA-seq Data

Bioinformaticians face several challenges when exploring publicly available bulk RNA-seq data. These challenges arise from the complexity and volume of the data, as well as the need to ensure data quality and extract meaningful biological insights. A few notable roadblocks include:

  • Data Heterogeneity: Publicly available RNA-seq data often come from different laboratories, platforms, and experimental conditions. This heterogeneity makes it difficult to compare and integrate datasets effectively.
  • Inconsistency in Data Quality and Preprocessing: For instance, GEO (Gene Expression Omnibus) includes a multitude of gene expression profiles from various experiments, platforms, and sources. Of these, only 2.9% of the records (or studies in layman’s terms) have been curated retrospectively.  Researchers must apply rigorous quality control measures and preprocessing steps to make data suitable for analysis.
  • Lack of Transparency: Inadequate documentation and clarity in data processing and analysis pipelines pose challenges to the interpretation, optimization, and comparability of RNA-seq data across studies, potentially undermining its reliability and utility in scientific research.

Our Solution: Polly

Elucidata's data harmonization platform, Polly, tackles the challenges of data heterogeneity in open-source  databases by integrating and standardizing diverse datasets. Polly ensures data quality through rigorous preprocessing and provides transparent documentation of the analysis pipelines, enabling researchers to derive reliable insights efficiently.

Omics Data Signatures
How does Polly make Data ML-ready?
Feature Description
Metadata Harmonization and Data Standardization. Polly's harmonization engine standardizes and harmonizes data related to samples and experimental conditions.
Stringent Quality Checks in Data Ingestion and Processing. Rigorous quality checks during data ingestion and processing stages to identify and rectify errors or anomalies.
Customizable Processing Daa processing pipelines can be tailored to meet the unique requirements of different research projects and applications.
Ensuring Transparency in the End-to-End Process 1. Documentation of steps, parameters, and methods applied to the process.
2. Facilitates understanding and reproducibility of analyses.

These high-quality datasets form a solid foundation for extracting relevant molecular signatures. For further exploration and analysis of these signatures, the platform also provides Polly Discover.

What is Polly Discover?

Polly Discover is an analysis module on the platform, to help users extract, find, and explore biologically important signatures from relevant curated datasets, as well as comparisons (of cohorts) within datasets. The module provides interactive visualizations that facilitate the interpretation of expression results. Users can enhance these results by incorporating existing knowledge bases and integrating them into meta-analysis methods, machine learning applications, and other tools. For those seeking more advanced visualizations, the data can be streamed to tools like Spotfire using APIs.

Polly Discover -  Key Features

  • High-quality metadata curation custom to research needs. Human readable comparison names segregated into appropriate categories to ease findability.
  • Full control over data processing pipelines used. Ensure all data is comparable with inhouse findings.
  • 360-degree findability journeys ( based on genes, pathways and other metadata fields) to search across public, in-house data
  • Fast turnaround times / predictable delivery timelines with tech-enabled processes.
  • Discover robust and consistent gene expression signatures across various comparisons.
  • Integrate with other open-source knowledge bases seamlessly to enrich signatures.
Omics Data Signatures
Polly Discover Workflow

Use Case: Finding the Gene Signatures Associated with Ulcerative Colitis in a Few Clicks.

A researcher studying ulcerative colitis aimed to identify specific gene signatures linked to the disease. By comparing their in-house bulk RNA-seq data with publicly available information, they sought to validate their findings and pinpoint potential targets with greater confidence.

For starters,  data audits have been performed on datasets from sources such as GEO and ArrayExpress to find all the ulcerative colitis-related datasets and store them in an Atlas. Both public and in-house data were processed using the same pipeline enabling users to generate and compare insights from both public and in-house data seamlessly.

With Polly Discover,

  • The datasets were deeply curated with Polly Harmonization Engine to make the following key fields available to the users - disease, tissue, drug, cell-line, cell type, mouse/rat strain, experimental factors, comparison types, etc. This curation enabled users to find relevant curated datasets within minutes.  
  • Each dataset was carefully curated to identify relevant groups and suitable comparisons. For instance, within the GSE112057 dataset, comparisons included Crohn’s Disease vs. normal, Crohn’s Disease vs. colitis, and Polyarticular Arthritis vs. colitis, among others. Using DESeq2, differentially expressed genes and enriched pathways from MSigDB for each of these comparisons are already precomputed and stored in Polly’s Atlas. This streamlined approach makes it convenient and efficient to identify gene signatures and grasp the functional significance of these differentially expressed genes.

In this case study, we picked 5 datasets where ulcerative colitis samples are compared with normal samples. Here’s how one dataset can be consumed with the Polly Discover on Polly-

Omics Data Signatures
Metadata Curation on Polly

A curated comparison study enables identifying genes that are known to be biologically relevant to Ulcerative Colitis, here there are 55 Control Samples and 43 Perturbation Samples with 837 upregulated genes.

Omics Data Signatures
Visualize curated comparisons within the dataset

Further analysis of the differentially expressed genes in the dataset can be done by visualizing a volcano plot of genes and its associated log fold change value and p-value. The Gene List can be downloaded and compared to the in-house propriety bulk-RNAseq data for validation.

Omics Data Signatures
Volcano Plot

More robust validation of in-house findings can be achieved by cross-comparing log-fold change (logFC) values across 5 datasets, this can help analyze consistent patterns of gene expression changes across datasets, and researchers can identify more reliable gene signatures associated with ulcerative colitis.

Omics Data Signatures
Upregulated genes across the datasets

Notably, all genes consistently demonstrate similar expression patterns across the various studies.

Upregulated genes across 5 datasets of comparison ' Ulcerative Colitis Vs Normal'.

This approach adds strength to the results by demonstrating the consistency of gene expression patterns across diverse studies conducted by different groups, even in the presence of heterogeneity in experimental conditions, data sources, and time points regarding Ulcerative Colitis.

With Polly Discover, identifying common genes across all curated datasets is a mere minute task. Further analysis can be done using open-source tools like GOProfiler, NetworkAnalyst, Cytoscape, etc.

Downstream step Tool
functional relevance of these genesets GOProfiler Image
Pathways that get impacted by the geneset of consistently upregulated genes
Drug repurposing NetworkAnalyst Image
Drugs that can be used for a given gene target
Gene signaling regulation NetworkAnalyst Image
Gene signaling regulation

Employing DisGeNET, researchers identified the predominant mutations in ulcerative colitis-afflicted individuals, namely NOD2, ATG16L1, IL23R, ABCB1, TNFSF15, STAT3, NR1I2, and TLR4. Their objective was to explore instances of differential expression of these genes in various biological conditions. With Polly Discover, they could search and discover 99 distinct comparisons across biological conditions where these genes exhibited differential expression.

Omics Data Signatures
A geneset search on Polly DIscover

Impact

1. By utilizing Polly Discover, the researcher were able to validate the in-house findings of their study on ulcerative colitis saving 70% of time consumed over traditional methods.

2. The researcher efficiently identified gene signatures and enriched pathways associated with the disease, enhancing their understanding of ulcerative colitis.

3. With few clicks, researchers swiftly identified 99 distinct comparisons across biological conditions showcasing the varied expression of key genes predominant mutations in ulcerative colitis-afflicted individuals

Conclusion

Polly Discover on Elucidata's Polly simplifies the complexities of transcriptomics data analysis, providing researchers with a one-stop solution. By addressing challenges in publicly available RNA-seq data, Polly Discover ensures high-quality, harmonized data for efficient exploration.

The use-case of Polly Discover is exemplified in a scenario involving the exploration of genes associated with ulcerative colitis. Through Polly's harmonizing engine, researchers can compare in-house bulk RNA-seq data with public data, ensuring high confidence in target identification. The platform's curated datasets, comparisons, and precomputed gene signatures streamline the process, offering efficient data exploration.

Connect with us or reach out to us at info@elucidata.io to learn more.

Request Demo