What are the upregulated and downregulated genes in response to a treatment?
Are there specific gene signatures associated with a disease subtypes or stages?
How are signalling pathways affected by genetic mutations?
How does your in-house data compare with the publicly available data?
These are fundamental questions when researching gene expression data to identify candidate genes and biomarkers associated with diseases. However, addressing these questions using public databases is highly non-trivial. Data quality and variability remain persistent concerns due to variations in experimental protocols, sample sizes, and platform differences. These factors introduce noise and bias, akin to searching for a needle in a haystack when attempting to find and extract meaningful gene signatures from the available data.
Bioinformaticians face several challenges when exploring publicly available bulk RNA-seq data. These challenges arise from the complexity and volume of the data, as well as the need to ensure data quality and extract meaningful biological insights. A few notable roadblocks include:
Elucidata's data harmonization platform, Polly, tackles the challenges of data heterogeneity in open-source databases by integrating and standardizing diverse datasets. Polly ensures data quality through rigorous preprocessing and provides transparent documentation of the analysis pipelines, enabling researchers to derive reliable insights efficiently.
These high-quality datasets form a solid foundation for extracting relevant molecular signatures. For further exploration and analysis of these signatures, the platform also provides Polly Discover.
Polly Discover is an analysis module on the platform, to help users extract, find, and explore biologically important signatures from relevant curated datasets, as well as comparisons (of cohorts) within datasets. The module provides interactive visualizations that facilitate the interpretation of expression results. Users can enhance these results by incorporating existing knowledge bases and integrating them into meta-analysis methods, machine learning applications, and other tools. For those seeking more advanced visualizations, the data can be streamed to tools like Spotfire using APIs.
A researcher studying ulcerative colitis aimed to identify specific gene signatures linked to the disease. By comparing their in-house bulk RNA-seq data with publicly available information, they sought to validate their findings and pinpoint potential targets with greater confidence.
For starters, data audits have been performed on datasets from sources such as GEO and ArrayExpress to find all the ulcerative colitis-related datasets and store them in an Atlas. Both public and in-house data were processed using the same pipeline enabling users to generate and compare insights from both public and in-house data seamlessly.
With Polly Discover,
In this case study, we picked 5 datasets where ulcerative colitis samples are compared with normal samples. Here’s how one dataset can be consumed with the Polly Discover on Polly-
A curated comparison study enables identifying genes that are known to be biologically relevant to Ulcerative Colitis, here there are 55 Control Samples and 43 Perturbation Samples with 837 upregulated genes.
Further analysis of the differentially expressed genes in the dataset can be done by visualizing a volcano plot of genes and its associated log fold change value and p-value. The Gene List can be downloaded and compared to the in-house propriety bulk-RNAseq data for validation.
More robust validation of in-house findings can be achieved by cross-comparing log-fold change (logFC) values across 5 datasets, this can help analyze consistent patterns of gene expression changes across datasets, and researchers can identify more reliable gene signatures associated with ulcerative colitis.
Notably, all genes consistently demonstrate similar expression patterns across the various studies.
This approach adds strength to the results by demonstrating the consistency of gene expression patterns across diverse studies conducted by different groups, even in the presence of heterogeneity in experimental conditions, data sources, and time points regarding Ulcerative Colitis.
With Polly Discover, identifying common genes across all curated datasets is a mere minute task. Further analysis can be done using open-source tools like GOProfiler, NetworkAnalyst, Cytoscape, etc.
Employing DisGeNET, researchers identified the predominant mutations in ulcerative colitis-afflicted individuals, namely NOD2, ATG16L1, IL23R, ABCB1, TNFSF15, STAT3, NR1I2, and TLR4. Their objective was to explore instances of differential expression of these genes in various biological conditions. With Polly Discover, they could search and discover 99 distinct comparisons across biological conditions where these genes exhibited differential expression.
1. By utilizing Polly Discover, the researcher were able to validate the in-house findings of their study on ulcerative colitis saving 70% of time consumed over traditional methods.
2. The researcher efficiently identified gene signatures and enriched pathways associated with the disease, enhancing their understanding of ulcerative colitis.
3. With few clicks, researchers swiftly identified 99 distinct comparisons across biological conditions showcasing the varied expression of key genes predominant mutations in ulcerative colitis-afflicted individuals
Polly Discover on Elucidata's Polly simplifies the complexities of transcriptomics data analysis, providing researchers with a one-stop solution. By addressing challenges in publicly available RNA-seq data, Polly Discover ensures high-quality, harmonized data for efficient exploration.
The use-case of Polly Discover is exemplified in a scenario involving the exploration of genes associated with ulcerative colitis. Through Polly's harmonizing engine, researchers can compare in-house bulk RNA-seq data with public data, ensuring high confidence in target identification. The platform's curated datasets, comparisons, and precomputed gene signatures streamline the process, offering efficient data exploration.
Get the latest insights on Biomolecular data and ML