Polly's Bulk OmixAtlases: The Effortless Dataset Discovery and Retrieval Platform

April 20, 2023

The variety and volume of data being produced by biological research hold tremendous potential for reuse and drug discovery but are scattered across multiple, disparate sources and lack standardization. Thus, the availability of data does not equate to its easy usability. Researchers and data scientists find it laborious to derive accurate high-quality RNA-seq data from public repositories.

Are you also looking for a one-click solution for RNA-seq data discovery and retrieval? Look no further. This blog discusses Elucidata’s Polly- a biomedical data platform to help you do the same. Polly is a data-centric MLOps platform that provides access to FAIR (Findable, Accessible, Interoperable, and Reusable) multi-omics data from public and proprietary sources.

Public Sources for Gene Expression Data:

There are many public-source repositories for high-throughput data on gene expression studies, including those that examine genome methylation, chromatin structure, and genome–protein interactions, and other forms of high-throughput functional genomics data submitted by the research community.

Some of these sources are listed below:

  • Gene Expression Omnibus (GEO) by NCBI is a commonly used source for Gene Expression Data, including RNA-sequenced data. It has a collection of datasets from a wide range of studies and organisms.
  • Sequence Read Archive is a repository for high throughput sequencing data which also includes RNA-Seq data and is maintained by NCBI.
  • Array Express is a repository for gene expression data that contains RNA-sequenced data and is maintained by the European Bioinformatics Institute (EBI)
  • Genomics Data Commons (GDC) portal by NCI is specific to cancers and has data from a wide variety of cancers.
  • European Nucleotide Archive (ENA) is a repository for raw and processed sequenced data, including RNA sequencing data, and is maintained by the EBI.

What is GEO?

GEO is the most widely used repository for finding RNA-seq data due to the vastness of its data. In this section, we discuss the challenges associated with finding data on GEO and the data itself.

Challenges Faced While Using GEO:

1. Findability: Challenges Faced While Searching for Datasets on GEO

The data on GEO does not follow a particular ontology. So, it might be important to find out the synonyms and the acronyms/ abbreviations of the keyword of interest to improve the search results.

2. Usability: Pain Points Associated with Data on GEO

  1. Lack of annotation: GEO data is poorly annotated making it difficult to interpret and compare results across different studies. It can also be difficult to work with it due to the various file formats and data structures used making it time-consuming to extract and analyze the data.
  2. Data inconsistency and lack of data quality: Data on GEO is generated by many different laboratories using different methods and platforms which can result in variable data qualities and inconsistencies among the data.

Polly’s OmixAtlas aims to address these issues by ensuring that the metadata from different data types and across different data sources are curated and harmonized.

OmixAtlas: Find and Use GEO Datasets Readily on Polly

OmixAtlas is the data warehouse on Polly that provides access to a large number of curated RNA sequencing studies. It is a collection of millions of datasets from public, proprietary, and licensed sources that have been curated, harmonized, and made ready for downstream machine learning and analytical applications. There are essentially 2 different datatypes; Bulk RNA-seq data and Single-cell RNA-seq data grouped under Bulk RNA-seq OmixAtlas and Single-cell OmixAtlas, respectively.

Bulk RNASeq OmixAtlas on Polly

Bulk RNA-seq Omixatlas is a revolutionary technology that enables researchers to analyze gene expression across various cell types and tissues.

  • Biomedical data in one place: Bulk OmixAtlas provides access to over 40,700 datasets.
  • Frequent updation: The data on Bulk OmixAtlas is frequently updated to maintain pace and synced with the source repositories wherever applicable.
  • ML-ready data made available through curation: Data from diverse sources have been structured, and the metadata is harmonized with controlled vocabularies and ontologies. Curated metadata makes datasets searchable and findable.
  • Integrability: OmixAtlas allows the querying of data through point-and-click solutions. Code-based advanced access enables users to access data on Polly or outside of Polly on any platform of choice.

Downstream Usage of Datasets from Bulk RNA-seq OmixAtlas: Explore Datasets Visually Using Phantasus

Go one step ahead and readily visualize the curated GEO datasets using Phantasus.

Phantasus is a user-friendly web application for interactive gene expression analysis. It simplifies data analysis by offering a seamless approach, from loading, normalizing, and filtering the data to performing differential gene expression and downstream analysis.

The highly curated datasets on Polly allow seamless integration of the Phantasus app, and data can be analyzed readily without the need for preprocessing. Any dataset can be opened on this application on Polly, and a corresponding heatmap will appear.

Polly hosts the world’s largest collection of highly curated, ML-ready bulk and single-cell RNA seq data. Our curation pipelines, high-quality, accurately annotated data, standard workflows, and scientific expertise are used by industries and academia across the globe to accelerate their drug discovery process. Reach out to us to learn more about how to accelerate your research!

