Polly vs Recount3 - Comparing Findability for RNA-seq Data

Jayashree
March 7, 2023
Polly vs Recount3 - Comparing Findability for RNA-seq Data

RNA sequencing is a rapidly emerging method for investigating the transcriptome. Over the past few decades, it has significantly progressed, becoming a paramount approach in transcriptome profiling. RNA-seq data is being utilized in multiple aspects of research and disease treatments. However, findability, usability, quality, and reliability have always been problematic for researchers and data scientists.  

Though this is a very niche space, multiple platforms are being developed to facilitate the availability of  RNA-seq data and do so in varying degrees of efficiency. Here we lay out a comparison between two such platforms operating in a similar space.

In this blog, we compare and discuss the difference between Elucidata’s ML-Ops platform Polly, and an online resource Recount3, as sources for uniformly processed and annotated RNAseq data.

What is Polly?

Polly is a data-centric MLOps platform that hosts FAIR (Findable Accessible Interoperable and Reusable) multi-omics data from public and proprietary sources. Specific ETL pipelines called Connectors facilitate seamless data ingestion and harmonization. Polly’s curation infrastructure is built on a specialized BERT model, PollyBERT, that helps in metadata annotation.

What is Recount3?

Recount3 is an online resource that consists of uniformly processed RNA-seq data. It consists of RNA-seq gene, exon, and exon-exon junction counts as well as coverage bigWig files for 8,679 and 10,088 different studies for humans and mouse respectively. It is the third generation of the ReCount project and part of recount.bio.The raw sequencing data is processed with the Monorail system which generates the coverage bigWig files and the recount-unified text files. Furthermore, snapcount enables query-based access to the recount3 and recount2 data.

Polly v/s Recount3: A Comparison

Let us dive deeper into understanding how these platforms work with the help of a few examples.

1. Querying Efficiency

Querying at GUI level for transcriptomics datasets Neurodegenerative diseases in humans.

PARAMETERS
POLLY
RECOUNT3
# of datasets found:
~7.5k
42
Time taken
Seconds
Seconds
Pros
  • Availability of elaborate set of filters that improves dataset findability.

  • The search automatically expands to include diseases related to the term neurodegenerative disease such as AD, Huntingtons, Parkinsons etc


  • All datasets have data files associated with them ready to download

  • Gives results from all 3 of its sources in single search


Querying programmatically for Alzheimer's disease datasets with normal and patient samples.

PARAMETERS
POLLY
RECOUNT3
# of datasets found:
10 42
Time taken
<5 minutes
>45 minutes
Cons
-

  • Recount does not offer programmatic querying capabilities across datasets.

  • To identify relevent datasets for the query used above, the user needs to use the study explorer and read through the description of the datasets to get datasets with Normal and disease samples.

2. How Easy Is It to Find Relevant Data?

  • Polly GUI and Polly-python both support keyword searches through metadata fields following a standard ontology across all the datasets. It also allows free text searches.
  • Recount3 does not allow searching datasets programmatically using keyword searches. There is no metadata curation, hence the metadata ontologies differ across datasets, making it difficult to find similar datasets using a single keyword.

3. How Easy Is It to Access the Data?

  • Polly platform and Polly-python are proprietary software requiring a license for their usage and hence also the data hosted on the platform.
  • On Recount3, the data and R package is open source, hence no authorization is required to access the data.

4. How Easy Is It to Integrate the Data with Other Data and Interoperate with Applications or Workflows for Analysis, Storage, and Processing?

  • On Polly, all the datasets are processed through the same Kalisto pipeline and the dataset metadata follow a standard ontology allowing easy comparison of datasets. They can be easily used with downstream analysis packages.
  • On Recount3, all datasets are processed through the same Monorail pipeline and can be easily used with downstream analysis packages. However, the lack of standard ontology is a big problem since you might miss out on data just because of the difference in vocabulary.

Data FAIRness Comparison    

Ratings out of 5

PARAMETERS
POLLY
RECOUNT3
Findability
⭐️⭐️⭐️⭐️⭐️ ⭐️⭐️
Accessibility
⭐️⭐️⭐️⭐️
⭐️⭐️⭐️⭐️⭐️
Interoperability ⭐️⭐️⭐️⭐️⭐️
⭐️⭐️⭐️⭐️
Reusability
⭐️⭐️⭐️⭐️

⭐️⭐️⭐️⭐️

Comparing Data Availability and Usability  

  • Comparing the volume, variety, and sources of the data
  • Comparing dataset processing



POLLY RECOUNT3
Datatypes
  • Bulk RNASeq, sc RNASeq
  • Microarray, Bulk RNASeq, sc RNASeq (only smartseq platform)
Data sources
  • GEO, SCP, Human cell Atlas, Single Cell Expression Atlas, Tabula Sapiens, HTAN, Zenodo, Covid-19 cell Atlas
  • SRA, GTEX, TCGA
Data Volume
  • 735,914 RNA-seq samples across all organisms. 
  • 41,638 datasets from human and mouse studies.
  • New datasets are added regularly.
  • 750,000 human and mouse RNA-seq samples. 
  • 18,767 datasets human and mouse studies.
  • New datasets are not added regularly.
Organisms
  • Human, mouse, rat, primates and other organisms

  • Human, Mouse only
Data processing
  • All bulk RNA seq datasets are processed through Kalisto based pipeline, if the raw data is available for the dataset,following the best published RNASeq practices 
  • Datasets without available raw data are ingested as is from the source with metadata standardization.
  • All datasets are processed thorugh their distributed processing system Monorail 
  • Monorail uses STAR and related tools to summarize expression at the gene and exons levels (annotation-dependent), to detect and report exon-exon splice junctions, and to summarize coverage along the genome as a bigWig file .

While both platforms are great sources for finding processed RNAseq data, it would be helpful to take a closer look to identify how they would serve particular users. It is very important for researchers and scientists to keep up with all the emerging data without having to spend a lot of time finding the relevant ones. It is thus preferable to have metadata backed with standard ontologies enabling superior search and findability. We hope this blog can help users make an informed choice between these platforms.

If you are spending time scouring datasets to just find out relevant ones for downstream analysis, now is the time to reach out. Connect with us to learn more about how to accelerate your research.

Other Resources

Talk to our Data Expert
Thank you for reaching out!

Our team will get in touch with you over email within next 24-48hrs.
Oops! Something went wrong while submitting the form.

FAQs

What are the key benefits of using Polly for gene target prioritization in patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

  • Data-Driven Target Selection: Polly integrates multi-omics data to identify key genes relevant to patient subgroups.
  • Accelerated Drug Discovery: The platform prioritizes targets based on disease associations and biomarker relevance, expediting the discovery and validation process.
  • Improved Reproducibility: Harmonized datasets ensure reliable and reproducible findings for target validation.

How does Polly help in training classifier models for patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides pre-processed, harmonized datasets that enable AI/ML model training for patient classification. It supports feature selection, dimensionality reduction, and validation workflows to build robust predictive models for precision medicine applications.

How does Polly assist in defining genetic signatures for different stages of cell differentiation?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly analyzes both single-cell and bulk multi-omics data to identify stage-specific genetic markers. By applying machine learning algorithms to detect patterns in gene expression, Polly helps researchers map lineage differentiation and gain insights into disease progression.

What is the process of creating a disease-specific atlas using Polly’s harmonization engine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly builds disease-specific atlases by:

  1. Aggregating multi-omics datasets from curated sources.
  2. Harmonizing data using standardized ontologies.
  3. Annotating datasets with clinical metadata.
  4. Structuring the information into disease-specific cohorts for targeted biomarker and therapeutic research.

How does Polly integrate multiple data types for more reliable patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly integrates genomics, transcriptomics, proteomics, and clinical data into a unified, multi-dimensional view of patient populations. This helps researchers uncover complex biological relationships and enhances predictive modeling for patient subgroups.

Can Polly handle data quality issues and unstructured data from public repositories?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Yes, Polly automatically processes raw, unstructured data from public sources, addressing missing values, batch effects, and inconsistencies. Its machine learning–driven pipelines filter out noise and standardize data, ensuring higher-quality datasets for seamless analysis.

How does Polly harmonize multi-omic datasets to improve the quality of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's harmonization engine normalizes, processes, and integrates diverse datasets using standard ontologies and metadata frameworks. This ensures consistency, removes batch effects, and enhances the reliability of downstream analyses for precise patient classification.

How does Elucidata's Polly help in overcoming the challenges of patient stratification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly streamlines patient stratification by:

  • Harmonizing and Integrating Multi-omics Data: Polly standardizes data across different sources, making it analysis-ready.
  • Curating High-quality Datasets: The platform ensures datasets are clean, structured, and well-annotated, thereby improving the reliability of downstream analyses.
  • Enabling AI-driven Insights: Polly applies machine learning models to uncover patterns and classify patients effectively.
  • Ensuring Reproducibility and Scalability
  • Automated pipelines and version-controlled workflows allow for efficient scaling to large datasets while maintaining detailed records of each analysis step, making it easier to reproduce or modify results.

What challenges do researchers face when performing patient stratification using multi-omics data?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers encounter several challenges, including:

  • Data Heterogeneity: Multi-omics data come from different platforms, making integration complex.
  • Data Quality Issues: Public datasets often contain missing values, noise, or inconsistencies.
  • Computational Complexity: Large-scale multi-omics data require significant computational power and expertise to process.
  • Interpretability: Even with powerful analytical methods, extracting clear and meaningful biological insights from high-dimensional data remains a significant challenge.

What is patient stratification, and why is it important for precision medicine?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Patient stratification is the process of categorizing patients into subgroups based on genetic, molecular, or clinical characteristics. This approach is crucial for precision medicine because it identifies which patient populations are most likely to respond to specific treatments, thereby improving therapeutic outcomes and reducing the risk of adverse effects.

What are the key advantages of using Polly for transcriptome profiling and biomarker identification?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly provides access to a curated repository of RNA-seq datasets that are consistently processed and enriched with metadata. This harmonization allows researchers to efficiently search for datasets with similar transcriptional profiles, facilitating transcriptome profiling and biomarker identification.

What methodologies does Polly use to identify synergistic drug combinations?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly utilizes signature reversal and multivariate gene expression signatures to predict potential drug combinations. By analyzing publicly available transcriptomics data and drug signatures, Polly can identify drugs or compounds that may have therapeutic effects by reversing disease signatures.

How does Polly rank datasets similar to a gene signature query?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly ranks similar datasets using cosine similarity scores, which measure how closely a dataset's transcriptional profile matches the query signature. This helps researchers quickly find relevant datasets for further analysis and validation.

What steps are involved in creating a query gene signature on Polly?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Researchers define the biological process of interest, select a dataset, preprocess the data, identify differentially expressed genes, and validate the signature. Polly’s platform streamlines this process with expert support and ML-ready datasets.

How does Polly's RNA-Seq Atlas simplify gene signature analysis?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Polly's RNA-Seq Atlas addresses challenges in extracting associated signatures from public databases by providing a curated resource of RNA-seq datasets collected from the Gene Expression Omnibus (GEO). This richly curated resource helps researchers to find datasets with similar transcriptional profiles to their gene sets of interest.

What is gene signature comparison, and why is it important in drug discovery?

Lorem ipsum dolor sit amet consectetur. Dictumst faucibus nibh imperdiet phasellus vitae ut sit. Ut eros amet massa tellus orci. Vestibulum ac arcu est nulla non eget nulla. Eget pulvinar eu ac mi cursus elementum neque. Massa nisl fringilla platea diam faucibus nullam. In lacus mauris nec ultrices. Ut accumsan leo adipiscing montes proin.

Gene signature comparison analyzes gene expression patterns to identify disease-related signatures. It helps researchers find drugs that can reverse disease signatures, aiding in therapeutic discoveries.