Single cell RNA-sequencing (scRNA-seq) technology has come a long way since its first successful implementation in 2009. This technology has provided unprecedented insights into biological processes and improved our understanding at the cellular level. In the last decade, various scRNA-seq technologies have been developed and revolutionized sample collection, single-cell capture, barcoded reverse transcription, cDNA amplification, library preparation, sequencing, and streamlined bioinformatics analysis.
With the advent of more scRNA-seq methods, a proportional increase in the data generated has occurred, however, there is a disproportionate advancement in the analysis tools and techniques for single-cell RNA sequencing data.
In this blog, we’ll take a look at some of the tools and formats that are available for single-cell RNA sequencing and single-cell omics data, why H5ad is a preferred format, and why the Python-Anndata-H5ad ecosystem is widely adopted.
Currently, a Gene Cluster Text (GCT) File format or “.gct” from the Broad Institute is one of the most standard formats for storing processed gene expression data and metadata. However, the GCT format is not well suited for storing higher dimensional data such as scRNA-seq. The sparse nature and higher sample count (number of cells captured) make GCT an unsuitable format for single-cell omics data. To address this, research groups have tried to solve the on-disk data storage problem for single-cell data with a few formats. Some of the most commonly used formats are:
The format that comes closest to being widely adopted owing to being a persistent on-disk storage format is H5ad format. The H5ad format is based on the standard h5 format, a Hierarchical Data Formats (HDF) used to store large amounts of data in the form of multidimensional arrays. The H5 format is primarily used to store scientific data that is well-organized for quick retrieval and analysis. There is a host of interactive tools available in Python to process, analyze and visualize data - Scanpy, MUON, Strem, etc., - in the H5ad format and this plays a major role in wide adaptability for this format. Further, to support and consume data downstream once it is stored in an H5ad format, the anndata data structure and the Python-Anndata-H5ad ecosystem are used. Now, we’ll delve into the features of the anndata data structure and the Python-Anndata-H5ad ecosystem one by one.
Anndata (a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray) is a reasonably popular data structure with good community adoption. At the time of writing this document, anndata has about 2M downloads in total and 51K downloads/month, 345 Github stars, and 1K dependent repositories.
There are multiple tools for analysis and visualization in Python that rely on the anndata structure:
The Python-Anndata-H5ad ecosystem now comes into the picture as it supports an extensive set of tools that can be used to process data in an H5ad format downstream.
Efficient storage, querying, and interaction are important aspects to consider while choosing an environment to be used for storing data. The Python-Anndata-H5ad ecosystem provides:
Overall, the Python-Anndata-H5ad ecosystem provides efficient storage and consumption of single-cell omics data and supports further development in the domain by acting as a base system.
Considering the advantages and features of using anndata, H5ad, and Python, single-cell RNA-seq data on Polly is also based on the Python-Anndata-H5ad ecosystem. The Single-cell OmixAtlas on Polly contains single-cell RNA seq datasets with harmonized metadata, standardized and normalized data using consistent pipelines, expert-annotated cell types, and standard ontologies for reliable results that empower scientists to achieve their research goals.
Even though the Python-Anndata-H5ad ecosystem appears to be the ideal choice for single-cell RNA-seq data, there is a requirement for tools/ecosystems to enable downstream usage of data in H5ad format to cater to specific user needs and the availability of tools in other languages/ecosystems as well. Apart from Python, there have been interesting developments to enable analysis and visualization for single-cell omics data in other backgrounds as well. However, many of these are in R and the lack of complete support for H5ad/Anndata in R makes it difficult to use. For example, even though there is an anndata implementation in R through reticulate, however, it often suffers from reading issues.
Nonetheless, there is a possibility to convert and use H5ad data into formats suitable for use in other ecosystems. Since H5ad is based on the hdf5 format, the data can be read through most languages to get individual data slots. In R, particularly rhdf5 and hdf5r can be used to interact with H5ad file content and to further use the data. To encourage and support R usage, the scientific community has developed a lot of converters that can be used on top of Polly’s Single-cell Omixatlas to access preexisting converter libraries such as:
The preference for analysis tools/software varies across research groups and depends largely on familiarity with the working environments in different languages and other factors we have listed in this blog.
Researchers work tirelessly scouring through a large number of single cell RNA sequencing data that is either semi-structured or unstructured with incomplete annotation and metadata. There is an urgent need for tech solutions that take care of the cleaning, annotating, and versioning of the data, all in one place. Elucidata’s Polly has the world’s largest collection of highly curated ML-ready single-cell and bulk RNA seq data. For Single-cell data on Polly, a consistent H5ad format is used and for Bulk RNA Seq data, a consistent GCT format is used.
If you have requirements for extensively curated biomedical research data, reach out to us to learn more!