Product & Engineering

Single-Cell RNA-seq Data Analysis Beyond H5ad

Ayush Praveen, Trisha Dhawan
March 7, 2023

Single-cell RNA-sequencing (scRNA-seq) technology has come a long way since its first successful implementation in 2009. This technology has provided unprecedented insights into biological processes and improved our understanding at the cellular level. In the last decade, various scRNA-seq technologies have been developed and revolutionized sample collection, single-cell capture, barcoded reverse transcription, cDNA amplification, library preparation, sequencing, and streamlined bioinformatics analysis.

With the advent of more scRNA-seq methods, a proportional increase in the data generated has occurred, however, there is a disproportionate advancement in the analysis tools and techniques for single-cell RNA sequencing data.

In this blog, we’ll take a look at some of the tools and formats that are available for single-cell RNA sequencing and single-cell omics data, why H5ad is a preferred format, and why the Python-Anndata-H5ad ecosystem is widely adopted.

Single-cell Data Storage: Currently Used Formats

Currently, a Gene Cluster Text (GCT) File format or “.gct” from the Broad Institute is one of the most standard formats for storing processed gene expression data and metadata. However, the GCT format is not well suited for storing higher dimensional data such as scRNA-seq. The sparse nature and higher sample count (number of cells captured) make GCT an unsuitable format for single-cell omics data. To address this, research groups have tried to solve the on-disk data storage problem for single-cell data with a few formats. Some of the most commonly used formats are:

  1. h5Seurat: by Paul Hoffman from Satija Lab for storing Seurat object on disk as a file that can be read as an S4 object.
  2. RDS: A serialization-supported format in R that can store any R object. RDS is used to store Seurat and Single Cell Experiment Objects in R.
  3. Loom: A hdf5-based file format with i/o support in R and Python. It can also be read as an S4 object in R.
  4. H5ad: hdf5-based file format developed by Theislab with extensive support in Python.

The format that comes closest to being widely adopted owing to being a persistent on-disk storage format is H5ad format. The H5ad format is based on the standard h5 format, a Hierarchical Data Formats (HDF) used to store large amounts of data in the form of multidimensional arrays. The H5 format is primarily used to store scientific data that is well-organized for quick retrieval and analysis. There is a host of interactive tools available in Python to process, analyze and visualize data - Scanpy, MUON, Strem, etc., - in the H5ad format and this plays a major role in wide adaptability for this format. Further, to support and consume data downstream once it is stored in an H5ad format, the anndata data structure and the Python-Anndata-H5ad ecosystem are used. Now, we’ll delve into the features of the anndata data structure and the Python-Anndata-H5ad ecosystem one by one.

Why Anndata?

Anndata (a Python package for handling annotated data matrices in memory and on disk, positioned between pandas and xarray) is a reasonably popular data structure with good community adoption. At the time of writing this document, anndata has about 2M downloads in total and 51K downloads/month, 345 Github stars, and 1K dependent repositories.

There are multiple tools for analysis and visualization in Python that rely on the anndata structure:

  1. Cell Oracle: to understand Gene Regulatory Networks (GRNs) and perform in silico gene perturbations to simulate cell fate changes.
  2. Stream: for trajectory analysis for scRNA-seq data.
  3. MUON: for multi-modal scRNA-seq analysis toolkit built with the support of anndata.
  4. CellxGene: the state of art visualization application by CZI for scRNA-seq data.
  5. SquidPy: Spatial Omics analysis and visualization for scRNA-seq data.

The Python-Anndata-H5ad ecosystem now comes into the picture as it supports an extensive set of tools that can be used to process data in an H5ad format downstream.

Why Python-Anndata-H5ad Ecosystem for Single-cell RNA-seq Data?

Efficient storage, querying, and interaction are important aspects to consider while choosing an environment to be used for storing data. The Python-Anndata-H5ad ecosystem provides:
  • persistent and standard on-disk format.
  • storage for large volumes of data in the form of multidimensional arrays, based on the standard hdf5 system created to support scientific data storage.
  • support for scale-up for data by providing efficient operations with low memory consumption and reducing runtime overhead using Anndata.
  • support for storing data in sparse matrix format, out-of-core conversion between dense and spare matrices, lazy and in-place subsetting, per element operation for lower memory usage, slicing, dicing, merging, concatenating like pandas df and numpy data arrays.

Overall, the Python-Anndata-H5ad ecosystem provides efficient storage and consumption of single-cell omics data and supports further development in the domain by acting as a base system.

Considering the advantages and features of using anndata, H5ad, and Python, single-cell RNA-seq data on Polly is also based on the Python-Anndata-H5ad ecosystem. The Single-cell OmixAtlas on Polly contains single-cell RNA seq datasets with harmonized metadata, standardized and normalized data using consistent pipelines, expert-annotated cell types, and standard ontologies for reliable results that empower scientists to achieve their research goals.

Can Single-cell RNA Sequencing Data in H5ad Format be Used in Other Ecosystems?

Even though the Python-Anndata-H5ad ecosystem appears to be the ideal choice for single-cell RNA-seq data, there is a requirement for tools/ecosystems to enable downstream usage of data in H5ad format to cater to specific user needs and the availability of tools in other languages/ecosystems as well. Apart from Python, there have been interesting developments to enable analysis and visualization for single-cell omics data in other backgrounds as well. However, many of these are in R and the lack of complete support for H5ad/Anndata in R makes it difficult to use. For example, even though there is an anndata implementation in R through reticulate, however, it often suffers from reading issues.

Nonetheless, there is a possibility to convert and use H5ad data into formats suitable for use in other ecosystems. Since H5ad is based on the hdf5 format, the data can be read through most languages to get individual data slots. In R, particularly rhdf5 and hdf5r can be used to interact with H5ad file content and to further use the data. To encourage and support R usage, the scientific community has developed a lot of converters that can be used on top of Polly’s Single-cell Omixatlas to access preexisting converter libraries such as:

  1. sceasy: Interconversion between anndata, Loom, Seurat, Single Cell Experiment object
  2. seurat-disk: Interconversion between H5ad and h5seurat formats.
  3. anndata2ri: RPy2-based converter for interconversion between anndata (Python) and SingleCellExperiment (R).
  4. zellkonverter: Extensive conversion between anndata and single-cell experiment objects.

The preference for analysis tools/software varies across research groups and depends largely on familiarity with the working environments in different languages and other factors we have listed in this blog.

Is There A One-Stop Shop Solution?

Researchers work tirelessly scouring through a large number of single-cell RNA sequencing data that is either semi-structured or unstructured with incomplete annotation and metadata. There is an urgent need for tech solutions that take care of the cleaning, annotating, and versioning of the data, all in one place. Elucidata’s Polly has the world’s largest collection of highly curated ML-ready single-cell and bulk RNA seq data. For Single-cell data on Polly, a consistent H5ad format is used and for Bulk RNA Seq data, a consistent GCT format is used.

‍If you have requirements for extensively curated biomedical research data, reach out to us to learn more!

Blog Categories

Blog Categories

Request Demo