FAIR Data

The improvements in molecular profiling technologies such as next-generation sequencing have reached the level of characterization of individual cells [1]. This has led to the discovery of biological phenomena that were obscured due to measurements of bulk samples or populations. The impact is uncovering mechanisms that can have huge implications for understanding of biology, health, diseases, and therapeutic developments.

While the potential of discovery with single-cell data is high, measurements at single-cell level have two fundamental issues that should be addressed while exploration of generated data:

**High Cell-to-cell Variability:**The assumption that cells of similar type should be homogeneous in nature isn’t correct, there is a strong cell-to -ell variability of expression. While the variability due to technical limitations such as batch effects, sensitivity limits should be avoided, there is intrinsic variability of expression between cells. Different studies have studied and concluded that the variability between cells even of similar nature should be appreciated as it has important implications for cellular functions [2][3]. However the methods which are currently available for single-cells aren’t adept at dealing with the variability. Additionally there are applications of the data where the variability of expression isn’t explored.**Low Sequencing Depth per Cell:**Most single-cell expression methods measure a total pool of reads tagged with a cell barcode functioning as an identity from individual cells. The information from the pool is demultiplexed later to reveal individual cell expression. While the sample level sequencing depth is high for measurement, per-cell sequencing depth is still limited. The per cell sensitivity and high drop outs often cause issues with the methods designed for single-cell analysis later.

As mentioned in the previous section, while cell-to-cell variability is an intrinsic biological property required for biological function, the limitations of single-cell methods to deal with the phenomenon affects confidence of calculations with applications where the variability is not studied (differential expression, extraction of biomarkers etc). Similarly low sequencing depth per cell affects the results highly. This in conjunction with the fact that most single-cell methods treat cells as samples lead to inflated p-values and unreliable effect sizes.

Multiple studies have highlighted that Differential Expression methods designed for single-cells aren’t optimized to deal with high cell-to-cell variability, fail to account for high dropouts and fail to ignore consideration of cells as samples. However, methods which rely on a pseudo bulk representation of cellular expression standout in terms of performance [4][5][6][7].

Pseudo-bulk analysis in single-cell RNA sequencing involves aggregating gene expression data from clusters of similar cells, creating a representative "pseudo-bulk" sample. This enables more computationally efficient analysis at a bulk level while retaining insights into cellular heterogeneity and functional characteristics within distinct cell populations.

While the pseudo-bulk workflows evolved primarily to calculate differential expression of single-cell data reliably using bulk differential expression methods, the pseudo-bulk expression data itself can be used for multiple different applications. Rather than considering the Differential Expression as the end point calculations, the calculated pseudo-bulk expression can be

- Normalized using both within and across sample normalization methods for further use.
- Used with other statistical methods such as linear regression or even advanced ML methods.
- Used for applications independent of differential expression like biomarker discovery etc.

There are multiple workflows out there and there is some understanding of useability of workflows for different applications. In this blog, we'll explore some of these workflows and establish some good practices for pseudo-bulk approaches when analyzing single-cell data.

Pseudo-bulks are as good representations of the actual biological phenomena as their composition. If pseudo-bulks with less number of cells or total genes/reads are considered for use, they can negatively affect the insights drawn from the data. As a rule of thumb, it is generally considered good to remove pseudo-bulks with less counts of cells or total genes/reads expressed. There aren’t well defined absolute cutoffs but pseudbulks with at least a few thousand reads and 50-100 cells should be minimum cutoffs.

Pseudo-bulks are the expression representation of cell type profiles from individual samples. The samples might be from the same or different conditions (Normal vs Tumor). The comparison of cell type specific bulk profiles can help in identifying cell type specific mechanism differences between the conditions. The Pseudo-bulk calculation can be approached as an aggregation operation over the data. Pseudo-bulk calculations can be done via

- Mean normalized expression

- Sum of counts

The mean normalization strategy, averages single-cell normalized expression values across each pseudo-bulk sample. The sum of counts strategy sums raw counts of genes across a pseudo-bulk sample. The sum of counts strategy requires a bulk specific normalization to be applied before use. Murphy et al [8], evaluated the performance of pseudo-bulk methods over pseudo-replication based methods and found both mean normalization based pseudo-bulk and sum of counts based pseudo-bulk to outperform pseudo-replication.

Although Murphy et al, found the mean strategy to be better than sum of counts, they reasoned that this is due to the missing normalisation step in the sum of counts approach. Since, the mean strategy can’t account for intra-individual variability, they reasoned that the sum of counts approach would perform better than the mean normalization strategy. Crowell et al [9], found that when a normalization step is accompanied with a sum of counts approach it outperforms the mean of normalized values methodology. Juntilla et al [11], compared different methodologies for differential expression analysis and not only showed the pseudo-bulk methods perform better than the other methods in specifically in terms of Specificity and Precision, they also highlighted that sum of counts approach did better than the mean of normalized expression approach except in case of slightly better reproducibility.

Even other studies [10] highlight similar findings regarding sum of counts approach. Though one practical consideration while applying either methodology is the expression value types available for a study. Not all studies provide raw counts and in that case, using the mean of normalized expression strategy to calculate pseudo-bulk is still better than using other approaches.

While the normalization strategy usually followed for the mean of normalization strategy is log library size adjustment, different bulk RNASeq normalization methods can be used for the sum of counts approach. The options to choose from can be Median of Ratios (DESeq2’s normalization method), Trimmed Mean of M-values (TMM), Voom, Counts Per Million (CPM), Variance Stabilising Transformation (VST), Regularised Log (Rlog) others. Juntilla et al [11], compared Median of Ratios, TMM and Voom and found them to perform similarly over the sum of counts expression data. More comprehensive checks are required to check and compare between different methods including Rlog and VST which perform well generally over the Bulk RNASeq data.

**Computational Efficiency**: Pseudo-bulk analysis reduces the computational burden associated with analyzing large scRNA-seq datasets.**Biological Interpretation:**It allows for the interpretation of gene expression patterns at a higher, more interpretable level than individual cells.**More Options for Computational Methods:**Bulk specific methods can be used for Pseudo-bulk data providing a plethora of options to choose.**Integration with Bulk Data:**Pseudo-bulk data can be more easily integrated with traditional bulk RNA-seq data for a comprehensive analysis.

By calculating pseudo-bulks from single-cell expression, a lot of the issues which aren’t addressable by single-cell specific methods currently can be avoided.

Pseudo-bulk approaches can avoid the statistical issues faced with the current single-cell specific methods. The approaches have many applications other than just differential expression. We hope that the blog will serve as a guide to workflows for pseudo-bulk calculations.

We use the best practices with the workflows for Pseudo-bulk applications on our data harmonization platform Polly to solve different use cases ranging from finding differences between cell states to extracting biomarkers for use cases.

Contact us at info@elucidata.io or request a demo here to learn more.

**References**