FAIR Data

Single Cell Differential Expression: Sham or Incorrect Practices ?

Ayush Praveen
October 30, 2023

The next phrase that continues after this sentence is quite harsh but it is also the one that I have routinely heard about single cell expression data from different scientists over 3 years of working with the data. Single cell differential expression is a sham !! This might sound harsh for a method, instrumental even for the basic cell type identification step, however at times I also have felt similarly. My rational mind however prompted me to see this problem either as a data-method misfit or a lack of understanding of the proper use of methods.

Before talking about the single cell expression methods, I would first like to discuss the issues that underlie the above feeling:

  1. No matter what, the p-values never seem to be non-significant: Even if there are minimal differences between the two cell populations, the p-value that one would get even after adjusting for FDR would be very low and usually never non-significant.
  2. High false-positive values in fold change: If one tries to compare the expression of cell types in two different conditions, often you get a lot of differentially expressed genes, however in often cases most genes don’t show a strong association with these states if compared based on expression or correlation.
Single Cell Differential Expression
Top differentially expressed (calculated using MAST) genes (Partial list) by dividing CD8+ T cells into two groups based on the high and low expression of the gene of interest in NSCLC tumors. The single cell data was taken from GEO dataset GSE99254 submitted by Guo et al, Nature Medicine, 2018
Single Cell Differential Expression
Correlation coefficient of the identified differentially expressed genes in the previous figure with the gene of the interest in CD8+ T cells in NSCLC tumor. The single cell data was taken from GEO dataset GSE99254 submitted by Guo et al, Nature Medicine, 2018

As we can see from the conflicting results in the above figures, genes that have been identified as differentially expressed (by splitting CD8+ T cells into two groups based on the high and low expression of a gene of interest) didn’t show a strong correlation (Pearson) with each other and gene of interest. This is one of the several example off-cases that I have observed while working with single cell DE.

Let’s look if others have also faced similar issues with single cell differential expression. Multiple studies (Wang et al, BMC Bioinformatics, 2019; Dal Molin, Alessandra, Giacomo Baruzzo, and Barbara Di Camillo, 2017, Frontiers in Genetics; Das et al, 2021, Genes; Squair et al, Nature Communications, 2021) have highlighted that the performance of single cell DE methods is subjective and data dependent. Additionally, coherence between the popular DE methods can be very variable as shown by Wang et al, BMC Bioinformatics, 2019.

Single Cell Differential Expression
Numbers of pairwise common DE genes tested by top 1000 genes in real data as tested by Wang et al, BMC Bioinformatics, 2019 (Figure 5)
Single Cell Differential Expression
Numbers of pairwise common DE genes tested by adjusted p-value< 0.05 in real data as tested by Wang et al, BMC Bioinformatics, 2019 (Figure 6)

From these results, it is evident that there is very low concordance between different methods.

As I started exploring methods, their comparative analysis, and correctness, I realized that there are essential inherent properties of single cell data that give rise to challenges in getting accurate differential expression results:

  1. High sparsity: A easy to distinguish characteristic of single cell data is the high proportion of zeros in the expression. These zeroes can be considered as actual data points of no expression of a gene and/or false/missed expression due to technical causes like the sensitivity of the instrument, half-life of mRNA, and other reasons.
  2. Multimodality: With bulk sequencing data, it is comparatively easy to classify a gene as differentially expressed or not as the distribution of genes is usually uni-modal (distribution having one clear peak) in bulk sequencing data. However, due to cellular heterogeneity and different cell types, the single cell gene expression is mostly multi-modal.
  3. Data size: While our capability to get expression profiles of millions of cells has greatly increased our understanding of biological systems, the high number of subjects (cells in this case) comes with its own set of data science challenges.

Squair et al, Nature Communications, 2021 tried to identify ways to confront high false positives in single cell differential expression. They reasoned that datasets, where the same population of purified cells has been sequenced for both bulk and single cell sequencing, can be used for understanding the discrepancies. They found 18 such datasets in the public domain and considered them as gold standard datasets for comparison. Collectively they identified that-

Pseudo Bulk Based Method Performs Better

Pseudo-bulk methods perform aggregation of expression of cells from different groups within a biological replicate. By doing so the methods thereby reduce the overall zero inflation. Additionally, Murphy et al, Nature Communications, 2022 also showed that pseudo-bulk methods tend to perform better than other methods for single cell DE.

Single Cell Differential Expression
Area under the concordance curve (AUCC) for fourteen DE methods in the eighteen ground-truth datasets as tested by Squair et al, Nature Communications, 2021 (Figure 1C)

However one can also reason that the performance of the pseudo bulk methods is only comparatively better than the others and still suffers from low performance as evident from the low AUCC for even the pseudo bulk based methods.

DE Methods Are Biased Towards Highly Expressed Genes

Comparing the non-aggregation with the pseudo bulk based methods for single cell DE, the authors identified that false positives identified by the former set of methods are usually amongst the high-expressing genes. Thus even if the actual difference between two groups of cells amongst highly expressed genes is minimal, they can still be falsely identified as differentially expressed genes. Conversely, false negatives overlooked by non-aggregation-based methods are usually lowly expressed genes. To test whether the aggregation has any role to play, the authors avoided the aggregation step in the pseudo bulk methods and considered each cell as a sample to compute differential expression. This gave rise to higher false positives which were from a set of highly expressed genes thus validating their hypothesis.

DE Analysis Should Account for Biological Differences Between Replicates

Single cell DE can allow us to compare groups of cells from different biological replicates and groups. However if the groups of cells are formed without considering the sample origin and replicate from which they are originating, even pseudo-bulk based methods perform worse. The authors mixed cells coming from different replicates into different groups and found that the accuracy of the pseudo-bulk based methods was lost.

Ending Notes

From all the studies that I read through, it is evident that we don’t have one method that can fit well for all datasets. Similarly, even the best of methods tend to have some tradeoffs for a set of genes. Based on these studies, the best practices that can be adopted currently for single cell DE are

  1. Use pseudo-bulk based methods for single cell DE as opposed to model-based or even other non-aggregation based methods.
  2. Make a comparison between cells or cell types by considering the biological replicate from which they originate.
  3. Validate the differentially expressed genes with the highest fold change based on their expression/correlation in different groups.
  4. If trying to identify markers/signatures for individual groups of cells, validate the DE results with statistical or machine learning methods especially optimized to work with high sample size and zero-inflated data.

There are more than 100 DE methods for single cell available right now but only a handful of them are used across most studies. The inherent properties of single cell data require us to look at the statistical base underlying the methods in a new light. This will help us repurpose existing methods/identify new methods to calculate accurate single cell differential expression.

P.S.: This blog is originally a part of blog series on DecodeBox, written by Ayush Praveen who is a Bioinformatics Scientist at Elucidata.

References

  1. Wang, T., Li, B., Nelson, C.E. et al. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics 20, 40 (2019).
  2. Squair, J.W., Gautier, M., Kathe, C. et al. Confronting false discoveries in single-cell differential expression. Nat Commun 12, 5692 (2021).
  3. Murphy, A.E., Skene, N.G. A balanced measure shows superior performance of pseudobulk methods in single-cell RNA-sequencing analysis. Nat Commun 13, 7851 (2022).
  4. Dal Molin A, Baruzzo G and Di Camillo B. Single-Cell RNA-Sequencing: Assessment of Differential Expression Analysis Methods. Front. Genet. 8:62 (2017).
  5. Das S, Rai A, Merchant ML, Cave MC, Rai SN. A Comprehensive Survey of Statistical Approaches for Differential Expression Analysis in Single-Cell RNA Sequencing Studies. Genes. 12(12):1947 (2021).

Blog Categories

Request Demo