Data Science & Machine Learning

Bulk RNA Sequencing: A Comparison of the Most Popular Tools and Pipelines

The association between gene expression dynamics and biological function has always been a subject of great interest in biology. With advances in technology, it has become possible to manipulate nucleic acids to quantify gene expression and interpret the significance biologically.

Bulk RNA sequencing has changed the way researchers approach a problem. RNA sequencing can provide qualitative and quantitative analyses of the entire transcriptome of the targeted cells/tissues/organisms. The applications of the data generated by Bulk RNA sequencing are boundless. The data could be used to study differential gene expression in healthy versus cancerous cells or in immune profiling. This has accelerated drug discovery in unprecedented ways.

However, with the increase in the significance of RNA sequencing, there has been a subsequent increase in the number of tools and techniques available to analyze raw reads obtained. The users are more puzzled than ever as to what is the best way to analyze their sequencing data. In this blog, we have assimilated a list of tools commonly used at various steps and have compared them.

Different Steps in the RNA Sequencing Analysis Pipeline

RNA Sequencing Analysis Pipeline

Following are the steps for analyzing RNAseq data -

Step 1 - Trimming/ Quality Check

Trimming the raw data is crucial to eliminate the adaptor sequences and poor-quality nucleotides thereby increasing the rate of mapping reads. Trimming also increases the reliability of the downstream analysis while reducing computational requirements at the same time. But it should be done cautiously with carefully chosen trim length to prevent unwanted changes in gene expression and transcriptome assembly. Most sought-out software used for trimmings are

A whole host of tools are available for trimming however, various studies show that there is no generic answer to ‘what is the best trimming tool’. The choice of tool depends on the type of dataset downstream analysis and user-defined parameters that need to be taken into consideration. For example, setting the main threshold parameter too high could reduce the size of surviving dataset while setting it too low may render the trimming exercise futile.

Step 2 - Alignment

Aligning the reads to a reference genome/transcriptome is the second step in the RNAseq pipeline. The tools popular for alignment are-

BWA had the highest alignment rate (percentage of sequenced reads that were successfully mapped to reference genome) and the most coverage among all the tools. HiSat2 was the fastest aligner among all the tools. STAR and HiSat2 perform slightly better in aligning the unmapped reads.

Step 3 - Counting/Quantification

After the reads have been mapped, they are assigned to a gene or transcript in a process called counting/quantification. This step quantifies the number of transcripts that will later be used to compare case versus control. The most commonly used tools are-

When compared for the best tool, Cufflinks and RSEM were ranked at the top followed by HTseq and StringTie-based pipelines.

Step 4 - Normalization

After the counting step, quantified transcripts undergo a normalization procedure to remove sequencing bias. Each normalization technique represents different gene expression values - Fragments per Kilobase of Mapped reads (FPKM), Transcripts per Million (TPM), Trimmed Mean of M values (TMM from edgeR), and Relative Log Expression (RLE from DESeq2), upper quartile (UQ), coverage (cov), estimated counts (est_counts) and effective counts (eff_counts). Researchers evaluated various normalization methods and found that the pipelines using TMM performed best followed by RLE, TPM, and FPKM.

Pseudoalignment -

Apart from the above-mentioned steps, there is an alternative pipeline called quantification by Pseudoalignment. The process where all three steps - alignment, counting and normalization are performed in a single step is called pseudoalignment. Commonly used pseudo aligners are - Kallisto, Salmon and Sailfish. When compared, all three methods showed similar performance in terms of precision and accuracy.

Differential Expression -

The final step in the RNAseq analysis pipeline is comparing the normalized transcript counts in case versus control to get the differential expression of genes. Since this is the most crucial step, there are several tools and techniques developed for DE analysis.

When compared for detection ability amongst these tools, Cuffdiff generated the least number of differentially expressed genes while SAMseq generated the most number of differentially expressed genes. When compared for accuracy, limma trend, limma voom and baySeq turned out to be the most accurate. Overall, for 16 different parameters, baySeq turned out to be the best tool for analysis followed by edgeR, limma trend, and limma voom.

Overall Pipeline Comparison

Six RNAseq pipelines were compared-

Most tools and pipelines available are comparable with each other and produce similar results. Researchers can choose tools based on the available computing resources, the research objectives, and the gene expression values. It would be wise to utilize multiple procedures/pipelines to obtain the most reliable fold change and the number of differentially expressed genes.

Bulk RNA-Sequencing Data on Polly

Polly is a data-centric ML Ops platform that hosts OmixAtlas. OmixAtlas is a collection of several millions of datasets that consists of carefully curated biomolecular data. Each dataset ingested into Polly OmixAtlas is efficiently curated to make the dataset analysis ready. Curated datasets also make it easier to query, find, and search datasets.

Bulk RNA sequencing OmixAtlas on Polly provides curated datasets with several metadata such as cell type, cell line, disease, drug, tissue, and organism. All datasets ingested go through 2 steps -

  1. Data engineering - This is done to transform the data into one consistent data schema.
  2. Metadata Harmonization - This is done to tag each sample and data into a uniform ontology.
Reach out to us to learn more about how to accelerate your research!

References:

  • Borozan, I., Watt, S. N., & Ferretti, V. (2013). Evaluation of alignment algorithms for discovery and identification of pathogens using RNA-Seq. PloS one, 8(10), e76935.
  • Corchete, L. A., Rojas, E. A., Alonso-López, D., De Las Rivas, J., Gutiérrez, N. C., & Burguillo, F. J. (2020). Systematic comparison and assessment of RNA-seq procedures for gene expression quantitative analysis. Scientific reports, 10(1), 19737.
  • Del Fabbro, C., Scalabrin, S., Morgante, M., & Giorgi, F. M. (2013). An extensive evaluation of read trimming effects on Illumina NGS data analysis. PloS one, 8(12), e85024.
  • Grant, G. R., Farkas, M. H., Pizarro, A. D., Lahens, N. F., Schug, J., Brunk, B. P., Stoeckert, C. J., Hogenesch, J. B., & Pierce, E. A. (2011). Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM). Bioinformatics (Oxford, England), 27(18), 2518–2528.
  • Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R., & Salzberg, S. L. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome biology, 14(4), R36.
  • Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature methods, 12(4), 357–360.
  • Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), 357–359.
  • Liu, X., Zhao, J., Xue, L., Zhao, T., Ding, W., Han, Y., & Ye, H. (2022). A comparison of transcriptome analysis methods with reference genome. BMC genomics, 23(1), 232.
  • Musich, R., Cadle-Davidson, L., & Osier, M. V. (2021). Comparison of Short-Read Sequence Aligners Indicates Strengths and Weaknesses for Biologists to Consider. Frontiers in plant science, 12, 657240.

Blog Categories

Blog Categories