RNA-seq expression estimates need not take longer than a cup of coffee
The quantification of gene or isoform abundance is a fundamental step in many transcriptome analysis tasks, such as determining differential expression between biological samples. Yet, estimating isoform abundance from a large set of RNA-seq reads remains a computationally intensive task, owing in large part to the necessity of read mapping. To address this problem directly, we developed Sailfish, a software tool that implements a novel, alignment-free algorithm for the estimation of isoform abundances directly from a set of reference sequences and RNA-seq reads. Rather than working at the read level, the fundamental unit of transcript coverage in Sailfish is the k-mer. Implementing this alternative, lightweight, approach allows Sailfish to dispense with many of the complexities of read mapping while remaining robust to sequencing errors. By replacing read mapping with intelligent k-mer indexing and counting, Sailfish is able to quantify isoform abundance orders of magnitude faster than existing tools. For example, it takes about 15 minutes for a set of 150 million reads where existing tools take over 6 hours.
This increase in speed is obtained without sacrificing accuracy. Sailfish implements an efficient, accelerated expectation-maximization algorithm for quantifying isoform abundance that produces high-quality results, and is capable of correcting numerous types of systematic bias that are known to occur in RNA-seq experiments. In the paper, we demonstrate that, on both real and synthetic data, Sailfish is as accurate as existing read mapping-based tools such as eXpress and Cufflinks.