As the development of the second generation sequencing technology (NGS), research about the genetic variation can be realized by sequencing about the whole genome of an individual or re-sequencing about the target area. Single nucleotide polymorphism (SNP) is the most common form of genetic variation. SNP detection is to find a new polymorphism site and the known polymorphism alleles on the target area.
There is lots of SNP detection software working on NGS data, among which is the widely used SOAPsnp. It takes into account the quality of sequencing data and errors of alignment and experiment to use a Bayesian model based SNP detection algorithm for calculation of quality score of each base. These quality scores are used as the standard of consensus sequence calling. Combined with the prior probability of dbSNP allele, it gets a low error rate for low-depth sequencing.
GSNP is an implementation of SOAPsnp on GPU. It uses one GPU thread to process one independent site, and optimizes the program in two ways: 1. Use sparse data structure to store the aligned base so as to reduce memory overhead; 2. Develop customized compression algorithms to reduce I/O overhead.
With these optimizations, GSNP can get more than 40X speedup compared with SOAPsnp. Originally, SOAPsnp needs 3 days to process the whole human genome while GSNP only uses 2 hours to complete the same work.
- Validated 4/5/2018