Identifying new variants and calling known variants while rapidly aligning sequence data

P.M. VanRaden1, D.M. Bickhart1, and J.R. O’Connell2

1Animal Genomics and Improvement Laboratory, Agricultural Research Service, United States Department of Agriculture, Beltsville, Maryland, 20705-2350, USA
2University of Maryland Baltimore, Baltimore, Maryland 21201, USA



 

ABSTRACT

Background: Whole-genome sequencing studies can identify causative mutations for subsequent use in genomic evaluations, but sequence alignment and variant identification are lengthy and sometimes inaccurate processes. BWA, GATK, and SAMtools, which separate the alignment and calling steps, were compared to Findmap which calls known variants during alignment to improve both speed and accuracy of sequence data processing.

Methods: Alignment with Findmap uses a hash table including both the reference map and the known alternate alleles. Findmap reads the previous variant list, calls variant alleles, and sums the allele counts for each DNA source while simultaneously aligning reads. Potential new single nucleotide variants (SNVs) and indel alleles are output for summary by variant identification program Findvar. Strategies were tested using cattle, human, or a completely random reference map and simulated or actual data. Most tests simulated 10 bulls each with 10× simulated sequence reads containing 39 million sequence variants from the 1000 Bull Genomes Project.

Results: With 10 processors, clock times to process 100× data were 106 hours for BWA, 25 hours for GATK, 11 hours for SAMtools, but only 3 hours for Findmap and 1 hour for Findvar. Memory required by BWA was 4.6 GB/processor, whereas Findmap required 46 GB that could be shared by 10 or more processors. Findmap correctly mapped 92.9% of reads (compared to 90.5% from BWA) and had high accuracy of calling alleles for known variants. For new variants, Findvar found 99.8% of SNVs, 79% of insertions, and 67% of deletions; GATK found 99.4, 95, and 90%; and SAMtools found 99.8, 12, and 16%, respectively. False positives, as percentages of true variants, were 10, 0.4, and 0.3% from Findvar; 12, 8.4, and 2.9% from GATK; and 37, 1.3, and 0.4% from SAMtools, respectively.

Conclusions: The advantages of Findmap and Findvar are 10 to 30 times faster processing than current open-source software, more precise alignment, more useful data summaries, more compact output, and fewer intermediary steps. Calling known and identifying new variants during alignment enables more efficient and accurate sequence-based genotyping.