||Align sequence reads to reference map, call previous variants, and identify new variants
||Version 2.1 programs, example and test outputs, and executables
(released October 1, 2018)
- Findmap alignment and Findvar variant calling series
- Programs mapsim.f90, storemap.f90, map2seq.f90, findmap.f90, findvar.f90, leftmost.f90, and depth2vcf.c are included. Skip program mapsim when using actual map and variant files; skip map2seq when using actual fastq reads. Programs leftmost and depth2vcf are optional for converting to other formats.
- To save space, the zip download does not include the actual reference genome, variant list, or fastq sequence read files. Download findmapV2.1.zip onto a computer with the Unix operating system. Type unzip findmapV2.1.zip, and hit enter.
- After unzipping the package, type runsim.script to run the program series script and generate an example simulated reference map, a simulated variant file, simulated fastq files, and other program input and output files. Type runmap.script to run the alignment and variant calling programs.
- The programs will execute in the Test_Output directory using the options files there. The options files give more detail on available options and recommended values. The output files in Test_Output should match those in Example_Output fairly closely.
Each new project should be run in a new directory because standard file names are used.
- For quick testing, the mapsim.options and map2seq.options are set to just 1 chromosome, and you must then reset them to 30 chromosome pairs for cattle (or other numbers to match your reference.fa file). Similarly, maxhash is set to 87654321 and maxdup to 10 million in storemap.options to reduce memory in initial testing, but values of 987654321 and 300 million are recommended for a 3-Gbase genome.
- For markersim.f90, the optional input file genetic.cor is in the subdirectory Example_Output.
- Do not include all of the unmapped contigs in the fasta map file as separate chromosomes because those would increase memory for little gain. From the human reference hg38.fa, we used 25 chromosomes (including 22 autosomes plus X, Y, and mitochondrial DNA) and reformatted the file using fasta.sas
- The variant list from the 1,000 Bull Genomes Project (Daetwyler et al., 2014) should be concatenated across chromosomes and then reformatted to variants.prior using program variant.sas. The main reason to reformat is that
indel notation and locations used in findmap differ from vcf notation. The variant.sas program can also reformat the 00-common_all.vcf human variant
- Option newvar should be 0 when processing actual data so that all known variants will be used. In mapsim.options, newvar can be set to 5 (or other
odd number) to exclude every 5th variant and demonstrate detection of those "new" variants. Program testvar.sas then tests the accuracy of variant calls, both for previously known and newly discovered variants.
- Program map2seq simulates genotypes for the variant list in groups of 4, with every 4th variant homozygous alternate allele, every 2nd variant homozygous reference, and the 1st and 3rd variants heterozygous or optionally can read the file genotypes.true instead to declare the variant genotypes for each DNA source.
||Standard fasta format for reference genome:
> as 1st character in line for each new chromosome
50-byte lines of ACGT (or N for unknown bases), or acgt for repeated sections
All programs in this series treat lower and uppercase as the same because storemap identifies, counts, stores, and links repeated k-mers to each other while hashing the reference map
||Lists all previously known SNPs and indels
Insertions are reported 1 base to the left of the 1st base where they differ from the reference genome, reading left to right
Deletions are reported at their detected location, not 1 base to the left
Use variants.sas to reformat the 1,000 bull genomes variant file
Format: chr# location vartype (SNP, INS, or DEL) variant# length alternate_allele
||List of DNA source names such as source1, source2, etc., along with numeric IDs
|Standard fastq format for paired end reads, with reads 1 and 2 of each pair at the same position in 2 separate files for each DNA source
||Program control file with user-defined options
||Hash table, etc., for reference map
||Unformatted map for faster input
||Number of ref and alt alleles, 1 row/variant
Format: variant# chr# var_location ref# alt#
||Format: ID# chip# #SNPs
Read counts for A and B alleles stored in 1-byte hexadecimal format (input format for imputation program findhap4; VanRaden et al., 2015).
||Alignments, errors, and known variant locations for segments where paired end locations differ by <fraglen
Format: segment# pair# direction chr# segment_location num_alts num_errs (var_locations var_type) (err_locations err_base)
||Same format, but for segments where paired end locations do not match
||Locations and properties of new indels detected (those not already in variants.txt)
Format: segment# pair# direction chr# seg_location indel_size indel_location bases (inserted or deleted)
||Summary of new SNPs including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# SNP_location read_depth num_alt ref_allele alt_allele
||Summary of new indels including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# indel_location read_depth num_alt ref_allele alt_allele
||Combined list of prior and new variants in same format as variants.prior
||VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Calling known variants and identifying new variants while rapidly aligning sequence data. J. Dairy Sci. (submitted). 2018. (abstract)
||VanRaden, P.M., and D.M. Bickhart. Fast single-pass alignment and variant calling using sequencing data. Plant Anim. Genome XXIV Conf., San Diego, CA, Jan. 9–13, W161. (presentation slides)
||VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Identifying and calling insertions, deletions, and single-base mutations efficiently from sequence data. J. Dairy Sci. 99(E-Suppl. 1):140(abstr. 0302). 2016. (presentation slides)
||VanRaden, P.M., C. Sun, and J.R. O'Connell. Fast imputation using medium or low-coverage sequence data. BMC Genet. 16(82). (presentation slides)
||Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. Van Tassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genet. 46:858–865.
||Program findmap.f90 and the others in this package are free, public domain, and were developed with U.S. taxpayer funding. Accurate results are not guaranteed. Please report any bugs to Paul.VanRaden@ars.usda.gov. You may modify, improve, use, and redistribute the code to anyone for any purpose. Or, you can ask Paul to make changes that could improve our use and your use of this software.
Paul M. VanRaden