AIP Software: findmap.f90
You are here: Beltsville Area Home / Beltsville Agricultural Research Center / Animal Genomics and Improvement Laboratory / Animal Improvement Program


ANIMAL IMPROVEMENT
PROGRAM

findmap.f90 Align sequence reads to reference map, call previous variants, and identify new variants

Downloads Version 1 programs, example outputs, and executables
(released July 19, 2016; last updated July 28, 2016)

  • Programs mapsim.f90, storemap.f90, map2seq.f90, findmap.f90, and findvar.f90 are included. Skip program mapsim when using actual map and variant files; skip map2seq when using actual fastq reads.
  • To save space, the zip download does not include the actual reference genome, variant list, or fastq sequence read files.
  • After unzipping the package, type runmap to run the program series script and generate an example simulated reference map, a simulated variant file, simulated fastq files, and other program input and output files.
  • For quick testing, the mapsim.options and map2seq.options are set to just 1 chromosome, and you must then reset them to 30 chromosome pairs for cattle (or other numbers to match your reference.fa file). Similarly, maxhash is set to 87654321 and maxdup to 10 million in storemap.options to reduce memory in initial testing, but values of 987654321 and 300 million are recommended for a 3-Gbase genome.
  • Do not include all of the unmapped contigs in the fasta map file as separate chromosomes because those would increase memory for little gain. From the human reference hg38.fa, we used 25 chromosomes (including 22 autosomes plus X, Y, and mitochondrial DNA) and reformatted the file using fasta.sas (included).
  • The variant list from the 1,000 bull genomes project (Daetwyler et al., 2014) should be concatenated across chromosomes and then reformatted to variants.prior using program variant.sas. A main reason to reformat is that the indel notation and locations used in findmap differ from the vcf notation. The variant.sas program can also reformat the 00-common_all.vcf human variant file.
  • Option newvar should be 0 when processing actual data so that all known variants will be used. In storemap.options and findmap.options, newvar can be set to 5 (or other odd number) to exclude every 5th variant and demonstrate detection of those "new" variants. Program testvar.sas then tests the accuracy of variant calls, both for previously known and newly discovered variants.
  • Program map2seq simulates genotypes for the variant list in groups of 4, with every 4th variant homozygous alternate allele, every 2nd variant homozygous reference, and the 1st and 3rd variants heterozygous. Future versions could read a genotype file instead.

Inputs reference.fa Standard fasta format for reference genome:
> as 1st character in line for each new chromosome
50-byte lines of ACGT (or N for unknown bases), or acgt for repeated sections
All programs in this series treat lower and uppercase as the same because storemap identifies, counts, stores, and links repeated k-mers to each other while hashing the reference map
variants.prior Lists all previously known SNPs and indels
Insertions are reported 1 base to the left of the 1st base where they differ from the reference genome, reading left to right
Deletions are reported at their detected location, not 1 base to the left
Use variants.sas to reformat the 1,000 bull genomes variant file
Format: chr# location vartype (SNP, INS, or DEL)  variant#  length  alternate_allele
fastq.filelist List of DNA source names such as source1, source2, etc., along with numeric IDs
source1.1.fq,
source1.2.fq,
source2.1.fq,
source2.2.fq, etc.
Standard fastq format for paired end reads, with reads 1 and 2 of each pair at the same position in 2 separate files for each DNA source
*.options Program control file with user-defined options

Outputs storemap.unf Hash table, etc., for reference map
reference.unf Unformatted map for faster input
variant.readdepth Number of ref and alt alleles, 1 row/variant
Format: variant# chr# var_location ref# alt#
individual.readdepth Format: ID#  chip#  #SNPs
Read counts for A and B alleles stored in 1-byte hexadecimal format (input format for imputation program findhap4; VanRaden et al., 2015).
segments.found Alignments, errors, and known variant locations for segments where paired end locations differ by <fraglen
Format: segment# pair# direction chr# segment_location num_alts num_errs (var_locations var_type) (err_locations err_base)
segments.lost Same format, but for segments where paired end locations do not match
segments.newindels Locations and properties of new indels detected (those not already in variants.txt)
Format: segment# pair# direction chr# seg_location indel_size indel_location bases (inserted or deleted)
SNPs.new Summary of new SNPs including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# SNP_location read_depth num_alt ref_allele alt_allele
indels.new Summary of new indels including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# indel_location read_depth num_alt ref_allele alt_allele
variants.all Combined list of prior and new variants in same format as variants.prior

References 2016  VanRaden, P.M., and D.M. Bickhart. Fast single-pass alignment and variant calling using sequencing data. Plant Anim. Genome XXIV Conf., San Diego, CA, Jan. 9–13, W161. (presentation slides)
VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Identifying and calling insertions, deletions, and single-base mutations efficiently from sequence data. J. Dairy Sci. 99(E-Suppl. 1):140(abstr. 0302). 2016. (presentation slides)
VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Identifying new variants and calling known variants while rapidly aligning sequence data. BMC Genet. (submitted). 2016. (abstract)
2015  VanRaden, P.M., C. Sun, and J.R. O'Connell. Fast imputation using medium or low-coverage sequence data. BMC Genet. 16(82). (presentation slides)
2014  Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. Van Tassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genet. 46:858–865.

License Program findmap.f90 and the others in this package are free, public domain, and were developed with U.S. taxpayer funding. Accurate results are not guaranteed. Please report any bugs to Paul.VanRaden@ars.usda.gov. You may modify, improve, use, and redistribute the code to anyone for any purpose. Or, you can ask Paul to make changes that could improve our use and your use of this software.


Paul M. VanRaden
Animal Genomics and Improvement Laboratory
Agricultural Research Service, USDA

Last Modified: 01/12/2017