Makefile description¶
The following sections review the options available in the BAM pipeline makefiles. As described in the Pipeline usage section, a default makefile may be generated using the paleomix bam_pipeline makefile command. For clarity, the location of options in subsections are specified by concatenating the names using :: as a separator. Thus, in the following (simplified example), the UseSeed option (line 13) would be referred to as Options :: Aligners :: BWA :: UseSeed:
1Options:
2
3 # Settings for aligners supported by the pipeline
4 Aligners:
5 # Choice of aligner software to use, either "BWA" or "Bowtie2"
6 Program: BWA
7
8 # Settings for mappings performed using BWA
9 BWA:
10 # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
11 # Postmortem damage localizes to the seed region, which BWA expects to
12 # have few errors (sets "-l"). See http://pmid.us/22574660
13 UseSeed: yes
Specifying command-line options¶
For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option --mm 5 to AdapterRemoval, then we would list it as --mm: 5 (all other options omitted for brevity):
AdapterRemoval:
--mm: 5
For options that do not take any values, such as the AdapterRemoval --trimns (enabling the trimming of Ns in the reads), these are specified either as --trimmns: `, with the value left blank, or as `--trimns: yes. The following are therefore equivalent:
AdapterRemoval:
--trimns: # Method 1
--trimns: yes # Method 2
In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to no (without quotes), as shown here:
AdapterRemoval:
--trimns: no
If you need to provide the text "yes" or "no" as the value for an option, it is necessary to put these in quotes:
--my-option: "yes"
--my-option: "no"
In some cases it is possible or even necessary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:
--my-option:
- "yes"
- "no"
- "maybe"
The above will be translated into calling the program in question with the options --my-option yes --my-option no --my-option maybe.
Options¶
By default, the Options section of the makefile contains the following:
2# Default options.
3# Can also be specific for a set of samples, libraries, and lanes,
4# by including the "Options" hierarchy at the same level as those
5# samples, libraries, or lanes below.
6Options:
7 # Sequencing platform, see SAM/BAM reference for valid values
8 Platform: Illumina
9 # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
10 # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
11 # specify 'Solexa', to handle reads on the Solexa scale. This is
12 # used during adapter-trimming and sequence alignment
13 QualityOffset: 33
14
15 # Settings for trimming of reads, see AdapterRemoval man-page
16 AdapterRemoval:
17 # Use AdapterRemoval version 2 or version 3
18 Version: 2
19
20 # Set and uncomment to override defaults adapter sequences
21# --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
22# --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
23 # Some BAM pipeline defaults differ from AR defaults;
24 # To override, change these value(s):
25 --minlength: 25
26 # Extra features enabled by default; change 'yes' to 'no' to disable
27 --collapse: yes
28 --trimns: yes
29 --trimqualities: yes
30 # Increase the maximum Phred allowed for input FASTQs, as well as for merged bases
31 # when using --collapse (default = 41). This is needed for some modern FASTQs.
32# --qualitymax: 42
33
34 # Settings for aligners supported by the pipeline
35 Aligners:
36 # Choice of aligner software to use, either "BWA" or "Bowtie2"
37 Program: BWA
38
39 # Settings for mappings performed using BWA
40 BWA:
41 # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
42 # for a description of each algorithm (defaults to 'backtrack')
43 Algorithm: backtrack
44 # Filter aligned reads with a mapping quality (Phred) below this value
45 MinQuality: 0
46 # Filter reads that did not map to the reference sequence
47 FilterUnmappedReads: yes
48 # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
49 # Postmortem damage localizes to the seed region, which BWA expects to
50 # have few errors (sets "-l"). See http://pmid.us/22574660
51 UseSeed: yes
52 # Additional command-line options may be specified below. For 'backtrack' these
53 # are applied to the "bwa aln". See Bowtie2 for more examples.
54# -n: 0.04
55
56 # Settings for mappings performed using Bowtie2
57 Bowtie2:
58 # Filter aligned reads with a mapping quality (Phred) below this value
59 MinQuality: 0
60 # Filter reads that did not map to the reference sequence
61 FilterUnmappedReads: yes
62 # Examples of how to add additional command-line options
63# --trim5: 5
64# --trim3: 5
65 # Note that the colon is required, even if no value is specified
66 --very-sensitive:
67 # Example of how to specify multiple values for an option
68# --rg:
69# - CN:SequencingCenterNameHere
70# - DS:DescriptionOfReadGroup
71
72 # Command-line options for mapDamage; use long-form options(--length not -l):
73 mapDamage:
74 # By default, the pipeline will downsample the input to 100k hits
75 # when running mapDamage; remove to use all hits
76 --downsample: 100000
77
78 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
79 # possible read-types reflect the output of AdapterRemoval
80 ExcludeReads:
81 # Exclude single-end reads (yes / no)?
82 Single: no
83 # Exclude non-collapsed paired-end reads (yes / no)?
84 Paired: no
85 # Exclude paired-end reads for which the mate was discarded (yes / no)?
86 Singleton: no
87 # Exclude overlapping paired-ended reads collapsed into a single sequence
88 # by AdapterRemoval (yes / no)?
89 Collapsed: no
90 # Like 'Collapsed', but only for collapsed reads truncated due to the
91 # presence of ambiguous or low quality bases at read termini (yes / no).
92 CollapsedTruncated: no
93
94 # Optional steps to perform during processing.
95 Features:
96 # If set to 'filter', PCR duplicates are removed from the output files; if set to
97 # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the
98 # output files; if set to 'no', the reads are assumed to not have been amplified.
99 PCRDuplicates: filter
100 # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots;
101 # set to 'model' to build plots and postmortem damage models; and set to 'rescale'
102 # to build plots, models, and BAMs with rescaled quality scores. All analyses are
103 # carried out per library.
104 mapDamage: plot
105 # Generate coverage information for the final BAM and for each 'RegionsOfInterest'
106 # specified in 'Prefixes' (yes / no).
107 Coverage: yes
108 # Generate histograms of number of sites with a given read-depth, from 0 to 200,
109 # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no).
110 Depths: yes
111 # Generate summary table for each target (yes / no)
112 Summary: yes
General Options¶
- Options :: Platform
7 # Sequencing platform, see SAM/BAM reference for valid values 8 Platform: Illumina
The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include
CAPILLARY,HELICOS,ILLUMINA,IONTORRENT,LS454,ONT,PACBIO, andSOLID.- Options :: QualityOffset
9 # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) 10 # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to 11 # specify 'Solexa', to handle reads on the Solexa scale. This is 12 # used during adapter-trimming and sequence alignment 13 QualityOffset: 33
The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and
Solexa. For most modern data, this will be 33, corresponding to ASCII characters in the range!toJ. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range@toh, and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from;toh. For a visual representation of this, refer to the Wikipedia article linked above.
Adapter Trimming¶
The "AdapterRemoval" subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the AdapterRemoval documentation. A few important particularly options are described here:
- Options :: AdapterRemoval :: --Version
17 # Use AdapterRemoval version 2 or version 3 18 Version: 2
Use either AdapterRemoval version 2 or version 3. AdapterRemoval version 3 is recommended for new projects.
- Options :: AdapterRemoval :: --adapter1 and Options :: AdapterRemoval :: --adapter2
20 # Set and uncomment to override defaults adapter sequences 21# --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG 22# --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for
--adapter1is expected to be found in the mate 1 reads, and the sequence specified for--adapter2is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. When using AdapterRemoval v3, appropriate adapters will automatically be selected, if not manually specified. Please refer to the AdapterRemoval documentation for more information.Note
As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the
--threadscommand-line option in the makefile, but by supplying the--adapterremoval-max-threadsoption to the BAM pipeline itself:$ paleomix bam run makefile.yaml --adapterremoval-max-threads 2
Options :: AdapterRemoval :: --minlength
25 --minlength: 25The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012].
Warning
The default value used by PALEOMIX for
--minlength(25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is necessarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient.
- Options :: AdapterRemoval :: --collapse
27 --collapse: yes
If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambiguous alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either
M_orMT_, with the latter marking reads that have been trimmed from the 5' or 3' termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option tono(without quotes):--collapse: yes # Option enabled --collapse: no # Option disabled
Note
This option may be combined with the
ExcludeReadsoption (see below), to either eliminate or select for short inserts, depending on the expectations from the experiment. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped.- Options :: AdapterRemoval :: --trimns
28 --trimns: yes
If set to
yes(without quotes), AdapterRemoval will trim uncalled bases ('N') from the 5' and 3' end of the reads. Trimming will stop at the first called base ('A', 'C', 'G', or 'T'). If both--trimnsand--trimqualitiesare enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option tono(without quotes):--trimns: yes # Option enabled --trimns: no # Option disabled
Note
This option has no effect when using AdapterRemoval v3.
- Options :: AdapterRemoval :: --trimqualities
29 --trimqualities: yes
If set to
yes(without quotes), AdapterRemoval will trim low-quality bases from the 5' and 3' end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option--minquality. This value defaults to 2. If both--trimnsand--trimqualitiesare enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option tono(without quotes):--trimqualities: yes # Option enabled --trimqualities: no # Option disabled
Note
This option has no effect when using AdapterRemoval v3.
- Options :: AdapterRemoval :: --qualitymax
30 # Increase the maximum Phred allowed for input FASTQs, as well as for merged bases 31 # when using --collapse (default = 41). This is needed for some modern FASTQs. 32# --qualitymax: 42
When using AdapterRemoval v2, this increases the maximum allowed Phred score for input data, as well as the maximum allowed Phred score for (merged) output. For AdapterRemoval v2, this option is only in effect if --merge-strategy additive is used.
Short read aligners¶
This section allows selection between supported short read aligners (currently BWA [Li2009a] and Bowtie2 [Langmead2012]), as well as setting options for these, individually:
34 # Settings for aligners supported by the pipeline
35 Aligners:
36 # Choice of aligner software to use, either "BWA" or "Bowtie2"
37 Program: BWA
To select a mapping program, set the Program option appropriately:
Program: BWA # Using BWA to map reads
Program: Bowtie2 # Using Bowtie2 to map reads
Short read aligners - BWA¶
The following options are applied only when running the BWA short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.
39 # Settings for mappings performed using BWA 40 BWA: 41 # One of "backtrack", "bwasw", or "mem"; see the BWA documentation 42 # for a description of each algorithm (defaults to 'backtrack') 43 Algorithm: backtrack 44 # Filter aligned reads with a mapping quality (Phred) below this value 45 MinQuality: 0 46 # Filter reads that did not map to the reference sequence 47 FilterUnmappedReads: yes 48 # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. 49 # Postmortem damage localizes to the seed region, which BWA expects to 50 # have few errors (sets "-l"). See http://pmid.us/22574660 51 UseSeed: yes 52 # Additional command-line options may be specified below. For 'backtrack' these 53 # are applied to the "bwa aln". See Bowtie2 for more examples. 54# -n: 0.04
- Options :: Aligners :: BWA :: Algorithm
41 # One of "backtrack", "bwasw", or "mem"; see the BWA documentation 42 # for a description of each algorithm (defaults to 'backtrack') 43 Algorithm: backtrackThe mapping algorithm to use; options are
backtrack(corresponding tobwa aln),bwasw, andmem. Additional command-line options may be specified for these. Algorithms are selected as follows:Algorithm: backtrack # 'Backtrack' algorithm, using the command 'bwa aln' Algorithm: bwasw # 'SW' algorithm for long queries, using the command 'bwa bwasw' Algorithm: mem # 'mem' algorithm, using the command 'bwa mem'Warning
Alignment algorithms
bwaswandmemcurrently cannot be used with input data that is encoded using QualityOffset 64 orSolexa. This is a limitation of PALEOMIX, and will be resolved in future versions. In the meantime, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example seqtk.- Options :: Aligners :: BWA :: MinQuality
45 MinQuality: 0 46 # Filter reads that did not map to the reference sequenceSpecifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero
MinQualityvalue. To filter unmapped reads, use the optionFilterUnmappedReads(see below). To set this option, replace the default value with a desired minimum:MinQuality: 0 # Keep all hits MinQuality: 25 # Keep only hits where mapping-quality >= 25- Options :: Aligners :: BWA :: FilterUnmappedReads
46 # Filter reads that did not map to the reference sequence 47 FilterUnmappedReads: yesSpecifies whether unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to
yes(without quotes), all unmapped reads are discarded during the mapping process, while setting the option tono(without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with eitheryesorno(without quotes):FilterUnmappedReads: yes # Remove unmapped reads during alignment FilterUnmappedReads: no # Keep unmapped readsOptions :: Aligners :: BWA :: *
Additional command-line options may be specified for the selected alignment algorithm, as described in the "Specifying command-line options" section above. See also the examples listed for Bowtie2 below. Note that for the
backtrackalgorithm, it is only possible to specify options for thebwa alncall.
Short read aligners - Bowtie2¶
The following options are applied only when running the Bowtie2 short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.
56 # Settings for mappings performed using Bowtie2 57 Bowtie2: 58 # Filter aligned reads with a mapping quality (Phred) below this value 59 MinQuality: 0 60 # Filter reads that did not map to the reference sequence 61 FilterUnmappedReads: yes 62 # Examples of how to add additional command-line options 63# --trim5: 5 64# --trim3: 5 65 # Note that the colon is required, even if no value is specified 66 --very-sensitive: 67 # Example of how to specify multiple values for an option 68# --rg: 69# - CN:SequencingCenterNameHere 70# - DS:DescriptionOfReadGroup
- Options :: Aligners :: Bowtie2 :: MinQuality
58 # Filter aligned reads with a mapping quality (Phred) below this value 59 MinQuality: 0See
Options :: Aligners :: BWA :: MinQualityabove.- Options :: Aligners :: Bowtie2 :: FilterUnmappedReads
60 # Filter reads that did not map to the reference sequence 61 FilterUnmappedReads: yesSee
Options :: Aligners :: BWA :: FilterUnmappedReadsabove.Options :: Aligners :: BWA :: *
Additional command-line options may be specified for Bowtie2, as described in the "Specifying command-line options" section above. Please refer to the Bowtie2 documentation for more information about available command-line options.
mapDamage options¶
72 # Command-line options for mapDamage; use long-form options(--length not -l): 73 mapDamage: 74 # By default, the pipeline will downsample the input to 100k hits 75 # when running mapDamage; remove to use all hits 76 --downsample: 100000This subsection is used to specify options for mapDamage2.0, when plotting postmortem DNA damage, when building models of the postmortem damage, and when rescaling quality scores to account for this damage. In order to enable plotting, modeling, or rescaling of quality scores, please see the
mapDamageoption in theFeaturessection below.Note
It may be worthwhile to tweak mapDamage parameters before building a model of postmortem DNA damage; this may be accomplished by running the pipeline without rescaling, running with the
mapDamagefeature set toplot(with or without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before settingmapDamagetomodel(with or without quotes).Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with rescaling enabled, simply remove the mapDamage files generated for the relevant libraries (see the File structure section).
Warning
Rescaling requires a certain minimum number of C>T and G>A substitutions, before it is possible to construct a model of postmortem DNA damage. If mapDamage fails with an error indicating that "DNA damage levels are too low", then it is necessary to disable rescaling for that library to continue.
- Options :: mapDamage :: --downsample
74 # By default, the pipeline will downsample the input to 100k hits 75 # when running mapDamage; remove to use all hits 76 --downsample: 100000
By default, the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficient for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:
--downsample: 100000 # Sample 100 thousand reads --downsample: 1000000 # Sample 1 million reads --downsample: 0 # No downsampling
Options :: mapDamage :: *
Additional command-line options may be supplied to mapDamage, just like the
--downsampleparameter, as described in the "Specifying command-line options" section above. These are used during plotting and rescaling (if enabled).
Excluding read-types¶
78 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; 79 # possible read-types reflect the output of AdapterRemoval 80 ExcludeReads: 81 # Exclude single-end reads (yes / no)? 82 Single: no 83 # Exclude non-collapsed paired-end reads (yes / no)? 84 Paired: no 85 # Exclude paired-end reads for which the mate was discarded (yes / no)? 86 Singleton: no 87 # Exclude overlapping paired-ended reads collapsed into a single sequence 88 # by AdapterRemoval (yes / no)? 89 Collapsed: no 90 # Like 'Collapsed', but only for collapsed reads truncated due to the 91 # presence of ambiguous or low quality bases at read termini (yes / no). 92 CollapsedTruncated: noDuring the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogenous DNA. The following read types are currently recognized:
- Single
Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline.
- Paired
Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which both mates were retained, and which were not merged into a single read (if
--collapseis set for AdapterRemoval).- Singleton
Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the
--maxns, the--minlength, or the--maxlengthoptions supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode.- Collapsed
Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the --collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with
M_.- CollapsedTruncated
Paired-end reads (like Collapsed), which were trimmed due to the
--trimqualitiesor the--trimnscommand-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name withMT_. AdapterRemoval v2 only.To enable / disable exclusion of a read type, set the value for the appropriate type to
yesorno(without quotes):Singleton: no # Singleton reads are NOT excluded Singleton: yes # Singleton reads are excluded
Optional features¶
94 # Optional steps to perform during processing. 95 Features: 96 # If set to 'filter', PCR duplicates are removed from the output files; if set to 97 # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the 98 # output files; if set to 'no', the reads are assumed to not have been amplified. 99 PCRDuplicates: filter 100 # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots; 101 # set to 'model' to build plots and postmortem damage models; and set to 'rescale' 102 # to build plots, models, and BAMs with rescaled quality scores. All analyses are 103 # carried out per library. 104 mapDamage: plot 105 # Generate coverage information for the final BAM and for each 'RegionsOfInterest' 106 # specified in 'Prefixes' (yes / no). 107 Coverage: yes 108 # Generate histograms of number of sites with a given read-depth, from 0 to 200, 109 # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no). 110 Depths: yes 111 # Generate summary table for each target (yes / no) 112 Summary: yesThis section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available:
- PCRDuplicates
This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is
filter, which corresponds to running Picard MarkDuplicates andpaleomix rmdup_collapsedon the input files, and removing any read determined to be a PCR duplicate; the second optionmarkfunctions like thefilteroption, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the SAM/BAM specification for more information), in order to allow down-stream tools to identify these as duplicates. The final option isno(without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing.- mapDamage
The
mapDamageoption accepts four possible values:no,plot,model, andrescale. By default, value (plot), will cause mapDamage to be run in order to generate simple plots of the postmortem DNA damage rates, as well as base composition plots, and more. If set tomodel, mapDamage will firstly generate the plots described forplot, but also construct models of DNA damage parameters, as described in [Jonsson2013]. Note that a minimum amount of DNA damage is required to be present in order to build these models. If the option is set torescale, both plots and models will be constructed using mapDamage, and in addition, the quality scores of bases will be down-graded based on how likely they are to represent postmortem DNA damage (see above).- Coverage
If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig.
- Depths
If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the Pipeline usage section for more information).
- Summary
If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more.
For a description of where files are placed, refer to the File structure section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to
yesorno(without quotes):Summary: no # Do NOT generate a per-target summary table Summary: yes # Generate a per-target summary table
Prefixes¶
115# Map of prefixes by name, each having a Path, which specifies the location of the
116# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
117Prefixes:
118 # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary
119 # statistics and as part of output filenames.
120 NAME_OF_PREFIX:
121 # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references
122 # against which reads are to be mapped. Using the same name as filename is strongly
123 # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37').
124 Path: PATH_TO_PREFIX.fasta
125
126 # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a .bed file
127 # listing extra regions for which coverage / depth statistics should be calculated;
128 # if no names are specified for the BED records, results are named after the
129 # chromosome / contig. Replace 'NAME' with the desired name for these regions.
130# RegionsOfInterest:
131# NAME: PATH_TO_BEDFILE
Reference genomes used for mapping are specified by listing these (one or more) in the Prefixes section. Each reference genome is associated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the Path value are required, as shown here for several examples:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
MyPrefix1:
# Path to FASTA file containing reference genome; must end with '.fasta'
Path: /path/to/genomes/file_1.fasta
MyPrefix2:
Path: /path/to/genomes/file_2.fasta
MyPrefix3:
Path: /path/to/genomes/AE008922_1.fasta
Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled Features (see above). In addition to the path, it is possible to specify RegionsOfInterest, which are described below.
Regions of interest¶
It is possible to specify one or more "regions of interest" for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled).
Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format.
Statistics are merged by the names specified in the BED file, or by the contig names if no names were specified. Thus, it is important to insure that names are unique if individual statistics are desired for every region.
Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interest under the RegionOfInterest section for a given prefix:
# Produce additional coverage / depth statistics for a set of
# regions defined in a BED file; if no names are specified for the
# BED records, results are named after the chromosome / contig.
RegionsOfInterest:
MyRegions: /path/to/my_regions.bed
MyOtherRegions: /path/to/my_other_regions.bed
The following is a simple example of such a BED file, for an alignment against the rCRS (NC_012920.1):
NC_012920_1 3306 4262 region_a
NC_012920_1 4469 5510 region_b
NC_012920_1 5903 7442 region_a
In this case, the resulting tables will contain information about two different regions, namely region_a (2495 bp, resulting from merging the two individual regions specified), and region_b (1041 bp). The order of lines in this file does not matter.
Adding multiple prefixes¶
In cases where it is necessary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. It is therefore possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder at /path/to/genomes, which contains our reference genomes:
$ ls /path/to/genomes
AE000516_2.fasta
AE004091_2.fasta
AE008922_1.fasta
AE008923_1.fasta
To automatically add these four reference genomes to the makefile, we would add a prefix as follows:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
MyGenomes*:
# Path to .fasta file containing a set of reference sequences.
Path: /path/to/genomes/*.fasta
There are two components to this, namely the name of the pseudo-prefix which must end with a star (*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of ls /path/to/genomes/*.fasta, and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
AE000516_2:
Path: /path/to/genomes/AE000516_2.fasta
AE004091_2:
Path: /path/to/genomes/AE004091_2.fasta
AE008922_1:
Path: /path/to/genomes/AE008922_1.fasta
AE008923_1:
Path: /path/to/genomes/AE008923_1.fasta
Note
The name provided for the pseudo-prefix (here MyGenomes) is not used by the pipeline, and can instead be used to document the nature of the files being included.
Warning
Just like regular prefixes, it is required that the filename of the reference genome ends with .fasta. However, the pipeline will attempt to add any file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path /path/to/genomes/* was used instead of /path/to/genomes/*.fasta, this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself.
Targets¶
134# Mapping targets are specified using the following structure. Replace 'NAME_OF_TARGET'
135# with the desired prefix for filenames.
136NAME_OF_TARGET:
137 # Replace 'NAME_OF_SAMPLE' with the name of this sample.
138 NAME_OF_SAMPLE:
139 # Replace 'NAME_OF_LIBRARY' with the name of this sample.
140 NAME_OF_LIBRARY:
141 # Replace 'NAME_OF_LANE' with the lane name (e.g. the barcode) and replace
142 # 'PATH_WITH_WILDCARDS' with the path to the FASTQ files to be trimmed and mapped
143 # for this lane (may include wildcards).
144 NAME_OF_LANE: PATH_WITH_WILDCARDS
In the BAM pipeline, the term Target is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified.
The following simplified example, derived from the makefile constructed as part of Pipeline usage section exemplifies this:
1# Target name; all output files uses this name as a prefix
2MyFilename:
3 # Sample name; used to tag data for segregation in downstream analyses
4 MySample:
5 # Library name; used to tag data for segregation in downstream analyses
6 TGCTCA:
7 # Lane / run names and paths to FASTQ files
8 Lane_1: data/TGCTCA_L1_*.fastq.gz
9 Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz
- Target name
The first top section of this target (line 2,
MyFilename) constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start withMyFilename; for example, the summary table normally generated from running the pipeline will be placed in a file namedMyFilename.summary.- Sample name
The subsections listed in the
Targetsection (line 4,MySample) constitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup (RG) tags.- Library name
The subsections listed in the
Samplesection (line 6,TGCTCA) constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below).Warning
Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX.
- Lane name
The subsections of each library are used to specify the names of individual lanes. Each sample may include any number of lanes and each lane may include any number of FASTQ files. Paths listed here may include wildcards (
*) or the special value{Pair}, which is used to indicate that the lane is paired end. During runtime,{Pair}is replaced with1and2, before searching using wildcards, and the pipeline expects the resulting filenames to correspond to mate 1 and mate 2 reads respectively.
In addition to these target (sub)sections, it is possible to specify Options for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below.
Warning
It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries.
Including already trimmed reads¶
In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both raw and trimmed FASTQ reads, and to ensure that these integrate in the pipeline.
To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the ExcludeReads option (see above). The following minimal example demonstrates this:
1MyFilename:
2 MySample:
3 ACGATA:
4 # Regular lane, containing reads that are not already trimmed
5 Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
6
7 # Lane containing pre-trimmed reads of each type
8 Lane_2:
9 # Single-end reads
10 Single: /path/to/single_end_reads.fastq.gz
11
12 # Paired-end reads where one mate has been discarded
13 Singleton: /path/to/singleton_reads.fastq.gz
14
15 # Paired end reads; note that the {Pair} key is required,
16 # just like with raw, paired-end reads
17 Paired: /path/to/paired_end_{Pair}.fastq.gz
18
19 # Paired-end reads merged into a single sequence
20 Collapsed: /path/to/collapsed.fastq.gz
21
22 # Paired-end reads merged into a single sequence, and then truncated
23 CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz
The above examples show how each type of reads are to be listed, but it is not necessary to specify more than a single type of pre-trimmed reads in the makefile.
Note
Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics.
Overriding global settings¶
In addition to the main Options section, it is possible to override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of Pipeline usage section as the base:
1MyFilename:
2 # These options apply to all samples with this filename
3 Options:
4 # In this example, we override the default adapter sequences
5 AdapterRemoval:
6 --adapter1: AGATCGGAAGAGC
7 --adapter2: AGATCGGAAGAGC
8
9 MySample:
10 # These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA'
11 Options:
12 # In this example, we assume that FASTQ files for our libraries
13 # include Phred quality scores encoded with offset 64.
14 QualityOffset: 64
15
16 ACGATA:
17 Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
18
19 GCTCTG:
20 # These options apply to 'Lane_1' in the 'GCTCTG' library
21 Options:
22 # It is possible to override options we have previously overridden
23 QualityOffset: 33
24
25 Lane_1: data/GCTCTG_L1_*.fastq.gz
26
27 TGCTCA:
28 Lane_1: data/TGCTCA_L1_*.fastq.gz
29 Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz
In this example, we have overwritten options at 3 places:
The first place (lines 2 - 7) will be applied to all samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data.
The second place (lines 10 - 14) are applied to the sample
MySamplethat we have included in this target, and consequently applies to all libraries specified for this sample (ACGATA,GCTCTG, andTGCTCA). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15).Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the
QualityOffsetoption).
Note
It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library.
Warning
Only the mapDamage and the PCRDuplicates features can be overridden for individual targets, samples, or libraries. Over features can be overridden per target, but not per sample or per library.