Makefile description¶
Contents
The following sections reviews the options available in the BAM pipeline makefiles. As described in the Pipeline usage section, a default makefile may be generated using the 'paleomix bam_pipeline mkfile' command. For clariety, the location of options in subsections are specified by concatenating the names using '::' as a seperator. Thus, in the following (simplified example), the 'UseSeed' option (line 13) would be referred to as 'Options :: Aligners :: BWA :: UseSeed':
1 2 3 4 5 6 7 8 9 10 11 12 13 | Options:
# Settings for aligners supported by the pipeline
Aligners:
# Choice of aligner software to use, either "BWA" or "Bowtie2"
Program: BWA
# Settings for mappings performed using BWA
BWA:
# May be disabled ("no") for aDNA alignments, as post-mortem damage
# localizes to the seed region, which BWA expects to have few
# errors (sets "-l"). See http://pmid.us/22574660
UseSeed: yes
|
Specifying command-line options¶
For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option --mm 5 to AdapterRemoval, then we would list it as "--mm: 5" (all other options omitted for brevity):
AdapterRemoval:
--mm: 5
For options that do not take any values, such as the AdapterRemoval '--trimns' (enabling the trimming of Ns in the reads), these are specified either as "--trimmns: ", with the value left blank, or as "--trimns: yes". The following are therefore equivalent:
AdapterRemoval:
--trimns: # Method 1
--trimns: yes # Method 2
In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to 'no' (without quotes), as shown here:
AdapterRemoval:
--trimns: no
If you need to provide the text "yes" or "no" as the value for an option, it is nessesary to put these in quotes:
--my-option: "yes"
--my-option: "no"
In some cases it is possible or even nessesary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:
--my-option:
- "yes"
- "no"
- "maybe"
The above will be translated into calling the program in question with the options "--my-option yes --my-option no --my-option maybe".
Options section¶
By default, the 'Options' section of the makefile contains the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
# Sequencing platform, see SAM/BAM reference for valid values
Platform: Illumina
# Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
# or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
# specify 'Solexa', to handle reads on the Solexa scale. This is
# used during adapter-trimming and sequence alignment
QualityOffset: 33
# Split a lane into multiple entries, one for each (pair of) file(s)
# found using the search-string specified for a given lane. Each
# lane is named by adding a number to the end of the given barcode.
SplitLanesByFilenames: yes
# Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
CompressionFormat: bz2
# Settings for trimming of reads, see AdapterRemoval man-page
AdapterRemoval:
# Adapter sequences, set and uncomment to override defaults
# --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
# --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
# Some BAM pipeline defaults differ from AR defaults;
# To override, change these value(s):
--mm: 3
--minlength: 25
# Extra features enabled by default; change 'yes' to 'no' to disable
--collapse: yes
--trimns: yes
--trimqualities: yes
# Settings for aligners supported by the pipeline
Aligners:
# Choice of aligner software to use, either "BWA" or "Bowtie2"
Program: BWA
# Settings for mappings performed using BWA
BWA:
# One of "backtrack", "bwasw", or "mem"; see the BWA documentation
# for a description of each algorithm (defaults to 'backtrack')
Algorithm: backtrack
# Filter aligned reads with a mapping quality (Phred) below this value
MinQuality: 0
# Filter reads that did not map to the reference sequence
FilterUnmappedReads: yes
# Should be disabled ("no") for aDNA alignments, as post-mortem damage
# localizes to the seed region, which BWA expects to have few
# errors (sets "-l"). See http://pmid.us/22574660
UseSeed: yes
# Additional command-line options may be specified for the "aln"
# call(s), as described below for Bowtie2 below.
# Settings for mappings performed using Bowtie2
Bowtie2:
# Filter aligned reads with a mapping quality (Phred) below this value
MinQuality: 0
# Filter reads that did not map to the reference sequence
FilterUnmappedReads: yes
# Examples of how to add additional command-line options
# --trim5: 5
# --trim3: 5
# Note that the colon is required, even if no value is specified
--very-sensitive:
# Example of how to specify multiple values for an option
# --rg:
# - CN:SequencingCenterNameHere
# - DS:DescriptionOfReadGroup
# Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
# removed from the output files; if set to 'mark', PCR duplicates are
# flagged with bit 0x400, and not removed from the output files; if set to
# 'no', the reads are assumed to not have been amplified. Collapsed reads
# are filtered using the command 'paleomix rmdup_duplicates', while "normal"
# reads are filtered using Picard MarkDuplicates.
PCRDuplicates: filter
# Carry out quality base re-scaling of libraries using mapDamage
# This will be done using the options set for mapDamage below
RescaleQualities: no
# Command-line options for mapDamage; note that the long-form
# options are expected; --length, not -l, etc. Uncomment the
# "mapDamage" line adding command-line options below.
mapDamage:
# By default, the pipeline will downsample the input to 100k hits
# when running mapDamage; remove to use all hits
--downsample: 100000
# Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
# possible read-types reflect the output of AdapterRemoval
ExcludeReads:
Single: no # Single-ended reads / Orphaned paired-ended reads
Paired: no # Paired ended reads
Singleton: no # Paired reads for which the mate was discarded
Collapsed: no # Overlapping paired-ended reads collapsed into a
# single sequence by AdapterRemoval
CollapsedTruncated: no # Like 'Collapsed', except that the reads
# truncated due to the presence ambigious
# bases or low quality bases at read termini.
# Optional steps to perform during processing
Features:
RawBAM: no # Generate BAM from the raw libraries (no indel realignment)
# Location: {Destination}/{Target}.{Genome}.bam
RealignedBAM: yes # Generate indel-realigned BAM using the GATK Indel realigner
# Location: {Destination}/{Target}.{Genome}.realigned.bam
mapDamage: yes # Generate mapDamage plot for each (unrealigned) library
# Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/
Coverage: yes # Generate coverage information for the raw BAM (wo/ indel realignment)
# Location: {Destination}/{Target}.{Genome}.coverage
Depths: yes # Generate histogram of number of sites with a given read-depth
# Location: {Destination}/{Target}.{Genome}.depths
Summary: yes # Generate summary table for each target
# Location: {Destination}/{Target}.summary
DuplicateHist: no # Generate histogram of PCR duplicates, for use with PreSeq
# Location: {Destination}/{Target}.{Genome}.duphist/{Library}/
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Uncomment and replace 'NAME_OF_PREFIX' with name of the prefix; this name
# is used in summary statistics and as part of output filenames.
# NAME_OF_PREFIX:
# Uncomment and replace 'PATH_TO_PREFIX' with the path to .fasta file
# containing the references against which reads are to be mapped.
# Path: PATH_TO_PREFIX
# (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
# .bed file listing extra regions for which coverage / depth statistics
# should be calculated; if no names are specified for the BED records,
# results are named after the chromosome / contig. Change 'NAME' to the
# name to be used in summary statistics and output filenames.
# RegionsOfInterest:
# NAME: PATH_TO_BEDFILE
# Mapping targets are specified using the following structure. Uncomment and
# replace 'NAME_OF_TARGET' with the desired prefix for filenames.
#NAME_OF_TARGET:
# Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
# NAME_OF_SAMPLE:
# Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
# NAME_OF_LIBRARY:
# Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
# and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
# to be trimmed and mapped for this lane (may include wildcards).
# NAME_OF_LANE: PATH_WITH_WILDCARDS
|
Options: General¶
- Options :: Platform
7 8
# Sequencing platform, see SAM/BAM reference for valid values Platform: Illumina
The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include 'CAPILLARY', 'HELICOS', 'ILLUMINA', 'IONTORRENT', 'LS454', 'ONT', 'PACBIO', and 'SOLID'.
- Options :: QualityOffset
9 10 11 12 13 14
# Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to # specify 'Solexa', to handle reads on the Solexa scale. This is # used during adapter-trimming and sequence alignment QualityOffset: 33 # Split a lane into multiple entries, one for each (pair of) file(s)
The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and 'Solexa'. For most modern data, this will be 33, corresponding to ASCII characters in the range '!' to 'J'. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range '@' to 'h', and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ';' to 'h'. For a visual representation of this, refer to the Wikipedia article linked above.
- Options :: SplitLanesByFilenames
14 15 16 17
# Split a lane into multiple entries, one for each (pair of) file(s) # found using the search-string specified for a given lane. Each # lane is named by adding a number to the end of the given barcode. SplitLanesByFilenames: yes
This option influences how the BAM pipeline handles lanes that include multiple files. By default (corresponding to a value of 'yes'), the pipeline will process individual files in parallel, potentially allowing for greater throughput. If set to 'no', all files in a lane are merged during processing, resulting in a single set of trimmed reads per lane. The only effect of this option on the final result is a greater number of read-groups specified in the final BAM files. See the File structure section for more details on how this is handled.
Warning
This option is deprecated, and will be removed in future versions of PALEOMIX.
- Options :: CompressionFormat
18 19
# Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2 CompressionFormat: bz2
This option determines which type of compression is carried out on trimmed FASTQ reads; if set to 'gz', reads are gzip compressed, and if set to 'bz2', reads are compressed using bzip2. This option has no effect on the final results, but may be used to trade off space (gz) for some additional runtime (bz2).
Warning
This option is deprecated, and may be removed in future versions of PALEOMIX.
- Options :: PCRDuplicates
72 73 74 75 76 77 78
# Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are # removed from the output files; if set to 'mark', PCR duplicates are # flagged with bit 0x400, and not removed from the output files; if set to # 'no', the reads are assumed to not have been amplified. Collapsed reads # are filtered using the command 'paleomix rmdup_duplicates', while "normal" # reads are filtered using Picard MarkDuplicates. PCRDuplicates: filter
This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is 'filter', which corresponds to running Picard MarkDuplicates and 'paleomix rmdup_collapsed' on the input files, and removing any read determined to be a PCR duplicate; the second option 'mark' functions like the 'filter' option, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the SAM/BAM specification for more information), in order to allow down-stream tools to identify these as duplicates. The final option is 'no' (without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing.
Options: Adapter Trimming¶
The "AdapterRemoval" subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the AdapterRemoval documentation. A few important particularly options are described here:
- Options :: AdapterRemoval :: --adapter1 and Options :: AdapterRemoval :: --adapter2
23 24 25
# Adapter sequences, set and uncomment to override defaults # --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG # --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the AdapterRemoval documentation for more information.
Note
As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the --threads command-line option in the makefile, but by supplying the --adapterremoval-max-threads option to the BAM pipeline itself:
$ paleomix bam_pipeline run makefile.yaml --adapterremoval-max-threads 2
Warning
Older versions of PALEOMIX may use the --pcr1 and --pcr2 options instead of --adapter1 and --adapter2; for new projects, using --adapter1 and --adapter2 is strongly recommended, due to the simpler schematics (described above). If your project uses the --pcr1 and --pcr2 options, then refer to the AdapterRemoval documentation information for how to proceed!
- Options :: AdapterRemoval :: --mm
28
--mm: 3
Sets the fraction of mismatches allowed when aligning reads / adapter sequences. If the specified value (MM) is greater than 1, this is calculated as 1 / MM, otherwise the value is used directly. To set, replace the default value as desired:
--mm: 3 # Maximum mismatch rate of 1 / 3 --mm: 5 # Maximum mismatch rate of 1 / 5 --mm: 0.2 # Maximum mismatch rate of 1 / 5
Options :: AdapterRemoval :: --minlength
The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012].
Warning
The default value used by PALEOMIX for '--minlength' (25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is nessesarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient.
- Options :: AdapterRemoval :: --collapse
31
--collapse: yes
If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambigious alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either 'M_' or 'MT_', with the latter marking reads that have been trimmed from the 5' or 3' termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option to 'no' (without quotes):
--trimns: yes # Option enabled --trimns: no # Option disabled
Note
This option may be combined with the 'ExcludeReads' option (see below), to either elimate or select for short inserts, depending on the exceptations from the experiement. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped.
- Options :: AdapterRemoval :: --trimns
32
--trimns: yes
If set to 'yes' (without quotes), AdapterRemoval will trim uncalled bases ('N') from the 5' and 3' end of the reads. Trimming will stop at the first called base ('A', 'C', 'G', or 'T'). If both --trimns and --trimqualities are enabled, then consequtive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):
--trimns: yes # Option enabled --trimns: no # Option disabled
- Options :: AdapterRemoval :: --trimqualities
33
--trimqualities: yes
If set to 'yes' (without quotes), AdapterRemoval will trim low-quality bases from the 5' and 3' end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option --minquality. This value defaults to 2. If both --trimns and --trimqualities are enabled, then consequtive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):
--trimqualities: yes # Option enabled --trimqualities: no # Option disabled
Options: Short read aligners¶
This section allow selection between supported short read aligners (currently BWA [Li2009a] and Bowtie2 [Langmead2012]), as well as setting options for these, individually:
35 36 37 38 | # Settings for aligners supported by the pipeline
Aligners:
# Choice of aligner software to use, either "BWA" or "Bowtie2"
Program: BWA
|
To select a mapping program, set the 'Program' option appropriately:
Program: BWA # Using BWA to map reads
Program: Bowtie2 # Using Bowtie2 to map reads
Options: Short read aligners - BWA¶
The following options are applied only when running the BWA short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Should be disabled ("no") for aDNA alignments, as post-mortem damage # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes # Additional command-line options may be specified for the "aln" # call(s), as described below for Bowtie2 below.
- Options :: Aligners :: BWA :: Algorithm
42 43 44 # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrackThe mapping algorithm to use; options are 'backtrack' (corresponding to 'bwa aln'), 'bwasw', and 'mem'. Additional command-line options may be specified for these. Algorithms are selected as follows:
Algorithm: backtrack # 'Backtrack' algorithm, using the command 'bwa aln' Algorithm: bwasw # 'SW' algorithm for long queries, using the command 'bwa bwasw' Algorithm: mem # 'mem' algorithm, using the command 'bwa mem'Warning
Alignment algorithms 'bwasw' and 'mem' currently cannot be used with input data that is encoded using QualityOffset 64 or 'Solexa'. This is a limitation of PALEOMIX, and will be resolved in future versions. In the mean time, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example seqtk.
- Options :: Aligners :: BWA :: MinQuality
45 46 # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0Specifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero 'MinQuality' value. To filter unmapped reads, use the option 'FilterUnmappedReads' (see below). To set this option, replace the default value with a desired minimum:
MinQuality: 0 # Keep all hits MinQuality: 25 # Keep only hits where mapping-quality >= 25- Options :: Aligners :: BWA :: FilterUnmappedReads
47 48 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yesSpecifies wether or not unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to 'yes' (without quotes), all unmapped reads are discarded during the mapping process, while setting the option to 'no' (without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with either 'yes' or 'no' (without quotes):
FilterUnmappedReads: yes # Remove unmapped reads during alignment FilterUnmappedReads: no # Keep unmapped readsOptions :: Aligners :: BWA :: *
Additional command-line options may be specified for the selected alignment algorithm, as described in the "Specifying command-line options" section above. See also the examples listed for Bowtie2 below. Note that for the 'backtrack' algorithm, it is only possible to specify options for the 'bwa aln' call.
Options: Short read aligners - Bowtie2¶
The following options are applied only when running the Bowtie2 short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 # Settings for mappings performed using Bowtie2 Bowtie2: # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Examples of how to add additional command-line options # --trim5: 5 # --trim3: 5 # Note that the colon is required, even if no value is specified --very-sensitive: # Example of how to specify multiple values for an option # --rg: # - CN:SequencingCenterNameHere # - DS:DescriptionOfReadGroup
- Options :: Aligners :: Bowtie2 :: MinQuality
58 59 # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0See 'Options :: Aligners :: BWA :: MinQuality' above.
- Options :: Aligners :: Bowtie2 :: FilterUnmappedReads
60 61 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yesSee 'Options :: Aligners :: BWA :: FilterUnmappedReads' above.
Options :: Aligners :: BWA :: *
Additional command-line options may be specified for Bowtie2, as described in the "Specifying command-line options" section above. Please refer to the Bowtie2 documentation for more information about available command-line options.
Options: mapDamage plots and rescaling¶
80 81 82 83 84 85 86 87 88 89 90 # Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: no # Command-line options for mapDamage; note that the long-form # options are expected; --length, not -l, etc. Uncomment the # "mapDamage" line adding command-line options below. mapDamage: # By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000
- Options :: RescaleQualities
80 81 82
# Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: no
If enabled, mapDamage will be used to build a model of post-mortem DNA damage per library, which is then applied to the per-library BAMs in order to recalibrate quality scores according to the probability that a C>T or G>A substitution is caused by post-mortem DNA damage [Jonsson2013]. To enable, change the value to 'yes' (without quotes):
RescaleQualities: no # Rescaling disabled RescaleQualities: yes # Rescaling enabled
Note
It may be worthwhile to tweak mapDamage parameters before building a model of post-mortem DNA damage; this may be accomplished by running the pipeline without rescaling, running with the 'mapDamage' feature set to 'yes' (without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before setting 'RescaleQualities' to yes.
Disabling the construction of the final BAMs may be accomplished by setting the features 'RawBam' and 'RealignedBAM' to 'no' (without quotes) in the 'Features' section (see below), and then setting the desired option to yes again after enabling rescaling and adding the desired options to the mapDamage section.
Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with RescaleQualities enabled, simply remove the mapDamage files generated for the relevant libraries (see the File structure section).
Warning
Rescaling requires a certain minimum number of C>T and G>A substituions, before it is possible to construct a model of post-mortem DNA damage. If mapDamage fails with an error indicating that "DNA damage levels are too low", then it is nessesary to disable rescaling for that library to continue.
- Options :: mapDamage :: --downsample
88 89 90
# By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000
By default the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficent for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:
--downsample: 100000 # Sample 100 thousand reads --downsample: 1000000 # Sample 1 million reads --downsample: 0 # No downsampling
- Options :: mapDamage :: *
80 81 82
# Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: no
Additional command-line options may be supplied to mapDamage, just like the '--downsample' parameter, as described in the "Specifying command-line options" section above. These are used during plotting and rescaling (if enabled).
Options: Excluding read-types¶
92 93 94 95 96 97 98 99 100 101 102 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; # possible read-types reflect the output of AdapterRemoval ExcludeReads: Single: no # Single-ended reads / Orphaned paired-ended reads Paired: no # Paired ended reads Singleton: no # Paired reads for which the mate was discarded Collapsed: no # Overlapping paired-ended reads collapsed into a # single sequence by AdapterRemoval CollapsedTruncated: no # Like 'Collapsed', except that the reads # truncated due to the presence ambigious # bases or low quality bases at read termini.During the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogeneous DNA. The following read types are currently recognized:
- Single
- Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline.
- Paired
- Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which both mates were retained, and which were not merged into a single read (if --collapse is set for AdapterRemoval).
- Singleton
- Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the '--maxns', the '--minlength', or the '--maxlength' options supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode.
- Collapsed
- Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the --collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with 'M_'.
- CollapsedTruncated
- Paired-end reads (like Collapsed), which were trimmed due to the '--trimqualities' or the '--trimns' command-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name with 'MT_'.
To enable / disable exclusion of a read type, set the value for the appropriate type to 'yes' or 'no' (without quotes):
Singleton: no # Singleton reads are NOT excluded Singleton: yes # Singleton reads are excluded
Options: Optional features¶
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 # Optional steps to perform during processing Features: RawBAM: no # Generate BAM from the raw libraries (no indel realignment) # Location: {Destination}/{Target}.{Genome}.bam RealignedBAM: yes # Generate indel-realigned BAM using the GATK Indel realigner # Location: {Destination}/{Target}.{Genome}.realigned.bam mapDamage: yes # Generate mapDamage plot for each (unrealigned) library # Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/ Coverage: yes # Generate coverage information for the raw BAM (wo/ indel realignment) # Location: {Destination}/{Target}.{Genome}.coverage Depths: yes # Generate histogram of number of sites with a given read-depth # Location: {Destination}/{Target}.{Genome}.depths Summary: yes # Generate summary table for each target # Location: {Destination}/{Target}.summary DuplicateHist: no # Generate histogram of PCR duplicates, for use with PreSeqThis section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available:
- RawBAM
- If enabled, the pipeline will generate a final BAM, which is NOT processed using the GATK Indel Realigner (see below), following all other processing steps.
- RealignedBAM
- If enabled, the pipeline will generate a final BAM, which is processed using the GATK Indel Realigner [McKenna2010], in order to improve the alignment near indels, by performing a multiple sequence alignment in regions containing putative indels.
- mapDamage
- If enabled, the pipeline will generate mapDamage plots for each reference genome and for each library. Note that these will be generated even if the option is set to 'no', if the 'RescaleQualities' option is enabled (see above).
- Coverage
- If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig.
- Depths
- If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the Pipeline usage section for more information).
- Summary
- If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more.
- DuplicateHist
- If enabled, a histogram of the estimated number of PCR duplicates observed per DNA fragment is generated per library. This may be used with the 'preseq' program in order to estimate the (remaining) complexity of a given library, and thereby direct future sequencing efforts [Daley2013].
For a description of where files are placed, refer to the File structure section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to 'yes' or 'no' (without quotes):
Summary: no # Do NOT generate a per-target summary table Summary: yes # Generate a per-target summary table
Prefixes section¶
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | # Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Uncomment and replace 'NAME_OF_PREFIX' with name of the prefix; this name
# is used in summary statistics and as part of output filenames.
# NAME_OF_PREFIX:
# Uncomment and replace 'PATH_TO_PREFIX' with the path to .fasta file
# containing the references against which reads are to be mapped.
# Path: PATH_TO_PREFIX
# (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
# .bed file listing extra regions for which coverage / depth statistics
# should be calculated; if no names are specified for the BED records,
# results are named after the chromosome / contig. Change 'NAME' to the
# name to be used in summary statistics and output filenames.
# RegionsOfInterest:
# NAME: PATH_TO_BEDFILE
|
Reference genomes used for mapping are specified by listing these (one or more) in the 'Prefixes' section. Each reference genome is assosiated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the 'Path' value are required, as shown here for several examples:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
MyPrefix1:
# Path to FASTA file containing reference genome; must end with '.fasta'
Path: /path/to/genomes/file_1.fasta
MyPrefix2:
Path: /path/to/genomes/file_2.fasta
MyPrefix3:
Path: /path/to/genomes/AE008922_1.fasta
Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled 'Features' (see above). In addition to the path, two other options are available per prefix, namely the 'Label' and 'RegionsOfInterest', which are described below.
Warning
FASTA files used in the BAM pipeline must be named with a .fasta file extension. Furthermore, if alignments are to be carried out against the human nuclear genome, chromosomes MUST be ordered by their number for GATK to work! See the GATK FAQ for more information.
Regions of interest¶
It is possible to specify one or more "regions of interest" for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled, see above).
Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format.
If these regions are named, statistics are merged by these names (esstentially treating them as pseudo contigs), while regions are merged by contig. Thus, it is important to insure that names are unique if statistics are desired for very single region, individually.
Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interst under the 'RegionOfInterest' section for a given prefix:
# Produce additional coverage / depth statistics for a set of
# regions defined in a BED file; if no names are specified for the
# BED records, results are named after the chromosome / contig.
RegionsOfInterest:
MyRegions: /path/to/my_regions.bed
MyOtherRegions: /path/to/my_other_regions.bed
The following is a simple example of such a BED file, for an alignment against the rCRS (NC_012920.1):
NC_012920_1 3306 4262 region_a
NC_012920_1 4469 5510 region_b
NC_012920_1 5903 7442 region_a
In this case, the resulting tables will contain information about two different regions, namely region_a (2495 bp, resulting from merging the two individual regions specified), and region_b (1041 bp). The order of lines in this file does not matter.
Adding multiple prefixes¶
In cases where it is nessesary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. To allow such use-cases, it is possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder '/path/to/genomes', which contains our reference genomes:
$ ls /path/to/genomes
AE000516_2.fasta
AE004091_2.fasta
AE008922_1.fasta
AE008923_1.fasta
To automatically add these (4) reference genomes to the makefile, we would add a prefix as follows:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
MyGenomes*:
# Path to .fasta file containg a set of reference sequences.
Path: /path/to/genomes/*.fasta
There are two components to this, namely the name of the pseudo-prefix which must end with a star (*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of 'ls /path/to/genomes/*.fasta', and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
# Name of the prefix; is used as part of the output filenames
AE000516_2:
Path: /path/to/genomes/AE000516_2.fasta
AE004091_2:
Path: /path/to/genomes/AE004091_2.fasta
AE008922_1:
Path: /path/to/genomes/AE008922_1.fasta
AE008923_1:
Path: /path/to/genomes/AE008923_1.fasta
A makefile including such prefixes is executed as any other makefile.
Note
The name provided for the pseudo-prefix (here 'MyGenomes') is not used by the pipeline, and can instead be used to document the nature of the files being included.
Warning
Just like regular prefixes, it is required that the filename of the reference genome ends with '.fasta'. However, the pipeline will attempt to add any file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path '/path/to/genomes/*' was used instead of '/path/to/genomes/*.fasta', this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself.
Prefix labels¶
Prefixes:
# Uncomment and replace 'NAME_OF_PREFIX' with name of the prefix; this name
# is used in summary statistics and as part of output filenames.
# NAME_OF_PREFIX:
# ...
# (Optional) Uncomment and replace 'LABEL' with one of 'nuclear',
# 'mitochondrial', 'chloroplast', 'plasmid', 'bacterial', or 'viral'.
# Label: LABEL
The label option for prefixes allow a prefix to be classified according to one of several categories, currently including 'nuclear', 'mitochondrial', 'chloroplast', 'plasmid', 'bacterial', and 'viral'. This is only used when generating the .summary files (if the 'Summary' feature is enabled), in which the label is used instead of the prefix name, and the results for prefixes with the same label are combined.
Warning
Labels are deprecated, and will either be removed in future versions of PALEOMIX, or significantly changed.
Targets section¶
104 105 106 107 108 109 110 111 112 113 114 | # Mapping targets are specified using the following structure. Uncomment and
# replace 'NAME_OF_TARGET' with the desired prefix for filenames.
#NAME_OF_TARGET:
# Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
# NAME_OF_SAMPLE:
# Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
# NAME_OF_LIBRARY:
# Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
# and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
# to be trimmed and mapped for this lane (may include wildcards).
# NAME_OF_LANE: PATH_WITH_WILDCARDS
|
In the BAM pipeline, the term 'Target' is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified.
The following simplified example, derived from the makefile constructed as part of Pipeline usage section exemplifies this:
1 2 3 4 5 6 7 8 9 | # Target name; all output files uses this name as a prefix
MyFilename:
# Sample name; used to tag data for segregation in downstream analyses
MySample:
# Library name; used to tag data for segregation in downstream analyses
TGCTCA:
# Lane / run names and paths to FASTQ files
Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz
|
- Target name
- The first top section of this target (line 1, 'MyFilename') constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start with 'MyFilename'; for example, the summary table normally generated from running the pipeline will be placed in a file named 'MyFilename.summary'.
- Sample name
- The subsections listed in the 'Target' section (line 2, 'MySample') consitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup ('RG') tags.
- Library name
The subsections listed in the 'Sample' section (line 3, 'TGCTCA') constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below).
Warning
Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX.
- Lane name
- The subsections of each library are used to specify the names of invidual
In addition to these target (sub)sections, it is possibel to specify 'Options' for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below.
Warning
It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, GATK and mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries.
Including already trimmed reads¶
In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both 'raw' and trimmed FASTQ reads, and to ensure that these integrate in the pipeline.
To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the 'ExcludeReads' option (see above). The following minimal example demonstrates this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | MyFilename:
MySample:
ACGATA:
# Regular lane, containing reads that are not already trimmed
Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz
# Lane containing pre-trimmed reads of each type
Lane_2:
# Single-end reads
Single: /path/to/single_end_reads.fastq.gz
# Paired-end reads where one mate has been discarded
Singleton: /path/to/singleton_reads.fastq.gz
# Paired end reads; note that the {Pair} key is required,
# just like with raw, paired-end reads
Paired: /path/to/paired_end_{Pair}.fastq.gz
# Paired-end reads merged into a single sequence
Collapsed: /path/to/collapsed.fastq.gz
# Paired-end reads merged into a single sequence, and then truncated
CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz
|
The above examples show how each type of reads are to be listed, but it is not nessesary to specify more than a single type of pre-trimmed reads in the makefile.
Note
Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics.
Overriding global settings¶
In addition to the 'Options' section included, by default, at the beginning of every makefile, it is possible to specify / override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of Pipeline usage section as the base:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | MyFilename:
# These options apply to all samples with this filename
Options:
# In this example, we override the default adapter sequences
AdapterRemoval:
--adapter1: AGATCGGAAGAGC
--adapter2: AGATCGGAAGAGC
MySample:
# These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA'
Options:
# In this example, we assume that FASTQ files for our libraries
# include Phred quality scores encoded with offset 64.
QualityOffset: 64
ACGATA:
Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz
GCTCTG:
# These options apply to 'Lane_1' in the 'GCTCTG' library
Options:
# It is possible to override options we have previously overriden
QualityOffset: 33
Lane_1: 000_data/GCTCTG_L1_*.fastq.gz
TGCTCA:
Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz
|
In this example, we have overwritten options at 3 places:
- The first place (lines 2 - 7) will be applied to all samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data.
- The second place (lines 10 - 14) are applied to the sample 'MySample' that we have included in this target, and consequently applies to all libraries specified for this sample ('ACGATA', 'GCTCTG', and 'TGCTCA'). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15).
- Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the 'QualityOffset' option).
Note
It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library.
Warning
It is currently not possible to set the 'Features' except in the global 'Options' section at the top of the Makefile; this limitation will be removed in future versions of PALEOMIX.