Makefile description

The following sections reviews the options available in the BAM pipeline makefiles. As described in the Pipeline usage section, a default makefile may be generated using the 'paleomix bam_pipeline makefile' command. For clarity, the location of options in subsections are specified by concatenating the names using '::' as a separator. Thus, in the following (simplified example), the 'UseSeed' option (line 13) would be referred to as 'Options :: Aligners :: BWA :: UseSeed':

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Options:

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
      # Post-mortem damage localizes to the seed region, which BWA expects to
      # have few errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes

Specifying command-line options

For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option --mm 5 to AdapterRemoval, then we would list it as "--mm: 5" (all other options omitted for brevity):

AdapterRemoval:
  --mm: 5

For options that do not take any values, such as the AdapterRemoval --trimns (enabling the trimming of Ns in the reads), these are specified either as "--trimmns: ", with the value left blank, or as "--trimns: yes". The following are therefore equivalent:

AdapterRemoval:
  --trimns:      # Method 1
  --trimns: yes  # Method 2

In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to 'no' (without quotes), as shown here:

AdapterRemoval:
  --trimns: no

If you need to provide the text "yes" or "no" as the value for an option, it is necessary to put these in quotes:

--my-option: "yes"
--my-option: "no"

In some cases it is possible or even necessary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:

--my-option:
    - "yes"
    - "no"
    - "maybe"

The above will be translated into calling the program in question with the options "--my-option yes --my-option no --my-option maybe".

Options

By default, the 'Options' section of the makefile contains the following:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Set and uncomment to override defaults adapter sequences
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
      # Post-mortem damage localizes to the seed region, which BWA expects to
      # have few errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes
      # Additional command-line options may be specified below. For 'backtrack' these
      # are applied to the "bwa aln". See Bowtie2 for more examples.
#      -n: 0.04

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Command-line options for mapDamage; use long-form options(--length not -l):
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

  # Optional steps to perform during processing.
  Features:
    # If set to 'filter', PCR duplicates are removed from the output files; if set to
    # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the
    # output files; if set to 'no', the reads are assumed to not have been amplified.
    PCRDuplicates: filter
    # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots;
    # set to 'model' to build plots and post-mortem damage models; and set to 'rescale'
    # to build plots, models, and BAMs with rescaled quality scores. All analyses are
    # carried out per library.
    mapDamage: plot
    # Generate coverage information for the final BAM and for each 'RegionsOfInterest'
    # specified in 'Prefixes' (yes / no).
    Coverage: yes
    # Generate histograms of number of sites with a given read-depth, from 0 to 200,
    # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no).
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes

General Options

Options :: Platform
7
8
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina

The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include 'CAPILLARY', 'HELICOS', 'ILLUMINA', 'IONTORRENT', 'LS454', 'ONT', 'PACBIO', and 'SOLID'.

Options :: QualityOffset
 9
10
11
12
13
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33

The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and 'Solexa'. For most modern data, this will be 33, corresponding to ASCII characters in the range '!' to 'J'. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range '@' to 'h', and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ';' to 'h'. For a visual representation of this, refer to the Wikipedia article linked above.

Adapter Trimming

The "AdapterRemoval" subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the AdapterRemoval documentation. A few important particularly options are described here:

Options :: AdapterRemoval :: --adapter1 and Options :: AdapterRemoval :: --adapter2
17
18
19
     # Set and uncomment to override defaults adapter sequences
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the AdapterRemoval documentation for more information.

Note

As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the --threads command-line option in the makefile, but by supplying the --adapterremoval-max-threads option to the BAM pipeline itself:

$ paleomix bam run makefile.yaml --adapterremoval-max-threads 2
Options :: AdapterRemoval :: --mm
22
     --mm: 3

Sets the fraction of mismatches allowed when aligning reads / adapter sequences. If the specified value (MM) is greater than 1, this is calculated as 1 / MM, otherwise the value is used directly. To set, replace the default value as desired:

--mm: 3    # Maximum mismatch rate of 1 / 3
--mm: 5    # Maximum mismatch rate of 1 / 5
--mm: 0.2  # Maximum mismatch rate of 1 / 5

Options :: AdapterRemoval :: --minlength

The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012].

Warning

The default value used by PALEOMIX for --minlength (25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is nessesarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient.

Options :: AdapterRemoval :: --collapse
25
     --collapse: yes

If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambiguous alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either 'M_' or 'MT_', with the latter marking reads that have been trimmed from the 5' or 3' termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option to 'no' (without quotes):

--collapse: yes  # Option enabled
--collapse: no   # Option disabled

Note

This option may be combined with the 'ExcludeReads' option (see below), to either eliminate or select for short inserts, depending on the expectations from the experiment. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped.

Options :: AdapterRemoval :: --trimns
26
     --trimns: yes

If set to 'yes' (without quotes), AdapterRemoval will trim uncalled bases ('N') from the 5' and 3' end of the reads. Trimming will stop at the first called base ('A', 'C', 'G', or 'T'). If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):

--trimns: yes  # Option enabled
--trimns: no   # Option disabled
Options :: AdapterRemoval :: --trimqualities
27
     --trimqualities: yes

If set to 'yes' (without quotes), AdapterRemoval will trim low-quality bases from the 5' and 3' end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option --minquality. This value defaults to 2. If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):

--trimqualities: yes  # Option enabled
--trimqualities: no   # Option disabled

Short read aligners

This section allow selection between supported short read aligners (currently BWA [Li2009a] and Bowtie2 [Langmead2012]), as well as setting options for these, individually:

29
30
31
32
  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

To select a mapping program, set the 'Program' option appropriately:

Program: BWA      # Using BWA to map reads
Program: Bowtie2  # Using Bowtie2 to map reads

Short read aligners - BWA

The following options are applied only when running the BWA short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.

35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
      # Post-mortem damage localizes to the seed region, which BWA expects to
      # have few errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes
      # Additional command-line options may be specified below. For 'backtrack' these
      # are applied to the "bwa aln". See Bowtie2 for more examples.
#      -n: 0.04
Options :: Aligners :: BWA :: Algorithm
26
27
28
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack

The mapping algorithm to use; options are 'backtrack' (corresponding to 'bwa aln'), 'bwasw', and 'mem'. Additional command-line options may be specified for these. Algorithms are selected as follows:

Algorithm: backtrack  # 'Backtrack' algorithm, using the command 'bwa aln'
Algorithm: bwasw      # 'SW' algorithm for long queries, using the command 'bwa bwasw'
Algorithm: mem        # 'mem' algorithm, using the command 'bwa mem'

Warning

Alignment algorithms 'bwasw' and 'mem' currently cannot be used with input data that is encoded using QualityOffset 64 or 'Solexa'. This is a limitation of PALEOMIX, and will be resolved in future versions. In the mean time, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example seqtk.

Options :: Aligners :: BWA :: MinQuality
39
40
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0

Specifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero 'MinQuality' value. To filter unmapped reads, use the option 'FilterUnmappedReads' (see below). To set this option, replace the default value with a desired minimum:

MinQuality: 0   # Keep all hits
MinQuality: 25  # Keep only hits where mapping-quality >= 25
Options :: Aligners :: BWA :: FilterUnmappedReads
41
42
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes

Specifies wether or not unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to 'yes' (without quotes), all unmapped reads are discarded during the mapping process, while setting the option to 'no' (without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with either 'yes' or 'no' (without quotes):

FilterUnmappedReads: yes  # Remove unmapped reads during alignment
FilterUnmappedReads: no   # Keep unmapped reads

Options :: Aligners :: BWA :: *

Additional command-line options may be specified for the selected alignment algorithm, as described in the "Specifying command-line options" section above. See also the examples listed for Bowtie2 below. Note that for the 'backtrack' algorithm, it is only possible to specify options for the 'bwa aln' call.

Short read aligners - Bowtie2

The following options are applied only when running the Bowtie2 short read aligner; see the section "Options: Short read aligners" above for how to select this aligner.

51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup
Options :: Aligners :: Bowtie2 :: MinQuality
53
54
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0

See 'Options :: Aligners :: BWA :: MinQuality' above.

Options :: Aligners :: Bowtie2 :: FilterUnmappedReads
55
56
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes

See 'Options :: Aligners :: BWA :: FilterUnmappedReads' above.

Options :: Aligners :: BWA :: *

Additional command-line options may be specified for Bowtie2, as described in the "Specifying command-line options" section above. Please refer to the Bowtie2 documentation for more information about available command-line options.

mapDamage options

67
68
69
70
71
  # Command-line options for mapDamage; use long-form options(--length not -l):
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

This subsection is used to specify options for mapDamage2.0, when plotting post-mortem DNA damage, when building models of the post-mortem damage, and when rescaling quality scores to account for this damage. In order to enable plotting, modeling, or rescaling of quality scores, please see the 'mapDamage' option in the 'Features' section below.

Note

It may be worthwhile to tweak mapDamage parameters before building a model of post-mortem DNA damage; this may be accomplished by running the pipeline without rescaling, running with the 'mapDamage' feature set to 'plot' (with or without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before setting 'mapDamage' to 'model' (with or without quotes).

Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with rescaling enabled, simply remove the mapDamage files generated for the relevant libraries (see the File structure section).

Warning

Rescaling requires a certain minimum number of C>T and G>A substitutions, before it is possible to construct a model of post-mortem DNA damage. If mapDamage fails with an error indicating that "DNA damage levels are too low", then it is necessary to disable rescaling for that library to continue.

Options :: mapDamage :: --downsample
69
70
71
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

By default the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficient for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:

--downsample: 100000   # Sample 100 thousand reads
--downsample: 1000000  # Sample 1 million reads
--downsample: 0        # No downsampling

Options :: mapDamage :: *

Additional command-line options may be supplied to mapDamage, just like the --downsample parameter, as described in the "Specifying command-line options" section above. These are used during plotting and rescaling (if enabled).

Excluding read-types

73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

During the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogenous DNA. The following read types are currently recognized:

Single
Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline.
Paired
Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which both mates were retained, and which were not merged into a single read (if --collapse is set for AdapterRemoval).
Singleton
Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the --maxns, the --minlength, or the --maxlength options supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode.
Collapsed
Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the --collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with 'M_'.
CollapsedTruncated
Paired-end reads (like Collapsed), which were trimmed due to the --trimqualities or the --trimns command-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name with 'MT_'.

To enable / disable exclusion of a read type, set the value for the appropriate type to 'yes' or 'no' (without quotes):

Singleton: no   # Singleton reads are NOT excluded
Singleton: yes  # Singleton reads are excluded

Optional features

104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
  # Optional steps to perform during processing.
  Features:
    # If set to 'filter', PCR duplicates are removed from the output files; if set to
    # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the
    # output files; if set to 'no', the reads are assumed to not have been amplified.
    PCRDuplicates: filter
    # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots;
    # set to 'model' to build plots and post-mortem damage models; and set to 'rescale'
    # to build plots, models, and BAMs with rescaled quality scores. All analyses are
    # carried out per library.
    mapDamage: plot
    # Generate coverage information for the final BAM and for each 'RegionsOfInterest'
    # specified in 'Prefixes' (yes / no).
    Coverage: yes
    # Generate histograms of number of sites with a given read-depth, from 0 to 200,
    # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no).
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes

This section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available:

PCRDuplicates
This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is 'filter', which corresponds to running Picard MarkDuplicates and 'paleomix rmdup_collapsed' on the input files, and removing any read determined to be a PCR duplicate; the second option 'mark' functions like the 'filter' option, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the SAM/BAM specification for more information), in order to allow down-stream tools to identify these as duplicates. The final option is 'no' (without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing.
mapDamage
The 'mapDamage' option accepts four possible values: 'no', 'plot', 'model', and 'rescale'. By default value ('plot'), will cause mapDamage to be run in order to generate simple plots of the post-mortem DNA damage rates, as well as base composition plots, and more. If set to 'model', mapDamage will firstly generate the plots described for 'plot', but also construct models of DNA damage parameters, as described in [Jonsson2013]. Note that a minimum amount of DNA damage is required to be present in order to build these models. If the option is set to 'rescale', both plots and models will be constructed using mapDamage, and in addition, the quality scores of bases will be down-graded based on how likely they are to represent post-mortem DNA damage (see above).
Coverage
If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig.
Depths
If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the Pipeline usage section for more information).
Summary
If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more.

For a description of where files are placed, refer to the File structure section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to 'yes' or 'no' (without quotes):

Summary: no   # Do NOT generate a per-target summary table
Summary: yes  # Generate a per-target summary table

Prefixes

110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# Map of prefixes by name, each having a Path, which specifies the location of the
# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
Prefixes:
  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary
  # statistics and as part of output filenames.
  NAME_OF_PREFIX:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references
    # against which reads are to be mapped. Using the same name as filename is strongly
    # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37').
    Path: PATH_TO_PREFIX.fasta

    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a .bed file
    # listing extra regions for which coverage / depth statistics should be calculated;
    # if no names are specified for the BED records, results are named after the
    # chromosome / contig. Replace 'NAME' with the desired name for these regions.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE

Reference genomes used for mapping are specified by listing these (one or more) in the 'Prefixes' section. Each reference genome is associated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the 'Path' value are required, as shown here for several examples:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  MyPrefix1:
    # Path to FASTA file containing reference genome; must end with '.fasta'
    Path: /path/to/genomes/file_1.fasta
  MyPrefix2:
    Path: /path/to/genomes/file_2.fasta
  MyPrefix3:
    Path: /path/to/genomes/AE008922_1.fasta

Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled 'Features' (see above). In addition to the path, it is possible to specify RegionsOfInterest, which are described below.

Regions of interest

It is possible to specify one or more "regions of interest" for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled).

Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format.

Statistics are merged by the names specified in the BED file, or by the contig names if no names were specified. Thus, it is important to insure that names are unique if individual statistics are desired for every region.

Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interest under the RegionOfInterest section for a given prefix:

# Produce additional coverage / depth statistics for a set of
# regions defined in a BED file; if no names are specified for the
# BED records, results are named after the chromosome / contig.
RegionsOfInterest:
  MyRegions: /path/to/my_regions.bed
  MyOtherRegions: /path/to/my_other_regions.bed

The following is a simple example of such a BED file, for an alignment against the rCRS (NC_012920.1):

NC_012920_1 3306    4262    region_a
NC_012920_1 4469    5510    region_b
NC_012920_1 5903    7442    region_a

In this case, the resulting tables will contain information about two different regions, namely region_a (2495 bp, resulting from merging the two individual regions specified), and region_b (1041 bp). The order of lines in this file does not matter.

Adding multiple prefixes

In cases where it is necessary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. It is therefore possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder at '/path/to/genomes', which contains our reference genomes:

$ ls /path/to/genomes
AE000516_2.fasta
AE004091_2.fasta
AE008922_1.fasta
AE008923_1.fasta

To automatically add these four reference genomes to the makefile, we would add a prefix as follows:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  MyGenomes*:
    # Path to .fasta file containing a set of reference sequences.
    Path: /path/to/genomes/*.fasta

There are two components to this, namely the name of the pseudo-prefix which must end with a star (*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of 'ls /path/to/genomes/*.fasta', and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  AE000516_2:
    Path: /path/to/genomes/AE000516_2.fasta
  AE004091_2:
    Path: /path/to/genomes/AE004091_2.fasta
  AE008922_1:
    Path: /path/to/genomes/AE008922_1.fasta
  AE008923_1:
    Path: /path/to/genomes/AE008923_1.fasta

Note

The name provided for the pseudo-prefix (here 'MyGenomes') is not used by the pipeline, and can instead be used to document the nature of the files being included.

Warning

Just like regular prefixes, it is required that the filename of the reference genome ends with '.fasta'. However, the pipeline will attempt to add any file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path '/path/to/genomes/*' was used instead of '/path/to/genomes/*.fasta', this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself.

Targets

129
130
131
132
133
134
135
136
137
138
139
# Mapping targets are specified using the following structure. Replace 'NAME_OF_TARGET'
# with the desired prefix for filenames.
NAME_OF_TARGET:
  # Replace 'NAME_OF_SAMPLE' with the name of this sample.
  NAME_OF_SAMPLE:
    # Replace 'NAME_OF_LIBRARY' with the name of this sample.
    NAME_OF_LIBRARY:
       # Replace 'NAME_OF_LANE' with the lane name (e.g. the barcode) and replace
       # 'PATH_WITH_WILDCARDS' with the path to the FASTQ files to be trimmed and mapped
       # for this lane (may include wildcards).
      NAME_OF_LANE: PATH_WITH_WILDCARDS

In the BAM pipeline, the term 'Target' is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified.

The following simplified example, derived from the makefile constructed as part of Pipeline usage section exemplifies this:

1
2
3
4
5
6
7
8
9
# Target name; all output files uses this name as a prefix
MyFilename:
  # Sample name; used to tag data for segregation in downstream analyses
  MySample:
    # Library name; used to tag data for segregation in downstream analyses
    TGCTCA:
      # Lane / run names and paths to FASTQ files
      Lane_1: data/TGCTCA_L1_*.fastq.gz
      Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz
Target name
The first top section of this target (line 2, 'MyFilename') constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start with 'MyFilename'; for example, the summary table normally generated from running the pipeline will be placed in a file named 'MyFilename.summary'.
Sample name
The subsections listed in the 'Target' section (line 4, 'MySample') constitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup ('RG') tags.
Library name

The subsections listed in the 'Sample' section (line 6, 'TGCTCA') constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below).

Warning

Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX.

Lane name
The subsections of each library are used to specify the names of individual lanes. Each sample may include any number of lanes and each lane may include any number of FASTQ files. Paths listed here may include wildcards (*) or the special value {Pair}, which is used to indicate that the lane is paired end. During runtime, {Pair} is replaced with 1 and 2, before searching using wildcards, and the pipeline expects the resulting filenames to correspond to mate 1 and mate 2 reads respectively.

In addition to these target (sub)sections, it is possible to specify 'Options' for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below.

Warning

It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries.

Including already trimmed reads

In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both 'raw' and trimmed FASTQ reads, and to ensure that these integrate in the pipeline.

To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the 'ExcludeReads' option (see above). The following minimal example demonstrates this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
MyFilename:
  MySample:
    ACGATA:
      # Regular lane, containing reads that are not already trimmed
      Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz

      # Lane containing pre-trimmed reads of each type
      Lane_2:
        # Single-end reads
        Single: /path/to/single_end_reads.fastq.gz

        # Paired-end reads where one mate has been discarded
        Singleton: /path/to/singleton_reads.fastq.gz

        # Paired end reads; note that the {Pair} key is required,
        # just like with raw, paired-end reads
        Paired: /path/to/paired_end_{Pair}.fastq.gz

        # Paired-end reads merged into a single sequence
        Collapsed: /path/to/collapsed.fastq.gz

        # Paired-end reads merged into a single sequence, and then truncated
        CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz

The above examples show how each type of reads are to be listed, but it is not necessary to specify more than a single type of pre-trimmed reads in the makefile.

Note

Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics.

Overriding global settings

In addition to the main Options section, it is possible to override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of Pipeline usage section as the base:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
MyFilename:
  # These options apply to all samples with this filename
  Options:
    # In this example, we override the default adapter sequences
    AdapterRemoval:
      --adapter1: AGATCGGAAGAGC
      --adapter2: AGATCGGAAGAGC

  MySample:
    # These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA'
    Options:
       # In this example, we assume that FASTQ files for our libraries
       # include Phred quality scores encoded with offset 64.
       QualityOffset: 64

    ACGATA:
      Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz

    GCTCTG:
      # These options apply to 'Lane_1' in the 'GCTCTG' library
      Options:
        # It is possible to override options we have previously overridden
        QualityOffset: 33

      Lane_1: data/GCTCTG_L1_*.fastq.gz

    TGCTCA:
      Lane_1: data/TGCTCA_L1_*.fastq.gz
      Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz

In this example, we have overwritten options at 3 places:

  • The first place (lines 2 - 7) will be applied to all samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data.
  • The second place (lines 10 - 14) are applied to the sample 'MySample' that we have included in this target, and consequently applies to all libraries specified for this sample ('ACGATA', 'GCTCTG', and 'TGCTCA'). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15).
  • Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the 'QualityOffset' option).

Note

It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library.

Warning

Only the mapDamage and the PCRDuplicates features can be overridden for individual targets, samples, or libraries. Over features can be overridden per target, but not per sample or per library.