Makefile description

The following sections reviews the options available in the BAM pipeline makefiles. As described in the Pipeline usage section, a default makefile may be generated using the ‘paleomix bam_pipeline makefile’ command. For clarity, the location of options in subsections are specified by concatenating the names using ‘::’ as a separator. Thus, in the following (simplified example), the ‘UseSeed’ option (line 13) would be referred to as ‘Options :: Aligners :: BWA :: UseSeed’:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
Options:

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # May be disabled ("no") for aDNA alignments, as post-mortem damage
      # localizes to the seed region, which BWA expects to have few
      # errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes

Specifying command-line options

For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option –mm 5 to AdapterRemoval, then we would list it as “–mm: 5” (all other options omitted for brevity):

AdapterRemoval:
  --mm: 5

For options that do not take any values, such as the AdapterRemoval ‘–trimns’ (enabling the trimming of Ns in the reads), these are specified either as “–trimmns: “, with the value left blank, or as “–trimns: yes”. The following are therefore equivalent:

AdapterRemoval:
  --trimns:      # Method 1
  --trimns: yes  # Method 2

In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to ‘no’ (without quotes), as shown here:

AdapterRemoval:
  --trimns: no

If you need to provide the text “yes” or “no” as the value for an option, it is necessary to put these in quotes:

--my-option: "yes"
--my-option: "no"

In some cases it is possible or even necessary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:

--my-option:
    - "yes"
    - "no"
    - "maybe"

The above will be translated into calling the program in question with the options “–my-option yes –my-option no –my-option maybe”.

Options section

By default, the ‘Options’ section of the makefile contains the following:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: bz2

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Adapter sequences, set and uncomment to override defaults
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments, as post-mortem damage
      # localizes to the seed region, which BWA expects to have few
      # errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

  # Optional steps to perform during processing.
  Features:
    # Generate BAM from the alignments without indel realignment (yes / no)
    RawBAM: no
    # Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
    RealignedBAM: yes
    # To disable mapDamage, write 'no'; to generate basic mapDamage plots,
    # write 'plot'; to build post-mortem damage models, write 'model',
    # and to produce rescaled BAMs, write 'rescale'. The 'model' option
    # includes the 'plot' output, and the 'rescale' option includes both
    # 'plot' and 'model' results. All analyses are carried out per library.
    mapDamage: plot
    # Generate coverage information for the raw BAM (wo/ indel realignment).
    # If one or more 'RegionsOfInterest' have been specified for a prefix,
    # additional coverage files are generated for each alignment (yes / no)
    Coverage: yes
    # Generate histogram of number of sites with a given read-depth, from 0
    # to 200. If one or more 'RegionsOfInterest' have been specified for a
    # prefix, additional histograms are generated for each alignment (yes / no)
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes
    # Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
    DuplicateHist: no


# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name
  # is used in summary statistics and as part of output filenames.
  NAME_OF_PREFIX:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: PATH_TO_PREFIX

    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
    # .bed file listing extra regions for which coverage / depth statistics
    # should be calculated; if no names are specified for the BED records,
    # results are named after the chromosome / contig. Change 'NAME' to the
    # name to be used in summary statistics and output filenames.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE


# Mapping targets are specified using the following structure. Uncomment and
# replace 'NAME_OF_TARGET' with the desired prefix for filenames.
#NAME_OF_TARGET:
   #  Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
#  NAME_OF_SAMPLE:
     #  Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
#    NAME_OF_LIBRARY:
       # Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
       # and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
       # to be trimmed and mapped for this lane (may include wildcards).
#      NAME_OF_LANE: PATH_WITH_WILDCARDS

Options: General

Options :: Platform
7
8
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina

The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include ‘CAPILLARY’, ‘HELICOS’, ‘ILLUMINA’, ‘IONTORRENT’, ‘LS454’, ‘ONT’, ‘PACBIO’, and ‘SOLID’.

Options :: QualityOffset
 9
10
11
12
13
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33

The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and ‘Solexa’. For most modern data, this will be 33, corresponding to ASCII characters in the range ‘!’ to ‘J’. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range ‘@’ to ‘h’, and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ‘;’ to ‘h’. For a visual representation of this, refer to the Wikipedia article linked above.

Options :: SplitLanesByFilenames
14
15
16
17
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes

This option influences how the BAM pipeline handles lanes that include multiple files. By default (corresponding to a value of ‘yes’), the pipeline will process individual files in parallel, potentially allowing for greater throughput. If set to ‘no’, all files in a lane are merged during processing, resulting in a single set of trimmed reads per lane. The only effect of this option on the final result is a greater number of read-groups specified in the final BAM files. See the File structure section for more details on how this is handled.

Warning

This option is deprecated, and will be removed in future versions of PALEOMIX.

Options :: CompressionFormat
18
19
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: bz2

This option determines which type of compression is carried out on trimmed FASTQ reads; if set to ‘gz’, reads are gzip compressed, and if set to ‘bz2’, reads are compressed using bzip2. This option has no effect on the final results, but may be used to trade off space (gz) for some additional runtime (bz2).

Warning

This option is deprecated, and may be removed in future versions of PALEOMIX.

Options :: PCRDuplicates
72
73
74
75
76
77
78
  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is ‘filter’, which corresponds to running Picard MarkDuplicates and ‘paleomix rmdup_collapsed’ on the input files, and removing any read determined to be a PCR duplicate; the second option ‘mark’ functions like the ‘filter’ option, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the SAM/BAM specification for more information), in order to allow down-stream tools to identify these as duplicates. The final option is ‘no’ (without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing.

Options: Adapter Trimming

The “AdapterRemoval” subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the AdapterRemoval documentation. A few important particularly options are described here:

Options :: AdapterRemoval :: –adapter1 and Options :: AdapterRemoval :: –adapter2
23
24
25
     # Adapter sequences, set and uncomment to override defaults
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT

These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for –adapter1 is expected to be found in the mate 1 reads, and the sequence specified for –adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the AdapterRemoval documentation for more information.

Note

As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the –threads command-line option in the makefile, but by supplying the –adapterremoval-max-threads option to the BAM pipeline itself:

$ paleomix bam_pipeline run makefile.yaml --adapterremoval-max-threads 2

Warning

Older versions of PALEOMIX may use the –pcr1 and –pcr2 options instead of –adapter1 and –adapter2; for new projects, using –adapter1 and –adapter2 is strongly recommended, due to the simpler schematics (described above). If your project uses the –pcr1 and –pcr2 options, then refer to the AdapterRemoval documentation information for how to proceed!

Options :: AdapterRemoval :: –mm
28
     --mm: 3

Sets the fraction of mismatches allowed when aligning reads / adapter sequences. If the specified value (MM) is greater than 1, this is calculated as 1 / MM, otherwise the value is used directly. To set, replace the default value as desired:

--mm: 3    # Maximum mismatch rate of 1 / 3
--mm: 5    # Maximum mismatch rate of 1 / 5
--mm: 0.2  # Maximum mismatch rate of 1 / 5

Options :: AdapterRemoval :: –minlength

The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012].

Warning

The default value used by PALEOMIX for ‘–minlength’ (25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is nessesarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient.

Options :: AdapterRemoval :: –collapse
31
     --collapse: yes

If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambiguous alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either ‘M_’ or ‘MT_’, with the latter marking reads that have been trimmed from the 5’ or 3’ termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option to ‘no’ (without quotes):

--collapse: yes  # Option enabled
--collapse: no   # Option disabled

Note

This option may be combined with the ‘ExcludeReads’ option (see below), to either eliminate or select for short inserts, depending on the expectations from the experiment. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped.

Options :: AdapterRemoval :: –trimns
32
     --trimns: yes

If set to ‘yes’ (without quotes), AdapterRemoval will trim uncalled bases (‘N’) from the 5’ and 3’ end of the reads. Trimming will stop at the first called base (‘A’, ‘C’, ‘G’, or ‘T’). If both –trimns and –trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5’ and 3’ end of the reads. To disable, set the option to ‘no’ (without quotes):

--trimns: yes  # Option enabled
--trimns: no   # Option disabled
Options :: AdapterRemoval :: –trimqualities
33
     --trimqualities: yes

If set to ‘yes’ (without quotes), AdapterRemoval will trim low-quality bases from the 5’ and 3’ end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option –minquality. This value defaults to 2. If both –trimns and –trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5’ and 3’ end of the reads. To disable, set the option to ‘no’ (without quotes):

--trimqualities: yes  # Option enabled
--trimqualities: no   # Option disabled

Options: Short read aligners

This section allow selection between supported short read aligners (currently BWA [Li2009a] and Bowtie2 [Langmead2012]), as well as setting options for these, individually:

35
36
37
38
  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

To select a mapping program, set the ‘Program’ option appropriately:

Program: BWA      # Using BWA to map reads
Program: Bowtie2  # Using Bowtie2 to map reads

Options: Short read aligners - BWA

The following options are applied only when running the BWA short read aligner; see the section “Options: Short read aligners” above for how to select this aligner.

40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # May be disabled ("no") for aDNA alignments, as post-mortem damage
      # localizes to the seed region, which BWA expects to have few
      # errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: yes
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.
Options :: Aligners :: BWA :: Algorithm
42
43
44
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack

The mapping algorithm to use; options are ‘backtrack’ (corresponding to ‘bwa aln’), ‘bwasw’, and ‘mem’. Additional command-line options may be specified for these. Algorithms are selected as follows:

Algorithm: backtrack  # 'Backtrack' algorithm, using the command 'bwa aln'
Algorithm: bwasw      # 'SW' algorithm for long queries, using the command 'bwa bwasw'
Algorithm: mem        # 'mem' algorithm, using the command 'bwa mem'

Warning

Alignment algorithms ‘bwasw’ and ‘mem’ currently cannot be used with input data that is encoded using QualityOffset 64 or ‘Solexa’. This is a limitation of PALEOMIX, and will be resolved in future versions. In the mean time, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example seqtk.

Options :: Aligners :: BWA :: MinQuality
45
46
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0

Specifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero ‘MinQuality’ value. To filter unmapped reads, use the option ‘FilterUnmappedReads’ (see below). To set this option, replace the default value with a desired minimum:

MinQuality: 0   # Keep all hits
MinQuality: 25  # Keep only hits where mapping-quality >= 25
Options :: Aligners :: BWA :: FilterUnmappedReads
47
48
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes

Specifies wether or not unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to ‘yes’ (without quotes), all unmapped reads are discarded during the mapping process, while setting the option to ‘no’ (without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with either ‘yes’ or ‘no’ (without quotes):

FilterUnmappedReads: yes  # Remove unmapped reads during alignment
FilterUnmappedReads: no   # Keep unmapped reads

Options :: Aligners :: BWA :: *

Additional command-line options may be specified for the selected alignment algorithm, as described in the “Specifying command-line options” section above. See also the examples listed for Bowtie2 below. Note that for the ‘backtrack’ algorithm, it is only possible to specify options for the ‘bwa aln’ call.

Options: Short read aligners - Bowtie2

The following options are applied only when running the Bowtie2 short read aligner; see the section “Options: Short read aligners” above for how to select this aligner.

56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup
Options :: Aligners :: Bowtie2 :: MinQuality
58
59
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0

See ‘Options :: Aligners :: BWA :: MinQuality’ above.

Options :: Aligners :: Bowtie2 :: FilterUnmappedReads
60
61
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes

See ‘Options :: Aligners :: BWA :: FilterUnmappedReads’ above.

Options :: Aligners :: BWA :: *

Additional command-line options may be specified for Bowtie2, as described in the “Specifying command-line options” section above. Please refer to the Bowtie2 documentation for more information about available command-line options.

Options: mapDamage plots and rescaling

80
81
82
83
84
85
86
  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

This subsection is used to specify options for mapDamage2.0, when plotting post-mortem DNA damage, when building models of the post-mortem damage, and when rescaling quality scores to account for this damage. In order to enable plotting, modeling, or rescaling of quality scores, please see the ‘mapDamage’ option in the ‘Features’ section below.

Note

It may be worthwhile to tweak mapDamage parameters before building a model of post-mortem DNA damage; this may be accomplished by running the pipeline without rescaling, running with the ‘mapDamage’ feature set to ‘plot’ (with or without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before setting ‘RescaleQualities’ to ‘model’ (with or without quotes).

Disabling the construction of the final BAMs may be accomplished by setting the features ‘RawBam’ and ‘RealignedBAM’ to ‘no’ (without quotes) in the ‘Features’ section (see below), and then setting the desired option to yes again after enabling rescaling and adding the desired options to the mapDamage section.

Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with RescaleQualities enabled, simply remove the mapDamage files generated for the relevant libraries (see the File structure section).

Warning

Rescaling requires a certain minimum number of C>T and G>A substitutions, before it is possible to construct a model of post-mortem DNA damage. If mapDamage fails with an error indicating that “DNA damage levels are too low”, then it is necessary to disable rescaling for that library to continue.

Options :: mapDamage :: –downsample
84
85
86
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

By default the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficient for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:

--downsample: 100000   # Sample 100 thousand reads
--downsample: 1000000  # Sample 1 million reads
--downsample: 0        # No downsampling

Options :: mapDamage :: *

Additional command-line options may be supplied to mapDamage, just like the ‘–downsample’ parameter, as described in the “Specifying command-line options” section above. These are used during plotting and rescaling (if enabled).

Options: Excluding read-types

 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    # Exclude single-end reads (yes / no)?
    Single: no
    # Exclude non-collapsed paired-end reads (yes / no)?
    Paired: no
    # Exclude paired-end reads for which the mate was discarded (yes / no)?
    Singleton: no
    # Exclude overlapping paired-ended reads collapsed into a single sequence
    # by AdapterRemoval (yes / no)?
    Collapsed: no
    # Like 'Collapsed', but only for collapsed reads truncated due to the
    # presence of ambiguous or low quality bases at read termini (yes / no).
    CollapsedTruncated: no

During the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogenous DNA. The following read types are currently recognized:

Single
Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline.
Paired
Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which both mates were retained, and which were not merged into a single read (if –collapse is set for AdapterRemoval).
Singleton
Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the ‘–maxns’, the ‘–minlength’, or the ‘–maxlength’ options supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode.
Collapsed
Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the –collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with ‘M_’.
CollapsedTruncated
Paired-end reads (like Collapsed), which were trimmed due to the ‘–trimqualities’ or the ‘–trimns’ command-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name with ‘MT_’.

To enable / disable exclusion of a read type, set the value for the appropriate type to ‘yes’ or ‘no’ (without quotes):

Singleton: no   # Singleton reads are NOT excluded
Singleton: yes  # Singleton reads are excluded

Options: Optional features

104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
  # Optional steps to perform during processing.
  Features:
    # Generate BAM from the alignments without indel realignment (yes / no)
    RawBAM: no
    # Generate indel-realigned BAM using the GATK Indel realigner (yes / no)
    RealignedBAM: yes
    # To disable mapDamage, write 'no'; to generate basic mapDamage plots,
    # write 'plot'; to build post-mortem damage models, write 'model',
    # and to produce rescaled BAMs, write 'rescale'. The 'model' option
    # includes the 'plot' output, and the 'rescale' option includes both
    # 'plot' and 'model' results. All analyses are carried out per library.
    mapDamage: plot
    # Generate coverage information for the raw BAM (wo/ indel realignment).
    # If one or more 'RegionsOfInterest' have been specified for a prefix,
    # additional coverage files are generated for each alignment (yes / no)
    Coverage: yes
    # Generate histogram of number of sites with a given read-depth, from 0
    # to 200. If one or more 'RegionsOfInterest' have been specified for a
    # prefix, additional histograms are generated for each alignment (yes / no)
    Depths: yes
    # Generate summary table for each target (yes / no)
    Summary: yes
    # Generate histogram of PCR duplicates, for use with PreSeq (yes / no)
    DuplicateHist: no

This section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available:

RawBAM
If enabled, the pipeline will generate a final BAM, which is NOT processed using the GATK Indel Realigner (see below), following all other processing steps.
RealignedBAM
If enabled, the pipeline will generate a final BAM, which is processed using the GATK Indel Realigner [McKenna2010], in order to improve the alignment near indels, by performing a multiple sequence alignment in regions containing putative indels.
mapDamage
The ‘mapDamage’ option accepts four possible values: ‘no’, ‘plot’, ‘model’, and ‘rescale’. By default value (‘plot’), will cause mapDamage to be run in order to generate simple plots of the post-mortem DNA damage rates, as well as base composition plots, and more. If set to ‘model’, mapDamage will firstly generate the plots described for ‘plot’, but also construct models of DNA damage parameters, as described in [Jonsson2013]. Note that a minimum amount of DNA damage is required to be present in order to build these models. If the option is set to ‘rescale’, both plots and models will be constructed using mapDamage, and in addition, the quality scores of bases will be down-graded based on how likely they are to represent post-mortem DNA damage (see above).
Coverage
If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig.
Depths
If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the Pipeline usage section for more information).
Summary
If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more.
DuplicateHist
If enabled, a histogram of the estimated number of PCR duplicates observed per DNA fragment is generated per library. This may be used with the ‘preseq’ program in order to estimate the (remaining) complexity of a given library, and thereby direct future sequencing efforts [Daley2013].

For a description of where files are placed, refer to the File structure section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to ‘yes’ or ‘no’ (without quotes):

Summary: no   # Do NOT generate a per-target summary table
Summary: yes  # Generate a per-target summary table

Prefixes section

129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name
  # is used in summary statistics and as part of output filenames.
  NAME_OF_PREFIX:
    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the
    # references against which reads are to be mapped. Using the same name
    # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta
    # should be named 'Human_g1k_v37').
    Path: PATH_TO_PREFIX

    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a
    # .bed file listing extra regions for which coverage / depth statistics
    # should be calculated; if no names are specified for the BED records,
    # results are named after the chromosome / contig. Change 'NAME' to the
    # name to be used in summary statistics and output filenames.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE

Reference genomes used for mapping are specified by listing these (one or more) in the ‘Prefixes’ section. Each reference genome is associated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the ‘Path’ value are required, as shown here for several examples:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  MyPrefix1:
    # Path to FASTA file containing reference genome; must end with '.fasta'
    Path: /path/to/genomes/file_1.fasta
  MyPrefix2:
    Path: /path/to/genomes/file_2.fasta
  MyPrefix3:
    Path: /path/to/genomes/AE008922_1.fasta

Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled ‘Features’ (see above). In addition to the path, two other options are available per prefix, namely the ‘Label’ and ‘RegionsOfInterest’, which are described below.

Warning

FASTA files used in the BAM pipeline must be named with a .fasta file extension. Furthermore, if alignments are to be carried out against the human nuclear genome, chromosomes MUST be ordered by their number for GATK to work! See the GATK FAQ for more information.

Regions of interest

It is possible to specify one or more “regions of interest” for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled, see above).

Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format.

If these regions are named, statistics are merged by these names (essentially treating them as pseudo contigs), while regions are merged by contig. Thus, it is important to insure that names are unique if statistics are desired for very single region, individually.

Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interest under the ‘RegionOfInterest’ section for a given prefix:

# Produce additional coverage / depth statistics for a set of
# regions defined in a BED file; if no names are specified for the
# BED records, results are named after the chromosome / contig.
RegionsOfInterest:
  MyRegions: /path/to/my_regions.bed
  MyOtherRegions: /path/to/my_other_regions.bed

The following is a simple example of such a BED file, for an alignment against the rCRS (NC_012920.1):

NC_012920_1 3306    4262    region_a
NC_012920_1 4469    5510    region_b
NC_012920_1 5903    7442    region_a

In this case, the resulting tables will contain information about two different regions, namely region_a (2495 bp, resulting from merging the two individual regions specified), and region_b (1041 bp). The order of lines in this file does not matter.

Adding multiple prefixes

In cases where it is necessary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. To allow such use-cases, it is possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder ‘/path/to/genomes’, which contains our reference genomes:

$ ls /path/to/genomes
AE000516_2.fasta
AE004091_2.fasta
AE008922_1.fasta
AE008923_1.fasta

To automatically add these (4) reference genomes to the makefile, we would add a prefix as follows:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  MyGenomes*:
    # Path to .fasta file containing a set of reference sequences.
    Path: /path/to/genomes/*.fasta

There are two components to this, namely the name of the pseudo-prefix which must end with a star (*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of ‘ls /path/to/genomes/*.fasta’, and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:

#  Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  AE000516_2:
    Path: /path/to/genomes/AE000516_2.fasta
  AE004091_2:
    Path: /path/to/genomes/AE004091_2.fasta
  AE008922_1:
    Path: /path/to/genomes/AE008922_1.fasta
  AE008923_1:
    Path: /path/to/genomes/AE008923_1.fasta

A makefile including such prefixes is executed as any other makefile.

Note

The name provided for the pseudo-prefix (here ‘MyGenomes’) is not used by the pipeline, and can instead be used to document the nature of the files being included.

Warning

Just like regular prefixes, it is required that the filename of the reference genome ends with ‘.fasta’. However, the pipeline will attempt to add any file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path ‘/path/to/genomes/*’ was used instead of ‘/path/to/genomes/*.fasta’, this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself.

Prefix labels

Prefixes:
  # Uncomment and replace 'NAME_OF_PREFIX' with name of the prefix; this name
  # is used in summary statistics and as part of output filenames.
#  NAME_OF_PREFIX:
    # ...

    # (Optional) Uncomment and replace 'LABEL' with one of 'nuclear',
    # 'mitochondrial', 'chloroplast', 'plasmid', 'bacterial', or 'viral'.
#    Label: LABEL

The label option for prefixes allow a prefix to be classified according to one of several categories, currently including ‘nuclear’, ‘mitochondrial’, ‘chloroplast’, ‘plasmid’, ‘bacterial’, and ‘viral’. This is only used when generating the .summary files (if the ‘Summary’ feature is enabled), in which the label is used instead of the prefix name, and the results for prefixes with the same label are combined.

Warning

Labels are deprecated, and will either be removed in future versions of PALEOMIX, or significantly changed.

Targets section

152
153
154
155
156
157
158
159
160
161
162
# Mapping targets are specified using the following structure. Uncomment and
# replace 'NAME_OF_TARGET' with the desired prefix for filenames.
#NAME_OF_TARGET:
   #  Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample.
#  NAME_OF_SAMPLE:
     #  Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample.
#    NAME_OF_LIBRARY:
       # Uncomment and replace 'NAME_OF_LANE' with the name of this lane,
       # and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files
       # to be trimmed and mapped for this lane (may include wildcards).
#      NAME_OF_LANE: PATH_WITH_WILDCARDS

In the BAM pipeline, the term ‘Target’ is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified.

The following simplified example, derived from the makefile constructed as part of Pipeline usage section exemplifies this:

1
2
3
4
5
6
7
8
9
# Target name; all output files uses this name as a prefix
MyFilename:
  # Sample name; used to tag data for segregation in downstream analyses
  MySample:
    # Library name; used to tag data for segregation in downstream analyses
    TGCTCA:
      # Lane / run names and paths to FASTQ files
      Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
      Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz
Target name
The first top section of this target (line 1, ‘MyFilename’) constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start with ‘MyFilename’; for example, the summary table normally generated from running the pipeline will be placed in a file named ‘MyFilename.summary’.
Sample name
The subsections listed in the ‘Target’ section (line 2, ‘MySample’) constitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup (‘RG’) tags.
Library name

The subsections listed in the ‘Sample’ section (line 3, ‘TGCTCA’) constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below).

Warning

Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX.

Lane name
The subsections of each library are used to specify the names of individual

In addition to these target (sub)sections, it is possible to specify ‘Options’ for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below.

Warning

It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, GATK and mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries.

Including already trimmed reads

In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both ‘raw’ and trimmed FASTQ reads, and to ensure that these integrate in the pipeline.

To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the ‘ExcludeReads’ option (see above). The following minimal example demonstrates this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
MyFilename:
  MySample:
    ACGATA:
      # Regular lane, containing reads that are not already trimmed
      Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz

      # Lane containing pre-trimmed reads of each type
      Lane_2:
        # Single-end reads
        Single: /path/to/single_end_reads.fastq.gz

        # Paired-end reads where one mate has been discarded
        Singleton: /path/to/singleton_reads.fastq.gz

        # Paired end reads; note that the {Pair} key is required,
        # just like with raw, paired-end reads
        Paired: /path/to/paired_end_{Pair}.fastq.gz

        # Paired-end reads merged into a single sequence
        Collapsed: /path/to/collapsed.fastq.gz

        # Paired-end reads merged into a single sequence, and then truncated
        CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz

The above examples show how each type of reads are to be listed, but it is not necessary to specify more than a single type of pre-trimmed reads in the makefile.

Note

Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics.

Overriding global settings

In addition to the ‘Options’ section included, by default, at the beginning of every makefile, it is possible to specify / override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of Pipeline usage section as the base:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
MyFilename:
  # These options apply to all samples with this filename
  Options:
    # In this example, we override the default adapter sequences
    AdapterRemoval:
      --adapter1: AGATCGGAAGAGC
      --adapter2: AGATCGGAAGAGC

  MySample:
    # These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA'
    Options:
       # In this example, we assume that FASTQ files for our libraries
       # include Phred quality scores encoded with offset 64.
       QualityOffset: 64

    ACGATA:
      Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz

    GCTCTG:
      # These options apply to 'Lane_1' in the 'GCTCTG' library
      Options:
        # It is possible to override options we have previously overridden
        QualityOffset: 33

      Lane_1: 000_data/GCTCTG_L1_*.fastq.gz

    TGCTCA:
      Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
      Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz

In this example, we have overwritten options at 3 places:

  • The first place (lines 2 - 7) will be applied to all samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data.
  • The second place (lines 10 - 14) are applied to the sample ‘MySample’ that we have included in this target, and consequently applies to all libraries specified for this sample (‘ACGATA’, ‘GCTCTG’, and ‘TGCTCA’). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15).
  • Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the ‘QualityOffset’ option).

Note

It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library.

Warning

It is currently not possible to set the ‘Features’ except in the global ‘Options’ section at the top of the Makefile; this limitation will be removed in future versions of PALEOMIX.