Pipeline usage

The following describes, step by step, the process of setting up a project for mapping FASTQ reads against a reference sequence using the BAM pipeline. For a detailed description of the configuration file (makefile) used by the BAM pipeline, please refer to the section Makefile description, and for a detailed description of the files generated by the pipeline, please refer to the section File structure.

The BAM pipeline is invoked using the paleomix command, which offers access to the pipelines and to other tools included with PALEOMIX (see section Other tools). For the purpose of these instructions, we will make use of a tiny FASTQ data set included with PALEOMIX pipeline, consisting of synthetic FASTQ reads simulated against the human mitochondrial genome. To follow along, first create a local copy of the example data-set:

$ paleomix bam example .

This will create a folder named bam_pipeline in the current folder, which contain the example FASTQ reads and a 'makefile' showcasing various features of the BAM pipeline (makefile.yaml). We will make use of a subset of the data, but we will not make use of the makefile. The data we will use consists of 3 simulated ancient DNA libraries (independent amplifications), for which either one or two lanes have been simulated:

Library

Lane

Type

Files

ACGATA

1

PE

data/ACGATA_L1_*.fastq.gz

GCTCTG

1

SE

data/GCTCTG_L1_*.fastq.gz

TGCTCA

1

SE

data/TGCTCA_L1_*.fastq.gz

2

PE

data/TGCTCA_L2_*.fastq.gz

Warning

The BAM pipeline largely relies on the existence of final and intermediate files in order to detect if a given analytical step has been carried out. Therefore, changes made to a makefile after the pipeline has already been run (even if not run to completion) may therefore not cause analytical steps affected by these changes to be re-run. If changes are to be made at such a point, it is typically necessary to manually remove affected intermediate files before running the pipeline again. See the section File structure for more information about the layout of files generated by the pipeline.

Creating a makefile

As described in the Introduction, the BAM pipeline operates based on 'makefiles', which serve to specify the location and structure of input data (samples, libraries, lanes, etc), and which specific which tasks are to be run, and which settings to be used. The makefiles are written using the human-readable YAML format, which may be edited using any regular text editor.

For a brief introduction to the YAML format, please refer to the YAML usage in PALEOMIX section, and for a detailed description of the BAM Pipeline makefile, please refer to section Makefile description.

To start a new project, we must first generate a makefile template using the following command, which for the purpose of this tutorial we place in the example folder:

$ cd bam_pipeline/
$ paleomix bam makefile > makefile.yaml

Once you open the resulting file (makefile.yaml) in your text editor of choice, you will find that BAM pipeline makefiles are split into three major sections, representing 1) the default options; 2) the reference genomes against which reads are to be mapped; and 3) the of input files for the samples which are to be processed.

In a typical project, we will need to review the default options, add one or more reference genomes which we wish to target, and list the input data to be processed.

Default options

The makefile starts with an Options section, which is applied to every set of input-files in the makefile unless explicitly overwritten for a given sample (this is described in the Makefile description section). For most part, the default values should be suitable for any given project, but special attention should be paid to the following options (double colons are used to separate subsections):

Options :: Platform

The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include CAPILLARY, HELICOS, ILLUMINA, IONTORRENT, LS454, ONT, PACBIO, and SOLID.

Options :: QualityOffset

The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and Solexa. For most modern data, this will be 33, corresponding to ASCII characters in the range ! to J. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range @ to h, and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ; to h. For a visual representation of this, refer to the Phred quality-scores page.

Warning

By default, the adapter trimming software used by PALEOMIX expects quality-scores no greater than 41, corresponding to the ASCII character J when encoded using offset 33. If the input-data contains quality-scores higher greater than this value, then it is necessary to specify the maximum value using the --qualitymax command-line option. See below.

Warning

Presently, quality-offsets other than 33 are not supported when using the BWA mem or the BWA bwasw algorithms. To use these algorithms with quality-offset 64 data, it is therefore necessary to first convert these data to offset 33. This can be accomplished using the seqtk tool.

Options :: AdapterRemoval :: --adapter1 and Options :: AdapterRemoval :: --adapter2

These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination using AdapterRemoval. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these sequences, assuming that the reads were long enough, if you treat Ns as wildcards).

Warning

It is very important that the correct adapter sequences are used. Please refer to the AdapterRemoval documentation for more information and for help identifying the adapters for paired-end reads.

Aligners :: Program

The short read alignment program to use to map the (trimmed) reads to the reference genome. Currently, users many choose between BWA and Bowtie2, with additional options available for each program.

Aligners :: * :: MinQuality

The minimum mapping quality of hits to retain during the mapping process. If this option is set to a non-zero value, any hits with a mapping quality below this value are removed from the resulting BAM file (this option does not apply to unmapped reads). If the final BAM should contain all reads in the input files, this option must be set to 0, and the FilterUnmappedReads option set to no.

Aligners :: BWA :: UseSeed

Enable/disable the use of a seed region when mapping reads using the BWA backtrack alignment algorithm (the default). Disabling this option may yield some improvements in the alignment of highly damaged ancient DNA, at the cost of significantly increasing the running time. As such, this option is not recommended for modern samples [Schubert2012].

For the purpose of the example project, we need only change a few options. Since the reads were simulated using an Phred score offset of 33, there is no need to change the QualityOffset option, and since the simulated adapter sequences matches the adapters that AdapterRemoval searches for by default, so we do not need to set either of --adapter1 or --adapter2. We will, however, use the default mapping program (BWA) and algorithm (backtrack), but change the minimum mapping quality to 30 (corresponding to an error probability of 0.001). Changing the minimum quality is accomplished by locating the Aligners section of the makefile, and changing the MinQuality value from 0 to 30 (line 40):

29# Settings for aligners supported by the pipeline
30Aligners:
31  # Choice of aligner software to use, either "BWA" or "Bowtie2"
32  Program: BWA
33
34  # Settings for mappings performed using BWA
35  BWA:
36    # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
37    # for a description of each algorithm (defaults to 'backtrack')
38    Algorithm: backtrack
39    # Filter aligned reads with a mapping quality (Phred) below this value
40    MinQuality: 30
41    # Filter reads that did not map to the reference sequence
42    FilterUnmappedReads: yes
43    # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
44    # Post-mortem damage localizes to the seed region, which BWA expects to
45    # have few errors (sets "-l"). See http://pmid.us/22574660
46    UseSeed: yes

Since the data we will be mapping represents (simulated) ancient DNA, we will furthermore set the UseSeed option to no (line 55), in order to recover a small additional amount of alignments during mapping (see [Schubert2012]):

38# Settings for aligners supported by the pipeline
39Aligners:
40  # Choice of aligner software to use, either "BWA" or "Bowtie2"
41  Program: BWA
42
43  # Settings for mappings performed using BWA
44  BWA:
45    # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
46    # for a description of each algorithm (defaults to 'backtrack')
47    Algorithm: backtrack
48    # Filter aligned reads with a mapping quality (Phred) below this value
49    MinQuality: 30
50    # Filter reads that did not map to the reference sequence
51    FilterUnmappedReads: yes
52    # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
53    # Post-mortem damage localizes to the seed region, which BWA expects to
54    # have few errors (sets "-l"). See http://pmid.us/22574660
55    UseSeed: no

Once this is done, we can proceed to specify the location of the reference genome(s) that we wish to map our reads against.

Reference genomes (prefixes)

Mapping is carried out using one or more reference genomes (or other sequences) in the form of FASTA files, which are indexed for use in read mapping (automatically, by the pipeline) using either the bwa index or bowtie2-build commands. Since sequence alignment index are generated at the location of these files, reference genomes are also referred to as "prefixes" in the documentation. In other words, using BWA as an example, the PALEOMIX pipeline will generate a index (prefix) of the reference genome using a command corresponding to the following, for BWA:

$ bwa index prefixes/my_genome.fasta

In addition to the BWA / Bowtie2 index, several other related files are also automatically generated, including a FASTA index file (.fai), which are required for various operations of the pipeline. These are similarly located at the same folder as the reference FASTA file. For a more detailed description, please refer to the File structure section.

Warning

Since the pipeline automatically carries out indexing of the FASTA files, it therefore requires write-access to the folder containing the FASTA files. If this is not possible, one may simply create a local folder containing symbolic links to the original FASTA file(s), and point the makefile to this location. All automatically generated files will then be placed in this location.

Specifying which FASTA file to align sequences is accomplished by listing these in the Prefixes section in the makefile. For example, assuming that we had a FASTA file named my_genome.fasta which is located in the my_prefixes folder, the following might be used:

Prefixes:
  my_genome:
    Path: my_prefixes/my_genome.fasta

The name of the prefix (here my_genome) will be used to name the resulting files and in various tables that are generated by the pipeline. Typical names include hg19, EquCab20, and other standard abbreviations for reference genomes, accession numbers, and the like. Multiple prefixes can be specified, but each name MUST be unique:

Prefixes:
  my_genome:
    Path: my_prefixes/my_genome.fasta
  my_other_genome:
    Path: my_prefixes/my_other_genome.fasta

In the case of this example project, we will be mapping our data against the revised Cambridge Reference Sequence (rCRS) for the human mitochondrial genome, which is included in examples folder under prefixes, as a file named rCRS.fasta. To add it to the makefile, locate the Prefixes section located below the Options section, and update it as described above (lines 115 and 119):

110# Map of prefixes by name, each having a Path, which specifies the location of the
111# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
112Prefixes:
113  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary
114  # statistics and as part of output filenames.
115  rCRS:
116    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references
117    # against which reads are to be mapped. Using the same name as filename is strongly
118    # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37').
119    Path: prefixes/rCRS.fasta

Once this is done, we may specify the input data that we want the pipeline to process.

Specifying read data

A single makefile may be used to process one or more samples, to generate one or more BAM files and supplementary statistics. In this project we will only deal with a single sample, which we accomplish by adding creating our own section at the end of the makefile. The first step is to determine the name for the files generated by the BAM pipeline. Specifically, we will specify a name which is prefixed to all output generated for our sample (here named MyFilename), by adding the following line to the end of the makefile:

129# You can also add comments like these to document your experiment
130MyFilename:

This first name, or grouping, is referred to as the target, and typically corresponds to the name of the sample being processes, though any name may do. The actual sample-name is specified next (it is possible, but uncommon, for a single target to contain multiple samples), and is used both in tables of summary statistics, and recorded in the resulting BAM files. This is accomplished by adding another line below the target name:

129# You can also add comments like these to document your experiment
130MyFilename:
131  MySample:

Similarly, we need to specify the name of each library in our data set. By convention, I often use the index used to construct the library as the library name (which allows for easy identification), but any name may be used for a library, provided that it unique to that sample. As described near the start of this document, we are dealing with 3 libraries:

Library

Lane

Type

Files

ACGATA

1

PE

data/ACGATA_L1_*.fastq.gz

GCTCTG

1

SE

data/GCTCTG_L1_*.fastq.gz

TGCTCA

1

SE

data/TGCTCA_L1_*.fastq.gz

2

PE

data/TGCTCA_L2_*.fastq.gz

It is important to correctly specify the libraries, since the pipeline will not only use this information for summary statistics and record it in the resulting BAM files, but will also carry out filtering of PCR duplicates (and other analyses) on a per-library basis. Wrongly grouping together data will therefore result in a loss of useful alignments wrongly identified as PCR duplicates, or, similarly, in the inclusion of reads that should have been filtered as PCR duplicates. The library names are added below the name of the sample (MySample), in a similar manner to the sample itself:

129# You can also add comments like these to document your experiment
130MyFilename:
131  MySample:
132    ACGATA:
133
134    GCTCTG:
135
136    TGCTCA:

The final step involves specifying the location of the raw FASTQ reads that should be processed for each library, and consists of specifying one or more "lanes" of reads, each of which must be given a unique name. For single-end reads, this is accomplished simply by providing a path (with optional wildcards) to the location of the file(s). For example, for lane 1 of library ACGATA, the files are located at data/ACGATA_L1_*.fastq.gz:

$ ls data/GCTCTG_L1_*.fastq.gz
data/GCTCTG_L1_R1_01.fastq.gz
data/GCTCTG_L1_R1_02.fastq.gz
data/GCTCTG_L1_R1_03.fastq.gz

We simply specify these paths for each of the single-end lanes, here using the lane number to name these (similar to the above, this name is used to tag the data in the resulting BAM file):

129# You can also add comments like these to document your experiment
130MyFilename:
131  MySample:
132    ACGATA:
133
134    GCTCTG:
135      Lane_1: data/GCTCTG_L1_*.fastq.gz
136
137    TGCTCA:
138      Lane_1: data/TGCTCA_L1_*.fastq.gz

Specifying the location of paired-end data is slightly more complex, since the pipeline needs to be able to locate both files in a pair. This is accomplished by making the assumption that paired-end files are numbered as either mate 1 or mate 2, as shown here for 4 pairs of files with the common _R1 and _R2 labels:

$ ls data/ACGATA_L1_*.fastq.gz
data/ACGATA_L1_R1_01.fastq.gz
data/ACGATA_L1_R1_02.fastq.gz
data/ACGATA_L1_R1_03.fastq.gz
data/ACGATA_L1_R1_04.fastq.gz
data/ACGATA_L1_R2_01.fastq.gz
data/ACGATA_L1_R2_02.fastq.gz
data/ACGATA_L1_R2_03.fastq.gz
data/ACGATA_L1_R2_04.fastq.gz

Knowing how that the files contain a number specifying which file in a pair they correspond to, we can then construct a path that includes the keyword {Pair} in place of that number. For the above example, that path would therefore be data/ACGATA_L1_R{Pair}_*.fastq.gz (corresponding to data/ACGATA_L1_R[12]_*.fastq.gz):

129# You can also add comments like these to document your experiment
130MyFilename:
131  MySample:
132    ACGATA:
133      Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
134
135    GCTCTG:
136      Lane_1: data/GCTCTG_L1_*.fastq.gz
137
138    TGCTCA:
139      Lane_1: data/TGCTCA_L1_*.fastq.gz
140      Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz

Note

Note that while the paths given here are relative to the location of where the pipeline is run, it is also possible to provide absolute paths, should the files be located in an entirely different location.

Note

At the time of writing, the PALEOMIX pipeline supports uncompressed, gzipped, and bzipped FASTQ reads. It is not necessary to use any particular file extension for these, as the compression method (if any) is detected automatically.

The final makefile

Once we've completed the steps described above, the resulting makefile should look like the following, shown here with the modifications that we've made highlighted:

  1# -*- mode: Yaml; -*-
  2# Default options.
  3# Can also be specific for a set of samples, libraries, and lanes,
  4# by including the "Options" hierarchy at the same level as those
  5# samples, libraries, or lanes below.
  6Options:
  7  # Sequencing platform, see SAM/BAM reference for valid values
  8  Platform: Illumina
  9  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
 10  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
 11  # specify 'Solexa', to handle reads on the Solexa scale. This is
 12  # used during adapter-trimming and sequence alignment
 13  QualityOffset: 33
 14
 15  # Settings for trimming of reads, see AdapterRemoval man-page
 16  AdapterRemoval:
 17    # Set and uncomment to override defaults adapter sequences
 18#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
 19#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
 20    # Some BAM pipeline defaults differ from AR defaults;
 21    # To override, change these value(s):
 22    --mm: 3
 23    --minlength: 25
 24    # Extra features enabled by default; change 'yes' to 'no' to disable
 25    --collapse: yes
 26    --trimns: yes
 27    --trimqualities: yes
 28
 29  # Settings for aligners supported by the pipeline
 30  Aligners:
 31    # Choice of aligner software to use, either "BWA" or "Bowtie2"
 32    Program: BWA
 33
 34    # Settings for mappings performed using BWA
 35    BWA:
 36      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
 37      # for a description of each algorithm (defaults to 'backtrack')
 38      Algorithm: backtrack
 39      # Filter aligned reads with a mapping quality (Phred) below this value
 40      MinQuality: 30
 41      # Filter reads that did not map to the reference sequence
 42      FilterUnmappedReads: yes
 43      # May be disabled ("no") for aDNA alignments with the 'aln' algorithm.
 44      # Post-mortem damage localizes to the seed region, which BWA expects to
 45      # have few errors (sets "-l"). See http://pmid.us/22574660
 46      UseSeed: no
 47      # Additional command-line options may be specified below. For 'backtrack' these
 48      # are applied to the "bwa aln". See Bowtie2 for more examples.
 49#      -n: 0.04
 50
 51    # Settings for mappings performed using Bowtie2
 52    Bowtie2:
 53      # Filter aligned reads with a mapping quality (Phred) below this value
 54      MinQuality: 0
 55      # Filter reads that did not map to the reference sequence
 56      FilterUnmappedReads: yes
 57      # Examples of how to add additional command-line options
 58#      --trim5: 5
 59#      --trim3: 5
 60      # Note that the colon is required, even if no value is specified
 61      --very-sensitive:
 62      # Example of how to specify multiple values for an option
 63#      --rg:
 64#        - CN:SequencingCenterNameHere
 65#        - DS:DescriptionOfReadGroup
 66
 67  # Command-line options for mapDamage; use long-form options(--length not -l):
 68  mapDamage:
 69    # By default, the pipeline will downsample the input to 100k hits
 70    # when running mapDamage; remove to use all hits
 71    --downsample: 100000
 72
 73  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
 74  # possible read-types reflect the output of AdapterRemoval
 75  ExcludeReads:
 76    # Exclude single-end reads (yes / no)?
 77    Single: no
 78    # Exclude non-collapsed paired-end reads (yes / no)?
 79    Paired: no
 80    # Exclude paired-end reads for which the mate was discarded (yes / no)?
 81    Singleton: no
 82    # Exclude overlapping paired-ended reads collapsed into a single sequence
 83    # by AdapterRemoval (yes / no)?
 84    Collapsed: no
 85    # Like 'Collapsed', but only for collapsed reads truncated due to the
 86    # presence of ambiguous or low quality bases at read termini (yes / no).
 87    CollapsedTruncated: no
 88
 89  # Optional steps to perform during processing.
 90  Features:
 91    # If set to 'filter', PCR duplicates are removed from the output files; if set to
 92    # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the
 93    # output files; if set to 'no', the reads are assumed to not have been amplified.
 94    PCRDuplicates: filter
 95    # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots;
 96    # set to 'model' to build plots and post-mortem damage models; and set to 'rescale'
 97    # to build plots, models, and BAMs with rescaled quality scores. All analyses are
 98    # carried out per library.
 99    mapDamage: plot
100    # Generate coverage information for the final BAM and for each 'RegionsOfInterest'
101    # specified in 'Prefixes' (yes / no).
102    Coverage: yes
103    # Generate histograms of number of sites with a given read-depth, from 0 to 200,
104    # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no).
105    Depths: yes
106    # Generate summary table for each target (yes / no)
107    Summary: yes
108
109
110# Map of prefixes by name, each having a Path, which specifies the location of the
111# BWA/Bowtie2 index, and optional regions for which additional statistics are produced.
112Prefixes:
113  # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary
114  # statistics and as part of output filenames.
115  rCRS:
116    # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references
117    # against which reads are to be mapped. Using the same name as filename is strongly
118    # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37').
119    Path: prefixes/rCRS.fasta
120
121    # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a .bed file
122    # listing extra regions for which coverage / depth statistics should be calculated;
123    # if no names are specified for the BED records, results are named after the
124    # chromosome / contig. Replace 'NAME' with the desired name for these regions.
125#    RegionsOfInterest:
126#      NAME: PATH_TO_BEDFILE
127
128
129# You can also add comments like these to document your experiment
130MyFilename:
131  MySample:
132    ACGATA:
133      Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
134
135    GCTCTG:
136      Lane_1: data/GCTCTG_L1_*.fastq.gz
137
138    TGCTCA:
139      Lane_1: data/TGCTCA_L1_*.fastq.gz
140      Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz

With this makefile in hand, the pipeline may be executed using the following command:

$ paleomix bam run makefile.yaml

The pipeline will run as many simultaneous processes as there are cores in the current system, but this behavior may be changed by using the --max-threads command-line option. Use the --help command-line option to view additional options available when running the pipeline. By default, output files are placed in the same folder as the makefile, but this behavior may be changed by setting the --destination command-line option. For this projects, these files include the following:

$ ls -d MyFilename*
MyFilename
MyFilename.rCRS.coverage
MyFilename.rCRS.depths
MyFilename.rCRS.mapDamage
MyFilename.summary

The files include a table of the average coverage, a histogram of the per-site coverage (depths), a folder containing one set of mapDamage plots per library, and the final BAM file and its index (the .bai file), as well as a table summarizing the entire analysis. For a more detailed description of the files generated by the pipeline, please refer to the File structure section; should problems occur during the execution of the pipeline, then please verify that the makefile is correctly filled out as described above, and refer to the Troubleshooting the BAM Pipeline section.

Note

The first item, MyFilename, is a folder containing intermediate files generated while running the pipeline, required due to the many steps involved in a typical analyses, and which also allows for the pipeline to resume should the process be interrupted. This folder will typically take up 3-4x the disk-space used by the final BAM file(s), and can safely be removed once the pipeline has run to completion, in order to reduce disk-usage.