Pipeline usage

The following describes, step by step, the process of setting up a project for mapping FASTQ reads against a reference sequence using the BAM pipeline. For a detailed description of the configuration file (makefile) used by the BAM pipeline, please refer to the section Makefile description, and for a detailed description of the files generated by the pipeline, please refer to the section File structure.

The BAM pipeline is invoked using either the ‘paleomix’ command, which offers access to all tools included with PALEOMIX (see section Other tools), or using the (deprecated) alias ‘bam_pipeline’. Thus, all commands in the following may take one of the following (equivalent) forms:

$ paleomix bam_pipeline [...]
$ bam_pipeline [...]

For the purpose of these instructions, we will make use of a tiny FASTQ data set included with PALEOMIX pipeline, consisting of synthetic FASTQ reads simulated against the human mitochondrial genome. To follow along, first create a local copy of the BAM pipeline example data:

$ paleomix bam_pipeline example .

This will create a folder named ‘bam_pipeline’ in the current folder, which contain the example FASTQ reads and a ‘makefile’ showcasing various features of the BAM pipeline (‘000_makefile.yaml’). We will make use of a subset of the data, but we will not make use of the makefile. The data we will use consists of 3 simulated ancient DNA libraries (independent amplifications), for which either one or two lanes have been simulated:

Library Lane Type Files
ACGATA 1 PE 000_data/ACGATA_L1_*.fastq.gz
GCTCTG 1 SE 000_data/GCTCTG_L1_*.fastq.gz
TGCTCA 1 SE 000_data/TGCTCA_L1_*.fastq.gz
  2 PE 000_data/TGCTCA_L2_*.fastq.gz

Warning

The BAM pipeline largely relies on the existence of final and intermediate files in order to detect if a given analytical step has been carried out. Therefore, changes made to a makefile after the pipeline has already been run (even if not run to completion) may therefore not cause analytical steps affected by these changes to be re-run. If changes are to be made at such a point, it is typically necessary to manually remove affected intermediate files before running the pipeline again. See the section File structure for more information about the layout of files generated by the pipeline.

Creating a makefile

As described in the Introduction, the BAM pipeline operates based on ‘makefiles’, which serve to specify the location and structure of input data (samples, libraries, lanes, etc), and which specific which tasks are to be run, and which settings to be used. The makefiles are written using the human-readable YAML format, which may be edited using any regular text editor.

For a brief introduction to the YAML format, please refer to the YAML usage in PALEOMIX section, and for a detailed description of the BAM Pipeline makefile, please refer to section Makefile description.

To start a new project, we must first generate a makefile template using the following command, which for the purpose of this tutorial we place in the example folder:

$ cd bam_pipeline/
$ paleomix bam_pipeline mkfile > makefile.yaml

Once you open the resulting file (‘makefile.yaml’) in your text editor of choice, you will find that BAM pipeline makefiles are split into 3 major sections, representing 1) the default options used for processing the data; 2) the reference genomes against which reads are to be mapped; and 3) sets of input files for one or more samples which is to be processed.

In a typical project, we will need to review the default options, add one or more reference genomes which we wish to target, and list the input data to be processed.

Default options

The makefile starts with an “Options” section, which is applied to every set of input-files in the makefile unless explicitly overwritten for a given sample (this is described in the Makefile description section). For most part, the default values should be suitable for a given project, but special attention should be paid to the following options (colons indicates subsections):

Options:Platform

The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The SAM/BAM specification the valid platforms, which currently include ‘CAPILLARY’, ‘HELICOS’, ‘ILLUMINA’, ‘IONTORRENT’, ‘LS454’, ‘ONT’, ‘PACBIO’, and ‘SOLID’.

Options:QualityOffset

The QualityOffset option refers to the starting ASCII value used to encode Phred quality-scores in user-provided FASTQ files, with the possible values of 33, 64, and ‘Solexa’. For most modern data, this will be 33, corresponding to ASCII characters in the range ‘!’ to ‘J’. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range ‘@’ to ‘h’, and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ‘;’ to ‘h’. For a visual representation of this, refer to the Wikipedia article linked above.

Warning

By default, the adapter trimming software used by PALEOMIX expects quality-scores no higher than 41, corresponding to the ASCII character ‘J’ when encoded using offset 33. If the input-data contains quality-scores higher greater than this value, then it is necessary to specify the maximum value using the ‘–qualitymax’ command-line option. See below.

Warning

Presently, quality-offsets other than 33 are not supported when using the BWA ‘mem’ or the BWA ‘bwasw’ algorithms. To use these algorithms with quality-offset 64 data, it is therefore necessary to first convert these data to offset 33. This can be accomplished using the seqtk tool.

Options:AdapterRemoval:–adapter1 Options:AdapterRemoval:–adapter2

These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination using AdapterRemoval. Thus, the sequence provided for –adapter1 is expected to be found in the mate 1 reads, and the sequence specified for –adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the AdapterRemoval documentation for more information.

Aligners:Program

The short read alignment program to use to map the (trimmed) reads to the reference genome. Currently, users many choose between ‘BWA’ and ‘Bowtie2’, with additional options available for each program.

Aligners:BWA:MinQuality and Aligners:Bowtie2:MinQuality

The minimum mapping quality of hits to retain during the mapping process. If this option is set to a non-zero value, any hits with a mapping quality below this value are removed from the resulting BAM file (this option does not apply to unmapped reads). If the final BAM should contain all reads in the input files, this option must be set to 0, and the ‘FilterUnmappedReads’ option set to ‘no’.

Aligners:BWA:UseSeed

Enable/disable the use of a seed region when mapping reads using the BWA ‘backtrack’ alignment algorithm (the default). Disabling this option may yield some improvements in the alignment of highly damaged ancient DNA, at the cost of significantly increasing the running time. As such, this option is not recommended for modern samples [Schubert2012].

For the purpose of the example project, we need only change a few options. Since the reads were simulated using an Phred score offset of 33, there is no need to change the ‘QualityOffset’ option, and since the simulated adapter sequences matches the adapters that AdapterRemoval searches for by default, so we do not need to set eiter of ‘–adapter1’ or ‘–adapter2’. We will, however, use the default mapping program (BWA) and algorithm (‘backtrack’), but change the minimum mapping quality to 30 (corresponding to an error probability of 0.001). Changing the minimum quality is accomplished by locating the ‘Aligners’ section of the makefile, and changing the ‘MinQuality’ value from 0 to 30 (line 12):

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Settings for aligners supported by the pipeline
Aligners:
  # Choice of aligner software to use, either "BWA" or "Bowtie2"
  Program: BWA

  # Settings for mappings performed using BWA
  BWA:
    # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
    # for a description of each algorithm (defaults to 'backtrack')
    Algorithm: backtrack
    # Filter aligned reads with a mapping quality (Phred) below this value
    MinQuality: 30
    # Filter reads that did not map to the reference sequence
    FilterUnmappedReads: yes
    # Should be disabled ("no") for aDNA alignments, as post-mortem damage
    # localizes to the seed region, which BWA expects to have few
    # errors (sets "-l"). See http://pmid.us/22574660
    UseSeed: yes

Since the data we will be mapping represents (simulated) ancient DNA, we will furthermore set the UseSeed option to ‘no’ (line 18), in order to recover a small additional amount of alignments during mapping (c.f. [Schubert2012]):

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Settings for aligners supported by the pipeline
Aligners:
  # Choice of aligner software to use, either "BWA" or "Bowtie2"
  Program: BWA

  # Settings for mappings performed using BWA
  BWA:
    # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
    # for a description of each algorithm (defaults to 'backtrack')
    Algorithm: backtrack
    # Filter aligned reads with a mapping quality (Phred) below this value
    MinQuality: 30
    # Filter reads that did not map to the reference sequence
    FilterUnmappedReads: yes
    # Should be disabled ("no") for aDNA alignments, as post-mortem damage
    # localizes to the seed region, which BWA expects to have few
    # errors (sets "-l"). See http://pmid.us/22574660
    UseSeed: no

Once this is done, we can proceed to specify the location of the reference genome(s) that we wish to map our reads against.

Reference genomes (prefixes)

Mapping is carried out using one or more reference genomes (or other sequences) in the form of FASTA files, which are indexed for use in read mapping (automatically, by the pipeline) using either the “bwa index” or “bowtie2-build” commands. Since sequence alignment index are generated at the location of these files, reference genomes are also referred to as “prefixes” in the documentation. In other words, using BWA as an example, the PALEOMIX pipeline will generate a index (prefix) of the reference genome using a command corresponding to the following, for BWA:

$ bwa index prefixes/my_genome.fa

In addition to the BWA / Bowtie2 index, several other related files are also automatically generated, including a FASTA index file (.fai), and a sequence dictionary (.dict), which are required for various operations of the pipeline. These are similarly located at the same folder as the reference FASTA file. For a more detailed description, please refer to the File structure section.

Warning

Since the pipeline automatically carries out indexing of the FASTA files, it therefore requires write-access to the folder containing the FASTA files. If this is not possible, one may simply create a local folder containing symbolic links to the original FASTA file(s), and point the makefile to this location. All automatically generated files will then be placed in this location.

Specifying which FASTA file to align sequences is accomplished by listing these in the “Prefixes” section in the makefile. For example, assuming that we had a FASTA file named “my_genome.fasta” which is located in the folder “my_prefixes”, the following might be used:

Prefixes:
  my_genome:
    Path: my_prefixes/my_genome.fasta

The name of the prefix (here ‘my_genome’) will be used to name the resulting files and in various tables that are generated by the pipeline. Typical names include ‘hg19’, ‘EquCab20’, and other standard abbreviations for reference genomes, accession numbers, and the like. Multiple prefixes can be specified, but each name MUST be unique:

Prefixes:
  my_genome:
    Path: my_prefixes/my_genome.fasta
  my_other_genome:
    Path: my_prefixes/my_other_genome.fasta

Warning

FASTA files used in the BAM pipeline must be named with a .fasta file extension. Furthermore, if alignments are to be carried out against the human nuclear genome, chromosomes MUST be ordered by their number for GATK to work! See the GATK FAQ for more information.

In the case of this example project, we will be mapping our data against the revised Cambridge Reference Sequence (rCRS) for the human mitochondrial genome, which is included in examples folder under ‘000_prefxies’, as a file named ‘rCRS.fasta’. To add it to the makefile, locate the ‘Prefixes’ section located below the ‘Options’ section, and update it as described above (lines 5 and 7):

125
126
127
128
129
130
131
132
# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  rCRS:
    # Path to .fasta file containing a set of reference sequences.
    Path: 000_prefixes/rCRS.fasta

Once this is done, we may specify the input data that we wish the pipeline to process for us.

Specifying read data

A single makefile may be used to process one or more samples, to generate one or more BAM files and supplementary statistics. In this project we will only deal with a single sample, which we accomplish by adding creating our own section at the end of the makefile. The first step is to determine the name for the files generated by the BAM pipeline. Specifically, we will specify a name which is prefixed to all output generated for our sample (here named ‘MyFilename’), by adding the following line to the end of the makefile:

145
146
# You can also add comments like these to document your experiment
MyFilename:

This first name, or grouping, is referred to as the target, and typically corresponds to the name of the sample being processes, though any name may do. The actual sample-name is specified next (it is possible, but uncommon, for a single target to contain multiple samples), and is used both in tables of summary statistics, and recorded in the resulting BAM files. This is accomplished by adding another line below the target name:

145
146
147
# You can also add comments like these to document your experiment
MyFilename:
  MySample:

Similarly, we need to specify the name of each library in our dataset. By convention, I often use the index used to construct the library as the library name (which allows for easy identification), but any name may be used for a library, provided that it unique to that sample. As described near the start of this document, we are dealing with 3 libraries:

Library Lane Type Files
ACGATA 1 PE 000_data/ACGATA_L1_*.fastq.gz
GCTCTG 1 SE 000_data/GCTCTG_L1_*.fastq.gz
TGCTCA 1 SE 000_data/TGCTCA_L1_*.fastq.gz
  2 PE 000_data/TGCTCA_L2_*.fastq.gz

It is important to correctly specify the libraries, since the pipeline will not only use this information for summary statistics and record it in the resulting BAM files, but will also carry out filtering of PCR duplicates (and other analyses) on a per-library basis. Wrongly grouping together data will therefore result in a loss of useful alignments wrongly identified as PCR duplicates, or, similarly, in the inclusion of reads that should have been filtered as PCR duplicates. The library names are added below the name of the sample (‘MySample’), in a similar manner to the sample itself:

145
146
147
148
149
150
151
152
# You can also add comments like these to document your experiment
MyFilename:
  MySample:
    ACGATA:

    GCTCTG:

    TGCTCA:

The final step involves specifying the location of the raw FASTQ reads that should be processed for each library, and consists of specifying one or more “lanes” of reads, each of which must be given a unique name. For single-end reads, this is accomplished simply by providing a path (with optional wildcards) to the location of the file(s). For example, for lane 1 of library ACGATA, the files are located at 000_data/ACGATA_L1_*.fastq.gz:

$ ls 000_data/GCTCTG_L1_*.fastq.gz
000_data/GCTCTG_L1_R1_01.fastq.gz
000_data/GCTCTG_L1_R1_02.fastq.gz
000_data/GCTCTG_L1_R1_03.fastq.gz

We simply specify these paths for each of the single-end lanes, here using the lane number to name these (similar to the above, this name is used to tag the data in the resulting BAM file):

145
146
147
148
149
150
151
152
153
154
# You can also add comments like these to document your experiment
MyFilename:
  MySample:
    ACGATA:

    GCTCTG:
      Lane_1: 000_data/GCTCTG_L1_*.fastq.gz

    TGCTCA:
      Lane_1: 000_data/TGCTCA_L1_*.fastq.gz

Specifying the location of paired-end data is slightly more complex, since the pipeline needs to be able to locate both files in a pair. This is accomplished by making the assumption that paired-end files are numbered as either mate 1 or mate 2, as shown here for 4 pairs of files with the common _R1 and _R2 labels:

$ ls 000_data/ACGATA_L1_*.fastq.gz
000_data/ACGATA_L1_R1_01.fastq.gz
000_data/ACGATA_L1_R1_02.fastq.gz
000_data/ACGATA_L1_R1_03.fastq.gz
000_data/ACGATA_L1_R1_04.fastq.gz
000_data/ACGATA_L1_R2_01.fastq.gz
000_data/ACGATA_L1_R2_02.fastq.gz
000_data/ACGATA_L1_R2_03.fastq.gz
000_data/ACGATA_L1_R2_04.fastq.gz

Knowing how that the files contain a number specifying which file in a pair they correspond to, we can then construct a path that includes the keyword ‘{Pair}’ in place of that number. For the above example, that path would therefore be ‘000_data/ACGATA_L1_R{Pair}_*.fastq.gz’ (corresponding to ‘000_data/ACGATA_L1_R[12]_*.fastq.gz’):

145
146
147
148
149
150
151
152
153
154
155
156
# You can also add comments like these to document your experiment
MyFilename:
  MySample:
    ACGATA:
      Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz

    GCTCTG:
      Lane_1: 000_data/GCTCTG_L1_*.fastq.gz

    TGCTCA:
      Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
      Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz

Note

Note that while the paths given here are relative to the location of where the pipeline is run, it is also possible to provide absolute paths, should the files be located in an entirely different location.

Note

At the time of writing, the PALEOMIX pipeline supports uncompressed, gzipped, and bzipped FASTQ reads. It is not necessary to use any particular file extension for these, as the compression method (if any) is detected automatically.

The final makefile

Once we’ve completed the steps described above, the resulting makefile should look like the following, shown here with the modifications that we’ve made highlighted:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# -*- mode: Yaml; -*-
# Timestamp: 2016-02-04T10:53:59.906883
#
# Default options.
# Can also be specific for a set of samples, libraries, and lanes,
# by including the "Options" hierarchy at the same level as those
# samples, libraries, or lanes below. This does not include
# "Features", which may only be specific globally.
Options:
  # Sequencing platform, see SAM/BAM reference for valid values
  Platform: Illumina
  # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+)
  # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to
  # specify 'Solexa', to handle reads on the Solexa scale. This is
  # used during adapter-trimming and sequence alignment
  QualityOffset: 33
  # Split a lane into multiple entries, one for each (pair of) file(s)
  # found using the search-string specified for a given lane. Each
  # lane is named by adding a number to the end of the given barcode.
  SplitLanesByFilenames: yes
  # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2
  CompressionFormat: bz2

  # Settings for trimming of reads, see AdapterRemoval man-page
  AdapterRemoval:
     # Adapter sequences, set and uncomment to override defaults
#     --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG
#     --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
     # Some BAM pipeline defaults differ from AR defaults;
     # To override, change these value(s):
     --mm: 3
     --minlength: 25
     # Extra features enabled by default; change 'yes' to 'no' to disable
     --collapse: yes
     --trimns: yes
     --trimqualities: yes

  # Settings for aligners supported by the pipeline
  Aligners:
    # Choice of aligner software to use, either "BWA" or "Bowtie2"
    Program: BWA

    # Settings for mappings performed using BWA
    BWA:
      # One of "backtrack", "bwasw", or "mem"; see the BWA documentation
      # for a description of each algorithm (defaults to 'backtrack')
      Algorithm: backtrack
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 30
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Should be disabled ("no") for aDNA alignments, as post-mortem
      # localizes to the seed region, which BWA expects to have few
      # errors (sets "-l"). See http://pmid.us/22574660
      UseSeed: no
      # Additional command-line options may be specified for the "aln"
      # call(s), as described below for Bowtie2 below.

    # Settings for mappings performed using Bowtie2
    Bowtie2:
      # Filter aligned reads with a mapping quality (Phred) below this value
      MinQuality: 0
      # Filter reads that did not map to the reference sequence
      FilterUnmappedReads: yes
      # Examples of how to add additional command-line options
#      --trim5: 5
#      --trim3: 5
      # Note that the colon is required, even if no value is specified
      --very-sensitive:
      # Example of how to specify multiple values for an option
#      --rg:
#        - CN:SequencingCenterNameHere
#        - DS:DescriptionOfReadGroup

  # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are
  # removed from the output files; if set to 'mark', PCR duplicates are
  # flagged with bit 0x400, and not removed from the output files; if set to
  # 'no', the reads are assumed to not have been amplified. Collapsed reads
  # are filtered using the command 'paleomix rmdup_duplicates', while "normal"
  # reads are filtered using Picard MarkDuplicates.
  PCRDuplicates: filter

  # Carry out quality base re-scaling of libraries using mapDamage
  # This will be done using the options set for mapDamage below
  RescaleQualities: no

  # Command-line options for mapDamage; note that the long-form
  # options are expected; --length, not -l, etc. Uncomment the
  # "mapDamage" line adding command-line options below.
  mapDamage:
    # By default, the pipeline will downsample the input to 100k hits
    # when running mapDamage; remove to use all hits
    --downsample: 100000

  # Set to 'yes' exclude a type of trimmed reads from alignment / analysis;
  # possible read-types reflect the output of AdapterRemoval
  ExcludeReads:
    Single: no              # Single-ended reads / Orphaned paired-ended reads
    Paired: no              # Paired ended reads
    Singleton: no           # Paired reads for which the mate was discarded
    Collapsed: no           # Overlapping paired-ended reads collapsed into a
                            # single sequence by AdapterRemoval
    CollapsedTruncated: no  # Like 'Collapsed', except that the reads
                            # truncated due to the presence ambiguous
                            # bases or low quality bases at read termini.

  # Optional steps to perform during processing
  Features:
    RawBAM: no          # Generate BAM from the raw libraries (no indel realignment)
                        #   Location: {Destination}/{Target}.{Genome}.bam
    RealignedBAM: yes   # Generate indel-realigned BAM using the GATK Indel realigner
                        #   Location: {Destination}/{Target}.{Genome}.realigned.bam
    mapDamage: yes      # Generate mapDamage plot for each (unrealigned) library
                        #   Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/
    Coverage: yes       # Generate coverage information for the raw BAM (wo/ indel realignment)
                        #   Location: {Destination}/{Target}.{Genome}.coverage
    Depths: yes         # Generate histogram of number of sites with a given read-depth
                        #   Location: {Destination}/{Target}.{Genome}.depths
    Summary: yes        # Generate summary table for each target
                        #   Location: {Destination}/{Target}.summary
    DuplicateHist: no   # Generate histogram of PCR duplicates, for use with PreSeq
                        #   Location: {Destination}/{Target}.{Genome}.duphist/{Library}/


# Map of prefixes by name, each having a Path key, which specifies the
# location of the BWA/Bowtie2 index, and optional label, and an option
# set of regions for which additional statistics are produced.
Prefixes:
  # Name of the prefix; is used as part of the output filenames
  rCRS:
    # Path to .fasta file containing a set of reference sequences.
    Path: 000_prefixes/rCRS.fasta

    # Label for prefix: One of nuclear, mitochondrial, chloroplast,
    # plasmid, bacterial, or viral. Is used in the .summary files.
#    Label: ...

    # Produce additional coverage / depth statistics for a set of
    # regions defined in a BED file; if no names are specified for the
    # BED records, results are named after the chromosome / contig.
#    RegionsOfInterest:
#      NAME: PATH_TO_BEDFILE


# You can also add comments like these to document your experiment
MyFilename:
  MySample:
    ACGATA:
      Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz

    GCTCTG:
      Lane_1: 000_data/GCTCTG_L1_*.fastq.gz

    TGCTCA:
      Lane_1: 000_data/TGCTCA_L1_*.fastq.gz
      Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz

With this makefile in hand, the pipeline may be executed using the following command:

$ paleomix bam_pipeline run makefile.yaml

The pipeline will run as many simultaneous processes as there are cores in the current system, but this behavior may be changed by using the ‘–max-threads’ command-line option. Use the ‘–help’ command-line option to view additional options available when running the pipeline. By default, output files are placed in the same folder as the makefile, but this behavior may be changed by setting the ‘–destination’ command-line option. For this projects, these files include the following:

$ ls -d MyFilename*
MyFilename
MyFilename.rCRS.coverage
MyFilename.rCRS.depths
MyFilename.rCRS.mapDamage
MyFilename.rCRS.realigned.bai
MyFilename.rCRS.realigned.bam
MyFilename.summary

The files include a table of the average coverages, a histogram of the per-site coverages (depths), a folder containing one set of mapDamage plots per library, and the final BAM file and its index (the .bai file), as well as a table summarizing the entire analysis. For a more detailed description of the files generated by the pipeline, please refer to the File structure section; should problems occur during the execution of the pipeline, then please verify that the makefile is correctly filled out as described above, and refer to the Troubleshooting the BAM Pipeline section.

Note

The first item, ‘MyFilename’, is a folder containing intermediate files generated while running the pipeline, required due to the many steps involved in a typical analyses, and which also allows for the pipeline to resume should the process be interrupted. This folder will typically take up 3-4x the disk-space used by the final BAM file(s), and can safely be removed once the pipeline has run to completion, in order to reduce disk-usage.