Overview of analytical stepsΒΆ
During a typical analyses, the BAM pipeline will proceed through the following steps for each sample:
Initial steps
Each prefix (reference sequences in FASTA format) is indexed using
samtools faidx
and using the short-read aligner configured for the current project.Preprocessing of reads
Adapter sequences, low quality bases and ambiguous bases are trimmed; overlapping paired-end reads are merged, and short reads are filtered using AdapterRemoval [Schubert2016].
Mapping of reads
- Processed reads resulting from the adapter-trimming / read-collapsing step above are mapped using the chosen short-read aligner (BWA or Bowtie2). The resulting BAMs are tagged using the information specified in the makefile (sample, library, lane, etc.).
- The records of the resulting BAM are updated using
samtools fixmate
to ensure that PE reads contain the correct information about the mate read. - The BAM is sorted using
samtools sort
, indexed usingsamtools index
, and validated using PicardValidateSamFile
. - Finally, the records are updated using
samtools calmd
to ensure consistent reporting of the number of mismatches relative to the reference genome (BAM tag 'NM').
Filtering of duplicates, recalculation (rescaling) of quality scores, and validation
- If enabled, PCR duplicates are filtered using Picard
MarkDuplicates
for SE and PE reads and usingpaleomix rmdup_collapsed
for collapsed reads (see the Other tools section). PCR filtering is carried out per library. - If mapDamage based rescaling of quality scores is, quality scores of bases that are potentially the result of post-mortem DNA damage are recalculated using a damage model built using mapDamage2.0 [Jonsson2013].
- The resulting BAMs are indexed and validated using Picard
ValidateSamFile
. Mapped reads at each position of the alignments are compared using the query name, sequence, and qualities. If a match is found, it is assumed to represent a duplication of input data (see Troubleshooting the BAM Pipeline).
- If enabled, PCR duplicates are filtered using Picard
Generation of final BAMs
Each BAM in the previous step is merged into a final BAM file.
Statistics
- If the
Summary
feature is enable, a single summary table is generated for each target. This table summarizes the input data in terms of the raw number of reads, the number of reads following filtering / collapsing, the fraction of reads mapped to each prefix, the fraction of reads filtered as duplicates, and more. - Coverage statistics and depth histograms are calculated for the intermediate and final BAM files using
paleomix coverage
andpaleomix depths
, if enabled. Statistics are calculated genome-wide and for any regions of interest specified by the user. - If mapDamage is enabled, mapDamage plots are generated; if modeling or rescaling is enabled, a model of the post-mortem DNA damage is also generated.
- If the