The BAM pipeline uses a simple file structure that corresponds to the structure of targets in the makefile. A folder is created for each target in the makefile (here "ExampleProject"). This folder contains a folder for the processed FASTQ reads, and a folder for each prefix, as well as some additional files used in certain analytical steps (see below):
Processed reads
Each of these folders contain a directory structure that corresponds to that of the makefiles. In addition, named folders are generated from each input FASTQ file or pair of FASTQ files:
ExampleProject/
reads/
Synthetic_Sample_1/
ACGATA/
Lane_1/
ACGATA_L1_Rx_01.fastq.gz/
ACGATA_L1_Rx_02.fastq.gz/
ACGATA_L1_Rx_03.fastq.gz/
ACGATA_L1_Rx_04.fastq.gz/
[...]
The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option "ExcludeReads":
$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
reads.settings # Settings / statistics file generated by AdapterRemoval
reads.discarded.gz # Low-quality or short reads
reads.truncated.gz # Single-ended reads following adapter-removal
reads.collapsed.gz # Paired-ended reads collapsed into single reads
reads.collapsed.truncated.gz # Collapsed reads trimmed at either termini
reads.pair1.truncated.gz # The first mate read of paired reads
reads.pair2.truncated.gz # The second mate read of paired reads
reads.singleton.truncated.gz # Paired-ended reads for which one mate was discarded
If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats):
$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/
reads.pretrimmed.validated
The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step.
Mapped reads
The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane "Lane_1" of library "ACGATA" as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE):
$ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
collapsed.bam # The mapped reads in BAM format
collapsed.bam.bai # Index file used for accessing the .bam file
collapsed.coverage # Coverage statistics
collapsed.sai # Intermediate alignment file generated by the BWA backtrack
collapsed.validated # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated
[...]
For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for "normal" reads (paired and single-ended reads), and one method for "collapsed" reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that "collapsedtruncated" reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed:
ExampleProject/
rCRS/
Synthetic_Sample_1/
ACGATA.duplications_checked
ACGATA.rmdup.*.bam
ACGATA.rmdup.*.bam.bai
ACGATA.rmdup.*.coverage
ACGATA.rmdup.*.validated
With the exception of the "duplicates_checked" file, these corresponds to the files created in the lane folder. The "duplicates_checked" file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not to be confused with PCR duplicates).
If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above:
ExampleProject/
rCRS/
Synthetic_Sample_1/
ACGATA.rescaled.bam
ACGATA.rescaled.bam.bai
ACGATA.rescaled.coverage
ACGATA.rescaled.validated
Finally, the resulting BAMs for each library (rescaled or not) are merged and validated. This results in the creation of the following files in the target folder:
ExampleProject/
rCRS.validated # Signifies that the final BAM has been validated
rCRS.duplications_checked # Similar to above, but catches duplicates across samples / libraries