File structure

The following section explains the file structure of the BAM pipeline example project (see Example projects and data-sets), which results if that project is executed:

ExampleProject:
  Synthetic_Sample_1:
    ACGATA:
      Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
      Lane_2:
        Singleton: data/ACGATA_L2/reads.singleton.truncated.gz
        Collapsed: data/ACGATA_L2/reads.collapsed.gz
        CollapsedTruncated: data/ACGATA_L2/reads.collapsed.truncated.gz

    GCTCTG:
      Lane_1: data/GCTCTG_L1_R1_*.fastq.gz

    TGCTCA:
      Options:
        BWA:
          MinQuality: 30

      Lane_1: data/TGCTCA_L1_R1_*.fastq.gz
      Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz

Once executed, this example is expected to generate the following result files, depending on which options are enabled:

  • ExampleProject.rCRS.bam
  • ExampleProject.rCRS.bam.bai
  • ExampleProject.rCRS.coverage
  • ExampleProject.rCRS.depths
  • ExampleProject.rCRS.mapDamage/
  • ExampleProject.summary

As well as a folder containing intermediate results:

  • ExampleProject/

Warning

Please be aware that the internal file structure of PALEOMIX may change between major revisions (e.g. v1.1 to 1.2), but is not expected to change between minor revisions (v1.1.1 to v1.1.2). Consequently, if you wish to re-run an old project with the PALEOMIX pipeline, it is recommended to either use the same version of PALEOMIX, or remove the folder containing intermediate files before starting (see below), in order to ensure that analyses are re-run from scratch.

Primary results

These files are the main results generated by the PALEOMIX pipeline:

ExampleProject.rCRS.bam and ExampleProject.rCRS.bam.bai

Final BAM file and its index file (.bai), created using the "samtools index". If rescaling has been enabled, this BAM will contain reads processed by mapDamage.

ExampleProject.rCRS.mapDamage/

Per-library analyses generated using mapDamage2.0. If rescaling or modeling is enabled, these folders also includes the model files generated for each library. See the mapDamage2.0 documentation for a description of these files.

ExampleProject.rCRS.coverage

Coverage statistics generated using the 'paleomix coverage' command. These include per sample, per library, and per contig / chromosome breakdowns.

ExampleProject.rCRS.depths

Depth-histogram generated using 'paleomix depths' commands. As with the coverage, this information is broken down by sample, library, and contig / chromosome.

ExampleProject.summary

A summary table, which is created for each target if enabled in the makefile. This table contains contains a summary of the project, including the number / types of reads processed, average coverage, and other statistics broken down by prefix, sample, and library.

Warning

Some statistics will missing if pre-trimmed reads are included in the makefile, since PALEOMIX relies on the output from the adapter trimming software to collect these values.

Intermediate results

The BAM pipeline uses a simple file structure that corresponds to the structure of targets in the makefile. A folder is created for each target in the makefile (here "ExampleProject"). This folder contains a folder for the processed FASTQ reads, and a folder for each prefix, as well as some additional files used in certain analytical steps (see below):

$ ls ExampleProject/
reads/
rCRS/
[...]

Processed reads

Each of these folders contain a directory structure that corresponds to that of the makefiles. In addition, named folders are generated from each input FASTQ file or pair of FASTQ files:

ExampleProject/
  reads/
    Synthetic_Sample_1/
      ACGATA/
        Lane_1/
          ACGATA_L1_Rx_01.fastq.gz/
          ACGATA_L1_Rx_02.fastq.gz/
          ACGATA_L1_Rx_03.fastq.gz/
          ACGATA_L1_Rx_04.fastq.gz/
[...]

The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option "ExcludeReads":

$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
reads.settings  # Settings / statistics file generated by AdapterRemoval
reads.discarded.gz  # Low-quality or short reads
reads.truncated.gz  # Single-ended reads following adapter-removal
reads.collapsed.gz  # Paired-ended reads collapsed into single reads
reads.collapsed.truncated.gz  # Collapsed reads trimmed at either termini
reads.pair1.truncated.gz  # The first mate read of paired reads
reads.pair2.truncated.gz  # The second mate read of paired reads
reads.singleton.truncated.gz  # Paired-ended reads for which one mate was discarded

If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats):

$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/
reads.pretrimmed.validated

The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step.

Mapped reads

The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane "Lane_1" of library "ACGATA" as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE):

$ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
collapsed.bam  # The mapped reads in BAM format
collapsed.bam.bai  # Index file used for accessing the .bam file
collapsed.coverage  # Coverage statistics
collapsed.sai  # Intermediate alignment file generated by the BWA backtrack
collapsed.validated  # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated
[...]

For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for "normal" reads (paired and single-ended reads), and one method for "collapsed" reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that "collapsedtruncated" reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed:

ExampleProject/
  rCRS/
    Synthetic_Sample_1/
      ACGATA.duplications_checked
      ACGATA.rmdup.*.bam
      ACGATA.rmdup.*.bam.bai
      ACGATA.rmdup.*.coverage
      ACGATA.rmdup.*.validated

With the exception of the "duplicates_checked" file, these corresponds to the files created in the lane folder. The "duplicates_checked" file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not to be confused with PCR duplicates).

If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above:

ExampleProject/
  rCRS/
    Synthetic_Sample_1/
      ACGATA.rescaled.bam
      ACGATA.rescaled.bam.bai
      ACGATA.rescaled.coverage
      ACGATA.rescaled.validated

Finally, the resulting BAMs for each library (rescaled or not) are merged and validated. This results in the creation of the following files in the target folder:

ExampleProject/
  rCRS.validated  # Signifies that the final BAM has been validated
  rCRS.duplications_checked  # Similar to above, but catches duplicates across samples / libraries