File structure

The following section explains the file structure of the BAM pipeline example project (see Example projects and data-sets), which results if that project is executed:

ExampleProject: # Target name
  Synthetic_Sample_1: # Sample name
    ACGATA: # Library 1
      Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz
      Lane_2:
        Singleton: 000_data/ACGATA_L2/reads.singleton.truncated.gz
        Collapsed: 000_data/ACGATAr_L2/reads.collapsed.gz
        CollapsedTruncated: 000_data/ACGATA_L2/reads.collapsed.truncated.gz

    GCTCTG: # Library 2
      Lane_1: 000_data/GCTCTG_L1_R1_*.fastq.gz
      Lane_2: rCRS: 000_data/GCTCTG_L2.bam

    TGCTCA: # Library 3
      Options:
        SplitLanesByFilenames: no

      Lane_1: 000_data/TGCTCA_L1_R1_*.fastq.gz
      Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz

Once executed, this example is expected to generate the following result files, depending on which options are enabled:

  • ExampleProject.rCRS.bam
  • ExampleProject.rCRS.bai
  • ExampleProject.rCRS.realigned.bam
  • ExampleProject.rCRS.realigned.bai
  • ExampleProject.rCRS.coverage
  • ExampleProject.rCRS.depths
  • ExampleProject.rCRS.duphist
  • ExampleProject.rCRS.mapDamage
  • ExampleProject.summary

As well as a single folder containing intermediate results:

  • ExampleProject/

Warning

Please be aware that the internal file structure of PALEOMIX may change between major revisions (e.g. v1.1 to 1.2), but is not expected to change between minor revisions (v1.1.1 to v1.1.2). Consequently, if you wish to re-run an old project with the PALEOMIX pipeline, it is recommended to either use the same version of PALEOMIX, or remove the folder containing intermediate files before starting (see below), in order to ensure that analyses are re-run from scratch.

Primary results

These files are the main results generated by the PALEOMIX pipeline:

ExampleProject.rCRS.bam and ExampleProject.rCRS.bai

Final BAM file, which has not realigned using the GATK Indel Realigner, and it’s index file (.bai), created using the “samtools index”. If rescaling has been enabled, this BAM will contain reads processed by mapDamage.

ExampleProject.rCRS.realigned.bam and ExampleProject.rCRS.realigned.bai

BAM file realigned using the GATK Indel Realigner, and it’s index file (.bai), created using the “samtools index”. If rescaling has been enabled, this BAM will contain reads processed by mapDamage.

ExampleProject.rCRS.mapDamage/

Per-library analyses generated using mapDamage2.0. If rescaling is enabled, these folders also includes the model files generated for each library. See the mapDamage2.0 documentation for a description of these files.

ExampleProject.rCRS.coverage

Coverage statistics generated using the ‘paleomix coverage’ command. These include per sample, per library and per contig / chromosome breakdowns.

ExampleProject.rCRS.depths

Depth-histogram generated using ‘paleomix depths’ commands. As with the coverage, this information is broken down by sample, library, and contig / chromosome.

ExampleProject.rCRS.duphist

Per-library histograms of PCR duplicates; for use with `preseq`_ [Daley2013] to estimate the remaining molecular complexity of these libraries. Please refer to the original PALEOMIX publication [Schubert2014] for more information.

ExampleProject.summary

A summary table, which is created for each target if enabled in the makefile. This table contains contains a summary of the project, including the number / types of reads processed, average coverage, and other statistics broken down by prefix, sample, and library.

Warning

Some statistics will missing if pre-trimmed reads are included in the makefile, since PALEOMIX relies on the output from the adapter trimming software to collect these values.

Intermediate results

Internally, the BAM pipeline uses a simple file structure which corresponds to the visual structure of the makefile. For each target (in this case “ExampleProject”) a folder of the same name is created in the directory in which the makefile is executed. This folder contains a folder containing the trimmed / collapsed reads, and a folder for each prefix (in this case, only “rCRS”), as well as some additional files used in certain analytical steps (see below):

$ ls ExampleProject/
reads/
rCRS/
[...]

Trimmed reads

Each of these folders in turn contains a directory structure that corresponds to the names of the samples, libraries, and lanes, shown here for Lane_1 in library ACGATA. If the option “SplitLanesByFilenames” is enabled (as shown here), several numbered folders may be created for each lane, using a 3-digit postfix:

ExampleProject/
  reads/
    Synthetic_Sample_1/
      ACGATA/
        Lane_1_001/
        Lane_1_002/
        Lane_1_003/
[...]

The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option “ExcludeReads”:

$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1_001/
reads.settings  # Settings / statistics file generated by AdapterRemoval
reads.discarded.bz2  # Low-quality or short reads
reads.truncated.bz2  # Single-ended reads following adapter-removal
reads.collapsed.bz2  # Paired-ended reads collapsed into single reads
reads.collapsed.truncated.bz2  # Collapsed reads trimmed at either termini
reads.pair1.truncated.bz2  # The first mate read of paired reads
reads.pair2.truncated.bz2  # The second mate read of paired reads
reads.singleton.truncated.bz2  # Paired-ended reads for which one mate was discarded

If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats):

$ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/
reads.pretrimmed.validated

The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step.

Mapped reads (BAM format)

The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane “Lane_1” of library “ACGATA” as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE):

$ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_001/
collapsed.bai  # Index file used for accessing the .bam file
collapsed.bam  # The mapped reads in BAM format
collapsed.coverage  # Coverage statistics
collapsed.validated  # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated
[...]

For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for “normal” reads (paired and single-ended reads), and one method for “collapsed” reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that “collapsedtruncated” reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed:

ExampleProject/
  rCRS/
    Synthetic_Sample_1/
      ACGATA.duplications_checked
      ACGATA.rmdup.*.bai
      ACGATA.rmdup.*.bam
      ACGATA.rmdup.*.coverage
      ACGATA.rmdup.*.validated

With the exception of the “duplicates_checked” file, these corresponds to the files created in the lane folder. The “duplicates_checked” file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not PCR duplicates!).

If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above:

ExampleProject/
  rCRS/
    Synthetic_Sample_1/
      ACGATA.rescaled.bai
      ACGATA.rescaled.bam
      ACGATA.rescaled.coverage
      ACGATA.rescaled.validated

Finally, the resulting BAMs for each library (rescaled or not) are merged (optionally using GATK) and validated. This results in the creation of the following files in the target folder:

ExampleProject/
  rCRS.validated  # Signifies that the "raw" BAM has been validated
  rCRS.realigned.validated  # Signifies that the realigned BAM has been validated
  rCRS.intervals   # Intervals selected by the GATK IndelRealigner during training
  rCRS.duplications_checked  # Similar to above, but catches duplicates across samples / libraries