.. highlight:: Yaml
.. _bam_filestructure:

File structure
==============

The following section explains the file structure of the BAM pipeline example project (see :ref:`examples`), which results if that project is executed::

    ExampleProject:
      Synthetic_Sample_1:
        ACGATA:
          Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz
          Lane_2:
            Singleton: data/ACGATA_L2/reads.singleton.truncated.gz
            Collapsed: data/ACGATA_L2/reads.collapsed.gz
            CollapsedTruncated: data/ACGATA_L2/reads.collapsed.truncated.gz

        GCTCTG:
          Lane_1: data/GCTCTG_L1_R1_*.fastq.gz

        TGCTCA:
          Options:
            BWA:
              MinQuality: 30

          Lane_1: data/TGCTCA_L1_R1_*.fastq.gz
          Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz

Once executed, this example is expected to generate the following result files,
depending on which options are enabled:

* ExampleProject.rCRS.bam
* ExampleProject.rCRS.bam.bai
* ExampleProject.rCRS.coverage
* ExampleProject.rCRS.depths
* ExampleProject.rCRS.mapDamage/
* ExampleProject.summary

As well as a folder containing intermediate results:

* ExampleProject/


.. warning::
    Please be aware that the internal file structure of PALEOMIX may change between major revisions (e.g. v1.1 to 1.2), but is not expected to change between minor revisions (v1.1.1 to v1.1.2). Consequently, if you wish to re-run an old project with the PALEOMIX pipeline, it is recommended to either use the same version of PALEOMIX, or remove the folder containing intermediate files before starting (see below), in order to ensure that analyses are re-run from scratch.


Primary results
---------------

These files are the main results generated by the PALEOMIX pipeline:

**ExampleProject.rCRS.bam** and **ExampleProject.rCRS.bam.bai**

    Final BAM file and its index file (.bai), created using the "samtools index". If rescaling has been enabled, this BAM will contain reads processed by mapDamage.

**ExampleProject.rCRS.mapDamage/**

    Per-library analyses generated using mapDamage2.0. If rescaling or modeling is enabled, these folders also includes the model files generated for each library. See the `mapDamage2.0 documentation`_ for a description of these files.

**ExampleProject.rCRS.coverage**

    Coverage statistics generated using the 'paleomix coverage' command. These include per sample, per library, and per contig / chromosome breakdowns.

**ExampleProject.rCRS.depths**

    Depth-histogram generated using 'paleomix depths' commands. As with the coverage, this information is broken down by sample, library, and contig / chromosome.

**ExampleProject.summary**

    A summary table, which is created for each target if enabled in the makefile. This table contains contains a summary of the project, including the number / types of reads processed, average coverage, and other statistics broken down by prefix, sample, and library.

.. warning::
    Some statistics will missing if pre-trimmed reads are included in the makefile, since PALEOMIX relies on the output from the adapter trimming software to collect these values.


Intermediate results
--------------------

The BAM pipeline uses a simple file structure that corresponds to the structure of targets in the makefile. A folder is created for each target in the makefile (here "ExampleProject"). This folder contains a folder for the processed FASTQ reads, and a folder for each prefix, as well as some additional files used in certain analytical steps (see below):

.. code-block:: bash

    $ ls ExampleProject/
    reads/
    rCRS/
    [...]


Processed reads
^^^^^^^^^^^^^^^

Each of these folders contain a directory structure that corresponds to that of the makefiles. In addition, named folders are generated from each input FASTQ file or pair of FASTQ files:

.. code-block:: bash

    ExampleProject/
      reads/
        Synthetic_Sample_1/
          ACGATA/
            Lane_1/
              ACGATA_L1_Rx_01.fastq.gz/
              ACGATA_L1_Rx_02.fastq.gz/
              ACGATA_L1_Rx_03.fastq.gz/
              ACGATA_L1_Rx_04.fastq.gz/
    [...]

The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option "ExcludeReads":

.. code-block:: bash

    $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
    reads.settings  # Settings / statistics file generated by AdapterRemoval
    reads.discarded.gz  # Low-quality or short reads
    reads.truncated.gz  # Single-ended reads following adapter-removal
    reads.collapsed.gz  # Paired-ended reads collapsed into single reads
    reads.collapsed.truncated.gz  # Collapsed reads trimmed at either termini
    reads.pair1.truncated.gz  # The first mate read of paired reads
    reads.pair2.truncated.gz  # The second mate read of paired reads
    reads.singleton.truncated.gz  # Paired-ended reads for which one mate was discarded


If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats):

.. code-block:: bash

    $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/
    reads.pretrimmed.validated

The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step.


Mapped reads
^^^^^^^^^^^^

The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane "Lane_1" of library "ACGATA" as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE):

.. code-block:: bash

    $ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/
    collapsed.bam  # The mapped reads in BAM format
    collapsed.bam.bai  # Index file used for accessing the .bam file
    collapsed.coverage  # Coverage statistics
    collapsed.sai  # Intermediate alignment file generated by the BWA backtrack
    collapsed.validated  # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated
    [...]

For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for "normal" reads (paired and single-ended reads), and one method for "collapsed" reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that "collapsedtruncated" reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed:

.. code-block:: bash

    ExampleProject/
      rCRS/
        Synthetic_Sample_1/
          ACGATA.duplications_checked
          ACGATA.rmdup.*.bam
          ACGATA.rmdup.*.bam.bai
          ACGATA.rmdup.*.coverage
          ACGATA.rmdup.*.validated

With the exception of the "duplicates_checked" file, these corresponds to the files created in the lane folder. The "duplicates_checked" file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not to be confused with PCR duplicates).

If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above:

.. code-block:: bash

    ExampleProject/
      rCRS/
        Synthetic_Sample_1/
          ACGATA.rescaled.bam
          ACGATA.rescaled.bam.bai
          ACGATA.rescaled.coverage
          ACGATA.rescaled.validated

Finally, the resulting BAMs for each library (rescaled or not) are merged and validated. This results in the creation of the following files in the target folder:

.. code-block:: bash

    ExampleProject/
      rCRS.validated  # Signifies that the final BAM has been validated
      rCRS.duplications_checked  # Similar to above, but catches duplicates across samples / libraries


.. _mapDamage2.0 documentation: http://ginolhac.github.io/mapDamage/\#a7
.. _preseq: http://smithlabresearch.org/software/preseq/