6 Inputs

The only input to the first module is a file called samples.manifest. This user-created file associates each FASTQ file with a path and ID, and allows the pipeline to automatically merge files if necessary. After creating the samples.manifest file, one must add the --input argument to the approriate execution script for the first module, and its value should be the path to the directory containing samples.manifest. For example, if the full path is /users/neagles/samples.manifest, the full argument in the execution script should be --input "/users/neagles".

The second module requires the same samples.manifest file and an additional file called rules.txt. The latter specifies where relevant logs and the samples.manifest file are located. Similarly as in the first module, one must add the --input argument to the approriate execution script for the second module, but its value should be the path to the directory containing rules.txt. Please note that running the first module automatically creates a rules.txt file in the same directory as the input samples.manifest, so one can use the same --input argument for both the first and second module when running them in series.

6.1 The samples.manifest File

6.1.1 What samples.manifest should look like

Each line in samples.manifest should have the following format:

  • For a set of unpaired reads <PATH TO FASTQ FILE>(tab)<optional MD5>(tab)<sample label/id>
  • For paired-end sets of reads <PATH TO FASTQ 1>(tab)<optional MD5 1>(tab)<PATH TO FASTQ 2>(tab)<optional MD5 2>(tab)<sample label/id>

A line of paired-end reads could look like this:

WGBS_sample1_read1.fastq 0 WGBS_sample1_read2.fastq 0 sample1

  • The MD5(s) on each line are for compatibility with a conventional samples.manifest structure, and are not explicitly checked in the pipeline (you may simply use 0s as in the above example).
  • Paths must be long/full.
  • If you have a single sample split across multiple files, you can signal for the pipeline to merge these files by repeating the sample label/id on each line of files to merge.
  • A samples.manifest file cannot include both single-end and paired-end reads; separate pipeline runs should be performed for each of these read types.

This is an example of a samples.manifest file for some paired-end samples. Note how the first sample “dm3” is split across more than one pair of files, and is to be merged:

/scratch/dm3_file1_1.fastq  0   /scratch/dm3_file1_2.fastq  0   dm3
/scratch/dm3_file2_1.fastq  0   /scratch/dm3_file2_2.fastq  0   dm3
/scratch/sample_01_1.fastq.gz   0   /scratch/sample_01_2.fastq.gz   0   sample_01
/scratch/sample_02_1.fastq.gz   0   /scratch/sample_02_2.fastq.gz   0   sample_02

6.1.2 More details regarding inputs

  • Input FASTQ files can have the following file extensions: .fastq, .fq, .fastq.gz, .fq.gz. All FASTQ files associated with the same sample ID must use the same extenstion.
  • FASTQ files must not contain “.” characters before the typical extension (e.g. sample.1.fastq), since some internal functions rely on splitting file names by “.”.

6.1.3 Creating a manifest file

In a common scenario, you may have a large number of FASTQ files in a single directory, for a given experiment. How can the samples.manifest file be constructed in this case? While the method you use is a matter of preference, we find it straightforward to write a small R script to generate the manifest.

Suppose we have 3 paired-end samples, consisting of a total of 6 FASTQ files:

/data/fastq/SAMPLE1_L001_R1_001.fastq.gz
/data/fastq/SAMPLE1_L001_R2_001.fastq.gz
/data/fastq/SAMPLE2_L002_R1_001.fastq.gz
/data/fastq/SAMPLE2_L002_R2_001.fastq.gz
/data/fastq/SAMPLE3_L003_R1_001.fastq.gz
/data/fastq/SAMPLE3_L003_R2_001.fastq.gz

The following script can generate the manifest appropriate for this experiment:

#  If needed, install the 'jaffelab' GitHub-based package, which includes a
#  useful function for string manipulation
remotes::install_github("LieberInstitute/jaffelab")

library("jaffelab")

fastq_dir <- "/data/fastq"

#  We can take advantage of the uniform file naming convention to get the paths
#  of each mate in the pair, for every sample. Here we use a somewhat
#  complicated regular expression to match file names (to be sure we are
#  matching precisely the files we think we're matching), but this can be kept
#  simple if preferred.
r1 <- list.files(fastq_dir, ".*_L00._R1_001\\.fastq\\.gz", full.names = TRUE)
r2 <- list.files(fastq_dir, ".*_L00._R2_001\\.fastq\\.gz", full.names = TRUE)

#  We can form a unique ID for each sample by taking the portion of the path to
#  the first read preceding the lane and mate identifiers. The function 'ss' is
#  a vectorized form of 'strsplit', handy for this task
ids <- ss(basename(r1), "_L00")

#  Sanity check: there should be the same number of first reads as second reads
stopifnot(length(R1) == length(R2))

#  Prepare the existing sample information into the expected format (for now,
#  as a character vector where each element will be a line in
#  'samples.manifest'). We will simply use zeros for the optional MD5 sums.
manifest <- paste(r1, 0, r2, 0, ids, sep = "\t")

#  Write the manifest to a file (in this case, in the current working
#  directory)
writeLines(manifest, con = "samples.manifest")

6.2 The rules.txt File

This input file to the second module is automatically produced when running the first module, and placed in the same directory as the input samples.manifest. The below documentation exists primarily to describe how to manually produce rules.txt if the first and second modules are run on different systems, or if the steps corresponding to the first module are performed outside of BiocMAP.

6.2.1 What rules.txt should look like

# An example 'rules.txt' file
manifest = /users/nick/samples.manifest
sam = /users/nick/Arioc/[id]/sams/[id].cfu.sam
arioc_log = /users/nick/Arioc/[id]/log/AriocP.[id].log
xmc_log = /users/nick/Arioc/[id]/log/[id].cfu.XMC.log
trim_report = /users/nick/trim_galore/[id]/[id].fastq.gz_trimming_report.txt
fastqc_log_last = /users/nick/FastQC/[id]_trimmed_summary.txt
fastqc_log_first = /users/nick/FastQC/[id]_untrimmed_summary.txt
  • This file consists of several lines of key-value pairs, where keys and values are separated by an equals sign (“=”) and optionally spaces.
  • Lines without “=” are ignored, and can be used as comments. The above example starts such lines with “#” for clarity.
  • The required keys to include in a valid rules.txt file include “manifest”, “sam”, “arioc_log”, and “trim_report”. The associated values for these keys are the paths to the samples.manifest file, the filtered/deduplicated alignment SAMs, verbose output logs from Arioc alignment, and output logs from TrimGalore!, respectively. Note that users who plan to use MethylDackel for methylation extraction (the default!) should have coordinate-sorted and indexed BAM files ready, which are passed to the “sam” key using a glob expression. Otherwise, just one SAM file is expected per sample. Note this example for the former case:
# Suppose we have 4 files across 2 samples:
#    /users/nick/alignments/id1_sorted.bam and /users/nick/alignments/id1_sorted.bam.bai
#    /users/nick/alignments/id2_sorted.bam and /users/nick/alignments/id2_sorted.bam.bai
# We can point to these files using the 'sam' key in 'rules.txt':
sam = /users/nick/alignments/[id]_sorted.bam*
  • Optional keys accepted in a rules.txt file include “xmc_log”, “fastqc_log_first”, and “fastqc_log_last”. Associated values are paths to logs from XMC (a utility that comes with the Arioc software), *summary.txt logs from FastQC, and logs from the latest or only run of FastQC (see the below point about optional keys), respectively. The optional “fastqc_log_first” key can be used when the user runs FastQC twice (for example, before and after trimming) for at least one sample. In this case, “fastqc_log_first” refers to the pre-trimming run of FastQC. Typically, methylation extraction is done in the second module, and the “xmc_log” option is only included for compatibility with a previous WGBS processing approach. Information from these logs will be included in the R data frame output from the second module, if these logs are present and specified in rules.txt.
  • Since exact paths are different between samples, including “[id]” in the value field in a line of rules.txt indicates that the path for any particular sample can be found by replacing “[id]” with its sample ID. Otherwise, paths are interpreted literally.
  • For paired-end experiments, there will be two summary logs from FastQC. To correctly describe this in rules.txt, a glob expression may be used. As an example, suppose we have FastQC summary logs for a sample called sample1, given by the paths /some_dir/fastqc/sample1_1_summary.txt and /some_dir/fastqc/sample1_2_summary.txt. The following line in rules.txt would be appropriate:
fastqc_log_last = /some_dir/fastqc/[id]_[12]_summary.txt

Be careful with forming glob expressions! If we had a 10-sample experiment with sample names sample1 through sample10 but used a simpler asterisk “*” in place of the alternation “_[12]” in the above example, we would get an error!

#  This is wrong! When finding files for "sample1", we'd also match files for
#  "sample10"!
fastqc_log_last = /some_dir/fastqc/[id]*_summary.txt
  • The --input [dir] argument to the nextflow command in run_second_half_*.sh scripts specifies the directory containing the rules.txt file, and is a required argument.