6 Manifest and Inputs

Inputs to SPEAQeasy are specified by a single file named samples.manifest. The samples.manifest file associates each FASTQ file with a path and ID, and allows the pipeline to automatically merge files if necessary.

6.1 What the Manifest Should Look Like

Each line in samples.manifest should have the following format:

For a set of unpaired reads <PATH TO FASTQ FILE>(tab)<optional MD5>(tab)<sample label/id>
For paired-end sets of reads <PATH TO FASTQ 1>(tab)<optional MD5 1>(tab)<PATH TO FASTQ 2>(tab)<optional MD5 2>(tab)<sample label/id>

A line of paired-end reads could look like this:

RNA_sample1_read1.fastq 0 RNA_sample1_read2.fastq 0 sample1

The MD5(s) on each line are for compatibility with a conventional samples.manifest structure, and are not explicitly checked in the pipeline (you may simply use 0s as in the above example).
Paths must be long/full.
If you have a single sample split across multiple files, you can signal for the pipeline to merge these files by repeating the sample label/id on each line of files to merge.
A samples.manifest file cannot include both single-end and paired-end reads; separate pipeline runs should be performed for each of these read types.

This is an example of a samples.manifest file for some paired-end samples. Note how the first sample “dm3” is split across more than one pair of files, and is to be merged:

/scratch/dm3_file1_1.fastq  0   /scratch/dm3_file1_2.fastq  0   dm3
/scratch/dm3_file2_1.fastq  0   /scratch/dm3_file2_2.fastq  0   dm3
/scratch/sample_01_1.fastq.gz   0   /scratch/sample_01_2.fastq.gz   0   sample_01
/scratch/sample_02_1.fastq.gz   0   /scratch/sample_02_2.fastq.gz   0   sample_02

6.1.1 More details regarding inputs

Input FASTQ files can have the following file extensions: .fastq, .fq, .fastq.gz, .fq.gz. All FASTQ files associated with the same sample ID must use the same extenstion.
FASTQ files must not contain “.” characters before the typical extension (e.g. sample.1.fastq), since some internal functions rely on splitting file names by “.”.
Base filenames must be distinct (e.g. including both /dir/one/name.fastq and /dir/two/name.fastq at different points in the manifest is not supported)

6.2 Creating a manifest file

In a common scenario, you may have a large number of FASTQ files in a single directory, for a given experiment. How can the samples.manifest file be constructed in this case? While the method you use is a matter of preference, we find it straightforward to write a small R script to generate the manifest.

Suppose we have 3 paired-end samples, consisting of a total of 6 FASTQ files:

/data/fastq/SAMPLE1_L001_R1_001.fastq.gz
/data/fastq/SAMPLE1_L001_R2_001.fastq.gz
/data/fastq/SAMPLE2_L002_R1_001.fastq.gz
/data/fastq/SAMPLE2_L002_R2_001.fastq.gz
/data/fastq/SAMPLE3_L003_R1_001.fastq.gz
/data/fastq/SAMPLE3_L003_R2_001.fastq.gz

The following script can generate the manifest appropriate for this experiment:

#  If needed, install the 'jaffelab' GitHub-based package, which includes a
#  useful function for string manipulation
remotes::install_github("LieberInstitute/jaffelab")

library("jaffelab")

fastq_dir <- "/data/fastq"

#  We can take advantage of the uniform file naming convention to get the paths
#  of each mate in the pair, for every sample. Here we use a somewhat
#  complicated regular expression to match file names (to be sure we are
#  matching precisely the files we think we're matching), but this can be kept
#  simple if preferred.
r1 <- list.files(fastq_dir, ".*_L00._R1_001\\.fastq\\.gz", full.names = TRUE)
r2 <- list.files(fastq_dir, ".*_L00._R2_001\\.fastq\\.gz", full.names = TRUE)

#  We can form a unique ID for each sample by taking the portion of the path to
#  the first read preceding the lane and mate identifiers. The function 'ss' is
#  a vectorized form of 'strsplit', handy for this task
ids <- ss(basename(r1), "_L00")

#  Sanity check: there should be the same number of first reads as second reads
stopifnot(length(R1) == length(R2))

#  Prepare the existing sample information into the format expected by
#  SPEAQeasy (for now, as a character vector where each element will be a line
#  in 'samples.manifest'). We will simply use zeros for the optional MD5 sums.
manifest <- paste(r1, 0, r2, 0, ids, sep = "\t")

#  Write the manifest to a file (in this case, in the current working
#  directory)
writeLines(manifest, con = "samples.manifest")