6 Manifest and Inputs

Inputs to SPEAQeasy are specified by a single file named samples.manifest. The samples.manifest file associates each FASTQ file with a path and ID, and allows the pipeline to automatically merge files if necessary.

6.1 What the Manifest Should Look Like

Each line in samples.manifest should have the following format:

  • For a set of unpaired reads <PATH TO FASTQ FILE>(tab)<optional MD5>(tab)<sample label/id>
  • For paired-end sets of reads <PATH TO FASTQ 1>(tab)<optional MD5 1>(tab)<PATH TO FASTQ 2>(tab)<optional MD5 2>(tab)<sample label/id>

A line of paired-end reads could look like this:

RNA_sample1_read1.fastq 0 RNA_sample1_read2.fastq 0 sample1

  • The MD5(s) on each line are for compatibility with a conventional samples.manifest structure, and are not explicitly checked in the pipeline (you may simply use 0s as in the above example).
  • Paths must be long/full.
  • If you have a single sample split across multiple files, you can signal for the pipeline to merge these files by repeating the sample label/id on each line of files to merge.
  • A samples.manifest file cannot include both single-end and paired-end reads; separate pipeline runs should be performed for each of these read types.

This is an example of a samples.manifest file for some paired-end samples. Note how the first sample “dm3” is split across more than one pair of files, and is to be merged:

/scratch/dm3_file1_1.fastq  0   /scratch/dm3_file1_2.fastq  0   dm3
/scratch/dm3_file2_1.fastq  0   /scratch/dm3_file2_2.fastq  0   dm3
/scratch/sample_01_1.fastq.gz   0   /scratch/sample_01_2.fastq.gz   0   sample_01
/scratch/sample_02_1.fastq.gz   0   /scratch/sample_02_2.fastq.gz   0   sample_02

6.1.1 More details regarding inputs

  • Input FASTQ files can have the following file extensions: .fastq, .fq, .fastq.gz, .fq.gz. All FASTQ files associated with the same sample ID must use the same extenstion.
  • FASTQ files must not contain “.” characters before the typical extension (e.g. sample.1.fastq), since some internal functions rely on splitting file names by “.”.
  • Base filenames must be distinct (e.g. including both /dir/one/name.fastq and /dir/two/name.fastq at different points in the manifest is not supported)

6.2 Creating a manifest file

In a common scenario, you may have a large number of FASTQ files in a single directory, for a given experiment. How can the samples.manifest file be constructed in this case? While the method you use is a matter of preference, we find it straightforward to write a small R script to generate the manifest.

Suppose we have 3 paired-end samples, consisting of a total of 6 FASTQ files:

/data/fastq/SAMPLE1_L001_R1_001.fastq.gz
/data/fastq/SAMPLE1_L001_R2_001.fastq.gz
/data/fastq/SAMPLE2_L002_R1_001.fastq.gz
/data/fastq/SAMPLE2_L002_R2_001.fastq.gz
/data/fastq/SAMPLE3_L003_R1_001.fastq.gz
/data/fastq/SAMPLE3_L003_R2_001.fastq.gz

The following script can generate the manifest appropriate for this experiment:

#  If needed, install the 'jaffelab' GitHub-based package, which includes a
#  useful function for string manipulation
remotes::install_github("LieberInstitute/jaffelab")

library("jaffelab")

fastq_dir <- "/data/fastq"

#  We can take advantage of the uniform file naming convention to get the paths
#  of each mate in the pair, for every sample. Here we use a somewhat
#  complicated regular expression to match file names (to be sure we are
#  matching precisely the files we think we're matching), but this can be kept
#  simple if preferred.
r1 <- list.files(fastq_dir, ".*_L00._R1_001\\.fastq\\.gz", full.names = TRUE)
r2 <- list.files(fastq_dir, ".*_L00._R2_001\\.fastq\\.gz", full.names = TRUE)

#  We can form a unique ID for each sample by taking the portion of the path to
#  the first read preceding the lane and mate identifiers. The function 'ss' is
#  a vectorized form of 'strsplit', handy for this task
ids <- ss(basename(r1), "_L00")

#  Sanity check: there should be the same number of first reads as second reads
stopifnot(length(R1) == length(R2))

#  Prepare the existing sample information into the format expected by
#  SPEAQeasy (for now, as a character vector where each element will be a line
#  in 'samples.manifest'). We will simply use zeros for the optional MD5 sums.
manifest <- paste(r1, 0, r2, 0, ids, sep = "\t")

#  Write the manifest to a file (in this case, in the current working
#  directory)
writeLines(manifest, con = "samples.manifest")