6 Manifest and Inputs
Inputs to SPEAQeasy are specified by a single file named samples.manifest
. The samples.manifest
file associates each FASTQ file with a path and ID, and allows the pipeline to automatically merge files if necessary.
6.1 What the Manifest Should Look Like
Each line in samples.manifest
should have the following format:
- For a set of unpaired reads
<PATH TO FASTQ FILE>(tab)<optional MD5>(tab)<sample label/id>
- For paired-end sets of reads
<PATH TO FASTQ 1>(tab)<optional MD5 1>(tab)<PATH TO FASTQ 2>(tab)<optional MD5 2>(tab)<sample label/id>
A line of paired-end reads could look like this:
RNA_sample1_read1.fastq 0 RNA_sample1_read2.fastq 0 sample1
- The MD5(s) on each line are for compatibility with a conventional samples.manifest structure, and are not explicitly checked in the pipeline (you may simply use 0s as in the above example).
- Paths must be long/full.
- If you have a single sample split across multiple files, you can signal for the pipeline to merge these files by repeating the sample label/id on each line of files to merge.
- A
samples.manifest
file cannot include both single-end and paired-end reads; separate pipeline runs should be performed for each of these read types.
This is an example of a samples.manifest
file for some paired-end samples. Note how the first sample “dm3” is split across more than one pair of files, and is to be merged:
/scratch/dm3_file1_1.fastq 0 /scratch/dm3_file1_2.fastq 0 dm3
/scratch/dm3_file2_1.fastq 0 /scratch/dm3_file2_2.fastq 0 dm3
/scratch/sample_01_1.fastq.gz 0 /scratch/sample_01_2.fastq.gz 0 sample_01
/scratch/sample_02_1.fastq.gz 0 /scratch/sample_02_2.fastq.gz 0 sample_02
6.1.1 More details regarding inputs
- Input FASTQ files can have the following file extensions:
.fastq
,.fq
,.fastq.gz
,.fq.gz
. All FASTQ files associated with the same sample ID must use the same extenstion. - FASTQ files must not contain “.” characters before the typical extension (e.g. sample.1.fastq), since some internal functions rely on splitting file names by “.”.
- Base filenames must be distinct (e.g. including both
/dir/one/name.fastq
and/dir/two/name.fastq
at different points in the manifest is not supported)
6.2 Creating a manifest file
In a common scenario, you may have a large number of FASTQ files in a single directory, for a given experiment. How can the samples.manifest
file be constructed in this case? While the method you use is a matter of preference, we find it straightforward to write a small R script to generate the manifest.
Suppose we have 3 paired-end samples, consisting of a total of 6 FASTQ files:
/data/fastq/SAMPLE1_L001_R1_001.fastq.gz
/data/fastq/SAMPLE1_L001_R2_001.fastq.gz
/data/fastq/SAMPLE2_L002_R1_001.fastq.gz
/data/fastq/SAMPLE2_L002_R2_001.fastq.gz
/data/fastq/SAMPLE3_L003_R1_001.fastq.gz
/data/fastq/SAMPLE3_L003_R2_001.fastq.gz
The following script can generate the manifest appropriate for this experiment:
# If needed, install the 'jaffelab' GitHub-based package, which includes a
# useful function for string manipulation
remotes::install_github("LieberInstitute/jaffelab")
library("jaffelab")
fastq_dir <- "/data/fastq"
# We can take advantage of the uniform file naming convention to get the paths
# of each mate in the pair, for every sample. Here we use a somewhat
# complicated regular expression to match file names (to be sure we are
# matching precisely the files we think we're matching), but this can be kept
# simple if preferred.
r1 <- list.files(fastq_dir, ".*_L00._R1_001\\.fastq\\.gz", full.names = TRUE)
r2 <- list.files(fastq_dir, ".*_L00._R2_001\\.fastq\\.gz", full.names = TRUE)
# We can form a unique ID for each sample by taking the portion of the path to
# the first read preceding the lane and mate identifiers. The function 'ss' is
# a vectorized form of 'strsplit', handy for this task
ids <- ss(basename(r1), "_L00")
# Sanity check: there should be the same number of first reads as second reads
stopifnot(length(R1) == length(R2))
# Prepare the existing sample information into the format expected by
# SPEAQeasy (for now, as a character vector where each element will be a line
# in 'samples.manifest'). We will simply use zeros for the optional MD5 sums.
manifest <- paste(r1, 0, r2, 0, ids, sep = "\t")
# Write the manifest to a file (in this case, in the current working
# directory)
writeLines(manifest, con = "samples.manifest")