Inputs to SPEAQeasy are specified by a single file named
samples.manifest file associates each FASTQ file with a path and ID, and allows the pipeline to automatically merge files if necessary.
Each line in
samples.manifest should have the following format:
- For a set of unpaired reads
<PATH TO FASTQ FILE>(tab)<optional MD5>(tab)<sample label/id>
- For paired-end sets of reads
<PATH TO FASTQ 1>(tab)<optional MD5 1>(tab)<PATH TO FASTQ 2>(tab)<optional MD5 2>(tab)<sample label/id>
A line of paired-end reads could look like this:
RNA_sample1_read1.fastq 0 RNA_sample1_read2.fastq 0 sample1
- The MD5(s) on each line are for compatibility with a conventional samples.manifest structure, and are not explicitly checked in the pipeline (you may simply use 0s as in the above example).
- Paths must be long/full.
- If you have a single sample split across multiple files, you can signal for the pipeline to merge these files by repeating the sample label/id on each line of files to merge.
samples.manifestfile cannot include both single-end and paired-end reads; separate pipeline runs should be performed for each of these read types.
This is an example of a
samples.manifest file for some paired-end samples. Note how the first sample “dm3” is split across more than one pair of files, and is to be merged:
- Input FASTQ files can have the following file extensions:
.fq.gz. All FASTQ files associated with the same sample ID must use the same extenstion.
- FASTQ files must not contain “.” characters before the typical extension (e.g. sample.1.fastq), since some internal functions rely on splitting file names by “.”.
- Base filenames must be distinct (e.g. including both
/dir/two/name.fastqat different points in the manifest is not supported)
In a common scenario, you may have a large number of FASTQ files in a single directory, for a given experiment. How can the
samples.manifest file be constructed in this case? While the method you use is a matter of preference, we find it straightforward to write a small R script to generate the manifest.
Suppose we have 3 paired-end samples, consisting of a total of 6 FASTQ files:
/data/fastq/SAMPLE1_L001_R1_001.fastq.gz /data/fastq/SAMPLE1_L001_R2_001.fastq.gz /data/fastq/SAMPLE2_L002_R1_001.fastq.gz /data/fastq/SAMPLE2_L002_R2_001.fastq.gz /data/fastq/SAMPLE3_L003_R1_001.fastq.gz /data/fastq/SAMPLE3_L003_R2_001.fastq.gz
The following script can generate the manifest appropriate for this experiment:
# If needed, install the 'jaffelab' GitHub-based package, which includes a # useful function for string manipulation remotes::install_github("LieberInstitute/jaffelab") library("jaffelab") fastq_dir <- "/data/fastq" # We can take advantage of the uniform file naming convention to get the paths # of each mate in the pair, for every sample. Here we use a somewhat # complicated regular expression to match file names (to be sure we are # matching precisely the files we think we're matching), but this can be kept # simple if preferred. r1 <- list.files(fastq_dir, ".*_L00._R1_001\\.fastq\\.gz", full.names = TRUE) r2 <- list.files(fastq_dir, ".*_L00._R2_001\\.fastq\\.gz", full.names = TRUE) # We can form a unique ID for each sample by taking the portion of the path to # the first read preceding the lane and mate identifiers. The function 'ss' is # a vectorized form of 'strsplit', handy for this task ids <- ss(basename(r1), "_L00") # Sanity check: there should be the same number of first reads as second reads stopifnot(length(R1) == length(R2)) # Prepare the existing sample information into the format expected by # SPEAQeasy (for now, as a character vector where each element will be a line # in 'samples.manifest'). We will simply use zeros for the optional MD5 sums. manifest <- paste(r1, 0, r2, 0, ids, sep = "\t") # Write the manifest to a file (in this case, in the current working # directory) writeLines(manifest, con = "samples.manifest")