SPEAQeasy is designed to be highly customizable, yet need no configuration from a user wishing to rely on sensible default settings. Most configuration, including software settings, hardware resources such as memory and CPU use, and more, can be done in a single file determined here. Please note that the common, “major” options are specified in your main script, and not in the configuration file. The config is intended to hold the “finer details”, which are documented below.
In many cases, a user has access to a computing cluster which they intend to run SPEAQeasy on. If your cluster is SLURM or SGE-based, the pipeline is pre-configured with options you may be used to specifying (such as disk usage, time for a job to run, etc). However, these are straightforward to modify, should there be a need/desire. Common settings are described in detail below; however, a more comprehensive list of settings from nextflow can be found here.
The maximum allowed run time for a process, or step in the pipeline, can be specified. This may be necessary for users who are charged based on run time for their jobs.
The simplest change you may wish to make is to relax time constraints for all processes. The setting for this is here:
While the syntax is not strict, some examples for properly specifying the option are
time = '5m',
time = '2h', and
time = '1d' (minutes, hours, and days, respectively).
Time restrictions may also be specified for individual workflow steps. This can be done with the same syntax- suppose you wish to request 30 minutes for a given sample to be trimmed (note process names here). In the “process” section of your config, find the section labelled “withName: Trimming” as in the example here:
In some cases, you may find it simpler to directly specify options accepted by your cluster. For example, SGE users with a default limit on the maximum file size they may write might specify the following:
As with the time option, this can be specified per-process or for all processes. Any option recognized by your cluster may be used.
A number of variables in your config file exist to control choices about annotation to use, options passed to software tools, and more. These need not necessarily be changed, but allow more precise control over the pipeline if desired. Values for these variables may be changed in the “params” section of your config (near the top). A full descriptive list is provided below.
- gencode_version_human: the GENCODE release to use for default (non-custom) annotation, when “hg38” or “hg19” references are used. A string, such as “32”.
- gencode_version_mouse: the GENCODE release to use for default (non-custom) annotation, when “mm10” reference is used. A string, such as “M23”.
- ensembl_version_rat: the Ensembl version to use for default (non-custom) annotation, when “rat” reference is used. A string, such as “98”. Note that
Rnor_6.0is used for
ensembl_version_rat< 105, and
mRatBN7.2is used for releases after and including 105.
- anno_build: controls which sequences are used in analysis with default (non-custom) annotation. “main” indicates only canonical reference sequences; “primary” includes additional scaffolds.
Command-line parameters can be directly passed to most software tools used in the pipeline, through configuration. Note that SPEAQeasy internally passes certain command-line options to some software tools, in a way that can not be overriden in the configuration file. This is done for options which control parallelization, file naming, or other functions which must either be fixed or are configurable through other means (e.g. number of threads is implicitly controlled by the
cpus variable in the
QualityUntrimmed process, rather than directly via the
-t argument to
fastqc). Open your particular config file determined here for details of which options are fixed vs. which can be specified through configuration. A list of variables which control command-line arguments is as follows:
- bam2wig_args: a string of optional arguments to pass to
bam2wig.pyfrom RSeQC, when producing coverage wig files from BAM alignments.
- bcftools_args: a string of optional arguments to pass to
bcftools callwhen calling variants (applicable for human samples only).
- fastqc_args: a string of optional arguments to pass to
fastqcwhen performing quality checking before (and if applicable, after) trimming.
- feat_counts_gene_args and feat_counts_exon_args: strings of optional arguments passed to FeatureCounts when counting genes and exons, respectively.
- hisat2_args: a string of optional arguments to pass to
- kallisto_len_mean and kallisto_len_sd: the mean and standard deviation for fragment length, required by
kallisto quantwhen performing pseudoalignment. It is recommended you directly modify these variables rather than individually modifying
kallisto_quant_ercc_single_args. These variables are also used when performing our pseudo-alignment-based approach to inferring strandness with single-end samples.
- kallisto_quant_single_args and kallisto_quant_paired_args: a string of arguments to pass to
kallisto quantwhen performing pseudo-alignment to the reference transcriptome, for single-end and paired-end samples, respectively.
- kallisto_quant_ercc_single_args and kallisto_quant_ercc_paired_args: a string of optional arguments to pass to
kallisto quantfor performing pseudo-alignment to the list of synthetic ERCC-defined transcripts, for applicable experiments.
- kallisto_index_args: a string of optional arguments to pass to
kallisto indexwhen first preparing
kallistofor a given reference transcriptome.
- salmon_index_args: a string of optional arguments to pass to
salmon indexwhen first preparing
salmonfor a given reference transcriptome. Note that Kallisto is used instead of Salmon, by default (see the
- salmon_quant_args: a string of optional arguments to pass to
salmon quantwhen quantifying transcript abundances with Salmon.
- samtools_args: a string of optional arguments to pass to
samtools mpileupbefore calling variants on human samples.
- star_args: a string of optional arguments to pass to
STARwhen aligning samples to the chosen reference genome.
- trim_adapter_args_single, trim_adapter_args_paired: these are passed to the
ILLUMINACLIPsettings for trimmomatic (after the adapter) for analysis with single and paired-end reads, respectively. Leave empty (i.e. ““) to not perform adapter trimming.
- trim_quality_args: literal string arguments used for quality trimming- leave empty (i.e. ““) to not perform quality trimming. An example value could be
"LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75".
- regtools_args: a string of optional arguments to pass to
regtools junctions extractwhen quantifying junctions.
- wigToBigWig_args: a string of optional arguments to pass to
wigToBigWigwhen producing coverage BigWig files from their wig counterparts.
A couple configuration variables control functionality specific to SPEAQeasy (i.e. not external software tools called by SPEAQeasy). These are documented below.
- num_reads_infer_strand: The number of lines in each FASTQ file to subset when determining strandness for a sample.
- wiggletools_batch_size: The maximum number of files to handle simultaneously when producing wig coverage files. We have encountered issues with
wiggletoolsstarting too many threads (one is used per sample handled), and so this can be lowered if similar issues occur for other users in large experiments.
- variants_merge_batch_size: The maximum number of files to merge at once when merging variant calls. A sufficiently low value may prevent issues with
bcftools mergehitting system limits on number of open file handles with large experiments.