8 Configuration
SPEAQeasy is designed to be highly customizable, yet need no configuration from a user wishing to rely on sensible default settings. Most configuration, including software settings, hardware resources such as memory and CPU use, and more, can be done in a single file determined here. Please note that the common, “major” options are specified in your main script, and not in the configuration file. The config is intended to hold the “finer details”, which are documented below.
8.1 Specifying Options for your Cluster
In many cases, a user has access to a computing cluster which they intend to run SPEAQeasy on. If your cluster is SLURM or SGE-based, the pipeline is pre-configured with options you may be used to specifying (such as disk usage, time for a job to run, etc). However, these are straightforward to modify, should there be a need/desire. Common settings are described in detail below; however, a more comprehensive list of settings from nextflow can be found here.
8.1.1 Time
The maximum allowed run time for a process, or step in the pipeline, can be specified. This may be necessary for users who are charged based on run time for their jobs.
8.1.1.1 Default for all processes
The simplest change you may wish to make is to relax time constraints for all processes. The setting for this is here:
executor {
name = 'sge'
queueSize = 40
submitRateLimit = '1 sec'
exitReadTimeout = '30 min'
}
process {
time = 10.hour // this can be adjusted as needed
errorStrategy = { task.exitStatus == 140 ? 'retry' : 'terminate' }
maxRetries = 1
While the syntax is not strict, some examples for properly specifying the option are time = '5m'
, time = '2h'
, and time = '1d'
(minutes, hours, and days, respectively).
8.1.1.2 Specify per-process
Time restrictions may also be specified for individual workflow steps. This can be done with the same syntax- suppose you wish to request 30 minutes for a given sample to be trimmed (note process names here). In the “process” section of your config, find the section labelled “withName: Trimming” as in the example here:
8.1.2 Cluster-specific options
In some cases, you may find it simpler to directly specify options accepted by your cluster. For example, SGE users with a default limit on the maximum file size they may write might specify the following:
withName: PairedEndHisat {
cpus = 4
memory = '20.GB'
penv = 'local'
clusterOptions = '-l h_fsize=500G' // This is SGE-specific syntax
}
As with the time option, this can be specified per-process or for all processes. Any option recognized by your cluster may be used.
8.2 SPEAQeasy Parameters
A number of variables in your config file exist to control choices about annotation to use, options passed to software tools, and more. These need not necessarily be changed, but allow more precise control over the pipeline if desired. Values for these variables may be changed in the “params” section of your config (near the top). A full descriptive list is provided below.
8.2.1 Annotation settings
- gencode_version_human: the GENCODE release to use for default (non-custom) annotation, when “hg38” or “hg19” references are used. A string, such as “32”.
- gencode_version_mouse: the GENCODE release to use for default (non-custom) annotation, when “mm10” reference is used. A string, such as “M23”.
- ensembl_version_rat: the Ensembl version to use for default (non-custom) annotation, when “rat” reference is used. A string, such as “98”. Note that
Rnor_6.0
is used forensembl_version_rat
< 105, andmRatBN7.2
is used for releases after and including 105. - anno_build: controls which sequences are used in analysis with default (non-custom) annotation. “main” indicates only canonical reference sequences; “primary” includes additional scaffolds.
8.2.2 Software settings
Command-line parameters can be directly passed to most software tools used in the pipeline, through configuration. Note that SPEAQeasy internally passes certain command-line options to some software tools, in a way that can not be overriden in the configuration file. This is done for options which control parallelization, file naming, or other functions which must either be fixed or are configurable through other means (e.g. number of threads is implicitly controlled by the cpus
variable in the QualityUntrimmed
process, rather than directly via the -t
argument to fastqc
). Open your particular config file determined here for details of which options are fixed vs. which can be specified through configuration. A list of variables which control command-line arguments is as follows:
- bam2wig_args: a string of optional arguments to pass to
bam2wig.py
from RSeQC, when producing coverage wig files from BAM alignments. - bcftools_args: a string of optional arguments to pass to
bcftools call
when calling variants (applicable for human samples only). - fastqc_args: a string of optional arguments to pass to
fastqc
when performing quality checking before (and if applicable, after) trimming. - feat_counts_gene_args and feat_counts_exon_args: strings of optional arguments passed to FeatureCounts when counting genes and exons, respectively.
- hisat2_args: a string of optional arguments to pass to
hisat2-align
. - kallisto_len_mean and kallisto_len_sd: the mean and standard deviation for fragment length, required by
kallisto quant
when performing pseudoalignment. It is recommended you directly modify these variables rather than individually modifyingkallisto_quant_single_args
andkallisto_quant_ercc_single_args
. These variables are also used when performing our pseudo-alignment-based approach to inferring strandness with single-end samples. - kallisto_quant_single_args and kallisto_quant_paired_args: a string of arguments to pass to
kallisto quant
when performing pseudo-alignment to the reference transcriptome, for single-end and paired-end samples, respectively. - kallisto_quant_ercc_single_args and kallisto_quant_ercc_paired_args: a string of optional arguments to pass to
kallisto quant
for performing pseudo-alignment to the list of synthetic ERCC-defined transcripts, for applicable experiments. - kallisto_index_args: a string of optional arguments to pass to
kallisto index
when first preparingkallisto
for a given reference transcriptome. - salmon_index_args: a string of optional arguments to pass to
salmon index
when first preparingsalmon
for a given reference transcriptome. Note that Kallisto is used instead of Salmon, by default (see the--use_salmon
option). - salmon_quant_args: a string of optional arguments to pass to
salmon quant
when quantifying transcript abundances with Salmon. - samtools_args: a string of optional arguments to pass to
samtools mpileup
before calling variants on human samples. - star_args: a string of optional arguments to pass to
STAR
when aligning samples to the chosen reference genome. - trim_adapter_args_single, trim_adapter_args_paired: these are passed to the
ILLUMINACLIP
settings for trimmomatic (after the adapter) for analysis with single and paired-end reads, respectively. Leave empty (i.e. ““) to not perform adapter trimming. - trim_quality_args: literal string arguments used for quality trimming- leave empty (i.e. ““) to not perform quality trimming. An example value could be
"LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:75"
. - regtools_args: a string of optional arguments to pass to
regtools junctions extract
when quantifying junctions. - wigToBigWig_args: a string of optional arguments to pass to
wigToBigWig
when producing coverage BigWig files from their wig counterparts.
8.2.3 Miscellaneous settings
A couple configuration variables control functionality specific to SPEAQeasy (i.e. not external software tools called by SPEAQeasy). These are documented below.
- num_reads_infer_strand: The number of lines in each FASTQ file to subset when determining strandness for a sample.
- wiggletools_batch_size: The maximum number of files to handle simultaneously when producing wig coverage files. We have encountered issues with
wiggletools
starting too many threads (one is used per sample handled), and so this can be lowered if similar issues occur for other users in large experiments. - variants_merge_batch_size: The maximum number of files to merge at once when merging variant calls. A sufficiently low value may prevent issues with
bcftools merge
hitting system limits on number of open file handles with large experiments.