8 Configuration

BiocMAP is designed to be highly customizable, yet need no configuration from a user wishing to rely on sensible default settings. Most configuration, including software settings, hardware resources such as memory and CPU use, and more, can be done in a single file determined here. Please note that the common, “major” options are specified in your main script, and not in the configuration file. The config is intended to hold the “finer details”, which are documented below.

While memory usage for most processes in BiocMAP is independent of the number of samples in the experiment, the FormBsseqObjects and MergeBsseqObjects processes in the second module scale roughly linearly with dataset size. In our experiments, 120GB of memory is sufficient for the MergeBsseqObjects process for 4 human samples, and 200GB is sufficient for as many as 600 human samples. As a rule of thumb, allow 1-2TB of disk space per human sample. BiocMAP execution time is nearly linear in the number of samples, taking roughly 1 day per 4 human samples (about 5 months for ~600 samples– while lengthy, an analysis of this scale is prohibitively long with CPU-based pipelines to the point of likely being infeasible).

8.1 Specifying Options for your Cluster

In many cases, a user has access to a computing cluster which they intend to run BiocMAP on. If your cluster is SLURM or SGE-based, the pipeline is pre-configured with options you may be used to specifying (such as disk usage, time for a job to run, etc). However, these are straightforward to modify, should there be a need/desire. Common settings are described in detail below; however, a more comprehensive list of settings from nextflow can be found here.

8.1.1 Time

The maximum allowed run time for a process, or step in the pipeline, can be specified. This may be necessary for users who are charged based on run time for their jobs.

8.1.1.1 Default for all processes

The simplest change you may wish to make is to relax time constraints for all processes. The setting for this is here:

executor {
    name = 'sge'
    queueSize = 40
    submitRateLimit = '1 sec'
    exitReadTimeout = '40 min'
}

process {
    cache = 'lenient'
    
    time = 10.hour  // this can be adjusted as needed

While the syntax is not strict, some examples for properly specifying the option are time = '5m', time = '2h', and time = '1d' (minutes, hours, and days, respectively).

8.1.1.2 Specify per-process

Time restrictions may also be specified for individual workflow steps. This can be done with the same syntax- suppose you wish to request 30 minutes for a given sample to be trimmed (note process names here). In the “process” section of your config, find the section labelled “withName: Trimming” as in the example here:

withName: Trimming {
    cpus = 1
    memory = 5.GB
    time = '30m' // specify a maximum run time of 30 minutes
}

8.1.2 Cluster-specific options

In some cases, you may find it simpler to directly specify options accepted by your cluster. For example, SGE users with a default limit on the maximum file size they may directly adjust the “h_fsize” variable as below:

withName: FilterAlignments {
    cpus = 2
    penv = 'local'
    memory = 16.GB
    clusterOptions = '-l h_fsize=800G'
}

As with the time option, this can be specified per-process or for all processes. Any option recognized by your cluster may be used.

8.2 BiocMAP settings

A number of variables in your config file exist to control choices about annotation to use, options passed to software tools, and more. These need not necessarily be changed, but allow more precise control over the pipeline if desired. Values for these variables may be changed in the “params” section of your config (near the top). A full descriptive list is provided below.

8.2.1 Annotation settings

  • gencode_version_human: the GENCODE release to use for default (non-custom) annotation, when “hg38” or “hg19” references are used. A string, such as “32”.
  • gencode_version_mouse: the GENCODE release to use for default (non-custom) annotation, when “mm10” reference is used. A string, such as “M23”.
  • anno_build: controls which sequences are used in analysis with default (non-custom) annotation. “main” indicates only canonical reference sequences; “primary” includes additional scaffolds.

8.2.2 Arioc settings

Below, we document how the names of settings accepted by BiocMAP correspond to variables in Arioc configuration. For documentation on those Arioc configuration variables, see here.

  • arioc_queue: when running BiocMAP on an SGE/SLURM-based cluster, it is assumed GPU resources are available via a “queue” on the cluster. This variable must be set by the user to the queue name, in order to properly run BiocMAP.
  • batch_size: this is passed to the “batchSize” attribute of the “AriocU” or “AriocP” element of the config for AriocU or AriocP.
  • gapped_seed: this is passed to the “seed” attribute of the “gapped” element of the config for AriocU or AriocP.
  • nongapped_seed: this is passed to the “seed” attribute of the “nongapped” of the config for AriocU or AriocP.
  • gapped_args: these arguments in string form are passed literally to the “gapped” element of the config for AriocU or AriocP.
  • nongapped_args: these arguments in string form are passed literally to the “nongapped” element of the config for AriocU or AriocP.
  • x_args: these arguments in string form are passed literally to the “X” element of the config for AriocU or AriocP.
  • max_gpus: an integer, specifying how many GPUs to use, per sample, during alignment. Generally, it is recommended to keep this value at 1 unless there are more GPUs available than samples in your dataset.
  • manually_set_gpu: (true or false) a boolean specifying whether to have BiocMAP search for an available GPU and set CUDA_VISIBLE_DEVICES “manually” when performing alignment. Generally, this should be set to false on computing clusters managed by a job scheduler like SLURM or SGE, since typically a proper job scheduler handles these details on behalf of the user. By contrast, on a local machine (where typically there is no job scheduler), this variable should be set to true to ensure two samples don’t get aligned simultaneously on the same GPU (which often results in crashing due to insufficient memory).
  • gpu_perc_usage_cutoff: a percentage, specified as an integer, of GPU utilization (as measured by nvidia-smi) below which a GPU is considered “unoccupied” and thus available to use during alignment with Arioc. This setting only applies if manually_set_gpu is true.

8.2.3 Miscellaneous settings

  • kallisto_single_args: literal arguments to pass to ‘kallisto quant’ for single-end samples, as part of the process for estimating bisulfite conversion efficiency when using the ‘–with_lambda’ option.