2 Setup Details

2.1 Requirements

BiocMAP is designed for execution on Linux, and requires the following:

  • Java 8 or later
  • Access to NVIDIA GPU(s) during installation, locally or via a computing cluster, with recent stable NVIDIA video drivers and CUDA runtime. BiocMAP may also be installed without these, but only the second module will function in this case
  • (recommended) docker, singularity, or Anaconda/ Miniconda

If installing the pipeline locally (see installation), the following are also required:

  • Python 3 (tested with 3.7.3), with pip
  • R (tested with R 3.6-4.3)
  • GNU make

If installing the pipeline for use with docker (see installation), the NVIDIA container toolkit is also required for use of the first module. A CUDA runtime >= 10.1 is required for docker/singularity users.

Additionally, installation via the “local” or “conda” modes (see below) require a C compiler, such as GCC. An up-to-date version of gcc is often required to ensure R packages get properly installed.

2.2 Installation

BiocMAP makes use of a number of different additional software tools. The user is provided four installation “modes” to automatically manage these dependencies: “docker”, “singularity”, “conda”, or “local” (which we recommend in descending order).

  • Docker: The recommended option is to manage software with docker, if it is available. From within the repository, perform the one-time setup by running bash install_software.sh "docker". This installs nextflow and sets up some test files. When running BiocMAP, the required docker images are automatically pulled if not already present, and components of the pipeline run within the associated containers. A full list of the images that are used is here. If root permissions are needed to run docker, one can instruct the installation script to use sudo in front of any docker commands by running bash install_software.sh "docker" "sudo". Finally, if using BiocMAP on a cluster, set the arioc_queue Arioc setting in your config file for the first module.

  • Singularity: If singularity is available, a user may run bash install_software.sh "singularity" to install BiocMAP. This installs nextflow and sets up some test files. When running BiocMAP, the required docker images are automatically pulled if not already present, and components of the pipeline run within the associated containers using singularity. A full list of the images that are used is here. Next, if using BiocMAP on a cluster, set the arioc_queue Arioc setting in your config file for the first module.

  • Conda: If conda is available (through Anaconda or Miniconda), a user can run bash install_software.sh "conda" to fully install BiocMAP. This creates a conda environment within which the required software is locally installed, and sets up some test files. Note that this is a one-time procedure even on a shared machine (new users automatically make use of the installed conda environment). Finally, if using BiocMAP on a cluster, set the arioc_queue Arioc setting in your config file for the first module.

  • Local install: Installation is performed by running bash install_software.sh "local" from within the repository. This installs nextflow, several bioinformatics tools, R and packages, and sets up some test files. A full list of software used is here. The script install_software.sh builds each software tool from source, and hence relies on some common utilities which are often pre-installed in many unix-like systems:

    • A C/C++ compiler, such as GCC or Clang
    • The GNU make utility
    • The makeinfo utility
    • git, for downloading some software from their GitHub repositories
    • The unzip utility

    Please note that this installation method is experimental, and can be more error-prone than installation via the “docker”, “singularity”, or “conda” modes. Finally, if using BiocMAP on a cluster, set the arioc_queue Arioc setting in your config file for the first module.

Note: users at the JHPCE cluster do not need to worry about managing software via the above methods (required software is automatically available through modules). Simply run bash install_software.sh "jhpce" to install any missing R packages and set up some test files. Next, make sure you have the following lines added to your ~/.bashrc file:

if [[ $HOSTNAME == compute-* ]]; then
    module use /jhpce/shared/jhpce/modulefiles/libd
fi

2.2.1 Troubleshooting

Some users may encounter errors during the installation process, particularly when installing software locally. We provide a list below of the most common installation-related issues.

BiocMAP has been tested on:

  • CentOS 7 (Linux)
  • Ubuntu 18.04 (Linux)

2.2.1.1 CUDA runtime is not installed

BiocMAP aligns samples to a reference genome using Arioc, a GPU-based software built with CUDA– we require that the CUDA toolkit is installed. During installation, if you encounter an error message like this:

g++ -std=c++14 -c -Wall -Wno-unknown-pragmas -O3 -m64 -I /include -o CudaCommon/ThrustSpecializations.o CudaCommon/ThrustSpecializations.cpp
In file included from CudaCommon/ThrustSpecializations.cpp:11:0:
CudaCommon/stdafx.h:26:60: fatal error: cuda_runtime.h: No such file or directory
 #include <cuda_runtime.h>               // CUDA runtime API
                                                            ^
compilation terminated.

it’s possible that the CUDA toolkit is not installed (and should be). On a computing cluster, it’s also possible CUDA-related software must be loaded, or is only available on a particular queue (associated with GPU resources). In the former case, check documentation or contact tech support to see if there is a proper way to load the CUDA toolkit for your cluster. For example, if your cluster uses Lmod environment modules, there might be a command like module load cuda that should be run before the BiocMAP installation script. If this works, you’ll need to adjust your configuration file for the first module following the advice here. In the latter case, try running the installation script from the queue containing the GPU(s).

2.2.1.2 g++ compilation errors

For “conda” or “local” installation methods, g++ is used to compile Arioc. If your gcc version is too old, you may encounter errors during the installation process, likely during the step that compiles Arioc. Here is an example of an error message that could occur:

g++: error: unrecognized command line option '-std=c++14'

On a local machine, consider installing a newer gcc and g++ (though please note that versions later than 7 cannot compile Arioc!). On a cluster, similar to the advice here, contact tech support or your cluster’s documentation to see if there is a way to load particular versions of gcc. If your cluster uses Lmod environment modules, there might be a command like module load gcc that should be run before the BiocMAP installation script. We have successfully used gcc 5.5.0. If this works, you’ll need to adjust your configuration file for the first module following the advice here.

2.2.1.3 Using Lmod modules with Arioc

For users encountering specific issues during installation via the “conda” or “local” modes on a computing cluster (in particular, see g++ compilation errors and cuda-related errors), loading an Lmod environment module before installation with install_software.sh might provide a solution if this is an option. After successful installation with install_software.sh while using a module, it will also be necessary to instruct BiocMAP to load this module whenever it performs alignment-related steps.

To do this, locate your configuration file for the first module. As an example, if you needed a module called ‘cuda/10.0’ to perform installation, you can add the line module = 'cuda/10.0' in the processes EncodeReference, EncodeReads, and AlignReads. Here is what the modified configuration would look like for the EncodeReference process for SLURM users:

withName: EncodeReference {
    cpus = 1
    memory = 80.GB
    queue = params.arioc_queue
    module = 'cuda/10.0'
}

To load two modules, such as gcc/5.5.0 and cuda/10.0, the syntax looks like: module = 'gcc/5.5.0:cuda/10.0'.

2.2.1.4 Singularity Issues

When installing BiocMAP with Singularity (i.e. bash install_software.sh singularity), quite a bit of memory is sometimes required to build the Singularity images from their Docker counterparts, hosted on Docker Hub. Memory-related error messages can widely vary, but an example looks like this:

INFO:   Creating SIF file...
FATAL:   While making image from oci registry: while building SIF from layers: While running mksquashfs: exit status 1: FATAL ERROR:Failed to create thread

Requesting more memory and reinstalling will solve such issues.

2.2.1.5 Java Issues

With any installation method, the process may fail if Java is not installed or is sufficiently outdated (e.g. < 11). In this case, installing a recent version of Java (Nextflow recommends between 11 and 18) will solve the issue.

Here are some potential pieces of BiocMAP error messages that suggest Java is too outdated or improperly installed:

NOTE: Nextflow is not tested with Java 1.8.0_262 -- It's recommended the use of version 11 up to 18
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/eclipse/jgit/api/errors/GitAPIException has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0

2.3 Run the Pipeline

The “main” script used to run the pipeline depends on the environment you will run it on.

2.3.1 Run in a SLURM environment/ cluster

  • (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in conf/first_half_slurm.config and conf/second_half_slurm.config.
  • Modify the main script and run: the main scripts are run_first_half_slurm.sh and run_second_half_slurm.sh. Each script may be submitted to the cluster as a job (e.g. sbatch run_first_half_slurm.sh). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submit run_first_half_slurm.sh, then monitor the output log run_first_half_slurm.log so that run_second_half_slurm.sh may be submitted when the log indicates the first half has completed.

See here for Nextflow’s documentation regarding SLURM environments.

2.3.2 Run on a Sun Grid Engines (SGE) cluster

  • (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in conf/first_half_sge.config and conf/second_half_sge.config.
  • Modify the main script and run: the main scripts are run_first_half_sge.sh and run_second_half_sge.sh. Each script may be submitted to the cluster as a job (e.g. qsub run_first_half_sge.sh). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submit run_first_half_sge.sh, then monitor the output log run_first_half_sge.log so that run_second_half_sge.sh may be submitted when the log indicates the first half has completed.

See here for additional information on nextflow for SGE environments.

2.3.3 Run locally

  • (Optional) Adjust configuration: hardware resource usage and other configurables are located in conf/first_half_local.config and conf/second_half_local.config.
  • Modify the main script and run: the main scripts are run_first_half_local.sh and run_second_half_local.sh. After configuring options for your use-case (See the full list of command-line options), each script may be run interactively (e.g. bash run_first_half_local.sh).

2.3.4 Run on the JHPCE cluster

  • (Optional) Adjust configuration: default configuration with thoroughly testing hardware resource specification is described within conf/first_half_jhpce.config and conf/second_half_jhpce.config.
  • Modify the main script and run: the “main” scripts are run_first_half_jhpce.sh and run_second_half_jhpce.sh. Each script may be submitted to the cluster as a job (e.g. qsub run_first_half_jhpce.sh). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submit run_first_half_jhpce.sh, then monitor the output log run_first_half_jhpce.log so that run_second_half_jhpce.sh may be submitted when the log indicates the first half has completed.

2.3.5 Example main script

Below is a full example of a typical main script for the first module, modified from the run_first_half_jhpce.sh script. At the top are some cluster-specific options, recognized by SGE, the grid scheduler at the JHPCE cluster. These are optional, and you may consider adding appropriate options similarly, if you plan to use BiocMAP on a computing cluster.

After the main command, nextflow first_half.nf, each command option can be described line by line:

  • --sample "paired": input samples are paired-end
  • --reference "hg38": these are human samples, to be aligned to the hg38 reference genome
  • --input "/users/neagles/wgbs_test": /users/neagles/wgbs_test is a directory that contains the samples.manifest file, describing the samples.
  • --output "/users/neagles/wgbs_test/out": /users/neagles/wgbs_test/out is the directory (which possibly exists already) where pipeline outputs should be placed.
  • -profile jhpce: configuration of hardware resource usage, and more detailed pipeline settings, is described at conf/jhpce.config, since this is a run using the JHPCE cluster
  • -w "/fastscratch/myscratch/neagles/nextflow_work": this is a nextflow-specific command option (note the single dash), telling BiocMAP that temporary files for the pipeline run can be placed under /fastscratch/myscratch/neagles/nextflow_work.
  • --trim_mode "force": this optional argument instructs BiocMAP to trim all samples. Note there are alternative options.
  • -profile first_half_jhpce: this line, which should typically always be included, tells BiocMAP to use conf/first_half_jhpce.config as the configuration file applicable to this pipeline run.
#!/bin/bash
#$ -l bluejay,mem_free=25G,h_vmem=25G,h_fsize=800G
#$ -o ./run_first_half_jhpce.log
#$ -e ./run_first_half_jhpce.log
#$ -cwd

module load nextflow
export _JAVA_OPTIONS="-Xms8g -Xmx10g"

nextflow first_half.nf \
    --sample "paired" \
    --reference "hg38" \
    --input "/users/neagles/wgbs_test" \
    --output "/users/neagles/wgbs_test/out" \
    -w "/fastscratch/myscratch/neagles/nextflow_work" \
    --trim_mode "force" \
    -profile first_half_jhpce

2.4 Sharing the pipeline with many users

A single installation of BiocMAP can be shared among potentially many users. New users can simply copy the appropriate “main” script (determined above) to a different desired directory, and modify the contents as appropriate for the particular experiment. Similarly, a single user can copy the “main” script and modify the copy whenever there is a new experiment/ set of samples to process, reusing a single installation of BiocMAP arbitrarily many times.

Note It is recommended to use a unique working directory with the -w option for each experiment. This ensures:

  • BiocMAP resumes from the correct point, if ever stopped while multiple users are running the pipeline
  • Deleting the work directory (which can take a large amount of disk space) does not affect BiocMAP execution for other users or other experiments

2.4.1 Customizing execution for each user

By default, all users will share the same configuration. This likely suffices for many use cases, but alternatively new configuration files can be created. Below we will walk through an example where a new user of a SLURM-based cluster wishes to use an existing BiocMAP installation to run the first module, but wants a personal configuration file to specify different annotation settings.

  1. Copy the existing configuration to a new file
#  Verify we are in the BiocMAP repository
pwd

#  Create the new configuration file
cp conf/first_half_slurm.config conf/my_first_half_slurm.config
  1. Modify the new file as desired

Below we will change the GENCODE release to the older release 25, for human, via the gencode_version_human variable.

params {    
    //----------------------------------------------------
    //  Annotation-related settings
    //----------------------------------------------------
    
    gencode_version_human = "25" // originally was "34"!
    gencode_version_mouse = "M23"
    anno_build = "main" // main or primary (main is canonical seqs only)

See configuration for details on customizing BiocMAP settings.

  1. Add the new file as a “profile”

This involves adding some code to nextflow.config, as shown below.

profiles {
    // Here we've named the new profile "my_new_config", and pointed it to the
    // file "conf/my_first_half_slurm.config".
    my_new_config {
        includeConfig 'conf/my_first_half_slurm.config'
    }
  1. Reference the new profile in the “main” script

Recall that new users should copy the “main” script and modify the copy as appropriate. In this case, we open a copy of the original run_first_half_slurm.sh:

#  At the nextflow command, we change the '-profile' argument at the bottom
$ORIG_DIR/Software/bin/nextflow $ORIG_DIR/first_half.nf \
    --sample "paired" \
    --reference "hg38" \
    -profile my_new_config # this was changed from "-profile first_half_slurm"!