2 Setup Details
2.1 Requirements
BiocMAP is designed for execution on Linux, and requires the following:
- Java 8 or later
- Access to NVIDIA GPU(s) during installation, locally or via a computing cluster, with recent stable NVIDIA video drivers and CUDA runtime. BiocMAP may also be installed without these, but only the second module will function in this case
- (recommended) docker, singularity, or Anaconda/ Miniconda
If installing the pipeline locally (see installation), the following are also required:
- Python 3 (tested with 3.7.3), with pip
- R (tested with R 3.6-4.3)
- GNU make
If installing the pipeline for use with docker (see installation), the NVIDIA container toolkit is also required for use of the first module. A CUDA runtime >= 10.1 is required for docker/singularity users.
Additionally, installation via the “local” or “conda” modes (see below) require a C compiler, such as GCC. An up-to-date version of gcc
is often required to ensure R packages get properly installed.
2.2 Installation
BiocMAP makes use of a number of different additional software tools. The user is provided four installation “modes” to automatically manage these dependencies: “docker”, “singularity”, “conda”, or “local” (which we recommend in descending order).
Docker: The recommended option is to manage software with docker, if it is available. From within the repository, perform the one-time setup by running
bash install_software.sh "docker"
. This installs nextflow and sets up some test files. When running BiocMAP, the required docker images are automatically pulled if not already present, and components of the pipeline run within the associated containers. A full list of the images that are used is here. If root permissions are needed to run docker, one can instruct the installation script to usesudo
in front of any docker commands by runningbash install_software.sh "docker" "sudo"
. Finally, if using BiocMAP on a cluster, set thearioc_queue
Arioc setting in your config file for the first module.Singularity: If
singularity
is available, a user may runbash install_software.sh "singularity"
to install BiocMAP. This installs nextflow and sets up some test files. When running BiocMAP, the required docker images are automatically pulled if not already present, and components of the pipeline run within the associated containers using singularity. A full list of the images that are used is here. Next, if using BiocMAP on a cluster, set thearioc_queue
Arioc setting in your config file for the first module.Conda: If
conda
is available (throughAnaconda
orMiniconda
), a user can runbash install_software.sh "conda"
to fully install BiocMAP. This creates a conda environment within which the required software is locally installed, and sets up some test files. Note that this is a one-time procedure even on a shared machine (new users automatically make use of the installed conda environment). Finally, if using BiocMAP on a cluster, set thearioc_queue
Arioc setting in your config file for the first module.Local install: Installation is performed by running
bash install_software.sh "local"
from within the repository. This installs nextflow, several bioinformatics tools, R and packages, and sets up some test files. A full list of software used is here. The scriptinstall_software.sh
builds each software tool from source, and hence relies on some common utilities which are often pre-installed in many unix-like systems:- A C/C++ compiler, such as GCC or Clang
- The GNU
make
utility - The
makeinfo
utility - git, for downloading some software from their GitHub repositories
- The
unzip
utility
Please note that this installation method is experimental, and can be more error-prone than installation via the “docker”, “singularity”, or “conda” modes. Finally, if using BiocMAP on a cluster, set the
arioc_queue
Arioc setting in your config file for the first module.
Note: users at the JHPCE cluster do not need to worry about managing software via the above methods (required software is automatically available through modules). Simply run bash install_software.sh "jhpce"
to install any missing R packages and set up some test files. Next, make sure you have the following lines added to your ~/.bashrc
file:
2.2.1 Troubleshooting
Some users may encounter errors during the installation process, particularly when installing software locally. We provide a list below of the most common installation-related issues.
BiocMAP has been tested on:
- CentOS 7 (Linux)
- Ubuntu 18.04 (Linux)
2.2.1.1 CUDA runtime is not installed
BiocMAP aligns samples to a reference genome using Arioc, a GPU-based software built with CUDA– we require that the CUDA toolkit is installed. During installation, if you encounter an error message like this:
g++ -std=c++14 -c -Wall -Wno-unknown-pragmas -O3 -m64 -I /include -o CudaCommon/ThrustSpecializations.o CudaCommon/ThrustSpecializations.cpp
In file included from CudaCommon/ThrustSpecializations.cpp:11:0:
CudaCommon/stdafx.h:26:60: fatal error: cuda_runtime.h: No such file or directory
#include <cuda_runtime.h> // CUDA runtime API
^
compilation terminated.
it’s possible that the CUDA toolkit is not installed (and should be). On a computing cluster, it’s also possible CUDA-related software must be loaded, or is only available on a particular queue (associated with GPU resources). In the former case, check documentation or contact tech support to see if there is a proper way to load the CUDA toolkit for your cluster. For example, if your cluster uses Lmod environment modules, there might be a command like module load cuda
that should be run before the BiocMAP installation script. If this works, you’ll need to adjust your configuration file for the first module following the advice here. In the latter case, try running the installation script from the queue containing the GPU(s).
2.2.1.2 g++
compilation errors
For “conda” or “local” installation methods, g++
is used to compile Arioc. If your gcc
version is too old, you may encounter errors during the installation process, likely during the step that compiles Arioc. Here is an example of an error message that could occur:
g++: error: unrecognized command line option '-std=c++14'
On a local machine, consider installing a newer gcc
and g++
(though please note that versions later than 7 cannot compile Arioc!). On a cluster, similar to the advice here, contact tech support or your cluster’s documentation to see if there is a way to load particular versions of gcc
. If your cluster uses Lmod environment modules, there might be a command like module load gcc
that should be run before the BiocMAP installation script. We have successfully used gcc
5.5.0. If this works, you’ll need to adjust your configuration file for the first module following the advice here.
2.2.1.3 Using Lmod modules with Arioc
For users encountering specific issues during installation via the “conda” or “local” modes on a computing cluster (in particular, see g++ compilation errors and cuda-related errors), loading an Lmod environment module before installation with install_software.sh
might provide a solution if this is an option. After successful installation with install_software.sh
while using a module, it will also be necessary to instruct BiocMAP to load this module whenever it performs alignment-related steps.
To do this, locate your configuration file for the first module. As an example, if you needed a module called ‘cuda/10.0’ to perform installation, you can add the line module = 'cuda/10.0'
in the processes EncodeReference
, EncodeReads
, and AlignReads
. Here is what the modified configuration would look like for the EncodeReference
process for SLURM users:
withName: EncodeReference {
cpus = 1
memory = 80.GB
queue = params.arioc_queue
module = 'cuda/10.0'
}
To load two modules, such as gcc/5.5.0
and cuda/10.0
, the syntax looks like: module = 'gcc/5.5.0:cuda/10.0'
.
2.2.1.4 Singularity Issues
When installing BiocMAP with Singularity (i.e. bash install_software.sh singularity
), quite a bit of memory is sometimes required to build the Singularity images from their Docker counterparts, hosted on Docker Hub. Memory-related error messages can widely vary, but an example looks like this:
INFO: Creating SIF file...
FATAL: While making image from oci registry: while building SIF from layers: While running mksquashfs: exit status 1: FATAL ERROR:Failed to create thread
Requesting more memory and reinstalling will solve such issues.
2.2.1.5 Java Issues
With any installation method, the process may fail if Java is not installed or is sufficiently outdated (e.g. < 11). In this case, installing a recent version of Java (Nextflow recommends between 11 and 18) will solve the issue.
Here are some potential pieces of BiocMAP error messages that suggest Java is too outdated or improperly installed:
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: org/eclipse/jgit/api/errors/GitAPIException has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
2.3 Run the Pipeline
The “main” script used to run the pipeline depends on the environment you will run it on.
2.3.1 Run in a SLURM environment/ cluster
- (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in
conf/first_half_slurm.config
andconf/second_half_slurm.config
. - Modify the main script and run: the main scripts are
run_first_half_slurm.sh
andrun_second_half_slurm.sh
. Each script may be submitted to the cluster as a job (e.g.sbatch run_first_half_slurm.sh
). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submitrun_first_half_slurm.sh
, then monitor the output logrun_first_half_slurm.log
so thatrun_second_half_slurm.sh
may be submitted when the log indicates the first half has completed.
See here for Nextflow’s documentation regarding SLURM environments.
2.3.2 Run on a Sun Grid Engines (SGE) cluster
- (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in
conf/first_half_sge.config
andconf/second_half_sge.config
. - Modify the main script and run: the main scripts are
run_first_half_sge.sh
andrun_second_half_sge.sh
. Each script may be submitted to the cluster as a job (e.g.qsub run_first_half_sge.sh
). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submitrun_first_half_sge.sh
, then monitor the output logrun_first_half_sge.log
so thatrun_second_half_sge.sh
may be submitted when the log indicates the first half has completed.
See here for additional information on nextflow for SGE environments.
2.3.3 Run locally
- (Optional) Adjust configuration: hardware resource usage and other configurables are located in
conf/first_half_local.config
andconf/second_half_local.config
. - Modify the main script and run: the main scripts are
run_first_half_local.sh
andrun_second_half_local.sh
. After configuring options for your use-case (See the full list of command-line options), each script may be run interactively (e.g.bash run_first_half_local.sh
).
2.3.4 Run on the JHPCE cluster
- (Optional) Adjust configuration: default configuration with thoroughly testing hardware resource specification is described within
conf/first_half_jhpce.config
andconf/second_half_jhpce.config
. - Modify the main script and run: the “main” scripts are
run_first_half_jhpce.sh
andrun_second_half_jhpce.sh
. Each script may be submitted to the cluster as a job (e.g.qsub run_first_half_jhpce.sh
). See the full list of command-line options for other details about modifying the script for your use-case. To run the complete workflow, it is recommended to first submitrun_first_half_jhpce.sh
, then monitor the output logrun_first_half_jhpce.log
so thatrun_second_half_jhpce.sh
may be submitted when the log indicates the first half has completed.
2.3.5 Example main script
Below is a full example of a typical main script for the first module, modified from the run_first_half_jhpce.sh
script. At the top are some cluster-specific options, recognized by SGE, the grid scheduler at the JHPCE cluster. These are optional, and you may consider adding appropriate options similarly, if you plan to use BiocMAP on a computing cluster.
After the main command, nextflow first_half.nf
, each command option can be described line by line:
--sample "paired"
: input samples are paired-end--reference "hg38"
: these are human samples, to be aligned to the hg38 reference genome--input "/users/neagles/wgbs_test"
:/users/neagles/wgbs_test
is a directory that contains thesamples.manifest
file, describing the samples.--output "/users/neagles/wgbs_test/out"
:/users/neagles/wgbs_test/out
is the directory (which possibly exists already) where pipeline outputs should be placed.-profile jhpce
: configuration of hardware resource usage, and more detailed pipeline settings, is described atconf/jhpce.config
, since this is a run using the JHPCE cluster-w "/fastscratch/myscratch/neagles/nextflow_work"
: this is a nextflow-specific command option (note the single dash), telling BiocMAP that temporary files for the pipeline run can be placed under/fastscratch/myscratch/neagles/nextflow_work
.--trim_mode "force"
: this optional argument instructs BiocMAP to trim all samples. Note there are alternative options.-profile first_half_jhpce
: this line, which should typically always be included, tells BiocMAP to useconf/first_half_jhpce.config
as the configuration file applicable to this pipeline run.
#!/bin/bash
#$ -l bluejay,mem_free=25G,h_vmem=25G,h_fsize=800G
#$ -o ./run_first_half_jhpce.log
#$ -e ./run_first_half_jhpce.log
#$ -cwd
module load nextflow
export _JAVA_OPTIONS="-Xms8g -Xmx10g"
nextflow first_half.nf \
--sample "paired" \
--reference "hg38" \
--input "/users/neagles/wgbs_test" \
--output "/users/neagles/wgbs_test/out" \
-w "/fastscratch/myscratch/neagles/nextflow_work" \
--trim_mode "force" \
-profile first_half_jhpce