2 Setup Details

2.1 Requirements

SPEAQeasy requires the following to be installed:

  • Java 8 or later
  • Singularity or Docker (recommended)
  • Python 3 and pip (only for local installation, which is not recommended)

If java is not installed, you can install it on linux with apt install default-jre, or with a different package manager as appropriate for your distribution. The above installations are typically done by an administrator.

SPEAQeasy has been tested on Linux, but it designed to run on any of a number of POSIX-compliant systems, including MacOS and FreeBSD.

2.2 Installation

SPEAQeasy makes use of a number of different additional software tools. The user is provided three options to automatically manage these dependencies.

  • Docker: The recommended option is to manage software with docker, if it is available. From within the repository, perform the one-time setup by running bash install_software.sh "docker". This installs nextflow, pulls docker images containing required software, and sets up some test files. When running SPEAQeasy, components of the pipeline run within the associated containers. A full list of the images that are used is here. If root permissions are needed to run docker, one can instruct the installation script to use sudo in front of any docker commands by running bash install_software.sh "docker" "sudo".

  • Singularity: An alternative recommended option is to manage software with singularity, if it is installed. From within the repository, perform the one-time setup by running bash install_software.sh "singularity". In practice this configures SPEAQeasy to use singularity to run the required software as originally packaged into docker images. A full list of the images that are used is here.

  • Local install: The alternative is to locally install all dependencies. This option is only officially supported on Linux; it is currently experimental and recommended against on other platforms. Installation is done by running bash install_software.sh "local" from within the repository. This installs nextflow, several bioinformatics tools, R and packages, and sets up some test files. A full list of software used is here. The script install_software.sh builds each software tool from source, and hence relies on some common utilities which are often pre-installed in many unix-like systems:

    • A C/C++ compiler, such as GCC or Clang
    • The GNU make utility
    • The makeinfo utility
    • git, for downloading some software from their GitHub repositories
    • The unzip utility

Note: users at the JHPCE cluster do not need to worry about managing software via the above methods (required software is automatically available through modules). Simply run bash install_software.sh "jhpce" to install any missing R packages and set up some test files.

2.2.1 Troubleshooting

Some users may encounter errors during the installation process, particularly when installing software locally. We provide a list below of the most common installation-related issues. Please note that “local” installation is only officially supported on Linux, and Mac users should install in “docker” mode! Below solutions for Mac OS are experimental, and not yet complete.

SPEAQeasy has been tested on:

  • CentOS 7 (Linux)
  • Ubuntu 20.04.2 LTS (Linux)

2.2.1.1 Singularity cannot create thread

On some high-performance-computing clusters, singularity tends to use an unexpectedly large amount of memory during installation (when converting from the Docker images we host). This can lead to error messages like the following when performing a singularity-based installation (i.e. bash install_software.sh singularity):

Storing signatures
INFO:    Creating SIF file...
FATAL:   While making image from oci registry: while building SIF from layers: While running mksquashfs: exit status 1: FATAL ERROR:Failed to create thread

Sometimes, requesting a potentially extraordinary amount of memory (e.g. > 64GB) while performing the installation is an effective workaround. If not, it may be appropriate to talk to your system administrator about configuring Singularity (properly editing singularity.conf) appropriately for a computing-cluster setting.

2.2.1.2 Required utilities are missing

This is particularly common issue for MacOS users, or Linux users trying to get SPEAQeasy running on a local machine (like a laptop). In either case, we will assume the user has root privileges for the solutions suggested below.

Mac OS users may be missing utilities required for the SPEAQeasy installation. A solution on Mac is to install brew, and install the required utilities through brew:

#  Install brew, if you don't already have it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

#  Install most required utilities
brew install autoconf automake make gcc zlib bzip2 xz pcre openssl texinfo llvm libomp

#  Install openJDK8 (Java)
brew install --cask homebrew/cask-versions/adoptopenjdk8

It is also recommended that Linux users install some basic dependencies if local installation fails for any reason.

#  On Debian or Ubuntu:
sudo apt install autoconf automake make gcc zlib1g-dev libbz2-dev liblzma-dev libpcre3-dev libcurl4-openssl-dev texinfo texlive-base default-jre default-jdk

#  On RedHat or CentOS:
sudo yum install autoconf automake make gcc zlib-devel bzip2 bzip2-devel xz-devel pcre-devel curl-devel texi2html texinfo java-1.8.0-openjdk java-1.8.0-openjdk-devel

2.2.1.3 Docker permissions issues

Users managing dependencies with docker might encounter error messages if docker is not properly configured:

docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock: Post http://%2Fvar%2Frun%2Fdocker.sock/v1.40/containers/create: dial unix /var/run/docker.sock: connect: permission denied.
See 'docker run --help'.

On a computing cluster, a system administrator is responsible for correctly configuring docker so that such errors do not occur. We provide a brief guide here for users who want to set up docker on a local machine.

2.3 Run the Pipeline

The “main” script used to run the pipeline depends on the environment you will run it on.

2.3.1 Run in a SLURM environment/ cluster

  • (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in conf/slurm.config.
  • Modify the main script and run: the main script is run_pipeline_slurm.sh. Submit as a job to your cluster with sbatch run_pipeline_slurm.sh. See the full list of command-line options for other details about modifying the script for your use-case.

See here for Nextflow’s documentation regarding SLURM environments.

2.3.2 Run on a Sun Grid Engines (SGE) cluster

  • (Optional) Adjust configuration: hardware resource usage, software versioning, and cluster option choices are specified in conf/sge.config.
  • Modify the main script and run: the main script is run_pipeline_sge.sh. Submit as a job to your cluster with qsub run_pipeline_sge.sh. See the full list of command-line options for other details about modifying the script for your use-case.

See here for additional information on nextflow for SGE environments.

2.3.3 Run locally

  • (Optional) Adjust configuration: hardware resource usage and other configurables are located in conf/local.config. Note that defaults assume access to 8 CPUs and 16GB of RAM.
  • Modify the main script and run: the main script is run_pipeline_local.sh. After configuring options for your use-case (See the full list of command-line options), simply run on the command-line with bash run_pipeline_local.sh.

2.3.4 Run on the JHPCE cluster

  • (Optional) Adjust configuration: default configuration with thoroughly testing hardware resource specification is described within conf/jhpce.config. Other settings, such as annotation release/version, can also be tweaked via this file.
  • Modify the main script and run: the “main” script is run_pipeline_jhpce.sh. The pipeline run is submitted as a job to the cluster by executing sbatch run_pipeline_jhpce.sh. See the full list of command-line options for other details about modifying the script for your use-case.

2.3.5 Example main script

Below is a full example of a typical main script, modified from the run_pipeline_jhpce.sh script. At the top are some cluster-specific options, recognized by SGE, the grid scheduler at the JHPCE cluster. These are optional, and you may consider adding appropriate options similarly, if you plan to use SPEAQeasy on a computing cluster.

After the main call, nextflow $ORIG_DIR/main.nf, each command option can be described line by line:

  • --sample "paired": input samples are paired-end
  • --reference "mm10": these are mouse samples, to be aligned to the mm10 genome
  • --strand "reverse": the user expects the samples to be reverse-stranded, which SPEAQeasy will verify
  • --ercc: the samples have ERCC spike-ins, which the pipeline should quantify as a QC measure.
  • --trim_mode "skip": trimming is not to be performed on any samples
  • --experiment "mouse_brain": the main pipeline outputs should be labelled with the experiment name “mouse_brain”
  • --input "/users/neagles/RNA_input": /users/neagles/RNA_input is a directory that contains the samples.manifest file, describing the samples.
  • -profile jhpce: configuration of hardware resource usage, and more detailed pipeline settings, is described at conf/jhpce.config, since this is a run using the JHPCE cluster
  • -w "/scratch/nextflow_runs": this is a nextflow-specific command option (note the single dash), telling SPEAQeasy that temporary files for the pipeline run can be placed under /scratch/nextflow_runs
  • --output "/users/neagles/RNA_output": SPEAQeasy output files should be placed under /users/neagles/RNA_output
#!/bin/bash

#SBATCH -q shared
#SBATCH --mem=40G
#SBATCH --job-name=SPEAQeasy
#SBATCH -o ./SPEAQeasy_output.log
#SBATCH -e ./SPEAQeasy_output.log

#  After running 'install_software.sh', this should point to the directory
#  where SPEAQeasy was installed, and not say "$PWD"
ORIG_DIR=/dcl01/lieber/ajaffe/Nick/SPEAQeasy

module load nextflow/20.01.0
export _JAVA_OPTIONS="-Xms8g -Xmx10g"

nextflow $ORIG_DIR/main.nf \
    --sample "paired" \
    --reference "mm10" \
    --strand "reverse" \
    --ercc \
    --trim_mode "skip" \
    --experiment "mouse_brain" \
    --input "/users/neagles/RNA_input" \
    -profile jhpce \
    -w "/scratch/nextflow_runs" \
    --output "/users/neagles/RNA_output"
    
#   Log successful runs on non-test data in a central location. Please adjust
#   the log path here if it is changed at the top!
bash $ORIG_DIR/scripts/track_runs.sh $PWD/SPEAQeasy_output.log

#  Produces a report for each sample tracing the pipeline steps
#  performed (can be helpful for debugging).
#
#  Note that the reports are generated from the output log produced in the above
#  section, and so if you rename the log, you must also pass replace the filename
#  in the bash call below.
echo "Generating per-sample logs for debugging..."
bash $ORIG_DIR/scripts/generate_logs.sh $PWD/SPEAQeasy_output.log

2.3.6 Advanced info regarding installation

  • If you are installing software to run the pipeline locally, all dependencies are installed into [repo directory]/Software/, and [repo directory]/conf/command_paths_long.config is configured to show nextflow the default installation locations of each software tool. Thus, this config file can be tweaked to manually point to different paths, if need be (though this shouldn’t be necessary).
  • Nextflow supports the use of Lmod modules to conveniently point the pipeline to the bioinformatics software it needs. If you neither wish to use docker nor wish to install the many dependencies locally– and already have Lmod modules on your cluster– this is another option. In the appropriate config file (as determined in step 3 in the section you choose below), you can include a module specification line in the associated process (such as module = 'hisat2/2.2.1' for BuildHisatIndex) as configured in conf/jhpce.config. In most cases this will be more work to fully configure, and so running the pipeline with docker or locally installing software is generally recommended instead. See nextflow modules for some more information.

2.4 Sharing the pipeline with many users

A single installation of SPEAQeasy can be shared among potentially many users. New users can simply copy the appropriate “main” script (determined above) to a different desired directory, and modify the contents as appropriate for the particular experiment. Similarly, a single user can copy the “main” script and modify the copy whenever there is a new experiment/ set of samples to process, reusing a single installation of SPEAQeasy arbitrarily many times.

Note It is recommended to use a unique working directory with the -w option for each experiment. This ensures:

  • SPEAQeasy resumes from the correct point, if ever stopped while multiple users are running the pipeline
  • Deleting the work directory (which can take a large amount of disk space) does not affect SPEAQeasy execution for other users or other experiments

New users who wish to include the --coverage option must also install RSeQC personally:

python3 -m pip install --user RSeQC==3.0.1

2.4.1 Customizing execution for each user

By default, all users will share the same configuration. This likely suffices for many use cases, but alternatively new configuration files can be created. Below we will walk through an example where a new user of a SLURM-based cluster wishes to use an existing SPEAQeasy installation, but wants a personal configuration file to specify different annotation settings.

  1. Copy the existing configuration to a new file
#  Verify we are in the SPEAQeasy repository
pwd

#  Create the new configuration file
cp conf/slurm.config conf/my_new.config
  1. Modify the new file as desired

Below we will change the GENCODE release to the older release 25, for human, via the gencode_version_human variable.

executor = 'slurm'

params {
  gencode_version_human = "25" // originally was "32"!
  gencode_version_mouse = "M25"
  ensembl_version_rat = "98"
  anno_build = "main" // main or primary (main is canonical seqs only)

See configuration for details on customizing SPEAQeasy settings.

  1. Add the new file as a “profile”

This involves adding some code to nextflow.config, as shown below.

profiles {
  // Here we've named the new profile "my_new_config", and pointed it to the
  // file "conf/my_new.config". Note that docker/ singularity users should have
  // the 2nd line "includeConfig 'conf/command_paths_short.config'" instead!
  my_new_config {
    includeConfig 'conf/my_new.config'
    includeConfig 'conf/command_paths_long.config'
  }
  
  // This configuration had already existed
  local {
    includeConfig 'conf/local.config'
    includeConfig 'conf/command_paths_long.config'
  }
  1. Reference the new profile in the “main” script

Recall that new users should copy the “main” script and modify the copy as appropriate. In this case, we open a copy of the original run_pipeline_slurm.sh:

#  At the nextflow command, we change the '-profile' argument at the bottom
$ORIG_DIR/Software/nextflow main.nf \
    --sample "single" \
    --reference "hg19" \
    --strand "unstranded" \
    --small_test \
    --annotation "$ORIG_DIR/Annotation" \
    -with-report execution_reports/pipeline_report.html \
    -profile my_new_config # this was changed from "-profile slurm"!