11 Glossary

Cluster: shorthand for “performance computing cluster”- a network of many machines, which a member may utilize to interactively run or submit computational jobs. Most data analysis for genomics data is performed on a cluster, which collectively has far more memory and computing power than say, an individual’s laptop or desktop machine.

Docker: Docker is a tool for creating, sharing, and running containers. Containers package up software and dependencies, so that a piece of software runs identically on any machine (capable of running docker containers). The idea is to avoid difficult-to-predict differences in software and operating system versions between machines, which often affect how a single program behaves.

Docker image: An image describes the exact environment (operating system, installed programs, etc) needed to run a piece of software. Containers are the processes which run from a specified Docker image. For example, the libddocker/rseqc:3.0.1 docker image exists to run RSeQC 3.0.1- thus the image contains a particular version of Python and associated libraries, ensuring consistent execution on a different machine.

ERCC: External RNA Controls Consortium, a group who developed a set of controls for sources of variability in RNA-seq expression data attributed to platform, starting material quality, and other experimental covariates. ERCC spike-ins are polyadenylated transcripts which can be added at known concentration to each sample. Quantifying these spike-ins during analysis can help calibrate RNA-seq results and adjust for technical confounders. SPEAQeasy provides the --ercc option for experiments using ERCC spike-ins.

FASTA: a file format for storing nucleotide sequences in plain-text

Features: genomic sites of interest- in the case of SPEAQeasy, features include genes, exons, exon-exon junctions, and transcripts. These features are described using GenomicRanges objects as part of the main RangedSummarizedExperiment outputs from SPEAQeasy.

GTF: a file format for storing sequence ranges and associated annotation/information

MD5 sum: a number (often displayed as a hexadecimal string) commonly used as a checksum to verify the integrity of files (such as after transferring a large file to a new filesystem). The samples.manifest file optionally can include MD5 sums for each FASTQ file, though SPEAQeasy does not verify file integrity.

Module: in the context of SPEAQeasy and its documentation, a module refers to an Lmod environment module. Nextflow can be very simply configured to use modules to manage software, but this typically involves significantly more work unless the modules already exist on the system where SPEAQeasy will be run. Modules are “units of software” which can be loaded with a simple command, and potentially shared by many users.

Nextflow: the workflow management language SPEAQeasy runs on, which organizes pieces of the pipeline into one coherent workflow that runs largely the same way for users of many different computer systems.

Bookdown: an open-source R package we use to help generate and organize this documentation website, written in R markdown.

RNA-seq: short for RNA sequencing- a high-throughput method for quantifying transcripts present in a sample or samples. Samples of mRNA are converted into cDNA libraries, at which point alignment to a reference genome and other analysis steps can proceed. RNA-seq is commonly used to quantify gene expression, though other information about the transciptome can be captured with RNA-seq (such as alternative splicing events).

Stranded/strandness: the orientation of RNA reads during sequencing. Depending on the sequencing protocol, some reads may be oriented in the direction of the ‘sense’ strand of corresponding cDNA, or potentially some in the ‘antisense’ direction. Strand-specific protocols are described as either “forward” or “reverse”; protocols which do not have a specific read orientation are called “unstranded” within SPEAQeasy.