4 Pipeline Overview
Diagram representing the “conceptual” workflow traversed by BiocMAP. Here some nextflow processes
are grouped together for simplicity; the exact processes traversed are enumerated below. The red box indicates the FASTQ files are inputs to the pipeline; green coloring denotes major output files from the pipeline; the remaining boxes represent computational steps. Yellow-colored steps are optional or not always performed; for example, preparing a particular set of annotation files occurs once and uses a cache for further runs. Finally, blue-colored steps are ordinary processes which occur on every pipeline execution.
4.1 Preparation Steps
The following processes in the pipeline are done only once for a given configuration, and are skipped on all subsequent runs:
4.1.1 Downloading and preparing required annotation files
- PullReference: when using default annotation, this process pulls the genome fasta from GENCODE, possibly subsets to canonical sequences (see annotation builds, and saves to the directory specified by
--annotation
. This is the file against which FASTQ reads are aligned, and methylation is extracted. - PrepareReference: when run as part of the first module: split the single FASTA file into individual files (for handling with Arioc) and write the configuration file later used by Arioc to encode these reference files. When run as part of the second module: runs
bismark_genome_preparation
with the--hisat
option, in preparation for possible methylation extraction usingBismark
(though note thatMethylDackel
is used instead by default!). - EncodeReference: run
AriocE
to encode reference files for use with theArioc
aligner. This process requires a GPU. - PrepareLambda when using the
--with_lambda
option as part of the second module, this process downloads the lambda bacteriophage genome and creates kallisto indices required for later estimation of bisulfite conversion efficiency.
4.2 Main Workflow Steps: First Module
- PreprocessInputs: Merge FASTQ files where specified in
samples.manifest
and create conveniently named soft links of each FASTQ file for internal use in BiocMAP. - FastQC_Untrimmed: FASTQ files are ran through FastQC as a preliminary quality control measure. By default, the presence of “adapter content” determined in this process decides whether a given sample is trimmed in the next step. See the
--trim_mode
command option for modifying this default behavior. - Trimming: See “FastQC_Untrimmed” above- FASTQ inputs are trimmed with
Trim Galore!
either based on adapter content from FastQC, or differently based on the--trim_mode
[command option]. FastQC is also run as a confirmation step after trimming. For samples that are trimmed, post-trimming FastQC metrics are collected and later gathered into an R data frame; for all other samples, these metrics are collected from the FastQC run prior to any trimming. - WriteAriocConfigs: Generate configurations file for use in
AriocE
, followed byAriocU
orAriocP
, using configuration settings specified in the appropriate config. Note two configuration files are created for each sample. - EncodeReads: Run
AriocE
to encode each sample in preparation for the alignment step with Arioc. - AlignReads: Align each sample to the reference genome using Arioc, creating at least one SAM file.
- FilterAlignments: Filter, sort, compress, and index alignments from Arioc. First, primary alignments with a
MAPQ
of at least 5 are kept, and duplicate alignments are removed usingsamblaster
. The result is coordinate-sorted, compressed into BAM format, and this BAM file is indexed. - MakeRules: A
rules.txt
file is created in the same directory assamples.manifest
so that the second module can be easily run by the user after completion of the first module.
4.3 Main Workflow Steps: Second Module
- PreprocessInputs: parse
rules.txt
andsamples.manifest
, the inputs to the second module, so that all relevant FASTQ files and logs may be handled correctly in later steps. - LambdaPseudo: when the
--with_lambda
option is specified, pseudoalignment to the lambda bacteriophage transcriptome is performed. Both the (overwhelmingly unmethylated) original transcriptome and then an “in-silico bisulfite-converted” version of the transcriptome are used in series. Successful map rates are used to estimate the entire experiment’s bisulfite conversion rate, a metric which is included in the final output R data frame. - BME: when using the
--use_bme
option, methylation extraction is performed usingbismark_methylation_extraction
. - Bismark2Bedgraph: when using the
--use_bme
option, thebismark2bedgraph
utility from Bismark is run to generate intermediatebedGraph
files, which are later used to generate text-based “cytosine reports”. - Coverage2Cytosine: when using the
--use_bme
option, thecoverage2cytosine
utility from Bismark created text-based “cytosine reports”, which are later read into R to produce the finalbsseq
objects. - MethylationExtraction: this process is run by default in place of the above Bismark-related processes, and performs methylation extraction with
MethylDackel
, ultimately generating “cytosine reports” comparable to those produced by thecoverage2cytosine
Bismark utility. - ParseReports: logs from FastQC, any trimming, alignment, methylation extraction, and, if applicable, bisulfite conversion estimation, are parsed to extract key metrics into an R data frame. This data frame is produced as both a standalone
.rda
file, and included as part of thecolData
slots in each of the outputbsseq
objects. - FormBsseqObjects:
bsseq
R objects are formed, one for each canonical chromosome and all samples, from information provided in the “cytosine reports” generated in either theMethylationExtraction
orCoverage2Cytosine
process. These are typically considered intermediary files, as the next process merges the manybsseq
objects from this step into just a pair of final outputs, but for very large experiments, the objects from this step may be more reasonable to work with (to reduce memory requirements). - MergeBsseqObjects: the final process in the pipeline, which combines all previous
bsseq
objects into just two objects- the first containing all cytosines in “CpG” context in the genome, and the other containing those in “CpH” context. ThecolData
slot in each of these objects contain the metrics extracted in theParseReports
process, some of which might be useful covariates in potential downstream statistical analyses. See outputs for more information.