4 Pipeline Overview
Diagram representing the “conceptual” workflow traversed by BiocMAP. Here some nextflow processes are grouped together for simplicity; the exact processes traversed are enumerated below. The red box indicates the FASTQ files are inputs to the pipeline; green coloring denotes major output files from the pipeline; the remaining boxes represent computational steps. Yellow-colored steps are optional or not always performed; for example, preparing a particular set of annotation files occurs once and uses a cache for further runs. Finally, blue-colored steps are ordinary processes which occur on every pipeline execution.
4.1 Preparation Steps
The following processes in the pipeline are done only once for a given configuration, and are skipped on all subsequent runs:
4.1.1 Downloading and preparing required annotation files
- PullReference: when using default annotation, this process pulls the genome fasta from GENCODE, possibly subsets to canonical sequences (see annotation builds, and saves to the directory specified by
--annotation. This is the file against which FASTQ reads are aligned, and methylation is extracted. - PrepareReference: when run as part of the first module: split the single FASTA file into individual files (for handling with Arioc) and write the configuration file later used by Arioc to encode these reference files. When run as part of the second module: runs
bismark_genome_preparationwith the--hisatoption, in preparation for possible methylation extraction usingBismark(though note thatMethylDackelis used instead by default!). - EncodeReference: run
AriocEto encode reference files for use with theAriocaligner. This process requires a GPU. - PrepareLambda when using the
--with_lambdaoption as part of the second module, this process downloads the lambda bacteriophage genome and creates kallisto indices required for later estimation of bisulfite conversion efficiency.
4.2 Main Workflow Steps: First Module
- PreprocessInputs: Merge FASTQ files where specified in
samples.manifestand create conveniently named soft links of each FASTQ file for internal use in BiocMAP. - FastQC_Untrimmed: FASTQ files are ran through FastQC as a preliminary quality control measure. By default, the presence of “adapter content” determined in this process decides whether a given sample is trimmed in the next step. See the
--trim_modecommand option for modifying this default behavior. - Trimming: See “FastQC_Untrimmed” above- FASTQ inputs are trimmed with
Trim Galore!either based on adapter content from FastQC, or differently based on the--trim_mode[command option]. FastQC is also run as a confirmation step after trimming. For samples that are trimmed, post-trimming FastQC metrics are collected and later gathered into an R data frame; for all other samples, these metrics are collected from the FastQC run prior to any trimming. - WriteAriocConfigs: Generate configurations file for use in
AriocE, followed byAriocUorAriocP, using configuration settings specified in the appropriate config. Note two configuration files are created for each sample. - EncodeReads: Run
AriocEto encode each sample in preparation for the alignment step with Arioc. - AlignReads: Align each sample to the reference genome using Arioc, creating at least one SAM file.
- FilterAlignments: Filter, sort, compress, and index alignments from Arioc. First, primary alignments with a
MAPQof at least 5 are kept, and duplicate alignments are removed usingsamblaster. The result is coordinate-sorted, compressed into BAM format, and this BAM file is indexed. - MakeRules: A
rules.txtfile is created in the same directory assamples.manifestso that the second module can be easily run by the user after completion of the first module.
4.3 Main Workflow Steps: Second Module
- PreprocessInputs: parse
rules.txtandsamples.manifest, the inputs to the second module, so that all relevant FASTQ files and logs may be handled correctly in later steps. - LambdaPseudo: when the
--with_lambdaoption is specified, pseudoalignment to the lambda bacteriophage transcriptome is performed. Both the (overwhelmingly unmethylated) original transcriptome and then an “in-silico bisulfite-converted” version of the transcriptome are used in series. Successful map rates are used to estimate the entire experiment’s bisulfite conversion rate, a metric which is included in the final output R data frame. - BME: when using the
--use_bmeoption, methylation extraction is performed usingbismark_methylation_extraction. - Bismark2Bedgraph: when using the
--use_bmeoption, thebismark2bedgraphutility from Bismark is run to generate intermediatebedGraphfiles, which are later used to generate text-based “cytosine reports”. - Coverage2Cytosine: when using the
--use_bmeoption, thecoverage2cytosineutility from Bismark created text-based “cytosine reports”, which are later read into R to produce the finalbsseqobjects. - MethylationExtraction: this process is run by default in place of the above Bismark-related processes, and performs methylation extraction with
MethylDackel, ultimately generating “cytosine reports” comparable to those produced by thecoverage2cytosineBismark utility. - ParseReports: logs from FastQC, any trimming, alignment, methylation extraction, and, if applicable, bisulfite conversion estimation, are parsed to extract key metrics into an R data frame. This data frame is produced as both a standalone
.rdafile, and included as part of thecolDataslots in each of the outputbsseqobjects. - FormBsseqObjects:
bsseqR objects are formed, one for each canonical chromosome and all samples, from information provided in the “cytosine reports” generated in either theMethylationExtractionorCoverage2Cytosineprocess. These are typically considered intermediary files, as the next process merges the manybsseqobjects from this step into just a pair of final outputs, but for very large experiments, the objects from this step may be more reasonable to work with (to reduce memory requirements). - MergeBsseqObjects: the final process in the pipeline, which combines all previous
bsseqobjects into just two objects- the first containing all cytosines in “CpG” context in the genome, and the other containing those in “CpH” context. ThecolDataslot in each of these objects contain the metrics extracted in theParseReportsprocess, some of which might be useful covariates in potential downstream statistical analyses. See outputs for more information.