5 Annotation

BiocMAP can be run with hg38, hg19, or mm10 references. The pipeline has a default and automated process for pulling and building annotation-related files, but the user can opt to provide their own annotation as an alternative. Both of these options are documented below. Example annotation files below are the ones used with default configuration when hg38 reference is selected.

  • A genome assembly fasta: the reference genome to align reads to, like the file here (but unzipped).
  • Gene annotation gtf: containing transcript data, like the file here (but unzipped).
  • The lambda transcriptome: for experiments utilizing spike-ins of the lambda bacteriophage genome, the transcriptome provided here is used (but unzipped).

5.1 Default Annotation

BiocMAP uses annotation files provided by GENCODE.

5.1.1 Choosing a release

With genome annotation constantly being updated, the user may want to use a particular GENCODE release. The configuration variables gencode_version_human and gencode_version_mouse control which GENCODE release is used for the human and mouse genomes, respectively.

params {    
    //----------------------------------------------------
    //  Annotation-related settings
    //----------------------------------------------------
    
    gencode_version_human = "34"
    gencode_version_mouse = "M23"
    anno_build = "main" // main or primary (main is canonical seqs only)

5.1.2 Choosing a “build”

Depending on the analysis you are doing, you may wish to only consider the reference chromosomes (for humans, the 25 sequences “chr1” through “chrM”) for alignment and methylation extraction. BiocMAP provides the option to choose from two annotation “builds” for a given release and reference, called “main” and “primary” (following the naming convention from GENCODE databases).

  • The “main” build consists of only the canonical “reference” sequences for each species
  • The “primary” build consists of the canonical “reference” sequences and additional scaffolds, as a genome primary assembly fasta from GENCODE would contain.

See the variable annotation_build in your configuration file for making this selection for your pipeline run.

5.2 Custom Annotation

You may wish to provide a genome FASTA (the reference genome to align reads to), such as the file here, in place of the automatically managed GENCODE files described in the above section.

You must also add the --custom_anno [label] argument to your execution scripts, to specify you are using custom annotation files. The “label” is a string you want to include in filenames generated from the annotation files you provided. This is intended to allow the use of potentially many different custom annotations, assigned a unique and informative name you choose each time. This can be anything except an empty string (which internally signifies not to use custom annotation).