9 Software

PopTop makes use of several external bioinformatics software tools. These tools are all public and readily available at the links below.

9.1 Software Versions

Here is the full list of software used by this pipeline:

Software Version Command used by the pipeline
bcftools 1.13 bcftools
htslib 1.13.2 tabix
nextflow >=0.27.0 (tested with 20.01.0) nextflow
R 4.1.2 Rscript
plink 1.90b6.6 plink
perl 5.24.4 perl

9.1.1 Prerequisites

  • familiarity with basic glob expressions
  • familiarity with basic shell scripting (if-statements, variables, and the mv command)
  • knowledge about what output files are created from both the original and replacement software tool
  • basic knowledge of the PopTop workflow
  • the replacement software tool should be on the PATH

9.1.2 Modifying main.nf

9.1.2.1 Replace the relevant command

We will begin by locating the relevant process in the file main.nf. While process names are intended to be intuitive, this step may require some inference. We are modifying the way PopTop data clenses files, so the process of interest is called “CleanPlink”, and we locate this around line 70 (the exact number may change).

process CleanPlink {

Next, we try to locate the command that runs the current software tool, plink. Typically, there will only be a single command (possibly formatted to be spread across multiple lines for clarity). We look for this command in the section of the shell block of the process, enclosed in triple single-quotes. The below “template” version of the CleanPlink process may serve as a guide for where to look.

process CleanPlink {
    publishDir "${params.output}/logs", mode:'copy', pattern:'clean_plink.log'

    input:
        //file plink_beds
        file plink_raw_out_clean
        file hapmap from file("${workflow.projectDir}/test_small/hapmap3r2_CEU.CHB.JPT.YRI.no-at-cg-snps.txt")
        //file plink_out

    output:
        file "clean_plink.log"
        file "plinkFiles_byChr/*.chr*.{fam,bim,bed}" into filtered_plinks_out

    shell:
        '''
        mkdir plinkFiles
        name=$(echo LIBD_*_raw.bed | rev | cut -d "." -f 2- | rev)
        
        # STEP 3: filter per GWAS standards
        plink --bfile $name --make-bed --out plinkFiles/$name
        plink --bfile $name --maf !{params.maf_min} --geno !{params.max_missing} --hwe !{params.hwe_min} --make-bed --out plinkFiles/$name
        # STEP 4: check sex
        plink --bfile $name --check-sex --out plinkFiles/$name
        # STEP 5-6: write out problematic sexes
        grep PROBLEM plinkFiles/$name.sexcheck > plinkFiles/$name.sexprobs
        # STEP 7: elevated missingness rates
        plink --bfile $name --missing --out plinkFiles/$name
        # STEP 8: check elevated het rates
        plink --bfile $name --het --out plinkFiles/$name
        ## 11 samples were elevated about 10%, keep in for now
        # STEP 11: LD indepdent SNPs
        plink --bfile $name --indep-pairwise !{params.indep_pairwise} --out plinkFiles/$name
        # STEP 12: cluster LD indepdent SNPs
        plink --bfile $name --extract plinkFiles/$name.prune.in --genome --out plinkFiles/$name
        # STEP 13: check IBD
        #perl ../run-IBD-QC.pl plinkFiles/$name
        #mv fail-IBD-QC.txt plinkFiles/
        ### STOP HERE FOR IMPUTATION
        # split into chunks
        mkdir -p plinkFiles_byChr
        for chr in $(seq 1 26); do
            plink --bfile $name --chr $chr --make-bed --out plinkFiles_byChr/$name.chr$chr ;
        done
        cp .command.log clean_plink.log
        '''
}

In the CleanPlink process, there is a quite a bit of bash code used to conditionally trim samples depending on parameters from the configuration file. We can keep most of this bash code exactly how it is and simply alter the configuration file to our desired specifciations. From above, the command to run the plink filtering looks like this in the configuration file:

params {
    // ------------------------------------------------------------------------
    // Software paths for tools without modules
    // ------------------------------------------------------------------------
    
    shapeit = "/dcl01/lieber/RNAseq/Datasets/BrainGenotyping_2018/Software/shapeit.v2.904.2.6.32-696.18.7.el6.x86_64/bin/shapeit"
    
    // ------------------------------------------------------------------------
    // Settings for "CleanPlink" process
    // ------------------------------------------------------------------------
    
    // Minimum cutoff minor allele frequency used to filter ('--maf' option)
    maf_min = 0.005
    
    // Maximum missing call frequency used to filter ('--geno' option)
    max_missing = 0.05
    
    // Minimum Hardy-Weinberg equilibrium exact test p-value threshold ('--hwe'
    // option)
    hwe_min = 0.00001
    
    // Value for the '--indep-pairwise' argument to plink
    indep_pairwise = "50 5 0.2"
}