9 Software
PopTop makes use of several external bioinformatics software tools. These tools are all public and readily available at the links below.
9.1 Software Versions
Here is the full list of software used by this pipeline:
Software | Version | Command used by the pipeline |
---|---|---|
bcftools | 1.13 | bcftools |
htslib | 1.13.2 | tabix |
nextflow | >=0.27.0 (tested with 20.01.0) | nextflow |
R | 4.1.2 | Rscript |
plink | 1.90b6.6 | plink |
perl | 5.24.4 | perl |
9.1.1 Prerequisites
- familiarity with basic glob expressions
- familiarity with basic shell scripting (if-statements, variables, and the
mv
command) - knowledge about what output files are created from both the original and replacement software tool
- basic knowledge of the PopTop workflow
- the replacement software tool should be on the PATH
9.1.2 Modifying main.nf
9.1.2.1 Replace the relevant command
We will begin by locating the relevant process in the file main.nf
. While process names are intended to be intuitive, this step may require some inference. We are modifying the way PopTop data clenses files, so the process of interest is called “CleanPlink”, and we locate this around line 70 (the exact number may change).
Next, we try to locate the command that runs the current software tool, plink. Typically, there will only be a single command (possibly formatted to be spread across multiple lines for clarity). We look for this command in the section of the shell
block of the process, enclosed in triple single-quotes. The below “template” version of the CleanPlink
process may serve as a guide for where to look.
process CleanPlink {
publishDir "${params.output}/logs", mode:'copy', pattern:'clean_plink.log'
input:
//file plink_beds
file plink_raw_out_clean
file hapmap from file("${workflow.projectDir}/test_small/hapmap3r2_CEU.CHB.JPT.YRI.no-at-cg-snps.txt")
//file plink_out
output:
file "clean_plink.log"
file "plinkFiles_byChr/*.chr*.{fam,bim,bed}" into filtered_plinks_out
shell:
'''
mkdir plinkFiles
name=$(echo LIBD_*_raw.bed | rev | cut -d "." -f 2- | rev)
# STEP 3: filter per GWAS standards
plink --bfile $name --make-bed --out plinkFiles/$name
plink --bfile $name --maf !{params.maf_min} --geno !{params.max_missing} --hwe !{params.hwe_min} --make-bed --out plinkFiles/$name
# STEP 4: check sex
plink --bfile $name --check-sex --out plinkFiles/$name
# STEP 5-6: write out problematic sexes
grep PROBLEM plinkFiles/$name.sexcheck > plinkFiles/$name.sexprobs
# STEP 7: elevated missingness rates
plink --bfile $name --missing --out plinkFiles/$name
# STEP 8: check elevated het rates
plink --bfile $name --het --out plinkFiles/$name
## 11 samples were elevated about 10%, keep in for now
# STEP 11: LD indepdent SNPs
plink --bfile $name --indep-pairwise !{params.indep_pairwise} --out plinkFiles/$name
# STEP 12: cluster LD indepdent SNPs
plink --bfile $name --extract plinkFiles/$name.prune.in --genome --out plinkFiles/$name
# STEP 13: check IBD
#perl ../run-IBD-QC.pl plinkFiles/$name
#mv fail-IBD-QC.txt plinkFiles/
### STOP HERE FOR IMPUTATION
# split into chunks
mkdir -p plinkFiles_byChr
for chr in $(seq 1 26); do
plink --bfile $name --chr $chr --make-bed --out plinkFiles_byChr/$name.chr$chr ;
done
cp .command.log clean_plink.log
'''
}
In the CleanPlink
process, there is a quite a bit of bash code used to conditionally trim samples depending on parameters from the configuration file. We can keep most of this bash code exactly how it is and simply alter the configuration file to our desired specifciations. From above, the command to run the plink filtering looks like this in the configuration file:
params {
// ------------------------------------------------------------------------
// Software paths for tools without modules
// ------------------------------------------------------------------------
shapeit = "/dcl01/lieber/RNAseq/Datasets/BrainGenotyping_2018/Software/shapeit.v2.904.2.6.32-696.18.7.el6.x86_64/bin/shapeit"
// ------------------------------------------------------------------------
// Settings for "CleanPlink" process
// ------------------------------------------------------------------------
// Minimum cutoff minor allele frequency used to filter ('--maf' option)
maf_min = 0.005
// Maximum missing call frequency used to filter ('--geno' option)
max_missing = 0.05
// Minimum Hardy-Weinberg equilibrium exact test p-value threshold ('--hwe'
// option)
hwe_min = 0.00001
// Value for the '--indep-pairwise' argument to plink
indep_pairwise = "50 5 0.2"
}