If you want to reproduce this document, you will need the following R packages installed. It’s best if you install R 3.5 from CRAN and maybe even RStudio.
## Install R 3.5
install.packages('BiocManager')
BiocManager::install(c('recount', 'downloader', 'devtools', 'DT'))
First we load the R packages.
library('recount')
library('downloader')
library('devtools')
library('DT')
Next we load the prediction data from Shannon Ellis which is stored in a data.frame (a table) called PredictedPhenotypes
.
## Load the predictions
predfile <- 'PredictedPhenotypes_v0.0.03.rda'
url <- paste0("https://github.com/leekgroup/recount-website/blob/master/predictions/",
predfile, "?raw=true")
destfile <- file.path(tempdir(), predfile)
downloader::download(url, destfile = destfile, mode = "wb")
load(destfile)
## Explore the data a bit
head(PredictedPhenotypes)
## sample_id dataset reported_sex predicted_sex accuracy_sex
## 1 SRR660824 gtex male male 0.9981128
## 2 SRR2166176 gtex male male 0.9981128
## 3 SRR606939 gtex female female 0.9981128
## 4 SRR2167642 gtex male male 0.9981128
## 5 SRR2165473 gtex male male 0.9981128
## 6 SRR603858 gtex male male 0.9981128
## reported_samplesource predicted_samplesource accuracy_samplesource
## 1 tissue tissue 0.961732
## 2 tissue tissue 0.961732
## 3 tissue tissue 0.961732
## 4 tissue tissue 0.961732
## 5 tissue tissue 0.961732
## 6 tissue cell_line 0.961732
## reported_tissue predicted_tissue accuracy_tissue
## 1 Lung Lung 0.9505137
## 2 Brain Brain 0.9505137
## 3 Heart Heart 0.9505137
## 4 Brain Brain 0.9505137
## 5 Skin Skin 0.9505137
## 6 Lung Lung 0.9505137
## reported_sequencingstrategy predicted_sequencingstrategy
## 1 PAIRED PAIRED
## 2 PAIRED PAIRED
## 3 PAIRED PAIRED
## 4 PAIRED PAIRED
## 5 PAIRED PAIRED
## 6 PAIRED PAIRED
## accuracy_sequencingstrategy
## 1 0.9992661
## 2 0.9992661
## 3 0.9992661
## 4 0.9992661
## 5 0.9992661
## 6 0.9992661
dim(PredictedPhenotypes)
## [1] 70479 14
Now we can identify which of the 70k+ samples are predicted to be from the brain.
brain <- tolower(PredictedPhenotypes$predicted_tissue) == 'brain'
length(brain)
## [1] 70479
table(brain)
## brain
## FALSE TRUE
## 60314 6199
Using the all_metadata()
function from recount
we download the metadata for the roughly 50k samples from SRA.
meta <- all_metadata()
## 2017-07-06 15:40:15 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//RtmpEc4716/metadata_clean_sra.Rdata
dim(meta)
## [1] 50099 21
We can match the predicted brain samples to the SRA metadata using the run
identifier. Since the metadata doesn’t include the roughly 20k samples from GTEx and TCGA, not all predicted brain samples match. That is ok because we want to focus on the SRA data for which we don’t have proper metadata.
map <- match(PredictedPhenotypes$sample_id[brain], meta$run)
map <- map[!is.na(map)]
## Number of predicted brain samples in SRA
length(map)
## [1] 4601
## Number of corresponding studies in SRA with at least 1 predicted brain sample
length(unique(meta$project[map]))
## [1] 363
Now we can get the project ids and check what percent of the samples for a given project have predicted brain samples.
proj <- meta$project[map]
proj.n <- table(proj)
## Get the total number of samples in the projects
proj.all <- table(meta$project)
map.proj <- match(names(proj.n), names(proj.all))
proj.all <- proj.all[map.proj]
## Drop projects with less than 4 samples
table(proj.all >= 4)
##
## FALSE TRUE
## 37 326
proj.n <- proj.n[proj.all >= 4]
proj.all <- proj.all[proj.all >= 4]
## Calculate the percent of predicted brain samples
proj.perc <- as.vector(round(proj.n / proj.all * 100, 2))
## Explore that distribution
summary(proj.perc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.15 5.88 18.30 33.79 59.58 100.00
boxplot(proj.perc, main = 'Percent of predicted brain samples')
To reduce false positives brain predictions, we will focus on studies that are mostly brain samples.
cutoffs <- c(70, 80, 90)
cutoffs.pass <- sapply(cutoffs, function(cut) { table(proj.perc > cut) })
colnames(cutoffs.pass) <- cutoffs
cutoffs.pass
## 70 80 90
## FALSE 259 267 275
## TRUE 67 59 51
We can then put all the information in a single table, export it to a text file, and explore it interactively in this document.
proj.df <- data.frame(brain_predicted_n = as.vector(proj.n), project_n = as.vector(proj.all),
brain_percent = proj.perc, project_id = names(proj.n))
proj.df <- proj.df[order(proj.df$brain_percent, decreasing = TRUE), ]
write.table(proj.df, file = 'projects_list.txt', sep = '\t', col.names = TRUE, quote = FALSE, row.names = FALSE)
datatable(proj.df,
options = list(pagingType='full_numbers', pageLength=25, scrollX='100%'),
escape = FALSE, rownames = FALSE)
Lets focus on the projects where at least 70 percent of the samples are predicted to come from the brain. That would include project SRP025982
which from https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP025982 is mostly “universal human brain”.
## Reproducibility information
print('Reproducibility information:')
## [1] "Reproducibility information:"
Sys.time()
## [1] "2017-07-06 15:40:18 EDT"
proc.time()
## user system elapsed
## 11.375 0.756 16.828
options(width = 120)
session_info()
## Session info ----------------------------------------------------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, darwin15.6.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2017-07-06
## Packages --------------------------------------------------------------------------------------------------------------
## package * version date source
## acepack 1.4.1 2016-10-29 CRAN (R 3.4.0)
## AnnotationDbi 1.38.1 2017-06-01 Bioconductor
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## base64enc 0.1-3 2015-07-28 CRAN (R 3.4.0)
## Biobase * 2.36.2 2017-05-04 Bioconductor
## BiocGenerics * 0.22.0 2017-04-25 Bioconductor
## BiocParallel 1.10.1 2017-05-03 Bioconductor
## biomaRt 2.32.1 2017-06-09 Bioconductor
## Biostrings 2.44.1 2017-06-01 Bioconductor
## bit 1.1-12 2014-04-09 CRAN (R 3.4.0)
## bit64 0.9-7 2017-05-08 CRAN (R 3.4.0)
## bitops 1.0-6 2013-08-17 CRAN (R 3.4.0)
## blob 1.1.0 2017-06-17 CRAN (R 3.4.0)
## BSgenome 1.44.0 2017-04-25 Bioconductor
## bumphunter 1.17.2 2017-05-09 Bioconductor
## checkmate 1.8.2 2016-11-02 CRAN (R 3.4.0)
## cluster 2.0.6 2017-03-10 CRAN (R 3.4.0)
## codetools 0.2-15 2016-10-05 CRAN (R 3.4.0)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.0)
## compiler 3.4.0 2017-04-21 local
## data.table 1.10.4 2017-02-01 CRAN (R 3.4.0)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.0)
## DelayedArray * 0.2.7 2017-06-03 Bioconductor
## derfinder 1.10.4 2017-05-11 Bioconductor
## derfinderHelper 1.10.0 2017-04-25 Bioconductor
## devtools * 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
## doRNG 1.6.6 2017-04-10 CRAN (R 3.4.0)
## downloader * 0.4 2015-07-09 CRAN (R 3.4.0)
## DT * 0.2 2016-08-09 CRAN (R 3.4.0)
## evaluate 0.10 2016-10-11 CRAN (R 3.4.0)
## foreach 1.4.3 2015-10-13 CRAN (R 3.4.0)
## foreign 0.8-69 2017-06-21 CRAN (R 3.4.0)
## Formula 1.2-1 2015-04-07 CRAN (R 3.4.0)
## GenomeInfoDb * 1.12.2 2017-06-09 Bioconductor
## GenomeInfoDbData 0.99.0 2017-02-14 Bioconductor
## GenomicAlignments 1.12.1 2017-05-12 Bioconductor
## GenomicFeatures 1.28.3 2017-06-09 Bioconductor
## GenomicFiles 1.12.0 2017-04-26 Bioconductor
## GenomicRanges * 1.28.3 2017-05-25 Bioconductor
## GEOquery 2.42.0 2017-04-25 Bioconductor
## ggplot2 2.2.1 2016-12-30 CRAN (R 3.4.0)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gridExtra 2.2.1 2016-02-29 CRAN (R 3.4.0)
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.0)
## Hmisc 4.0-3 2017-05-02 CRAN (R 3.4.0)
## htmlTable 1.9 2017-01-26 CRAN (R 3.4.0)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## htmlwidgets 0.8 2016-11-09 CRAN (R 3.4.0)
## httr 1.2.1 2016-07-03 CRAN (R 3.4.0)
## IRanges * 2.10.2 2017-05-25 Bioconductor
## iterators 1.0.8 2015-10-13 CRAN (R 3.4.0)
## jsonlite 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.16 2017-05-18 CRAN (R 3.4.0)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## latticeExtra 0.6-28 2016-02-09 CRAN (R 3.4.0)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.4.0)
## locfit 1.5-9.1 2013-04-20 CRAN (R 3.4.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## Matrix 1.2-10 2017-04-28 CRAN (R 3.4.0)
## matrixStats * 0.52.2 2017-04-14 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.0)
## nnet 7.3-12 2016-02-02 CRAN (R 3.4.0)
## parallel * 3.4.0 2017-04-21 local
## pkgmaker 0.22 2014-05-14 CRAN (R 3.4.0)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.0)
## qvalue 2.8.0 2017-04-25 Bioconductor
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.4.0)
## Rcpp 0.12.11 2017-05-22 CRAN (R 3.4.0)
## RCurl 1.95-4.8 2016-03-01 CRAN (R 3.4.0)
## recount * 1.2.1 2017-05-06 Bioconductor
## registry 0.3 2015-07-08 CRAN (R 3.4.0)
## rentrez 1.1.0 2017-06-01 CRAN (R 3.4.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0)
## rlang 0.1.1 2017-05-18 CRAN (R 3.4.0)
## rmarkdown 1.6 2017-06-15 CRAN (R 3.4.0)
## rngtools 1.2.4 2014-03-06 CRAN (R 3.4.0)
## rpart 4.1-11 2017-03-13 CRAN (R 3.4.0)
## rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
## Rsamtools 1.28.0 2017-04-25 Bioconductor
## RSQLite 2.0 2017-06-19 CRAN (R 3.4.0)
## rtracklayer 1.36.3 2017-05-25 Bioconductor
## S4Vectors * 0.14.3 2017-06-03 Bioconductor
## scales 0.4.1 2016-11-09 CRAN (R 3.4.0)
## splines 3.4.0 2017-04-21 local
## stats * 3.4.0 2017-04-21 local
## stats4 * 3.4.0 2017-04-21 local
## stringi 1.1.5 2017-04-07 CRAN (R 3.4.0)
## stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
## SummarizedExperiment * 1.6.3 2017-05-29 Bioconductor
## survival 2.41-3 2017-04-04 CRAN (R 3.4.0)
## tibble 1.3.3 2017-05-28 CRAN (R 3.4.0)
## tools 3.4.0 2017-04-21 local
## utils * 3.4.0 2017-04-21 local
## VariantAnnotation 1.22.2 2017-06-22 Bioconductor
## withr 1.0.2 2016-06-20 CRAN (R 3.4.0)
## XML 3.98-1.9 2017-06-19 CRAN (R 3.4.0)
## xtable 1.8-2 2016-02-05 CRAN (R 3.4.0)
## XVector 0.16.0 2017-04-25 Bioconductor
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
## zlibbioc 1.22.0 2017-04-25 Bioconductor