This document contains the code that creates the recount_brain
version 2 table by merging recount-brain
version 1 with the metadata in GTEx and TCGA. Dustin Sokolowski created the recount_brain_v2
analysis with supervision from Michael D Wilson. Leonardo Collado-Torres edited this document.
Here, there are two categories of files being loaded. Firstly, the recount_brain
, TCGA, and GTEx data from recount
are being downloaded using the add_metadata()
and all_metadata()
functions respectively. Secondly, some additional information about GTEx samples are added. Specifically, sample age, sex, and Hardy-Death classification are taken from the gtex_pheno.csv
. Information in regards to to sample fixing and sample freezing are found in gtex_sampinfo.csv
. These two files can be downloaded from https://github.com/LieberInstitute/recount-brain/tree/master/cross_studies_metadata/GTEx_extra. Furthermore, they are csv files adapted from links documented in the code.
library('recount')
# below are the files required to combine datasets
#GTEx & TCGA metadata from recount
recount_brain <- add_metadata(source = "recount_brain_v1")
## 2020-11-13 16:24:16 downloading the recount_brain metadata to /tmp/RtmpK9pZcs/recount_brain_v1.Rdata
## Loading objects:
## recount_brain
GTEx <- recount::all_metadata("gtex")
## 2020-11-13 16:24:17 downloading the metadata to /tmp/RtmpK9pZcs/metadata_clean_gtex.Rdata
tcga <- recount::all_metadata("tcga")
## 2020-11-13 16:24:18 downloading the metadata to /tmp/RtmpK9pZcs/metadata_clean_tcga.Rdata
# Read txt file downloaded from:
# "https://storage.googleapis.com/gtex_analysis_v7/annotations/GTEx_v7_Annotations_SubjectPhenotypesDS.txt"
# this dataset was also converted into a csv before being loaded into R
gtex_pheno <- read.csv("https://raw.githubusercontent.com/LieberInstitute/recount-brain/master/cross_studies_metadata/GTEx_extra/gtex_pheno.csv",
header = T, as.is = T)
# Remaining phenotype information that may be useful for GTEx metadata
# https://storage.googleapis.com/gtex_analysis_v7/annotations/GTEx_v7_Annotations_SampleAttributesDS.txt
# This dataset was converted into a csv before being loaded into R
gtexSampinfo <- read.csv("https://raw.githubusercontent.com/LieberInstitute/recount-brain/master/cross_studies_metadata/GTEx_extra/gtex_sampinfo.csv",
header = T, as.is = T)
# Supplementary table 1 to re-order columns
notes <- read.csv("https://raw.githubusercontent.com/LieberInstitute/recount-brain/master/SupplementaryTable1.csv",
header= T, as.is = T)
# Generic funciton to convert factor into character vector
tochr <- function(x) return(as.character(levels(x))[x])
The code chunk below extracts brain samples from GTEx using the sms
column. Secondly, the sample id’s are adjusted such that GTEx metadata from recount can be easily merged with GTEx metadata from the phenotype and sample file. Finally, these files are merged.
GTEX_brain <- GTEx[GTEx$smts == "Brain",] # brain samples in gtex
# Change the sample id of the "sampid" column to the first 9 or 10 characters so that the GTEX_brain and gtex_pheno columns can be merged
s <- substr(GTEX_brain$sampid, 1,10)
s1 <- c()
for(i in s) {
last <- substr(i, nchar(i), nchar(i))
if(last == "-") {
s1 <- c(s1, substr(i, 1, nchar(i)-1))
} else {
s1 <- c(s1, i)
}
}
GTEX_brain$SUBJID <- s1
# merge GTEX_brain and gtex_pheno
GTEx_brain_merge <- merge(GTEX_brain, gtex_pheno, by = "SUBJID")
#GTEx brain samples
gtexSampinfo_brain <- intersect(GTEx_brain_merge$sampid, gtexSampinfo$SAMPID)
The code below processes the columns important to tissue location. Specifically, tissue location is found from smtsd
, and locations are extracted from the same column. Finally, GTEx contains samples within the putamen, which is in the right hemisphere. Other sample regions are bilateral.
#Tissue location
tissue_1_gtex <- substr(GTEx_brain_merge$smtsd, 9,
nchar(GTEx_brain_merge$smtsd))
# broadman locations
broadman_gtex <- c()
for(i in 1:nrow(GTEx_brain_merge)) {
if(GTEx_brain_merge$smtsd[i] %in% "Brain - Anterior cingulate cortex (BA24)") {
broadman_gtex[i] <- 24
next
}
if(GTEx_brain_merge$smtsd[i] %in% "Brain - Frontal Cortex (BA9)") {
broadman_gtex[i] <- 9
next
}
broadman_gtex[i] <- NA
}
# mapping putamen to right hemisphere
hemisphere_gtex <- c()
for(i in GTEx_brain_merge$smtsd) {
if(i == "Brain - Putamen (basal ganglia)") {
hemisphere_gtex <- c(hemisphere_gtex, "right")
} else {
hemisphere_gtex <- c(hemisphere_gtex, "bilateral")
}
}
The code below looks at age, sex, and disease. In terms of age, every brain sample is older than 20, therefore development is only adult. Disease is organized by the Hardy scale, where fast but natural and ventilator deaths make it difficult to determine disease. Otherwise, controls are violent/fast deaths and “disease” is individuals who were previously ill. Finally, recount_brain_v2
only uses public data and more in-depth information is private.
#developmental stage
development <- "adult"
# DTHHRDY explanation
stat <- c("ventilator", "violent_fast", "fast_natural", "ill_unexpected", "ill_expected")
#GTEx disease mapping
gtex_disease <- c()
gtex_disease_status <- c()
for(i in GTEx_brain_merge$DTHHRDY) {
gtex_disease <- c(gtex_disease, stat[i+1])
if(i == 1) {
gtex_disease_status <- c(gtex_disease_status, "Control")
}
if(i == 2 | i == 0) {
gtex_disease_status <- c(gtex_disease_status, "either")
}
if(i == 3 | i == 4) {
gtex_disease_status <- c(gtex_disease_status, "Disease")
}
}
# mapping sex to character
sex_character <- c("male", "female")
sex <- c()
for(i in GTEx_brain_merge$SEX) sex <- c(sex, sex_character[i])
#RNA isolation type
technique <- paste0("RNA Seq, ", GTEx_brain_merge$smnabtcht)
gtex_sampleInfo_brain_merged <- merge(GTEx_brain_merge, gtexSampinfo,
by.x = "sampid", by.y = "SAMPID", all = F)
rownames(gtex_sampleInfo_brain_merged) <- gtex_sampleInfo_brain_merged$sampid
rownames(gtexSampinfo) <- gtexSampinfo$SAMPID
Sample isolation, fixed and frozen samples. These data were originally acquired from the pheno_sampid column. Time after isolation was taken from the SMTSISCH
function and is currently being used as a rough proxy for Post-mortem interval. Afterwards, samples with SMTSPAX > 0
were fixed, SMTSISH < 0
were frozen. Otherwise it’s tough to tell.
gtexSampinfo$SMTSPAX[is.na(gtexSampinfo$SMTSPAX)] <- 0
gtexSampinfo$SMTSISCH[is.na(gtexSampinfo$SMTSISCH)] <- 0
isoTime <- rep(NA, nrow(GTEx_brain_merge))
names(isoTime) <- GTEx_brain_merge$sampid
count <- 1
for(i in names(isoTime)) {
#print(which(rownames(gtexSampinfo) == i))
isoTime[i] <- gtexSampinfo[i, "SMTSISCH"]
}
prep <- c()
for(i in 1:nrow(gtexSampinfo)) {
if(gtexSampinfo$SMTSPAX[i] > 0) {
prep[i] <- "fixed"
next
}
if(gtexSampinfo$SMTSISCH[i] < 0) {
prep[i] <- "frozen"
next
}
prep[i] <- "unclear"
}
names(prep) <- rownames(gtexSampinfo)
prep_use <- rep(0, nrow(GTEx_brain_merge))
names(prep_use) <- GTEx_brain_merge$sampid
for(i in names(prep_use)){
prep_use[i] <- prep[i]
}
Combine metadata in the order of recount_brain_v1
. Some columns (e.g. age units, sample location, public availability…) were consistent across all samples, are these columns are a consistent character vector.
GTEX_combn <- cbind(GTEx_brain_merge$AGE, "Years", technique, GTEx_brain_merge$avg_read_length,
NA, NA, "Laboratory, Data Analysis and Coordinating Center (LDACC)",
broadman_gtex, NA,
"Laboratory, Data Analysis and Coordinating Center (LDACC)", NA,
NA, "Public", "Adult", gtex_disease, gtex_disease_status,
GTEx_brain_merge$experiment,
hemisphere_gtex, NA, "Illumina TruSeq RNA sequencing", NA, "paired",
"cDNA", paste0("transcriptomic - ", GTEx_brain_merge$smcenter),
GTEx_brain_merge$smnabtchd, NA, NA, "Homo sapiens", NA, "Illumina",
unlist(isoTime),
"mins", unlist(prep_use), TRUE, "not public", GTEx_brain_merge$smnabtch,
GTEx_brain_merge$smrin,
GTEx_brain_merge$run, NA, GTEx_brain_merge$smts, sex,
GTEx_brain_merge$sample, GTEx_brain_merge$project, tissue_1_gtex,
NA, NA, NA, "Postmortem")
The code below makes the adjustments to TCGA. This code chunk extracts the brain (i.e. Lower Grade Glioma and Glioblastoma) samples from TCGA.
# filter for brain samples
tcga_brain_nums <- which(tcga$gdc_cases.project.project_id %in%
c("TCGA-LGG", "TCGA-GBM"))
cd_brain.ol <- tcga[tcga_brain_nums,]
RNAseq file information. Average read length was calculated using the formula below:
\[ avgReadLength = auc / (mappedReadCount * numberEnds) \]
I.e. if RNA-seq was paired end the average read length was halved from
\[ auc/mappedReadCount\]
File size (mega bytes) is file size
\[ bytes / 1,000,000 \]
#Avg read length
TCGA_readlength <- cd_brain.ol$auc / cd_brain.ol$mapped_read_count * ifelse(cd_brain.ol$paired_end, 2, 1)
#file size in megabytes
mb_tcga <- cd_brain.ol$gdc_file_size / 1e6
Age at diagnosis is used for age instead of age of treatment/death. These information can still be acquired from the TCGA metadata information. merge these data with toupper(cd_brain.ol$gdc_file_id)
, which is the identifier that maps to the row names of the TCGA count data. The youngest individual in the TCGA_brain dataset is 14, so for development samples are split into adolescent/adult.
Disease is information from gdc_cases.samples.sample_type
. disease status is binary from if the tissue was disease or normal tissues (5 samples). Also tumour-cDNA or cDNA is split the same way.
#age at diagnosis
age_at_diag <- cd_brain.ol$cgc_case_age_at_diagnosis
#age normalized for developmental stage
development_tcga <- c()
for(i in 1:length(cd_brain.ol$cgc_case_age_at_diagnosis)) {
if(is.na(age_at_diag[i])) {
development_tcga[i] <- NA
next
}
if(age_at_diag[i] < 20) {
development_tcga[i] <- "Adolescent"
next
}
development_tcga[i] <- "Adult"
}
#cDNA type and solid tissue normal
disease_status_tcga <- c()
selection_tcga <- c()
for(i in 1:length(cd_brain.ol$gdc_cases.samples.sample_type)) {
if(cd_brain.ol$gdc_cases.samples.sample_type[i] == "Solid Tissue Normal") {
disease_status_tcga[i] <- "Control"
selection_tcga[i] <- "cDNA"
} else {
disease_status_tcga[i] <- "Disease"
selection_tcga[i] <- "ctDNA"
}
}
# Histological grade, data is changed to match recount brain
neoP <- tochr(cd_brain.ol$xml_neoplasm_histologic_grade)
neoP[is.na(neoP)] <- "0"
grade_adjust <- c()
for(i in 1:length(neoP)) {
if(neoP[i] == "G2") {
grade_adjust[i] <- "Grade II"
next
}
if(neoP[i] == "G3") {
grade_adjust[i] <- "Grade III"
next
}
grade_adjust[i] <- NA
}
path <- cd_brain.ol$xml_ldh1_mutation_found
pathology_comp <- c()
for(i in 1:length(path)) {
if(is.na(path[i])) {
pathology_comp[i] <- NA
next
}
if(path[i] == "YES") {
pathology_comp[i] <- "+ IDH1 Mutation"
next
}
if(path[i] == "NO") {
pathology_comp[i] <- "- IDH1 Mutation"
next
}
pathology_comp[i] <- path[i]
}
table(pathology_comp)
## pathology_comp
## - IDH1 Mutation + IDH1 Mutation
## 34 91
# LGG or GBM
cancer_type <- substr(x = cd_brain.ol$gdc_cases.project.project_id, 6,
nchar(cd_brain.ol$gdc_cases.project.project_id))
Combining TCGA data into the recount_brain_v1 format, some columns are consistent (e.g. all sequencing data was paired end) As such, the paired end sequencing column is “paired”.
TCGA_combn <- cbind(cd_brain.ol$cgc_case_age_at_diagnosis, "Years", "RNA_seq", TCGA_readlength,
cd_brain.ol$cgc_case_id, cd_brain.ol$xml_patient_id,
cd_brain.ol$gdc_cases.tissue_source_site.name,
NA, NA, cd_brain.ol$gdc_center.name, grade_adjust,
cd_brain.ol$gdc_cases.samples.sample_type,
cd_brain.ol$gdc_metadata_files.access.analysis, development_tcga,
cd_brain.ol$gdc_cases.samples.sample_type, disease_status_tcga,
cd_brain.ol$gdc_metadata_files.file_id.experiment, NA, NA,
cd_brain.ol$gdc_platform, toupper(cd_brain.ol$gdc_file_id), "paired",
selection_tcga, "TRANSCRIPTOMIC", cd_brain.ol$cgc_file_upload_date, NA,
cd_brain.ol$gdc_file_size / 1e6, "Homo sapiens", pathology_comp,
"Illumina", NA, NA, "frozen soon after surgery", "TRUE",
cd_brain.ol$gdc_cases.demographic.race,
cd_brain.ol$cgc_file_published_date, NA, NA, NA,"Brain",
cd_brain.ol$gdc_cases.demographic.gender, NA,NA,cancer_type,NA,NA,
tochr(cd_brain.ol$xml_histological_type), "Biopsy")
rownames(cd_brain.ol) <- toupper(cd_brain.ol$gdc_file_id)
The drug information in the cgc_drug_therapy_drug_name
column contains multiple typos and ambiguous drug names. The script below adjusts these drug names to allow for consistency. drug_info_T
informs the presence of drug information. drug_therapy_type
distinguishes between chemo, radiation etc. Finally, 260/280 is the TCGA proxy of RNA quality. Some older cancers (i.e. OV) have RIN, however LGG and GBM moved over to 260/280.
dN <- toupper(cd_brain.ol$cgc_drug_therapy_drug_name)
drugName <- c()
# fixed typos in TCGA drugs
for(i in 1:length(dN)) {
if(is.na(dN[i])) {
drugName[i] <- NA
next
}
if(dN[i] %in% c("TEMOZOLAMIDE", "TEMOZOLOMIDE")) {
drugName[i] <- "TEMOZOLOMIDE"
next
}
if(dN[i] %in% c("TEMADOR","TEMODAR", "TEMODAR (ESCALATION)", "METRONOMIC TEMODAR")) {
drugName[i] <- "TEMODAR"
next
}
if(dN[i] %in% c("LOMUSTINE (CCNU)","LOMUSTINE", "LOMUSTIN")) {
drugName[i] <- "LOMUSTINE"
next
}
if(dN[i] %in% c("ISOTRETINOIN","ISOTRECTINOIN (ACCCUTANE)")) {
drugName[i] <- "ISOTRETINOIN"
next
}
if(dN[i] %in% c("I 131 81C6","I131-81C6")) {
drugName[i] <- "I-131-81C6"
next
}
if(dN[i] %in% c("HYDROXYUREA","HYDROYUREA")) {
drugName[i] <- "HYDROXYUREA"
next
}
if(dN[i] %in% c("GLIADEL WAFER","GLIADEL WAFER (BCNU)", "GLIADEL")) {
drugName[i] <- "GLIADEL"
next
}
if(dN[i] %in% c("DEXAMETHASONE","DEXMETHASONE")) {
drugName[i] <- "DEXAMETHASONE"
next
}
if(dN[i] %in% c("CPT11","CPT-11")) {
drugName[i] <- "CPT11"
next
}
if(dN[i] %in% c("CARMUSTINE", "CARMUSTIN", "CARMUSTINE (BCNU)", "CARMUSTINE BCNU")) {
drugName[i] <- "CARMUSTINE"
next
}
if(dN[i] %in% c("BEVACIZUMAB","BEVACIZUMAB OR PLACEBO RTOG 0825")) {
drugName[i] <- "BEVACIZUMAB"
next
}
if(dN[i] %in% c("BCNU","BCNU (CARMUSTINE)")) {
drugName[i] <- "BCNU"
next
}
drugName[i] <- dN[i]
}
drug_info_T <- cd_brain.ol$xml_has_drugs_information
drug_therapy_type <- cd_brain.ol$cgc_drug_therapy_pharmaceutical_therapy_type
T_260_280 <- cd_brain.ol$gdc_cases.samples.portions.analytes.a260_a280_ratio
The code below readjusts the order of recount_brain_v1
. This is completed by insuring that the order of columns in recount_brain matches TCGA, GTEx, and the recount website. All of the column names are then matches.
recount_brain_reorder = recount_brain[,gsub(' ','', notes$Variable[1:48])]
colnames(TCGA_combn) <- colnames(GTEX_combn) <- colnames(recount_brain_reorder)
The data below cleans up the colData
related to combining the three datasets and making a consistent identifier. The Study
is the name of SRA study, TCGA
, or GTEX
. The _full
columns are TCGA columns with the correct number of rows filled up for recount_brain and GTEx. Finally, these columns are combined together and the dataset is saved.
Study <- sub("\\..*","", rownames(recount_brain) )
Study_full <- c(Study, rep("TCGA", nrow(TCGA_combn)),
rep("GTEX", nrow(GTEX_combn)))
Dataset <- c(rep("recount_brain_v1",length(Study)),
rep("TCGA", nrow(TCGA_combn)), rep("GTEX", nrow(GTEX_combn)))
drugName_full <- c(rep(NA, length(Study)), drugName, rep(NA, nrow(GTEX_combn) ))
drug_info_full <- c(rep(NA, length(Study)), drug_info_T,
rep(NA, nrow(GTEX_combn) ))
drug_type_full <- c(rep(NA, length(Study)), drug_therapy_type,
rep(NA, nrow(GTEX_combn) ))
full_260_280<- c(rep(NA, length(Study)), T_260_280, rep(NA, nrow(GTEX_combn) ))
count_file_identifier <- c(recount_brain$run_s, rownames(cd_brain.ol),
GTEx_brain_merge$run)
brain_meta <- rbind(recount_brain_reorder, TCGA_combn, GTEX_combn )
metadata_complete <- cbind(brain_meta, Study_full, drugName_full,
drug_info_full, drug_type_full, full_260_280, count_file_identifier, Dataset)
The code below adjusts some of the major columns within the dataset to account for different datasets using slightly different names. For example, if you filter for “Primary”, you get all primary tumors instead of just the recount_brain_v1 primary tumors.
#Tissue site 1 adjust
tsite1 <- c()
ts <- metadata_complete$tissue_site_1
for(i in 1:nrow(metadata_complete)) {
if(ts[i] %in% c("Caudate (basal ganglia)", "Caudate")) {
tsite1[i] <- "Caudate"
next
}
if(ts[i] %in% c("Frontal Cortex", "Frontal Cortex (BA9)")) {
tsite1[i] <- "Frontal Cortex"
next
}
if(ts[i] %in% c("Nucleus accumbens", "Nucleus accumbens (basal ganglia)")) {
tsite1[i] <- "Nucleus accumbens"
next
}
if(ts[i] %in% c("Putamen", "Putamen (basal ganglia)")) {
tsite1[i] <- "Putamen"
next
}
tsite1[i] <- ts[i]
}
metadata_complete$tissue_site_1 <- tsite1
# Adjusting the disease information so that tumour information is consistent
dis <- c() # Note, In azheimer's disease and Parkinson's disease there was a minor error with the encoding of the apostrophe. You will likely need to adjust these individuals manually
for(i in 1:length(metadata_complete$disease)) {
if(metadata_complete$disease[i] %in% c("Brain tumor", "Tumor")) {
dis[i] <- "brain tumor unspecified"
next
}
dis[i] <- metadata_complete$disease[i]
}
metadata_complete$disease <- dis
clinStage2 <- c()
for(i in 1:length(metadata_complete$clinical_stage_2)) {
if(is.na(metadata_complete$clinical_stage_2[i])) {
clinStage2[i] <- NA
next
}
if(metadata_complete$clinical_stage_2[i] %in% c("Primary Tumor")) {
clinStage2[i] <- "Primary"
next
}
if(metadata_complete$clinical_stage_2[i] %in% c("Recurrent Tumor")) {
clinStage2[i] <- "Recurrent"
next
}
clinStage2[i] <- metadata_complete$clinical_stage_2[i]
}
metadata_complete$clinical_stage_2 <- clinStage2
# Fixing capital in consernt
metadata_complete$consent_s <- toupper(metadata_complete$consent_s)
race_adjusted <- toupper(metadata_complete$race)
for(i in 1:length(race_adjusted)) {
if(race_adjusted[i] %in% "BLACK OR AFRICAN AMERICAN") {
race_adjusted[i] <- "BLACK"
}
}
metadata_complete$race <- race_adjusted
#Information on sample origin: iPSC conistency
origin <- metadata_complete$sample_origin
for(i in 1:length(origin)) {
if(origin[i] %in% "iPSCs") {
origin[i] <- "iPSC"
}
}
metadata_complete$sample_origin <- origin
# making sure that oligodendroglioma/oligodendrogliomas are different
t_type <- metadata_complete$tumor_type
for(i in 1:length(t_type)) {
if(t_type[i] %in% "Anaplastic Oligodendrogliomas") {
t_type[i] <- "Anaplastic Oligodendroglioma"
next
}
}
metadata_complete$sample_origin <- origin
metadata_complete$tumor_type <- t_type
# Converting run_s to also contain the identifier. This allows recount_brain_v2 to be accessed via the "add_metadata()" function
metadata_complete$run_s <- metadata_complete$count_file_identifier
recount_brain_v2
The final code chunk checks the final dimensions and md5sum object of recount_brain_v2
before saving it into an Rdata object and listing variables.
#Completed metadata is the combined and saved
recount_brain <- metadata_complete
dim(recount_brain)
## [1] 6547 55
## For compatibility with add_metadata()
recount_brain$run_s <- as.character(recount_brain$run_s)
## Re-cast some vars
recount_brain$count_file_identifier <- as.character(recount_brain$count_file_identifier)
recount_brain$drug_info_full <- recount_brain$drug_info_full == 'YES'
recount_brain$rin <- as.numeric(recount_brain$rin)
recount_brain$pmi <- as.numeric(recount_brain$pmi)
recount_brain$avgspotlen_l <- as.numeric(recount_brain$avgspotlen_l)
recount_brain$insertsize_l <- as.numeric(recount_brain$insertsize_l)
recount_brain$mbases_l <- as.integer(recount_brain$mbases_l)
recount_brain$mbytes_l <- as.numeric(recount_brain$mbytes_l)
recount_brain$brodmann_area <- as.integer(recount_brain$brodmann_area)
recount_brain$present_in_recount <- as.logical(recount_brain$present_in_recount)
## Simplify age by turning ranges such as 20-29 to mean(c(20, 29))
mean_age <- function(x) {
mean(as.integer(strsplit(x, '-')[[1]]))
}
age <- as.numeric(recount_brain$age)
## Warning: NAs introduced by coercion
age[grepl('-', recount_brain$age)] <- sapply(
recount_brain$age[grepl('-', recount_brain$age)], mean_age)
recount_brain$age <- age
## Between version 1 and 2, these are the columns that change types
r <- add_metadata(source = 'recount_brain_v1')
## 2020-11-13 16:24:28 downloading the recount_brain metadata to /tmp/RtmpK9pZcs/recount_brain_v1.Rdata
## Loading objects:
## recount_brain
x <- sapply(r, class) == sapply(recount_brain[, colnames(r)], class)
sapply(recount_brain[, colnames(r)], class)[!x]
## avgspotlen_l insertsize_l mbytes_l
## "numeric" "numeric" "numeric"
sapply(r, class)[!x]
## avgspotlen_l insertsize_l mbytes_l
## "integer" "integer" "integer"
## Save the data
save(recount_brain, file = 'recount_brain_v2_noOntology.Rdata')
write.csv(recount_brain, file = 'recount_brain_v2_noOntology.csv', quote = TRUE,
row.names = FALSE)
## Check md5sum for the resulting files
sapply(dir(pattern = 'recount_brain_v2'), tools::md5sum)
## recount_brain_v2_noOntology.csv.recount_brain_v2_noOntology.csv
## "e7855403fac9dc4d6345908c1e5da5a7"
## recount_brain_v2_noOntology.Rdata.recount_brain_v2_noOntology.Rdata
## "aa95cc6a34b77b9062e2f77da0cac286"
## recount_brain_v2.csv.recount_brain_v2.csv
## "2ab643a4ce55d731c637456ff50ef36b"
## recount_brain_v2.Rdata.recount_brain_v2.Rdata
## "0cc562916ced9f2bf4fb2b9a1a446121"
## List of all variables
colnames(recount_brain)
## [1] "age" "age_units" "assay_type_s" "avgspotlen_l"
## [5] "bioproject_s" "biosample_s" "brain_bank" "brodmann_area"
## [9] "cell_line" "center_name_s" "clinical_stage_1" "clinical_stage_2"
## [13] "consent_s" "development" "disease" "disease_status"
## [17] "experiment_s" "hemisphere" "insertsize_l" "instrument_s"
## [21] "library_name_s" "librarylayout_s" "libraryselection_s" "librarysource_s"
## [25] "loaddate_s" "mbases_l" "mbytes_l" "organism_s"
## [29] "pathology" "platform_s" "pmi" "pmi_units"
## [33] "preparation" "present_in_recount" "race" "releasedate_s"
## [37] "rin" "run_s" "sample_name_s" "sample_origin"
## [41] "sex" "sra_sample_s" "sra_study_s" "tissue_site_1"
## [45] "tissue_site_2" "tissue_site_3" "tumor_type" "viability"
## [49] "Study_full" "drugName_full" "drug_info_full" "drug_type_full"
## [53] "full_260_280" "count_file_identifier" "Dataset"
recount_brain_v2
Below provides some summary statistics on the merged dataset. Below there are some pivot tables of columns split by the major dataset.
#Sex
table(recount_brain$sex, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## female 442 259 298
## male 967 695 402
## pooled 0 2938 0
#Development
table(recount_brain$development, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## Adolescent 0 35 4
## Adult 1409 963 696
## Child 0 58 0
## Fetus 0 38 0
## Infant 0 47 0
#Tumor type
table(recount_brain$tumor_type, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## Anaplastic Astrocytomas 0 24 0
## Anaplastic Oligodendroastrocytomas 0 36 0
## Anaplastic Oligodendroglioma 0 19 0
## Astrocytoma 0 63 196
## Glioblastoma 0 206 0
## Glioblastoma Multiforme (GBM) 0 0 1
## normal 0 8 0
## Oligoastrocytoma 0 9 135
## Oligodendroastrocytoma 0 37 0
## Oligodendroglioma 0 49 200
## Treated primary GBM 0 0 1
## Untreated primary (de novo) GBM 0 0 167
# Clinical stage 2
table(recount_brain$clinical_stage_2, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## Familial 0 16 0
## Grade IV 0 7 0
## Primary 0 64 671
## Recurrent 0 61 31
## Secondary 0 21 0
## Solid Tissue Normal 0 0 5
## Sporadic 0 20 0
# tissue_site 1
table(recount_brain$tissue_site_1, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## Amygdala 81 0 0
## Anterior cingulate cortex (BA24) 99 0 0
## Brainstem 0 2 0
## Caudate 134 5 0
## Cerebellar Hemisphere 118 0 0
## Cerebellum 145 29 0
## Cerebral cortex 0 638 0
## Corpus callosum 0 13 0
## Cortex 132 0 0
## Dura mater 0 1 0
## Frontal Cortex 120 27 0
## GBM 0 0 175
## Hippocampus 103 25 0
## Hypothalamus 104 0 0
## LGG 0 0 532
## Lumbar spinal cord 0 41 0
## Mixed 0 6 0
## Nucleus accumbens 123 1 0
## Putamen 103 6 0
## Spinal cord (cervical c-1) 76 0 0
## Substantia nigra 71 1 0
## Whole brain 0 2 0
# present in recount
table(recount_brain$present_in_recount, recount_brain$Dataset)
##
## GTEX recount_brain_v1 TCGA
## FALSE 0 1217 0
## TRUE 1409 3214 707
Full summary:
summary(recount_brain)
## age age_units assay_type_s avgspotlen_l bioproject_s biosample_s
## Min. : 1.00 Length:6547 Length:6547 Min. : 27.00 Length:6547 Length:6547
## 1st Qu.: 40.00 Class :character Class :character 1st Qu.: 95.57 Class :character Class :character
## Median : 54.50 Mode :character Mode :character Median : 152.00 Mode :character Mode :character
## Mean : 50.77 Mean : 157.56
## 3rd Qu.: 64.50 3rd Qu.: 200.00
## Max. :106.00 Max. :2017.00
## NA's :3446
## brain_bank brodmann_area cell_line center_name_s clinical_stage_1 clinical_stage_2
## Length:6547 Min. : 4.00 Length:6547 Length:6547 Length:6547 Length:6547
## Class :character 1st Qu.: 9.00 Class :character Class :character Class :character Class :character
## Mode :character Median : 9.00 Mode :character Mode :character Mode :character Mode :character
## Mean :14.61
## 3rd Qu.:24.00
## Max. :46.00
## NA's :6003
## consent_s development disease disease_status experiment_s hemisphere
## Length:6547 Length:6547 Length:6547 Length:6547 Length:6547 Length:6547
## Class :character Class :character Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## insertsize_l instrument_s library_name_s librarylayout_s libraryselection_s librarysource_s
## Min. : 0.000 Length:6547 Length:6547 Length:6547 Length:6547 Length:6547
## 1st Qu.: 0.000 Class :character Class :character Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character Mode :character Mode :character
## Mean : 3.124
## 3rd Qu.: 0.000
## Max. :245.000
## NA's :2116
## loaddate_s mbases_l mbytes_l organism_s pathology platform_s
## Length:6547 Min. : 0.0 Min. : 0 Length:6547 Length:6547 Length:6547
## Class :character 1st Qu.: 787.5 1st Qu.: 640 Class :character Class :character Class :character
## Mode :character Median : 1542.0 Median : 1272 Mode :character Mode :character Mode :character
## Mean : 2872.6 Mean : 2488
## 3rd Qu.: 2660.0 3rd Qu.: 3226
## Max. :52310.0 Max. :35161
## NA's :2116 NA's :1409
## pmi pmi_units preparation present_in_recount race releasedate_s
## Min. : 0.0 Length:6547 Length:6547 Mode :logical Length:6547 Length:6547
## 1st Qu.: 0.0 Class :character Class :character FALSE:1217 Class :character Class :character
## Median : 6.0 Mode :character Mode :character TRUE :5330 Mode :character Mode :character
## Mean : 152.1
## 3rd Qu.: 21.0
## Max. :1442.0
## NA's :5988
## rin run_s sample_name_s sample_origin sex sra_sample_s
## Min. :1.500 Length:6547 Length:6547 Length:6547 Length:6547 Length:6547
## 1st Qu.:6.500 Class :character Class :character Class :character Class :character Class :character
## Median :7.100 Mode :character Mode :character Mode :character Mode :character Mode :character
## Mean :7.209
## 3rd Qu.:7.900
## Max. :9.800
## NA's :4828
## sra_study_s tissue_site_1 tissue_site_2 tissue_site_3 tumor_type viability
## Length:6547 Length:6547 Length:6547 Length:6547 Length:6547 Length:6547
## Class :character Class :character Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Study_full drugName_full drug_info_full drug_type_full full_260_280 count_file_identifier
## Length:6547 Length:6547 Mode :logical Length:6547 Min. :1.500 Length:6547
## Class :character Class :character FALSE:278 Class :character 1st Qu.:1.800 Class :character
## Mode :character Mode :character TRUE :422 Mode :character Median :1.810 Mode :character
## NA's :5847 Mean :1.835
## 3rd Qu.:1.880
## Max. :2.270
## NA's :5972
## Dataset
## Length:6547
## Class :character
## Mode :character
##
##
##
##
This document was made possible thanks to:
Code for creating this document
## Create the vignette
library('rmarkdown')
system.time(render('cross_studies_metadata.Rmd', 'BiocStyle::html_document'))
Reproducibility information for this document.
## Reproducibility info
proc.time()
## user system elapsed
## 33.407 3.364 51.681
message(Sys.time())
## 2020-11-13 16:24:28
options(width = 120)
session_info()
## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.2 Patched (2020-06-24 r78746)
## os CentOS Linux 7 (Core)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz US/Eastern
## date 2020-11-13
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date lib source
## AnnotationDbi 1.50.3 2020-07-25 [2] Bioconductor
## askpass 1.1 2019-01-13 [2] CRAN (R 4.0.0)
## assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0)
## backports 1.2.0 2020-11-02 [1] CRAN (R 4.0.2)
## base64enc 0.1-3 2015-07-28 [2] CRAN (R 4.0.0)
## bibtex 0.4.2.3 2020-09-19 [2] CRAN (R 4.0.2)
## Biobase * 2.48.0 2020-04-27 [2] Bioconductor
## BiocFileCache 1.12.1 2020-08-04 [2] Bioconductor
## BiocGenerics * 0.34.0 2020-04-27 [2] Bioconductor
## BiocManager 1.30.10 2019-11-16 [2] CRAN (R 4.0.0)
## BiocParallel 1.22.0 2020-04-27 [2] Bioconductor
## BiocStyle * 2.16.1 2020-09-25 [1] Bioconductor
## biomaRt 2.44.4 2020-10-13 [2] Bioconductor
## Biostrings 2.56.0 2020-04-27 [2] Bioconductor
## bit 4.0.4 2020-08-04 [2] CRAN (R 4.0.2)
## bit64 4.0.5 2020-08-30 [2] CRAN (R 4.0.2)
## bitops 1.0-6 2013-08-17 [2] CRAN (R 4.0.0)
## blob 1.2.1 2020-01-20 [2] CRAN (R 4.0.0)
## bookdown 0.21 2020-10-13 [1] CRAN (R 4.0.2)
## BSgenome 1.56.0 2020-04-27 [2] Bioconductor
## bumphunter 1.30.0 2020-04-27 [2] Bioconductor
## callr 3.5.1 2020-10-13 [2] CRAN (R 4.0.2)
## checkmate 2.0.0 2020-02-06 [2] CRAN (R 4.0.0)
## cli 2.1.0 2020-10-12 [2] CRAN (R 4.0.2)
## cluster 2.1.0 2019-06-19 [3] CRAN (R 4.0.2)
## codetools 0.2-16 2018-12-24 [3] CRAN (R 4.0.2)
## colorspace 1.4-1 2019-03-18 [2] CRAN (R 4.0.0)
## crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0)
## curl 4.3 2019-12-02 [2] CRAN (R 4.0.0)
## data.table 1.13.2 2020-10-19 [2] CRAN (R 4.0.2)
## DBI 1.1.0 2019-12-15 [2] CRAN (R 4.0.0)
## dbplyr 2.0.0 2020-11-03 [1] CRAN (R 4.0.2)
## DelayedArray * 0.14.1 2020-07-14 [2] Bioconductor
## derfinder 1.22.0 2020-04-27 [2] Bioconductor
## derfinderHelper 1.22.0 2020-04-27 [2] Bioconductor
## desc 1.2.0 2018-05-01 [2] CRAN (R 4.0.0)
## devtools * 2.3.2 2020-09-18 [2] CRAN (R 4.0.2)
## digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
## doRNG 1.8.2 2020-01-27 [2] CRAN (R 4.0.0)
## downloader 0.4 2015-07-09 [2] CRAN (R 4.0.0)
## dplyr 1.0.2 2020-08-18 [2] CRAN (R 4.0.2)
## ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0)
## evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0)
## fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0)
## foreach 1.5.1 2020-10-15 [2] CRAN (R 4.0.2)
## foreign 0.8-80 2020-05-24 [3] CRAN (R 4.0.2)
## Formula 1.2-4 2020-10-16 [2] CRAN (R 4.0.2)
## fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
## generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
## GenomeInfoDb * 1.24.2 2020-06-15 [2] Bioconductor
## GenomeInfoDbData 1.2.3 2020-05-18 [2] Bioconductor
## GenomicAlignments 1.24.0 2020-04-27 [2] Bioconductor
## GenomicFeatures 1.40.1 2020-07-08 [2] Bioconductor
## GenomicFiles 1.24.0 2020-04-27 [2] Bioconductor
## GenomicRanges * 1.40.0 2020-04-27 [2] Bioconductor
## GEOquery 2.56.0 2020-04-27 [2] Bioconductor
## ggplot2 3.3.2 2020-06-19 [2] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
## gridExtra 2.3 2017-09-09 [2] CRAN (R 4.0.0)
## gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0)
## Hmisc 4.4-1 2020-08-10 [2] CRAN (R 4.0.2)
## hms 0.5.3 2020-01-08 [2] CRAN (R 4.0.0)
## htmlTable 2.1.0 2020-09-16 [2] CRAN (R 4.0.2)
## htmltools 0.5.0 2020-06-16 [2] CRAN (R 4.0.2)
## htmlwidgets 1.5.2 2020-10-03 [2] CRAN (R 4.0.2)
## httr 1.4.2 2020-07-20 [2] CRAN (R 4.0.2)
## IRanges * 2.22.2 2020-05-21 [2] Bioconductor
## iterators 1.0.13 2020-10-15 [2] CRAN (R 4.0.2)
## jpeg 0.1-8.1 2019-10-24 [2] CRAN (R 4.0.0)
## jsonlite 1.7.1 2020-09-07 [2] CRAN (R 4.0.2)
## knitcitations * 1.0.10 2019-09-15 [1] CRAN (R 4.0.2)
## knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
## lattice 0.20-41 2020-04-02 [3] CRAN (R 4.0.2)
## latticeExtra 0.6-29 2019-12-19 [2] CRAN (R 4.0.0)
## lifecycle 0.2.0 2020-03-06 [2] CRAN (R 4.0.0)
## limma 3.44.3 2020-06-12 [2] Bioconductor
## locfit 1.5-9.4 2020-03-25 [2] CRAN (R 4.0.0)
## lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.0)
## magick 2.5.2 2020-11-10 [1] CRAN (R 4.0.2)
## magrittr 1.5 2014-11-22 [2] CRAN (R 4.0.0)
## Matrix 1.2-18 2019-11-27 [3] CRAN (R 4.0.2)
## matrixStats * 0.57.0 2020-09-25 [2] CRAN (R 4.0.2)
## memoise 1.1.0 2017-04-21 [2] CRAN (R 4.0.0)
## munsell 0.5.0 2018-06-12 [2] CRAN (R 4.0.0)
## nnet 7.3-14 2020-04-26 [3] CRAN (R 4.0.2)
## openssl 1.4.3 2020-09-18 [2] CRAN (R 4.0.2)
## pillar 1.4.6 2020-07-10 [2] CRAN (R 4.0.2)
## pkgbuild 1.1.0 2020-07-13 [2] CRAN (R 4.0.2)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0)
## pkgload 1.1.0 2020-05-29 [2] CRAN (R 4.0.2)
## plyr 1.8.6 2020-03-03 [2] CRAN (R 4.0.0)
## png 0.1-7 2013-12-03 [2] CRAN (R 4.0.0)
## prettyunits 1.1.1 2020-01-24 [2] CRAN (R 4.0.0)
## processx 3.4.4 2020-09-03 [2] CRAN (R 4.0.2)
## progress 1.2.2 2019-05-16 [2] CRAN (R 4.0.0)
## ps 1.4.0 2020-10-07 [2] CRAN (R 4.0.2)
## purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.0)
## qvalue 2.20.0 2020-04-27 [2] Bioconductor
## R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2)
## rappdirs 0.3.1 2016-03-28 [2] CRAN (R 4.0.0)
## RColorBrewer 1.1-2 2014-12-07 [2] CRAN (R 4.0.0)
## Rcpp 1.0.5 2020-07-06 [2] CRAN (R 4.0.2)
## RCurl 1.98-1.2 2020-04-18 [2] CRAN (R 4.0.0)
## readr 1.4.0 2020-10-05 [2] CRAN (R 4.0.2)
## recount * 1.14.0 2020-04-27 [2] Bioconductor
## RefManageR 1.2.12 2019-04-03 [1] CRAN (R 4.0.2)
## remotes 2.2.0 2020-07-21 [2] CRAN (R 4.0.2)
## rentrez 1.2.2 2019-05-02 [2] CRAN (R 4.0.0)
## reshape2 1.4.4 2020-04-09 [2] CRAN (R 4.0.0)
## rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2)
## rmarkdown * 2.5 2020-10-21 [1] CRAN (R 4.0.2)
## rngtools 1.5 2020-01-23 [2] CRAN (R 4.0.0)
## rpart 4.1-15 2019-04-12 [3] CRAN (R 4.0.2)
## rprojroot 1.3-2 2018-01-03 [2] CRAN (R 4.0.0)
## Rsamtools 2.4.0 2020-04-27 [2] Bioconductor
## RSQLite 2.2.1 2020-09-30 [2] CRAN (R 4.0.2)
## rstudioapi 0.11 2020-02-07 [2] CRAN (R 4.0.0)
## rtracklayer 1.48.0 2020-04-27 [2] Bioconductor
## S4Vectors * 0.26.1 2020-05-16 [2] Bioconductor
## scales 1.1.1 2020-05-11 [2] CRAN (R 4.0.0)
## sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0)
## stringi 1.5.3 2020-09-09 [2] CRAN (R 4.0.2)
## stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.0)
## SummarizedExperiment * 1.18.2 2020-07-09 [2] Bioconductor
## survival 3.2-3 2020-06-13 [3] CRAN (R 4.0.2)
## testthat 3.0.0 2020-10-31 [1] CRAN (R 4.0.2)
## tibble 3.0.4 2020-10-12 [2] CRAN (R 4.0.2)
## tidyr 1.1.2 2020-08-27 [2] CRAN (R 4.0.2)
## tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0)
## usethis * 1.6.3 2020-09-17 [2] CRAN (R 4.0.2)
## VariantAnnotation 1.34.0 2020-04-27 [2] Bioconductor
## vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
## withr 2.3.0 2020-09-22 [2] CRAN (R 4.0.2)
## xfun 0.19 2020-10-30 [1] CRAN (R 4.0.2)
## XML 3.99-0.5 2020-07-23 [2] CRAN (R 4.0.2)
## xml2 1.3.2 2020-04-23 [2] CRAN (R 4.0.0)
## XVector 0.28.0 2020-04-27 [2] Bioconductor
## yaml 2.2.1 2020-02-01 [2] CRAN (R 4.0.0)
## zlibbioc 1.34.0 2020-04-27 [2] Bioconductor
##
## [1] /users/neagles/R/4.0
## [2] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0/R/4.0/lib64/R/site-library
## [3] /jhpce/shared/jhpce/core/conda/miniconda3-4.6.14/envs/svnR-4.0/R/4.0/lib64/R/library
This document was generated using BiocStyle (Oleś, Morgan, and Huber, 2020) with knitr (Xie, 2014) and rmarkdown (Allaire, Xie, McPherson, Luraschi, et al., 2020) running behind the scenes.
Citations made with knitcitations (Boettiger, 2019) and the bibliographical file is available here.
[1] J. Allaire, Y. Xie, J. McPherson, J. Luraschi, et al. rmarkdown: Dynamic Documents for R. R package version 2.5. 2020. <URL: https://github.com/rstudio/rmarkdown>.
[2] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.10. 2019. <URL: https://CRAN.R-project.org/package=knitcitations>.
[3] L. Collado-Torres, A. Nellore, K. Kammers, S. E. Ellis, et al. “Reproducible RNA-seq analysis using recount2”. In: Nature Biotechnology (2017). DOI: 10.1038/nbt.3838. <URL: http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html>.
[4] A. Oleś, M. Morgan, and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.16.1. 2020. <URL: https://github.com/Bioconductor/BiocStyle>.
[5] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2020. <URL: https://www.R-project.org/>.
[6] H. Wickham, J. Hester, and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 2.3.2. 2020. <URL: https://CRAN.R-project.org/package=devtools>.
[7] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. <URL: http://www.crcpress.com/product/isbn/9781466561595>.