This function returns a data.frame()
with the samples that are available
from recount3
. Note that a specific sample might be available from a
given data_source
and none or many collections
.
available_samples(
organism = c("human", "mouse"),
recount3_url = getOption("recount3_url", "http://duffel.rail.bio/recount3"),
bfc = recount3_cache(),
verbose = getOption("recount3_verbose", TRUE),
available_homes = project_homes(organism = organism, recount3_url = recount3_url)
)
A character(1)
specifying which organism you want to
download data from. Supported options are "human"
or "mouse"
.
A character(1)
specifying the home URL for recount3
or a local directory where you have mirrored recount3
. Defaults to the
load balancer http://duffel.rail.bio/recount3, but can also be
https://recount-opendata.s3.amazonaws.com/recount3/release from
https://registry.opendata.aws/recount/ or SciServer datascope from
IDIES at JHU https://sciserver.org/public-data/recount3/data. You can
set the R option recount3_url
(for example in your .Rprofile
) if
you have a favorite mirror.
A BiocFileCache-class
object where the files will be cached to, typically created by
recount3_cache()
.
A logical(1)
indicating whether to show messages with
updates.
A character()
vector with the available project homes
for the given recount3_url
. If you use a non-standard recount3_url
, you
will likely need to specify manually the valid values for available_homes
.
A data.frame()
with the sample ID used by the original source of
the data (external_id
), the project ID (project
), the organism
, the
file_source
from where the data was accessed, the date the sample
was processed (date_processed
) in YYYY-MM-DD
format,
the recount3
project home location (project_home
), and the project
project_type
that differentiates between data_sources
and compilations
.
## Find all the human samples available from recount3
human_samples <- available_samples()
#> 2023-05-07 00:10:16.352644 caching file sra.recount_project.MD.gz.
#> 2023-05-07 00:10:16.658074 caching file gtex.recount_project.MD.gz.
#> 2023-05-07 00:10:17.013503 caching file tcga.recount_project.MD.gz.
dim(human_samples)
#> [1] 347005 7
head(human_samples)
#> external_id project organism file_source date_processed project_home
#> 1 SRR5579327 SRP107565 human sra 2019-10-01 data_sources/sra
#> 2 SRR5579328 SRP107565 human sra 2019-10-01 data_sources/sra
#> 3 SRR5579329 SRP107565 human sra 2019-10-01 data_sources/sra
#> 4 SRR5579330 SRP107565 human sra 2019-10-01 data_sources/sra
#> 5 SRR5579331 SRP107565 human sra 2019-10-01 data_sources/sra
#> 6 SRR5579332 SRP107565 human sra 2019-10-01 data_sources/sra
#> project_type
#> 1 data_sources
#> 2 data_sources
#> 3 data_sources
#> 4 data_sources
#> 5 data_sources
#> 6 data_sources
## How many are from a data source vs a compilation?
table(human_samples$project_type, useNA = "ifany")
#>
#> data_sources
#> 347005
## What are the unique file sources?
table(
human_samples$file_source[human_samples$project_type == "data_sources"]
)
#>
#> gtex sra tcga
#> 19214 316443 11348
## Find all the mouse samples available from recount3
mouse_samples <- available_samples("mouse")
#> 2023-05-07 00:10:19.163851 caching file sra.recount_project.MD.gz.
dim(mouse_samples)
#> [1] 416859 7
head(mouse_samples)
#> external_id project organism file_source date_processed project_home
#> 1 SRR8249198 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> 2 SRR8249199 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> 3 SRR8249200 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> 4 SRR8249201 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> 5 SRR8249202 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> 6 SRR8249205 SRP170963 mouse sra 2020-01-01 data_sources/sra
#> project_type
#> 1 data_sources
#> 2 data_sources
#> 3 data_sources
#> 4 data_sources
#> 5 data_sources
#> 6 data_sources
## How many are from a data source vs a compilation?
table(mouse_samples$project_type, useNA = "ifany")
#>
#> data_sources
#> 416859