This function returns a data.frame() with the samples that are available from recount3. Note that a specific sample might be available from a given data_source and none or many collections.

available_samples(
  organism = c("human", "mouse"),
  recount3_url = getOption("recount3_url", "http://duffel.rail.bio/recount3"),
  bfc = recount3_cache(),
  verbose = getOption("recount3_verbose", TRUE),
  available_homes = project_homes(organism = organism, recount3_url = recount3_url)
)

Arguments

organism

A character(1) specifying which organism you want to download data from. Supported options are "human" or "mouse".

recount3_url

A character(1) specifying the home URL for recount3 or a local directory where you have mirrored recount3. Defaults to the load balancer http://duffel.rail.bio/recount3, but can also be https://recount-opendata.s3.amazonaws.com/recount3/release from https://registry.opendata.aws/recount/ or SciServer datascope from IDIES at JHU https://sciserver.org/public-data/recount3/data. You can set the R option recount3_url (for example in your .Rprofile) if you have a favorite mirror.

bfc

A BiocFileCache-class object where the files will be cached to, typically created by recount3_cache().

verbose

A logical(1) indicating whether to show messages with updates.

available_homes

A character() vector with the available project homes for the given recount3_url. If you use a non-standard recount3_url, you will likely need to specify manually the valid values for available_homes.

Value

A data.frame() with the sample ID used by the original source of the data (external_id), the project ID (project), the organism, the file_source from where the data was accessed, the date the sample was processed (date_processed) in YYYY-MM-DD format, the recount3 project home location (project_home), and the project project_type that differentiates between data_sources and compilations.

Examples


## Find all the human samples available from recount3
human_samples <- available_samples()
#> 2023-05-07 00:10:16.352644 caching file sra.recount_project.MD.gz.
#> 2023-05-07 00:10:16.658074 caching file gtex.recount_project.MD.gz.
#> 2023-05-07 00:10:17.013503 caching file tcga.recount_project.MD.gz.
dim(human_samples)
#> [1] 347005      7
head(human_samples)
#>   external_id   project organism file_source date_processed     project_home
#> 1  SRR5579327 SRP107565    human         sra     2019-10-01 data_sources/sra
#> 2  SRR5579328 SRP107565    human         sra     2019-10-01 data_sources/sra
#> 3  SRR5579329 SRP107565    human         sra     2019-10-01 data_sources/sra
#> 4  SRR5579330 SRP107565    human         sra     2019-10-01 data_sources/sra
#> 5  SRR5579331 SRP107565    human         sra     2019-10-01 data_sources/sra
#> 6  SRR5579332 SRP107565    human         sra     2019-10-01 data_sources/sra
#>   project_type
#> 1 data_sources
#> 2 data_sources
#> 3 data_sources
#> 4 data_sources
#> 5 data_sources
#> 6 data_sources

## How many are from a data source vs a compilation?
table(human_samples$project_type, useNA = "ifany")
#> 
#> data_sources 
#>       347005 

## What are the unique file sources?
table(
    human_samples$file_source[human_samples$project_type == "data_sources"]
)
#> 
#>   gtex    sra   tcga 
#>  19214 316443  11348 

## Find all the mouse samples available from recount3
mouse_samples <- available_samples("mouse")
#> 2023-05-07 00:10:19.163851 caching file sra.recount_project.MD.gz.
dim(mouse_samples)
#> [1] 416859      7
head(mouse_samples)
#>   external_id   project organism file_source date_processed     project_home
#> 1  SRR8249198 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#> 2  SRR8249199 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#> 3  SRR8249200 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#> 4  SRR8249201 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#> 5  SRR8249202 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#> 6  SRR8249205 SRP170963    mouse         sra     2020-01-01 data_sources/sra
#>   project_type
#> 1 data_sources
#> 2 data_sources
#> 3 data_sources
#> 4 data_sources
#> 5 data_sources
#> 6 data_sources

## How many are from a data source vs a compilation?
table(mouse_samples$project_type, useNA = "ifany")
#> 
#> data_sources 
#>       416859