region_info
returns annotation of a single potential probe sequence or
list of sequences and, if specified, prints the resuts in a .csv file.
region_info(
REGION,
CSV = TRUE,
SEQ = TRUE,
OUTDIR = tempdir(),
CODING_ONLY = FALSE
)
Either a single hg19 genomic sequence including the chromosome,
start, end, and optionally strand separated by colons (e.g.,
'chr20:10199446-10288068:+'
), or a string of sequences to be annotated.
Must be character. Chromosome must be proceeded by 'chr'.
A logical(1)
value indicating if the results should be exported
in a .csv file.
A `logical(1)`` value indicating if the base sequence should be returned.
If a .csv file is to be exported, this parameter indicates the path where the file should be saved. By default the file will be saved in a temporary directory.
A logical vector of length 1 specifying whether to
subset the Annotated Genes to only the coding genes. That is, whether to
subset the genes by whether they have a non-NA CSS
value. The Annotated
Genes are downloaded with GenomicState::GenomicStateHub()
.
This function annotates all input sequences using
bumphunter::matchGenes()
. It returns a data frame where each
row is a genomic sequence specified in REGION. The columns
c('seqnames', 'start', 'end', 'width', 'strand') list the chromosome,
range, sequence length, and strand of the REGION. The columns c('name',
'annotation', 'description', 'region', 'distance', 'subregion',
'insideDistance', 'exonnumber', 'nexons', 'UTR', 'geneL', 'codingL',
'Geneid', 'subjectHits') are described in
bumphunter::matchGenes()
documentation.
If SEQ=TRUE, a column 'Sequence' will be included. This is recommended for sending the probe sequence to be synthesized.
If CSV=TRUE, a .csv file called region_info.csv will be saved to a
temporary directory unless otherwise specified in OUTDIR
.
x <- region_info("chr20:10286777-10288069:+", CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
head(x)
#> seqnames start end width strand name
#> 1 chr20 10286777 10288069 1293 + SNAP25
#> annotation
#> 1 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#> description region distance subregion insideDistance
#> 1 overlaps 3' overlaps 3' 87299 overlaps exon downstream 0
#> exonnumber nexons UTR geneL codingL Geneid
#> 1 10 10 overlaps 3'UTR 88588 30705 ENSG00000132639
#> Sequence
#> 1 GCTGATTCCAACAAAACCAGAATTGATGAGGCCAACCAACGTGCAACAAAGATGCTGGGAAGTGGTTAAGTGTGCCCACCCGTGTTCTCCTCCAAATGCTGTCGGGCAAGATAGCTCCTTCATGCTTTTCTCATGGTATTATCTAGTAGGTCTGCACACATAACACACATCAGTCCACCCCCATTGTGAATGTTGTCCTGTGTCATCTGTCAGCTTCCCAACAATACTTTGTGTCTTTTGTTCTCTCTTGGTCTCTTTCTTTCCAAAGGTTGTACATAGTGGTCATTTGGTGGCTCTAACTCCTTGAGGTCTTGAGTTTCATTTTTCATTTTCTCTCCTCGGTGGCATTTGCTGAATAACAACAATTTAGGAATGCTCAATGTGCTGTTGATTCTTTCAATCCACAGTATTGTTCTTGTAAAACTGTGACATTCCACAGAGTTACTGCCACGGTCCTTTGAGTGTCAGGCTCTGAATCTCTCAAAATGTGCCGTCTTTGGTTCCTCATGGCTGTTATCTGTCTTTATGATTTCATGATTAGACAATGTGGAATTACATAACAGGCATTGCACTAAAAGTGATGTGATTTATGCATTTATGCATGAGAACTAAATAGATTTTTAGATTCCTACTTAAACAAAAACTTTCCATGACAGTAGCATACTGATGAGACAACACACACACACACAAAACAACAGCAACAACAACAGAACAACAACAAAGCATGCTCAGTATTGAGACACTGTCAAGATTAAGTTATACCAGCAAAAGTGCAGTAGTGTCACTTTTTTCCTGTCAATATATAGAGACTTCTAAATCATAATCATCCTTTTTTAAAAAAAAGAATTTTAAAAAAGATGGATTTGACACACTCACCATTTAATCATTTCCAGCAAAATATATGTTTGGCTGAAATTATGTCAAATGGATGTAATATAGGGTTTGTTTGCTGCTTTTGATGGCTATGTTTTGGAGAGAGCAATCTTGCTGTGAAACAGTGTGGATGTAAATTTTATAAGGCTGACTCTTACTAACCACCATTTCCCCTGTGGTTTGTTATCAGTACAATTCTTTGTTGCTTAATCTAGAGCTATGCACACCAAATTGCTGAGATGTTTAGTAGCTGATAAAGAAACCTTTTAAAAAAATAATATAAATGAATGAAATATAAACTGTGAGATAAATATCATTATAGCATGTAATATTAAATTCCTCCTGTCTCCTCTGTCAGTTTGTGAAGTGATTGACATTTTGTAGCTAGTTTAAAATTATTAAAAATTATAGACTCCAGAT
## You can easily transform this data.frame to a GRanges object
GenomicRanges::GRanges(x)
#> GRanges object with 1 range and 14 metadata columns:
#> seqnames ranges strand | name annotation
#> <Rle> <IRanges> <Rle> | <AsIs> <AsIs>
#> [1] chr20 10286777-10288069 + | SNAP25 NM_001322902 NM_0013..
#> description region distance subregion insideDistance
#> <factor> <factor> <numeric> <factor> <numeric>
#> [1] overlaps 3' overlaps 3' 87299 overlaps exon downstream 0
#> exonnumber nexons UTR geneL codingL Geneid
#> <numeric> <integer> <factor> <numeric> <numeric> <character>
#> [1] 10 10 overlaps 3'UTR 88588 30705 ENSG00000132639
#> Sequence
#> <character>
#> [1] GCTGATTCCAACAAAACCAG..
#> -------
#> seqinfo: 1 sequence from an unspecified genome; no seqlengths
y <- region_info(
c(
"chr20:10286777-10288069:+",
"chr18:74690788-74692427:-",
"chr19:49932861-49933829:-"
),
CSV = FALSE, SEQ = FALSE
)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
head(y)
#> seqnames start end width strand name
#> 1 chr20 10286777 10288069 1293 + SNAP25
#> 2 chr18 74690788 74692427 1640 - MBP
#> 3 chr19 49932861 49933829 969 - SLC17A7
#> annotation
#> 1 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#> 2 NM_001025081 NM_001025090 NM_001025092 NM_001025094 NM_001025098 NM_001025100 NM_001025101 NM_002385 NP_001020252 NP_001020261 NP_001020263 NP_001020271 NP_001020272 NP_002376 XM_017025778 XM_017025780 XM_024451185 XM_024451186 XM_024451187 XM_024451188 XM_024451189 XP_016881267 XP_016881269 XP_024306953 XP_024306954 XP_024306955 XP_024306956 XP_024306957 XR_001753201 XR_001753202
#> 3 NM_020309 NP_064705
#> description region distance subregion insideDistance
#> 1 overlaps 3' overlaps 3' 87299 overlaps exon downstream 0
#> 2 inside exon inside 153212 inside exon 0
#> 3 inside exon inside 11788 inside exon 0
#> exonnumber nexons UTR geneL codingL Geneid
#> 1 10 10 overlaps 3'UTR 88588 30705 ENSG00000132639
#> 2 17 17 overlaps 3'UTR 154856 125322 ENSG00000197971
#> 3 13 13 overlaps 3'UTR 12959 10860 ENSG00000104888
candidates <- c(
"chr20:10286777-10288069:+",
"chr18:74690788-74692427:-",
"chr19:49932861-49933829:-"
)
region_info(candidates, CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#> seqnames start end width strand name
#> 1 chr20 10286777 10288069 1293 + SNAP25
#> 2 chr18 74690788 74692427 1640 - MBP
#> 3 chr19 49932861 49933829 969 - SLC17A7
#> annotation
#> 1 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#> 2 NM_001025081 NM_001025090 NM_001025092 NM_001025094 NM_001025098 NM_001025100 NM_001025101 NM_002385 NP_001020252 NP_001020261 NP_001020263 NP_001020271 NP_001020272 NP_002376 XM_017025778 XM_017025780 XM_024451185 XM_024451186 XM_024451187 XM_024451188 XM_024451189 XP_016881267 XP_016881269 XP_024306953 XP_024306954 XP_024306955 XP_024306956 XP_024306957 XR_001753201 XR_001753202
#> 3 NM_020309 NP_064705
#> description region distance subregion insideDistance
#> 1 overlaps 3' overlaps 3' 87299 overlaps exon downstream 0
#> 2 inside exon inside 153212 inside exon 0
#> 3 inside exon inside 11788 inside exon 0
#> exonnumber nexons UTR geneL codingL Geneid
#> 1 10 10 overlaps 3'UTR 88588 30705 ENSG00000132639
#> 2 17 17 overlaps 3'UTR 154856 125322 ENSG00000197971
#> 3 13 13 overlaps 3'UTR 12959 10860 ENSG00000104888
#> Sequence
#> 1 GCTGATTCCAACAAAACCAGAATTGATGAGGCCAACCAACGTGCAACAAAGATGCTGGGAAGTGGTTAAGTGTGCCCACCCGTGTTCTCCTCCAAATGCTGTCGGGCAAGATAGCTCCTTCATGCTTTTCTCATGGTATTATCTAGTAGGTCTGCACACATAACACACATCAGTCCACCCCCATTGTGAATGTTGTCCTGTGTCATCTGTCAGCTTCCCAACAATACTTTGTGTCTTTTGTTCTCTCTTGGTCTCTTTCTTTCCAAAGGTTGTACATAGTGGTCATTTGGTGGCTCTAACTCCTTGAGGTCTTGAGTTTCATTTTTCATTTTCTCTCCTCGGTGGCATTTGCTGAATAACAACAATTTAGGAATGCTCAATGTGCTGTTGATTCTTTCAATCCACAGTATTGTTCTTGTAAAACTGTGACATTCCACAGAGTTACTGCCACGGTCCTTTGAGTGTCAGGCTCTGAATCTCTCAAAATGTGCCGTCTTTGGTTCCTCATGGCTGTTATCTGTCTTTATGATTTCATGATTAGACAATGTGGAATTACATAACAGGCATTGCACTAAAAGTGATGTGATTTATGCATTTATGCATGAGAACTAAATAGATTTTTAGATTCCTACTTAAACAAAAACTTTCCATGACAGTAGCATACTGATGAGACAACACACACACACACAAAACAACAGCAACAACAACAGAACAACAACAAAGCATGCTCAGTATTGAGACACTGTCAAGATTAAGTTATACCAGCAAAAGTGCAGTAGTGTCACTTTTTTCCTGTCAATATATAGAGACTTCTAAATCATAATCATCCTTTTTTAAAAAAAAGAATTTTAAAAAAGATGGATTTGACACACTCACCATTTAATCATTTCCAGCAAAATATATGTTTGGCTGAAATTATGTCAAATGGATGTAATATAGGGTTTGTTTGCTGCTTTTGATGGCTATGTTTTGGAGAGAGCAATCTTGCTGTGAAACAGTGTGGATGTAAATTTTATAAGGCTGACTCTTACTAACCACCATTTCCCCTGTGGTTTGTTATCAGTACAATTCTTTGTTGCTTAATCTAGAGCTATGCACACCAAATTGCTGAGATGTTTAGTAGCTGATAAAGAAACCTTTTAAAAAAATAATATAAATGAATGAAATATAAACTGTGAGATAAATATCATTATAGCATGTAATATTAAATTCCTCCTGTCTCCTCTGTCAGTTTGTGAAGTGATTGACATTTTGTAGCTAGTTTAAAATTATTAAAAATTATAGACTCCAGAT
#> 2 GGAGGAAGAGATAGTCGCTCTGGATCACCCATGGCTAGACGCTGAAAACCCACCTGGTTCCGGAATCCTGTCCTCAGCTTCTTAATATAACTGCCTTAAAACTTTAATCCCACTTGCCCCTGTTACCTAATTAGAGCAGATGACCCCTCCCCTAATGCCTGCGGAGTTGTGCACGTAGTAGGGTCAGGCCACGGCAGCCTACCGGCAATTTCCGGCCAACAGTTAAATGAGAACATGAAAACAGAAAACGGTTAAAACTGTCCCTTTCTGTGTGAAGATCACGTTCCTTCCCCCGCAATGTGCCCCCAGACGCACGTGGGTCTTCAGGGGGCCAGGTGCACAGACGTCCCTCCACGTTCACCCCTCCACCCTTGGACTTTCTTTTCGCCGTGGCTGCGGCACCCTTGCGCTTTTGCTGGTCACTGCCATGGAGGCACACAGCTGCAGAGACAGAGAGGACGTGGGCGGCAGAGAGGACTGTTGACATCCAAGCTTCCTTTGTTTTTTTTTCCTGTCCTTCTCTCACCTCCTAAAGTAGACTTCATTTTTCCTAACAGGATTAGACAGTCAAGGAGTGGCTTACTACATGTGGGAGCTTTTGGTATGTGACATGCGGGCTGGGCAGCTGTTAGAGTCCAACGTGGGGCAGCACAGAGAGGGGGCCACCTCCCCAGGCCGTGGCTGCCCACACACCCCAATTAGCTGAATTCGCGTGTGGCAGAGGGAGGAAAAGGAGGCAAACGTGGGCTGGGCAATGGCCTCACATAGGAAACAGGGTCTTCCTGGAGATTTGGTGATGGAGATGTCAAGCAGGTGGCCTCTGGACGTCACCGTTGCCCTGCATGGTGGCCCCAGAGCAGCCTCTATGAACAACCTCGTTTCCAAACCACAGCCCACAGCCGGAGAGTCCAGGAAGACTTGCGCACTCAGAGCAGAAGGGTAGGAGTCCTCTAGACAGCCTCGCAGCCGCGCCAGTCGCCCATAGACACTGGCTGTGACCGGGCGTGCTGGCAGCGGCAGTGCACAGTGGCCAGCACTAACCCTCCCTGAGAAGATAACCGGCTCATTCACTTCCTCCCAGAAGACGCGTGGTAGCGAGTAGGCACAGGCGTGCACCTGCTCCCGAATTACTCACCGAGACACACGGGCTGAGCAGACGGCCCCGTGGATGGAGACAAAGAGCTCTTCTGACCATATCCTTCTTAACACCCGCTGGCATCTCCTTTCGCGCCTCCCTCCCTAACCTACTGACCCACCTTTTGATTTTAGCGCACCTGTGATTGATAGGCCTTCCAAAGAGTCCCACGCTGGCATCACCCTCCCCGAGGACGGAGATGAGGAGTAGTCAGCGTGATGCCAAAACGCGTCTTCTTAATCCAATTCTAATTCTGAATGTTTCGTGTGGGCTTAATACCATGTCTATTAATATATAGCCTCGATGATGAGAGAGTTACAAAGAACAAAACTCCAGACACAAACCTCCAAATTTTTCAGCAGAAGCACTCTGCGTCGCTGAGCTGAGGTCGGCTCTGCGATCCATACGTGGCCGCACCCACACAGCACGTGCTGTGACGATGGCTGAACGGAAAGTGTACACTGTTCCTGAATATTGAAATAAAACAATAAACTTTTAATGGTAT
#> 3 ACACACAGCACATTTCAGCCCCCCAGGCCCCCACCCCCTGTCCGGGACTACTGACCATGTGCCTCCCACTGAATGGCAGTTTCCAGGACCTCCATTCCACTCATCTCTGGCCTGAGTGACAGTGTCAAGGAACCCTGCTCCTCTCTGTCCTGCCTCAGGCCTAAGAAGCACTCTCCCTTGTTCCCAGTGCTGTCAAATCCTCTTTCCTTCCCAATTGCCTCTCAGGGGTAGTGAAGCTGCAGACTGACAGTTTCAAGGATACCCAAATTCCCCTAAAGGTTCCCTCTCCACCCGTTCTGCCTCAGTGGTTTCAAATCTCTCCTTTCAGGGCTTTATTTGAATGGACAGTTCGACCTCTTACTCTCTCTTGTGGTTTTGAGGCACCCACACCCCCCGCTTTCCTTTATCTCCAGGGACTCTCAGGCTAACCTTTGAGATCACTCAGCTCCCATCTCCTTTCAGAAAAATTCAAGGTCCTCCTCTAGAAGTTTCAAATCTCTCCCAACTCTGTTCTGCATCTTCCAGATTGGTTTAACCAATTACTCGTCCCCGCCATTCCAGGGATTGATTCTCACCAGCGTTTCTGATGGAAAATGGCGGTTTCAAGTCCCCGATTCCGTGCCCACTTCACATCTCCCCTACCAGCAGATTCTGCGAAAGCACCAAATTTCTCAAGACCCTCTTCTCCCTAGCTTAGCATAATGTCTGGGGAAACAACCAAAATCGCAATTTTAACAATATGCCTCTCTACCCCCGTGCACTTTTTCTGACATGGTTTTCAGGTCTAAATAGTGGCTGCTCCAGTCCATGAACTCAAAGGTTTGAAGCTACCACCATTGAACTCCCCCATGGTGGTTTCATGATGCCCCCTCCCCAATTCCTCGCACTTTATTCTCCTGGGTGGTTTCGAACTACCCTGTTTCTCAGTGGCCATTTGTTGTGTCCCTCAGGGGCTTAATGACTCAAAAT
## Explore the effect of changing CODING_ONLY
## Check how the "distance", "name", "Geneid" among other values change
region_info("chr10:135379301-135379311:+", CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#> seqnames start end width strand name annotation description region
#> 1 chr10 135379301 135379311 11 + <NA> <NA> inside exon inside
#> distance subregion insideDistance exonnumber nexons
#> 1 0 inside exon 0 1 5
#> UTR geneL codingL Geneid Sequence
#> 1 inside transcription region 63710 NA ENSG00000288107 GCATGTGCGCT
region_info("chr10:135379301-135379311:+", CSV = FALSE, CODING_ONLY = TRUE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#> seqnames start end width strand name annotation
#> 1 chr10 135379301 135379311 11 + CYP2E1 NM_000773 NP_000764
#> description region distance subregion insideDistance exonnumber nexons
#> 1 downstream downstream 45391 <NA> NA NA 12
#> UTR geneL codingL Geneid Sequence
#> 1 <NA> 40814 11568 ENSG00000130649 GCATGTGCGCT
if (FALSE) {
region_info(candidates, OUTDIR = "/path/to/directory/")
region_info("chr20:10286777-10288069:+", OUTDIR = "/path/to/directory")
}