region_info returns annotation of a single potential probe sequence or list of sequences and, if specified, prints the resuts in a .csv file.

region_info(
  REGION,
  CSV = TRUE,
  SEQ = TRUE,
  OUTDIR = tempdir(),
  CODING_ONLY = FALSE
)

Arguments

REGION

Either a single hg19 genomic sequence including the chromosome, start, end, and optionally strand separated by colons (e.g., 'chr20:10199446-10288068:+'), or a string of sequences to be annotated. Must be character. Chromosome must be proceeded by 'chr'.

CSV

A logical(1) value indicating if the results should be exported in a .csv file.

SEQ

A `logical(1)`` value indicating if the base sequence should be returned.

OUTDIR

If a .csv file is to be exported, this parameter indicates the path where the file should be saved. By default the file will be saved in a temporary directory.

CODING_ONLY

A logical vector of length 1 specifying whether to subset the Annotated Genes to only the coding genes. That is, whether to subset the genes by whether they have a non-NA CSS value. The Annotated Genes are downloaded with GenomicState::GenomicStateHub().

Value

This function annotates all input sequences using bumphunter::matchGenes(). It returns a data frame where each row is a genomic sequence specified in REGION. The columns c('seqnames', 'start', 'end', 'width', 'strand') list the chromosome, range, sequence length, and strand of the REGION. The columns c('name', 'annotation', 'description', 'region', 'distance', 'subregion', 'insideDistance', 'exonnumber', 'nexons', 'UTR', 'geneL', 'codingL', 'Geneid', 'subjectHits') are described in bumphunter::matchGenes() documentation.

If SEQ=TRUE, a column 'Sequence' will be included. This is recommended for sending the probe sequence to be synthesized.

If CSV=TRUE, a .csv file called region_info.csv will be saved to a temporary directory unless otherwise specified in OUTDIR.

Author

Amanda J Price

Examples

x <- region_info("chr20:10286777-10288069:+", CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
head(x)
#>   seqnames    start      end width strand   name
#> 1    chr20 10286777 10288069  1293      + SNAP25
#>                                                                                                                                                                                                                                                                                                                                                        annotation
#> 1 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#>   description      region distance                subregion insideDistance
#> 1 overlaps 3' overlaps 3'    87299 overlaps exon downstream              0
#>   exonnumber nexons            UTR geneL codingL          Geneid
#> 1         10     10 overlaps 3'UTR 88588   30705 ENSG00000132639
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Sequence
#> 1 GCTGATTCCAACAAAACCAGAATTGATGAGGCCAACCAACGTGCAACAAAGATGCTGGGAAGTGGTTAAGTGTGCCCACCCGTGTTCTCCTCCAAATGCTGTCGGGCAAGATAGCTCCTTCATGCTTTTCTCATGGTATTATCTAGTAGGTCTGCACACATAACACACATCAGTCCACCCCCATTGTGAATGTTGTCCTGTGTCATCTGTCAGCTTCCCAACAATACTTTGTGTCTTTTGTTCTCTCTTGGTCTCTTTCTTTCCAAAGGTTGTACATAGTGGTCATTTGGTGGCTCTAACTCCTTGAGGTCTTGAGTTTCATTTTTCATTTTCTCTCCTCGGTGGCATTTGCTGAATAACAACAATTTAGGAATGCTCAATGTGCTGTTGATTCTTTCAATCCACAGTATTGTTCTTGTAAAACTGTGACATTCCACAGAGTTACTGCCACGGTCCTTTGAGTGTCAGGCTCTGAATCTCTCAAAATGTGCCGTCTTTGGTTCCTCATGGCTGTTATCTGTCTTTATGATTTCATGATTAGACAATGTGGAATTACATAACAGGCATTGCACTAAAAGTGATGTGATTTATGCATTTATGCATGAGAACTAAATAGATTTTTAGATTCCTACTTAAACAAAAACTTTCCATGACAGTAGCATACTGATGAGACAACACACACACACACAAAACAACAGCAACAACAACAGAACAACAACAAAGCATGCTCAGTATTGAGACACTGTCAAGATTAAGTTATACCAGCAAAAGTGCAGTAGTGTCACTTTTTTCCTGTCAATATATAGAGACTTCTAAATCATAATCATCCTTTTTTAAAAAAAAGAATTTTAAAAAAGATGGATTTGACACACTCACCATTTAATCATTTCCAGCAAAATATATGTTTGGCTGAAATTATGTCAAATGGATGTAATATAGGGTTTGTTTGCTGCTTTTGATGGCTATGTTTTGGAGAGAGCAATCTTGCTGTGAAACAGTGTGGATGTAAATTTTATAAGGCTGACTCTTACTAACCACCATTTCCCCTGTGGTTTGTTATCAGTACAATTCTTTGTTGCTTAATCTAGAGCTATGCACACCAAATTGCTGAGATGTTTAGTAGCTGATAAAGAAACCTTTTAAAAAAATAATATAAATGAATGAAATATAAACTGTGAGATAAATATCATTATAGCATGTAATATTAAATTCCTCCTGTCTCCTCTGTCAGTTTGTGAAGTGATTGACATTTTGTAGCTAGTTTAAAATTATTAAAAATTATAGACTCCAGAT

## You can easily transform this data.frame to a GRanges object
GenomicRanges::GRanges(x)
#> GRanges object with 1 range and 14 metadata columns:
#>       seqnames            ranges strand |   name             annotation
#>          <Rle>         <IRanges>  <Rle> | <AsIs>                 <AsIs>
#>   [1]    chr20 10286777-10288069      + | SNAP25 NM_001322902 NM_0013..
#>       description      region  distance                subregion insideDistance
#>          <factor>    <factor> <numeric>                 <factor>      <numeric>
#>   [1] overlaps 3' overlaps 3'     87299 overlaps exon downstream              0
#>       exonnumber    nexons            UTR     geneL   codingL          Geneid
#>        <numeric> <integer>       <factor> <numeric> <numeric>     <character>
#>   [1]         10        10 overlaps 3'UTR     88588     30705 ENSG00000132639
#>                     Sequence
#>                  <character>
#>   [1] GCTGATTCCAACAAAACCAG..
#>   -------
#>   seqinfo: 1 sequence from an unspecified genome; no seqlengths

y <- region_info(
    c(
        "chr20:10286777-10288069:+",
        "chr18:74690788-74692427:-",
        "chr19:49932861-49933829:-"
    ),
    CSV = FALSE, SEQ = FALSE
)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
head(y)
#>   seqnames    start      end width strand    name
#> 1    chr20 10286777 10288069  1293      +  SNAP25
#> 2    chr18 74690788 74692427  1640      -     MBP
#> 3    chr19 49932861 49933829   969      - SLC17A7
#>                                                                                                                                                                                                                                                                                                                                                                                        annotation
#> 1                                 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#> 2 NM_001025081 NM_001025090 NM_001025092 NM_001025094 NM_001025098 NM_001025100 NM_001025101 NM_002385 NP_001020252 NP_001020261 NP_001020263 NP_001020271 NP_001020272 NP_002376 XM_017025778 XM_017025780 XM_024451185 XM_024451186 XM_024451187 XM_024451188 XM_024451189 XP_016881267 XP_016881269 XP_024306953 XP_024306954 XP_024306955 XP_024306956 XP_024306957 XR_001753201 XR_001753202
#> 3                                                                                                                                                                                                                                                                                                                                                                             NM_020309 NP_064705
#>   description      region distance                subregion insideDistance
#> 1 overlaps 3' overlaps 3'    87299 overlaps exon downstream              0
#> 2 inside exon      inside   153212              inside exon              0
#> 3 inside exon      inside    11788              inside exon              0
#>   exonnumber nexons            UTR  geneL codingL          Geneid
#> 1         10     10 overlaps 3'UTR  88588   30705 ENSG00000132639
#> 2         17     17 overlaps 3'UTR 154856  125322 ENSG00000197971
#> 3         13     13 overlaps 3'UTR  12959   10860 ENSG00000104888

candidates <- c(
    "chr20:10286777-10288069:+",
    "chr18:74690788-74692427:-",
    "chr19:49932861-49933829:-"
)
region_info(candidates, CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#>   seqnames    start      end width strand    name
#> 1    chr20 10286777 10288069  1293      +  SNAP25
#> 2    chr18 74690788 74692427  1640      -     MBP
#> 3    chr19 49932861 49933829   969      - SLC17A7
#>                                                                                                                                                                                                                                                                                                                                                                                        annotation
#> 1                                 NM_001322902 NM_001322903 NM_001322904 NM_001322905 NM_001322906 NM_001322907 NM_001322908 NM_001322909 NM_001322910 NM_003081 NM_130811 NP_001309831 NP_001309832 NP_001309833 NP_001309834 NP_001309835 NP_001309836 NP_001309837 NP_001309838 NP_001309839 NP_003072 NP_570824 XM_005260808 XM_017028021 XM_017028022 XP_005260865 XP_016883510 XP_016883511
#> 2 NM_001025081 NM_001025090 NM_001025092 NM_001025094 NM_001025098 NM_001025100 NM_001025101 NM_002385 NP_001020252 NP_001020261 NP_001020263 NP_001020271 NP_001020272 NP_002376 XM_017025778 XM_017025780 XM_024451185 XM_024451186 XM_024451187 XM_024451188 XM_024451189 XP_016881267 XP_016881269 XP_024306953 XP_024306954 XP_024306955 XP_024306956 XP_024306957 XR_001753201 XR_001753202
#> 3                                                                                                                                                                                                                                                                                                                                                                             NM_020309 NP_064705
#>   description      region distance                subregion insideDistance
#> 1 overlaps 3' overlaps 3'    87299 overlaps exon downstream              0
#> 2 inside exon      inside   153212              inside exon              0
#> 3 inside exon      inside    11788              inside exon              0
#>   exonnumber nexons            UTR  geneL codingL          Geneid
#> 1         10     10 overlaps 3'UTR  88588   30705 ENSG00000132639
#> 2         17     17 overlaps 3'UTR 154856  125322 ENSG00000197971
#> 3         13     13 overlaps 3'UTR  12959   10860 ENSG00000104888
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Sequence
#> 1                                                                                                                                                                                                                                                                                                                                                            GCTGATTCCAACAAAACCAGAATTGATGAGGCCAACCAACGTGCAACAAAGATGCTGGGAAGTGGTTAAGTGTGCCCACCCGTGTTCTCCTCCAAATGCTGTCGGGCAAGATAGCTCCTTCATGCTTTTCTCATGGTATTATCTAGTAGGTCTGCACACATAACACACATCAGTCCACCCCCATTGTGAATGTTGTCCTGTGTCATCTGTCAGCTTCCCAACAATACTTTGTGTCTTTTGTTCTCTCTTGGTCTCTTTCTTTCCAAAGGTTGTACATAGTGGTCATTTGGTGGCTCTAACTCCTTGAGGTCTTGAGTTTCATTTTTCATTTTCTCTCCTCGGTGGCATTTGCTGAATAACAACAATTTAGGAATGCTCAATGTGCTGTTGATTCTTTCAATCCACAGTATTGTTCTTGTAAAACTGTGACATTCCACAGAGTTACTGCCACGGTCCTTTGAGTGTCAGGCTCTGAATCTCTCAAAATGTGCCGTCTTTGGTTCCTCATGGCTGTTATCTGTCTTTATGATTTCATGATTAGACAATGTGGAATTACATAACAGGCATTGCACTAAAAGTGATGTGATTTATGCATTTATGCATGAGAACTAAATAGATTTTTAGATTCCTACTTAAACAAAAACTTTCCATGACAGTAGCATACTGATGAGACAACACACACACACACAAAACAACAGCAACAACAACAGAACAACAACAAAGCATGCTCAGTATTGAGACACTGTCAAGATTAAGTTATACCAGCAAAAGTGCAGTAGTGTCACTTTTTTCCTGTCAATATATAGAGACTTCTAAATCATAATCATCCTTTTTTAAAAAAAAGAATTTTAAAAAAGATGGATTTGACACACTCACCATTTAATCATTTCCAGCAAAATATATGTTTGGCTGAAATTATGTCAAATGGATGTAATATAGGGTTTGTTTGCTGCTTTTGATGGCTATGTTTTGGAGAGAGCAATCTTGCTGTGAAACAGTGTGGATGTAAATTTTATAAGGCTGACTCTTACTAACCACCATTTCCCCTGTGGTTTGTTATCAGTACAATTCTTTGTTGCTTAATCTAGAGCTATGCACACCAAATTGCTGAGATGTTTAGTAGCTGATAAAGAAACCTTTTAAAAAAATAATATAAATGAATGAAATATAAACTGTGAGATAAATATCATTATAGCATGTAATATTAAATTCCTCCTGTCTCCTCTGTCAGTTTGTGAAGTGATTGACATTTTGTAGCTAGTTTAAAATTATTAAAAATTATAGACTCCAGAT
#> 2 GGAGGAAGAGATAGTCGCTCTGGATCACCCATGGCTAGACGCTGAAAACCCACCTGGTTCCGGAATCCTGTCCTCAGCTTCTTAATATAACTGCCTTAAAACTTTAATCCCACTTGCCCCTGTTACCTAATTAGAGCAGATGACCCCTCCCCTAATGCCTGCGGAGTTGTGCACGTAGTAGGGTCAGGCCACGGCAGCCTACCGGCAATTTCCGGCCAACAGTTAAATGAGAACATGAAAACAGAAAACGGTTAAAACTGTCCCTTTCTGTGTGAAGATCACGTTCCTTCCCCCGCAATGTGCCCCCAGACGCACGTGGGTCTTCAGGGGGCCAGGTGCACAGACGTCCCTCCACGTTCACCCCTCCACCCTTGGACTTTCTTTTCGCCGTGGCTGCGGCACCCTTGCGCTTTTGCTGGTCACTGCCATGGAGGCACACAGCTGCAGAGACAGAGAGGACGTGGGCGGCAGAGAGGACTGTTGACATCCAAGCTTCCTTTGTTTTTTTTTCCTGTCCTTCTCTCACCTCCTAAAGTAGACTTCATTTTTCCTAACAGGATTAGACAGTCAAGGAGTGGCTTACTACATGTGGGAGCTTTTGGTATGTGACATGCGGGCTGGGCAGCTGTTAGAGTCCAACGTGGGGCAGCACAGAGAGGGGGCCACCTCCCCAGGCCGTGGCTGCCCACACACCCCAATTAGCTGAATTCGCGTGTGGCAGAGGGAGGAAAAGGAGGCAAACGTGGGCTGGGCAATGGCCTCACATAGGAAACAGGGTCTTCCTGGAGATTTGGTGATGGAGATGTCAAGCAGGTGGCCTCTGGACGTCACCGTTGCCCTGCATGGTGGCCCCAGAGCAGCCTCTATGAACAACCTCGTTTCCAAACCACAGCCCACAGCCGGAGAGTCCAGGAAGACTTGCGCACTCAGAGCAGAAGGGTAGGAGTCCTCTAGACAGCCTCGCAGCCGCGCCAGTCGCCCATAGACACTGGCTGTGACCGGGCGTGCTGGCAGCGGCAGTGCACAGTGGCCAGCACTAACCCTCCCTGAGAAGATAACCGGCTCATTCACTTCCTCCCAGAAGACGCGTGGTAGCGAGTAGGCACAGGCGTGCACCTGCTCCCGAATTACTCACCGAGACACACGGGCTGAGCAGACGGCCCCGTGGATGGAGACAAAGAGCTCTTCTGACCATATCCTTCTTAACACCCGCTGGCATCTCCTTTCGCGCCTCCCTCCCTAACCTACTGACCCACCTTTTGATTTTAGCGCACCTGTGATTGATAGGCCTTCCAAAGAGTCCCACGCTGGCATCACCCTCCCCGAGGACGGAGATGAGGAGTAGTCAGCGTGATGCCAAAACGCGTCTTCTTAATCCAATTCTAATTCTGAATGTTTCGTGTGGGCTTAATACCATGTCTATTAATATATAGCCTCGATGATGAGAGAGTTACAAAGAACAAAACTCCAGACACAAACCTCCAAATTTTTCAGCAGAAGCACTCTGCGTCGCTGAGCTGAGGTCGGCTCTGCGATCCATACGTGGCCGCACCCACACAGCACGTGCTGTGACGATGGCTGAACGGAAAGTGTACACTGTTCCTGAATATTGAAATAAAACAATAAACTTTTAATGGTAT
#> 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                ACACACAGCACATTTCAGCCCCCCAGGCCCCCACCCCCTGTCCGGGACTACTGACCATGTGCCTCCCACTGAATGGCAGTTTCCAGGACCTCCATTCCACTCATCTCTGGCCTGAGTGACAGTGTCAAGGAACCCTGCTCCTCTCTGTCCTGCCTCAGGCCTAAGAAGCACTCTCCCTTGTTCCCAGTGCTGTCAAATCCTCTTTCCTTCCCAATTGCCTCTCAGGGGTAGTGAAGCTGCAGACTGACAGTTTCAAGGATACCCAAATTCCCCTAAAGGTTCCCTCTCCACCCGTTCTGCCTCAGTGGTTTCAAATCTCTCCTTTCAGGGCTTTATTTGAATGGACAGTTCGACCTCTTACTCTCTCTTGTGGTTTTGAGGCACCCACACCCCCCGCTTTCCTTTATCTCCAGGGACTCTCAGGCTAACCTTTGAGATCACTCAGCTCCCATCTCCTTTCAGAAAAATTCAAGGTCCTCCTCTAGAAGTTTCAAATCTCTCCCAACTCTGTTCTGCATCTTCCAGATTGGTTTAACCAATTACTCGTCCCCGCCATTCCAGGGATTGATTCTCACCAGCGTTTCTGATGGAAAATGGCGGTTTCAAGTCCCCGATTCCGTGCCCACTTCACATCTCCCCTACCAGCAGATTCTGCGAAAGCACCAAATTTCTCAAGACCCTCTTCTCCCTAGCTTAGCATAATGTCTGGGGAAACAACCAAAATCGCAATTTTAACAATATGCCTCTCTACCCCCGTGCACTTTTTCTGACATGGTTTTCAGGTCTAAATAGTGGCTGCTCCAGTCCATGAACTCAAAGGTTTGAAGCTACCACCATTGAACTCCCCCATGGTGGTTTCATGATGCCCCCTCCCCAATTCCTCGCACTTTATTCTCCTGGGTGGTTTCGAACTACCCTGTTTCTCAGTGGCCATTTGTTGTGTCCCTCAGGGGCTTAATGACTCAAAAT

## Explore the effect of changing CODING_ONLY
## Check how the "distance", "name", "Geneid" among other values change
region_info("chr10:135379301-135379311:+", CSV = FALSE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#>   seqnames     start       end width strand name annotation description region
#> 1    chr10 135379301 135379311    11      + <NA>       <NA> inside exon inside
#>   distance   subregion insideDistance exonnumber nexons
#> 1        0 inside exon              0          1      5
#>                           UTR geneL codingL          Geneid    Sequence
#> 1 inside transcription region 63710      NA ENSG00000288107 GCATGTGCGCT
region_info("chr10:135379301-135379311:+", CSV = FALSE, CODING_ONLY = TRUE)
#> loading from cache
#> Completed! If CSV=TRUE, check for region_info.csv in the temporary
#> directory (i.e. tempdir()) unless otherwise specified in OUTDIR.
#>   seqnames     start       end width strand   name          annotation
#> 1    chr10 135379301 135379311    11      + CYP2E1 NM_000773 NP_000764
#>   description     region distance subregion insideDistance exonnumber nexons
#> 1  downstream downstream    45391      <NA>             NA         NA     12
#>    UTR geneL codingL          Geneid    Sequence
#> 1 <NA> 40814   11568 ENSG00000130649 GCATGTGCGCT
if (FALSE) {
region_info(candidates, OUTDIR = "/path/to/directory/")

region_info("chr20:10286777-10288069:+", OUTDIR = "/path/to/directory")
}