Download parts of NCBI’s GenBank to a local folder and create a simple SQL-like database. Use ‘get’ tools to query the database by accession IDs. rentrez wrappers are available, so that if sequences are not available locally they can be searched for online through Entrez.

See the detailed tutorials for more information.

Introduction

Vous entrez, vous rentrez et, maintenant, vous …. restez!

Downloading sequences and sequence information from GenBank and related NCBI taxonomic databases is often performed via the NCBI API, Entrez. Entrez, however, has a limit on the number of requests and downloading large amounts of sequence data in this way can be inefficient. For programmatic situations where multiple Entrez calls are made, downloading may take days, weeks or even months.

This package aims to make sequence retrieval more efficient by allowing a user to download large sections of the GenBank database to their local machine and query this local database either through package specific functions or Entrez wrappers. This process is more efficient as GenBank downloads are made via NCBI’s FTP using compressed sequence files. With a good internet connection and a middle-of-the-road computer, a database comprising 20 GB of sequence information can be generated in less than 10 minutes.

Installation

You can install restez from GitHub with:

# install.packages("devtools")
devtools::install_github("AntonelliLab/restez")

Quick Examples

For more detailed information on the package’s functions and detailed guides on downloading, constructing and querying a database, see the detailed tutorials.

Setup

# Warning: running these examples may take a few minutes
library(restez)
#> -------------
#> restez v0.1.0
#> -------------
#> Remember to restez_path_set() and, then, restez_connect()
# choose a location to store GenBank files
restez_path_set(rstz_pth)
# Run the download function
db_download()
# connect, ensure safe disconnect after finishing
restez_connect()
#> Remember to run `restez_disconnect()`
# after download, create the local database
db_create()

Query

# get a random accession ID from the database
id <- sample(list_db_ids(), 1)
#> Warning in list_db_ids(): Number of ids returned was limited to [100].
#> Set `n=NULL` to return all ids.
# you can extract:
# sequences
seq <- gb_sequence_get(id)[[1]]
str(seq)
#>  chr "GATCCTGCCGGAAGCTCGACGAGGTCGATATTGTAGTTGCCAGCACTCGTGCGATTCTTGCCACCACGTCCTTCGAGATAGAAGGTTCGGTCCTCGTTGCTGGCAAGTATC"| __truncated__
# definitions
def <- gb_definition_get(id)[[1]]
print(def)
#> [1] "Unidentified clone B8 DNA sequence from ocean beach sand"
# organisms
org <- gb_organism_get(id)[[1]]
print(org)
#> [1] "unidentified"
# or whole records
rec <- gb_record_get(id)[[1]]
cat(rec)
#> LOCUS       AF298111                 705 bp    DNA     linear   UNA 23-NOV-2000
#> DEFINITION  Unidentified clone B8 DNA sequence from ocean beach sand.
#> ACCESSION   AF298111
#> VERSION     AF298111.1
#> KEYWORDS    .
#> SOURCE      unidentified
#>   ORGANISM  unidentified
#>             unclassified sequences.
#> REFERENCE   1  (bases 1 to 705)
#>   AUTHORS   Naviaux,R.K.
#>   TITLE     Sand DNA: a multigenomic library on the beach
#>   JOURNAL   Unpublished
#> REFERENCE   2  (bases 1 to 705)
#>   AUTHORS   Naviaux,R.K.
#>   TITLE     Direct Submission
#>   JOURNAL   Submitted (21-AUG-2000) Medicine, University of California, San
#>             Diego School of Medicine, 200 West Arbor Drive, San Diego, CA
#>             92103-8467, USA
#> FEATURES             Location/Qualifiers
#>      source          1..705
#>                      /organism="unidentified"
#>                      /mol_type="genomic DNA"
#>                      /db_xref="taxon:32644"
#>                      /clone="B8"
#>                      /note="anonymous environmental sample sequence from ocean
#>                      beach sand"
#> ORIGIN      
#>         1 gatcctgccg gaagctcgac gaggtcgata ttgtagttgc cagcactcgt gcgattcttg
#>        61 ccaccacgtc cttcgagata gaaggttcgg tcctcgttgc tggcaagtat cgtgaccata
#>       121 gcgtccttgc tccggttctc acgggtaaag aaatctgcga gtgcatcccc gagctcgggc
#>       181 ggctccatgc cgtcaaagtc gtagccggga acggccacct gaaaatcact agaaatcagc
#>       241 ctctctttgc tgactccgtc cacaagggtc agataggcgt cgaagtcggc cgtgtgccct
#>       301 cgcatgacgg cagctacccg cgtgtttgcg ggaacgtcga acttgacgat caccgtcgcc
#>       361 gccagcacct cagccttgct tggagtcggg gagccggaca agcctaggct acggatcgaa
#>       421 ccactgatgc tctgccccac ctgcattccc acgatggccg agctgtcgag caagtcatcc
#>       481 gagtcgagga gatcatcgtc cggagctgtg ccgcagccca tcgccagagc agaaaattgg
#>       541 cactatggaa gtacagcgca tgccttcttt atgagcacnc gnatgccacg ggctacnctn
#>       601 tgttttcgca gcttacacnc ttcatttgcg ctgaagcggg caggttggca ncctttgggt
#>       661 aacataccca ctagttcgag gccgcttttt agttgcgagc tcgac
#> //

Entrez wrappers

# use the entrez_* wrappers to access GB data
res <- entrez_fetch(db = 'nucleotide', id = id, rettype = 'fasta')
cat(res)
#> >AF298111.1 Unidentified clone B8 DNA sequence from ocean beach sand
#> GATCCTGCCGGAAGCTCGACGAGGTCGATATTGTAGTTGCCAGCACTCGTGCGATTCTTGCCACCACGTC
#> CTTCGAGATAGAAGGTTCGGTCCTCGTTGCTGGCAAGTATCGTGACCATAGCGTCCTTGCTCCGGTTCTC
#> ACGGGTAAAGAAATCTGCGAGTGCATCCCCGAGCTCGGGCGGCTCCATGCCGTCAAAGTCGTAGCCGGGA
#> ACGGCCACCTGAAAATCACTAGAAATCAGCCTCTCTTTGCTGACTCCGTCCACAAGGGTCAGATAGGCGT
#> CGAAGTCGGCCGTGTGCCCTCGCATGACGGCAGCTACCCGCGTGTTTGCGGGAACGTCGAACTTGACGAT
#> CACCGTCGCCGCCAGCACCTCAGCCTTGCTTGGAGTCGGGGAGCCGGACAAGCCTAGGCTACGGATCGAA
#> CCACTGATGCTCTGCCCCACCTGCATTCCCACGATGGCCGAGCTGTCGAGCAAGTCATCCGAGTCGAGGA
#> GATCATCGTCCGGAGCTGTGCCGCAGCCCATCGCCAGAGCAGAAAATTGGCACTATGGAAGTACAGCGCA
#> TGCCTTCTTTATGAGCACNCGNATGCCACGGGCTACNCTNTGTTTTCGCAGCTTACACNCTTCATTTGCG
#> CTGAAGCGGGCAGGTTGGCANCCTTTGGGTAACATACCCACTAGTTCGAGGCCGCTTTTTAGTTGCGAGC
#> TCGAC
# if the id is not in the local database
# these wrappers will search online via the rentrez package
res <- entrez_fetch(db = 'nucleotide', id = c('S71333.1', id),
                    rettype = 'fasta')
#> [1] id(s) are unavailable locally, searching online.
cat(res)
#> >AF298111.1 Unidentified clone B8 DNA sequence from ocean beach sand
#> GATCCTGCCGGAAGCTCGACGAGGTCGATATTGTAGTTGCCAGCACTCGTGCGATTCTTGCCACCACGTC
#> CTTCGAGATAGAAGGTTCGGTCCTCGTTGCTGGCAAGTATCGTGACCATAGCGTCCTTGCTCCGGTTCTC
#> ACGGGTAAAGAAATCTGCGAGTGCATCCCCGAGCTCGGGCGGCTCCATGCCGTCAAAGTCGTAGCCGGGA
#> ACGGCCACCTGAAAATCACTAGAAATCAGCCTCTCTTTGCTGACTCCGTCCACAAGGGTCAGATAGGCGT
#> CGAAGTCGGCCGTGTGCCCTCGCATGACGGCAGCTACCCGCGTGTTTGCGGGAACGTCGAACTTGACGAT
#> CACCGTCGCCGCCAGCACCTCAGCCTTGCTTGGAGTCGGGGAGCCGGACAAGCCTAGGCTACGGATCGAA
#> CCACTGATGCTCTGCCCCACCTGCATTCCCACGATGGCCGAGCTGTCGAGCAAGTCATCCGAGTCGAGGA
#> GATCATCGTCCGGAGCTGTGCCGCAGCCCATCGCCAGAGCAGAAAATTGGCACTATGGAAGTACAGCGCA
#> TGCCTTCTTTATGAGCACNCGNATGCCACGGGCTACNCTNTGTTTTCGCAGCTTACACNCTTCATTTGCG
#> CTGAAGCGGGCAGGTTGGCANCCTTTGGGTAACATACCCACTAGTTCGAGGCCGCTTTTTAGTTGCGAGC
#> TCGAC
#> 
#> >S71333.1 alpha 1,3 galactosyltransferase [New World monkeys, mermoset lymphoid cell line B95.8, mRNA Partial, 1131 nt]
#> ATGAATGTCAAAGGAAAAGTAATTCTGTCGATGCTGGTTGTCTCAACTGTGATTGTTGTGTTTTGGGAAT
#> ATATCAACAGCCCAGAAGGCTCTTTCTTGTGGATATATCACTCAAAGAACCCAGAAGTTGATGACAGCAG
#> TGCTCAGAAGGACTGGTGGTTTCCTGGCTGGTTTAACAATGGGATCCACAATTATCAACAAGAGGAAGAA
#> GACACAGACAAAGAAAAAGGAAGAGAGGAGGAACAAAAAAAGGAAGATGACACAACAGAGCTTCGGCTAT
#> GGGACTGGTTTAATCCAAAGAAACGCCCAGAGGTTATGACAGTGACCCAATGGAAGGCGCCGGTTGTGTG
#> GGAAGGCACTTACAACAAAGCCATCCTAGAAAATTATTATGCCAAACAGAAAATTACCGTGGGGTTGACG
#> GTTTTTGCTATTGGAAGATATATTGAGCATTACTTGGAGGAGTTCGTAACATCTGCTAATAGGTACTTCA
#> TGGTCGGCCACAAAGTCATATTTTATGTCATGGTGGATGATGTCTCCAAGGCGCCGTTTATAGAGCTGGG
#> TCCTCTGCGTTCCTTCAAAGTGTTTGAGGTCAAGCCAGAGAAGAGGTGGCAAGACATCAGCATGATGCGT
#> ATGAAGACCATCGGGGAGCACATCTTGGCCCACATCCAACACGAGGTTGACTTCCTCTTCTGCATGGATG
#> TGGACCAGGTCTTCCAAGACCATTTTGGGGTAGAGACCCTGGGCCAGTCGGTGGCTCAGCTACAGGCCTG
#> GTGGTACAAGGCAGATCCTGATGACTTTACCTATGAGAGGCGGAAAGAGTCGGCAGCATATATTCCATTT
#> GGCCAGGGGGATTTTTATTACCATGCAGCCATTTTTGGAGGAACACCGATTCAGGTTCTCAACATCACCC
#> AGGAGTGCTTTAAGGGAATCCTCCTGGACAAGAAAAATGACATAGAAGCCGAGTGGCATGATGAAAGCCA
#> CCTAAACAAGTATTTCCTTCTCAACAAACCCTCTAAAATCTTATCTCCAGAATACTGCTGGGATTATCAT
#> ATAGGCCTGCCTTCAGATATTAAAACTGTCAAGCTATCATGGCAAACAAAAGAGTATAATTTGGTTAGAA
#> AGAATGTCTGA
restez_disconnect()

Contributing

Want to contribute? Check the contributing page.

Version

Pre-release version 0 for review.

Licence

MIT

References

Benson, D. A., Karsch-Mizrachi, I., Clark, K., Lipman, D. J., Ostell, J., & Sayers, E. W. (2012). GenBank. Nucleic Acids Research, 40(Database issue), D48–D53. http://doi.org/10.1093/nar/gkr1202

Winter DJ. (2017) rentrez: An R package for the NCBI eUtils API. PeerJ Preprints 5:e3179v2 https://doi.org/10.7287/peerj.preprints.3179v2

Maintainer

Dom Bennett