GenBank (Genetic Sequence Data Bank) is a
rapidly growing international repository of known genetic sequences
from a variety of organisms. Its use is central to modern biology and
This chapter shows you how to write Perl programs to extract
information from GenBank files and libraries. Exercises include
looking for patterns; creating special libraries; and parsing the
flat-file format to extract the DNA, annotation, and features. You
will learn how to make a DBM database to create your own rapid-access
lookups on selected data in a GenBank library.
Perl is a great tool for dealing with GenBank files. It enables you
to extract and use any of the detailed data in the sequence and in
the annotation, such as in the FEATURES table and elsewhere. When I
first started using Perl, I wrote a program that searched GenBank for
all sequence records annotated as being located on human chromosome
22. I found many genes where that information was so deeply buried
within the annotation, that the major gene mapping database, Genome
Database (GDB), hadn't included them in their chromosome map. I
think you'll discover the same feeling of power over the
information when you start applying Perl to GenBank files.
Most biologists are familiar with GenBank. Researchers can perform a
search, e.g., a BLAST search on some query sequence, and collect a
set of GenBank files of related sequences as a result. Because the
GenBank records are maintained by the individual scientists who
discovered the sequences, if you find some new sequence of interest,
you can publish it in GenBank.
GenBank files have a great deal of information in them in addition to
sequence data, including identifiers such as accession numbers and
gene names, phylogenetic classification, and references to published
literature. A GenBank file may also include a detailed FEATURES table
that summarizes facts about the sequence, such as the location of the
regulatory regions, the protein translation, and exons and introns.
GenBank is sometimes referred to as a
databank or data store,
which is different from a
database. Databases typically
have a relational structure imposed upon the data, including
associated indices and links and a query language.
comparison is a flat file, that is, an ASCII
text file that is easily readable by humans.
From its humble beginnings GenBank has rapidly grown, and the
flat-file format has seen signs of strain during the growth. With a
quickly advancing body of knowledge, especially one that's
growing as quickly as genetic data, it's difficult for the
design of a databank to keep up. Several reworkings of GenBank have
been done, but the flat-file format—in all its frustrating
Due to a certain flexibility in the content of some sections of a
GenBank record, extracting the information you're looking for
can be tricky. This flexibility is good, in that it allows you to put
what you think is most important into the data's annotation.
It's bad, because that same flexibility makes it harder to
write programs that to find and extract the desired
annotations. As a result, the trend has
been towards more structure in the annotations.
Since Perl's data structures and its use of regular expressions
make it a good tool for manipulating flat files, Perl is especially
well-suited to deal with GenBank data. Using these features in Perl
and building on the skills you've developed from previous
chapters, you can write programs to access the accumulated genetic
knowledge of the scientific community in GenBank.
Since this is a beginning book that requires no programming
experience, you should not expect to find the most finished,
multipurpose software here. Instead you'll find a solid
introduction to parsing and building fast lookup tables for GenBank
files. If you've never done so, I strongly recommend you
National Center for Biotechnology
Information (NCBI) at the National Institutes of Health (NIH)
While you're at it, stop by the
Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics
arm of the
Biology Laboratory (EMBL) at http://www.embl-heidelberg.de/. These are
large, heavily funded governmental
bioinformatics powerhouses, and they have
(and distribute) a great deal of state-of-the-art bioinformatics