Week One: Databases and Online Resources

Databases

Databases are curated sources of relevant information. Given the explosion of data in the life sciences, the management, structure and interrelationships between databases have become fundamental issues in the field.
As with any storage system, the key is not the ability to store information— which, given the cost of computer memory, is virtually unlimited— but the ease with which relevant information can be retrieved (and irrelevant information omitted).

Three of the main databases we will be working with are collections of sequence and other kinds of molecular information (structure), and they act to centralize information for molecular research. They are constantly being updated. And, as with all databases, they are only as good as 1) the data being entered and 2) the quality and consistency of the curation

So let's check them out:

NCBI
http://www.ncbi.nlm.nih.gov/

EBI
http://www.ebi.ac.uk/

DDBJ
http://www.ddbj.nig.ac.jp/

Here, in contrast, is a database that acts as a aggregator of other databases: http://www.eol.org/, which is the next level of organization above resources like NCBI.

UNDERSTANDING NCBI

To understand the structure of NCBI, start with the site map: http://www.ncbi.nlm.nih.gov/Sitemap/index.html

A more detailed entry point might be the "About" page:

http://www.ncbi.nlm.nih.gov/About/index.html

Here you’ll understand the objectives and a little bit of the history behind the database.

I expect you to read the section called Science Primer, where they will explain you pretty much what is behind every area.

http://www.ncbi.nlm.nih.gov/About/primer/index.html

In particular, pay attention to the sections on "Bioinformatics", "SNP", "Microarrays" and "Phylogenetics":

How do people use this?

There are infinite uses for NCBI, given the amount of resources it compiles. Usually people are trying to get information on a molecule, an organism, or a biological question. There are again multiple ways of getting to this.

1. Let’s start with a molecule.

Suppose you’re interested in a paper that came recently in PNAS (http://www.pnas.org/content/107/45/19496.full.pdf+html) in which the authors have identified a potential receptor for two now extinct primate retroviruses.

You could look for this receptor in a number of ways:
by name: copper transport protein
by amino acid sequence:MGMNHMGMNHMGMNHMDHMDHMDHMDNNSTMPPHHHPTTASHSHGGGDMPMTFYFGFKRVELLFYGLVINTPGEMAGAFVAVFLLAMFYEGLKIAREGLLRKSQVSIRYNSMPVPGPNGTILMETHKTVGQQMLS
by nucleotide sequence:agacggcgga gcttgacctg ggaagacttt ttgctgactc tcatcttttt ctggaaaact
or by accession number:HQ290320

Each of these methods comes up with a different answer to a slightly different question. What are the questions?

Q1: What are the questions?

Q2: Why are we retrieving things that come from different organisms?

Q3: Why are we retrieving different proteins?

Let’s go back to NCBI’s homepage. Paste the accession number and search all databases. Mark down the results you get. Explore the other resources for which there are hits with this accession number and write down the accession numbers for homologues of this gene that are found in a plant and an insect.

Let's now do the same but searching against a different version of the NCBI database— the "conserved domain database". Do we expect different results? Why?

Now we are in NCBI's Conserved Domain Database, a protein annotation resource, which consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. If you want to know more about the Conserved Domain Database (or any other) you’ll find a help link in almost every page or alternatively you can go back to the About NCBI page. The CCD is a part of NCBI Structure. Explore.

Q4: Tell me what other kinds of resources are in NCBI Structure.

2. Now, say we’re interested in an organism.

The paper in question used hamster as the model organism. Let’s type in the genus Cricetulus, and see what comes up in the databases. Notice that overall there are many hits for this organism. We know that hits are sometimes not accurate so let’s figure out if all those accessions are for our hamster.

Go into taxonomy. See who’s there. Is this our hamster? Go into the taxonomy record.

Q5: How many species of Cricetulus are there? Does this matter?.

So return to All databases and type in the full species name (Cricetulus griseus).

Now you can go into the protein database. Add a Boolean operator (AND) and find the copper transport receptor sequence. Go into any of them, and write down the accession numbers for both the nucleotide and the amino acid sequence. Also go into the CDD and try to visualize the cytochome B structure.

3. Genomic Resources.

NCBI has a Genomic Biology central database as well. There are multiple things you can do from here, most importantly this is a centralized location to find genomic resources for basically all genomes that have been sequenced.

A few interesting resources are the Organism Specific features such as the Map Viewer. Choose an organism and go into the map view of it’s Genome!

Let’s go into the human genome web page. There is an enormous amount of bioinformatics work that has been done to the human genome. In this page, you can browse specific chromosomes, see SNP maps and read the latest literature related to genomic research.

Go into the chromosome viewer. Choose a chromosome. You’ll see a list genes, and you can click on specific areas of the chromosome to see which genes are there. Try to find your favorite human gene. Or go into the mitochondrial chromosome, and find the human cytochrome B!

Other online resources for analysis:

For the course, we’ll use the San Diego Super Computer Biology Workbench. This way, we can perform our analysis much faster using their super computer, and not having to worry about installing too much software and making it work. So go into:

WORKBENCH

This is the new workbench, which can perform several tasks such as BLAST and also has an interface to the phylogenetic programs in the Phylip package, a pioneering phylogenetic inference program by Joe Felsenstein that has the most important and widely used methods.

Workbench

Today we’ll set up an account and upload some sequences into it.

Go for Use the Workbench Now and create an account.

The Help and Tutorials are really good, so I would recommend spending 15 minutes going through them. You can always go back and look for things you don’t know how to do.

To make things more organized, the workbench works with folders, this way, you can have separate files with data and tasks! Once you’re set up, create a new folder, let’s call it Lab1.

Don't forget to visit the Exercises page!

Get ready for BLASTing and aligning next week!