Week Two: Blast away

BLAST (Basic Local Alignment Search Tool)

BLAST is an algorithm developed by Altschul et al. 1990 (alternatively, you can read the Wikipedia article) that compares a query sequence to a database searching for similarities, and then scores these similarities according to statistical significance.

The most usual case is that sites such as NCBI will implement BLAST and you can input your sequence and search for matches against their database, and this has a number of customizable parameter, but people use this for a variety of purposes such as predicting location and function of proteins in a newly sequenced genome.

You can check NCBI's handbook about other possibilities for BLAST.

There are a number of different ways to BLAST a sequence. If you have the nucleotide sequence and want to search a nucleotide database, this is called a nucleotide BLAST, and the most common algorithm used for this type of search is blastn.

Q1. Go into BLAST homepage, list and explain all types of different searches that can be performed. Why would you want to perform these different kinds of searches?

A paper in Nature (http://www.nature.com/ng/journal/v41/n2/full/ng.297.html), describes how histone blocks may signal that the state of differentiation in a developing cell. Histone H3 is one of proteins that form the nucleosome, where DNA molecules wrap around in chromatin formation. If you want to find this molecule, you can either search all databases, or search specific databases in the pull down menu by the search field.
Type in Histone H3 and see what comes up looking at all databases.

Take a look at some of these results— why is so much information coming up? How might we go about reducing the set we retrieved?

Scroll down, find the entry numbered NM_005324. Go for the Nucleotide database and click on the accession for that sequence.

On your upper left corner there is a pull-down menu that says Display. Choose FASTA from this menu. Now you should get the sequence displayed in a format:

>names and numbers

This is the closest we have to a universal format. Most programs will accept this, and you can easily copy and paste. The BIG downside is that this format does not accept any kind of annotation.

So go ahead, highlight the whole thing and copy it. Now go into your Workbench, make a new folder that says Lab 2 and upload this sequence into it.

Now go into the tab on the upper part that says toolkit. Choose BlastN. Give your job a name, input the sequence you just uploaded in and go into parameters.

The first thing in parameters is to to set up the database you want to match you query against. This is very important because it directly determines the kind of results you get back: if you expect to find human sequences, don't BLAST against the Plants database! Searching against the mammalian database might be an interesting thing to try.

There are many other parameters that can be set, you're welcome to figure out by yourself what they mean but for now we'll just run with default parameters. Go ahead and save and run the task.

Now to see what happened, click on view status and then on the HTML file, which will open in your browser.

Read the output.

First this shows your parameters, the kind of run you made, which database you looked at and your original query. When you have hundreds of BLASTS saved on your computer, this information helps.
Then, you have a list of matches, sorted somehow.

Q2. How is this list sorted? What do the different columns mean? Why is the e-value so critical? Look into Altschul's paper for a detailed explanation.

Now scroll down past the list and you'll find that one by one, the accession listed before are matched up against your query sequence. This brings us to our next topic:


The alignments you see are pairwise alignments. They show two sequences matched against the other, the middle line shows where the sequences agree and where they disagree. On the top of each alignment, measures such as overall similarities are listed.

Q3. Are all alignments of the complete molecule? Why?

Now, choose another four sequences that are complete Histones, write down their accession numbers, and retrieve them from NCBI using the method of your choice.

Alternative methods:

  • You can retrieve them manually from NCBI and copy and paste them like we did before
  • You can have Workbench do this automatically for you
  • The faster: repeat this search in BLASTn as implemented by NCBI. Stick your sequence there, choose nucleotide database nr, for program choose blastn.

Take this opportunity to familiarize yourself with the NCBI BLAST interface, which is quite clickable. Then you can go into each entry that you like and copy it to Workbench.

After uploading them to workbench, we want to align all those sequences. Note that this is no longer two sequences been matched one against the other, but many sequences that need to be aligned. This is a much more complicated problem and poses various issues both theoretically and computationally. We will now use a Multiple Alignment Tool called Clustal, which is widely used in molecular evolution studies. So go into your tools and choose Clustal, set up the alignment and run. Once again, you're encouraged to learn about the parameters but we'll use the defaults for now.

Go check your output files! You should have a few different formats that will show you what the alignment is and a simple distance tree that was used to construct the alignment.

Make sure you understand this alignment!

There are many things that can be understood by eyeballing an alignment. You will notice conserved regions, highly variable regions, insertions, deletions and most importantly, whether you have a weird sequence.

Q4. For instance, you should NOT have a 2 nucleotide deletion in a protein coding sequence. Why?

Substitution Matrices

Now, let's use a smaller alignment to look at substitution matrices. Let's look at a small protein called platelet factor 4 (Pf4) which is important in the stream of reaction in blood coagulation. It is a short 70 aa protein, so pretty easy to handle for our purposes.

Note: You're welcome to use any protein you'd like for this exercise!

Let's look at these four:

It is now time for us to switch to Workbench (http://www.ngbw.org)— you should have created an account already.
Import and align the proteins above using Clustal W. Take a look at the alignment.

From such an alignment, you can make your own substitution matrix. Just count how many times each transition/transversion occurs, and figure out their relative frequencies. Easy!

Q5. What was you Transition/Transversion ratio? Would you use this in a molecular evolution study of mammals? What about of bacteria?

Don't forget the Exercises!