Week Three

Multiple Sequence Alignments

Today, we explore the mysteries of alignments. As we have mentioned before, alignments are statements of homology: when we (or the computer) places a nucleotide or a residue beneath another, we are making a claim that those two nucleotides or residues are homologous.
Virtually all of the tools that we use in bioinformatics depend on alignments. The quality of the alignments you supply, in turn, determine the subsequent quality and reliability of your results. There is an old acronym in computer sciences: GIGO (Garbage IN, Garbage OUT) which most certainly applies to alignments.

Consequently, it is well worth our time to do these thoughtfully, and to learn how to manipulate available parameters in order to obtain optimized alignments.

Q1. Why do we have to align sequences before analyzing them?

The Biological System

For the next few labs, we'll be looking at an interesting HIV dataset, that we've put together by exploring the HIV Databases. This is a phenomenal online resource. You should learn what it is about by going to their FAQ. If you feel like exploring their search interface, try finding the HIV-1 reference sequence K03455.

Probably one of their most striking features is the availability of an ever updated geographic distribution of sequences in the planet where you can click and go into each region to find out the subtype distribution for specific localities. Throughout this lab you should go back and compare the results you have with this map.

We are going to look at a small dataset made of 30 HIV genomes, each sequenced from a single patient, plus three chimpanzee SIV sequences as outgroups. This assortment combines different localities as well as different subtypes.
The data consists of complete genome sequences for single patients: each sequence is named, in FASTA format, as follows:

Which means this is patient number 1, virus subtype is A1, locality is Tanzania, accession number is AC324556

We have selected accessions that include the entire genome, which enables us to compare the evolution of different genomic regions of the same strain. Initially we we'll not look at the entire genome. Instead, we will focus on two regions: the pol region and the env region.

Q2. Find out what the env and pol regions encode. Do you expect differences in substitution patterns or rates of evolution for these genes? Why?
Hint: A map of HIV genome is located here.

In order to come up with this dataset, we manually selected sequences in the database, downloaded them, corrected their format, aligned them, corrected the alignment visually and cropped the dataset. This is a long and laborious process which is not worth spending our precious time on, and brings only frustration. You will have plenty of opportunities later in the lab to do this yourself. For now, I am providing you with the datasets in the appropriate format below. :)

The data

These files contain just the sequences, there are separate files for the whole genome, and each of the two regions. These are not aligned.

Unaligned Files

HIV dataset Whole Genome
HIV dataset Env region
HIV dataset Pol region

Start by installing the programs Se-Al (freely available here and SeaView (freely available here)

Open or downloading HIV_Env_noalg in TEXT or in some other TEXT editor. You'll see a bunch of fasta sequences one below the other, choose all by pressing apple+A, and then copy it all by pressing apple+C.

Now launch Se-Al and open a new window.
You'll get an untitled window that has two blank spaces divided by a bar. Paste the sequences into the right hand blank space, by clicking on it (borders should go blueish when selected) and press command+V. Your sequences should go in there nicely, you'll see the names on the left hand window and the sequences on the right hand window.

Save this and name it appropriately (HIV_env_notaligned.nex is a good name). By doing this you'll save in NEXUS format, a pretty standard format for Bioinformatics.
After saving, export the file in FASTA format, by clicking File > Export.

Repeat the whole process for the Pol dataset.

Before we move on, turn on the Block Colors feature under the Alignment menu.

Q3. Are there any clear patterns in the data? Why is it hard to reach any conclusions about the evolutionary forces shaping HIV based on the data as you now see it?

Multiple Sequence Alignment

Now let's move on to aligning these. Start with the Pol dataset:


  • Upload it into SWAMI (don't forget to properly tag the as nucleic acid, sequence, FASTA format) IF SWAMI IS STILL MEDITATING, YOU CAN DO THIS AT http://align.genome.jp/
  • The safest way is to open the file in a new tab, copy and paste it into workbench, just like we did before for Se Al. It should work by uploading a FASTA or NEXUS formatted file as well.
  • Create a new task and align it using Clustal W. (you might also want to try doing the alignments in SEAVIEW, just to get a feel for what it can do)

When it is done, there is a format issue we need to address before you can look at the newly aligned sequences in Se Al. Go into the outfile, and notice that this is not in FASTA format, but in CLUSTAL format. We need to correct the format.

Fixing the format:

  • Hit the button that says Save to Current Folder
  • Create a new task
  • Choose the outfile as input data
  • Choose ReadSeq as the program for your task
  • Go into parameters and tell the program that the input is in CLUSTAL and you want the output as FASTA
  • Once it is done, go into the outfile, see that it is in FASTA format, copy the whole thing and paste it into the right hand part of\ an untitled Se Al window.

Name this file appropriately (HIV_Pol_aligned is a good name) and visualize it. Have some patterns that were previously hidden now become apparent?

Repeat with the Env dataset.

Q4. Comparing both aligned datasets, can you point to differences in the patterns exhibited by the two regions? What about between sequences, or groups of sequences?
Think about evolutionary forces that would explain the patterns you observe, and try to formulate your observations in terms of hypothesis which could be tested.

Protein Alignment

Now onto aligning these datasets based on their coding seqs.

There are multiple ways of transforming your dataset into an Amino Acid dataset. You could for instance, fetch the accession numbers all over again, this time downloading their translated sequences. But since we have it all in our hands, let's just translate what we have.

Start by looking at the Pol aligned file you created. Open it in Seaview. In this program you can easily go from DNA to Amino Acids to Codons (toggle the right box under PROPS).

Now look at your sequences. Does it make sense? What are all those question marks and asterisks?

When you did a nucleotide alignment, the program didn't care about triplets when it went about inserting gaps. Therefore, some of the gaps inserted by Clustal are in the middle of codons. Can you see the problem now?

You should manually fix the whole alignment, by going back and forth between amino acid and nucleotide. A lot of people align manually and it is a good idea to always check your alignment very carefully, it is likely that you'll do a better job than the machine.

So get your aligned Pol file, open in SEAVIEW, translate it to protein. Now let's go through a checklist of things to make sure your alignment is in good shape:

  • Are there question marks now? If there are question marks, check the nucleotide sequence to see if that is a weird gap or if it is just that the sequencer wasn't able to assign one of the four letters to the site. If the sequence is in doubt, it will assign a different letter (such as R) which stands for which ones are possible letters for that site (such as T or C). In this case, we'll have to work with that lack of information anyway. In case there is nothing there, it means that the alignment program included a gap in the middle of a codon. Check for this by using the Codon view, and move single or double nucleotides back into the triplet.
  • Also look for weird things such as a stop codon (usually represented by an asterisk *) in the middle of the sequence. That usually happens because your sequence is in the wrong reading frame. If all sequences are wrong, try switching the reading frame under the menu Sequences. If only one of the sequences in in bad shape, try fixing it by selecting it and shifting the reading frame. There are only three possible reading frames, so if you can't get it right after trying all three, it won't go.
  • If you can't fix the stop codons by shifting the reading frame, there is definitely something wrong with the sequence. We know it is an active virus that was sequenced from a person that had HIV, therefore this virus is fully functional and there is nothing wrong with it's polymerase. Try to find what is wrong by determining at which point does the sequence start to have stop codons, highlight the area, switch into Codon viewing mode, and find by comparison where is the nucleotide missing. What probably happened here is that the sequencer skipped one letter, so you have to insert a gap to fix that. Switch to amino acid and see if it is fixed. Repeating this should fix your alignment.

Once you're certain that all is fixed, and you believe your alignment, let's save it:

Saving the Amino Acid alignment

  • Make sure it is in Amino Acid mode
  • Hit Save As
  • Choose FASTA format
  • Name it appropriatelly (HIV_Pol_alg_AA.fas is a good name)

Now you have saved your protein alignment, it will be useful later.

Make sure to keep a copy of your alignments, you'll need them!

Look at your alignment and re-evaluate the hypothesis you previously thought of.

Q5. Think about the hypothesis you formulated. Try to come up with a way to test them, what are the possible outcomes and how to interpret them.

Repeat for the Env dataset.

Aligned Files (Just in case Workbench doesn't work or you get too frustrated trying):
HIV Env Aligned!
HIV Pol Aligned!