Week Four

NOTE: As of this morning, Swami seems to be back in business. That said, an important part of bioinformatics involves the ability to perform the relevant analyses on more than one platform. Sites like SWAMI are great, but they are tools. Like any good mechanic, you will have your favorite set of tools, but you should be able to get your car running even if you break down without your favorite tools in the trunk.
Two other sites you should know are Phylemon, which is the web-based Spanish equivalent of SWAMI: http://phylemon.bioinfo.cipf.es/
and MEGA, powerful software which you should download onto your own machine, available at http://www.megasoftware.net/ (to install in on your macs by dragging into Applications folder, it will ask you to authenticate. Use the following username: cscadmin and the password : Fx37Te+)

First step, then, is to get signed up and/or download these two portals!

Measures of Description and Divergence (Part 1)

Quantifying and understanding the divergence (or it's counterpart, similarity) between sequences lies at the heart of virtually every analysis we will carry out. Divergence captures many of the fundamental processes we have been discussing: time, chance, selection.
A number of different approaches, each with its built-in set of assumptions, can be used to quantify sequence divergence. We will spend the next couple of weeks exploring this question.

You have been warned: the quality of your estimates of divergence is critically dependent on the quality of your alignments. Furthermore, many of the analyses of divergence will treat coding sequences separately from non coding sequences, and will also distinguish between first, second and third codon positions. We thus not only need a clean alignment, but (in the case of coding regions) one that is in frame. This means that there should be no gaps that aren't multiples of 3, and no stop codons in the middle of the sequence. There are problematic sequences in our alignment, as there are in all alignments, and we need to tidy them up:

  • Make sure you did not break any codons when inserting gaps last week.
  • If there are single gaps in your alignment, chances are you have a sequence that has an extra nucleotide, caused by sequencing error. Identify it > and get rid of it.
  • If there are stop codons in your alignment, chances are that the sequencer made a degenerate or incorrect call (other letters\ non-ATCG). Check the same position in other sequences and make the call yourself. We can't go back to the trace data, so we'll just make a conservative call and assume no change happened at that position. If you can't make the call, get rid of the amino-acid for that sequence and include three gaps.
  • It's a good idea to delete the final stop codon for all sequences. It is not going to make any difference for any analysis, and some programs don't like stop codons at all. Just get rid of it.

If you are unsure about your alignments, use the alignment we provided for the Pol region. Compare your alignment to the one I provide, and see if you agree with how I went about it. Remember that manual alignment, in addition to being tedious, involves judgement calls— which you may or may not agree with.
As long as you can justify your approach, there is no problem. Try the Env alignment on your own and I'll post it later in the week so you can compare.

HIV_Pol codon aligned
HIV_Pol codon aligned (in FASTA format for MEGA)

Always remember that you can go back and forth between formats by exporting in Se Al.

Composition and Distance Methods

We'll start by calculating some basic descriptive statistics about the individual sequences, as well as distances between sequences.
We might, for instance, be interested in the overall nucleotide composition of these sequences, or in the extent of the codon bias exhibited by these sequences. Those sort of descriptive stats are available under the MODELS tab.

Q1. Why would we be interested in these sorts of descriptive statistics about the sequences?

We are also interested in computing the divergence between sequences. This can as simple as counting how many sites are different between two sequences, and then dividing that number by the total amount of sites. By doing this, you get a pairwise difference. If you do this for all possible pairs, you get a matrix with all possible pairwise distances.

I will walk you through the procedure for generating a matrix of pairwise differences in MEGA, but you should be able to do these analyses in any of the platforms I list above.

To do this in MEGA,
1) you are going to want to go under the DATA tab, and open the alignment (in FASTA format— it should have the .fas suffix)
2) it will ask you if you want to open this file for analysis or alignment. Since it is already aligned, pick ANALYSIS (but you can also try realigning if you wish)
3) it asks you a series of questions that you should be able to answer (yes, these are nucleotide alignments, protein coding sequences, using standard code)
4) once the data are loaded, you can then go under the DISTANCE tab, and begin the analyses. I want you to pay particular attention to the parameters that you, as the investigator, can set (highlighted in yellow). As has been true all along, I do not want you going "yeah, whatever" to the default settings. They matter— or the can matter— and they cannot possibly be optimal for every analysis that you will be carrying out. Plug in cortex prior to hitting the ENTER key. I want you to pay particular attention to the MODEL/METHOD pulldown menu.

The program allows you to simply count the number of differences ("no. of differences") and to calculate a normalized pairwise distance ("p distance'). With these settings, what you see is what you get. You are counting.
But in many cases, counting is not enough, because we can assume that there have been sequence changes that we do not see, but can infer.

Q2. Under what circumstances can you imagine that we would want to infer/estimate changes between sequences, rather than just count?

We will explore the effects of this estimation using the simplest model: Jukes-Cantor. JK is a one parameter model. This means that the distance matrix we'll retrieve does NOT reflect the raw distances for each pairwise comparison, but the CORRECTED distances, under the specified model.

Q3. How are the uncorrected and corrected distances different?

The resulting table shows all the possible pairwise comparisons. There is not a lot you can get out of this table at a glance, but it does give you a chance to identify either outlying sequences, or very similar sequences. The matrix will, however, enable some further analyses.
You will have a choice on how to look at the matrix, and you can save it in a variety of formats.

One of the uses for the matrix is as the raw material for the construction of a kind of tree. Specifically, we will be constructing distance trees (using the distances as inputs), which create a topology based on the degree of sequence similarity: similar sequences are grouped together.

You do this (in MEGA) by going under the PHYLOGENY tab, and selecting the UPGMA option. What we are doing now is constructing a tree based on that distance matrix. This is a fairly straightforward method of tree-construction and shouldn't take long.
DO the same using the NJ approach

Q4. What are some of the assumptions underlying the UPGMA and NJ approaches to tree construction? Should we trust these trees? Why or why not? Under what circumstances might a tree based on similarity lead us astray?

Take a good look at the trees you have obtained. Compare the outputs of the two distance methods.

Q5. How the sequences grouped? Do you see them grouping by country, subtype, geographic regions?

Now let's repeat the exercise for Amino Acids. Open up the file in Se Al, translate it, export as FASTA (if you are doing this in MEGA).

If that doesn't work, use this HIV codon AA

Once again, pay attention to the parameters that you can set as the user, and think about your selections.

Q4. Are there significant differences in topology, ie, were many of the groupings different? What about the relative distance between nodes? What are the advantages and disadvantages of using nucleotides or AAs to construct your tree?

Finally, I'd like you to do the same analyses as above using the data for the HIV env gene we prepared last week. Once again, do the analysis at the nucleotide and at the protein levels. Once again, compare the resulting trees. Finally, compare the trees you derived from the env and pol datasets.

Q5. Are the trees you are obtaining from the two datasets — taken from the same HIV isolates— telling you a different story? Why, or why not?

Don't forget about the Exercises for next week!