Week Five

The footprint of selection

Today we will continue working on the evolution HIV, using both datasets that you have previously aligned, and new datasets that we will be generating for this experiment. Our goal is to understand how selective pressures are impacting the evolution of the HIV genome.
One hypothesis we might wish to explore concerns the evolution of the Pol region, which should be highly conserved because it codes for crucial enzymes, as opposed to the Env region, in contrast, might be under different selective pressures given that it encodes the aspects of the HIV virus that are readily visible to the host immune system.

While these are reasonable hypotheses, our task is to show that evidence for the action of natural selection can be retrieved from the comparative analysis of these sequences.

Q1. Can you re-state our expectations regarding selection? Which region should evolving neutrally, or undergoing purifying selection or positive selection? Why?

We will begin by calculating dn and ds for the sequences concerned, and then use that information to begin tracking the evidence for selection.

Calculating dn/ds ratios for sequences

We will use the alignments you previously created for the POL region (Below is my version in case you get desperate).
Because we are going to be looking at synonymous and nonsynonymous rates, you need to be sure that the alignments are in perfect shape. Specifically, that means:
1) Alignments are good, and preserve the reading frame of the sequences— no one or two nucleotide gaps!
2) They begin with an actual codon (may be the first ATG codon, although sometimes not) and end with the last amino-acid encoding codon. REMOVE THE STOP CODON FROM YOUR ALIGNMENT- IT MAKES MOST OF THE PROGRAMS QUIT.
3) They are in FASTA format


We will start by using the Nei-Gojobori (1986) method. This method estimates the ratios of non-syn subs per non-syn site over the ratio of syn subs per syn site. The output is thus a matrix of pairwise values (much like we saw for distances in the previous lab). More elaborate tests that apply models of substitution to this estimation procedure exist, and we will take a brief look at those as well. You need a good reason to go apply more elaborate models to your data — simply counting is a good place to start.
The analyses can be conducted either in MEGA (go under the SELECTION tab, and choose "CODON BASED Z TEST") or go to phylemon, sign in and find this path: Evolutionary tests > Adaptation tests > yn00 (version PAML 4a).

When you do this in MEGA, you will have the option to either use the observed differences, or a correction (Jukes Cantor, etc…). You also have an option allowing you to do this analysis for the entire set of sequences (average), or as a set of pairwise analyses. You also can specify the null hypothesis (neutrality, positive or purifying selection)

Q2. How the dn/ds ratio is obtained in the Nei-Gojobori method? How do you decide whether or not to use a correction? How do you decide whether to do this analysis as a set of pairwise comparisons or as an overall average?.

Let's start with the POL dataset. Upload your sequences in an appropriate file format, choose the correct genetic code.

NOTE: If this thing finds a single stop codon, it may abort the run. Go back and fix it!

If you think about it, this is an analysis that is still doing some averaging, even if you selected the pairwise comparison option. Under certain circumstances, averaging may be what you want. In other cases, averaging may lead you to lose valuable information.

Q3. What are the advantages of an analysis of selection that averages across the entire sequence? What are some potential limitations?

Mega allows you to carry out this analysis on a codon by codon basis as well, using the HyPhy option.

Run your Pol alignment through the codon-by-codon analysis. Do you reach different conclusions? Why, or why not?

A piece of advice: when you compare the output, particularly of pairwise comparisons, it may make it easier if you look at one particular sequence, see if there is an overall pattern in all pairwise comparisons. Try this for a few sequences and see if you can find some pattern.

We now are going to do something a bit more ambitious, taking advantage of the really interesting and well-curated dataset of HIV sequences [http://www.hiv.lanl.gov/content/index]. What I would like us to do is to create aligned datasets for two or more regions of the HIV genome, from the SAME individuals, to try to look for the different footprint of selection.
To do this, you need to go to the SEQUENCES section, and look at the ALIGNMENT menu[http://www.hiv.lanl.gov/content/sequence/NEWALIGN/align.html]. We will work this section together on the board, but they have done a good deal of the hard work for us.

Once you have your alignments, and are satisfied with them (and can understand them), go ahead and carry out the relevant analyses for detecting selection. As always, think before you hit the ol' enter button. I also want you to try to formulate the hypotheses you are testing in advance, so that they are framed in a way that can be answered.

Q4. How would you interpret you results, in terms of the hypothesis you set out to test? How would you report the signature of positive/neutral/purifying selection in these sequences if you were writing a paper about this?

Finally, if we have time, I'd like us to look at these sequences in one additional way:

This analysis [http://services.cbu.uib.no/tools/kaks] tries to take into account the phylogenetic structure of the tree, and thus examines the pattern of selection along particular branches of the tree— testing, in effect, the possibility that the role of selection may have changed in the course of evolution.
Go the website, and choose the options with care. You may, at least initially, want to use the Pol alignment that is supplied with this lab.

Here is the env dataset that may be showing positive selection: