Week Six

Phylogenetic Reconstruction

Phylogenetic reconstruction is a complex and crucial topic in molecular evolution and bioinformatics. Every biological object has a history, and the reconstruction of that history lies at the heart of what evolutionary biology is about. Perhaps less obviously, the interpretation of molecular data depends— implicity or explicitly— on a phylogenetic hypothesis. Think about what a blast search is really about: you are looking for similar sequences, and using that similarity to inform you about the identity and function of your query sequence. That logic involves an implicit evolutionary hypothesis.

You will note that phylogenetic reconstruction is, in a formal sense, about the generation of a phylogenetic hypothesis: a provisional claim about the material world that can be tested, rejected and revised. In all but a few cases, we do not know the true evolutionary history of the objects under study. We are thus reconstructing it, and using a series of methodological, statistical and philosophical criteria to determine the robustness of our hypothesis. An interesting aside: when we do know the evolutionary history— as in the case of viral or bacterial experimental evolution — we find that existing methods of phylogenetic reconstruction do a reassuring job of getting it right. Except when they fail spectacularly.

Today we will be concentrating on the following:
Constructing phylogenetic trees
Interpreting phylogenetic trees
Understanding the assumptions, limitations and applications of the most widely used methods (Distance, Maximum Parsimony, and Maximum Likelihood)

An ever increasing number of approaches for phylogenetic reconstruction now exist in the literature, but for our purposes, we want to subdivide them into two broad categories—algorithmic methods and tree searching methods.
As these names suggest, the first approach uses an algorithm to construct (usually) a single tree from the data. Algorithmic methods are computationally very efficient, but rely on a series of assumptions that can often be profoundly inaccurate. The most common algorithmic methods are UPGMA and Neighbor-Joining (NJ). These methods are also known as "distance-based" methods because they reduce your data to a single number that reflects the similarity or difference among the sequences, and then uses those numbers to construct the tree.
Tree-searching approaches usually construct multiple trees from the data, and then sort through these trees, based on some criterion (generically referred to as an optimality criterion) to choose among them. They are computationally less efficient, can sometimes take a while, but they can often allow the user to make fewer assumptions (or to make assumptions based on the data). In addition, they frequently provide some sort of statistical measure or confidence interval that enables you to evaluate the robustness of the resulting tree(s).These methods are known as character-based methods because they will treat each position of your alignment as a separate and independent character or data point as it goes about phylogenetic reconstruction.

We have two platforms today on which we can build trees: Mega, or Phylemon. I will leave the choice up to you- Mega does things elegantly and fast, but can be twitchy about formats; Phylemon is good about handing the output of one program to another, but may have fewer options when it comes to saving threes in pretty form. Eventually, you want to be able to use both.

As far as the data to be used for today, you have three choices:

1) you can use the HIV data that we have used in the past (Env and Pol). They are available below:

Env subset nucleotide
Pol subset nucleotide

Env subset AA
Pol subset AA

2) you can use your venom data, or some set of sequences related to the venom that you worked on

3) you can follow the example in the handout I will give you, taken from Barry Hall's book Phylogenetic Trees Made Easy


We will start by doing distance trees, because they are fast and commonly used. Note that they are commonly used because they are fast, and not necessarily because they lead to the right result. When you do a blast search, for instance, the tree that often comes back along with the sequences is a NJ tree.
One important qualification about NJ trees— they should really not be used when the average pairwise distance (JK distance) is >1. You can calculate that in MEGA under the Distance menu. , we should start by making sure we understand the distance trees. Rob spent a good amount of time explaining those, so:

Q1. Why do you imagine that we should not use distance methods when pairwise distance is >1?

Pay attention to some of the parameters that you can modify as we run the program. They should begin to make sense to you, and we will discuss them further in class.

Make sure you save the trees that you are producing. A bit later on, we may want to compare them


Parsimony, sometimes referred to as Minimum Evolution, is a tree-based method of reconstruction that selects the best tree using the principle of parsimony. That principle, Ockham's razor, is a methodological/philosophical criterion for selecting among multiple trees that are all compatible with your data. It is NOT a statement about how evolution works. Simply put, parsmony argues that in the presence of multiple possibilities, and whn no additional data can be brought to bear to assist in your decision, you should opt for the simplest explanation. In this case, "simplest" means the fewest evolutionary events across the entire tree. The principle of parsimony is a method for choosing among trees, not a method of constructing trees itself.

We have already stated that the amount of possible trees is gargantuan, once we start looking at many taxa. For instance, with our 34 taxa dataset of HIV, there are many more trees (about 1050) to be built than we would want to (or can) spend computing time on. As a rule, tree reconstruction programs go through the following steps:

Build trees compatible with the data
Choose among tree using some optimality criterion (usually one where you minimize or maximize some measurable quantity)

Ideally, we would construct all possible trees from the data, and then pick the best. But as we stated, once you have more than 10 taxa, you cannot run an exhaustive search. Instead, we use shortcuts, or "heuristics" in fancy talk: methods that start building a tree, and then try to identify "dead-ends" and abandon them. While these heuristics make the job possible, they can sometimes steer you away from the "best tree", and leave you trapped in the forest of possibilities. Wow, that sounds scary.

Even with heuristics, you may often end up with multiple trees with identical optimality scores. How then do we decide which is the "best tree"? Once again, we will be making a methodological choice: perhaps we want to show only those features that are present in all of the equivalent trees, and collapse all other branches into polytomies (this is called the "Strict Consensus" method). Or perhaps you want to show any resolved node that occurs more than 50% of the time (that is called the "Majority Rule Consensus" method). We will try these and others, but you will ultimately have to make and defend your choices.

Finally, we will be using a very elegant approach to estimating the confidence we can have about a node on our tree. This approach is called bootstrapping.

Q2. What exactly is meant by bootstrapping? Why would we use it to estimate the reliability of a branch?

So let's go ahead and start looking at the program that is implemented in Phylemon: Phylip. These kinds of analyses usually take a long time, so we'll start by reducing our dataset to fewer taxa, we'll look at all subtypes we had for Uganda, Tanzania and one of the SIV sequences. These are the files in Phylip format:

Env subset nucleotide
Pol subset nucleotide

Go into Phylemon, under parsimony, go into DNApars, which is the algorithm implementation for using nucleotide data. Let's start by using the Env dataset.

Now, you know how to enter your data. Then let's look at each of the option, because they will tell you how the program works:

  • Search for best tree: you want this. This means that the program will explore tree space in it's own way. The program has an algorithm that will build trees, either by adding groups one by one, or by using some kind of shortcut to estimate trees — sometimes this could even be a NJ or UPGMA algorithm. In any case, it will make trees, and then assess which are the best ones, by using the Principle of Parsimony. Alternatively, you could build trees using your own algorithm, or someone else's, and provide the program with those trees. Then, among those, the program will apply the parsimony principle. If you are sure that you have a really fast algorithm, this might be a good choice.
  • Search Option: this is where you tell the program how to look for the tree. You might want to tell it to built all possible trees (usually referred to as an exhaustive search), or you might want to use some kind of sampling strategy (Heuristic, BandB, etc). Since we already know that the amount of possible trees can be astronomical, you probably want to use a less thorough kind of search. The less thorough kinds of searches relies on somehow leading the tree building towards the best answer. This is done by multiple ways (TBR, Ratchet, SPR). For those of you interested in the computational part, this is the area that mostly welcomes innovation. There are several really cool ways of sampling trees, and of avoiding "local optima", that is, a place in tree space that the algorithm keeps sampling from, but it is not where the best tree will be found.
  • Number of trees to save: In this case, you won't have so many best trees. So 100 should be fine. The more you want the computer to save, the more it has to compare the new trees, the longer it takes. You will want to increase the number in more complicated searches.
  • Randomize: Randomizing takes part of the bias away and diminishes the possibility of getting stuck in local optima. I always choose yes and in this case, jumbling ten times should suffice. The more you jumble, the more work the computer has, the longer it takes.
  • Outgroup: We have an outgroup, it is the SIV, species 13.
  • We do not have to use threshold parsimony or tranversion parsimony (although it might be interesting to use tranversion parsimony, since you can get the Tv/Ts ratio from your data.
  • This time we will not use multiple datasets and the sequences are not interleaved.

Now take a look at the outcome. you can either look at the file or use ETE, whatever pleases you. Unfortunately, Phylip does not give you any measures such as CI, which would be useful, but there are other ways of getting those if we need. But it gives you the one tree in this case. Getting one tree out is not usual, and if we had more, we would look at a consensus.

Q3. What are the relationships in the resulting tree? Are sequences grouping by country or sub-type? Is this different from what you saw in the NJ trees?

Do the same for the Pol dataset.

Q4. Are the trees the same? What does this mean? You should think about gene genealogies and organismal phylogenies here. Do you expect all genes from the same organisms to reveal the same branching history? Why or why not?

If you really want to have more information about how these species split apart, you could use all evidence you have by putting together the two regions back to back in an alignment, and then performing a parsimony analysis on this. Alternatively, you could get both trees and make a consensus tree.

Repeat the exercise using protein datasets, for this you need to use the program Protpars:

Env subset AA
Pol subset AA

Q5. Are these congruent with the nucleotide trees? What do you think is the most appropriate kind of dataset for reconstructing phylogenies? Can you come up with some pros and cons of using nucleotides or amino-acids?

The last thing we want to do is assess the confidence we have in these trees. To do this, we use bootstrap, which means subsampling our dataset and repeating everything we did many times. Since you don't want to keep pressing Run 100 times, there is an easy way of doing this.

In Phylemon, go into Resampling, and click Seqboot. Paste your sequences there, choose "molecular sequences", "bootstrap", 100 replicates and don't forget that the input sequences are not interleaved. Run and it should come up fast.

Now you have many bootstrapped datasets, one behind the other. You might want to save this with a good name such as "Env_subset_boot_nt". Phylemon also allows you to continue the analysis from here.

So go ahead, choose DNApars or Protpars, and repeat all four analysis (Env and Pol, both for AA and nt), make sure this time you tell the program that there are multiple datasets and the appropriate number of datasets. This will take 100 times longer than it took for the other analysis, obviously.

The output is, as expected, 100 hundred analyses one behind the other. Note that for some datasets the program actually found more than on tree, as opposed to what we had found. So run a consensus, program CONSENSE using the majority rule. This means that the program will only consider real those grouping that have shown up in more than a certain percentage of all trees. You're welcome to try other types of consensus as well. Then use the ETE to visualize the final tree.

Q6. Do the bootstrapvalues help in the interpretation of results? How? What evolutionary inferences can you make from these results?

Joe Felsenstein keeps an updated list of phylogeny reconstruction programs at his website.