Phylogenetic Reconstruction

# Maximum Likelihood

Maximum Likelihood is regarded as the most general method of phylogenetic reconstruction, as proven by several simulations and a long literature on successful phylogenies recovered. As with anything else, there are some drawbacks to this method, but so far it is probably the best we have.

Part of the appeal of this approach (and a related one, Bayesian Reconstruction) is that we appear to be using any and all information at our disposal to arrive at a correct tree. In particular, ML is a model-driven approach: you are picking the tree that where the data are "most likely" given the particular model you have selected.

## Fitting the Model to the Data

Maximum Likelihood consists of finding the best model that describes your data.

There is a bit of terminology here and also some important definitions. A model usually consists of an equation made of terms that describe *variables* in your data, and the terms themselves are usually called *parameters*. In the toss coin example, your model is that there is a probability that heads will turn out (Ph), and of course a complementary probability that tails will turn out (1-Ph). Maximum Likelihood consists of finding the best value for that Ph parameter, given your data.

So, in essence, your model never changes. What changes is the value of that parameter. So ML will try to estimate the best values for the parameters that are in your model. If you wanted to include the probability that say, the coin will stop on its edge and neither heads or tails will turn up, that is a different model, with a third parameter to be estimated. ML can't estimate a parameter that is not included in the model!

## Molecular Evolution models

So how do we come up with models for molecular evolution? You are familiar with some models already. For instance, thinking that there is a difference between the probability of transitions and transversions is assuming a model with two different parameters. In truth, all molecular models are a variation of different rates for each of the 12 possible changes:

A | C | G | T | |

A | X | |||

C | X | |||

G | X | |||

T | X |

So in the case we want to count Transitions and Transversions differently, you essentially have two parameters

A | C | G | T | |

A | X | $\beta$ | $\alpha$ | $\beta$ |

C | $\beta$ | X | $\beta$ | $\alpha$ |

G | $\alpha$ | $\beta$ | X | $\beta$ |

T | $\beta$ | $\alpha$ | $\beta$ | X |

$\alpha$ = transitions

$\beta$ = transversions

This is equivalent to a model called Kimura 2 Parameter (K2P). The Jukes-Cantor model, assumes all these parameters are the same (only alphas in the table) so in essence there would be only one parameter to estimate. The most general model possible is called the General Time Reversible model (GTR) where all 6 parameters are allowed to vary, and the direction of change does not matter. Finally, if you like the idea that time is irreversible, and the direction of change matters, then there are 12 parameters to be estimated.

Some models will correct for the fact that the frequency of bases is unequal, therefore the probability of substitutions is affected by the frequency of each base, which makes a lot of sense. It is analogous to thinking that you need to compare the rate of synonymous substitutions per syn site, instead of the raw amount of substitutions. In order to achieve this, a modifier is attached to the appropriate parameter, accounting for empirical base frequencies.

**Q1. With this in mind, it might seem that you should always use the most general model, with 12 parameters. Can you think of reasons why this is generally not recommended in statistical analysis, and specifically, why would this be a problem in phylogenetic reconstruction?**

How then do we choose the "best" model? In this case, by looking at the data. We can estimate the most appropriate model by testing multiple models in a likelihood ratio test framework, which is a statistical method to determine which model best fits the data. If you are interested in learning more about this, a higher level statistical course is recommended.

*This brings up the strongest argument against Maximum Likelihood:* you are looking at your data to estimate parameters, and then using estimated parameters to look at your data. Some people regard this as a tautological system of thought, which means circular logic: the conclusion is equivalent to the premise. There is definitely some truth to that statement. The other big argument against ML is the extensive use of probabilistic thinking, when a phylogeny is a historical event. There is also some truth to that statement

Nonetheless, let's go ahead and estimate the best model for our data.

We'll use the same dataset we used for MP:

Env subset nucleotide

Pol subset nucleotide

Env subset AA

Pol subset AA

So let's move forward and start using ML. Go into Phylemon, and under Maximum Likelihood, let's use Joe Felsenstein's DNAml which is a part of the Phylip package. Start with the Env nucleotide dataset, which you already know how to use. We are going to keep the parameters as they are for now. Just remember that your sequences are not interleaved. Run the job and take a look at the tree. This was pretty fast because we are running a quite rough analysis, we'll do a better job in a minute.

(another site you may want to check out is here)

Once you have the output, we can go ahead and select our best model. All we wanted here was the ML tree, so that we can input it in the next step. You could also use a NJ tree, but let's be consistent.

Under Evolutionary Tests, Go to Model Selection, and select Jmod in order to run a model selection for our data.

You can do this by selecting files from server, by inputting your files manually or by uploading saved text files. It seems to work better when you copy and paste.

Before examining the output, it might be worth looking at the help file, available as a pdf.

Take a look at the output. You will see that the program tried to fit all different possible models of evolution to the alignment and tree we supplied, and it uses both the LRT and the AIC to score which is the model that best fits your data. It is known that trying to find the best model using these scores is somewhat misleading - in my experience, these methods always find the GTR+G+I the best model.

Posada and Crandall have come up with the notion of hierarchical model testing, which begins with the simplest model, and then proceeds in a stepwise fashion to make it more complicated, until no further gains in likelihood are obtained. This usually results in a better estimate of what is really happening. You can clearly see what the program did by going step by step. On the bottom, you'll have the estimated parameters for each substitution in a table just like mine.

Also, this will give you a slew of other parameters: the proportion of invariant sites (P); the parameters for each different rate class and the alpha that describes the gamma distribution in your data. Now you can go back to DNAml and include all this data there! Since PHYLIP is a bit clunky, we actually won't be able to include all this information there, but some of it will definitely go. Also, PHYLYP only implements one model of substitution - F81. So go back into DNAml, and insert your newly found parameters.

**Q2. Do you get a better tree?**

For a more satisfying feeling, you can use more advanced ML programs: PhyML and TreePuzzle.

Enter the model and parameters you obtained previously, and see if the resulting tree has changed.

//The only thing you need to know about Tree-Puzzle is that it uses a different method to estimate the likelihood, that is based on quartet puzzling approximation. This is a way to score the trees we have constructed, and then choose the best one. I will only list here the things you want to change, in order:

- Under DNA options, choose GTR. We will not use the GTR, but this program allows us to input any of the six rates, so we will end up defining our model next.
- Input the model rates from the ModelTest output file
- Under Rate Heterogeneity, choose Mixed (1 invariable + Gamma) and insert the respective values for the Gamma rate parameter (alpha), insert 4 as the number of rate categories and copythe fraction of invariant sites. Hit run.//

**Q3. How is this tree compared to the initial tree obtained in ML? What about to the tree obtained via MP?**