Lucy Ye

Reading Questions

Introduction to molecular evolution

In the Duffy paper, viral mutation rates are affected by factors that are either characteristic of the structure, function and operation of different viruses or (the fidelity of DNA or RNA to ‘quality-checking’ capabilities of different types of viruses, mutation rates that are directly related to genome size, and the method of viral replication) or variation caused by environmental or chemical influences (spontaneous deamination and influence of a particular host species).
One big obstacle in the estimation of mutation rates is the existence of prominent outliers in each method of estimation. For example, the retrovirus SFV has an anomalously low mutation rate uncharacteristic of most retroviruses. These examples show that many influences (for example, the impact of differences in viral generation time) shape the mutation rates that challenge these biases. One potential limitation is the belief that RNA viruses mutate at rates close to their error threshold, lending to the idea of a relationship between mutation rates and limited genome size.
The error threshold is the theoretical limit to mutation rates in viruses, or in other words, the number of mutations that can occur before the population becomes extinct. The maximum mutation rate that still ensures survivability, then, must fall under this error threshold. This holds especially true for RNA viruses, as they are generally believed to have high mutation rates.

The molecular clock

The idea of the molecular clock comes from the assumption of a relatively constant rate of molecular change, across species,that drives evolution. It is also consistent with the neutral model (or neutral theory), that suggests that the rate of neutral mutations, which are not influenced by selection, can be regarded as the actual rate of mutation in organisms. It also assumes, to some extent, a steady increase of molecular change exclusive of other influencing factors.
Data in support of a molecular clock is limited. Early on, the clock was considered a unique tool because it was able to reconcile the protein evolutionary rates over a vast array of morphologically diverse species. The emergence of gene sequencing backed, to some extent, the molecular clock’s provision that the genetic distance between two species was equal to the amount of time elapsed since their initial divergence. However, the accuracy of the clock and these estimates have been greatly questioned, especially since the clock’s estimates of major events in evolution have collided with findings from fossil records.
Some of the main objections to a molecular clock theory point to problems in accuracy. A strict molecular clock would assume a steady “tick rate” ( substitution rate) both within a lineage and between lineages. In reality, the substitution rate within a lineage can be uneven and the substitution rates between lineages are variable. Extra variation is injected with the influence with a huge variety of internal processes that could affect the number of substitutions/time, mainly by see-sawing between the influences of selection or drift. This can also occur at several levels (on sites in genes, in specific genes or across whole genomes). These factors will greatly decrease the accuracy of the estimates of a ‘strict’ molecular clock, leading to a ‘sloppy’ or ‘relaxed’ approach to molecular clock estimates.
In addition, the molecular clock has difficulty accounting for external, environmental factors that also shape evolution. For instance, a particular molecular clock must be calibrated based on information often derived from the fossil record. This leads to error involving poor sampling by region, time period, or taxon, and other inaccuracies based in uncertainties in the fossils themselves. Geological shift is another factor that cannot really be accounted for, as well as the inability to determine sample dates for bacteria and viruses.
One way of testing the validity of the molecular clock is the chi-square test developed by Fitch in 1976. In this model, one can test the null hypothesis of the molecular clock by looking at the number of sequence differences between, of example, species A, B, and C. The null hypothesis would be something like a= b, which can be tested so that you would reject or fail to reject, at a given significance level.

Detecting Selection at the Molecular Level

1) Can the level of inter-specific divergence in a protein be used as evidence for (or against) the role of selection in shaping the protein? Why, or why not?
Yes, because inter-species divergence involves the quantitative comparison of nonsynonymous to synonymous divergence in order to determine whether selection is working positively or negatively.
2) Can the level of intra-specific polymorphism in a protein be used as evidence for (or against) the role of selection in shaping the protein? Why, or why not?
No, because intra-specific polymorphism by definition provides several allelic “options” at each loci. A high level of polymorphism prescribes greater occurrence of several lower-frequency (less than 99%) alleles, providing less certainty that selection is occurring in a necessarily positive, negative, purifying, directional etc. fashion. For example, in the Fox2P paper, the excess of non-derived, non-ancestral alleles at a high frequency has caused two mutations that caused the divergence of the function of Fox2p between humans and the great apes. Yet we can’t be sure that this is an effect of positive selection or a ‘relaxation of the constraints’.
3) How is selection revealed in Hiv env? In Fox2P? What form does selection appear to be taking in each of these cases?
Selection is revealed to operate opportunistically (based on specific time periods and amino acid sites) and in a positively selecting manner in HIV env, for the purpose of escaping the immune system of a host after infection. In the case of HIV env, positive selection occurs at rapid rates for this purpose. In FOX2P, divergence occurred humans and the great apes at merely two points of mutation. The authors create two hypotheses: that more low-frequency alleles would be observed than expected, or that more derived (non-ancestral) alleles at high frequency are expected than under neutral selection. They conclude that, coupled with the factor of human population growth, the second hypothesis best explains the evolution of Fox2P in humans. In this case, selection appears to have fixed this advantageous mutation pretty quickly in humans, leading me to believe it takes the form of positive selection.
4) Has our perception of the importance of natural selection changed as we shift from single-locus to genome-wide analyses? Why, or why not?
Definitely yes. I think that from the single-locus perspective, the neutral theory is more capable of persuasion – yet from a genome-wide perspective, there is more evidence that from the big picture, evolution “trends” at functionally important genes/proteins towards one side of selection or another rather than maintaining even a ‘nearly neutral’ process of evolution.

Evolution of Color Vision genes

This week’s readings explored the extraordinarily flexible group of genes that regulate vision systems in vertebrates. The expression of the opsin genes that control color vision is highly tissue-specific (they are restricted to photoreceptor cells in the retina, pineal complex, and brain) and is therefore a good model to examine the evolution of color vision (Yokoyama et al, 1995). More importantly, they have helped elucidate the initial steps to loss of gene function, bettered our understanding of how pseudogenes become nonfunctional, and provided experimental support for the statistical evidence of positive selection at work, with some surprising results. Yokoyama et al 1995 analyzed three opsin genes in three Astyanax fasciatus populations and found that high levels of mutation occurred prior to nonfunctionality in these genes, and that opsin function was not completely lost even in the blind cavefish. In the 2008 article, Yokoyama et al compared statistical tests of positive selection with an experimental model based on reconstructing rhodopsin sequences from ancestral vertebrate species. They found that looking only for parallel amino acid changes could over- or under-estimate the actual probability of functional changes, as candidate amino acid changes did not greatly impact dim-light vision in fish, and different amino acid replacements were able to produce similar functional changes. Essentially, study of vision genes in vertebrates revealed that loss-of-function, or the “creation” of a pseudogene, is a process of greater ambiguity than can be determined simply with statistical analyses, and are less likely influenced by positive selection than was previously believed.

Experimentals (Week 2)

As Hillis et al demonstrated in their 1992 study with their experimental model of the bacteriophage T7, experimental models of phylogeny have allowed us to test the accuracy and capabilities of several commonly used phylogeny inference methods. Bull et al 1997 used replicate lineages of bacteriophage X174 that demonstrated high levels of convergent evolution to suggest that 1) common methods of phylogenetic reconstruction will fail to correctly guess the evolutionary history and 2) these methods may inaccurately calculate statistical significance, as the true tree was rejected as being an inferior fit to the data. This is indeed evident from the results of the experiment, however, the authors noted that the rates of convergence in these lineages were extraordinary, and pertained to a very specific set of properties that do not occur in nature. Convergent evolution does occur to some degree in nature, and it is important to note that inference methods seem incapable of inferring it. However, due to the highly controlled settings of the experiment, I am uncertain how signficant the study is to the actual use of these methods in evolutionary inference, as convergent evolution certainly occurs, in most cases, at a much lower rate in nature. Sanson et al 2002 provided support for the maximum likelihood in the case of neutrally evolving lineages, while Paterson et al 2010 tested the rate of evolution in two independently evolving and co-evolving bacterium & phage models. In sum, it seems that several unique, known environmental factors (convergence in Bull et al's case and coevolution in Paterson et al's case) already complicate the ability of inference methods to correctly guess evolutionary history, and that there are many more that we probably don't even know about. In some of the cases, it seemed that the inference methods were able to correctly guess the topologies of the true trees, but got messed up as they got closer to the common ancestor. Especially in experimental models, where lineages are generated quickly under similar selective environments, the direct impact is seen the most strongly in the first divergences and their effects dissipate in the younger branches of the tree. We must question the usefulness of these commonly used inference methods if they can only provide correct topologies but not the entirety of the true histories that we seek. (Something about how only a part of the truth can be a lie?oh boy!)

Critique 1 - Crandall et al

I believe that some of Crandall et al’s critiques are warranted, to some degree. Conventional approaches to reconstructing evolutionary events do have limitations based on preconceptions that do not always apply, and we likely underestimate the occurrence of some evolutionary events because of this. Yet the nature of HIV evolution is so different from that of anything else in some respects that it is easy to see where conventional methods are doomed to fail. Overall, I find Crandall’s critiques more persuasive than others (such as Bull et al), given the statistically strong data and relatively “uncontrolled” experimental setup. However, I am unwilling to completely disregard these conventional approaches, as their efficacy may still hold some value in other organisms and populations.
The lack of correlation between viral load, genetic diversity, and population size in the study is an example of this. Whether by replenishing viral loads via latent reservoirs or rapid fixation of adaptive changes in the depleted viral population, HIV is well-known to be able to induce significant adaptations in populations suffering from great selective pressure. If we apply conventional approaches that use viral load at a given time as an indicator of the underlying genetic diversity, we would clearly get inconclusive or no results from using the data in this study. In this case, we can see from both the biology and the data that criticism of this method may not be warranted as HIV evolutionary rates deviate so far relative to the norm at all population sizes that we cannot conclude that population size is a poor indicator of genetic diversity in all populations, and as an ME method.
The common approach of using the dn/ds ratio to detect the action of selection is also contested in this study. Despite clear indication that the HIV sequences were under positive Darwinian selection, the estimated dn/ds ratio fell below the designated value of one or higher that would indicate positive evolution in all but two patients. The authors believe that this is due to oversampling of synonymous substitutions via the pairwise comparison feature of the Nei & Gojobori method of estimating dn/ds and that methods that take into consideration other parameters (like the Maximum Likelihood method) would fare better. This critique is appropriate considering the greater power of ML methods, but does cast some due uncertainty on the unquestioned use of the 1:1 dn/ds ratio rule.
Crandall et al are also legitimately criticize the assumed relative 'rarity' of parallel and convergent evolution, and this is supported by multiple examples and the strong statistical significance of this study. Despite biases due to the unique evolutionary behavior of HIV, the experimental setup seems as bias-free as possible. It uses HIV patients, in whom changes in the virus can only be monitored to some extent (in comparison to the study by Bull et al (1997), which set several variables upon the viral lineages generated). Crandall et al's study applied one source of selective pressure (aka extensive drug therapy) on patients, and merely sequenced the HIV protease gene at different time points. The selective pressure of drug therapy could be argued to have a direct correlation with the parallel evolution that we are seeing. The depleted viral loads are forced to quickly fix adaptations that will help Indinavir resistance in order to survive, and this theory is supported by the fact that the identical amino acid changes all fall within the list of eleven positions associated with Indinavir resistance. Yet due to the fact that they sampled from multiple clones in strains unique from patient to patient, I am not convinced that this explanation holds strong. Such high levels of parallel evolution supports the argument that common reconstruction methods could be greatly underestimating the occurrence of so-called "rare" events in nature, or even in a clinical setting, which may be no less important.

Critique 2 - Nielsen et al

The inconsistencies in different approaches of detecting evidence of positive selection in humans can definitely seem like a solely human genetic diversity problem, as there are too many confounding factors involved in human genomic evolution to be able to make such overarching claims in most cases, especially if results from different approaches are inconsistent. Particularly in the case of the FoxP2 gene, which has shown strong intuitive evidence of having been positively selected in humans and might have influenced development of speech, it would be erroneously simple to point to the defective/ loss-of-function of language skills that result from FoxP2 mutations, despite the signs that the truth is likely much more complex (FoxP2 also has a role in lung development, and is almost ubiquitously expressed, and studies present inconsistencies between methods of detecting selection). This is just one example of how Nielsen et al paint the picture of increasing doubt as to the conclusions we can draw from human whole-genome analyses.
The FoxP2 case is a good representation of these issues because, like a lot of other genes, it is easy to draw overarching (and therefore erroneous or inconclusive) conclusions about its evolution in humans, due to a combination of genetic evidence and inferences about human history. Yet it is quite clear that there are many caveats to examining the entire human genome and trying to discern important evolutionary events. The vast amount of data on human variation might prove to be a hindrance instead of elucidating us in our search for 'marker' events in human evolutionary history. Additionally, I think it is important to point out that along with the advantages of knowing the anthropological history of humans, like the LCT gene's selective “sweep” for lactase resistance around the same time dairy farming was established in Europe, such knowledge also creates a large bias towards specific conclusions. FoxP2 studies, such as Enard et al 2002, has some relevant here as the authors seemed to overstate the significance of FoxP2 as ‘the first gene relevant to the human ability to develop language’, while understating the underlying complexities regarding the functions of FoxP2.
What stood out to me in the Nielsen et al paper was the controversy of piecing together evidence of selection out of several biological and statistical approaches that, independently, are effectively inconclusive but also inconsistent all together. For example, much of today’s research especially in developmental biology and related fields, employ genetic or cell ablation experiment to study their subsequent loss-of-function effects. Nielsen et al note that such studies should not be taken as wholly conclusive in elucidating all functions and purposes of specific genes, and especially not in the field of evolutionary biology, despite increasing demand for functional studies in order to verify the action of selection. FoxP2 mutations are associated with deficiencies in language skills and craniofacial muscles, but also may have effects on other functions that are not phenotypically expressed; thus, the FoxP2 gene cannot be conclusively linked with the human-specific characteristic of the development of speech. I think that these kinds of problems are not endemic of human evolution; instead, I believe that we are both more and less capable of examining and identifying selection events in human evolution than the evolution of other species due to the sort of doubts and limitations that the vast amount of information and bias now available to us create.

Lab 1

Week One Q's


1) Cholera toxin (CTX) is a key virulence factor in cholera pathogenesis. The cholera toxin B-chain protein is secreted to bind to Ganglioside GM1 on the surface of the host’s cells.
2) CtxB’s role as the secreted factor that actually binds to host cells and allows CTX to be carried into the cell is an important indicator that it would be a good vaccine target. If a cholera vaccine were capable of targeting this key mode of entry/attachment of the virus, it would be extremely effective in preventing virulence. Therefore, preventing the binding of CtxB to GM1 would be a very simple and effective point to target a vaccine.
3) AA sequence of cholera toxin B: (retrieved from accession number page)
4) Nucleotide sequence of cholera toxin B: (retrieved from accession number page)
5) I searched for ctxB on the CDD database and found PBP2_LTTR substrate, the substrate binding domain of LysR-type transcriptional regulators (LTTRs), a member of the type 2 periplasmic binding fold protein superfamily. It appears that this protein has functions , such as the synthesis of virulence factors, toxin production, attachment and secretion, that could have possibly preceded CtxB evolutionarily.
6) E. coli heat-labile enterotoxin (Etx) is a close homologue of CtxB due to its similar structure and function as a mucosal adjuvant. I found this by searching for conserved domains from the accession number page of the CtxB protein. Using a simple literature search, I also found that Myelin Basic Protein has homologous peptide domains to both CtxB and cholera toxin A.

7) In an article comparing the E. coli enterotoxin B subunit (EtxB) to CtxB, D. Miller et al (2001) indicate that despite their homology the two are not equivalent as prospective vaccine adjuvants because of striking differences in their immunomodulatory properties. According to Miller et al, comparisons between EtxB and CtxB had not previously been made because of close structural similarities (80% sequence identity), but closer comparison of their relative adjuvant and immunostimulatory activities indicated that structural similarity in this case does not lend to similarity in function. Thus, examination and comparison of homologs to CtxB has indeed shed light on the function of CtxB. I believe that such comparisons are not conclusive about whether CtxB is a good candidate for a vaccine target, but potentially provides ‘relative’ information as to its fitness as a vaccine target.

Lab 2

Week 2 Q's


Lab 3

Week 3 Q's


1. The TERT sequence should be analyzed at the protein level because its function in the telomerase protein is as the enzyme that uses TERC to add a six-nucleotide repeating sequence to the 3’ end of telomeres. Therefore, analysis of a TERT sequence alignment will reveal more about its evolution across species from a functional perspective than will analysis at the DNA sequence level, unlike Terc, which is functionally a template RNA and will never be translated (therefore better analyzed at the DNA level).
2. I made sure that the weight matrix was set to ClustalW (for DNA). I did not make changes to the parameters for the two alignments as the homologous sequences in both were so different among species to begin with that I did not think it would really make a difference to adjust the gap penalties- I let the computer make its best estimate.
3. From analysis of the alignments it seems that TERT evolves much faster than TERC. Just from eyeballing, one can tell that the number of conserved sequences in TERC is much higher than TERT, in which I did not see even one position that was fully conserved across the species sampled. It should also be noted that finding full nucleotide sequences for TERT was more difficult than finding full sequences of TERC, and some of the sequences included may be of questionable origin (or only partially sequenced as of yet). This observation coincides with the function of both components; as the template, it is appropriate that TERC is much more highly conserved across species due to its key role in telomerase. TERT, the enzyme, is much more susceptible to mutations and likely has more regions of lesser importance on which mutation has acted, across the species we have sampled.
4. In order to identify functional domains, I looked at the domains that were consistently conserved between species. We assume that without these conserved domains, key functions of TERT and Terc would not be possible.

Terc alignment: positions 145-153, 203-226, 443-454, in reference to the human sequence.
TERT alignment: uncertain. I did not find any domain that was fully conserved throughout. Might need help on this – even though it evolves faster, there should be some conserved domains but I don’t see any at all