Ilene Magpiong

Lab 1

E-mailed lab report to Rob directly
In lab report format, because I finished this before I knew of the question format on the wiki.
(Updated format for Lab 2.)

Lab 2

Question 1

Find 6 homologous sequences (you can choose a protein of interest, or you can use the histones we used last time), and download them both as nucleotide sequences and as amino acid sequences.

I decided to investigate phosphofructokinase (PFK). PFK is an important enzyme in the process of glycolysis and aids in phosphorylation, along with ATP and Mg++, of fructose 6-phosphate forming the fructose 1,6-bisphosphate. I chose this particular protein just because I like it and thought it would be interesting to compare alignments of PFK1 in various species.

Sequence 1: gi|72161441|ref|YP_289098.1 341 aa bacteria - Thermobifida fusca YX
Sequence 2: gi|269127253|ref|YP_003300623.1 341 aa bacteria - Thermomonospora curvata DSM 43183
Sequence 3: gi|297562062|ref|YP_003681036.1 341 aa bacteria - Nocardiopsis dassonvillei DSM 43111
Sequence 4: gi|302868984|ref|YP_003837621.1 350 aa bacteria - Micromonospora aurantiaca ATCC 27029
Sequence 5: gi|271964260|ref|YP_003338456.1 341 aa bacteria - Streptosporangium roseum DSM 43021
Sequence 6: gi|296269305|ref|YP_003651937.1 341 aa bacteria - Thermobispora bispora DSM 43833

Question 2

Using clustal W, align the amino-acid sequences. Think a bit about what the parameters that you can control might mean, and try the alignment using at least two different parameter sets. Evaluate the resulting alignments, and discuss briefly why your parameter setting did/did not make a difference to the resulting alignment.

Using ClustalW_N, the six homologous protein sequences for the aforementioned phosphofructokinase proteins were aligned. The first alignment used the default settings with the slow pairwise alignment; the second alignment changed the -quicktree setting to fast and the ktuple size to 2; and the third alignment added the -nopgap function. All alignments were set to do full multiple alignments.

Each of the alignment results was exactly the same despite all of the parameter changes. The -quicktree parameter changes the speed of the alignments from slow to fast, but did not make a difference in speed of the alignments between the sequences. K-tuple size is the size of the window frame of exactly matching fragments of the sequence. Decreasing this number will slow down the process but make the results more accurate and sensitive. Inversely, increasing the number will speed up the alignment because there is less likely to be matches the larger the window. K-tuple, even with the fast speed, did not change the speed of the return of the output file. The final is the -nogap function which refers to the amino acid specific gap penalties that change the gap opening penalties at each position in the sequence—-modification factors are pre-determined for specific residues, larger values for gap penalties decreases the likelihood of having an adjacent gap.

Despite the changes in these parameters, the alignment results were exactly the same. Perhaps, in this case, the length of the amino acid sequence may be too short for the change in parameters to make a difference. There are also quite a bit of conserved regions because each sequencing being from bacteria, means that the sequences may already have many similarities.

Question 3

Take the 2 most closely related sequences (most similar), and do a pairwise alignment using ALIGN of those two sequences (amino-acid sequences). Does the pairwise alignment match the alignment of the two sequences you got using CLUSTAL? Why, or why not?

Sequence 3: gi|297562062|ref|YP_003681036.1 341 aa bacteria - Nocardiopsis dassonvillei DSM 43111
Sequence 6: gi|296269305|ref|YP_003651937.1 341 aa bacteria - Thermobispora bispora DSM 43833

Results of comparing two closest matches:
Comparison 1 (with all six sequences): Sequences (3:6) Aligned. Score: 89.1496
Comparison 2 (with sequence3 and 6 only): Sequences (3:6) Aligned. Score: 88.8856

Sequences 3 and 6, from Nocardiopsis dassonvillei and Thermobispora bispora, respectively, were the closest matches. When scoring all six sequences, the alignment of sequences 3 and 6 produced a score of 89.15, however when scored without the other four sequences, the score returned was 88.8856. The bioinformatics tools we typically use for alignments, located on Swami, was down and thus I switched to bioinformatics tools offered at GenomeNet. The slight difference is definitely attributed to the change in systems. Without including the four additional sequences the alignment appeared to be the same, suggesting that including the other sequences did not have an impact on the quality of the results. This also suggests that all six of the sequences are very tightly related and a large portion of the sequences conserved within bacteria.

Question 4

Now take the DNA sequence data for the two most closely related sequences, and align them using ALIGN. Does the resulting alignment appear to match the one you obtain when you align using amino acid sequences? Under what circumstances might thoe two alignments differ?

ALIGN results:
Sequences (1:2) Aligned. Score: 84.7801

Aligning the Nocardiopsis dassonvillei and Thermobispora bispora DNA sequences resulted in another close match with a score of 84.78. The slight difference in the scoring is attributed to the difference in size of the protein sequences versus the DNA sequences. The sequence for protein was very short and specific to the phosphofructokinase protein and the DNA sequences were larger, accounting for more of the organism's genome. If you were to compare the genomes of the two organisms, the sequences would have a higher probability of not matching as closely. Also, from DNA to protein sequence may show different alterations such as splicing that alters the final protein sequence and vice versa on alterations on the DNA sequence that may not show in the protein state.

Question 5

Finally, run a pairwise alignment of the two most similar sequences using LALIGN. Does the output match the results you obtained in (3)? Under what circumstances would you expect these alignments to differ dramatically.

LALIGN results:
Sequences (1:2) Aligned. Score: 81.5
Waterman-Eggert score: 1876; 446.3 bits; E(1) < 5.3e-130
81.5% identity (93.5% similar) in 341 aa overlap (1-341:1-341)

Using the LALIGN program, the Nocardiopsis dassonvillei and Thermobispora bispora DNA sequence alignment resulted in a score of 81.5. The LALIGN program finds the non-overlapping local alignments/sequence similarities, reports their score between two sequences, and report a specified number of alignments between the sequences depending on what you set the number to. The pairwise alignment in this experiment was set to 10. The output for the alignments is exactly the same, but the scoring is different. I attribute the similarities to the size of the sequence and the difference in the scoring based on the different features the programs are looking at. The locality plays a key role in LALIGN and thus plays a role in the scoring. I'd imagine that there would be greater differences in the results of the two multiple sequence alignment programs when you have much larger data sets. In a larger data set there is more room to look at locality and thus garner more differences.