Introduction to Bioinformatics

lab5: Multiple alignment & Phylogeny

Nov 04, 2009.

Firstly, you should download and install clustalw, clustalx, treeview and phylip to your disc.


Section I: Multiple alignment exercise

1. Carboxypeptidases

Carboxypeptidases are enzymes that cut the carboxyl terminus of peptides. They are secreted by the pancreas to aid digestion.

The file carboxypep.fasta contains 18 sequences for carboxypepsidases from humans, cows, rats, pigs, etc. Align the sequences using ClustalX.

  • How well conserved are the sequences?
  • Are there any sequences that seem to be outliers (more distantly related)?
  • Does the Neighbour-Joining tree support your notion of outlier? Try removing the outliers using the Edit menu in ClustalX, and then redo the alignment.

2. Beta-lactamases

Beta-lactamases are enzymes (produced by bacteria) that cleave antibiotics of the so-called beta-lactam family, thus making the bacteria resistant against those drugs (including penicillin).

The sequences of 12 beta-lactamases can be found in file bcII-all.fasta. Align them in ClustalX.

Choose Alignment | Output Format Options | Output Order | Input to prevent ClustalX from rearranging the sequences.

  • There are a few sites at which an amino acid is conserved in all twelve sequences. How many? (Look at the Column Score Profile, or look for a star above the alignment).

  • The twelve proteins fall in three families, B1, B2, and B3. The B1 family consists of the six first proteins (in the original file order), the B2 family consists of the next two proteins, and the B3 family consists of the last four proteins.

  • Check that this is also what ClustalX inferred, by saving and then viewing the Neighbour-Joining tree. Print out the alignment produced by ClustalX and compare it to the alignment shown in the paper by Galleni et al (available only on paper). Reordering the sequences to fit the Galleni paper will make this task much easier.

Section II: phylip exercise

Introduction

The aim of this exercise is to get a basic knowledge about phylogenetic trees. In systematic biology and evolution biology this is of course essential. But phylogenetic trees are also important tools in ecology and medicine (which we will see here).

Inference Methods

The way of constructing a tree in today¡¯s lab is described below.

  • Search in SRS, Entrez or BLAST and retrieve the interesting sequences. They have to be homologous, otherwise they don't have a common evolutionary history.
  • Make a FASTA file of the sequences.
  • Align the sequences with a suitable program. A frequently used program in multiple alignment is clustalw. You can also use clustalw on the net.
  • Pairwise distances are calculated by the program PROTDIST from the PHYLIP package. The distance between two sequences correlates (negatively) to the sequence similarities. If two sequences are very similar the distance is low.
  • Bootstraping is a method to calculate the support for a particular branch in a tree. This is the way it is done in todays exercise:
    • By resampling positions in the alignment at random using the number of samples that equals the alignment length a distance matrix is calculated.
    • By making this for example 100 times, 100 distance matrices are constructed.
  • The distance matrix is used to construct a tree showing relationship between the taxa. In our case this is done by the Neighbor Joining method. If bootstrap is used, every distance matrix gives rise to a new tree (the "replicate"). In the consensus tree, each branch (or more correct each node) will be assigned a number showing how many trees showed that particular branch topology. Values below 50% are of no use. As a rule of thumb branches with values over 80% may be considered as supported.

1. In-class exercise

  • Download the FASTA file cyc.fas.
  • Use clustalw to align the sequences in above file.
  • Make sure that the output format is PHYLIP.
  • The run will take a minute. Save the alignment.
  • Now you will use your alignment to produce bootstrap matrices/alignments. Open seqboot, enter your alignment file and type in a random seed (odd number). The random seed is used to get randomness in the bootstrap algorithm. Choose 100 bootstraps and click on submit. Save the output to a file.
  • Now you want to calculate distance matrices for your 100 bootstrap replicates. Open protdist, enter the bootstrap/alignment file you just saved and get the resulting file.
  • Then we will produce neighbor-joining trees for each the 100 replicates. Open neighbor, enter the file with distance matrices. Save the treefile for next step.
  • To summarize the bootstrap trees, we will make a consensus tree. Open consense and enter the trees file.
  • View the output files and outtrees with drawgram.
  • If you want to improve the graphics you can use the drawtree program.

Note: You should view all of the output or outtree file with notebook.


Section III: Take-home exercise

Evolution of HIV and SIV

There are four different subspecies of the African green monkey, Cercopithecus aethiops. They inhabit different, but partially overlapping, areas south of Sahara. From all these subspecies, a lentivirus , which is a retrovirus, have been isolated. The lentivirus is called SIV, simian immunodeficiency virus, which is a misleading name due to the fact that it has not been shown to cause immunodeficiency in its natural hosts. The lentiviruses are subspecies specific which has led to the conclusion that the virus is of an old age.

SIV resembles HIV in many aspects and the two probably have a common ancestor. Their RNA genomes are about 10 kb in length. A large fraction of the RNA is taken up by the gag, pol and env- genes and long terminal repeats. In addition, there are five or six shorter genes. Some of these are unique to lentiviruses.

HIV and SIV are extraordinary diverse from a genetic point of view. In an infected individual the viruses may differ on the RNA-level. The term for these different viruses is quasi-species.

HIV is divided into two groups, HIV-1 and HIV-2. These groups are further subdivided into subtypes. Globally circulating strains of HIV are known to be highly recombinogenic, but in humans recombination can only occur between viruses that are replicating within the same cell. It has been shown that coinfection can occur, but it is very rare. In this lab you will examine the evolution of HIV- and SIV-viruses.


Exercise

The three files env.txt, gag.txt and pol.txt are containing the protein sequences from the isolates in the list below.

noIsolateAccession noSubtype/animal
1HIV-1ELIK03454D
2HIV-1LAIK02013B
3HIV-1MALK03456Unclassified
4HIV-1NDKM27323D
5HIV-2D205X61240B
6HIV-2RODM15390A
7HIV-2STM31113A
8HIV-2UCIL07625B
9SIVmacM19499macaque
10SIVcpzX52154chimpanzee
11SIVagmM58410African green monkey
12SIVmanX14307mangabey

Do phylogenic analysis of the proteins. Use neighbor joining with 100 bootstrap replicates as shown above.

Also run the same analysis, but skip seqboot. This will give you the real neighbor joining tree (the tree based on the real data, as opposed to the bootstrap trees which are based on random sampling from the real data). From the consensus tree, you can see the support for the nodes on the real tree.

All the trees are un-rooted. That means that they do not say anything about the direction of evolution. Take that into account when you compare the trees. To simplify, use the same out-group in all the trees. To know which number to type in the out-group box, count the sequences from the top of the alignment. (The numbers depend on the alignment, so the same taxa could have a different number in the alignments for the different proteins.) Use your biological knowledge to select an appropriate outgroup.

If you want, you can also run protpars to search for the best tree under maximum parsimony criteria and compare the results.

Please answer the following question:

  1. What do the trees show with regards to the HIV and SIV relationships?
  2. Why do SIV:s cluster with both HIV-1 and HIV-2?
  3. What do the bootstrap values tell you about the trees? And how do they influence your interpretation?


updated on Aug. 12, 2007 by Wu