Introduction to Bioinformatics

Lab 7: Computational Gene Finding

Woo Mao-Ying, 10/25/2007

Today's Topics

(1) NCBI ORF Finder; (2) Genscan; (3) GeneMark.HMM; (4) Other GeneFinding Program.

1. Use of ORFs for prokaryotic gene prediction

(Roughly) annotate this non-eukaryotic genome and determine the nature of the organism by using NCBI's "Open Reading Frame Finder"
- Go to NCBI ORF Finder window.
- Paste sequence into window.
- Select 'OrfFind'.
- On the result page change sensitivity from 100 to 300. Select 'Redraw'.
- Locate the exact start and end points for an ORF by clicking it. Record the nucelotide positions for each ORF. Determine exact position for each ORF.
- Identify nature of gene for any ORF by clicking on it. Then find and select 'Blast' above the ORF display.
- Wait for your search results to be displayed, If this takes too long write down the 'Request ID', start a Blast search for another ORF, and return later to retrieve your results.
- View the Blast results for several ORFs and determine what the genes are and the nature of the organism.
Does this map confirm your suspicion? Which genes do the ORFs you found before represent? Which genes did you miss?
Re-run the NCBI Orf Finder but this time set the sensitivity on the output page to 100 and to 50. Select 'Redraw' and try to detect ORFs for the genes that are on the map but which you did not see before.

2. GENSCAN

A. Developed by Chris Burge (Currently at MIT)
B. Eukaryotic Gene Prediction
C. Model: Statistical (Hidden Markov Model)
D. Web GenScan Service (http://genes.mit.edu/GENSCAN.html)

Limit: One million base pairs (1Mbps) in length.
Three different types of organisms: Vertebrate, Arabidopsis, Maize
It predicts Genes/Exons
For sequences longer than 1Mbps, you should use local standalone version.

E. Example

Let's use GenScan to predicted genes in the following 100Kbps Arabidopsis genomic sequence: Arabidopsis genomic sequence

Submit to GeneScan web server.
select Arabidopsis
type in sequence name (option)
select "predicted CDS and peptide"
paste in your DNA sequence or upload your file
Select "Run GeneScan"
GenScan Output File HTML, PDF View
Answer the following questions:
- How many genes have you found in this piece of DNA?
- How many exons does the predicted gene#10 have?
- What protein corresponds to the predicted gene#14?

Another Example. Homo sapiens Chr#18 58,941,000 ~ 59,140,000

3. GeneMark.HMM

GeneMark is a family of gene prediction programs developed at Georgia Institute of Technology.

Sequence Type	Gene Prediction Program
Gene Prediction in Bacteria, Archaea and Metagenomes	parallel combination of GeneMark-P and GeneMark.hmm-P, Heuristic models, GeneMarkS
Gene Prediction in Eukaryotes	parallel combination of GeneMark-E and GeneMark.hmm-E, GeneMark.hmm-ES
Gene Prediction in Viruses, Phages and Plasmids	the Heuristic approach or the self- training program GeneMarkS
Gene Prediction in EST and cDNA	GeneMark-E

The statistical model of genomic sequence organization employed in the GeneMark.hmm algorithm is a HMM with duration or a hidden semi-Markov model (HSMM). The HSMM architecture consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes hidden states for start site (initiation site), stop site (termination site), and donor and acceptor splice sites. The site states emit nucleotide sequences of fixed length modeled by positional (inhomogeneous) Markov chains. The length and parameters of these models are site type-dependent and determined from the sets of sequences of verified sites of a given type. Note that the models for sequences emitted by splice site states are also intron phase-dependent. The protein-coding states (initial, internal, terminal exons and single exon gene) emit nucleotide sequences modeled by the three-periodic inhomogeneous Markov chains. Parameters of these models are chosen to be tied and are estimated from the sets of annotated protein-coding sequences. Orders of the Markov chains, up to the 5th order, are chosen depending on the total length of the training sequence. ...

Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the GeneMark.hmm algorithm.
The hidden states are represented as ovals in the figure, and arrows correspond to allowed transitions between the states.

B. Accuracy comparison

C. Exercise

Let's use GeneMark.hmm to predict genes in the following 100Kbps Arabidopsis genomic sequence: Arabidopsis genomic sequence

Submit to GeneMark web-server.
In the "Sequence Text" window, paste in the edited nucleotide sequence or upload the sequence file.
For output options, select the following: Generate PDF graphics (screen), Print GeneMark 2.4 predictions in addition to GeneMark.hmm predictions, and Translate predicted genes into protein.
Click on "Start GeneMark".
*GeneMark.hmm Output File HTML, PDF View
Answer the following questions:
- How many exons are in the unknown sequence?
- What are the start and end points for each exon?
- Do the two gene finding programs (GenScan and GeneMark.hmm) agree on the above answers or are there discrepancies?
- What other elements could you identify with these programs? Look for Poly A sites, GC content, etc.
- Can you translate the sequence into a protein? What is the length of the protein sequence?
- What else can you say about the putative protein sequence? (Molecular weight, other characteristic attributes, matches in a database, structure)

4. Other Gene Finding Programs

GeneID http://www1.imim.es/geneid.html
GRAIL http://compbio.ornl.gov/tools/index.shtml
GRAIL-EXP http://compbio.ornl.gov/grailexp
MZEF http://rulai.cshl.org/software/index1.htm.
PROCRUSTES http://hto-13.usc.edu/software/procrustes.
HMMgene http://www.cbs.dtu.dk/services/ eukaryotes and prokaryotes.
BCM Gene Finder http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html
FGENEH http://www.softberry.com/berry.phtml.

Modified Oct 06, 2007 by Woo