Introduction to Bioinformatics
Woo Mao-Ying, 10/25/2007
Today's Topics
(1) NCBI ORF Finder; (2) Genscan; (3) GeneMark.HMM; (4) Other GeneFinding Program.
1. Use of ORFs for prokaryotic gene prediction
-
(Roughly) annotate this
non-eukaryotic genome and determine the nature of the organism by using NCBI's "Open
Reading Frame Finder"
- Go to NCBI
ORF Finder window.
- Paste sequence into window.
- Select 'OrfFind'.
- On the result page change sensitivity from 100
to 300. Select 'Redraw'.
- Locate the exact start and end points for an ORF by clicking it. Record the nucelotide positions for each ORF. Determine exact position for each ORF.
- Identify nature of gene for any ORF by clicking on it. Then find and select 'Blast' above the ORF display.
- Wait for your search results to be displayed,
If this takes too long write down the 'Request
ID', start a Blast search for another ORF, and
return later to retrieve your results.
- View the Blast results for several ORFs and
determine what the genes are and the nature of the organism.
-
Does this
map confirm your suspicion? Which genes do the
ORFs you found before represent? Which genes did
you miss?
-
Re-run the NCBI Orf Finder but this time set the
sensitivity on the output page to 100 and to 50.
Select 'Redraw' and try to detect ORFs for the genes
that are on the map but which you did not see before.
A. Developed by Chris Burge (Currently at MIT)
B. Eukaryotic Gene Prediction
C. Model: Statistical (Hidden Markov Model)
D. Web GenScan Service (http://genes.mit.edu/GENSCAN.html)
- Limit: One million base pairs (1Mbps) in length.
- Three different types of organisms: Vertebrate, Arabidopsis, Maize
- It predicts Genes/Exons
- For sequences longer than 1Mbps, you should use local standalone version.
E. Example
Let's use GenScan to predicted genes in the following 100Kbps Arabidopsis genomic sequence: Arabidopsis genomic sequence
- Submit to GeneScan web server.
- select Arabidopsis
- type in sequence name (option)
- select "predicted CDS and peptide"
- paste in your DNA sequence or upload your file
- Select "Run GeneScan"
- GenScan Output File HTML, PDF View
- Answer the following questions:
- How many genes have you found in this piece of DNA?
- How many exons does the predicted gene#10 have?
- What protein corresponds to the predicted gene#14?
Another Example. Homo sapiens Chr#18 58,941,000 ~ 59,140,000
GeneMark is a family of gene prediction programs developed at Georgia Institute of Technology.
The statistical model of genomic sequence organization employed in the GeneMark.hmm algorithm is a HMM with duration or a hidden semi-Markov model (HSMM). The HSMM architecture consists of hidden states for initial, internal and terminal exons, introns, intergenic regions and single exon genes. It also includes hidden states for start site (initiation site), stop site (termination site), and donor and acceptor splice sites. The site states emit nucleotide sequences of fixed length modeled by positional (inhomogeneous) Markov chains. The length and parameters of these models are site type-dependent and determined from the sets of sequences of verified sites of a given type. Note that the models for sequences emitted by splice site states are also intron phase-dependent. The protein-coding states (initial, internal, terminal exons and single exon gene) emit nucleotide sequences modeled by the three-periodic inhomogeneous Markov chains. Parameters of these models are chosen to be tied and are estimated from the sets of annotated protein-coding sequences. Orders of the Markov chains, up to the 5th order, are chosen depending on the total length of the training sequence. ...
Figure 1. Hidden Markov model of a prokaryotic nucleotide sequence used in the GeneMark.hmm algorithm. The hidden states are represented as ovals in the figure, and arrows correspond to allowed transitions between the states.
C. Exercise
Let's use GeneMark.hmm to predict genes in the following 100Kbps Arabidopsis genomic sequence: Arabidopsis genomic sequence
- Submit to GeneMark web-server.
- In the "Sequence Text" window, paste in the edited nucleotide sequence or upload the sequence file.
- For output options, select the following: Generate PDF graphics (screen), Print GeneMark 2.4 predictions in addition to GeneMark.hmm predictions, and Translate predicted genes into protein.
- Click on "Start GeneMark".
- *GeneMark.hmm Output File HTML, PDF View
- Answer the following questions:
- How many exons are in the unknown sequence?
- What are the start and end points for each exon?
- Do the two gene finding programs (GenScan and GeneMark.hmm) agree on the above answers or are there discrepancies?
- What other elements could you identify with these programs? Look for Poly A sites, GC content, etc.
- Can you translate the sequence into a protein? What is the length of the protein sequence?
- What else can you say about the putative protein sequence? (Molecular weight, other characteristic attributes, matches in a database, structure)
4. Other Gene Finding Programs
- GeneID http://www1.imim.es/geneid.html
- GRAIL http://compbio.ornl.gov/tools/index.shtml
- GRAIL-EXP http://compbio.ornl.gov/grailexp
- MZEF http://rulai.cshl.org/software/index1.htm.
- PROCRUSTES http://hto-13.usc.edu/software/procrustes.
- HMMgene http://www.cbs.dtu.dk/services/ eukaryotes and prokaryotes.
- BCM Gene Finder http://searchlauncher.bcm.tmc.edu/seq-search/gene-search.html
- FGENEH http://www.softberry.com/berry.phtml.
Modified Oct 06, 2007 by Woo
|