Challenges in Genome Annotation and Data Integration.

The identification and annotation of protein encoding genes is one of the primary goals of whole genome sequencing projects. While computational methods are widely used as the initial method for gene identification and enumeration, these methods have proven inadequate for the correct identification and characterization of genes and transcripts, generally having problems with features such as 5’ and 3’ untranslated regions and in discerning alternatively spliced products. Expert review (curation) of computational gene prediction and the underlying evidence that supports such predictions can greatly improve the overall quality of genome annotation. Evidence from EST and cDNA sequences provides experimental validation of predictions and, when interpreted by trained molecular biologists, improves the overall quality of genome annotation. The annotation process exposes a number of issues surrounding data management and integration. These include, but are not limited to, data updates, data quality control and data sharing. Some solutions to these problems will be discussed.

-------------------------------------------------------------------------------

RICHARD J. MURAL
Celera Genomics
45 West Gude Drive
Rockville, MD 20878, USA.
Email: richard.mural@celera.com.