|
The number of genes officially recognized in the human genome, 20-25000 protein coding genes, is amazingly low, barely above that of the nematode C. elegans. That is in part because annotation of the genome tends to be conservative and oversimplified: typically, only conserved coding genes with introns and long CDSs are validated. At the extreme, the latest human consensus CDS collection CCDS encompasses today only 14,795 CDS from 13,142 genes. But most of the human specific or intronless genes are excluded from those collections.
The goal of the AceView project is to give a more realistic image of the structure and organisation of the genes of higher organisms. Since we try to glimpse into the unknown, we do not use ab initio predictions or advanced statistical models. Rather, we try to construct, for each gene, the minimal set of mRNA models compatible with all the available genomic and cDNA experimental sequences. In practice, we succeed in assembling 4,200,000 public human mRNAs and ESTs on the human genome one by one. We refine the exon boundaries by co-alignment, and cluster the sequences into genome-based transcripts. Transcripts contacting each other are then considered alternative variants of a single “gene”.
The resulting view is rich and complex. We find over 40,000 genes putatively encoding more than 100 amino acids, and another 11,000 genes with introns but no clear protein coding potential. Some of those may be partial, some are antisense to protein coding genes and could play a role in regulation, some are clearly non-protein-coding, a few may be artefacts.
Why are our results different? The main strength of AceView is its central usage of the high quality genome and its original EST-to-genome alignment algorithms, which have been fine tuned during hand annotation of 202,000 worm cDNA sequencing traces. AceView does not mask repeats and does not pre-cluster the mRNAs and ESTs against each other. It carefully handles the anomalies: we label structural defects such as mosaics and rearrangements in 5% of the cDNA clones, flag the spurious poly A in clones internally primed in A-rich region, fix the strands, spot the vectors and so on. All this allows us to use more cDNA data and to unambiguously map and cluster 72% of the ESTs and 95% of the GenBank mRNAs into genes.
Organization of the genes is not straightforward, and may provide human genes with interesting classical genetic behaviors. Most of all, there are loads of alternatively spliced variants: the 33,000 genes with introns have on average 5 mRNA variants and 4 protein isoforms per gene. This is much more prevalent than in the nematode at comparable cDNA coverage. Alternative variants differ in their promotor, in their last exon, or in their splicing pattern. This phenomenon, including the lack of splicing of some introns (which usually leads to truncated proteins), is prone to be regulated and tissue specific.
We also see genes encoding sets of proteins, some of which have no aminoacid in common: these should display interallelic complementation and behave as typical complex loci. In addition, there are many genes overlapping in cis, reminding us of operons, genes in antisense, close repetitions, transcribed retroposons and a few misleading errors in the genome.
When we try to annotate the transcripts, one of the mysteries remains the determination of what exactly is translated into proteins. Completeness is not always granted, but even when it is, we do not know which ORF is translated, what initiates translation, which codon is used, which signals? We see frequent examples of possible uORFs, non-ATG start, dicistronic transcripts encoding multiple putative proteins in a single mRNA. Although we know little about usage of leaky starts, leaky stops (selenocysteine, pyrrolysine, nonsense suppressors or other stop-breakers), translational frameshift, RNA editing, internal ribosome entry sites, or control of the polyadenylation site, it appears that these translation-control mechanisms are highly regulated, by external cues and/or by modifier genes.
Unfortunately, we do not have enough information on proteins. Mass spec is promising, but only 10 to 20% of the results can be recognized in human proteome studies. We suspect that this might be in part due to the severe incompleteness of the current catalog of proteins, and believe that the creation of a global mass spec database could greatly help to understand the rules of translation.
We hope to hear your ideas on how to better our view and annotation of the genes!
|