A Comprehensive Whole Genome Bacterial Phylogeny Using
Correlated Peptide Motifs Defined in a High Dimensional Vector Space"

Abstract:
--------
Motivation: As whole genome sequences continue to expand in number and
complexity, effective methods for comparing and categorizing both genes
and species represented within extremely large datasets are required.
Current methods have generally utilized incomplete (and likely insufficient)
subsets of the available data even as additional data becomes available at
a rapid rate.  In collaboration with Professor Gary Stuart at Indiana
State University, we have developed an accurate and efficient method for
producing robust gene and species phylogenies using very large whole genome
protein datasets.  This method relies on multidimensional protein vector
definitions supplied by the singular value decomposition (SVD) of
large sparse data matrices in which each protein is uniquely represented as
vector of overlapping tetrapeptide frequencies.  Each of the 134,155 proteins
from 53 complete prokaryotic genomes and one mitochondria were represented
in a definition space constructed from the 571 largest singular triplets.
Quantitative pairwise estimates of species similarity were obtained
by summing the protein vectors to form species vectors, then determining
the cosines of the angles between species vectors.  Evolutionary trees
were produced from the distance matrices obtained following the conversion
of these vector derived similarity measures into evolutionary distance
measures.  Although many accepted prokaryotic relationships were confirmed
in these trees, several novel relationships were also noted.  In addition,
we provide evidence that each of the SVD-derived basis vectors represents
a particular conserved protein motif composed of sets of correlated peptides.
Each "copep" motif is precisely defined as a particular linear combination
of all 160,000 possible tetrapeptides.  This analysis represents the most
detailed simultaneous comparison of prokaryotic genes and species available
to date.

-------------------------------------------------------------------
Michael W. Berry                     Department of Computer Science
berry@cs.utk.edu                     University of Tennessee
OFF:(865) 974-3838                   203 Claxton Complex
FAX:(865) 974-4404                   1122 Volunteer Blvd.
URL:http://www.cs.utk.edu/~berry/    Knoxville, TN  37996-3450