|
More flexible methods are needed to
functionally classify and facilitate discovery of new gene relationships
from genomic data. We previously developed an automated method called
Semantic Gene Organizer© (SGO), which utilizes Latent Semantic Indexing
(LSI) of titles and abstracts in MEDLINE citations to extract gene-to-gene
and gene-to-keyword relationships (Homayouni, et al., 2005).
In the current study, we explored the utility of SGO for analysis of
microarray data. Using the Affymetrix U74A GeneChip platform, we previously
identified 116 IFN-b stimulated genes (ISGs) in mouse embryonic fibroblasts
(MEFs) (Pfeffer et al., 2004). For each gene, we constructed a gene-abstract
document containing all titles and abstracts for MEDLINE citations
cross-referenced in Entrez Gene. A gene-by-document (sparse) matrix was then
constructed whereby each weighted nonzero element defines the importance of
a document (column) as a referent for the corresponding gene (row). A
truncated singular value decomposition (SVD) was used to factor the sparse
matrix into lower-rank (orthogonal) matrix factors. The column vectors of
these factors are used to represent both genes and gene-documents in the
same lower-dimensional subspace for query matching and ranking purposes.
Relationships between genes were determined by the cosine of the vector
angle between gene documents. A self-similarity matrix was then constructed
and used to generate a hierarchical tree using the Fitch-Margoliash least
squares method available via PHYLIP (University of Washington). The
clustering results were evaluated manually and annotated using Gene Ontology
(GO) functional classifications. We found several prominent clusters of
conceptually related genes that corresponded to MHC class I receptor
activity (P=2.18E-14) and GTPase (P=1.58E-09). This result indicated that
SGO clustering was consistent with GO classifications.
A unique feature of SGO is its ability to rank genes based on the conceptual
relationship to any user-defined keyword query. We evaluated SGO precision
by performing keyword queries using GO index terms. An SGO query with “MHC
class I receptor activity” or “GTPase activity” produced and average
precision of 74% and 73%, respectively. We used the keyword query feature of
SGO to: (1) Expand GO classifications. For instance, in addition to
identifying MHC class I receptor genes, SGO ranked genes that are
functionally associated with the processing of class I MHC peptides, such as
Psme2 and Psmb9; (2) Identify genes that are associated with terms not
indexed in GO. For instance we queried the 116 ISG set for genes
conceptually associated with biological functions of interferon such as
antiviral or anti-proliferative activity; and (3) Identify candidate gene
regulatory components associated with the microarray data. Importantly, this
method allows identification of transcription factors in the network that
may be activated by expression independent mechanisms such as nuclear
transport, phosphorylation. protein-protein interaction events.
In summary, we have demonstrated that LSI-based methods provide a flexible
data-mining tool to functionally classify genes based on the biomedical
literature and provide a complementary tool to GO based classification
methods. Moreover, we demonstrate here that LSI-based methods provide a
unique tool to identify potential regulatory elements associated with gene
expression data sets.
|