Literature Based Functional Analysis of Microarray Data

More flexible methods are needed to functionally classify and facilitate discovery of new gene relationships from genomic data. We previously developed an automated method called Semantic Gene Organizer© (SGO), which utilizes Latent Semantic Indexing (LSI) of titles and abstracts in MEDLINE citations to extract gene-to-gene and gene-to-keyword relationships (Homayouni, et al., 2005).

In the current study, we explored the utility of SGO for analysis of microarray data. Using the Affymetrix U74A GeneChip platform, we previously identified 116 IFN-b stimulated genes (ISGs) in mouse embryonic fibroblasts (MEFs) (Pfeffer et al., 2004). For each gene, we constructed a gene-abstract document containing all titles and abstracts for MEDLINE citations cross-referenced in Entrez Gene. A gene-by-document (sparse) matrix was then constructed whereby each weighted nonzero element defines the importance of a document (column) as a referent for the corresponding gene (row). A truncated singular value decomposition (SVD) was used to factor the sparse matrix into lower-rank (orthogonal) matrix factors. The column vectors of these factors are used to represent both genes and gene-documents in the same lower-dimensional subspace for query matching and ranking purposes. Relationships between genes were determined by the cosine of the vector angle between gene documents. A self-similarity matrix was then constructed and used to generate a hierarchical tree using the Fitch-Margoliash least squares method available via PHYLIP (University of Washington). The clustering results were evaluated manually and annotated using Gene Ontology (GO) functional classifications. We found several prominent clusters of conceptually related genes that corresponded to MHC class I receptor activity (P=2.18E-14) and GTPase (P=1.58E-09). This result indicated that SGO clustering was consistent with GO classifications.

A unique feature of SGO is its ability to rank genes based on the conceptual relationship to any user-defined keyword query. We evaluated SGO precision by performing keyword queries using GO index terms. An SGO query with “MHC class I receptor activity” or “GTPase activity” produced and average precision of 74% and 73%, respectively. We used the keyword query feature of SGO to: (1) Expand GO classifications. For instance, in addition to identifying MHC class I receptor genes, SGO ranked genes that are functionally associated with the processing of class I MHC peptides, such as Psme2 and Psmb9; (2) Identify genes that are associated with terms not indexed in GO. For instance we queried the 116 ISG set for genes conceptually associated with biological functions of interferon such as antiviral or anti-proliferative activity; and (3) Identify candidate gene regulatory components associated with the microarray data. Importantly, this method allows identification of transcription factors in the network that may be activated by expression independent mechanisms such as nuclear transport, phosphorylation. protein-protein interaction events.

In summary, we have demonstrated that LSI-based methods provide a flexible data-mining tool to functionally classify genes based on the biomedical literature and provide a complementary tool to GO based classification methods. Moreover, we demonstrate here that LSI-based methods provide a unique tool to identify potential regulatory elements associated with gene expression data sets.

-------------------------------------------------------------------------------

Lai Wei1,2, Kevin Heinrich3, Lijing Xu1, Michael Berry3,
Lawrence M. Pfeffer2, and Ramin Homayouni1

Departments of 1Neurology, and of 2Pathology and Laboratory Medicine,
University of Tennessee Health Science Center, Memphis, TN 38103; Department of 3Computer Science, University of Tennessee, Knoxville, TN, 37996