Phylogenetic Networks with Constrained and Unconstrained Recombination

A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not tree-like. With the growth of genomic data, much of which does not fit ideal tree models, and the increasing appreciation of the genomic role of such phenomena asrecombination, recurrent and back mutation, horizontal gene transfer, gene conversion, and mobile genetic elements, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks. 

Wang et al. studied the problem of constructing a phylogenetic network for a set of n binary sequences derived from a known ancestral sequence, when each site in the sequence can change state at most once in the network, and recombination between sequences is allowed. They showed that the problem of finding a phylogenetic network that minimizes the number of recombination events is NP-hard, but gave a polynomial-time algorithm (O(nm + n^4)-time, for n sequences of length m each) that was intended to determine whether the sequences could be derived on a phylogenetic network in which the recombination cycles are node disjoint. We call such a network a ``galled-tree". The paper by Wang et al. is seminal in defining this problem and asserting that it has an efficient solution. Unfortunately, the algorithm given there is incomplete and only constitutes a sufficient, but not a necessary, test for the existence of a galled-tree for the data.

In this talk we do several things. By more deeply analyzing the combinatorial constraints on galled-trees, we obtain an algorithm that runs in O(nm + n^3)-time and is guaranteed to be both a necessary and sufficient test for the existence of a galled-tree for the data. We show how to relax the assumption that we know the ancestral sequence. We show that when there is a galled-tree, the algorithm constructs a ''reduced" galled-tree that minimizes the number of mutations occurring on recombination cycles. We prove that the algorithm produces a reduced galled-tree that minmizes the number of recombinations needed for the data, over all possible
phylogenetic networks for the data, even if multiple crossovers, or different ancestral sequences are allowed. 
Hence, when a galled-tree exists, the problem of minimizing the number of recombinations can be solved
efficiently. The effect is that the galled-tree is a phylogenetic network that explains the input sequences with the ``littlest deviation" from a true tree model. We show that the reduced galled-tree is ``nearly-unique", but when it is not unique, the algorithm also obtains a count of the number of galled-trees that exist for the input data, and can create these in linear time for each one, starting from the canonical galled-tree. 

Finally, we consider phylogenetic networks where the recombination networks are not constrained in any way. We discuss new, efficiently computed lower bounds on the number of recombination events needed.

Joint work with Satish Eddhu, Chuck Langley, and Dean Hickerson, Yun Song, Yufeng Wu


---------------------------------------------------------------------------------------------------------------

Dan Gusfield, Professor 
Department of Computer Science
University of California, Davis
Davis, California 95616
gusfield@cs.ucdavis.edu
http://www.cs.ucdavis.edu/people/faculty/gusfield.html

 

Dan Gusfield - Biosketch, January 2005

My background is in Combinatorial Optimization, and various applications of Combinatorial Optimization. I received my Ph.D. in 1980 from UC Berkeley, working with Richard Karp, and was an Assistant Professor at Yale University.  I moved to UC Davis in July 1986. In the last 15 years I have mostly addressed problems in Computational Biology and Bioinformatics. I first addressed questions about building evolutionary trees, and then problems in molecular sequence analysis. I presently focus mostly on problems of population-scale genomics, of which haplotype inference is one example. My main funding for
computational biology and bioinformatics came initially from the Department of Energy Human Genome Project through the Lawrence Berkeley Labs Human Genome Center, then directly from DOE, but more recently has been funded by the NSF. My book, ``Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology" has helped define the intersection of computer science and bioinformatics. I serve on the editorial board of the Journal of Computational Biology, and wrote the scientific section of the proposal to create the IEEE/ACM Transaction on Computational Biology and Bioinformatics. I have served on many NSF and DOE research proposal panels on computer science, bioinformatics and
computational biology, and on numerous conference program committees. I co-chaired the program committee for the 1994 conference on Combinatorial Pattern Matching; I co-organized the 1995 Dagstuhl Conference on Bioinformatics at the
Dagstuhl Center in Germany; I co-chaired the program committee of the 2002 Workshop on Algorithms for Bioinformatics held in Rome, and was the Program Chair of the 2004 RECOMB (Research on Computational Molecular Biology) conference. At UCD I was chair of the Computer Science Department for four years, and wrote the bioinformatics section (one of
three) of the Genomics/Bioinformatics initiative proposal that resulted in the creation of the UCD Genomics Center. I developed and continue to teach an undergraduate course on the Theory and Practice of Bioinformatics taught mostly to biology students, and a graduate course on Sequence Analysis and Computational Biology, taught to Computer Science students. I have over eighty technical publications and have given numerous invited lectures and tutorials, including keynote, plenary, special or distinguished lectures. In August 2003, the paper by myself, S. Eddhu and C. Langley received the best-paper award at the IEEE CSB conference on bioinformatics held at Stanford. In October 2004 I was named as the
founding Editor-in-Chief of the IEEE/ACM Transactions on Computational Biology and Bioinformatics (www.computer.org/tcbb/).