Monday, October 06, 2008

Jeepers it's been a while since I was on this.

It's official: the reticulate cophylogeny problem is NP-complete. Phew, I was worried about that. Thanks for this go to Ran Libeskind-Hadas, who did most of the work.

I've been thinking about multi-host parasites. These are a pain.
How to deal with them? The same PhD student (Rob) tells me that host specificity is really mucky and varies hugely among systems. It's so awful we don't know how best to represent it and solve it. I was hoping that we could just have "composite locations" on the host phylogeny but without restricting exactly which host lineages they can and can't include, there's the potential to explode the problem space even more, by approximately O(4^n), for n taxa on the host tree, which is yucky.

I've also been rewriting TreeMap in Java. It does stuff already, but I've not got it doing actual maps yet, only jungles.. TM3 will be awesome. It already has useful things like "untangle": when you have lots of edge crossings in your tanglegram and you can't untangle them, press the button and hey presto :-) It's just a heuristic, but it's cool.

Labels: ,

Saturday, July 08, 2006

I was first introduced to the cophylogeny problem by Rod Page at the University Glasgow in about 1995, and I've been interested in it ever since. I wrote my "Jungle" solution into TreeMap, which was I think an improvement, but basically cophylogeny mapping is horribly difficult and computationally complex, so it takes forever. The statistical analysis is also horrible. More on that later, perhaps.
Naively, I thought at the time of devising the jungle solution that most people testing cophylogeny would be looking at species that were, well, pretty well connected, like the initial gopher-louse study which was basically so good it's spoiled me. It turns out that most people seem to want to do cophylogeny mapping between huge trees which appear to have no relation to each other, so may as well be random. This is a total pain to map.
The cophylogeny problem is basically an attempt to infer ancient relationships between taxa that presently have an ecological link. The traditional domains of the problem are host and parasite or pathogen systems, the gene tree / species tree problem, and biogeography. Essentially in all of these there's an independent phylogeny (call it H) and a dependent phylogeny (P), and some known associations between the extant species of each: this parasite species infects that host species, etc. The task is to determine on which (ancestral) host lineages were found the ancestral parasites. Simple enough in principle but it gets ugly as there are four kinds of coevolutionary event that are recoverable from this kind of problem statement (Fredrik Ronquist worked this out but I forget which paper), and no way to tell how much each should "cost." Cophylogeny mapping works by positing associations of the internal nodes in P to locations in H, which may be nodes or branches, and reading the evolutionary processes that must have gone on to produce the observed relationships between the trees. Mapping is the most intuitive and I think best method but it's terrifically slow and I need to re-write the internals of TreeMap (preferably as a command-line application) to work a load faster.
I'll post thoughts and updates here if I can, and hope to talk about what I'm doing in terms of analysis and such too.