Saturday, July 08, 2006

I was first introduced to the cophylogeny problem by Rod Page at the University Glasgow in about 1995, and I've been interested in it ever since. I wrote my "Jungle" solution into TreeMap, which was I think an improvement, but basically cophylogeny mapping is horribly difficult and computationally complex, so it takes forever. The statistical analysis is also horrible. More on that later, perhaps.
Naively, I thought at the time of devising the jungle solution that most people testing cophylogeny would be looking at species that were, well, pretty well connected, like the initial gopher-louse study which was basically so good it's spoiled me. It turns out that most people seem to want to do cophylogeny mapping between huge trees which appear to have no relation to each other, so may as well be random. This is a total pain to map.
The cophylogeny problem is basically an attempt to infer ancient relationships between taxa that presently have an ecological link. The traditional domains of the problem are host and parasite or pathogen systems, the gene tree / species tree problem, and biogeography. Essentially in all of these there's an independent phylogeny (call it H) and a dependent phylogeny (P), and some known associations between the extant species of each: this parasite species infects that host species, etc. The task is to determine on which (ancestral) host lineages were found the ancestral parasites. Simple enough in principle but it gets ugly as there are four kinds of coevolutionary event that are recoverable from this kind of problem statement (Fredrik Ronquist worked this out but I forget which paper), and no way to tell how much each should "cost." Cophylogeny mapping works by positing associations of the internal nodes in P to locations in H, which may be nodes or branches, and reading the evolutionary processes that must have gone on to produce the observed relationships between the trees. Mapping is the most intuitive and I think best method but it's terrifically slow and I need to re-write the internals of TreeMap (preferably as a command-line application) to work a load faster.
I'll post thoughts and updates here if I can, and hope to talk about what I'm doing in terms of analysis and such too.