My project started off with the idea of making the species (host) and gene (guest) tree mappings used my evolutionary biologists more interoperable between software platforms. At first, we thought we might kick off a new standardization, but it turned out we were just going to build a template on an existing set of standards (NeXML, primarily).
In comparative biology software seems to follow the precedent of predecessors when it comes to data formats. Though XML has been the de facto standard for data marshaling (serialization into a standard format) ubiquitously for many years, it’s only as of the past couple of years that XML has made it into phylogenetics research, in the form of NeXML and PhyloXML.
When I started the project, the first decision I had to deal with was which XML standard to use. Having both a near nil knowledge of both evolutionary biology, and of the new XML standards, I had to lean heavily on my fabulous mentor, James Estill to fill in the gaps and forge ahead with that decision. Jamie had chosen NeXML for it’s flat structure, whereas PhyloXML had a more traditional nested structure that didn’t seem to lend itself to straightforward processing and readability. NeXML also offers referencing an ontology via namespaces, and has robust perl library support in Bio::Phylo and BioPerl. The authors of both XML standards are involved in working groups at NESCent, where we all work, so we’d have great access to support either way.
Jamie set up a initial wiki page for our new NeXML template, and after some discussions with standard author and Bio::Phylo lead Rutger Vos, we settled on a final template form. This form puts reconciliation info about trees in the top level Trees element, and the mappings of guest nodes onto host nodes or edges are attached to the guest nodes. All information is does with meta tags, which Bio::Phylo can extract handily. That takes of the the XML side of things, and the amount of code needed for parsing seems to back up the argument that a flat mapping is simpler to handle than a nested one. Parsing is done using Bio::Phylo which has stronger NeXML support than plain BioPerl.
With the XML side mostly wrapped up, the question of legacy formats comes next. NHX (and variants) was, before XML came around, a popular way to serialize richly annotated trees, for lack of better options. Variations like PRIME were created to support extra attributes (like duplication events) that weren’t supported by plain Newick (NHX). Two of the three software packages (PrimeGSR/TV and TreeBest) use this NHX variant, and so it’s important to, for the foreseeable future, support input and output for NHX variants. The best way to do this, is to build on existing NHX support in BioPerl. Jamie had done some processing on PRIME, but that code isn’t easy to desirable to maintain, so a more standard approach would be preferred for ongoing support.
With XML and NHX IO taken care of, the next issue is working with the existing iPlant database, which has been populated via Jamie’s PRIME parsing code and TreeBest PRIME output with 2500 gene family mappings to 6 plant species. I have the original 2500 PRIME output files, and access to the database, so I there is enough data to compare a new parsing method to the original method employed for the initial import.
If the module were to be only used for direct IO, then the internal object wouldn’t matter very much. The tree could be a generic no nonsense linked list structure, or it could be an augmented BioPerl or Bio::Phylo::Forest::Tree object, with all the accoutrements. However, there is a final concern, which is use of the new perl object in the iplant infrastructure. The only use case for this type of use so far is the tree viewer, a Java web application which renders the trees and uses the iPlant TR database to decorate the trees to indicate relationships and mappings. I am planning to speak with one of the Java developers today to get a better sense of how if at all my code might be helpful this summer. It is possible that the existing JSON support in the TR database backend written my Jamie would better support the Java viewer in the short term.
Ahead of my call today with Todd Vision and Jamie, some important questions are:
- Will the new parser be used strictly for IO
- What are the use cases and requirements for anything beyond simple IO
- Therefore, what specific architecture requirements would there be for the perl object? (based on viewer requirements, or other potential use cases)
My intuition is that simple IO is the primary objective for this summer, but I am keeping an open mind. One area where the Java viewer might benefit from the new code is to be able to load reconciliations directly from NHX or NeXML files, rather than requiring a fully functional TR backend and database (Naim mentioned this).
Use cases for simple IO are:
- Load NeXML and NHX based reconciliations into the database
- Output NeXML reconciliations from the database
There are a number of things that the perl parser module might supply to a user, if use cases supported the need:
- Individual reconciliations and trees requested by ID/species/gene name
- Cursor based retrieval of records from a large set of mappings
- Support deep hierarchical access to the trees and mappings (as in standard complex tree objects)
The parser is in early stages, and so far I have looked at parsing NeXML and NHX, but haven’t started on iplant database support yet, as this is somewhat tied into the viewer problem, since existing database retrieval code is for that purpose/use. My next objective, if simple IO is the priority, is to release a version that supports NeXML fully, and then NHX, and then iPlant DB, but not necessarily in that order. Demonstrating use of this code would be to go between TreeBest output and NeXML output, for example, as mentioned above.



