GSOC at 30,000 feet

My project started off with the idea of making the species (host) and gene (guest) tree mappings used my evolutionary biologists more interoperable between software platforms. At first, we thought we might kick off a new standardization, but it turned out we were just going to build a template on an existing set of standards (NeXML, primarily).

In comparative biology software seems to follow the precedent of predecessors when it comes to data formats. Though XML has been the de facto standard for data marshaling (serialization into a standard format) ubiquitously for many years, it’s only as of the past couple of years that XML has made it into phylogenetics research, in the form of NeXML and PhyloXML.

When I started the project, the first decision I had to deal with was which XML standard to use. Having both a near nil knowledge of both evolutionary biology, and of the new XML standards, I had to lean heavily on my fabulous mentor, James Estill to fill in the gaps and forge ahead with that decision. Jamie had chosen NeXML for it’s flat structure, whereas PhyloXML had a more traditional nested structure that didn’t seem to lend itself to straightforward processing and readability. NeXML also offers referencing an ontology via namespaces, and has robust perl library support in Bio::Phylo and BioPerl. The authors of both XML standards are involved in working groups at NESCent, where we all work, so we’d have great access to support either way.

Jamie set up a initial wiki page for our new NeXML template, and after some discussions with standard author and Bio::Phylo lead Rutger Vos, we settled on a final template form. This form puts reconciliation info about trees in the top level Trees element, and the mappings of guest nodes onto host nodes or edges are attached to the guest nodes. All information is does with meta tags, which Bio::Phylo can extract handily. That takes of the the XML side of things, and the amount of code needed for parsing seems to back up the argument that a flat mapping is simpler to handle than a nested one. Parsing is done using Bio::Phylo which has stronger NeXML support than plain BioPerl.

With the XML side mostly wrapped up, the question of legacy formats comes next. NHX (and variants) was, before XML came around, a popular way to serialize richly annotated trees, for lack of better options. Variations like PRIME were created to support extra attributes (like duplication events) that weren’t supported by plain Newick (NHX). Two of the three software packages (PrimeGSR/TV and TreeBest) use this NHX variant, and so it’s important to, for the foreseeable future, support input and output for NHX variants. The best way to do this, is to build on existing NHX support in BioPerl. Jamie had done some processing on PRIME, but that code isn’t easy to desirable to maintain, so a more standard approach would be preferred for ongoing support.

With XML and NHX IO taken care of, the next issue is working with the existing iPlant database, which has been populated via Jamie’s PRIME parsing code and TreeBest PRIME output with 2500 gene family mappings to 6 plant species. I have the original 2500 PRIME output files, and access to the database, so I there is enough data to compare a new parsing method to the original method employed for the initial import.

If the module were to be only used for direct IO, then the internal object wouldn’t matter very much. The tree could be a generic no nonsense linked list structure, or it could be an augmented BioPerl or Bio::Phylo::Forest::Tree object, with all the accoutrements. However, there is a final concern, which is use of the new perl object in the iplant infrastructure. The only use case for this type of use so far is the tree viewer, a Java web application which renders the trees and uses the iPlant TR database to decorate the trees to indicate relationships and mappings. I am planning to speak with one of the Java developers today to get a better sense of how if at all my code might be helpful this summer. It is possible that the existing JSON support in the TR database backend written my Jamie would better support the Java viewer in the short term.

Ahead of my call today with Todd Vision and Jamie, some important questions are:

  • Will the new parser be used strictly for IO
  • What are the use cases and requirements for anything beyond simple IO
  • Therefore, what specific architecture requirements would there be for the perl object? (based on viewer requirements, or other potential use cases)

My intuition is that simple IO is the primary objective for this summer, but I am keeping an open mind. One area where the Java viewer might benefit from the new code is to be able to load reconciliations directly from NHX or NeXML files, rather than requiring a fully functional TR backend and database (Naim mentioned this).

Use cases for simple IO are:

  • Load NeXML and NHX based reconciliations into the database
  • Output NeXML reconciliations from the database

There are a number of things that the perl parser module might supply to a user, if use cases supported the need:

  • Individual reconciliations and trees requested by ID/species/gene name
  • Cursor based retrieval of records from a large set of mappings
  • Support deep hierarchical access to the trees and mappings (as in standard complex tree objects)

The parser is in early stages, and so far I have looked at parsing NeXML and NHX, but haven’t started on iplant database support yet, as this is somewhat tied into the viewer problem, since existing database retrieval code is for that purpose/use. My next objective, if simple IO is the priority, is to release a version that supports NeXML fully, and then NHX, and then iPlant DB, but not necessarily in that order. Demonstrating use of this code would be to go between TreeBest output and NeXML output, for example, as mentioned above.

 

Posted in Uncategorized | Tagged | Leave a comment

Representing gene tree reconciliation maps in XML

The first goal of my GSOC project is to define a viable XML representation of gene tree reconciliations so that I can then implement support for this new “standard” in both a standard library, and the iPlant (umbrella organization) software that NESCent researchers have been developing for phyloinformatics. Today I presented my colleagues at NESCent with a proposal for an implementation based on NeXML. My mentor, James Estill did the leg work in choosing between NeXML and PhyloXML, and in defining the mapping paradigm. My contribution has been to turn the XML outline into a working XML file that works with the BioPerl and Bio::Phylo nexml parser, and to begin rallying feedback.

The basic idea of our proposal, is to introduce an element into the NeXML schema that encapsulates mappings between species (host) tree edges, and gene (guest) tree nodes. This allows for gene duplication and speciation events, and potentially horizontal gene transfer and other events as defined by some ontology. Mapped edges defined with the same node endpoint represent a speciation, while edges defined by two nodes represent a duplication. Using only just node could indicate horizontal transfer.

You can find the proposal here. Please post a comment on my blog or email me if you happen to be in evolutionary biology and have thoughts about GTR!

Since tree reconciliation is a phylogenetic technique that affects many researchers, tools, and platforms, our next goal is to obtain feedback from the NeXML maintainer, Rutger Vos, the PhyloXML maintainer, Chris Zmasek (both of whom have participated in the evoinfo working group at NESCent), my other GSOC mentors, and the larger research community at NESCent and beyond. Once we know we’re on the right path, have a plan to get developers on board, and have a functional open source proof of concept via the iPlant platform and BioPerl, we’ll be able to consider the project successful.

My focus this week is on figuring out BioPerl/Bio::Phylo tree functionality to identify which parser/library to focus on when implementing the new XML support. There’s also the question of how the schema will be updated, or if NexML is ultimately even the best technology to build on. Thankfully I have direct access to those who maintain phyloinformatics standards via NESCent.

Beyond the immediate XML implementation needs, I’m also, as time permits, looking at the iplant tree reconciliation schema and the import scripts that work on data from the pipeline. I’ll be looking at primeGSR and TreeBeST, and getting a basic understanding of their reconciliation features. I’ve had to do a bit of molecular biology and genetics review in order to keep up, which is great, because I’m going to be neck-deep in this stuff once classes start, so I’ve got a great head start.

Posted in Uncategorized | Tagged | 2 Comments

2011 NYC GSOC meetup at Google NYC HQ

This afternoon I met up with some fellow GSOC’ers and a couple of cool googlers for a tour of the main google NYC HQ. The building, with one of the largest footprints in manhattan at 111 Eighth Avenue, is the former Port Authority building, and takes up a city block. The building is absolutely amazing, and features multiple themed cafeterias with unlimited free gourmet grub, scooters, a lego gallery, a computer museum, a puzzle trail, an inter-floor ladder, and lots of fun fantasty inspired details like doors that have doors in them. The computer museum had many of my old favorites including a NeXT station, an SGI Iris, a Symbolics lisp machine, a C64 and many many more. Google seems to be expanding beyond HQ, which means they’ve either filled or plan to fill the building, I guess. Sounds like business is good! We got to check out the google docs work area, which involved lots of fun workspaces and themed “apartment” meeting rooms. Hopefully none of this is a trade secret… Thanks google!

Posted in Uncategorized | Tagged | Leave a comment

Google, NESCent, and my summer of code.

This summer will be my first time employed as a working scientist. I may be stretching the definition of “scientist”, since the software I’m working on is actually a support technology to people doing real research, but I hope to see what research is all about and check out evolutionary and synthetic biology this summer. Primarily, I’ll be working with NESCent, an evolutionary science research group, and will be sponsored by Google via Google Summer of Code. Specifically, I’ll be making it easier for evolutionary biologists to manage the data generated from mapping species and gene trees. Cool huh? It’s quite an honor to be chosen for a GSOC project, and I can think of no better way to launch myself towards science! My mentors, James Estill and Dr. Todd Vision are very sharp, friendly, and supportive, and I’m excited about their research, though I don’t yet know much about it.

Let me share some background on how I came to do what I’m doing. A couple of years ago, when I felt really burnt out on corporate IT and web work, I found myself looking for a new direction. It didn’t take me long to realize that I wanted to work on the intersection of biology and technology. It occurred to me that bioinformatics was a merging of my current experience, and indeed an expansion of my computer science skills, along with some serious scientific training. It took me a cross country adventure and some soul searching, and now I’m a CS/bioinformatics at Hunter College, and I’m loving it. I was looking for some kind of computational biology gig and Dr. Qiu who runs the bioinformatics lab at Hunter helped me submit a last minute application for this GSOC project. Dr. Qiu is absolutely fantastic, and I’m looking forward to helping him with his BioPerl related work in the Fall, assuming he can use a hand.

While GSOC is my primary project, I’m also part of the NYC iGEM software team, working part time as an advisor, basically in my personal free time. The team is lead by Genspace co-founder Russell Durrett, and he’s put together a great team. The project involves some serious synthetic biology, comparing over a dozen strains of radiation resistant D. Radiodurans bacteria and trying to transplant that durability into other microorganisms. The other goal is to build a robust asset database for biological samples and reagents. The project is hosted at the Mason Lab at the Weil-Cornell Institute for Computational Biomedicine, located just a few minutes from Hunter campus.

With just two last minute assignments to finish before I’m free to work on my summer projects, I feel like I can see the finish line. We’ll have weekly status reports, and with my first one due on Tuesday, I’m hoping to get a little bit of a jump start over this holiday weekend.

It’s going to be a really, really educational summer, and I’m thrilled to be more involved with science. Here’s wishing all 2011 GSOC students a fruitful and exciting summer of code!

Posted in Uncategorized | Tagged | Leave a comment

Look ma, no calculator! PEMDAS osx dashboard widget.

Update on 4/28: Mike from Donkey Engineering got back to me about a result copy shortcut: “Apple basically only allows you to put stuff on the clipboard when command+C is pressed, but I have a few ideas on how to work around that so it fits into your workflow” Looking forward to the update, Mike!

My favorite workflow when I’m doing lots of chemistry problems is to use Better Touch Tool so that a three finger tap brings up the OSX dashboard, and within dashboard, I use PEMDAS widget to do fancy equations. No physical calculator needed! Also on my dashboard is a periodic table in a web clip widget so those lovely masses and valence electrons are at my fingertips. A widget that displays an image file would be better, since the web clip widget depends on internet access, but that’s usually not a problem.

The only thing that would make doing calculation-heavy work without an actual physical calculator (like my beloved but chunky TI-89) even faster would be a hotkey in PEMDAS for copying the result. I filled out the support form on the Donkey Engineering web site asking about copying the result and Mike responded with a fixed version supporting copy and paste! You have to click the result to copy it, so I still hope they implement a second hot key just for copying the result, but it’s a leap forward in efficiency. Big thumbs up to Donkey Engineering for responding to my support request so quickly! I strongly suggest you check out PEMDAS in widget, app, or iphone form.

There are some app launchers like quicksilver that support calculator functions, but if you need a more robust fully scientific calculator, the PEMDAS widget is really the best option for fast workflow if you’re doing dozens of calculation-heavy problems with multiple steps. When I don’t need a lot of scientific functions, sometimes I use perl one-liners. If you think of something even faster, let me know. I wonder how other people churn through problems on other platforms/apps. Am I crazy for avoiding my lonely TI-89?

Posted in Uncategorized | Tagged , | Leave a comment
  • The Native Inhabitant:

    Web and IT pro turned novice scientist. Currently studying computer science and bioinformatics at Hunter College.

    Here be: dragons, bio + engineering + medicine + ethics, vegan eats and fashion, music and words, gadgets and software, photography, design, DIY/maker/hacker culture, NYC, running/fitness, cyborg anthropology, et cetera.

    dp at danielpacker dot org