Genomic data is rather beautiful in that the stirrings and mysteries of life can be glimpsed within it. I’ve been dabbling in making functional genomic art — mostly with spiral representations of bacterial DNA. Here’s a rendering of Borrelia burgdorferi plasmid cp26. We’re studying Borrelia in the lab.
My first experiments (here and here) are simplistic, but pleasant, I think. My initial goal was to squeeze a nice big chunk of DNA into a small space. To that end, I more or less succeeded with the spiral approach. It turns out (unsurprisingly) that the circular visualization of genomic data isn’t new, though I don’t know if anyone has used my particular approach. Visualization is an exceedingly fun thing to poke at and learn, and I look forward to coming up with more ways of making functional bio-art.
On the practical side of things, Dr. Qiu and my colleagues at the evolutionary bioinformatics lab have been enlightening me to the professional needs of biologists in the realm of visualization. Some of the things I’ve been advised to look into are: six frame translation, SNPs, GC base percentages, amino acid frequency, synteny (gene order), genetic drift, real time simulation visualization, and phylogeny. That’s quite a laundry list. One must be careful when asking for ideas and suggestions in the lab.
Six frame translation refers to the six ways to look at a DNA sequence — three forward and three backward. There are three in each direction because the DNA is broken up into codons that each have three bases. It’s impossible to know which base a gene starts on without trying all three in either direction. So, you look for long uninterrupted sequences of codons (an open reading frame, or ORF), at each of the six reading frames. Those could quite possibly be genes, and finding a gene is like striking gold.
Here’s my rendering of reading frame 1 on the same Borrelia plasmid. I’m nearly 100% sure that I got something wrong, because when I checked ORF Finder, I got different results. This is a demonstration of the concept for a tool for finding potential genes, like the NCBI orf finder. The coloring of the segments is exponentially scaled based on size, and they become delimited by a blue border when they reach a maximum threshhold. In other words, the red areas with a blue border are most likely to be genes. It’s basically a heat map for genes. Well, so goes the theory, in my limited understanding. I’m hoping to fix this up so that I get results consistent with NCBI for all reading frames.
I’m rendering these in 100% homegrown convoluted OpenGL/C++ code. Ultimately I suspect that any general purpose genome visualization tool I might end up with will want to live on the web, perhaps utilizing WebGL. Even if this turns out to be a command line tool, it could live on the web as an image generator. For the moment, I’m enjoying developing natively on my laptop as an idea scratchpad. Before I go implementing any more features, I’ll make sure the sequence data is coming through accurately so that the ORFs match NCBI. Then I’ll tackle six frame alignments, etc. At that point, hopefully I’ll have something worth sharing!
Any thoughts on genome visualization? Suggestions? Please leave a comment.