Monday, May 23, 2016
Peacocks, DNA, and the Pancake Problems
by Jonathan Kujawa
(this is the sequel to last month's 3QD essay on the Pancake Problems)
I frequently come across a rafter of wild turkeys on bike rides through the countryside near my home. This particular group is recognizable thanks to having a peahen as an honorary member. Just this morning I was treated to a startling surprise: the peahen was busily herding a brood of chicks! I would have thought peacocks and turkeys were too distantly related to successfully breed. Apparently nobody told the peahen. I haven't seen any other peacocks in the neighborhood, so it would seem that she is more than friends with one of her turkey buddies. According to the internet, peacock/turkey hybrids (turcocks? peakeys?) are a thing which can happen.
Going by looks and their natural geographic ranges, my wrong guess was that peacocks and turkeys should be pretty distant on the tree of life. In the not-too-distant past, classification of species depended on such observational data.
Nowadays we can dig directly into the DNA to look for answers about relatedness. In the past decade it became possible to sequence the entire DNA of an organism. Not only that, but it's become fast and cheap. In fifteen years we've gone from the Human Genome Project taking thirteen years and $2.7 billion dollars to sequence the human genome to now being able to do it in days for $1,000. The progress in this field puts Moore's Law to shame.
It's one thing to have the data, it's another to put it to use. To deal with the flood of information pouring out of DNA sequencers an entirely new field called computational molecular biology has sprung up. It's a wonderful combination of biology, mathematics, and computer science.
A good example of this is turnips. Looking at them in the garden you might guess that they are more closely related to radishes than cabbage. In the 1980s Jeffrey Palmer and his collaborators looked at the mitochondrial genomes of turnips and cabbage and found that the genes they contained were nearly identical. What was different was the order of those genes . The random mutations which occurred over the years didn't change the genes themselves, only their position in the DNA.
Even better, Palmer and company saw that the kind of rearrangements which occur are of specific kind. When a mutation occurs, what happens is that a segment of DNA consisting of some number of genes is snipped out, flipped around, and put back in, now in reverse order. For example, if genes were the numbers one through five, a typical sequence of mutations might look like:
Here at each step the segment of genes to be snipped out and reversed is indicated with an underline. Because each mutation reverses the order of some of the genes, folks call it a reversal.
At some point in the distant past a primordial turbage had two offspring, one which eventually became a turnip and one a cabbage. In the present day we can sequence their DNA, compare them for differences, and count how many reversals are required to rearrange the genes of a cabbage into the order found in a turnip. If we happen to know that such a mutation occurs roughly once every 10,000 years and there are 64 reversals required to take you from a cabbage to a turnip, then the mythical turbage lived approximately 320,000 years ago .
Even if you don't know that these mutations occur every 10,000 years, you can still do relative comparisons. If a cabbage and a turnip has fewer reversals between them then does a radish and a turnip, then we can safely say the common ancestor of the cabbage and turnip occurred in the more recent past . The number of reversals gives us a measure of how far back in time the common ancestors lived. You can assemble this information into what is known as a phylogenetic tree. Each split in the tree corresponds to a common ancestor and the lengths of branches correspond to genetic distance.
Here is a phylogenetic tree borrowed from a great article by Darren Naish. It is based on a 2010 paper by Bonilla, Bruan, and Kimball in which they do exactly the sort of analysis we're talking about.
It turns out that this phenomenon of mutations by reversals is not just for turnips. For example, the genes in the human X chromosome hasn't changed much the past 125 million years except for rearrangements by these reversals. We can use this method to compare people, or people to gorillas, or any others who share this chromosome.
This all sounds great. Sequence the genes, count the number of reversals, and profit! But there is a problem at step 2. The X chromosome has something like 2000 genes. So we are talking about rearranging 2000 different genes.
Faithful 3QD readers should have a sense of dread (cue ominous music). From past experience we know that there are
ways of rearranging a deck of 52 playing cards (that's about 8 x 1064). Now this isn't quite a Brobdingnagian number, but it is large enough that the order of a well mixed deck of cards is nearly certain to never have occurred before.
The number of ways to rearrange the two thousand genes in the X chromosome is
2000! ≈ 3 x 105735.
Even the fastest of supercomputers will be useless in any sort of brute force search through all the possibilities. In fact, the situation is so much worse than that. According to Bremermann's Limit, if you were to turn the Earth itself into a perfectly efficient computer and let it run for as long as the Earth has existed, you could only process something like 1095 bits. That's laughably small compared to the problem at hand. Even the mice of the Hitchhiker's Guide would be stumped by a brute force search.
Mathematicians, on the other hand, deal with infinities on a daily basis and are naive enough to not be cowed by such large numbers. So a biologist calls up their friendly neighborhood mathematician and explains the problem. The mathematician chuckles and says, "Pancakes!". In the small chance the mathematician hasn't gone around the bend once and for all, the biologist asks "What the heck are you talking about?".
The mathematician tells the tale of Harry Dweighter's Pancake Problem. Every time you use a spatula to flip part of a pancake stack, you are exactly doing a certain kind of reversal! A sequence of pancakes are taken out, their order is reversed, and they are put back onto the stack. For pancakes the only sequences being reversed are those begin at the top of the stack and end somewhere in the middle, whereas in DNA the start and end of the sequence can happen anywhere. But the principle is the same.
In fact, this is just one corner of a rich field of mathematics known as Geometric Group Theory. My own department has several people who work in this area. One way to describe this field is to imagine you have a group of allowed states and a list of allowed moves which take you from one state to another. The number of moves required to take you from one state to another gives you a notion of distance (fewer moves means they are closer, more moves means farther). Even for the same group of allowed states, different collections of legal moves can give strikingly different possibilities. This gives you a rich world of mathematical problems and a wide array of tools from geometry, counting, etc., with which to tackle them. Lots of beautiful math is the result.
Your group of allowed states might be all the possible ways to arrange checkers on a checkerboard. The allowed moves are the legal moves in checkers. Two arrangements of checkers are then considered close if it only takes a few legal moves to go from one to another. Or the groups of allowed states might be the possible configurations of a Rubik's cube with cube moves as your legal moves. Or your group of allowed states might be all possible stacks of pancakes and your allowed moves are those given by spatula flips. Or you group of allowed states is the 2000! rearrangements of the X chromosome and the allowed moves are reversals. Once you start looking, you see geometric group theory everywhere!
For the biologist this is great news. The arrangement of genes in the turnip is one state, the arrangement of genes in the cabbage is another, and the legal moves are the reversals. All we want to know is the distance (in the geometric group theory sense) between them. Unfortunately, in 1997 a computer scientist named Caprara proved that the problem of computing reversal distance in the group of rearrangements is NP-hard. This means it is at least as hard as NP problems and, if you believe P ≠ NP (and most people do), that means it's very hard indeed! It seems we are at dead end.
Remarkably, in 1995 and 1996 Bafna, Hannnhalli, and Pevzner gave a polynomial time algorithm for calculating reversal distances in biology. Variations on their algorithm are now part of the standard computer toolkit for biologists interested in making phylogenetic trees. How did they manage it? First, note the date! Since they were working before Caprara's result, Bafna and company weren't burdened by the fact that a polynomial time algorithm for an NP-hard problem is surely impossible.
Since P ≠ NP is still one of the most famous open problems in mathematics and is worth a cool million to whomever solves it, we're clearly missing something. What's going on? Remember that each gene is actually a string of C's, T's, G's, and A's in the DNA. This means that it has a direction. The string ACTGACCG and GCCAGTCA encode the same gene even though the order is reversed. It makes no difference biologically. When we think about it carefully we realize that not only does a reversal flip the order of the genes, but within each gene the DNA itself is reversed. That is, gene reversal is not the Pancake Problem, it's the Burnt Pancake Problem! When counting reversals we not only can look at the order of the pancakes/genes, but also if they are burnt side down/in order, or not. These extra tidbits of information are just enough to turn the problem from intractable to solvable!
Real world applications aside, there is a rich world of mathematics here. Given a group of states and a list of allowed moves, there is often a state which is special in some way (if it's checkers this would be the arrangement of checkers at the beginning of the game, if it's a stack of pancakes it would be the stack of pancakes which are all in order). Geometric group theorists call the length of a state the minimum number of legal moves it takes to go from this preferred state to the one in question.
This turns out to be an interesting number. For example, the functions f(n) and g(n) from the Pancake Problem and Burnt Pancake Problem are exactly asking for the number of moves required in the worst case to get to the preferred state; that is, they are computing the length of the longest element.
A few years ago an undergraduate student, Rhyker Benavidez, was working with me and became interested in the length of states where the allowed moves are the reversals of pancakes and biology. Specifically, he noticed in computer calculations that if you count the number of states of length 1, length 2, length 3, etc., that there are intriguing patterns to these numbers. Based on these computer calculations he conjectured that in certain cases they are given by polynomials. This is much nicer than we have any reason to expect! However, there is a precedence for this kind of result. If you consider the group of all orderings of the numbers 1, 2, 3, ..., n and your legal moves are that you can swap any two adjacent numbers (these moves are called simple transpositions), then it is a classical result that number of states is of a fixed length is given by an explicit polynomial (where the number n is the variable).
Rhyker continued to work on his conjectures after graduation and I'm proud to say that he has verified several of them! They'll hopefully be part of a research paper on the subject someday. Having done our time on the Pancake Problems I look forward to when riches pour in like they did for Bill Gates and David Cohen :-). Rhyker is graduating with a Masters of Divinity from Harvard and will be studying Biomedical Engineering at Johns Hopkins this fall. And a week from Friday he and Brandon Ranallo will be married. Congratulations Rhyker!
Thanks go to Brian Hayes. His article "Sorting out the Genome" in the Sept/Oct 2007 issue of American Scientist is where I first learned about reversal distance (you can find it and many more interesting articles here). I study groups for a living and was excited beyond words to learn that there were interesting biology problems in my area of math. In particular, the image of a sequence of reversals example given above is from his article.
 If you remember from your high school biology class, DNA comes in a long strand of four amino acids: T's, C's, G's, and A's. A gene is a segment of DNA which encodes a functional RNA or protein and is the "atom" when considering questions of genetics. Since I'm a mathematician who never took high school biology, it's safe to say that this is a gross oversimplification. But it's good enough for us.
 Why 320,000 instead of 640,000? If we think about it for a minute we realize that once they split into two genetic lines, each had their fair share of mutations and of the 64 in total we counted, 32 occurred in the cabbage and 32 in the turnip. Also, strictly speaking this is only a lower bound on how long ago they had a common ancestor. In principle it is possible for a mutation to cause a reversal and a later mutation to undo it. We have no way of knowing if two reversals happened but cancelled each other. In principal there could be unknown mutations which have occurred. This is extremely unlikely, but it could happen.
 This of course supposes that the rate of mutation is the same for each. We'd need to look at the biology to see what is reasonable to assume here. Same goes for hybrids which might have happened in the intervening years and other confounding worries. But, again, we'll leave it to the biologists to worry about that.
 Image from the University of Bristol Maths Department.
Posted by Jon Kujawa at 12:55 AM | Permalink