A couple weeks ago, I was talking to someone about the research I did in
grad school. Briefly, my research involved computer-aided design of
antibodies. While describing my research, I realized that the things I worked on using a relatively powerful computer and
high-end software a mere 5-10 years ago could now be accomplished on a cheap home computer, most of it using free software.
I've often thought about going back to grad school, but this realization pointed me to an interesting possibility. Why not try some computational research on my own, self-funded, to see if I could get a research paper
published?The first thing I thought to try was something I had thought of at the end of my time in grad school. One of the big mysteries left in biology is
how proteins fold. If proteins simply sampled the possible shapes which they could fold into one at a time, it's said that it would take longer than the
age of the universe for a typical protein to fold. But within a cell, proteins fold into their proper shape within milliseconds, or, at longest, hours. And the shape of proteins is important; it's what determines what they do in cells, and what proteins do is what determines what living things do. Another important note on all this is that the
sequence of amino acids in the protein determines the final shape of the protein.
One thing that's likely to happen during protein folding is that certain parts of the protein quickly fold, and this quick folding brings other parts closer together, which then interact to fold the protein into its final shape. So, I thought it would be interesting to search through the known protein structures, and see if any short sequences of
amino acids --say, combinations of 3 amino acids--tend to have a standard shape in these known proteins. With 20 possible amino acids, and 3 positions, that comes out to
20 * 20 * 20 = 8000 possible 3-amino-acid combinations. It's a lot of things to check, but that's what computers are good at.
So, before starting on this project, I decided to check
PubMed to see what had been done since I looked into this in 2000 or so. And, of course, in December of 2002, somebody published
a paper about exactly what I was planning to do.
So, that one was out (well, mostly; I still think I might give it a try, but just as an exercise in programming and to test the
replicability of their data). While I was at PubMed, I decided to poke around and see if any other ideas fell out.
So then I started
thinking about evolution. I started thinking about the
human genome. The human genome contains about 3 billion
DNA base pairs. New DNA gets into the human genome through two routes, as far as we know: duplication of existing DNA (through things like copying errors or
transposons), or incorporation of viral DNA into the genome (something called
lysogeny). By tracing the relationships between these genes, using the same
techniques we use to trace relationships between genes in different organisms, it should be possible to trace the evolutionary history of every gene in the human genome (with the possible exceptions of the virally introduced genes and genes that diverged too long ago for us to recognize their relationship). I thought that would be an interesting thing to try.
So, of course, did several other people. For example, this mob worked together to
map human chromosome 18. Others have done similar things on other parts of the human genome. Not only had people beat me to the punch again, but the job was way harder than my computer is likely to be able to handle.
So... I'm still not sure exactly what I'm going to look into. I still plan to do this, but I've decided the first step is to read a bit more to catch up. If you know of any computational biology work that it might be interesting to look into, feel free to tell me about it in the comments.