Image
©Innovations
EECS and statistics professor Yun Song studies computing problems related to the human genome.

There are about six billion base pairs in the human genome, and our family tree includes about six billion living humans. For other species, these numbers are also enormous. So, although DNA sequencing begins in a laboratory, it requires research-level computer science and statistics to crunch the resulting mass of data and make sense of the results, which have applications ranging from medicine and biology to anthropology and history. As EECS and statistics professor Yun Song remarks, "Just 15 years ago, it was very difficult for population genetics researchers to run their computationally intensive analyses on desktop computers. It's thanks to relatively recent improvements in computers and algorithms that these problems have become tractable."

Song has been working on two computation-heavy problems: how to assemble an individual organism's genome and how to trace the genealogies of individual DNA sequences back in time. The first problem arises because DNA sequencing begins with generating many millions of "short-read" pieces from a DNA sample. For a human-sized genome, it takes the state-of-the-art Illumina Genome Analyzer system several days of constant processing to jigsaw-puzzle all of these pieces together. If that sounds like a long time, recall that it took more than 10 years and hundreds of millions of dollars to sequence the first human genome, which was completed in 2005.

The second problem, tracing genealogies, also relies on computing, since exhuming graves to obtain ancestor DNA is not an option. As with sequencing, this problem generates an explosion of possibilities that defies brute-force searches by even the fastest computers. "You know that DNA sequences in a group of individuals are related, but you don't know exactly how," Song explains. "So you need to consider all possible evolutionary histories. For 100 random people, this number is very large."

Of course, these two problems are intertwined. You can sequence short reads of DNA far more efficiently if you're smart about how the genome behaves as it recombines through the generations: how different parts of it move and change, and what combinations that may be possible mathematically are not biologically likely. To this end, Song and his team have designed a reconfigurable supercomputer using Field Programmable Gate Array chips (FPGAs). The system assembles DNA pieces in parallel by mapping them against a reference genome, using algorithms based on flexible models of evolution. This makes genome sequencing more like assembling a jigsaw puzzle with a recognizable picture, rather than a puzzle that's all white.

Together with Professor Charles H. Langley and other collaborators at UC Davis, Song recently put together a proposal for the National Institutes of Health (NIH) to resequence 1,000 Drosophila (fruit fly) genomes. The project would parallel the "1000 Human Genomes" project, which launched earlier this year to sequence genomes from people around the world.

For understanding human genetics, working with Drosophila has many advantages: they share basic anatomical systems with humans, such as a circulatory, nervous, and digestive, but they're small and inexpensive to maintain; they reproduce quickly; their genome is smaller and better understood; and activists don't cause trouble for people who tinker with them. All of this allows researchers to perform functional genomics experiments that can't be performed with human subjects, for example, adding or subtracting specific genes to determine what they actually do in real life, apart from any speculation.

As Song explains, the continuing rapid progress in sequencing genomes and understanding genetics will revolutionize medicine. Within a few years, he predicts, the cost of genetic sequencing for individuals will drop below $1,000, a widely anticipated milestone. As the price continues to drop, it will become standard medical procedure. The information patients' genomes will provide, combined with medical records, family histories and research data, will enable medicine to become far more individualized and effective as well as uncover countless new and useful correlations. Preventive medicine will improve, benefiting everyone, as testing reveals genetic propensities toward disease long before any symptoms appear. But security and policy around the genome data will be an important issue. As Song warns, "If people are refused medical insurance because of their genome, that would be terrible."

In addition to writing the medical history of the future, population genetics might also literally rewrite history, by showing how human variation has emerged in response to evolutionary pressures in different parts of the world. As Song explains, "This work will help us understand a lot more about population history and structure and how diet, migration and other cultural and environmental factors have contributed to the tapestry of human variation."