A new study finds up to 250 regions where the reference genome sequenced over 13 years may be missing information.

Human Genome 2003
©DAVID MARCHAL
STILL UNDER CONSTRUCTION?: A new study finds that there could be hundreds of genes missing from the "complete" human genome that was assembled in 2003.


If you ask the scientists at the National Human Genome Research Institute (NHGRI) when the Human Genome Project wrapped up, they'll tell you it was finished in 2003. However, a new study indicates that the composite reference genome cobbled together from parts of the genetic codes of multiple people, is definitely a work in progress.

The completed genome was to serve as a model of the genetic makeup of a typical human that researchers could use as a reference to detect genetic flaws and defects in people with certain disorders. But new research published last week in Nature shows that the current model may be faulty - and that there may actually be yet-to-be-uncovered genes missing from it.

A study of genetic variations in eight individuals turned up more than 250 regions throughout the genome that researchers believe may contain hundreds of new genes. It also determined that the reference genome may be completely wrong or contain rare alleles (versions of genes), says study coauthor Evan E. Eichler, associate professor of genome sciences at the University of Washington in Seattle.

"The reference genome having a rare allele means it is not exactly presenting the majority of people, which is how most people think of a reference genome," says Michael Snyder, a Yale University biologist who was not involved in the study.

Eichler believes that the findings gleaned from these eight genomes - and 17 others that he plans to analyze - could help fill in gaps in the reference genome, which would make the sequence more helpful in the study of complex genetic disorders such as heart disease, diabetes and schizophrenia. "There's a saying that goes, 'It's the sequence, stupid,'" Eichler says. "Once you get the sequence to a high quality, you can go after association studies and go after diseases."

Eichler and his team set out to pinpoint areas in the genome in which there were structural changes that might take place by comparing the codes of several people. These variations can affect thousands to millions of letters or nucleotides (DNA molecules) in the genetic code. The human genome contains 3 billion letters. The alterations can take the form of so-called copy number variations (in which several genes are either deleted or duplicated, causing a change in the number of copies of a gene a person carries, rather than the norm of one copy from each parent) or inversions, in which a segment of the code is reversed. These mutations can be caused when a child's genome is being made (by cutting and pasting their parents' codes together) or by errors in repairing DNA damage, which is typically caused by environmental factors such as ultraviolet rays and smoke inhalation.

The researchers took DNA samples from the blood of eight individuals: four Africans, two Asians and two Europeans. They randomly broke each person's code into a million fragments and then attempted to match the ends of the segments to regions on the reference genome. If they could not find a match, the team designated the matchless segment as a site of a structural change.

In total, the researchers identified 1,695 instances of structural variations, 800 of which had not been previously reported. Fifty percent of the regions affected by these mutations showed up in more than one of the people studied. Forty percent of the 525 regions found to be missing from the reference genome were due to copy number variations, which means that a crop of yet-to-be-discovered genes may be hiding within them.

"I'm almost certain he's found new genes," says Jonathan Sebat, a geneticist at Cold Spring Harbor Laboratory in Long Island, N.Y. "We've never seen any [locations] where the reference sequence has zero copies of a gene."

Eichler says that his team is currently sequencing the segments of the volunteers' genomes containing the missing information. "There are clearly things that look like they could be genes in there," he says.

He notes that many structural variation in our genomes occur in 400 unstable regions of the code. "A lot of these variations are biased to specific regions that include genes that are important to adaptation," he says. "These are genes that have changed very radically within humans or are [relatively] new genes that are not found outside of humans." He calls these areas "crucibles of evolution," in which new nucleotide combinations have been tried out and mostly discarded except in "very rare" instances in which they created advantageous traits and "a new gene was minted."

Once these structural variations are characterized as deletions, duplications or inversions, they can be added to other efforts like the International HapMap Project, an attempt to catalog mutations involving only a single nucleotide within genes between people of different ethnicity. Earlier this year, an international consortium, including the NHGRI, announced a plan to sequence 1,000 genomes that will, among other things, help refine the data in the reference genome.

If the reference genome can be amended to represent the most common set of genes (and both small and large variations can be catalogued), Eichler says scientists will be able to quickly pinpoint alleles found in those with a particular illness, such a diabetes, and compare it with the reference genome to determine if it's "normal" or flawed. "By properly characterizing normal genomes," he says, "we'll be able to identify disease-causing variants very easily."