Poking a Pet Peeve - 3 Quarks Daily

I'm not even going to bother sourcing the quote that initially spurred this column; we've all seen a hundred similar claims in press releases, news articles, blogs and so on:

It is an established fact that 98 percent of the DNA, or the code of life, is exactly the same between humans and chimpanzees. So the key to what it means to be human resides in that other 2 percent.

Argh. This meme, or trope, or whatever you want to call it, drives me crazy. Here's why:

Individual human genomes vary by about 0.08% at the single-nucleotide level, whereas human and chimpanzee genomes differ by about 1-1.5% at the same level. This is misleading, though, because single-nucleotide comparison means aligning comparable sequences base-by-base and counting the differences. In order to line up the two sequences in the first place, you have to introduce gaps into each sequence to allow for insertions and deletions. Like this:

actgccggctaac-----gtaccTgtcaactggcatgcatgcaagtacc
actgccggcGaacggtccgtacccgtcaac--gcatgAatgcaagtacc

In this made-up example, three bases out of fifty are different (6%) but the gaps account for a further 7 bases' worth of difference (14%). Do this with enough regions of each genome to get a representative sample and you can estimate the degree of sequence identity between the two genomes. Of the optimally-aligned sections of our genomes, we share about 98.5-99% with chimps, but taking the gaps into account produces a rather lower figure of about 95%, something Roy Britten showed in 2002.

What both figures overlook, and tend to obscure, is differences in the organization of large sections of the genetic information: duplications, inversions, recombinations between and within chromosomes, insertions of retroviral sequences, species-specific genes and so on. There are a number of methods that allow us to measure such differences, but at the submicroscopic level¹ one of the newest and perhaps most powerful is representational oligonucleotide microarray analysis (ROMA). What ROMA does (there's a good explanatory paper here) is to compare reduced-complexity representations of two genomes. The average resolution is one probe every 35 kb. The authors say that 10-15 kb is feasible, but the more granular comparison may be more interesting, at least initially, because it shows the “big picture” — like zooming out on a map. (There is some tradeoff, of course; earlier lower-resolution studies found far fewer polymorphisms.)

When researchers at Cold Spring Harbor Labs used ROMA to investigate the differences between tumor and normal cells, they included a normal-normal control to establish lower limits of variability (the Science paper is here). What they found was that the genomes of normal individuals vary not just at the level of the individual nucleotide or even gene, but also on a much larger (though still submicroscopic) scale, with deletions and duplications from 100,000 b to 1 Mb (b = base, or more accurately base pair, a single “rung” on the familiar twisted rope ladder image of DNA).

(As an aside: how big is 100 kb – 1 Mb? The entire human genome is about 3000 Mb, and contains somewhere between 18,000 and 30,000 genes (estimates vary, but the newer the estimate the lower the number seems to be). For simplicity, say the “average gene” is about 100 kb (but note that this is a bit misleading, since a typical gene contains only a few hundred to several thousand bases of coding sequence, which may be spread out across hundreds of kb but is more usually contained within, say, a few tens of kb). So, 100-1000 kb is easily big enough to encompass a whole gene, or even quite a few entire genes. Indeed, the CSHL researchers found variation in some 70 genes, including the gene which causes Cohen syndrome and genes known to be involved in neurodevelopment, leukaemia, drug resistance in breast cancer and body weight regulation.)

The team compared twenty individual genomes and found 76 unique CNVs (copy number variants). The average CNV was 465 kb (median 222 kb) and individuals differed from each other by an average of 11 CNPs, but the authors provide multiple reasons to expect the observed CNPs to represent only a subset of the total, which they estimate to be 226 CNPs covering 44 Mb. In fact, more recent studies have discovered a total of 1237 CNVs covering more than 140 Mb. The authors of the linked review caution that most of these have not been validated by alternative methods or discovery in multiple unrelated individuals, so the final number will be considerably lower, and a slightly earlier review describes 563 “apparently unique” human CNVs.

So what happens if you make a similar² comparison between chimpanzees and humans? Perry et al. used array CGH (a technique closely related to ROMA) to compare the genomes of 20 unrelated chimpanzees, and found 355 CNVs; the same array (which covers about 12% of the human reference genome) was used in an earlier study of 55 unrelated human genomes, and found 255 CNVs. Of these, 74 CNVs were found in the same regions of the two genomes, and many of these CNVs were frequent in both species, indicating that certain regions may be particularly susceptible to this kind of variation. In their paper describing the draft chimpanzee genome, Mikkelson et al. provide a number of points of comparison with the human genome. In addition to providing the best available figure for single-nucleotide differences (1.23%), they estimate that insertions and deletions result in a difference of about 90 Mb, or 3%, between the two genomes. Earlier, Newman et al. used a bioinformatics approach to compare sequence data from the (at the time, unfinished) chimp genome with the human genome reference sequence. They found insertions/deletions amounting to about a 5% difference between human and chimp genomes, but they also found 174 submicroscopic sequence inversions spanning more than 450 Mb. It turns out that such inversions, sections of DNA whose orientation along the chromosome is reversed between the genomes being compared, are surprisingly frequent in humans and chimpanzees. Feuk et al. compared the current chimp draft with the human reference genome sequence and found 1,576 putative regions of inverted orientation, covering more than 154 Mb. Of the 23 of these inversions that were experimentally validated, three were polymorphic in humans. Similarly, Szamalek et al. made gene order comparisons between a set of 11,518 human and chimp genes and found 71 inversions; of the 5 validated inversions (spanning about 11.5 Mb and containing a total of 103 genes), three were polymorphic in the chimpanzee and one in humans. These studies strongly suggest that submicroscopic differences are an important source of genomic variation in, and between, humans and chimpanzees. Moreover, a large number of genes have been shown to be affected by submicroscopic changes, and CNVs and small-scale inversions have been associated with a wide variety of biological functions (see here and here for reviews). For instance, CNVs are estimated to affect over 3,000 human- or chimpanzee-specific genes, and known human CNVs include genes involved in drug detoxification (glutathione-S-transferase,cytochrome P450s),immune response and inflammation (leukocyte immunoglobulin-likereceptor, defensins), surface antigens (melanoma antigen gene,rhesus blood group gene families) and variation in drug responses and disease resistance/susceptibility.

I hope all this makes it clear that human and chimpanzee genomes are not “98% identical”, except at the relatively uninformative level of single-nucleotide comparisons. Indeed, the most meaningful differences between the two genomes are likely structural in nature, and cannot be neatly summed up as a percentage difference of any kind. I'm not even going to start on “what it means to be human” — that's a lifetime's worth of philosophy and molecular evolution. All I want to do here is to make the case that, whatever it is that differentiates us from chimpanzees, it is not to be found in that infamous 2%.

———-
¹ There are also much larger-scale differences, visible under a microscope, between the two genomes. These have been well characterized, and include the fusion of two ancestral chromosomes (analogous to chr. 12 and 13 in chimpanzees) to form human chromosome 2, extensive sequence duplications in chimpanzees relative to humans, and nine large pericentric inversions.

² I have not found a ROMA-based comparison between humans and chimpanzees, but all of the studies described focused on differences at roughly the same scale on which ROMA operates.

….

This work is licensed under a Creative Commons Attribution 3.0 License.