Evolving Codes and Novel Genes

The genetic code, the translation between each sequence of three nucleotides to an amino acid, is shared widely across the tree of life. Explore Evolution seizes upon some small variations in that translation to claim that common ancestry must be wrong.

Contrary to claims in Explore Evolution, the genetic code is still considered to be universal, in that the known exceptions are very minor variations on the same basic code found in all organisms. Long before the discovery of variant codes, it was known that the genetic code can in fact change, and how it can change, based on laboratory findings in bacteria and yeast mutants. We also have a good understanding of the kind of mutations and selective forces that can allow the genetic code to evolve new variants.

We also have a growing understanding of so-called ORFans, coding DNA sequences which don't seem homologous to genes in any other lineage. In the account of Explore Evolution, these are insuperable challenges to evolution, but this misrepresents the current state of knowledge. The more of these apparently unique sections of coding DNA we study, and the more genomes we sequence, the fewer of these ORFans actually seem unique. Many of what were thought to be ORFans actually have families, and Explore Evolution misinforms students by treating these sequences as unsolvable problems.

The Genetic Code

Explore Evolution wrongly state that biologists originally maintained that the genetic code is absolutely universal (invariant); that this absolute universality was considered evidence for common descent; that this would be a reasonable inference because changing the code would be invariably lethal ("not survivable"); and finally, that the claim of universality fell apart in the 1980s with the discovery of variant genetic codes. Thus, the authors claim, the genetic code is not universal and the inference of common descent is in question and life must have "multiple separate origins." They cite physicist Hubert Yockey to justify the claim: "Some scientists think this is a possibility, saying that the evidence may point to a polyphyletic view of the history of life." (p. 59)

There are many problems with this argument, which is based on misunderstanding and misrepresentation of the available knowledge and of the scientific record.

First, contrary to the key assertion, scientists have been aware of natural genetic code mutants since at least the 1960s, and the actual molecular mechanism of some of these mutations (such as "suppressors of amber") was elucidated in both bacteria and yeast (Goodman HM, Abelson J, Landy A, Brenner S, Smith JD. (1968) "Amber suppression: a nucleotide change in the anticodon of a tyrosine transfer RNA." Nature 217:1019-24; Capecchi MR, Hughes SH, Wahl GM. (1975) "Yeast super-suppressors are altered tRNAs capable of translating a nonsense codon in vitro." Cell 6:269-77.) Amber suppressor mutations change the read-out of certain codons from STOP to an amino acid by altering the structure of one of the transfer RNAs. This tRNA recognizes the codons in messenger RNAs and allows the addition of the correct amino acid during protein synthesis. These mutants showed how new variant genetic codes can evolve, and what kind of selective pressures can favor such changes (in this case, the need for reversion of point mutations which introduce deleterious STOP codons in critical genes). Therefore, it was recognized fairly early that the genetic code did not need to be absolutely invariant to be fundamentally shared between all organisms ("universal"). Already in 1966, Francis Crick stated in his Croonian lecture:

The best evidence to date [for the universality of the code] is probably the excellent agreement between the code deduced for E. coli and the mutagenic data … derived from tobacco plants or human beings. There is thus little doubt that the genetic code is similar in most organisms. Whether there are any organisms which use a slightly modified version of the code remains to be seen.
Crick, FH (1967) The Croonian Lecture. "The Genetic Code." Proc R Soc Lond B Biol Sci. 167:331-47
Code Variants: from Knight RD, Freeland SJ, Landweber LF. "Rewiring the keyboard: evolvability of the genetic code." Nat Rev Genet. 2001; 2:49-58Code Variants: from Knight RD, Freeland SJ, Landweber LF. "Rewiring the keyboard: evolvability of the genetic code." Nat Rev Genet. 2001; 2:49-58

Second, the small number of organisms with variant genetic codes and the limited extent of the changes (involving a few codons at most) strongly support the view that these represent new variations of the "standard," universal code, as opposed to independently originated codes. Moreover, the known code variants themselves offer in many cases evidence for common descent, being shared by related organisms according to the established phylogenetic hierarchy, as shown in the figure below, from Knight RD, Freeland SJ, Landweber LF. (2001) "Rewiring the keyboard: evolvability of the genetic code." Nat Rev Genet. 2:49-58, which contains a thorough discussion on the phylogenetic distribution and mechanisms of genetic code variation.

In particular, the authors of Explore Evolution mention the specific example of organisms in which 2 of the 3 "stop codons" have been reassigned to encode for amino acids (change "a" in the figure above). They argue:
It's very hard to see how an organism could have survived a transformation from the standard code to this one. Changing to this new code would cause the cell to produce useless strings of extra amino acids when it should have stopped protein production.
Explore Evolution, p. 58

However, the mechanisms that underly this particular kind of code change in some organisms are known, and undermine the authors' argument. The studies have been performed mainly in Ciliates, a group of unicellular eukaryotes belonging to the Protozoans, which include Tetrahymena, the organism specifically cited by the authors. Notably, Ciliates have a peculiar genomic organization, with hundreds of very small chromosomes, often containing a single gene, organized in two distinct nuclei; moreover, their genes tend to have unusually short sequences past their termination codons. This is important because it means that mutations that suppress termination (such as that mentioned by the authors) are less likely to generate very long amino acid stretches past the normal protein end, and hence to cause deleterious phenotypes. Consistent with this possibility, Ciliates comprise the majority of organisms with alternative genetic codes containing termination suppressors (Knight RD, Freeland SJ, Landweber LF. (2001) "Rewiring the keyboard: evolvability of the genetic code." Nat Rev Genet. 2:49-58; Lopuzone CA, Knight RD, Landweber LF. (2001) "The molecular basis of genetic code change in ciliates." Curr Biol. 11:65-74).

The particular genetic code mentioned by the authors, in which the UAG and UAA codons are used to encode the amino acid glutamine instead of STOP, results from two sets of changes. The first involves a reassignment of the transfer RNA for glutamine to recognize UAG and UAA. Interestingly, this kind of change can occur through intermediates with only partial effects ("wobble") (Schultz DW, Yarus M. 1994 "Transfer RNA mutation and the malleability of the genetic code" J. Mol. Biol. 235(5):1377-80). The second set of changes affects eRF1, one of the proteins involved in recognizing STOP codons (Lopuzone CA, Knight RD, Landweber LF. 2001 "The molecular basis of genetic code change in ciliates." Curr Biol. 11:65-74). Because of this, it is not "very hard" at all, and in fact very possible, to envision gradual evolution of this new genetic code through intermediates in which the codon interpretation is ambiguous; "hybrid codes" in a sense.

To justify the claim that some scientists see variation in the genetic code as evidence against a single tree of life, the authors quote a sentence from a 1992 book by Hubert Yockey, a physicist with an interest in information theory of biological systems. Interestingly, the quote in question is absent from the latest edition of the book. More importantly, in the current edition, Yockey approvingly quotes Francis Crick's suggestion that all extant life forms descended from a small interbreeding population (i.e. common descent), and once again prophesizing the possibility of organisms with variant codes:

Crick (1981), in one of those marvelous intuitions that have led him to so many discoveries, and without the mathematical argument above, has proposed: “What the code suggests is that life, at some stage, went through at least one bottleneck, a small interbreeding population from which all subsequent life has descended… Nevertheless, one is mildly surprised that several versions of the code did not emerge, and the fact that the mitochondrial codes are slightly different from the rest supports this.
Yockey H, (2005) Information Theory, Evolution, and The Origin of Life. Cambridge University Press, 2005, p.102

Here, the authors of Explore Evolution make the fundamental mistake of conflating (either on purpose or due to a misunderstanding of the underlying issues) two very different questions: common descent, whether extant organisms can trace their ancestry to a single population at some point in the past, and abiogenesis, whether life originated only once. Quite obviously the two issues are distinct. It is entirely possible that life originated more than once, but that early life forms were so promiscuous in sharing genetic material that they constituted, from a genetic standpoint, a single population from which all later organisms evolved. Essentially no scientist at this point objects to the possibility of the latter proposition, although some disagree as to the pattern in which various lineages arose from the original population. These differences may affect the extent to which linear, vertical descent lineages can be unequivocally identified when analyzing the deepest phylogenetic relationships (such as the separation of life into the Domains of Eubacteria, Archaea, and Eukarya), but do not change the view that all organisms ultimately are phylogenetically related.

ORFans

Explore Evolution claims:

[Molecular biologists] have been surprised to learn that a large number of of genes code for proteins whose function we don't understand yet. They call these ORFan genes.
Explore Evolution, p. 60

This is not the definition of ORFans. ORFans are "open reading frames," sections of a chromosome with a start codon followed by a stretch of nucleotide triplets and ended by a stop codon and which do not match a known coding DNA sequence in other species. There is no guarantee that these sections even code for a protein, let alone that they have any function. More importantly, these merely have no currently recognized relatives. (Siew N, Fischer D. (2003) "Analysis of singleton ORFans in fully sequenced microbial genomes." Proteins. 53:241-51) Function is not a consideration in defining ORFans. Some of these proteins with no known relatives do have recognized functions (e.g. bacterial virulence factor staphostatin B (1nycA)).

In contrast, we do have many genes that are in recognizable gene families, but whose functions are not clear from their sequence alone. For example, alpha-beta barrel family proteins have a wide variety of functions, and it is difficult to deduce the function of a member from simple inspection. The incorrect definition given in Explore Evolution artificially inflates the purported number of ORFans.

According to evolutionary theory, new genes arise from old genes by mutation … . New genes should resemble the older "ancestor genes." However, these newly discovered genes do not match any sequence that codes from a known protein.
Explore Evolution, p. 61

Most ORFans have relatives found for them rather rapidly as new genomes are sequenced. With the larger databases available now, old ORFans are finding relatives (e.g. in 2004 hypothetical protein Apc1120 was an ORFan, now several relatives have turned up) and fewer new ORFans are being found. Also, we know that proteins can be generated de novo, so not all proteins must be traced back to older ancestor genes.

Thus, there are two claims here:

  1. There are a substantial number of ORFans have no similarity to other sequences and
  2. Common descent assumes all (or a very high proportion) of current proteins all originated with the Last Universal Common Ancestor.

The first claim is deeply misleading and the second is wrong.

Explore Evolution gives the impression that there are many genes with no relation to any other genes (especially by selectively quoting from older papers). In fact while initially many putative genes in a newly sequenced organism may appear to be unrelated to any then known gene, relatives are usually found rather rapidly. When H. influenzae was first sequenced, 64% of its Open Reading Frames (ORF's, putative genes) were ORFans, as of 2003, only 5.2% were. When Mycoplasma genitalium was first sequenced, roughly 30% of its predicted genes were ORFans, and now all have homologues in other lineages.

Explore Evolution quotes the brief review N. Siew, D. Fischer. 2003 "Twenty Thousand ORFan microbial protein families for the biologist?" Structure 11:7-9.

If proteins in different organisms have descended from common ancestral proteins by duplication and adaptive variation, why is it that so many today show no similarity to each other? Why is it that we do not find today any of the necessary “intermediate sequences” that must have given rise to these ORFans?
Explore Evolution, p. 62

This citation ignores the following sentences from that paper:

Regardless of their origin, ORFans may be of two types. Some ORFans may correspond to newly evolved (through a yet unknown mechanism) or to unique descendants of ancient proteins, with unique functions and three-dimensional (3D) structures not currently observed in other families. Alternatively, ORFans may correspond to highly diverse members of known protein families, but with functions and/or 3D structures similar to proteins already known.

As well as the prescient observation:

More sensitive computational methods, such as fold recognition or sequence-to-profile comparisons, may succeed in assigning some ORFans to known families, and thus, their roles and functions may be gained.

This is what has turned out to be the case. By ignoring work in this area since 2003, (including papers from Siew and Fischer published after this mini-review, such as Siew N, Fischer D. (2003) Proteins. 53:241-51), Explore Evolution gives a highly distorted picture of our current understanding of ORFans.

ORFans versus Genome Number: The proportion of ORFans in the genome, as compared to the total number of sequenced genes. As we increase the number of genes sequenced, the percent of ORFans fall. As of 2003, only 5% of long ORFans (ORF's that are unlikley to be simple sequencing artefacts) were unaccounted for. Figure 1, C from Siew, N and Fisher D, PROTEINS: Structure, Function, and Genetics 53:241–251 (2003)ORFans versus Genome Number: The proportion of ORFans in the genome, as compared to the total number of sequenced genes. As we increase the number of genes sequenced, the percent of ORFans fall. As of 2003, only 5% of long ORFans (ORF's that are unlikley to be simple sequencing artefacts) were unaccounted for. Figure 1C from Siew N, Fischer D. (2003) "Analysis of singleton ORFans in fully sequenced microbial genomes." Proteins. 53:241-51). Figure 1, C from Siew, N and Fisher D, PROTEINS: Structure, Function, and Genetics 53:241–251 (2003)

In an inquiry-based class, a teacher might ask the students to suggest reasons why some putative genes appear to be ORFans. Once students generated that list, the teacher could encourage students to generate testable hypotheses and even to test those hypotheses. Instead of guiding students and teachers along that path, Explore Evolution encourages students simply to surrender in the face of the unexplained, a decidedly inquiry-averse approach. Some of the reasons scientists have offered for genes to remain ORFans includ:

  1. Some ORFans may be artefacts: Many ORFans are very short, 100-150 codons long. It is likely that many of these represent database or annotation errors. Also, in any genome, one would expect some random ORFs being formed. Fukuchi S and Nishikawa K. ("Estimation of the number of authentic orphan genes in bacterial genomes." DNA Res. 2004 Aug 31;11(4):219-31, 311-313.) closely examined sequences and estimated that about half of all short ORFans are sequencing or other errors.
  2. Some ORFans may have relatives, but we haven't sampled enough genomes yet. As of 2003, when most of the ORFan comparisons were done, something like 60 complete bacterial genomes had been sequenced. Note the diagram above, with the continuing fall of ORFans as more genomes are sequenced. By 2006 the percentage of ORFans fell by a further 5% (Marsden RL, et al., "Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space." Nucleic Acids Res. 2006 34:1066-80). More genomes have been sequenced since then, but there are many, many more bacteria that are not yet sequenced, and will have genomes quite divergent from the human pathogens that form the majority of current sequences. This will be especially important because a horizontal transfer from a distantly related bacteria that has not been sequenced will look like an ORFan (until that distantly related bacteria is sequenced). A recent paper shows that many E. coli ORFans are the result of horizontal gene transfer from bacteriophages (Daubin and Ochman, 2004; "Bacterial genomes as new gene homes: the genealogy of ORFans in E. coli". Genome Res. (6):1036-42.). Bacteriophages are viruses, which is why they didn't turn up in bacterial database comparisons.
  3. Some ORFans may have relatives, but our tools aren't good enough to detect these relatives yet. Rapidly evolving proteins, especially small proteins, can have have their evolutionary history obscured by multiple substitutions during their evolution. More sensitive techniques are needed to find the relatives of these proteins, usually based on structural recognition. For example, using improved fold recognition software and a large database of fold family structures, Siew et al. have found that in Bacillus sp., some related ORFans are members of the of the alpha/beta hydrolase superfamily, and most likely derive from the haloperoxidases (N. Siew, H. K. Saini and D. Fischer. (2005) "A Putative Novel Alpha/Beta Hydrolase Family in Bacillus." FEBS Letters, 579:3175-82.).

    So most ORFans have been accounted for, and as we study more genomes with better tools we will resolve the status of many more. In an inquiry-based approach, students could recheck Escherichia coli ORFans from 2003, and would find that the vast majority now have resolved relatives. Indeed, if some of the non-artefactual ORFans are due to horizontal transfer from bacteriophages, as recent experiments suggest (Daubin and Ochman, 2004), then they may prove to be a valuable tool in understanding the phylogeny of bacteria, in the same way that families of LINES, SINES and pseudo genes have been. Far from being a threat to common descent, the patterns seen of the nested hierarchies of singleton, lineage specific and family specific ORFans are those you would expect from common descent.
  4. Some ORFans may be de novo generated proteins. We fully expect a modest proportion of new genes to be generated de novo during evolution. We even have examples of proteins that are so generated. The most famous of these is the nylonase gene, which allows bacteria to metabolise the artificial polymer nylon. This was produced by a mutation in a piece of non-coding DNA which generated a transcribable protein (Okada H, et al., (1983) "Evolutionary adaptation of plasmid-encoded enzymes for degrading nylon oligomers." Nature. 306(5939):203-6.). The sperm-specific dynein intermediate chain gene (Sdic) was generated by a fusion mutation between two genes (so strictly speaking it falls under the gene duplication rubric), but the coding region of the new Sdic gene is generated from the non-coding intronic regions, so protein homology studies would have a hard time identifying it (Nurminsky DI, et al., (1998) "Selective sweep of a newly evolved sperm-specific gene in Drosophila." Nature. 396(6711):572-5). Formation of new genes poses no problem for evolutionary biology or common descent, as we do not demand that all, or the vast majority of genes originate in the Last Common Universal Ancestor. Furthermore, we are quite able to trace common ancestry with some genes being generated de novo, as this does not disturb the trees generated from other genes.