What’s ENCODE’d in Your Genome Isn’t A Simple Collection of Genes

By Kevin E. Noonan —

Encode
While Craig Venter is trying to synthesize a "minimal genetic complement" bacteria (see "Patent Life (Really)"),
a consortium of 35 research groups from 80 research centers are
attacking the problem from the other end of the phylogenetic tree:
what is needed (minimally) to encode a human being? Known as the ENCODE (ENCyclopedia Of DNA Elements) Project group, the consortium operates under the auspices, and with the financial support of, the National Human Genome Research Institute (NHGRI). And in a National_human_genome_research_inst
formal announcement of the publication of a synthesizing article in Nature and the concomitant publication of 83 separate supporting papers in the journal Genome Research,
the latest results further distinguish the structure and complexity of
the mammalian genome when compared to the more efficiently-designed
bacterial genome.

Cover_nature
In its report ("Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project," Nature
447:799-816), the ENCODE group presents results showing that a majority
of the DNA sequences studied are transcribed into RNA, including both
gene sequences and sequences understood to be non-coding – "junk" DNA –
and that these primary transcripts are overlapping, i.e., they
start and stop at a more diverse array of sites than previously
appreciated. Overlapping genes were detected in 224 loci, and 180 of
these contained at least one exon from an upstream gene. As the
authors explain, "[i]nstead of the traditional view that many genes
have one or more alternative transcripts that code for alternative
proteins, our data suggest that a given gene may both encode multiple
protein products and produce other transcripts that include sequences
from both strands and from neighbouring loci (often without encoding a
different protein)." An illustrative example is a fusion transcript
consisting of "at least" three coding exons from the ATP5O gene and two coding exons from the DONSON gene, expressed in small intestine.

An important caveat to these results is that they are limited to a
review of but 1% of the human genome (44 genomic region targets, 30
million basepairs). Within the studied sequences the authors did not
find the kind of transcriptional distinctions between coding and
non-coding DNA. The authors found more than ten times the number of
transcriptional "start" sites associated with regulatory sequences than
genes known to reside in the target loci. Moreover, there did not
appear to be any differences in how evolutionarily conserved "gene"
sequences were (compared with non-coding DNA) about 50% of the time,
suggesting that evolution is operating not at the level of "conserved"
genes but on interspersed elements, whose regulation-in-context (i.e.,
within the surrounding non-coding DNA) was also the subject of
evolutionary pressures. Alternatively, the type of element-selection
evolution could result in a plurality of alternative elements making up
any particular portion of an encoded protein, and thus represent a
"warehouse for natural selection."

These results have interesting consequences for protecting expressed
sequence tag (EST) sequences, since if confirmed, they strike at the
underlying rationales for the utility of such sequences. Using the
traditional paradigm of discrete "gene" sequences being the templates
for transcription, the existence of an EST in a tissue, particularly
the differential expression of a particular EST in a tissue, was
assumed to be significant and reflect a cell, tissue, or organ-specific
gene expression event. If, on the other hand, there is a more general
level of transcription, the assumption of utility ESTs have been imbued
with is at best highly questionable.

Human_genome_project_hgp
These results also reinforce the message from sequencing the human genome by the Human Genome Project
(HGP) at the turn of the century that we are at the beginning, not the
end, of the road towards understanding how the decoded sequence
information is organized and used by the cell. This report, like
others arising directly from the HGP, indicates that mammalian genomes
are much more complex and depend more upon assortment, shuffling, and
RNA tailoring (splicing, etc.) for mediating gene expression than occur
in bacteria. Indeed, the concepts of gene transcription elucidated
over the past 40 years in lower organisms is likely to be seriously
inadequate for understanding mammalian cell biology. As a consequence,
the paradigm shift is underway to accommodate the realities of
mammalian cell biology reports such as the ENCODE report, and to adapt
our thinking about mammalian gene expression and genome structure to
conform to our DNA and not the other way around.

ENCODE data can be accessed here.

recent posts

about