What is it?
the New Genetics
Websites en Español
Primer Molecular Genetics
List of All Publications
Search This Site
Site Stats and Credits
Quick Links for this page are as follows:
From the Genome to the Proteome
are the fundamental working units of every living system. All the
instructions needed to direct their activities are contained within
the chemical DNA (deoxyribonucleic acid).
from all organisms is made up of the same chemical and physical
components. The DNA sequence is the particular side-by-side arrangement
of bases along the DNA strand (e.g., ATTCCGGA). This order spells
out the exact instructions required to create a particular organism
with its own unique traits.
is an organism’s complete set of DNA. Genomes vary widely in size:
the smallest known genome for a free-living organism (a bacterium)
contains about 600,000 DNA base pairs, while human and mouse genomes
have some 3 billion. Except for mature red blood cells, all human
cells contain a complete genome.
DNA in the human genome
is arranged into 24 distinct chromosomes--physically
separate molecules that range in length from about 50 million to
250 million base pairs. A few types of major chromosomal abnormalities,
including missing or extra copies or gross breaks and rejoinings
(translocations), can be detected by microscopic examination. Most
changes in DNA, however, are more subtle and require a closer analysis
of the DNA molecule to find perhaps single-base differences.
Each chromosome contains
many genes, the basic physical
and functional units of heredity. Genes are specific sequences of
bases that encode instructions on how to make proteins. Genes comprise
only about 2% of the human genome; the remainder consists of noncoding
regions, whose functions may include providing chromosomal structural
integrity and regulating where, when, and in what quantity proteins
are made. The human genome is estimated to contain 20,000-25,000
Although genes get a
lot of attention, it’s the proteins
that perform most life functions and even make up the majority of
cellular structures. Proteins are large, complex molecules made
up of smaller subunits called amino acids. Chemical properties that
distinguish the 20 different amino acids cause the protein chains
to fold up into specific three-dimensional structures that define
their particular functions in the cell.
The constellation of
all proteins in a cell is called its proteome.
Unlike the relatively unchanging genome, the dynamic proteome changes
from minute to minute in response to tens of thousands of intra-
and extracellular environmental signals. A protein’s chemistry and
behavior are specified by the gene sequence and by the number and
identities of other proteins made in the same cell at the same time
and with which it associates and reacts. Studies to explore protein
structure and activities, known as proteomics, will be the focus
of much research for decades to come and will help elucidate the
molecular basis of health and disease.
How is genome sequencing done?
Download a PDF illustration
courtesy of the Department of Energy's Joint
Genome Institute. See also their step-by-step
illustrated guide to how sequencing is done.
For more on genome sequencing, see the Sequencing
- Chromosomes, which range in size from 50 million to 250 million bases,
must first be broken into much shorter pieces (subcloning step).
- Each short piece is used as a template to generate a set of fragments
that differ in length from each other by a single base that will be
identified in a later step (template preparation and sequencing
figure depicting the sequencing reaction.
- The fragments in a set are separated by gel electrophoresis (separation
New fluorescent dyes allow separation of all four fragments in a
single lane on the gel.
See an example
of an electropherogram using fluorescent dyes. Click on the image
for a caption.
- The final base at the end of each fragment is identified (base-calling
step). This process recreates the original sequence of As, Ts, Cs,
and Gs for each short piece generated in the first step.
Current electrophoresis limits are about 500 to 700 bases sequenced
per read. Automated sequencers analyze the resulting electropherograms
and the output is a four-color chromatogram showing peaks that represent
each of the four DNA bases.
After the bases are "read," computers are used to assemble the short
sequences (in blocks of about 500 bases each, called the read length)
into long continuous stretches that are analyzed for errors, gene-coding
regions, and other characteristics.
To read about all the trouble researchers go through to "finish"
this raw sequence from automated sequencers Click here
(and scroll to bottom that begins "Here are our definitions of...").
Finished sequence is submitted to major public sequence databases,
such as GenBank. Human Genome Project sequence
data are thus made freely available to anyone around the world.
What We've Learned So Far
What Does the Draft Human Genome Sequence Tell Us?
By the Numbers
The Wheat from the Chaff
- The human genome contains 3164.7 million chemical nucleotide bases
(A, C, T, and G).
- The average gene consists of 3000 bases, but sizes vary greatly,
with the largest known human gene being dystrophin at 2.4 million bases.
- The total number of genes is estimated at 30,000 —much lower
than previous estimates of 80,000 to 140,000 that had been based on
extrapolations from gene-rich areas as opposed to a composite of gene-rich
and gene-poor areas.
- Almost all (99.9%) nucleotide bases are exactly the same in all people.
- The functions are unknown for over 50% of discovered genes.
How It's Arranged
- Less than 2% of the genome codes for proteins.
- Repeated sequences that do not code for proteins ("junk DNA") make
up at least 50% of the human genome.
- Repetitive sequences are thought to have no direct functions, but
they shed light on chromosome structure and dynamics. Over time, these
repeats reshape the genome by rearranging it, creating entirely new
genes, and modifying and reshuffling existing genes.
- During the past 50 million years, a dramatic decrease seems to have
occurred in the rate of accumulation of repeats in the human genome.
How the Human Compares with Other Organisms
- The human genome's gene-dense "urban centers" are predominantly composed
of the DNA building blocks G and C.
- In contrast, the gene-poor "deserts" are rich in the DNA building
blocks A and T. GC- and AT-rich regions usually can be seen through
a microscope as light and dark bands on chromosomes.
- Genes appear to be concentrated in random areas along the genome,
with vast expanses of noncoding DNA between.
- Stretches of up to 30,000 C and G bases repeating over and over often
occur adjacent to gene-rich areas, forming a barrier between the genes
and the "junk DNA." These CpG islands are believed to help regulate
- Chromosome 1 has the most genes (2968), and the Y chromosome has
the fewest (231).
Variations and Mutations
- Unlike the human's seemingly random distribution of gene-rich areas,
many other organisms' genomes are more uniform, with genes evenly spaced
- Humans have on average three times as many kinds of proteins as the
fly or worm because of mRNA transcript "alternative splicing" and chemical
modifications to the proteins. This process can yield different protein
products from the same gene.
- Humans share most of the same protein families with worms, flies,
and plants, but the number of gene family members has expanded in humans,
especially in proteins involved in development and immunity.
- The human genome has a much greater portion (50%) of repeat sequences
than the mustard weed (11%), the worm (7%), and the fly (3%).
- Although humans appear to have stopped accumulating repeated DNA
over 50 million years ago, there seems to be no such decline in rodents.
This may account for some of the fundamental differences between hominids
and rodents, although gene estimates are similar in these species. Scientists
have proposed many theories to explain evolutionary contrasts between
humans and other organisms, including those of life span, litter sizes,
inbreeding, and genetic drift.
Applications, Future Challenges
- Scientists have identified about 1.4 million locations where single-base
DNA differences (SNPs) occur in humans. This information promises to
revolutionize the processes of finding chromosomal locations for disease-associated
sequences and tracing human history.
- The ratio of germline (sperm or egg cell) mutations is 2:1 in males
vs females. Researchers point to several reasons for the higher mutation
rate in the male germline, including the greater number of cell divisions
required for sperm formation than for eggs.
Deriving meaningful knowledge from the DNA sequence will define research
through the coming decades to inform our understanding of biological systems.
This enormous task will require the expertise and creativity of tens of
thousands of scientists from varied disciplines in both the public and private
The draft sequence already is having an impact on finding genes associated
with disease. A number of genes have been pinpointed and associated
cancer, muscle disease, deafness, and blindness. Additionally, finding
the DNA sequences underlying such common diseases as cardiovascular
diabetes, arthritis, and cancers is being aided by the human variation
maps (SNPs) generated in the HGP in cooperation with the private sector.
These genes and SNPs provide focused targets for the development of effective
One of the greatest impacts of having the sequence may well be in enabling
an entirely new approach to biological research. In the past, researchers
studied one or a few genes at a time. With whole-genome sequences and
new high-throughput technologies, they can approach questions systematically
and on a grand scale. They can study all the genes in a genome, for example,
or all the transcripts in a particular tissue or organ or tumor, or how
tens of thousands of genes and proteins work together in interconnected
networks to orchestrate the chemistry of life.
The Next Step: Functional Genomics
The words of Winston Churchill, spoken in 1942 after 3 years of war,
capture well the HGP era: "Now this is not the end. It is not even the
beginning of the end. But it is, perhaps, the end of the beginning."
The avalanche of genome data grows daily. The new challenge will be to
use this vast reservoir of data to explore how DNA and proteins work with
each other and the environment to create complex, dynamic living systems.
Systematic studies of function on a grand scale-functional genomics-will
be the focus of biological explorations in this century and beyond. These
explorations will encompass studies in transcriptomics, proteomics, structural
genomics, new experimental methodologies, and comparative genomics.
- Transcriptomics involves large-scale analysis of messenger
RNAs transcribed from active genes to follow when, where, and under
what conditions genes are expressed.
- Studying protein expression and function--or proteomics--can
bring researchers closer to what's actually happening in the cell than
gene-expression studies. This capability has applications to drug design.
- Structural genomics initiatives are being launched worldwide
to generate the 3-D structures of one or more proteins from each protein
family, thus offering clues to function and biological targets for drug
- Experimental methods for understanding the function of DNA sequences
and the proteins they encode include knockout studies to inactivate
genes in living organisms and monitor any changes that could reveal
- Comparative genomics—analyzing DNA sequence patterns
of humans and well-studied model organisms side-by-side—has become one
most powerful strategies for identifying human genes and interpreting