DOE Genomes

Human Genome Project Information

The Science Behind the Human Genome Project
Basic Genetics, Genome Draft Sequence, and Post-Genome Science

Basic Information
 Genetics 101

 Media Guide

About the Project
 What is it?
 Landmark Papers
 Sequence Databases
 Ethical Issues
 Genetics 101

Medicine &
the New Genetics

 Gene Testing
 Gene Therapy

 Disease Information
 Genetic Counseling

Ethical, Legal,
Social Issues

 Privacy Legislation

 Gene Testing
 Gene Therapy
 Genetically Modified Food
 Behavioral Genetics
 Minorities, Race, Genetics
 Human Migration

 Chromosome Poster
 Genetics 101
Genética Websites en Español

 Sequence Databases
 Landmark Papers

 Chromosome Poster
 Primer Molecular Genetics
 List of All Publications

  ???Search This Site


 Contact Us
 Privacy Statement

 Site Stats and Credits
 Site Map

Quick Links for this page are as follows:

From the Genome to the Proteome

Cells are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the chemical DNA (deoxyribonucleic acid).

DNA from all organisms is made up of the same chemical and physical components. The DNA sequence is the particular side-by-side arrangement of bases along the DNA strand (e.g., ATTCCGGA). This order spells out the exact instructions required to create a particular organism with its own unique traits.

The genome is an organism’s complete set of DNA. Genomes vary widely in size: the smallest known genome for a free-living organism (a bacterium) contains about 600,000 DNA base pairs, while human and mouse genomes have some 3 billion. Except for mature red blood cells, all human cells contain a complete genome.

DNA in the human genome is arranged into 24 distinct chromosomes--physically separate molecules that range in length from about 50 million to 250 million base pairs. A few types of major chromosomal abnormalities, including missing or extra copies or gross breaks and rejoinings (translocations), can be detected by microscopic examination. Most changes in DNA, however, are more subtle and require a closer analysis of the DNA molecule to find perhaps single-base differences.

Each chromosome contains many genes, the basic physical and functional units of heredity. Genes are specific sequences of bases that encode instructions on how to make proteins. Genes comprise only about 2% of the human genome; the remainder consists of noncoding regions, whose functions may include providing chromosomal structural integrity and regulating where, when, and in what quantity proteins are made. The human genome is estimated to contain 20,000-25,000 genes.

Although genes get a lot of attention, it’s the proteins that perform most life functions and even make up the majority of cellular structures. Proteins are large, complex molecules made up of smaller subunits called amino acids. Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell.

The constellation of all proteins in a cell is called its proteome. Unlike the relatively unchanging genome, the dynamic proteome changes from minute to minute in response to tens of thousands of intra- and extracellular environmental signals. A protein’s chemistry and behavior are specified by the gene sequence and by the number and identities of other proteins made in the same cell at the same time and with which it associates and reacts. Studies to explore protein structure and activities, known as proteomics, will be the focus of much research for decades to come and will help elucidate the molecular basis of health and disease.

How is genome sequencing done?

Download a PDF illustration courtesy of the Department of Energy's Joint Genome Institute. See also their step-by-step illustrated guide to how sequencing is done.

  • Chromosomes, which range in size from 50 million to 250 million bases, must first be broken into much shorter pieces (subcloning step).

  • Each short piece is used as a template to generate a set of fragments that differ in length from each other by a single base that will be identified in a later step (template preparation and sequencing reaction steps).

    See a figure depicting the sequencing reaction.

  • The fragments in a set are separated by gel electrophoresis (separation step).

    New fluorescent dyes allow separation of all four fragments in a single lane on the gel.

    See an example of an electropherogram using fluorescent dyes. Click on the image for a caption.

  • The final base at the end of each fragment is identified (base-calling step). This process recreates the original sequence of As, Ts, Cs, and Gs for each short piece generated in the first step.

    Current electrophoresis limits are about 500 to 700 bases sequenced per read. Automated sequencers analyze the resulting electropherograms and the output is a four-color chromatogram showing peaks that represent each of the four DNA bases.

    After the bases are "read," computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics.

    To read about all the trouble researchers go through to "finish" this raw sequence from automated sequencers Click here (and scroll to bottom that begins "Here are our definitions of...").

    Finished sequence is submitted to major public sequence databases, such as GenBank. Human Genome Project sequence data are thus made freely available to anyone around the world.

For more on genome sequencing, see the Sequencing Fact Sheet.
What We've Learned So Far

What Does the Draft Human Genome Sequence Tell Us?

By the Numbers

  • The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G).
  • The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases.
  • The total number of genes is estimated at 30,000 —much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas.
  • Almost all (99.9%) nucleotide bases are exactly the same in all people.
  • The functions are unknown for over 50% of discovered genes.
The Wheat from the Chaff
  • Less than 2% of the genome codes for proteins.
  • Repeated sequences that do not code for proteins ("junk DNA") make up at least 50% of the human genome.
  • Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes.
  • During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome.
How It's Arranged
  • The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C.
  • In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes.
  • Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between.
  • Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity.
  • Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).
How the Human Compares with Other Organisms
  • Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout.
  • Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene.
  • Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity.
  • The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%).
  • Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift.
Variations and Mutations
  • Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history.
  • The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.
Applications, Future Challenges
Deriving meaningful knowledge from the DNA sequence will define research through the coming decades to inform our understanding of biological systems. This enormous task will require the expertise and creativity of tens of thousands of scientists from varied disciplines in both the public and private sectors worldwide.

The draft sequence already is having an impact on finding genes associated with disease. A number of genes have been pinpointed and associated with breast cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA sequences underlying such common diseases as cardiovascular disease, diabetes, arthritis, and cancers is being aided by the human variation maps (SNPs) generated in the HGP in cooperation with the private sector. These genes and SNPs provide focused targets for the development of effective new therapies.

One of the greatest impacts of having the sequence may well be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumor, or how tens of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life. 

The Next Step: Functional Genomics

The words of Winston Churchill, spoken in 1942 after 3 years of war, capture well the HGP era: "Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning."

The avalanche of genome data grows daily. The new challenge will be to use this vast reservoir of data to explore how DNA and proteins work with each other and the environment to create complex, dynamic living systems. Systematic studies of function on a grand scale-functional genomics-will be the focus of biological explorations in this century and beyond. These explorations will encompass studies in transcriptomics, proteomics, structural genomics, new experimental methodologies, and comparative genomics.

  • Transcriptomics involves large-scale analysis of messenger RNAs transcribed from active genes to follow when, where, and under what conditions genes are expressed.
  • Studying protein expression and function--or proteomics--can bring researchers closer to what's actually happening in the cell than gene-expression studies. This capability has applications to drug design.
  • Structural genomics initiatives are being launched worldwide to generate the 3-D structures of one or more proteins from each protein family, thus offering clues to function and biological targets for drug design.
  • Experimental methods for understanding the function of DNA sequences and the proteins they encode include knockout studies to inactivate genes in living organisms and monitor any changes that could reveal their functions.
  • Comparative genomics—analyzing DNA sequence patterns of humans and well-studied model organisms side-by-side—has become one of the most powerful strategies for identifying human genes and interpreting their function.

Last modified: Wednesday, March 26, 2008

Home * Contacts * Disclaimer

Document Use and Credits
Publications and webpages on this site were created by the U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Permission to use these documents is not needed, but please credit the U.S. Department of Energy Genome Programs and provide the website All other materials were provided by third parties and not created by the U.S. Department of Energy. You must contact the person listed in the citation before using those documents.

Base URL:

Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program