An Open Letter To The Mouse Genetics Community
From
Francis S. Collins, M.D., Ph.D.
National Human Genome Research Institute, NIH
e-mail fc23a@nih.gov
On Behalf of The Public Mouse Sequencing Project

The public mouse sequencing effort is robust and proceeding rapidly, but from queries we have been receiving recently there seems to be considerable confusion about the state of the effort and the location of the data. We clearly have not been effective enough in our attempts to communicate these plans and achievements to the wider mouse community. I would like, therefore, to let you know the current status of our effort.

The goals for our program were designed to develop the critical resources that were defined by the community:

The ultimate goals are a robust physical map and a high quality, finished sequence of the mouse C57BL/6J strain).
In addition, useful intermediate products will be produced as we proceed, but in a way that will not slow the completion of the map and sequence at all.

The Washington University, Whitehead and Sanger Centers, as well as others, are aggressively working toward these goals. Let me briefly summarize the current results and upcoming products:

BAC Map. The BAC physical map of the mouse is in very good shape and is nearly complete. Over 300,000 BAC clones coming approximately equally from the RPCI23 (female) and RPCI24 (male) libraries have been fingerprinted and assembled into fewer than 4300 contigs. The fingerprint data and an initial assembly into 6500 contigs are available at http://www.bcgsc.bc.ca/projects/mouse_mapping/. The more recent assembly will be posted by early August at the Wash U web site (http://genome.wustl.edu/gsc/mouse). A "User's Guide" about how to use these resources is currently being written and will be posted on the Wash U site as soon as completed. In addition, the Wash U group is also attempting to link the BAC map to both human syntenic regions and to the mouse RH map, so that it will be possible to look up clones for particular regions.

Initial WGS Reads. An initial set of whole genome shotgun data, comprising over 17 million reads or about 2.5 - 3X coverage, has been generated. The data are available from the Trace Archive at the NCBI (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi) and the Ensembl Trace Server (http://trace.ensembl.org). The mouse reads have been compared to the human genome and homologous reads have been laid out along the human draft sequence. They can be viewed at (http://genome.cse.ucsc.edu and http://www.ensembl.org).

The mouse reads are immediately useful for both human and mouse genetics. For example, for mouse positional cloning, one can take a mouse gene in a region, look up syntenic regions of human on the UCSC or Ensembl web sites, and find all of the mouse reads for the conserved segments. You can also retrieve the mate pairs from the trace archive to improve coverage. The mate pairs are a nice distance apart for PCR recovery of the segment. Starting with a mouse cDNA, one can search the trace repository for matches and again retrieve the mate pairs. The hits also give a good idea of related family members in the mouse genome. Again, use of the read pair information can let you recover a suitable region for generating the segments needed to construct the vectors for mouse knockouts. There are already several examples of mouse genes that have been positionally cloned using the available public information.

Assembly of Initial WGS Reads. The 2.5-3X coverage in WGS reads has been subjected to an initial assembly using assembly software developed in the public domain. This assembly is currently being assessed and will be posted on the Whitehead web site by the end of August. Based on preliminary analysis, the contigs are still relatively small (several kb) and not yet anchored. But, they will be aligned against the human genome and will be useful for experimental work as above. We will also be matching the contigs to the BAC end sequences, and thereby expect to anchor the BAC contigs to the mouse RH and genetic maps.

Additional WGS Reads. Based on community feedback on the utility of the initial WGS reads, we will expand the data set to 5-6x coverage, using double ended plasmid reads. We expect these data to be generated by December 2001. The larger data set will be assembled to give larger contigs and much larger scaffolds. This assembly should suffice for most positional cloning and knockout experiments.

BAC-based Sequencing of the Genome. We will then take a tiling path of BAC clones across the genome (currently being developed from the BAC physical map) and subject the entire set to standard shotgun sequencing; the results will be combined with the WGS reads. The merged data set will provide the ordered, anchored full shotgun coverage of the mouse sequence that will be the substrate for the completion of the finished, high quality mouse sequence that is the ultimate goal. We anticipate that deep shotgun coverage (leaving relatively few gaps) will be achieved in about 2 years, with the exact time frame depending on the precise amount of funding available. (The sequencing capacity is in place, inasmuch as the deep shotgun phase for the human is essentially done.) We will then turn to the finishing of the sequence and the remaining gaps.

Sequencing of Individual BACs. NHGRI continues to provide the opportunity for investigators to obtain priority sequencing of individual BACs (or small contigs) of particularly high biological interest (http://www.nih.gov/science/models/bacsequencing/). In fact, this program is under-subscribed and has adequate capacity to accept many additional requests.

I hope that this information is useful to you and would appreciate advice from any members of this listserve as to how to make it more widely known in the mammalian genetics community. Please feel free to distribute this summary yourselves as you see fit.

To ensure better communication and good responsiveness to the community, we will shortly establish a web site containing information and pointers for the mouse community about the available resources, which we will update regularly.

Finally, let me re-emphasize that the HGP is deeply committed to generating these data as rapidly as possible and to serving the mouse community as well as possible.