About the NCBI RefSeqGene Project

RefSeqGene, a subset of NCBI's Reference Sequence (RefSeq) project, defines genomic sequences to be used as reference standards for well-characterized genes. These sequences, labeled with the keyword RefSeqGene in NCBI's nucleotide database, serve as a stable foundation for reporting mutations, for establishing conventions for numbering exons and introns, and for defining the coordinates of other variations. RefSeq mRNA and protein sequences have long been used for this purpose, but have the obvious weakness of not providing explicit coordinates for flanking or intronic sequence. RefSeq chromosome sequences do provide explicit coordinates no matter the relationship to any gene annotation, but have awkwardly large coordinate values that will change when the sequence is updated because of a re-assembly. Sequences of the RefSeqGene project counter both of these drawbacks by providing more stable gene-specific genomic sequence for each gene, as well as including upstream and downstream flanking regions. If modifications must be made to any RefSeqGene sequence, it will be versioned and tools will be provided to facilitate conversion of coordinates. The RefSeqGene sequences are aligned to reference chromosomes, and current and previous chromosome coordinates are available because of that re-alignment. The Clinical Remap tool make that conversion easy.

The RefSeqGene project is an active member of the Locus Reference Genomic project.

The RefSeqGene project gratefully acknowledges the leadership and interest of Dr. M.L. Gulley and the Molecular Pathology Resource Committee of the College of American Pathologists.

See also M.L. Gulley et al., Clinical laboratory reports in molecular pathology.

Sequence Selection

Sequences in the RefSeqGene set are well-supported and, to the extent for which this is possible, represent a prevalent, 'standard' allele.

Criterion 1. Well-supported

The default implementation of 'well-supported genomic sequence' is the sequence from the public reference assembly. The rationale for this definition is the quality of the genomic product. If the current public reference assembly is not well supported, then an alternate sequence will be selected, in consultation with gene-specific experts as available.  When feasible, RefSeqGene sequences will be derived from a single clone, based on the assumption that no sequence errors were introduced in cloning, and that a single insert represents an example of a naturally occurring haplotype.

Another aspect of well-supported is the placement of  exons and coding regions on the RefSeqGene.  The mRNAs used to define the exons are selected based on consultation with locus-specific database curators and other domain experts. Almost all have been reviewed as well by the Consensus CDS  project (CCDS) so the placement of splice junctions and the initiation and termination codons are well-supported.

Criterion 2. Standard allele

The default implementation of 'standard allele' will be the sequence from the public reference assembly. If, however, there is published evidence, evidence from locus-specific databases, or evidence from clinical testers, that the sequence in the Reference assembly is not standard, the RefSeqGene sequence will be constructed from an alternate source sequence, or locally modified.

Last updated: Sat, 2011-09-10 10:05