NCBI Reference Sequence (RefSeq)

PubMed	All Databases	BLAST	OMIM	Books	Taxonomy	Structure
Search for

RefSeq Production Processes

Home

RefSeq records are derived from publicly available sequence data; varying levels of validation, additional annotation, and manual curation are applied to the RefSeq record. NCBI Reference Sequences are provided through the separate processes described below.

This page provides a brief overview of the RefSeq production processes. Also see:
blank spacer gif NCBI Handbook, RefSeq chapter
NCBI Handbook, Genome Annotation chapter
Genome Annotation Pipeline

Collaboration

For some organisms, the annotated RefSeq records are provided by collaborating groups. Depending on the organism, collaborations may be established at the whole-genome level, or smaller collaborations may be established for gene families.

Whole-genome collaborations include records for Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. When such a collaboration is established, the primary sequence level review is carried out by the collaborating group. Processing of annotated genome data submitted by collaborations is semi-automated; data is provided by a collaborating group and validated at NCBI to detect obvious errors (e.g., the annotated CDS location is not capable of encoding the provided protein), and to apply the annotation in a more uniform way. NCBI processing may integrate additional information such as nomenclature or other descriptive data. Additional manual curation of these records is not carried out by NCBI staff. NCBI may update the records to correct a general format problem, but otherwise these records are only updated when the collaborating group provides an update. Should errors be reported, then NCBI staff relays that information to the collaborating group.

RefSeq records that are supplied by collaboration do include an indication of the submitting group on the record either as a direct submission Reference citation and/or in the COMMENT block. The RefSeq status (e.g., REVIEWED etc) is either indicated by the collaborating group, or is inferred based on the supplied annotation.

Genome Assembly & Annotation Pipeline

NCBI is providing annotation for some assembled genomic sequence data including human, mouse, rat, honey bee, chicken, chimpanzee (and others). This pipeline is automated and data is refreshed periodically. The model RefSeq records produced from this pipeline have a distinguishing accession prefix (XM, XR, XP), are derived from the genomic sequence, have varying levels of transcript or protein homology support, and are not subject to further manual curation.

Also see:
blank spacer gif NCBI Handbook, Annotation chapter
Genome Assembly & Annotation Build Pipeline
Gnomon description

Curation by NCBI Staff

RefSeq transcript and protein records for a subset of organisms, primarily mammals, are curated by NCBI staff. Curation is an ongoing process and some records have not been reviewed yet; the curation status is indicated on the RefSeq record in the COMMENT block. Some records representing genomic regions (accession prefix NG_) are provided specifically to support more comprehensive genome-level annotation. The curated RefSeq records are created via a process that includes automated computational methods, collaboration, and manual data review by NCBI staff. This process is further described in the NCBI Handbook, RefSeq chapter.

A combined approach uses both collaborator supplied sequence information and automated BLAST analysis to provide an initial RefSeq record. Records are subject to validation to correct annotation errors and provide annotation in a more consistent format. Descriptive information, including Official Nomenclature and additional citations, are applied to the records. These initial records have a PROVISIONAL, PREDICTED, or INFERRED status.

Additional manual curation is applied to this set of RefSeq records to provide the optimal sequence record, and to fix sequence errors including mis-association with a locus (as might occur for closely related gene families), chimeric sequences, vector or linker contamination, or apparent sequencing errors. Both the nucleotide and protein sequence record may change due to this process. Sequence level review is carried out primarily by NCBI staff but some records are provided via collaboration. These records have a VALIDATED status. Additional annotation, a summary description, and other functional information may be applied, as available, during the sequence review process. These records have a REVIEWED status.

The process flow includes the following steps:

Initial Automatic Processing:
- Automatic processing and FTP downloads from collaborating groups provides an initial definition of the gene and sequence associations
- Validation and QA evaluations check for data conflicts and data completeness
- If pass QA phase, automatically provide RefSeq record. The initial RefSeq record will have a status of INFERRED, PREDICTED, or PROVISIONAL and may include enhanced feature annotation including:
  - Publications
  - Names, Symbols, Aliases
  - GeneID number
  - cross-references to other databases
  - Map information
Curation Processing (QA failures and other genes):
- Gather available data
- Review Gene-2-sequence associations: data conflicts are resolved through NCBI staff review in collaboration with collaborating databases; this review process is critical for accurately representing closely related genes.
- Curation may provide further enhancements to the RefSeq transcript and/or protein records including:
  - Sequence information
    - remove vector, linker contamination
    - extend UTR
    - represent the optimal sequence by correcting sequencing errors or choosing which polymorphic variant to represent - as identified in published reports, via in-house sequence analysis, or per personal communication.
    - represent splice variant records when there is sufficient unambiguous data available
  - Annotation information:
    - Add publications
    - Add a summary description about the gene and protein function
    - Add a description of transcript variants
    - Add feature annotations such as mature protein products, poly-adenylation signals and sites
    - ensure correct representation of atypical biology such as selenoproteins, ribosomal slippage, or non-AUG translation initiation sites.

Multiple collaborations support this process.

Since there is a strong manual curation component in this pipeline, input from the research community is especially welcome to further improve the quality of this dataset. The RefSeq records generated by this pipeline are used as a reagent in the genome assembly & annotation pipeline (see above).

Also see:
blank spacer gif NCBI Handbook, RefSeq chapter

Entrez Genomes Pipeline

The Entrez Genomes pipeline provides RefSeq records for bacteria, viruses, organelles, plasmids, and other organisms including Saccharomyces cerevisiae, Arabidopsis thaliana, Plasmodium falciparum,, and Leishmania major. Drosophila melanogaster and Caenorhabditis elegans RefSeq records are provided through collaboration, processing in the Entrez Genomes pipeline, and processing in the NCBI curation-supported pipeline. Records for additional species are added to the collection as sufficient sequence data becomes available. This pipeline relies on both automatic computation, collaboration, and in-house expert analysis to provide records at several levels of curation. These RefSeq records undergo an initial automated validation process before being released. The validation step checks for data errors and provides consistent feature annotation. If more than one genomic sequence is available for the genome, then one is selected for use as the RefSeq standard. This selection takes into account various factors including level of annotation, strain information, and community input.

Also see:
blank spacer gif Microbial Genomes
Organelles
Viral Genomes
Plant Genomes

Last updated March 8, 2012

Questions or Comments?
Write to the Help Desk

Disclaimer Privacy statement