Annotation Information

Information on NCBI genome annotation methods.

I. Overview

As genome sequence data become available for an organism, NCBI staff work to provide the data as a reference sequence (RefSeq) for display in the Map Viewer. We have developed several protocols to reach this goal and rely heavily on collaboration with genome-specific research groups whenever possible. NCBI provides various levels of computation, analysis, and curation as needed per organism. For instance, most genomes are assembled by an external group and annotated via the NCBI annotation pipeline (e.g. rat, bee, fly). For the human and mouse genomes, NCBI computes the assembly in collaboration with the international sequencing consortium; NCBI and other external groups independently provide annotation on the assembled genome. And, for other genomes, such as Drosophila melanogaster, the NCBI RefSeqs represent the assembly and annotation as provided by the fly sequencing consortium.

NCBI is providing reference sequence (RefSeq) records that represent assemblies of genomic sequence data and the corresponding RNA and protein sequences. For externally provided assemblies, there is no guarantee that the genomic RefSeq will correspond exactly to the submitted assembly as contaminants will be screened out in the RefSeq version (see the Assembly page for more information). The NCBI annotation pipeline annotates the genomic RefSeq data with features such as genes, RNAs, proteins, variation (SNPs), STS markers, and FISH mapped clones. All sequences (genomic, RNAs, proteins) are available for customized BLAST searches. BLAST results, as well as the sequence features, are readily displayed on NCBI's Map Viewer.

II. Feature Annotation

The annotation process identifies sequence features on the contigs such as variation, sequence tagged sites, FISH-mapped clone regions, transcript alignments, known and predicted genes, and gene models. This stage provides contig, RNA, and protein records with added feature annotation. In addition, organism specific features, such as Gene Trap clones for mouse will also be annotated.

Clone Features

Human FISH-mapped clones (4, 5) are annotated on the human genome by aligning their sequence tags on the contigs using MegaBLAST (1) and e-PCR (3) analysis. Sequence tags are in the form of either GenBank Accession numbers from the draft or finished clone insert sequence, GenBank Accession numbers of BAC-end sequences, or STS markers determined by PCR and hybridization experiments.

Currently, we annotate human clones that have been mapped by fluorescence in situ hybridization (FISH) by the human bac resource consortium. These data provide a means to determine the correspondence between the sequence and the cytogenetic coordinate systems.

In addition, clones that have been end sequences (6) are annotated by aligning their BAC-end sequence to the assembly using MegaBLAST (1).

STS Features

Electronic PCR (ePCR) (3) is used to place STS primer pairs, stored in UniSTS, on the contigs by looking for consistency between the determined product size and the reported size.

Variation

Variations in dbSNP are mapped to the Genome Assembly by BLAST homology. Hits are recorded as high confidence if 95% of the flanking sequence is returned in the alignment with 0-6 mismatches. If no high confidence hits are observed, hits are recorded as low confidence if 75% of the flanking sequence is returned in the alignment with < 3% mismatches.

Variation annotation in the Map Viewer reports overall mapping quality as the number of chromosomes hit, number of contigs hit, and total number of hits to genome. SNPs with ambiguous map positions are annotated with a warning when Variation map is the master map. Complete mapping information is available from both the dbSNP web site and FTP site.

Gene, Transcript and Protein features

Genes are annotated using both (i) RefSeq transcript alignments and (ii) Gnomon prediction in those regions not covered by RefSeq alignments. The annotation includes coding transcripts, pseudogenes, and non-coding transcripts, which are represented as "misc_RNA" features.

RefSeq transcript alignments:

A first set of known genes (and their corresponding transcripts and proteins) are identified by aligning reference sequences (RefSeq) to the assembled genomic sequence using SPLIGN (7) and assembling the hits according to limited constraints and heuristics regarding exon structure. Transcript models are reconstructed by attempting to settle disagreements between individual sequence alignments without using an a priori model (such as codon usage, initiation, or polyA signals). Although such a model is not used, information generated during a build (including predictions from Gnomon) are used to improve the RefSeqs themselves.

Alternate RefSeq models derived from the available sequence data are grouped under the same gene when they share one or more exons on the same strand.

If the defining RefSeq sequence aligns to more than one location on the genome, the best alignment is selected and annotated on the contig. If they are of equal quality, both are annotated. Genes (and corresponding transcript and protein features) are annotated on the contig if the defining transcript alignment is >=95% identity and the aligned region covers >=50% of the length, or at least 1000 bases.

Gnomon prediction:

Once the RefSeqs are placed on the genome, the remainder of the supporting information includes other mRNAs, ESTs, and information on protein homologies generated from comparisons of translated regions.

Additional GenBank mRNAs and ESTs are aligned to the assembled genomic sequence using SPLIGN (7), and together with the RefSeq alignments, are chained together to merge alignments based on shared splice sites. A set of optimal self-consistent, non-overlapping transcript alignments are chosen from each regional cluster of these chained transcript alignments, using metrics of coding propensity, splice score, and protein alignments via BLASTX against filtered NR proteins (those with CDD hits or hits in distant organisms).

Transcript models are generated via a Hidden Markov Model (HMM) using transcript alignment constraints and protein hit information if available. The model allows nonconsensus splices existing in the transcript alignment, makes deletions/insertions in the sequence to compensate for the frameshifts found in the protein alignments, and suppresses stop codons found in "exons" of protein alignments. Note that models generated with frameshifts and suppressed stop codons are strong candidates for pseudogenes.

The HMM will continue through regions without constraint information and create ab initio models. These are aligned via BLASTP against filtered NR proteins and an optimal self-consistent set of protein hits is chosen based on total score. The HMM is re-run with constraints based on these protein hits. This produces the final set of Gnomon gene models; the contig annotation includes only the subset not overlapping the RefSeq-based models.

Repeat Features

We use the program RepeatMasker(2) to screen genome sequences and identify interspersed repeats. Repeat libraries are defined by Giri, and included as part of the RepeatMasker distribution.

III. Products

The NCBI Genome Annotation project provides sequences and resource support via Entrez Gene, Map Viewer and anonymous FTP.

Sequence Data

A comprehensive set of RefSeq records are provided on the FTP site. Multiple mRNA and protein RefSeqs are provided for genes when the supporting RefSeq, GenBank mRNA, and EST data support alternative splicing. Transcripts are also instantiated for some non protein coding genes. These records represent transcribed pseudogenes.

See the RefSeq documentation for a complete list of accession prefixes. Accessions that begin with the prefix XM_ (mRNA), XR_ (non-coding transcript), and XP_ (protein) are model reference sequences produced by NCBI's Genome Annotation project. These records represent the transcripts and proteins that are annotated on the NCBI Contigs (prefix NT_ or NW_), which may have been generated from incomplete data. Because the XM_, XR_, and XP_ accessions reflect the current state of NCBI's assembly of the genomic sequence, they may be different from GenBank submissions for mRNAs and/or the curated RefSeq records. These differences may reflect real sequence variation (polymorphism), errors in GenBank accessions used as sources for not yet reviewed (provisional) RefSeq records, or errors or gaps in the available genomic sequence. These sequences should be used with caution, after comparing any XM_ or XP accession to other available sequence information (Check BLink, Entrez Gene, or related sequences).

Resource Support

dbSNP provides information about sequence variation including map location, alleles, frequency data, genotype data, and functional data. Report pages include links to Entrez Gene, UniSTS, GenBank, PubMed, the NCBI Map Viewer, and external submitter web sites.

Organism guide pages provide a central point of access to information about sequencing progress, NCBI resources, NIH resources, and meeting and press releases. See the Genomic Biology Page for a list of available pages.

Entrez Gene includes report pages for all genes defined by the genome annotation process. Every effort is made to provide associations to known genes; additional pages anchored on an Interim ID are provided for new genes or those that cannot be unambiguously associated with a known gene.

Map Viewer presents a graphical view of the available genome sequence data as well as non-sequence map data such as cytogenetic, genetic, physical, or radiation hybrid maps (the type and number of maps available vary by organism). The Map Viewer provides a robust query interface and interactive displays. Additional information on using the resource and on organism-specific maps including those for human and mouse is available. The Map Viewer displays may also include links to view supporting evidence (Evidence Viewer, Model Maker).

Entrez Graphical Sequence View provides a graphical overview of the GenBank Flat File plus a section of the sequence data. Annotated features are indicated on both the graphic and the sequence. The interface provides both zoom and scroll capability. This view is available for all sequence records by selecting "Graphics" from the "Display" menu; links to this graphical sequence viewer are provided from Entrez Gene and the Map Viewer (look for "sv" links). This display is particularly useful for viewing genes and other features annotated on the contigs.

RefSeq provides a non-redundant database of sequences including genomic, transcript, and protein. RefSeq transcripts are used as a reagent for genome annotation.

UniSTS provides STS marker reports on primer sequences, product size, mapping information, GenBank and RefSeq records that contain the primer sequences (as determined by Electronic PCR), and provides links to relevant resources.

IV. Data access

RefSeq contigs, model transcripts, and model proteins are fully integrated into the main NCBI resources. Thus, they can be retrieved with Entrez queries, accessed via a customized BLAST page, and are included in protein "BLink" pages, which display the results of pre-computed BLAST searches.

BLAST

The organism-specific BLAST pages (for example, Human , Mouse) pages provide an interface to BLAST an Accession number or FASTA-formatted sequence against the assembled genomic sequence data, as well as the RefSeq transcripts and proteins annotated on the genome.

Entrez Retrieval

The RefSeq contigs, transcripts, and proteins are retrievable with standard Entrez queries such as an Accession number, gene symbol, or protein name. You can also use the Limits settings, or make use of Entrez "properties" restrictions to further restrict the query.

See the RefSeq web site for Entrez query tips

FTP

The genomes FTP site holds data generated by analysis and/or additional processing at NCBI. This site includes sequences generated by the genome build and annotation effort (contig, transcript and protein sequences) as well as Map Viewer data files. Please see the provided README files for further information.

V. References

MegaBLAST: Zhang Z, et al., A greedy algorithm for aligning DNA sequences. J Comput Biol 2000 Feb-Apr;7(1-2):203-14
RepeatMasker: Smit, AFA & Green, P.
ePCR Schuler GD Sequence mapping by electronic PCR. Genome Res. 1997 May;7(5):541-50.
FISH-mapped clones: The BAC Resource consortium, Integration of cytogenetic landmarks into the draft sequence of the human genome. Nature 2001 Feb 15;409(6822):953-8.
FISH-mapped clones: Kirsch IR, et al., A Systematic, high-resolution linkage of the cytogenetic and physical maps of the human genome. Nat Genet. 2000 Apr;24(4):339-40.
Mouse BAC end sequences: Zhao S, et al., Mouse BAC ends quality assessment and sequence analyses. Genome Res 2001 Oct;11(10):1736-45.
Splign: Kapustin Y, et al. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 2008 May 21;3:20.

NCBI