Proteins

Databases

A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases.

Database that groups biomedical literature, small molecules, and sequence data in terms of biological relationships.

A collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database.

HIV-1, Human Protein Interaction Database

A database of known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliographies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data.

A collection of related protein sequences (clusters), consisting of Reference Sequence proteins encoded by complete prokaryotic and organelle plasmids and genomes. The database provides easy access to annotation information, publications, domains, structures, external links, and analysis tools.

A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

A collection of curated, non-redundant genomic DNA, transcript (RNA), and protein sequences produced by NCBI. RefSeqs provide a stable reference for genome annotation, gene identification and characterization, mutation and polymorphism analysis, expression studies, and comparative analyses. The RefSeq collection is accessed through the Nucleotide and Protein databases.

Downloads

BLAST executables for local use are provided for Solaris, LINUX, Windows, and MacOSX systems. See the README file in the ftp directory for more information. Pre-formatted databases for BLAST nucleotide, protein, and translated searches also are available for downloading under the db subdirectory.

FTP: BLAST Databases

Sequence databases for use with the stand-alone BLAST programs. The files in this directory are pre-formatted databases that are ready to use with BLAST.

FTP: CDD

This site provides full data records for CDD, along with individual Position Specific Scoring Matrices (PSSMs), mFASTA sequences and annotation data for each conserved domain. See the README file for full details.

FTP: FASTA BLAST Databases

Sequence databases in FASTA format for use with the stand-alone BLAST programs. These databases must be formatted using formatdb before they can be used with BLAST.

FTP: GenPept

The protein sequences corresponding to the translations of coding sequences (CDS) in GenBank are collected for each GenBank release..Please see the README file in the directory for more information.

FTP: Protein Clusters

This site contains data from the Protein Clusters database arranged by release date. See the README files for more information.

FTP: RefSeq

This site contains all nucleotide and protein sequence records in the Reference Sequence (RefSeq) collection. The ""release"" directory contains the most current release of the complete collection, while data for selected organisms (such as human, mouse and rat) are available in separate directories. Data are available in FASTA and flat file formats. See the README file for details.

Submissions

BioProject Submission

An online form that provides an interface for researchers, consortia and organizations to register their BioProjects. This serves as the starting point for the submission of genomic and genetic data for the study. The data does not need to be submitted at the time of BioProject registration.

Tools

A link option on protein records that displays the results of a pre-computed BLAST search of that protein against all other protein sequences at NCBI.

Finds regions of local similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as to help identify members of gene families.

Batch Entrez

Allows you to retrieve records from many Entrez databases by uploading a file of GI or accession numbers from the Nucleotide or Protein databases, or a file of unique identifiers from other Entrez databases. Search results can be saved in various formats directly to a local file on your computer.

COBALT

COBALT is a protein multiple sequence alignment tool that finds a collection of pairwise constraints derived from conserved domain database, protein motif database, and sequence similarity, using RPS-BLAST, BLASTP, and PHI-BLAST.

A stand-alone application for viewing 3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and UNIX and can be configured to receive data from most popular web browsers. Cn3D simultaneously displays structure, sequence, and alignment, and has powerful annotation and alignment editing features.

Concise Microbial Protein BLAST

A specialized BLAST service in which the queried database consists of all proteins from complete microbial (prokaryotic) genomes. NCBI has precalculated clusters of similar proteins at the genus-level and one representative is chosen from each cluster in order to reduce the dataset, thereby reducing search time and providing a broader taxonomic view.

Identifies the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (Reverse Position-Specific BLAST) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD).

Tools that provide access to data within NCBI's Entrez system outside of the regular web query interface. They provide a method of automating Entrez tasks within software applications. Each utility performs a specialized retrieval task, and can be used simply by writing a specially formatted URL.

Open Mass Spectrometry Search Algorithm (OMSSA) Search

An efficient search engine for identifying MS/MS peptide spectra by searching libraries of known protein sequences. OMSSA scores significant hits with a probability score developed using classical hypothesis testing, the same statistical method used in BLAST.

A utility for computing alignment of proteins to genomic nucleotide sequence. It is based on a variation of the Needleman Wunsch global alignment algorithm and specifically accounts for introns and splice signals. Due to this algorithm, ProSplign is accurate in determining splice sites and tolerant to sequencing errors.

Sequence Viewer

Provides a configurable graphical display of a nucleotide or protein sequence and features that have been annotated on that sequence. In addition to use on NCBI sequence database pages, this viewer is available as an embeddable webpage component. Detailed documentation including an API Reference guide is available for developers wishing to embed the viewer in their own pages.