NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gene Help [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2005-.

Bookshelf ID: NBK3841

Gene Help: Integrated Access to Genes of Genomes in the Reference Sequence Collection

Garth Brown, Craig Wallin, Tatiana Tatusova, Kim Pruitt, and Donna Maglott.

Created: September 13, 2006; Last Update: September 25, 2012.

Introduction

Gene supplies gene-specific connections in the nexus of map, sequence, expression, structure, function, citation, and homology data. Unique identifiers are assigned to genes with defining sequences, genes with known map positions, and genes inferred from phenotypic information. These gene identifiers are used throughout NCBI's databases and tracked through updates of annotation. Gene includes genomes represented by NCBI Reference Sequences (or RefSeqs) and is integrated for indexing and query and retrieval from NCBI's Entrez and E-Utilities systems.

Quick Start

Gene is accessed like any other Entrez database, namely by

  • querying on any word,

  • restricting the query term to a certain field, or

  • applying filters or properties

Here are some representative queries:

Find genes by...Search text
free texthuman muscular dystrophy
partial name and multiple speciestransporter[title] AND ("Drosophila melanogaster"[orgn] OR "Mus musculus"[orgn])
chromosome and symbol (II[chr] OR 2[chr]) AND adh*[sym]
associated sequence accession number M11313[accn]
gene name (symbol)BRCA1[sym]
publication (PubMed ID)11331580[PMID]
Gene Ontology (GO) terms or identifiers"cell adhesion"[GO]
10030[GO]
genes with variants of medical interestgene_snp_clin[filter]
chromosome and speciesY[CHR] AND human[ORGN]
Enzyme Commission (EC) numbers1​.9.3.1[EC]

When you look at the URLs that underlie these links, you will see that they are constructed by combining ‘http://www.ncbi.nlm.nih.gov/gene/?term=’ with a query term qualified by field names (in square brackets).

How Data Are Maintained

New Records

Records are added to Gene if any of the following conditions is met:

  • A RefSeq is created for a completely sequenced genome and that record contains annotated genes. In the case of RNA viruses with polyprotein precursors, annotated proteins may be treated as equivalent to a ”gene”.

  • A recognized genome-specific database provides information about genes (preferably with defining sequence) or mapped phenotypes.

  • The NCBI Genome Annotation Pipeline reports model genes.

  • A model organism is scheduled for sequencing, and representative sequences are identified to characterize known genes.

The minimum set of data necessary for a gene record, therefore, is: a unique identifier, or GeneID, assigned by NCBI; a preferred symbol; and either defining sequence information, map information, or official nomenclature from an authority list.

Gene records are not created for genomes which are incompletely represented by whole genome shotgun (WGS) assemblies. In terms of RefSeqs accessions, this means that genes annotated on accessions of the pattern NZ_ABCD12345678 are not submitted to Gene. Although not all existing records have been removed, loci defined by repetitive elements, endogenous retroviruses not named by nomenclature authorities, and loci identified by single transcripts with no other supporting data also are not in scope for Gene.

Numbering system

A unique GeneID is assigned to each new record. There are currently two number generators being used by Gene; one that is assigning values in the range of 7,000,000 – 99,999,999 and another that is assigning values > 100,000,000. Thus the sequence of GeneIDs is expected to have gaps.

Updates

Records are updated when new information is received. For some genomes, this may occur when a genome is re-annotated and the corresponding RefSeqs are updated. For other genomes, this may occur when any information attached to a single gene record is altered. Updates are processed daily.

Some components of the Gene record are updated automatically from other resources. Table 1 summarizes these data elements, their sources, and the update frequency. For example, GeneRIFs are processed independently of the Gene record. Most GeneRIFs are provided by the staff of the National Library of Medicine's Index Section and are integrated weekly. Those are available with the first update to Gene of the week. Public users are also invited to submit GeneRIFs, via the 'New GeneRIF' link in the Bibliography section of a Gene report.

Table 1

Table

Table 1. Data sources for Gene.

When any change is made to a record, the modification date is changed. This includes changes in GeneRIFs. The modification date, therefore, is the later of any update to Gene or supplemental information.

About two days are required for an update to be reflected in all reports from Gene. In some cases, the full report may be more up-to-date than the ftp site, because the ftp files are regenerated after a re-index of the database, and this process may lag a day behind the update to the database itself.

Suppressed Records

Gene will suppress a record for several reasons:

  • Review by NCBI staff and/or collaborators indicates that a record is no longer supported or in scope for Gene. An explanation for the suppression is provided by RefSeq staff.

  • Review by NCBI staff and/or collaborators indicates that the original record defined only part of what is now understood to be the functional gene unit. In that event, one record is made secondary to another, and the URL to the current record is provided.

  • The molecular basis for a Gene record that was previously only a mapped phenotype is discovered, and there was already a record for the causative locus or loci. The record for the mapped phenotype is made secondary to one of the causative loci and added to the phenotype section of all.

By default, all records, i.e., current and suppressed, are retrieved by a query submitted with no restrictions. You can, however, restrict your results to current records. For example,

  • click on Current only under the Filter your results: tool at the upper right of any query result page

  • check the box Current Records in the Include Only section of the Limits page, or

  • qualify your query with the phrase “AND alive[property]“

Query Tips provides additional details.

How content is selected

The content of an Gene record depends on availability of information and curatorial decisions. If you have suggestions about types of information that should be included in general, or for a specific record, please let us know by using our update form. More details about maintenance of certain types of information are provided in Gene’s FAQ.

How Data Are Displayed (Display Settings/Format)

NCBI's Entrez system supports multiple display options for each of its databases. The options available can be browsed by clicking on Display Settings (Figure 1). The options depend on whether you are viewing a set of results, or just one record. In the former case, with the Summary and Brief formats, Display settings also provides choices for controlling the number of items to display, and their order. Note that additional customization of display formats and filtering options is possible by configuring your MyNCBI preferences.

Figure 1

Figure

Figure 1. Display Settings. There are two types of options to configure your display: (1) from a query result (top) or (2) from a record-specific display (bottom). The latter has no need to offer controls on the number of records to display or how to (more...)

Gene provides the following categories of formats:

Short Reports

Summary

When you process a query, the results are displayed in the Summary format (or “docsum”) as shown in Figure 2. You can see that this is the Summary format by noting the word Summary at the top of the results section to the right of Display Settings (Figure 2).

Figure 2

Figure

Figure 2. A representative Summary report from Gene sorted by Gene Weight and filtered for Current Only. This resulted from a query (note the text in the box under the search option in the query bar) for records having COL1A1 as the symbol (gene). This (more...)

In the Summary format, each result is numbered, and a check box is provided at the left of the record. The check box enables you to select which of the records in the retrieval set that you want to review in another format, according to your selection in the Display Settings box. If none is checked, all are displayed in the selected format.

The text of the summary includes the preferred gene symbol, the complete name if available, the binomial name (genus species) (in brackets), other symbols and names, other designations, the genomic location (Chromosome, Location), the location on a genomic RefSeq for the reference assembly, the Mendelian Inheritance in Man (MIM) number for the gene (human only), and the GeneID (Figure 2B). When there is a symbol designated as official, it will be labeled as such. The order of precedence for displaying a symbol as preferred is

1.

Official symbol

2.

Locus tag

3.

First symbol in the set of aliases

If the gene is on a named plasmid, then the plasmid name is given as the location.

Recent Activity displays your recent database searches and document views in all Entrez databases (i.e. not only Gene).

Brief

The functions allowed from the Brief display are similar to those described from the Summary display. The purpose of the Brief option is to support a more compact result set while providing enough information (the preferred symbol, 20 characters of the full name, and the GeneID) for you to select records to display a Full Report. The Brief format is provided within the rich Entrez web environment; Brief (text) is just that, a text display with no URLs.

UI List

This displays only the unique identifiers (UIs) or GeneIDS for the records retrieved by your query (without the functions supported by Entrez).

Sort by

When the Summary or Brief options are selected, the Display Settings menu also allows you to reorder the results. The options are:

  • Relevance (the current default). Relevance is calculated from Gene's assessment of what fields are the most important by which to find search results. For example, Gene assigns more value to results if they match a term in the 'Gene Name' (symbol) field than to a match in free text such as the RefSeq or GeneRIF summary. Thus if your query is the single term 'cat', then records with symbols of 'cat' will be sorted ahead of records with the term cat only elsewhere in the record.

  • Gene Weight. Gene Weight is calculated from multiple lines of evidence geared toward evaluating how well a gene has been characterized. These lines include:

    1.

    Informative Gene-PubMed links. Informativeness is inversely proportional to the number of Gene records connected to a PubMed record.

    2.

    Informative symbols or full names. A gene with a symbol constructed as LOC+GeneID is weighted less, for example, than a gene with the symbol 'ABCA1'. A gene with a description that starts with the word 'hypothetical' is weighted less than one with a description that starts with 'cystic fibrosis'.

    3.

    Inclusion in HomoloGene or Protein Clusters. Genes (or their products) that are known to be conserved are weighted more highly.

    4.

    Inclusion in OMIM or Books.

  • Name
    Results are sorted alphabetically (case insensitive) by the symbol of the gene.

  • Chromosome. Choosing this option causes the records to be sorted in this order:

    1.

    Alphabetically by organism name

    2.

    Numerically by chromosome

    3.

    Numerically by the start position on the chromosome.

    For example, suppose that the search results include genes for Homo sapiens (human) and Mus musculus (mouse). The human genes will all appear before those for mouse. Within the set of human genes in the results, those that are placed on chromosome 1 will appear first, followed by those placed on chromosome 2, and so on. Finally, within a chromosome, genes will be sorted according to their start positions on the chromosome. Genes that are not placed on a chromosome will appear at the end of the results. Genes that are placed on multiple chromosomes will be sorted according to the first such chromosome.

Subset of data content

Gene Table

The Gene Table display represents the gene structure as annotated on the indicated genomic RefSeq. The default report is based on the reference assembly, but the selection menu in the top box (Figure 3A) allows you to generate reports from other RefSeq genomic sequences.

Figure 3

Figure

Figure 3. Gene Table display. Use the Genomic Sequence and Coordinate System options in section A to select the sequence and the numbering system you want to use to generate your report. The embedded graphic (B) allows you to see what other elements are (more...)

The table reports the intron/exon organization of each transcript, and, if an mRNA, the region of each exon that contains coding sequence (sample). It does this in two ways:

  • graphically, by repeating the display included in the Full Report

  • in a table, by reporting the position of any exon or coding region, and reporting the length of exons, coding regions, and introns

The Gene Table display supports retrieval of gene-related sequence, as summarized in Table 2.

Table 2

Table

Table 2. Access to Gene-specific sequence information from Gene.

Please note that Gene Table is not supported when the gene has not (yet) been annotated on any of NCBI's Genomic RefSeqs.

The sequence being retrieved is from the indicated genomic sequence, not the RNA. This means that the length of any non-aligning nucleotides, including a poly(A) tail or vector sequence, is not included in the GeneTable report.

Unaligned tails can be displayed graphically in the Sequence Viewer; follow the Open Full View link on Gene’s Full Report, click Configure on the right side of the Graphical Panel, and add RefSeq Alignments from the Alignments Track tab. Unaligned tails are displayed as boxes with the number of aligned bases shown above. Note that RefSeq transcripts with perfect alignments (excluding poly(A) tail) are NOT displayed in the RefSeq Alignment track. More information on how features are rendered in the Sequence Viewer is available from the Graphical View Legend section of the Sequence Viewer Help document.

When following a link from GeneTable to the sequence-specific nucleotide or protein record, use the Display Settings options there to generate the format you prefer (e.g. GenBank).

Because Gene Table reflects the annotation on the current genomic sequence, for bulk access you may prefer to use the seq_gene.md.gz file in the species-specific mapview subdirectory. These files are available for genomes that can be viewed in Map Viewer. For example:

Please note that RefSeq may update annotation on sequences representing a genome less frequently than updates to gene-specific RefSeqs. This means that if the version of a RefSeq RNA has changed, or if the number of transcript variants has changed, the GeneTable display will be out of date with respect to the Reference Sequences section of the full Gene report. Please check also the Reference Sequences section of the Gene record to determine whether updates have occurred (new versions and/or more variants and/or suppression resulting from review).

Please see Table 2 for a summary of how to access gene-specific sequence information via Gene.

GeneRIF

The GeneRIF display for a Gene can be accessed by a URL constructed as:

http://www.ncbi.nlm.nih.gov/gene/GENEID/?report=generif, where a GeneID replaces GENEID.

Example for GeneID 1059: http://www.ncbi.nlm.nih.gov/gene/1059/?report=generif

This display lists the text of the GeneRIF (which anchors a link to PubMed), the title of the paper, and the authors.

The PubMed (GeneRIF) display provides a listing of all the PubMed uids that are associated with GeneRIFs AND interaction data for a GeneID. Thus the count of GeneRIFs displayed for a gene may differ from the number of results in PubMed when the PubMed (GeneRIF) link is used.

Full Reports

All of the content that Gene provides is defined by the ASN.1 file. The Full Report display is of the HTML transformation of that ASN.1 and includes navigation tools (Table of contents and Related information), discovery elements, diagrams, and text. Some gene-specific information is not maintained in Gene but is maintained in more specialized databases such as BioSystems, GEO, HomoloGene, UniGene, and Probe. Access to the additional information maintained in other resources within NCBI or external to NCBI is provided by the listings under Related information (on the right beneath the Table of contents) and by other HTML anchors within the page.

The Full Report display is divided into the gray Search bar (explained in Query tips), navigation and discovery functions at the right, and content elements divided by horizontal separators that display or hide that subsection.

Navigation/Discovery column

The menu at the right of the Gene report supports navigation to multiple sites of interest (Figure 2). In some display formats, the menu can be expanded and compressed by clicking on the down (Image icon_down.jpg) or up (Image icon_up.jpg) arrows, respectively. More details about each submenu follow.

Table of Contents

lists the subcategories of information available for a gene. Clicking on the name of the category takes you to that portion of the gene record. The arrow pointing up on the bar separating subcategories (Image icon_doubleup.jpg) will return you to the top of the page, should you want to make a different selection from the menu. The arrows at the left of the bar at the top of the section allow you to open (Image icon_down.jpg) or close (Image icon_up.jpg) the display of that section.

Related information

indicates other Entrez databases (or report types) that reference Gene. Each line anchors a link to gene-specific data in those databases/reports (Figure 4).

Figure 4

Figure

Figure 4. Representative Related information section. The names of these links indicate both the name of the target NCBI database and, in many cases, a subset of records or displays at that target. Details about some of these links are provided in this (more...)

General information

enumerates resources that may help you find and understand the information in Gene. The Help link goes to the default help document. The default help document is also accessed by the question marks (Image icon_question.jpg) in the horizontal section separators.

Related sites

provides links to home pages of a subset of Entrez databases likely of interest to users of Gene.

Feedback

enumerates several sites where you can comment on or add data to Gene and/or RefSeq.

Subscription

provides links to forms where you can subscribe to a mailing list to receive announcements about updates to Gene, Map Viewer, and RefSeq.

Recent activity

displays your recent database searches and document views. You can click on any to return to the results of that query or that document.

Content elements

Title

The section immediately below Display Settings/Send to (Figure 5) provides the preferred symbol and descriptive name in bold font, followed by the italicized binomial in brackets. If there is a recognized authority for the gene nomenclature of a species, then that authority is the source for these values.

Figure 5

Figure

Figure 5. Representative title and summary sections of a Full Report.

The second line of this section contains the NCBI GeneID and the last date a record was changed. The date is in the format day-month-year. Change is defined as any modification to the content of the record, including ancillary changes such as the URL for a displayed link. If a record was merged or discontinued, that information is provided in a third line.

Summary

The Summary section of the Full Report display (Figure 5) may include several categories of information, namely:

Official Symbol: and Name: Nomenclature provided by the named external authority.

Primary source: Identifier and link to the major resource outside of NCBI that provided information about this gene. For some taxa, this resource may be the nomenclature authority; in other taxa it may be the group that defines genes and submits annotation to public sequence databases.

Locus tag for the record is in the next line. Locus tag corresponds to the systematic feature qualifier used by the international sequence collaboration (INSDC, DDBJ/EMBL/GenBank) and can be assigned by sequence submitters as a unique, systematic gene descriptor. When such a value is not available from submitted sequence, the identifier from a collaborating model organism database is used. Locus tag is often used to anchor a link to a database other than Gene. Locus tag is also used as the preferred symbol if an official symbol has not been identified for a gene.

See related: A listing of other identifiers for this gene, provided as database name/value pairs.

Gene type: Possible values are tRNA, rRNA, snRNA, scRNA, snoRNA, miscRNA, ncRNA, protein coding, pseudo, other, and unknown. These are indexed as properties of a gene.

RefSeq status: Any of the set of status descriptions defined by RefSeq.

Organism: The binomial, and strain when appropriate, with a link to the NCBI Taxonomy database.

Lineage: Binomial and lineage from the Taxonomy database.

Also known as: Unofficial symbols and descriptions that have been used for this gene and its products. If there is no official symbol, and no locus_tag, the symbol at the top of the display is repeated in this section. These names are integrated from several sources, including model organism databases, annotation on sequence records, and interactive curation from the published literature.

Annotation information: Information about annotation oddities for a gene on the reference assembly. May be a report from NCBI’s annotation pipeline, or a comment written by a RefSeq curator to explain how a gene is (or is not) represented in NCBI’s annotation. Not provided if the RefSeq group does not provide annotation for a genome, or if there are no problems in the annotation.

Summary: Descriptive text about the gene, its cellular localization, its function, and its effect on phenotype. Records with a summary section can be retrieved by use of the property has_summary (Table 3).

Table 3

Table

Table 3. Other properties in Gene (excluding those related to genetype, rnatype, and source).

Genomic Context

The Genomic context (Figure 6) section reports the location of the gene on the chromosome in non-sequence coordinates and the strain and genotype information of the source sequence. The right hand side of this section (not shown in Figure 6) includes a link to Map Viewer, providing the same display as that generated from the Map Viewer link in the Related information menu.

Figure 6

Figure

Figure 6. Genomic context and Genomic regions, transcripts, and products sections. These sections provide diagrams of the gene and its neighbors, the gene’s intron/exon organization, and the RefSeqs that are used to represent RNA products and (more...)

If the gene has been included in a genomic annotation, the section also diagrams neighboring genes and indicates their orientations. If the name of a gene is too long to use for a label, truncation is indicated by an ellipsis (...). The gene being shown on the diagram is in maroon. All other diagrams and labels anchor links to specific gene pages, supporting quick navigation to review neighboring genes by clicking in the area of the symbol/arrow.

The diagram shows the gene’s placement on any and all chromosomes of the reference assembly, if the gene is annotated there. Otherwise, the diagram will show another genomic placement, in this order of precedence: reference contig; reference genomic region (NG); alternate chromosome; contig of an alternate assembly. The location information for all placements will be provided in the ASN.1 of the record and in the Reference Sequences Section. If a gene has not been included in the current version of the annotated genome provided in NCBI RefSeqs, the Genomic context section will not include a diagram but will report the map location.

Genomic Regions, Transcripts, and Products

This portion of the Full Report (Figure 6) is provided when a gene has been annotated on a genomic RefSeq, in other words, when the intron/exon/coding region information, or the position of a pseudogene, is available in some genomic coordinate system. The display in this section is generated from the NCBI Sequence Viewer, the same software that drives the Graphics sequence display option available from the sequence databases, and provides some of the navigation features. A legend describes how annotated features are rendered in this display, and a link in the top right hand corner of the sequence panel provides complete Help documentation. Several YouTube videos, available here, provide additional instruction on the use of the Sequence Viewer.

You can use the Genomic regions, transcripts, and products section to:

  • view the intron/exon/coding region organization of a gene and its RNA product(s), or the placement of a pseudogene, on a genomic RefSeq

  • identify the RefSeqs that correspond to any RNA or protein product and see an overview of the exons they represent

  • alter the zoom level of the display (more…)

  • move upstream and downstream in sequence being displayed (more…)

  • navigate to a full display of the genomic context via the link Go to nucleotide Graphics

  • navigate to the genomic sequence of the gene in FASTA format

  • navigate to the genomic sequence of the gene in GenBank format.

  • Change the display of the genomic sequence on which the gene is annotated. The default display is the chromosome of the reference assembly; for some taxa there are alternate assemblies. For human, the RefSeqGene can also be selected.

Each position of a gene product, when represented by a RefSeq RNA (accession NM_000000/NM_000000000 or XM_000000/XM_000000000 for mRNA, NR_000000 or XR_000000 for non-protein coding RNAs) and/or protein (NP_000000/NP_000000000 or XP_000000/XP_000000000), is provided relative to the genomic accession on which it is annotated. For some species, including human and other vertebrates, the genomic RefSeqs are updated independently of the annotated product RNAs, with the latter being updated more frequently. This means that several kinds of discrepancies between the diagram and the current RefSeq RNAs may result.

  • The diagram may be labeled with an mRNA accession (for a predicted transcript) of the format XM_123456 or XM_123456789, yet display of that accession from Entrez Nucleotide indicates that this accession is no longer primary. That means that a curated mRNA (accession of the format NM_123456 or NM_123456789) has been generated to replace the previous model accession. This new "NM" accession will be reported in the Reference Sequences section of Gene.

  • The diagram may be labeled with curated RNA accession numbers (of the format NM_123456 or NM_123456789 or NR_123456) different from those listed in the RefSeq section. This will result if curation after the submission of the annotated genome identified more transcript variants, which therefore are listed only in the Reference Sequence section but not in the diagram. It will also result if curation after submission of the annotated genome identified an error in the annotated product, and the accession for that product was suppressed. In that case, the Genomic regions, transcripts and products section will indicate a transcript not listed in the RefSeq section of the Gene report.

Changing the zoom level in the display
1.

Select a subsequence to display, and display it

a.

Left click in the white section with the coordinates and ruler, and drag to select your region of interest.

b.

Right click, select zoom on range, and the display will refresh to provide the region of interest

2.

Use the in/out zoom functions
Right click, and select either zoom in or zoom out. The display will refresh, and change the region displayed by a factor of 2

Move upstream and downstream

A single left click anywhere in the display other than the ruler section, followed by a drag, results in a shift to display upstream and downstream sequence.

Bibliography

The Bibliography section (Figure 7) may have two components:

Figure 7

Figure

Figure 7. Representative bibliography section with displays of citations and GeneRIFs. If the number of citations exceeded either 5 (PubMed) or (10) GeneRIFs, the first 5 or 10 would be displayed, along with the total count with a link to the display (more...)

A.

An embedded display of a subset of PubMed citations.

B.

A embedded display of a subset of GeneRIFs.

The approach in both components is to display a limited number of records within the full display (5 for PubMed, 10 for GeneRIF), provide a count of the total records available, and support links to a display of all records. The GeneRIFs component also provides a link to submit a new GeneRIF for the gene, or to submit a request to the RefSeq curators to review information in the record.

What is a GeneRIF?

A GeneRIF is a concise phrase describing a function or functions of a gene, with the PubMed citation supporting that assertion. The majority of GeneRIFs have been provided by a collaboration between the NLM's Index Section and NCBI. There is no constraint on the number of independent submissions of GeneRIFs per PubMed id, although those from non-NLM sources are reviewed by RefSeq staff. The GeneRIF homepage provides more information about the project, including how general users can make submissions. If more than one GeneRIF for a gene has the same text but a different citation, the link to PubMed (icon at the left) will result in a display of all citations.

Each species has a GeneID with the symbol NEWENTRY. When staff of the NLM indexer sections cannot identify the gene to which a publication belongs, the GeneRIF is connected to the 'NEWENTRY', which is a placeholder for all the 'unconnected' GeneRIFs for a species. The GeneRIF text remains associated with that GeneID until a RefSeq curator can identify or create the specific gene or genes to which the submission should be connected.

The full display of GeneRIFs for a gene can be generated at any time by selecting GeneRIF as the format from Display Settings.

Phenotype

This section reports the effect of the gene on phenotype, especially disease. For human genes (Figure 8), the first row links to the Phenotype-Genotype Integrator, (PheGenI, pronounced FEE-GEE-NEE), a web portal providing a tabular display of genome-wide association study results relating the gene and/or its expression to a phenotype. PheGenI includes links to Genotype-Tissue Expression (GTex) results and viewers to display the relationships among genetic variants at the nucleotide level.

Figure 8

Figure

Figure 8. Representative phenotype section in the Full Report display. This section reports the effect of a gene on phenotype, particularly disease, when known. For some human diseases, links to the Phenotype-Genotype Integrator and GeneReviews are available. (more...)

Named phenotypes are provided in subsequent rows. Each phenotype row may be expanded, providing links to more information as available. In the case of human disease, this may include links to OMIM, NHGRI Catalog of Genome-Wide Association Studies, PubMed, and GeneReviews. You can view the full GeneReview, or open a display of the Summary data only.

Interactions

There are two major subcategories of information reported as Interactions: HIV-1 interactions and general interactions.

HIV-1 protein interactions

The HIV-1, Human Protein Interaction Database is funded by the Division of Acquired Immunodeficiency Syndrome (DAIDS) of the National Institute of Allergy and Infectious Diseases (NIAID). As the title indicates, this project focuses on the human proteins that have been shown to interact with proteins from HIV-1. The format of this section is different for the human and HIV-1 gene reports. On human, the display consists of:

  • the HIV-1 protein, which anchors a link to Gene for that gene product

  • a concise description of the interaction

  • links to papers in PubMed that support the described interaction

For HIV-1, the display is subdivided by peptide name and includes:

  • a key word categorizing the interaction

  • the full name of the human gene, which anchors a link to that record

  • links to papers in PubMed that support the described interaction

Please note that there are separate reports from this section that are available for download, both from the HIV-1, Human Protein Interaction Database homepage and the GeneRIF subdirectory of the Gene FTP site.

Interactions

Interactions in this general section are reported as pairs. The report will always include, in the first column, the product of the gene that is part of the interaction. Depending on the type of interaction, the rest of the display may report:

  • the other interactant, anchoring a link to more information

  • the gene name of the other interactant, anchoring a link to that record in Gene

  • the complex to which the interactant(s) belongs

  • the source of these data, anchoring a link to the record at that source

  • a concise description of the interaction

  • links to papers in PubMed that support the described interaction

Alleles

This section reports the general characteristics of alleles that have been described for a gene and provides links to more detailed information. This function is not available for all species; the current set is for mouse and is being developed from information supplied by Mouse Genome Informatics.

General Gene Information

This section includes several subcategories of information, including:

GeneOntology (GO): The specific GO terms are listed by source of the information, category, term, evidence information, and links to supporting publications. Each GO term supports a link to the AmiGO browser. Abbreviations in the Evidence column indicate the level of support for assigning a GO term to a gene. Explanations for these abbreviations are provided by the Gene Ontology website.

Gene does not alter the associations provided by a model organism database, nor does Gene recapitulate the directed acyclic graph structure provided by GO. Thus, Gene does not support retrieval of all genes associated with a specific GO term based on that term's parent. If you identify a GO term that is inappropriate for a gene, please contact the model organism database directly. ftp.ncbi.nlm.nih.gov/gene/DATA/go_process.xml documents the authorities Gene uses to connect GO terms to GeneIDs.

Homology: A partial listing, with links, of orthologs in other species. Other views of homology data are available from TaxPlot and the HomoloGene link in the Related information menu.

Genotypes: Links to various reports from dbSNP about allele frequencies in one or more populations, all variations for a gene, or disease-associated variations.

Markers: An enumeration of the markers that are related to this gene. The relationship is reported based either on direct reports, e-PCR using mRNA templates, or e-PCR-based localization on the genome within a region beginning 2 kb upstream of the gene and ending 0.5 kb downstream. Links are provided in the NCBI UniSTS database.

Pathways: A description of pathways that include this gene with links to more information about that pathway.

Readthrough: Information about genes that are sometimes transcribed with others. More information about readthrough transcription and how these events are represented in Gene are described in a FAQ.

Related gene/pseudogenes: If a gene, provides a link to view the records of pseudogenes related to the functional gene. If a pseudogene, provides a link to the functional gene.

Relationships: This section reports some of the public sequences that were used support the prediction of the indicated RefSeq model. Thus this section is used only for those genomes for which NCBI calculates annotation, and only for those genes where there is not a supporting curated RefSeq. Sequence accessions reported in this section may differ from those for which alignments are displayed from Evidence Viewer for this gene.

General Protein Information

This section applies only to genes that encode proteins. It reports the name or names that have been assigned to proteins encoded by the gene and provides other descriptive text. The names are as annotated on the RefSeq protein, when that protein is available. The sources of these names include model organism databases, annotation on public sequence databases, and curation by RefSeq staff.

NCBI Reference Sequences (RefSeqs)

This section describes the gene-specific NCBI reference sequences (RefSeqs) that have been established for this gene. In addition to enumerating the accession numbers and providing links to the appropriate Entrez sequence database, this section may also include descriptions of each transcript variant, accession numbers of the public sequences used to support any transcript, links to matching related Ensembl and VEGA transcripts and proteins, and a listing of computed domains in an encoded protein. The text provided in this section therefore supports retrieving gene records based on descriptions of conserved domains.

The Reference Sequence group uses several approaches in maintaining information. These can be broadly categorized as:

1.

RefSeqs maintained independently of Annotated Genomes (Figure 9). RefSeq RNA and protein sequences are updated continuously, independently of any comprehensive reannotation of a genome. Because these reference sequences are curated independently of the genome annotation cycle, their versions may not match the RefSeq versions in the current genome build. You can identify updates by comparing versions in this section to versions in the Genomic regions, transcripts, and products section.

2.

RefSeqs of Annotated Genomes (Figure 10). This section reports genomic RefSeqs from all assemblies on which this gene is annotated, such as RefSeqs for chromosomes and scaffolds (contigs) from both reference and alternate assemblies. The position and strand of the gene feature is provided (offset 1). GenBank and FASTA and Nucleotide graphics anchor links to sequence in the given formats. Model RNAs and proteins are also reported here.

3.

Genome Annotation. RefSeq RNA and protein sequence are provided only through the process of genome/chromosome annotation.

4.

Suppressed Reference Sequence(s). Accession numbers listed in this section were suppressed for the cited reason(s). Suppressed RefSeqs do not appear in BLAST databases, related sequence links, or BLAST links (BLink) but may still be retrieved by from the Nucleotide or Protein databases, and by clicking on the hyperlinked accession.version.

Figure 9

Figure

Figure 9. Representative NCBI Reference Sequences (RefSeq) section in the Full Report display. This section includes two subsections: RefSeqs maintained independently of Annotated Genomes (this figure), and RefSeqs of Annotated Genomes (Figure 10). The (more...)

Figure 10

Figure

Figure 10. Representative subsection RefSeqs of Annotated Genomes in the NCBI Reference Sequences (RefSeq) section of a Full Report display. This subsection follows RefSeqs maintained independently of Annotated Genomes (Figure 9). It includes the accession (more...)

Related Sequences

This section has two subsections, one in which the nucleotide sequence is primary and one for protein sequences only (UniProtKB). It contains sequence accessions that are related to the gene and provides links to the appropriate sequence record in Entrez Nucleotide , Entrez Protein or UniProtKB. It is not intended to be a comprehensive list of all sequences related to any gene; such sequences can more explicitly be found by using BLAST to query sequence databases or by using pre-calculated reports of related sequences via Entrez Nucleotide, Entrez Protein, or BLink. The sequence accessions in this section are provided in a tab-delimited format in the gene2accession.gz file in the DATA directory of the Gene FTP site.

Depending on the genome of the gene being reported, the sequences included may or may not be restricted to the same subspecies or strain.

Gene purposely lists protein accession numbers on records being represented as not protein-coding. The intent is to make the connection between sequence annotation and Gene's current representation of the type of gene. For example, a nomenclature group may call a gene protein-coding, or UniProt may create a sequence record for a protein based on an open reading frame, but RefSeq staff may judge the evidence is weak based on lack of cross-species homology or experimental support. Gene will report the proteins sequences derived from the locus, but will represent the gene not to be protein-coding consistent with the RefSeq curation decision. Records of this type are reviewed periodically as new evidence is made available.

Users with evidence indicating that the Gene record should be reviewed are encouraged to contact the RefSeq staff.

Accessions are reported as related sequences based on several criteria:

  • mRNAs with unique best placement on a genome coinciding with an annotated Gene

  • cDNA/cDNA sequence relatedness (calculated based on criteria of identity, length of overlap to known accessions, and coverage of the novel accession)

  • submissions from model organism databases or nomenclature authorities

  • identification of proteins with identical sequences

  • curation by RefSeq staff

  • annotated GeneIDs from the ORFeome Collaboration or Celera

Gene LinkOut

LinkOut provides easy access to relevant online resources outside of the Entrez system. These connections, and their groupings, are maintained by the external database.

ASN.1

The ASN.1 display provides gene records structured according to the Entrezgene specification. An XML transformation of the ASN.1 is also available. Detailed information about the specification is provided in the Tips for Programmers section.

XML

Any record or selected set of records can be displayed in XML format. The XML is generated automatically from the ASN.1 record that is used to support the display, with the names of the tags defined by the ASN.1 specification. Detailed information about the specification is provided in the Tips for Programmers section.

Query Tips: How to submit detailed queries, edit your query, filter your results, and more…

Gene uses functions common to other NCBI databases. All functions of the Entrez indexing and query engine are used by Gene. This section therefore summarizes only how to use the tools in the context of the Gene database. Entrez Help and PubMed help provide general information on how to save searches, use Limits, Clipboard, history, and Advanced Search. For general information about Entrez, see Entrez Help.

Each Entrez database provides a query bar where you can select a database to interrogate, and enter a search term or terms. If a simple query is not powerful enough, there are additional search interfaces described as Limits and Advanced search.

Using Limits

Introduction to Limits

The Limits page (Figure 11) allows you to set the context for making queries to Gene. It is accessed by clicking on Limits in the gray query bar (Figure 2A).

Figure 11

Figure

Figure 11. Representative Limits Page. Shown is an example of a Limits page used to find current records in Gene for Homo sapiens that are related to ataxia and are associated with reviewed RefSeq records but not pseudogenes. Note ataxia in the query (more...)

If you want bookmark the limits page, use this URL:

http://www.ncbi.nlm.nih.gov/gene/limits/

Limits is designed to make it easier to execute certain queries by checking boxes, rather than by writing out the text of an Entrez query. It is particularly useful if you want to retrieve genes only:

  • Within a defined range on a chromosome

  • with a value found in a single known field

  • from a particular cellular source or RefSeq representation (Excludes/Includes)

  • represented by a particular type of RefSeq (Limit by RefSeq Status)

  • from a taxonomic group (Limit by Taxonomy)

Once you have set Limits, that setting remains through multiple queries, unless you remove the setting. A yellow banner appears below the query bar when Limits is turned on. You can turn off Limits at any time by clicking on remove.

The Exclude section (Figure 11) enables you to prevent certain types of genes from being included in your result set. Each check box is independent, so if you want to prevent the retrieval of genes encoded by mitochondria and by plastids, check both boxes. The NEWENTRY option refers to the GeneID used to support submission of GeneRIFs, by species, for a gene not currently in Gene.

You can use the Includes option to have only certain types of genes in your result set. These are defined as:

  • Genomic: Genes encoded by chromosomes or the major genomic macromolecule for the taxon.

  • Mitochondria: Genes encoded by mitochondria.

  • Plasmids: Genes encoded by plasmids.

  • Plastids: Genes encoded by plastids.

  • RefSeqs: Genes for which RefSeqs exist.

  • NEWENTRY: GeneIDs used to support submission of GeneRIFs, by species, for a gene not currently in Gene.

Additional limits are:

  • Limit by RefSeq Status : To retrieve genes based on the type of RefSeq used to represent the gene. How these types are established is documented here.

  • Limit by Taxonomy: Make it easier to restrict your query by organism or organisms.

The additional limits are treated as Includes and are hierarchical. For example, to limit your results to genes in invertebrate genomes, check Invertebrates. These selections are also treated as the Boolean operator OR, so if you want to retrieve genes from either Danio rerio or Xenopus sp., check Danio rerio and Xenopus laevis and Xenopus tropicalis.

Field restriction in the Limits page is an option that allows you to retrieve records only when your query term exists in the selected field. These fields (Table 4) are the same as those described in more detail in the Advanced Search section.

Table 4

Table

Table 4. Filter sets (partial) Complete documentation for all Entrez filters is here.

The Limits page also allows you to restrict queries by date. The options are:

  • Creation Date: Retrieve records created within the range entered, or according to pre-selected ranges in the pull-down menu.

  • Last Modification: Retrieve records modified within the range entered, or according to preselected ranges in the pull-down menu.

Examples Using Limits

A. To retrieve non-mitochondrially encoded NADH dehydrogenases from human, mouse, or rat, use the Limits form to:

1.

Enter nadh dehydrogenase in the query box (case does not matter).

2.

Select Gene/Protein name from the All fields drop-down menu.

3.

Check Mitochondrion under Exclude.

4.

Check Homo sapiens, Mus musculus, Rattus norvegicus.

5.

Click Search at the bottom of the page.

B. To retrieve E. coli genes related to tryptophan, use the Limits form to:

1.

Check Escherichia coli.

2.

Enter tryptophan in the query box.

3.

Click Search at the bottom of the page.

C. To retrieve current human genes located on chromosome 1 between base positions 1 to 500000:

1.

Select Homo sapiens from Limit by Chromosomal Region and enter the values for Chromosome and From/To.

2.

Select Current Records under Includes.

3.

Click Search at the bottom of the page.

Table 5 summarizes the fields used to categorize information in Gene records. The table also provides examples of how to use these entities effectively to retrieve records. The table is alphabetized by the values in the Field Name menu.

Table 5

Table

Table 5. Fields used to categorize information in Gene records.

Finding subsets of your results (the ‘Filter your results’ option)

When you are reviewing a query result in HTML format (not text), the Filter your results option allows you to display a subset of your result set. A default set is provided, but you can also customize your filters via My NCBI. One filter you may find particularly useful is ‘Current Only’ which removes discontinued or replaced records from the result set. This is equivalent to submitting a query that contains the expression ‘AND alive[property]’.

Words Excluded From Queries

Common, but uninformative, words and terms (also known as stopwords) are automatically eliminated from searches. However, a search term that is a stopword will be included if the term is explicitly qualified by a field name. For example, if you want to search for the term was, you could use:

  • was [All Fields]

Enclosing the term in double quotes would have the same effect.

A list of stopwords used in Gene is in Table 7.

Table 7

Table

Table 7. Stopwords.

Constructing Powerful Queries

Constructing queries based on free text, filters, and properties can be quite powerful in retrieving records of interest from Gene. Table 8 summarizes some of these approaches by describing:

Table 8

Table

Table 8. Constructing queries.

  • Scope: The intent of a query.

  • Query: How to construct a query that meets that intent.

  • Notes: How usage of Gene to retrieve these data may compare to other gene-related resources, namely HomoloGene, Map Viewer, or UniGene.

Although these examples use field restriction (see Table 5 for the comprehensive list of fields used to index the information in Gene records), free text can also be submitted. Gene then weights the retrievals based on the field in which a result was found. For example, if your query matches a gene symbol in one record and arbitrary text in another, the record where the match is on the symbol will be displayed before the other in the results. Thus Gene controls the default order in which results are returned by evaluating what fields are more critical to matching your query. This default sorting order is termed 'relevance'.

Tips for Programmers

The Gene Data Model and DTD

The data model for Gene is documented in the Entrezgene specification. It combines several definitions used by other NCBI databases, such as seqfeat, but also establishes definitions specific to Gene. Of special note is the Gene-commentary, which is used to represent many descriptors of genes. Each Gene-commentary is defined by type and supports specific representation of such elements as sequence database accession numbers (accession, version), citations (refs), external or internal resources defining the data (source), and position information. Heading, label, and text are used for general data, with the choice influenced by display in the Gene viewers.

The DTD for Gene is available from NCBI's DTD directory and is called NCBI Gene.dtd.

Entrez Programming Utilities and Gene

The full power of Entrez Programming Utilities (e-Utils) can be used to extract information from Gene programmatically. The basic strategy is to identify the query that will return the desired records and then submit that query via ESearch. The GeneIDs identified by that search can then be submitted to another function, such as ESummary or EFetch. Examples for Gene are provided on the FAQ page. The FTP site contains sample perl scripts that use ESearch and ESummary.

Extracting Gene Summaries and other information from Gene’s Document Summary

The Summary text provided via Gene and on RefSeq records can be extracted by taking advantage of the following:

  • the text of the Summary is included in the Document Summary (docsum) from Gene.

  • genes with Summary text can be identified by the has_summary property.

In other words:

1.

use eSearch to find all GeneIDs with the has_summary property

2.

use eSummary to retrieve the Summary text (e.g. http://eutils​.ncbi.nlm​.nih.gov/entrez/eutils/esummary​.fcgi?db​=gene&id=672&retmode​=xml)

3.

Extract the string in the Summary tag.

Table 9 lists the name attributes available from Gene’s docsum. This information may be extracted from Gene in a similar manner.

Table 9

Table

Table 9. The Name attributes of Gene’s Document Summary (docsum).

Gene FTP Site

The FTP site for Gene (README) has four major subdirectories: DATA, GeneRIF, and Tools.

DATA

DATA contains files that provide key attributes of genes, including:

  • all associated accession numbers, including RefSeqs (gene2accession.gz)

  • associated RefSeq accession numbers (gene2refseq.gz)

  • citations (gene2pubmed.gz)

  • nomenclature, ID, and map data (gene_info.gz)

  • genes that are no longer current (gene_history.gz)

  • MIM numbers (mim2gene)

  • UniGene clusters (gene2unigene)

  • GO terms (gene2go.gz)

  • relationships to other genes (gene_group.gz)

  • matching Ensembl annotation (gene2ensembl.gz)

  • matching VEGA annotation (gene2vega.gz)

  • relationship to UniProtKB proteins (gene_refseq_uniprotkb_collab.gz)

Details of the construction of these files are reported in the (README) file.

DATA also contains the ASN_BINARY subdirectory. This path contains both a comprehensive extraction from Gene (All_Data.gz), several subsets categorized by source (Organelles, Plasmids), and subdirectories grouped broadly by taxonomy. Records of genes from species that are requested frequently are also provided in species-specific files, for example these mammals. The format of these extractions is compressed binary ASN.1. The program gene2xml is available to convert these files to XML or ASN.1 text.

The GENE_INFO subdirectory of DATA provides subsets of the gene_info file grouped broadly by taxonomy. This directory structure mirrors that of the ASN_BINARY path. Thus if you want the type of information provided in gene_info, but do not want to have to process the complete text, you can use one of the files in the appropriate subdirectory, for example these plants.

GeneRIF

GeneRIF contains files that provide supplemental information about gene functions, either from the GeneRIF pipeline (generifs_basic.gz) or the HIV-1, Human Protein Interaction Database (hiv_interactions.gz). The tab-delimited files are not subdivided by species of interest. All files except the file reporting GeneID/PubMedID relationships (gene2pubmed.gz) have a column with the ID from the NCBI Taxonomy database to facilitate the extraction of a subset of the data from the file by species.

Tools

Gene_tools provides or points to programs and scripts to mine data from Gene. Of particular interest is gene2xml, which can be used to convert the binary ASN.1 in the ASN_BINARY directory to XML or to ASN.1 in text format (README).

Connecting Users of Gene to Your Website

Gene can serve as a gateway to information on your website served from your local database. Users of Gene will discover your website if you participate in our LinkOut system and become a LinkOut provider. Any Entrez database will support LinkOut. Linkout Help’s Information for Other Resource Providers explains the details of this opportunity.

There are many benefits to becoming a LinkOut provider. If you want access to your database to be apparent from Gene, you can control the description of your resource, the update cycle, and the icon to anchor links to your site. In other words, you do not have to wait for NCBI staff to go to your site to obtain and process information and match to Gene records. You know your site best—you can identify which records are related to Gene records and provide the most accurate and informative URL to connect that Gene record to your site. If you already provide LinkOuts to other Entrez databases, such as Nucleotide or Protein, you do not have to re-register as a provider; you need only notify LinkOut staff and start to submit a new resource file.

With the implementation of My NCBI, it is even more advantageous to become a LinkOut provider. One of the options registered users of My NCBI can select is to display the icons for any LinkOut provider at the top of a record. The presence of your familiar logo would invite users of Gene to go to your site.

Connecting your site to Gene

URLs can be constructed to query Gene, or to display a specific record if you know the GeneID. For example, if your site maintains the identifiers (GeneID) used by Gene, you can construct a link from your site to Gene by combining this base

http://www.ncbi.nlm.nih.gov/gene/

with the GeneID. For example, to link to GeneID 1, use this URL:

http://www.ncbi.nlm.nih.gov/gene/1

URLs that query Gene are constructed by adding ?term=[search term]

For example, to find records in Gene containing the phrase ‘immunoglobulin domain’, use this URL

http://www.ncbi.nlm.nih.gov/gene/?term=immunoglobulin_domain

More examples of queries are provided on Gene’s Home page, and general rules for building URLs to query Entrez databases are provided in the Creating a Web Link to the Entrez Databases chapter of this book. The valid display options are also documented in that chapter.

Historical Information about LocusLink

This version of Gene's help document removed detailed information about LocusLink. If you have any question about the history of LocusLink, please use this form.

Cover of Gene Help
Gene Help [Internet].

Download

Recent activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...