Whole Genome Shotgun Submissions

DDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes that are being sequenced by a whole genome shotgun strategy. WGS projects may be annotated, but annotation is not required. The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects.

WGS projects without annotation require at least two weeks to be processed. Projects with annotation require at least one month for processing. Please submit your project with enough lead time.

See the submission instructions. General information about WGS projects is below. We recommend sending us a test file via GenomesMacroSend if you have a large annotated genome to see if there are problems before committing to generating the entire project.

See the list of WGS projects.

Table of Contents

Introduction

Each WGS project is assigned a stable 4-letter WGS accession prefix, which does not change as the project is updated. In addition to the WGS accession prefix, the contig identifiers have a version number corresponding to a particular WGS project update. Finally, each individual contig within the assembly is assigned a unique accession number prefixed by the WGS accession prefix and version number. For instance, if a WGS project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (The last six digits of this ID identify each individual contig). When there is more sequencing and the genome is reassembled, the contigs are submitted as the 02 version of the WGS project. No linkage or relationship is expected between the old and new contigs, and the new contigs are given new accession numbers beginning with XXXX02000001. The 01 contigs are suppressed when the 02 contigs are released.  

The nucleotide data from all WGS projects go into the BLAST wgs database since the fall of 2011.  Proteins from most WGS projects go into the BLAST nr database.  Proteins from environmental projects are present in either the BLAST nr or env_nr database, depending upon whether that sequence has been identified as a particular organism (nr), or if the organism is not yet known (env_nr).

See the Metagenome Submission Guide for information about how to submit the various elements of a metagenome project.

WGS Project Info

  • Submit complete organellar and viral genomes as regular GenBank records by emailing the submissions to GenBank Submissions.
  • Complete, annotated genomes should be submitted to GenBank as a complete genome. The most common complete genomes are bacteria and archaea. Complete genomes are defined for GenBank as gap-free sequences that are annotated. Therefore, the sequence does not contain Ns that represent gaps. For information about complete genomes, see the bacterial genome submission guidelines.
  • Complete genomes that lack annotation are processed as WGS projects. When annotation is added, the complete genome is given a new accession number and the WGS accession number is made secondary, so that Entrez searches for either number will retrieve the complete annotated genome.
    • You may be interested to know that NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline. This pipeline generates files that are ready for submission to GenBank, although the submitter is welcome to edit them before submission to GenBank.
  • Include specific source information, such as strain or isolate name, country where the sample was collected, specimen voucher, sex, and any other relevant information. See the tbl2asn page for information on how to include source qualifiers in a submission.
  • Submit only contigs that are >199bp.
  • Remove any Ns that are at the end of a contig.
  • Split a sequence into separate contigs at any Ns that represent gaps. The only Ns allowed in contigs are internal Ns that represent ambiguous bases.
  • Include the quality scores, when possible.
  • Annotation can be included on the WGS contigs or on the scaffold or chromosome CON records that are generated from the information in the agp file, whichever is most appropriate for the project. Annotation that is submitted on a WGS contig will be displayed in Entrez on the scaffold or chromosome that includes that contig. Similarly, if a scaffold has annotation and is a component of a chromosome CON record, then its annotation will be displayed in Entrez on the chromosome. However, annotation that is submitted on a scaffold or chromosome CON record is not displayed on the underlying components. Contact NCBI for information about annotating scaffolds or chromosomes.

The table below shows three examples of WGS projects that have both contigs and scaffolds. One is unannotated and the others have annotation on either the contigs or the scaffolds. You can see that when the contigs are annotated, that annotation is displayed up on the corresponding scaffold in Entrez. Annotated records are shown as GenBank(Full) view. The accession number of each WGS project is included in the table:

Annotated Contigs Annotated Scaffolds No Annotation
ACZS00000000 ABXC00000000 AAGU00000000
WGS contig WGS contig WGS contig
Scaffold CON Scaffold CON Scaffold CON

Updating a WGS Project

If the same version of a WGS project is being updated, with annotation, for example, then the SeqIDs must be identical and the accession numbers must be included in the update, for both nucleotides and proteins. The correct format of the identifiers in such an update is:

gnl|WGS:XXXX|SeqID|gb|XXXX01xxxxxx

where XXXX is the accession prefix and XXXX01xxxxxx is the contig's accession number. We recommend that you send a test file to NCBI with details of your plans before generating a complicated update.
 

If you need additional assistance in preparing WGS submissions, please contact genomes@ncbi.nlm.nih.gov.

Last updated: 2012-08-27T09:47:39-04:00