How to Submit WGS Genomes

Introduction

DDBJ/EMBL/GenBank accepts both complete and incomplete genomes. Whole Genome Shotgun (WGS) sequencing projects are incomplete genomes or incomplete chromosomes that are being sequenced by a whole genome shotgun strategy. See the WGS project info section for details on what is and is not suitable for submission as a WGS project.

The pieces of a WGS project are the contigs (overlapping reads), and they do not include any gaps. An AGP file can be submitted to indicate how the contig sequences are assembled together into scaffolds (contig sequences separated by gaps) and/or chromosomes. We must have the contig sequences without gaps as the basic units for all WGS projects.

WGS projects may be annotated, but annotation is not required. NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline (PGAAP) . This pipeline generates files that are ready for submission to GenBank, although the submitter is welcome to edit them before submission to GenBank. Submit to PGAAP first, and then submit to GenBank. See below for details about PGAAP submissions to GenBank.

Information about the requirements for more complex assemblies, such as those with PARs or alternate loci, is in the Assembly Submission pages.

WGS genomes without annotation, or with PGAAP annotation, require at least two weeks to be processed. Genomes with annotation require at least one month for processing. Please submit your project with enough lead time.

Below are detailed instructions for preparing a simple WGS submission.

Table of Contents

Requirements

  1. Register the genome sequencing project with the BioProject DB, if you have not already done this.  Do not register a duplicate BioProject for the same genome.
  2. Make the genome assembly data files .
  3. Run the commandline program tbl2asn to generate the .sqn file for submission and the validation and discrepancy report files.
  4. Fix problems that are indicated in the .val and discrep files. Failure to do this will cause serious delays in processing.
  5. If you have higher-level assembly information, scaffolds and/or chromosomes, then generate an AGP file to build those objects from the wgs-contigs.
  6. Generate Assembly information files if your assembly contains complete chromosomes, plasmids or linkage groups (including the mitochondrion) or it assembles them from wgs contigs.
  7. Submit via the new Genomes (WGS) in the submission portal
  8. Submit the unassembled sequence reads to SRA, the Sequence Read Archive. For questions about submitting, contact the SRA HelpDesk via the "Write to the Help Desk" link at the bottom of the SRA page. Be sure to include in your submissions and correspondence the BioProject ID that was assigned when you registered your project with the BioProject database.

  • What happens after submission
  • Submitting PGAAP-annotated genomes
  • Submission Details:

    Data files:

    The specifics of the file formats are presented in the tbl2asn page.

    • generate a template file with submitter and publication information.
    • put the contig sequences into fasta format of the sequences. These files have the suffix .fsa. Each sequence has a definition line beginning with a '>' and a unique identifier, eg contig001, contig002, etc. Note that this unique identifier appears in the DEFINITION line in the flatfile view of the record. Please use concise names that do not include length or coverage information. See some example files.
      • These sequences are the contigs made from overlapping reads. The only allowed Ns are internal Ns that represent ambiguous bases. Remove any Ns that are at the end of a contig, and split a sequence into separate contigs at any Ns that represent gaps. Ns that represent gaps are not allowed, so sequences containing runs of Ns that represent gaps need to be split into individual contigs. This is even true for sequences that were assembled by assemblers that connect mate-pairs. That higher-level assembly information can be presented in an AGP file . The .fsa files can have up to 10,000 sequences per file. Larger submissions need to be split into multiple files.
      • Submit only contigs >199nt.
      • The source information can be included in the defline of each contig or in the tbl2asn commandline. It is included in the same format, in either place. At a minimum the source information is the organism and the relevant strain, breed, cultivar or isolate, if one exists for the sequenced organism. Note that the organism for microbial genomes (prokaryotes and fungi) still includes the strain, eg Escherichia coli BCE032-DM-B or Saccharomyces cerevisiae FL100, so the strain name appears in the organism and also in the strain, as seen in the example below. Bacterial (and other) isolates should also include the isolation source, eg the host or environmental isolation_source, when known. Bacterial annotated assemblies should also include the genetic code, [gcode=11], to reduce the number of false errors.

        Human pathogens should include:

        • host, eg Homo sapiens or spinach or Bos taurus
        • isolation source. This should be non-identifiable information, eg "stool sample from patient with fever" or "blood"
        • collection_date
        • country where the sample was isolated
        • latitude/longitude, in decimal coordinate system, where the sample was collected
        Here is an example of the source information for an annotated bacterial submission:

        [organism=Clostridium difficile ABDC] [strain=ABDC] [host=Homo sapiens] [country=Canada: Montreal] [collection_date=2008] [isolation-source=stool sample] [note=isolated from an outbreak in Montreal] [gcode=11]

      • Contigs that are part of a plasmid, or an organellar chromosome, or specific nuclear chromosomes need to have that information included in the fasta definition line, in these formats:
        • [plasmid-name=pBR322]
        • [plasmid-name=unnamed] (when the plasmid name is not known. However, be sure that each plasmid has a unique name. )
        • [location=mitochondrion]
        • [location=chloroplast]
        • [chromosome=2] (this is usually included only for smaller genomes when the chromosome assignment is certain; don't do this for large, eg mammalian, genomes when reassembly may result in a contig's assignment to a different chromosome)
    • quality scores of the sequences. These files correspond to and have the same basenames as the .fsa files, but have the suffix .qvl. The quality scores are optional, but desired.
    • annotation files, if appropriate. These correspond to and have the same basenames as the .fsa files, but have the suffix .tbl. The .tbl files have a 5-column tab-delimited format, as described in the annotation instruction pages.

      Be sure to read the annotation requirements in the appropriate annotation guidelines:

      .

      You may be interested to know that NCBI has a publicly available Prokaryotic Genomes Annotation Pipeline (PGAAP). This pipeline generates files that are ready for submission to GenBank, although the submitter is welcome to edit them before submission to GenBank. Submit to PGAAP first, and then submit to GenBank. See below for details about PGAAP submissions to GenBank.

    Run tbl2asn:

    Put all the files in the same directory, and run tbl2asn (version 19.6 or higher). The basic commandline is:

    • tbl2asn -p path_to_files -t template -M n -Z discrep
    • if you put the template file in a different directory, then include the full path to that file
    • if you want to use a particular file, then use -i file_name, instead of -p path_to_files
    • including "-Z discrep" runs the discrepancy report , which looks for inconsistencies or other suspicious problems, so is most appropriate for annotated submissions.
    • You can include the source information in the definition line of each contig, as described in the source info above. Alternatively, all of the common source information, can be included with -j in the tbl2asn commandline. Note that the organism for microbial genomes (prokaryotes and fungi) still includes the strain, eg Escherichia coli BCE032-DM-B or Saccharomyces cerevisiae FL100, so the strain name appears in the organism and also in the strain, as seen in the example below. In addition, if the submission is an annnotated prokaryotic genome, then include the genetic code with -j in the commandline:
      • tbl2asn -p path_to_files -t template -M n -Z discrep -j "[organism=Clostridium difficile ABDC] [strain=ABDC] [host=Homo sapiens] [country=Canada: Montreal] [collection_date=2008] [isolation-source=stool sample] [note=isolated from an outbreak in Montreal] [gcode=11]"

    Check the Output of the Validation and Discrepancy Report and Fix Problems

    • Check the errorsummary.val file for the number, severity and type of errors that are present in the .val files. All Errors and Rejects need to be fixed (as of March 2012, and v19.6 or higher). The presence of errors will slow processing.  Contact genomes@ncbi.nlm.nih.gov with any questions about the validation output.  During processing there may be some questions about other aspects of the submission.
    • Check the file named 'discrep' for the results of the discrepancy report. Categories prefaced with FATAL are always unacceptable and must be fixed (FATAL tags were added in January 2012 ).  Some of the categories are informational.  Reports that are not flagged as fatal need to be evaluated to determine if they represent annotation artifacts that need to be corrected or if they are acceptable due to the biology of the genome. See the discrepancy report examples and explanations for guidance. Write to genomes@ncbi.nlm.nih.gov and send the discrep file with questions about this report.
    • Make any necessary fixes to the input .fsa and/or .tbl files and run tbl2asn again. Or make the necessary fixes directly to the .sqn file by opening it in Sequin and editing the features there.

    AGP file

    AGP files provide the ordering and orientation information to construct supercontigs or scaffolds from contigs, or to construct chromosomes from scaffolds and/or contigs.  The AGP file defines these genome assemblies, so include all wgs-contigs that are considered to be part of the genome in the AGP file.

    see details
    See this page for the AGP format .

    Some specific requests are:

    • Encode the type of object in the object names in column 1 like this:
      • scaffold01, scaffold02, etc for scaffolds
      • chr (for bacteria) OR chr1, chr2, etc for eukaryotic chromosomes
      • the plasmid names for plasmids (eg pBR322). If the name is not known, then use 'unnamed'.
      • MT for the mitochondrial genome. MT_scaf01, MT_scaf02, etc for mitochondrial scaffolds.
    • Use "100" as the length and U as the component-type for gaps of unknown size, as that is the GenBank convention. These will appear as gap(unk100) in the flatfile view of the GenBank record.
    • Use the same contig identifiers in column 6 (the component-id) that you used in the .fsa files. If the components have already been assigned accession numbers, then you need to use the accession.version numbers as the component identifiers; do not use just the accession number.

    A standalone commandline program, agp_validate is available by anonymous FTP to validate the AGP file yourself. The -help option details the arguments and commandline format. The simplest, which just checks the format, is agp_validate *.agp >& agp_val .

    Plan for adopting v2.0 of the AGP specifications

    NCBI is switching from version 1.1 of the AGP specification to the new version, version 2.0, because the latter can convey valuable information on the nature of the evidence linking sequences on either side of a gap. Files in AGP2.0 will be accepted beginning Feb. 10, 2012 and will be required beginning July 1, 2012. See the full announcement.

    Assembly Information Files

    Additional files are needed if the genome assembly includes chromosomes (or plasmids or linkage-groups) or unlocalized scaffolds (scaffolds known to be part of a particular chromosome/plasmid/linkage-group, but whose location is not known).

    see details

    Contigs only (no AGP file), but some contigs are the complete chromosome or plasmid: Submit .sqn file(s) from tbl2asn plus a 4-column comma-separated chr/plasmid description file of ContigID,chromosome/plasmid name,Location,Type. The columns of the file are:

    • Column 1 is the contig name that is the id from the fasta file,
    • Column 2 is the official name of the chromosome, plasmid or linkage group. For example, 1, 2, 3 or I, II, III for chromosomes; pBR322 for plasmids; or LG1, LG2 for linkage groups. Use 'chr' for the prokaryotic chromosome name and 'MT' for the mitochondrial chromosome name when there is a single chromosome.
    • Column 3 is the subcellular location of the genomic molecule. Use 'na' for prokaryotes, and either 'nuclear' or the appropriate organelle, eg 'mitochondrion', for eukaryotes.
    • Column 4 is the type of genomic molecule and has a controlled vocabulary: chromosome, plasmid, linkage group.

    AGP file and chromosome/plasmid/linkage-group information: Submit .sqn file(s) from tbl2asn plus the following, depending upon the contents of the genome assembly submission:

    • Separate AGP files for:
      • unplaced scaffolds. These are scaffolds that have no chromosome information.
      • unlocalized scaffolds (also known as 'random'). These have a chromosome assignment but the location on the chromosome is unknown. For example, if there were several scaffolds that were mitochondrial but they were not assembled into the mitochondrial chromosome, then these would be unlocalized scaffolds.
      • chromosomes, plasmids and/or linkage groups.
    • Chromosome Types file when the assembly includes chromosome, plasmid and/or linkage group information. This is a 3-column comma-separated table with these values in the columns:
      • Column 1 is the official name of the chromosome/plasmid/linkage-group, eg 1, 2, 3 or I, II, III for chromosomes; pBR322 for plasmids; or LG1, LG2 for linkage groups. Use 'MT' for the mitochondrial chromosome name and 'chr' for the prokaryotic chromosome name when there is only a single chromosome. If the plasmid has not been named, then use 'unnamed'; however, be sure that each plasmid has a unique name. The value in column 1 must match the value in column 2 of the AGP Roles files, below.
      • Column 2 is the subcellular location of the genomic molecule. Use 'na' for prokaryotes and either 'nuclear' or the appropriate organelle, eg 'mitochondrion', for eukaryotes.
      • Column 3 is the type of genomic molecule and has a controlled vocabulary: chromosome, plasmid, linkage group.
    • AGP Roles, 2-column comma-separated file for unlocalized scaffolds:
      • Column 1 is the name of the object in column 1 of the corresponding AGP file.
      • Column 2 is the chromosome/plasmid/linkage-group of that object, using the name from column 1 of the Chromosome Types table. Therefore, use 'MT' for the mitochondrial chromosome name and 'chr' for the prokaryotic chromosome name when there is only a single chromosome.
    • AGP Roles, 2-column comma-separated file for chromosomes/plasmids/linkage-groups:
      • Column 1 is the name of the object in column 1 of the corresponding AGP file.
      • Column 2 is the chromosome/plasmid/linkage-group of that object, using the name from column 1 of the Chromosome Types table. Therefore, use 'MT' for the mitochondrial chromosome name and 'chr' for the prokaryotic chromosome name when there is only a single chromosome.

  • Submit your files

    NEW: Upload the .sqn and .agp files and chromosome/plasmid information files to GenBank in the Genomes (WGS) submission portal. A submission whose files total more than 2G will need to use Chrome, Opera or Safari browsers, which do not have a size limit on what they can upload. You can also compress large files to decrease large files for the upload.

    You will be asked to provide additional information,

    including this information:
    • BioProjectID from BioProject DB
    • Release date: After Processing OR a specific date
    • Whether you expect to annotate (or did annotate) this version of the genome assembly
    • Whether this is the final version of the genome. If you don't anticipate updating the assembly, then choose "yes".
    • Whether this genome is part of a multi-isolate study.
    • Whether this is a de novo assembly. If it was not, then you'll include the accession.version and/or the assembly name of the genome assembly that was used as the reference guide.
    • Assembly metadata:
      • Assembly Name: a short name suitable for display eg, LoxAfr_3.0 for a Loxodonta africana assembly, version 3.0
      • Assembly Method and version (or date the program was run): eg, Newbler v. 2.3 OR Celera Assembly v. May 2010
      • Genome Coverage: eg, 12x
      • Sequencing Technologies: eg, ABI 3730; 454 GS-FLX Titanium; Illumina GAIIx

    The submission will be given a 'SUB' temporary identifier which you can use in correspondence before an accession number is assigned to the genome submission.

    What happens next:

    Once we receive your genome submission, a member of our staff will conduct an initial review of it. If there are problems with the initial submission, the submitted files will be marked in the submission portal as "Processed-error" and you will receive an email with details of the problems. The problems, including those described in the Fix problems section, could be:

    • any Error-level errors and some Warning-level errors from the validation. You would see these in the.val file(s) generated by tbl2asn.
    • any FATAL or problem categories from the discrepancy report (in the discrep file from tbl2asn)
    • any sequence contamination in the .sqn files
    • bad format of the AGP file or inconsistencies between the AGP and .sqn files.

    Once you have made the fixes, log back into the Genomes (WGS) submission portal, retrieve that submission by its 'SUB' identifier and click "Resubmit" on that submission. You'll be back in the original submission and will need to delete the files that are marked as having errors, and then upload new files in their place. This resubmission will be a replacement of the original and will have a new 'SUB' temporary identifier.

    If the initial review by a member of our staff finds no significant issues with your submission or resubmission, you will be issued an accession number for the genome by email. After your submission is assigned an accession number, it undergoes a thorough review by our staff. This review is critical because we are striving to present genome annotation in an accurate and consistent manner so that database users can make maximum use of the data. If we encounter problems during this review, we will contact you by email.

    Submission statuses in the submission portal:

    • Queued: the submission is waiting for initial review
    • Processed-error: some or all files need to be fixed and resubmitted (with a different name)
    • Processing: the submission has passed the initial review and will be processed by NCBI staff. We will contact you during processing if the submission has issues that require additional information.

    If you elected to hold your genome until a particular date (or publication, whichever is first), we ask that you provide us with the expected publication date and also notify us in a timely manner of the upcoming publication and the relevant citation details. This will allow us to coordinate the release of your genome with the appearance of the paper. Please provide at least two weeks' notice of any upcoming publication.

  • Submitting PGAAP-annotated genomes

    You submit to PGAAP first, before you prepare your GenBank WGS submission. Note that PGAAP submissions require a different file format than WGS submissions. See the detailed PGAAP instructions here, http://www.ncbi.nlm.nih.gov/genomes/static/Annotation_pipeline_README.txt

    PGAAP will provide you with the annotated contig files which will be in the correct format for a WGS submission. You can review the annotation and make any changes that you think are necessary. If you do not make any changes, then you just need to send a request to genomes@ncbi.nlm.nih.gov to submit your PGAAP-annotated genome to GenBank, and include the directory name for the PGAAP-generated files. We will get the files from there and process them.

    If you do make changes to the annotation, then you will need to generate a .sqn file as described in the Run tbl2asn section, and then submit via the new Genomes (WGS) in the submission portal. Be sure to include in the comment box of the submission that the annotation is from PGAAP with your modifications.

    NOTE: In correspondence about using the PGAAP resource, include "PGAAP" in the subject line to be sure your messages are seen by the correct group.

    Last updated: 2012-08-27T09:10:55-04:00