NCBI

FAQ

  1. Can I submit annotation as a GenBank flatfile?
  2. I want all of the WGS contigs in my assembly available to users. Should I put unlink WGS contigs into the AGP?
  3. Can I submit an assembly and have it held back until I publish my paper?
  4. I'm using second generation sequencing technology. Can I still submit an assembly?
  5. My assembly was made with Velvet or Abyss. Do I need to split the sequences at the Ns that the assembly inserted?
  6. What should I use for the gap sizes?
  7. I concatenated the sequences into the correct order with Ns between each sequence and annotated this pseudomolecule. Can I submit this annotated pseudomolecule?
  8. I concatenated the sequences in a random order with Ns between each sequence and annotated the pseudomolecule. Can I submit the annotated pseudomolecule?
  9. Can I annotate across gaps?
  10. Do I need to submit my genome assembly with annotation?
  11. Does NCBI have an annotation pipeline that can be used to annotate my assembly?
  12. If I do have my own annotation, in what format should I provide this data?
  13. My genome assembly has contigs and scaffolds. Should I submit the annotation on the contigs or the scaffolds?
1. Can I submit annotation as a GenBank flatfile?
No, we cannot accept annotation a GenBank, EMBL or DDBJ flat file. To submit annotation, follow the instructions given for Prokaryotic annotation or Eukaryotic annotation.
2. I want all of the WGS contigs in my assembly available to users. Should I put singleton WGS contigs into the AGP?
The AGP file defines the assembly, so typically we do want all of the WGS contigs in the AGP file, especially if there is annotation. However, contigs that are not considered to be part of the assembly, perhaps because they are degenerate or duplicates, should not be included in the AGP file. In addition, for large assemblies with hundreds or thousands of singleton WGS contigs, we would prefer you omit singleton WGS contigs <200 bp from the AGP file.
3. Can I submit an assembly and have it held back until I publish my paper?
Yes, you may submit your assembly and have it held until publication.
4. I'm using second generation sequencing technology. Can I still submit an assembly?
Yes, you may submit assemblies using second generation sequencing technology. The process is analogous to using Sanger technology. The primary second generation reads should be submitted to the Sequencing Read Archive. The reads should be assembled into WGS contigs and submitted as described in the submission instructions. These WGS contigs can be used to assemble higher order molecules and submitted using an AGP file, as described in the submission instructions.
5. My assembly was made with Velvet or Abyss. Do I need to split the sequences at the Ns that the assembler inserted?
Yes, the only allowed Ns in the wgs-contigs are ambiguous bases within a sequence. Any Ns that represent gaps need to be removed by splitting the sequence into two contigs. The sequence can be rebuilt by providing an AGP file to assemble scaffolds. Use 10bp as the smallest gap size.
6. What should I use for the gap sizes?
If you have estimates of the gap sizes, then use those values for the gaps. We prefer that you use 10 as the minimum gap size. If you do not know the gap size, then use 100 as the value and the 'U' in column five of the AGP file.
7. I concatenated the sequences into the correct order with the Ns between each sequence and annotated the pseudomolecule. Can I submit this annotated pseudomolecule?
Yes, but you will need to submit 3 sets of files. The basis of the submission is contigs, the overlapping reads with no gaps. Therefore, you need to split the pseudomolecule back down into contigs and submit those as the pieces of a wgs project. Your submision will have 3 sets of files:
  • contigs: sequences constructed from the overlapping reads, with no terminal Ns or Ns that represent gaps.
  • AGP file: this provides the instructions for using the contigs into higher order molecules (scaffolds or chromosomes). The gaps defined in the AGP should be the same length as the number of Ns in your pseudomolecule.
  • The .sqn and .tbl files of the annotated pseudomolecule.
We will process the contigs as the pieces of the wgs project. We will make scaffolds/chromosome sequences from the AGP file and we will use the .sqn and .tbl files to put the annotation onto the scaffold/chromosome. Please let us know whether the pseudomolecule represents a chromosome sequence or a subchromosomal scaffold.
8. I concatentated the sequences in a random order with Ns between each sequence and annotated this pseudomolecule. Can I submit the annotated pseudomolecule?
The basis of the submission is the contigs (ungapped sequences constructed from overlapping reads). Therefore, you need to split the pseudomolecule into the contig sequences and submit those as the pieces of a wgs project. Since the annotated sequence does not correspond to a biological molecule, you need to map the annotation down to the contig level. You can use the offset in the .tbl file to avoid recalculating if desired, as shown here.
9. Can I annotate across gaps?
Annotation is allowed to cross gaps of estimated size, but not those of unknown sizes. However, we discourage annotation across gaps unless there is evidence that the translation on the other side of the gap is in the correct frame. In addition, if >50% of the translation is Xs (i.e. in the gap) then the CDS should be made partial at the gap, or split into two partial CDSs, as described for genes split across two contigs, depending upon the confidence of the translation on both sides of the gap.
10. Do I need to submit my genome assembly with annotation?
No, you can submit the genome without any annotation. Note, though, that complete bacterial genomes without annotation will be processed as WGS projects.
11. Does NCBI have an annotation pipeline that can be used to annotate my assembly?
NCBI does have annotation pipelines. NCBI can annotate complete or incomplete prokaryotic genomes using our genome annotation pipeline for prokaryotes PGAAP. Send a request to genomes@ncbi.nlm.nih.gov for PGAAP annotation, and follow the instructions on the PGAAP page for creating the necessary files. After the genome is annotated, you can submit it to GenBank, by sending a request to genomes@ncbi.nlm.nih.gov. If you have not modified the files after PGAAP annotation, then you just need to provide the path to the PGAAP files for GenBank submission.
12. If I do have my own annotation, in what format should I provide this data?
Annotation must be in the 5-column feature table described in tbl2asn and the Eukaryotic and Prokaryotic annotation instructions. We do not have a converter from other formats.
The 5-column feature table is saved as a file with the suffix .tbl, and that file is used in conjunction with the template, fasta, and optional quality score files to create the annotated genome file for submission to GenBank, as described on the tbl2asn page. The .sqn file(s) that is the output of running tbl2asn and the .tbl file (for eukaryotes) are submitted to GenBank.
13. My genome assembly has contigs and scaffolds. Should I submit the annotation on the contigs or the scaffolds?
Large genomes with thousands of contigs and hundreds or thousands of scaffolds should be annotated at the scaffold level. Small genomes, eg bacteria, can be annotated at either level. However, processing of those small genomes will be quicker if the annotation is on the contigs.

October 15, 2010