Sequin for Database Submissions and Updates:
A Quick Guide


Introduction

Sequin is a stand-alone software tool developed by the National Center for Biotechnology Information (NCBI) for submitting and updating sequences to the GenBank, EMBL, and DDBJ databases. Sequin has the capacity to handle long sequences and sets of sequences (segmented entries, as well as population, phylogenetic, and mutation studies). It also allows sequence editing and updating, and provides complex annotation capabilities. In addition, Sequin contains a number of built-in validation functions for enhanced quality assurance.

This overview is intended to provide a quick guide to Sequin's capabilities, including automatic annotation of coding regions, the graphical viewer, quality control features, and editing features. We suggest that you read this entire document before beginning your Sequin submission. More detailed instructions on these and other functions can be found in Sequin's on-screen Help file, also available on the World-Wide Web from the Sequin homepage at:

http://www.ncbi.nlm.nih.gov/Sequin/

Email help is also available from info@ncbi.nlm.nih.gov

Table of Contents

Before You Begin

Preparing Nucleotide and Amino Acid Data

Sequin normally expects to read sequence files in FASTA format. Note that most sequence analysis software packages include FASTA or "raw" as one of the available output formats. Population studies, phylogenetic studies, mutation studies, and environmental samples may be entered in either FASTA format, or in PHYLIP, NEXUS, MACAW, or FASTA+GAP formats if you are submitting an alignment.

See http://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp#FASTAFormatforNucleotideSequences for detailed examples of each of the various input data formats.

Prepare your sequence data files using a text editor, and save in ASCII text format (plain text). If your nucleotide sequence encodes one or more protein products, Sequin expects two files, one for the nucleotides and one for the proteins.

Definition Lines

FASTA format is simply the raw sequence preceded by a definition line. The definition line begins with a > sign and is followed immediately by a name for the sequence (your own local identification code, or sequence ID) and a title. During the submission process, indexing staff at the database to which you are submitting will change your sequence ID to an Accession number. You can embed other important information in the title, and Sequin uses this information to construct a record. Specifically, you can enter organism and strain or clone information in the nucleotide definition line and gene and protein information in the protein definition line using name-value pairs surrounded by square brackets. Example: [organism=Drosophila melanogaster] [strain=Oregon R]

Some modifier names have restricted values or formats.

The following modifiers should use only TRUE or FALSE. Example: [transgenic=TRUE].

This is the list of the remaining modifier names that you can include in your definition lines for nucleotide files:

  • acronym
  • anamorph
  • authority
  • bio-material
  • biotype
  • biovar
  • breed
  • cell-line
  • cell-type
  • chemovar
  • chromosome
  • clone
  • clone-lib
  • collected-by
  • common
  • country
  • cultivar
  • culture-collection
  • dev-stage
  • ecotype
  • endogenous-virus-name
  • forma
  • forma-specialis
  • fwd-pcr-primer-name
  • fwd-pcr-primer-seq
  • genotype
  • group
  • haplotype
  • identified-by
  • isolate
  • isolation-source
  • lab-host
  • lat-lon
  • map
  • metagenome-source
  • metagenomic
  • note
  • pathovar
  • plasmid-name
  • plastid-name
  • pop-variant
  • rev-pcr-primer-name
  • rev-pcr-primer-seq
  • segment
  • serogroup
  • serotype
  • serovar
  • sex
  • specific-host
  • specimen-voucher
  • strain
  • sub-species
  • subclone
  • subgroup
  • substrain
  • subtype
  • synonym
  • teleomorph
  • tissue-lib
  • tissue-type
  • type
  • variety

Example: [strain=BALB/c]

Some population studies are a mixture of integrated provirus and excised virion. These can be indicated by molecule and location qualifiers, e.g., [molecule=dna] [location=proviral] or [molecule=rna] [location=virion]. You can also embed [moltype=genomic] or [moltype=mRNA] to indicate from what source the molecule was isolated. If you're unsure of which modifier to use, use [note=...], and database staff will determine the appropriate modifier to use.

This is the list of modifier names that you can include in your definition lines for protein files:

A coding region feature will be created on the nucleotide sequence indicating where the protein sequence is encoded. If you specify "gene" in the protein sequence definition line, a gene that covers the coding region will be created with a locus specified by the value of "gene".

The product name for the coding region will be the "protein" value specified in the protein sequence definition line, if supplied. The product description for the coding region will be the "prot_desc" value specified in the protein sequence definition line, if supplied.

Note that the [ and ] brackets actually appear in the text. (Brackets are sometimes used in computer documentation to denote optional text. This convention is not followed here.) The bracketed information will be removed from the definition line for each sequence. Sequin can also calculate a new definition line by computing on features in the annotated record (see "Generating the Definition Line").

The ability to embed this information in the definition line is provided as a convenience to the submitter. If these annotations are not present, they can be entered in subsequent forms. Sequin is designed to use this information, and that provided in the initial forms, to build a properly structured record. In many cases, the final submission can be completely prepared from these data, so that no additional manual annotation is necessary once the record is displayed.

It is much easier to produce the final submission if you let Sequin work for you in this manner.

In this example we show alternative splicing, where a single gene produces multiple messenger RNAs that encode two similar but distinct protein products. Examples for the definition lines for the nucleotide and protein files are shown here:

Nucleotide Sequence:

>eIF4E [organism=Drosophila melanogaster] [strain=Oregon R] Drosophila ...
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCA ...

Protein Sequences:

>4E-I [gene=eIF4E] [protein=eukaryotic initiation factor 4E-I]
MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGN ...
>4E-II [gene=eIF4E] [protein=eukaryotic initiation factor 4E-II]
MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPAGNTATTTAPAGDD ...

Also, please note that there must be a line break (carriage return) between the definition line and the first line of sequence. Some word processors will break a single line onto two lines without actually adding a carriage return. (This feature is known as "word wrapping".) If you are unsure whether there is a carriage return, you can either set up your word processor so it shows invisible characters like carriage returns, or view the file in a text editor that does not create artificial line breaks. The definition line itself must not have a line break within it, because the second line would then be misinterpreted as the beginning of the sequence data. The actual sequence is usually broken every 50 to 80 characters, but this is not necessary for Sequin to be able to read it.

FASTA Format

There are three types of sequences that may be represented using the FASTA format: single, contiguous sequences, segmented sequences, and gapped sequences.

Single Sequence

This is the definition line followed by the sequence data. A sample single sequence file is shown here:

>ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCATTGA
TGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT

Segmented Nucleotide Sequences

A segmented nucleotide entry is an earlier method for capturing a set of non-contiguous sequences that has a defined order and orientation. For example, a genomic DNA segmented set could include encoding exons along with fragments of their flanking introns. An example of an mRNA segmented pair of records would be the 5' and 3' ends of an mRNA, where the middle region has not been sequenced. To import nucleotides in a segmented set, each individual sequence must be in FASTA format with an appropriate definition line, and all sequences should be in the same file. Organism information should only be included in the definition line for the first segment. Notice that there is a square open bracket on a line by itself before the first segment and a square close bracket on a line by itself after the last segment. These square brackets are required if you are importing multiple segmented sequences, but may be omitted if you are importing a file that contains all of the segments and using the "segmented sequence" format. Sequin will also generate an additional sequence to represent the combination of the segments, and that sequence will have a distinct sequence ID. A sample segmented sequence file is shown here:

[
>m_gagei_seg1 [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ...
ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA
TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT
GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA
TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT
AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT
TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT
TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA
ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA
>m_gagei_seg2
GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC
TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA
GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT
TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT
GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT
CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT
>m_gagei_seg3
TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA
TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC
TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT
ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC
ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA
GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA
TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG
GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT
AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT
AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT
ATT
]

Gapped Sequences

A gapped sequence represents a newer method for describing non-contiguous sequences, but only requires a single sequence identifier. A gap is represented by a line that starts with >? and is immediately followed by either a length (for gaps of known length) or "unk100" for gaps of unknown length. For example, ">?200". The next sequence segment continues on the next line, with no separate definition line or identifier. The difference between a gapped sequence and a segmented sequence is that the gapped sequence uses a single identifier and can specify known length gaps. Gapped sequences are preferred over segmented sequences. A sample gapped sequence file is shown here:

>m_gagei [organism=Mansonia gagei] Mansonia gagei NADH dehydrogenase ...
ATGGAGCATACATATCAATATTCATGGATCATACCGTTTGTGCCACTTCCAATTCCTATTTTAATAGGAA
TTGGACTCCTACTTTTTCCGACGGCAACAAAAAATCTTCGTCGTATGTGGGCTCTTCCCAATATTTTATT
GTTAAGTATAGTTATGATTTTTTCGGTCGATCTGTCCATTCAGCAAATAAATAAAAGTTCTATCTATCAA
TATGTATGGTCTTGGACCATCAATAATGATTTTTCTTTCGAGTTTGGCTACTTTATTGATTCGCTTACCT
AGTTCGAATTTGATACAAATTTATATTTTTTGGGAATTAGTTGGAATGTGTTCTTATCTATTAATAGGGT
TTTGGTTCACACGACCCGCTGCGGCAAACGCCTGTCAAAAAGCATTTGTAACTAATCGGATAGGCGATTT
TGGTTTATTATTAGGAATCTTAGGTTTTTATTGGATAACGGGAAGTTTCGAATTTCAAGATTTGTTCGAA
ATATTTAATAACTTGATTTATAATAATGAGGTTCAGTTTTTATTTGTTACTTTATGTGCCTCTTTATTA
>?200
GGTATAATAACAGTATTATTAGGGGCTACTTTAGCTCTTGC
TCAAAAAGATATTAAGAGGGGTTTAGCCTATTCTACAATGTCCCAACTGGGTTATATGATGTTAGCTCTA
GGTATGGGGTCTTATCGAGCCGCTTTATTTCATTTGATTACTCATGCTTATTCGAAGGCATTGTTGTTTT
TAGGATCCGGATCCGTTATTCATTCCATGGAAGCTATTGTTGGATATTCTCCAGATAAAAGCCAGAATAT
GGTTTTTATGGGCGGTTTAAGAAAGCATGTGCCAATTACACAAATTGCTTTTTTAGTGGGTACACTTTCT
CTTTGTGGTATTCCACCCCTTGCTTGTTTTTGGTCCAAAGATGAAATTCTTAGTGACAGCTGGTTGT
>?unk100
TCAATAAAACTATGGGGTAAAGAAGAACAAAAAATAATTAACAGAAATTTTCGTTTATCTCCTTTATTAA
TATTAACGATGAATAATAATGAGAAGCCATATAGAATTGGTGATAATGTAAAAAAAGGGGCTCTTATTAC
TATTACGAGTTTTGGCTACAAGAAGGCTTTTTCTTATCCTCATGAATCGGATAATACTATGCTATTTCCT
ATGCTTATATTGGCTCTATTTACTTTTTTTGTTGGAGCCATAGCAATTCCTTTTAATCAAGAAGGACTAC
ATTTGGATATATTATCCAAATTATTAACTCCATCTATAAATCTTTTACATCAAAATTCAAATGATTTTGA
GGATTGGTATCAATTTTTAACAAATGCAACTCTTTCAGTGAGTATAGCCTGTTTCGGAATATTTACAGCA
TTCCTTTTATATAAGCCTTTTTATTCATCTTTACAAAATTTGAACTTACTAAATTTATTTTCGAAAGGGG
GTCCTAAAAGAATTTTTTTGGATAAAATAATATACTTGATATACGATTGGTCATATAATCGTGGTTACAT
AGATACGTTTTATTCAGTATCCTTAACAAAAGGTATAAGAGGATTGGCCGAACTAACTCATTTTTTTGAT
AGGCGAGTAATCGATGGAATTACAAATGGAGTACGCATCACAAGTTTTTTTATAGGCGAAGGTATCAAAT
ATT

Alignment Formats

Once you have created your alignment file, be sure to note the characters used to indicate ambiguous bases, bases that match the master sequence, and gaps in the alignment. Be aware that some alignment formats use different characters to indicate gaps used to pad sequences at the beginning, middle, and end of the alignment. You will be able to specify these characters separately before importing the alignment file.

FASTA+GAP

>ABC-1 [organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
---ATTGCGTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT
TGATGCACCTGGACACAGAGATTTCATCAAGAACATGATCACTGGTACTT
>ABC-2 [organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
GATATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTCACCATCAT
TGATGCACCTGGACACAGAAATTTCATCAAGAACATGATCACTGGTACTT
>ABC-3 [organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
---ATTGCTTTATGGAAATTCGAAACTGCCAAATACTATGTTA-------
TGATGCACCTGGACACAGAGATTTCATCAAAAACATGATCACTGGTACTT

PHYLIP

      3  100
ABC-1      ---ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC-2      GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC-3      ---ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------

           TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
           TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
           TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT

>[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]

NEXUS Interleaved

#NEXUS

begin data;
        dimensions  ntax=3 nchar=100;
        format datatype=dna  missing=? gap=-  interleave ;
        matrix

[     1                                                   50]
ABC_1 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC_2 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC_3 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------

[     51                                                 100]
ABC_1 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_2 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_3 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
;
END;

begin ncbi;
sequin
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
;
end;

NEXUS Contiguous

#NEXUS

begin data;
        dimensions  ntax=3 nchar=100;
        format datatype=dna  missing=? gap=-  ;
        matrix

ABC_1   
???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_2  
GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_3  
???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------
TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
;
END;

begin ncbi;
sequin
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
>[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
;
end;

Sets of Segmented Sequences

If the sequences in a phylogenetic study are really segmented (e.g., exons 2 and 3 of a gene without intron 2), the individual segments from a single organism can be grouped within square brackets. Subsequent segments are detected by the presence of a FASTA definition line. For example:

[
>Qruex2 [organism=Quercus rubra]
CGAAAACCTGCACAGCAGAAACGACTCGCAAACTAGTAATAACTGACGGAGGACGGAGGG ...
>Qruex3
CATCATTGCCCCCCATCCTTTGGTTTGGTTGGGTTGGAAGTTCACCTCCCATATGTGCCC ...
]
[
>Qsuex2 [organism=Quercus suber]
CAAACCTACACAGCAGAACGACTCGAGAACTGGTGACAGTTGAGGAGGGCAAGCACCTTG ...
>Qsuex3
CATCGTTGCCCCCCTTCTTTGGTTTGGTTGGGTTGGAAGTTGGCCTTCCATATGTGCCCT ...
]
...

FASTA+GAP format can also use this convention for encoding sets of aligned segmented sequences.

Creating a Submission

The sequence data we will use for this example is the genomic sequence of the Drosophila melanogaster eukaryotic initiation factors 4E-I and 4E-II (GenBank Accession number U54469).

Basic Sequin Organization

Sequin is organized into a series of forms for entering submitting authors, entering organism and sequences, entering information such as strain, gene, and protein names, viewing the complete submission, and editing and annotating the submission. The goal is to go quickly from raw sequence data to an assembled record that can be viewed, edited, and submitted to your database of choice.

Advance through the pages that make up each form by clicking on labeled folder tabs or the Next Page button. After the basic information forms have been completed and the sequence data imported, Sequin provides a complete view of your submission, in your choice of text or graphic format. At this point, any of the information fields can be easily modified by double-clicking on any area of the record, and additional biological annotations can be entered by selecting from a menu.

Sequin has an on-screen Help file that is opened automatically when you start the program. Because it is context sensitive, the Help text will change and follow your steps as you progress through the program. A "Find" function is also provided.

Welcome to Sequin Form

Welcome to Sequin Form

Once you have finished preparing the sequence files, you are ready to start the Sequin program. Sequin's first window asks you to indicate the database to which the sequence will be submitted and prompts you to start a new project or continue with an existing one. Once you choose a database, Sequin will remember it in subsequent sessions. In general, each sequence submission should be entered as a separate project. However, segmented DNA sequences, gapped sequences, population studies, phylogenetic studies, and mutation studies should be submitted together as one project. This feature also eliminates the need to save Sequin information templates for each sequence.

To begin creating your submission, click the Start New Submission button.

Submitting Authors Form

The pages in the Submitting Authors form ask you to provide the release date, a working title, names and contact information of submitting authors, and affiliation information. To create a personal template for use in future submissions, use the File->Export menu item after completing each page of this form.

Submission Page

Submission Page

The Submission page asks for a tentative title for a manuscript describing the sequence and will initially mark the manuscript as being unpublished. When the article is published, the database staff will update the sequence record with the new citation. This page also lets you indicate that a record should be held confidential by the database until a specified date, although the preferred policy is to release the record immediately into the public databases.

Contact Page

Contact Page

The Contact page asks for the name, phone number, and email address of the person responsible for making the submission. Database staff members will contact this person if there are any questions about the record.

The Sfx (suffix) popup is used to enter personal name suffixes (e.g., Jr., Sr., or III), not a person's academic degrees (e.g., MD or PhD). Also, it is not necessary to type periods after initials.

Authors Page

Authors Page

In the Authors page, enter the names of the people who should get scientific credit for the sequence presented in this record. These will become the authors for the initial (unpublished) manuscript.

Authors are entered in a spreadsheet. As soon as anything is typed in the last row, a new (blank) row is added below it. Use the tab key to move between fields. Tabbing from the last column automatically moves to the First Name column in the next row.

Affiliation Page

Affiliation Page

The Affiliation page asks for the institutional affiliation of the primary author.

Sequence Format Form

Format Form

With Sequin, the actual sequence data are imported from an outside data file. So before you begin, prepare your sequence data files using a text editor, perhaps one associated with your laboratory sequence analysis software (see "Before you Begin").

Submission Type

If you have sequence data from a single source, choose from one of the following submission types:

See Before You Begin if you have questions about how to format your files or about the differences between these formats.

If you have a set of single sequences, segmented sequences, or gapped sequences or a mixture of these types of sequences, you will need to choose one of the following submission types:

Sequence Data Format

If you have chosen Single Sequence, Segmented Sequence, Gapped Sequence, or Batch Submission for the submission type, you will only be able to select FASTA (no alignment).

If you have chosen one of the other submission types, you may import the sequences in FASTA format, or you may choose to import the sequences using an alignment file by selecting Alignment (FASTA+GAP, NEXUS, PHYLIP, etc.). See Alignment Formats for an explanation of the available formats for alignment files.

Submission Category

Choose Original Submission if you have directly sequenced the nucleotide sequence in your laboratory.

Choose Third Party Annotation if you have downloaded or assembled sequence from GenBank and modified it with your own annotations. See http://www.ncbi.nih.gov/Genbank/TPA.html for more information about Third Party Annotation rules.

Organism and Sequences Form

The Organism and Sequences form has been enhanced with a number of Assistants that allow entry or editing of sequence and source information.

Nucleotide Page

The Nucleotide page will have one of three appearances, based on whether you have chosen to import a single sequence, a set of sequences, or an alignment.

Importing Nucleotide FASTA for a Single Sequence

Single Sequence Page

To import a single sequence, click on Import Nucleotide FASTA and enter the name of the file that contains your FASTA sequence. See Before You Begin for information on how to format your FASTA file. In addition to importing from a file, sequences can also be read by pasting from the computer's "clipboard" using the Edit->Paste menu item or by using the Add/Modify Sequences button.

Importing Nucleotide FASTA for a Sequence Set

Sequence Set Page

To import a set of sequences, click on Import Nucleotide FASTA and enter the name of the file that contains some or all of your FASTA sequences. See Before You Begin for information on how to format your FASTA file. You may click on Import Additional Nucleotide FASTA to import additional files if your sequences are in more than one file. In addition to importing from a file, sequences can also be read by pasting from the computer's "clipboard" using the Edit->Paste menu item or by using the Add/Modify Sequences button.

If you would like to create an alignment for your set of sequences, check Create Alignment on this page.

Importing an Alignment

Importing an Alignment

See Before You Begin for information on how to format your alignment file. Before importing your alignment, choose which characters in the alignment file represent gaps, ambiguous or unknown nucleotides, and "matches".

Some data files distinguish between gaps at the beginning, in the middle, and at the end of a sequence. These characters can be entered separately if needed, or you may specify the same character for all three kinds of gaps if appropriate.

Ambiguous/Unknown characters represent nucleotides that are present in the sequence but were not sequenced. Usually this is "N". Match characters are characters in a sequence other than the first that match the character at that alignment position in the first sequence. When match characters are used, usually they are specified as ".", but when match characters are not used, "." is frequently used as a gap character, so the ":" is supplied instead as a default.

You may specify more than one character for each of these categories. When you have filled out the character information, click on Import Nucleotide Alignment and enter the name of your alignment file.

After Importing Files

After Importing Files

When the sequence file or alignment file import is complete, a box will appear showing the number of nucleotide segments imported, the total length in nucleotides of the sequences entered, and the sequence ID(s) you designated. The actual sequence data are not shown. If any of this information is missing or incorrect, check the file containing the sequence data for proper FASTA format, click on the Clear Sequences button, then reimport the sequence(s).

If the imported nucleotide sequence or sequences or alignment have any problems, such as colliding local identifiers in a set or mismatched brackets in the definition line, an Assistant dialog appears to help correct the problems. Severe problems must be fixed before you can continue with the Sequin submission.

Organism Page

Organism Page

The second page of the Organism and Sequences form requests information regarding the scientific name of the organism from which the sequence was derived, if it was not already encoded in the nucleotide FASTA file. There are Assistants for manually adding organism name information or adding source qualifiers.

Sequin has extracted the organism and strain names from the FASTA definition line in this example, eliminating the need to manually enter information in the Organism page.

Proteins Page

Proteins Page

If your sequence or sequences encode one or more proteins, you can enter the sequences of the protein products in this page. To import the amino acid sequences, click on the Proteins folder tab and click on the Import Protein FASTA button. You may import more than one file by clicking the button again after importing the first file. See Before You Begin for information on how to format your protein files.

Proteins Example

In this example, we imported two protein sequences. These are the alternative splice products of the same gene. Both protein sequences were in the same data file, but each had its own definition line.

Sequin has extracted the gene and protein names from the FASTA definition lines, and will use these to construct the initial sequence record.

Annotation Page

Annotation Page

The Annotation page allows you to add an rRNA or CDS feature to the entire length of all sequences in the set. In addition, you can add a title to any sequences that didn't obtain them from a FASTA definition line. It is much easier to add these in bulk at this step than to add individual rRNA or CDS features to each sequence after the record is constructed.

It is customary in a nucleotide record to format titles for sequences containing coding region features in the following way:

Genus species protein name (gene symbol) mRNA/gene, complete/partial cds.

The choice of "mRNA" or "gene" depends upon the molecule type (use "mRNA" for mRNA or cDNA, and "gene" for genomic DNA). Use "partial" for incomplete features. The proper organism name in a phylogenetic study can be added to the beginning of each title automatically by checking the Prefix title with organism name box.

However, for records containing CDS, rRNA, or tRNA features, Sequin can generate the definition line automatically by computing on the features (see "Generating the Definition Line").

More complex situations, such as a population study of HIV sequences, can include multiple CDS features in each sequence. In this case, do not use the Annotation page to create features. (You can still use it for a common title, however.) After the initial submission has been created, you would manually annotate features onto one of the sequences. If you are submitting an alignment, or if you are submitting a set of sequences and you have checked Create Alignment on the Nucleotide page, you will be able to use feature propagation to annotate the same features at the equivalent aligned locations on the remaining sequences.

Viewing Your Submission

GenBank View

After you have completed importing the data files, Sequin will display your full submission information in the GenBank format (or EMBL format if you chose EMBL as the database for submission in the first form).

GenBank Format

On the basis of the information provided in your DNA and amino acid sequence files, any coding regions will be automatically identified and annotated for you. The figure shows only the top portion of the GenBank record, but you can see the first of two coding region (CDS) features. The vertical bar to the left of the paragraph indicates that the CDS has been selected by clicking with the computer's mouse.

You may now make changes to the coding region, publication, source, and other features in the record by double clicking on the appropriate paragraphs in the GenBank display format. You may also use the Annotate->Generate Definition Line menu item to compute a definition line for the annotated features in the record.

Graphical View

Graphic Format

To get a graphical view, change the Format popup menu from GenBank to Graphic. Reviewing your submission in Graphic format allows you to visually confirm expected location of exons, introns, and other features in multiple interval coding regions. The Graphic view in our eukaryotic initiation factor example illustrates how the coding region intervals for the two protein products are spatially related to each other.

The File->Duplicate View menu item will launch a second viewer on the record. The display format on each viewer can be independently set, allowing you to see a graphical view and a GenBank text report simultaneously. This is useful for getting an overall view of the features and seeing the details of annotation.

Sequence View

Sequence Format

Sequence view is a static version of the sequence and alignment editor. It shows the actual nucleotide sequence, with feature intervals annotated directly on the sequence. Protein translations of CDS features are also shown, as are all features shown in the graphical view.

Editing and Annotating Your Submission

At this point, Sequin could process your entry based on what you have entered so far, and you could send it to your nucleotide database of choice (as set in the initial form). However, to optimize the usefulness of your entry for the scientific community, you may want to provide additional information to indicate biologically significant regions of the sequence. But first, save the entry so that if you make any unwanted changes during the editing process you can revert to the original copy.

Additional information may be in the form of Descriptors or Features. Descriptors are annotations that apply to an entire sequence or set of sequences. They are used to remove redundant information in a record. Features are annotations that apply to a specific sequence interval.

Sequin provides two methods to modify your entry: (1) to edit existing information, double click on the text or graphic area you want to modify, and Sequin will display forms requesting needed information; or (2) to add new information, use the Annotate menu and select from the list of available annotations.

Sequence Editor

Additional sequence data can also be added using Sequin's sequence editor, which can be launched using the Edit->Edit Sequence menu item. Sequin will automatically adjust feature intervals when editing the sequence. Prior to Sequin, it was usually easier to reannotate everything from scratch when the sequence changed. But an even easier way to update sequences is described in the following section.

Updating the Sequence

Sequin can also read in a replacement sequence, or an overlapping sequence extension, and perform the alignment and feature propagation calculations necessary to adjust feature intervals, even though the individual editing operations were not done with the sequence editor.

The Edit->Update Sequence submenu has several choices. These are for use by the original submitter of a record.

You can read a FASTA file or raw sequence file. This can be a replacement sequence, or it can overlap the original sequence at the 5' or 3' end. After Sequin aligns the two sequences, and you select optional parameters, the sequence in your record is updated, with all feature intervals adjusted properly.

You can also update with an existing sequence record that contains features. This can be obtained from a file, or retrieved from Entrez either via an Accession number. The latter choice requires the network-aware version of Sequin. Once it gets the new record, Sequin aligns the two sequences as before. This is typically used either to merge two records that overlap, or to copy features from database records onto a new large contig.

Update Sequence Form

The first panel shows how the two sequences align to each other. In this case, it is a 5' extension of the existing sequence. 400 bases are new, 70 bases overlap the old sequence, and there are 30 bases of vector on the new sequence that do not align to the old sequence and will be trimmed off.

The second panel shows details of the 70-base aligned region. There is one single base gap in each sequence. The total number of sequence letters plus gap characters is the alignment length, 71 in this example. (This number was shown between the sequence figures in the first panel.) Mismatched bases are indicated by vertical red lines between the two sequences.

The third panel shows the actual sequence letters in the aligned region. Clicking on a gap or mismatch in the second panel scrolls to the appropriate place in this panel.

Before pressing Update Sequence, you need to enter optional parameters. The alignment relationship is calculated by Sequin, but in some cases you may want to replace or patch rather than extend the existing sequence.

Generating the Definition Line

The Annotate->Generate Definition Line menu item can make the appropriate titles once the record has been annotated with features. The general format for sequences containing coding region features is:

Genus species protein name (gene symbol) mRNA/gene, complete/partial cds.

Exceptional cases, where this automatic function is unable to generate a reasonable definition line, will be edited by the database staff to conform to the style conventions.

The new definition line will replace any previous title, including that originally on the FASTA definition line.

Record Validation

Once you are satisfied that you have entered all the relevant information, save your file! Then select the Search->Validate menu item. You will either receive a message that the validation test succeeded or see a screen listing the validation errors and warnings. Just double click on an error item to launch the appropriate editor for making corrections. The validator includes checks for such things as missing organism information, incorrect coding region lengths, internal stop codons in coding regions, inconsistent genetic codes, mismatched amino acids, and non-consensus splice sites.

Record Validator Form

Submitting the Entry

When the entry is properly formatted and error-free, click the Done button or select the File->Prepare Submission menu item. You will be prompted to save your entry and email it to the database you selected. The address for GenBank is gb-sub@ncbi.nlm.nih.gov. The address for EMBL is datasubs@ebi.ac.uk. The address for DDBJ is ddbjsub@ddbj.nig.ac.jp.

Advanced Topics

Feature Editor Design

Sequin uses a common structure for all feature editor forms, with (usually) three top-level folder tabs. One folder tab page is specific to the given feature type (biological source and publications have more). The Properties and Location pages are common to all features. Some of these pages may have subpages, accessed by a secondary set of smaller folder tabs. This organization allows editors for complex data structures to fit in a reasonably small window size. The most important information in a given section is always presented in the first subpage.

Coding Region Page

Coding Region Page

The coding region editor is perhaps the most complicated form in Sequin. Within the Coding Region page, the Product subpage lets you predict the coding region intervals from the protein sequence or translate the protein sequence from the location. (Importing a protein sequence from a file will also interpret the [gene=...] and [protein=...] definition line information and automatically attempt to predict the coding region intervals.) It also displays the genetic code used for translation and the reading frame. (Please note that there are currently 17 different genetic codes present in Sequin. For more information on these, see http://www.ncbi.nlm.nih.gov/Taxonomy/.)

The Protein subpage lets you set the name (or, if not known, a description) of the protein product. The Exceptions subpage allows you to indicate translation exceptions to the normal genetic code, such as insertion of selenocysteine, suppression of terminator codons by a suppressor tRNA, or completion of a stop codon by poly-adenylation of an mRNA.

Additional annotation on the protein product might include a leader peptide, transmembrane regions, disulfide bonds, or binding sites. These can be added after setting the Target Sequence popup on the sequence viewer to the desired protein sequence. You can also launch a duplicate view, already targeted to the appropriate protein, from the Protein subpage.

Properties Page

Properties Page

All features have a number of fields in common. The Partial box will be checked if the 5' partial or 3' partial boxes on the Location page were selected. Exception means that the sequence of the protein product doesn't match the translation of the DNA sequence because of some known biological reason (e.g., RNA editing). The Evidence popup is now deprecated by the Evidence subpage.

In addition, nucleotide features (other than genes themselves) can reference a gene feature. This is frequently done by overlap. (The overlapping gene will show up on the feature as a /gene qualifier in GenBank format.) Extension of the feature location will automatically extend the gene that is selected in the editor. In rare cases, you may want to set a gene by cross-reference.

The Comment subpage allows text to be associated with a feature. In GenBank format, this appears as a /note qualifier. The Citations subpage attaches citations to the feature. (The citations should first be added to the record using items in the Annotate->Publication submenu, whereupon it will appear in the REFERENCE section.) For example, an article that justifies a non-obvious or controversial biological conclusion would be cited here. In GenBank format, for example, if the publication is listed as Reference 2, the feature citation appears as /citation=[2]. Cross-Refs are cross-references to other databases. The contents of this subpage may only be changed by the GenBank, EMBL, or DDBJ database staff. Evidence has experiment and inference qualifier fields. The experiment qualifier must include details of the experiment used to justify the annotation.

Location Page

Location Page

All features are required to have a location, i.e., one or more intervals on a sequence coordinate. The Location page provides a spreadsheet for entering and editing this information. An arbitrary number of lines can be entered. In this coding region example, the intervals correspond to the exons. For an mRNA, the intervals would be the exons and UTRs. The 5' Partial and 3' Partial check boxes will show up as < or > in front of a feature coordinate in the GenBank flatfile, indicating partial locations.

The GenBank flatfile view of this location would be:

join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)

If the 5' Partial or 3' Partial boxes were checked, < and > symbols would appear at the appropriate end of the join statement:

join(<201..224,1550..1920,1986..2085,2317..2404,2466..>2629)

If the sequence was reverse complemented (based on a length of 2881 nucleotides), the Strand popups would all indicate Minus, and the join statement for the resulting feature location would be as follows:

complement(join(253..416,478..565,797..896,962..1332, 2658..2681))

NCBI DeskTop

NCBI DeskTop Window

The NCBI DeskTop is a window that directly displays the internal structure of the record being viewed in Sequin. It can be understood as a Venn diagram.

As with other views on a record, the DeskTop indicates selected items and lets you select items by clicking.

In this example, Sequin was given the genomic nucleotide and protein sequences for Drosophila melanogaster eukaryotic initiation factor 4E. It then determined the coding region intervals and built an initial structure. The organism (BioSource descriptor) is at the nuc-prot set and thus applies to both the nucleotide and protein sequences.

Additional Information

The Sequin homepage http://www.ncbi.nlm.nih.gov/Sequin/ has a Frequently Asked Questions section and more detailed instructions on using the capabilities of network-aware Sequin.

Reference

Network Configuration

Network Configuration Form

When first downloaded, Sequin runs in stand-alone mode, without access to the network. However, the program can also be configured to exchange information with the NCBI (GenBank) over the Internet. The network-aware mode of Sequin is identical to the stand-alone mode, but it contains some additional useful options.

Sequin can only function in its network-aware mode if the computer on which it resides has a direct Internet connection. Electronic mail access to the Internet is insufficient. In general, if you can install and use a WWW browser on your system, you should be able to install and use network-aware Sequin. Check with your system administrator or Internet provider if you are uncertain as to whether you have direct Internet connectivity.

To launch the configuration form, select Net Configure under the Misc menu, from either the initial Welcome to Sequin form or from a viewer on an existing sequence record.

If you are not behind a firewall, set the Connection control to Normal. If you also have a Domain Name Server (DNS) available, you can now simply press Accept.

If DNS is not available, uncheck the Domain Name Server box. If you are behind a firewall, set the Connection control to Firewall. The HTTP Proxy Server box then becomes active. If you also use a proxy server, type in its address. (If you have access to DNS, it will be of the form www.myproxy.myuniversity.edu. If you do not have DNS, you should use the numerical IP address of the form 127.45.23.6.) Once you type something in the HTTP Proxy Server box, the Port box becomes active and can be filled in or changed as appropriate. (By default the Non-transparent Proxy Server box is empty, indicating a CERN-like proxy.) Ask your network administrator for advice on the proper settings to use.

If you are in the United States, the default Timeout of 30 seconds should suffice. From foreign countries with poor Internet connection to the U.S., you can select up to 5 minutes as the timeout.

Finally, you will need to quit and restart Sequin ifor the network-aware settings to take effect.

If you are behind a firewall, it must be configured correctly to access NCBI services. Your network administrator may have done this already. If not, please have them contact NCBI for further instructions on setting up firewalls to work with NCBI services.

The following section is intended for network administrators:

Using NCBI services from behind a security firewall requires opening ports in your firewall. Please consult http://www.ncbi.nlm.nih.gov/IEB/ToolBox/NETWORK/firewall.html for the list of current hosts and ports that have the firewall daemon configured.

If your firewall is not transparent, the firewall port number should be mapped to the same port number on the external host.

Note: Old NCBI clients used different application configuration settings and ports than listed above. If you need to support such clients, which are becoming obsolete, please contact info@ncbi.nlm.nih.gov for further information.

Feature Table Format

Sequin can now annotate features by reading in a tab-delimited table. This is most often used by genome centers that store feature interval information in relational databases or spreadsheets. For most submitters, it is usually better to supply protein sequences in FASTA format with gene and protein names embedded in the definition line.

The feature table specifies the location and type of each feature, and Sequin processes the feature intervals and translates any CDSs. The table is read in the record viewer (after the sequence has been imported) using the File->Open menu item. The table must follow a defined format. The first line starts with >Feature, a space, and then the Sequence ID of the sequence you are annotating. In the example below, eIF4E is the Sequence ID, and it is a local identifier.

The table is composed of five columns: start, stop, feature key, qualifier key, and qualifier value. The columns are separated by tabs. The first row for any given feature has start, stop, and feature key. Additional feature intervals just have start and stop. The qualifiers follow on lines starting with three tabs.

For example, a table that looks like this:

>Features lcl|eIF4E
80      2881    gene
                        gene     eIF4E

201     224     CDS
1550    1920
1986    2085
2317    2404
2466    2629
                        product  eukaryotic initiation factor 4E-II

1402    1458    CDS
1550    1920
1986    2085
2317    2404
2466    2629
                        product  eukaryotic initiation factor 4E-I
                        note     encoded by two messenger RNAs

80      224     mRNA
1550    1920
1986    2085
2317    2404
2466    2881
                        product  eukaryotic initiation factor 4E-II

80      224     mRNA
892     1458
1550    1920
1986    2085
2317    2404
2466    2881
                        product  eukaryotic initiation factor 4E-I

80      224     mRNA
1129    1458
1550    1920
1986    2085
2317    2404
2466    2881
                        product  eukaryotic initiation factor 4E-I

will result in a GenBank flatfile that contains this:

     mRNA            join(80..224,1129..1458,1550..1920,1986..2085,2317..2404,
                     2466..2881)
                     /gene="eIF4E"
                     /product="eukaryotic initiation factor 4E-I"
     mRNA            join(80..224,892..1458,1550..1920,1986..2085,2317..2404,
                     2466..2881)
                     /gene="eIF4E"
                     /product="eukaryotic initiation factor 4E-I"
     mRNA            join(80..224,1550..1920,1986..2085,2317..2404,2466..2881)
                     /gene="eIF4E"
                     /product="eukaryotic initiation factor 4E-II"
     gene            80..2881
                     /gene="eIF4E"
     CDS             join(201..224,1550..1920,1986..2085,2317..2404,2466..2629)
                     /gene="eIF4E"
                     /codon_start=1
                     /product="eukaryotic initiation factor 4E-II"
                     /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETG
                     EPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTV
                     EDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDN
                     LWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDAL
                     RLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"
     CDS             join(1402..1458,1550..1920,1986..2085,2317..2404,
                     2466..2629)
                     /gene="eIF4E"
                     /note="encoded by two messenger RNAs"
                     /codon_start=1
                     /product="eukaryotic initiation factor 4E-I"
                     /translation="MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKD
                     VKPKEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWED
                     MQNEITSFDTVEDFWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVIT
                     LNKSSKTDLDNLWLDVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAA
                     LEIGHKLRDALRLGRNNSLQYQLHKDTMVKQGSNVKSIYTL"

Note that if the gene feature spans the intervals of the CDS and mRNA features for that gene, you don't need to include gene "qualifiers" in those features, because they will be picked up by overlap.

Features that are on the complementary strand are indicated by reversing the interval locations. For example, the table:

>Features lcl|dna2
5284    5202    tRNA
                        product  tRNA-Glu

will result in a GenBank flatfile containing:

     tRNA            complement(5202..5284)
                     /product="tRNA-Glu"

More instructions on using the feature table format for submitting large genomic records are available at
http://www.ncbi.nlm.nih.gov/Sequin/table.html.