Transcriptome Shotgun Assembly Sequence Database

What is the Transcriptome Shotgun Assembly (TSA) Database?

TSA is an archive of computationally assembled sequences from primary data such as ESTs, traces and Next Generation Sequencing Technologies. The overlapping sequence reads from a complete transcriptome are assembled into transcripts by computational methods instead of by traditional cloning and sequencing of cloned cDNAs. The primary sequence data used in the assemblies must have been experimentally determined by the same submitter. TSA sequence records differ from EST and GenBank records because there are no physical counterparts to the assemblies.

How Do TSA Sequence Records Differ from Other GenBank/EMBL/DDBJ Records?

The display of a TSA sequence is similar to other International Nucleotide Sequence Database Collaboration (INSDC) records, but includes the following:

  • The label 'TSA:' at the beginning of each Definition Line.
  • DBLINK
    • BioProject
    • BioSample (optional)
    • Sequence Read Archive
  • Keywords:  TSA; Transcriptome Shotgun Assembly
  • Assembly data
  • Comment describing the assembly if from a multi-step process.

Other Features and References are similar to those displayed in regular GenBank/EMBL/DDBJ records.

An example of a TSA submission is GAAAA00000000.

TSA sequence records are shared by all three INSDC databases and can be found using typical search methods in Entrez Nucleotide and Entrez Protein.

General Information

Nucleotide sequences must conform to the following standards:

  • Submitted sequences must be assembled from data experimentally determined by the submitter.
  • Screened for vector contamination and any vector/linker sequence removed. This includes the removal of NextGen sequencing primers.
  • Sequences cannot be less than 200 bp.
  • Sequences should have no more than 10% n's or greater than 14 n's in a row.
  • If the submission is a single-step, unannotated assembly and the output is a BAM file(s) these should be submitted as a TSA project to SRA.

Requirements:

  • Register your project in the BioProject database as a Transcriptome Shotgun Assembly project.
  • Raw reads should be submitted to SRA and the SRA run accession(s) (SRR) provided.  Do not include SRA and SRX accession numbers.
  • Assembly Data structured comment. Please see Creating the Structured Comment Table .
  • Description of the assembly process if a multi-step assembly was performed.
  • The library information for the primary data should be annotated on the Source Feature.  The information can also be submitted to BioSample and the BioSample accession(s) provided.
  • If annotation is provided the product names should follow the UniProt-Protein Naming Guidelines.

Creating the submission file

Submission Process:

  • The submission file can be generated using Sequin or tbl2asn.
  • The sequin file(s) should be submitted using TSA submission portal. Select the TSA option on the submission form.

Submission Tools:

Sequin

  • Select "Use a Submission Wizard":TSA
  • There are dialogs to enter:
    • Assembly data
    • Assembly description
    • BioProject
    • SRR accession
    • BioSample
  • The wizard is not for large sets of sequences.
  • A single submission should not consist of multiple BioProjects.

tbl2asn

  • tbl2asn reads a template.sbt along with the sequence and table files, and outputs ASN.1 for submission to TSA.

fasta defline components:

  • [moltype=mRNA]
  • [tech=TSA]
  • [bioproject=PRJNAXXXX1]
  • [SRA=SRRXXXXX1]
  • [biosample=SAMNXXXXXXX1]

For multiple accessions the terms should be separated by a comma.  For example:

                                                      [SRA=SRRXXXXXX1,SRRXXXXXX2]

tbl2asn command line arguments
-w assembly.cmt

Import assembly data

See Creating the Structured Comment Table for more information.

-Y Import assembly comment
-M t argument includes standard validator and additional TSA checks

Sample command line:

tbl2asn -t template.sbt -p. -a s -w assembly.cmt -Y comment -M t

Creating the Structured Comment Table

The structured comment table is a single tab-delimited table that includes the tag-value pairs that are to be applied to all of the sequences in your submission. For TSA records the Assembly Method (with version and/or year if available) and Sequencing technology must be included. Coverage and Assembly name are optional.

If you are using tbl2asn, generate the table to import using the Structured Comment page.

  • If you choose the save option the table will automatically be saved as assembly.cmt. If you are saving multiple tables with different options you will need to change the name of the file for each structured comment.
  • If you use the open option you will generate a table in the browser window that will need to be copied and saved.

An example table:

StructuredCommentPrefix Assembly
Assembly Method Newbler 2.0
Coverage 220x
Sequencing Technology 454; Solexa

Should not be submitted to TSA

  • Assemblies from sequences not directly sequenced by the submitter.
  • Clonal based assemblies. These should be submitted to GenBank.
  • Sequences assembled by inserting Ns to represent the gaps.
  • A single assembly from multiple organisms.

Last updated: 2012-08-29T11:08:57-04:00