TSA Frequently Asked Questions

What type of transcriptome assembly should be submitted to SRA?

Single-step, unannotated assemblies in BAM format should be submitted to SRA as a Transcriptome Project to SRA.

Can I submit single-pass reads to TSA?

Single-pass reads may be submitted as part of a TSA project providing they are in the minority and add additional information to the transcriptome project.

Can I submit assemblies with Ns inserted to estimate the length of the transcript?

TSA only accepts assemblies of overlapping reads or transcripts. We do not take sequences which have been assembled by inserting Ns, even of paired-ends. The only allowed Ns are internal Ns that represent ambiguous bases and this must follow the requirement that the total sequence be less than 10% n's and the number of n's in a row must be less than 15. Remove any Ns that are at the end of a assembly, and split an assembly into separate contigs at any Ns that represent gaps. Ns that represent gaps are not allowed, so sequences containing runs of Ns that represent gaps need to be split into individual assemblies. This is even true for sequences that were assembled by assemblers that connect mate-pairs.

Is annotation required?

Annotation is not required.  However, if annotation is included the product names should follow the UniProt-Protein Naming Guidelines.

Can I submit an assembly of EST/SRA/trace archive data generated by another group?

No. All submitted assemblies must be derived from primary data generated by the same group.

Where should clonally derived sequences be submitted?

These sequences should be submitted to GenBank . Only computationally assembled sequences by a program such as CAP3 should be submitted to TSA.

Why the size limit of 200 base pairs?

There is a size limit for TSA for two reasons. First, with the increased size of sequence generated by the new sequencing technologies, for the sequence to be an assembly it would be expected to be larger than the smallest DNA generated by these machines. Second, to receive acceptable coding region prediction results a minimum of 200 base pairs is required.

Is there a limit of the number of n's allowed in an assembled sequence?

The number of n's in a row must be less than 15. The assembled sequnce cannot have more than 10% n's. The reason for the limits is that an extended string of n's in the sequence from the new sequencing technologies usually indicates a bad read.

Should I use Sequin or tbl2asn?

The Sequin TSA wizard should be used for small submissions obtained from a single project or sample.  Sequin should not be used for large batches of sequences. The time to import the fasta and generate the .sqn file increases as more sequences are imported. Significant time delays have been observed at importing 8,000 sequences. We recomment using tbl2asn for any large data sets.

Why are my Contig names changed from ContigXXXX?

Contig names are the primary way of differentiating your TSA sequences. We encourage submitters to use unique contig names, for example by adding a suffix, when possible. If not, we will add a suffix for you.

Last updated: 2012-04-23T10:27:28-04:00