NCBI |
|
The NCBI Influenza Virus Sequence Database | |
Nucleotide or
protein sequences can be searched by adding a comma or space separated list of GenBank accession numbers or uploading a text file containing such a list under the "Get sequences
by accession" section. The sequences can be added to the "Query builder" or shown directly by clicking the "Add query" or "Show results" buttons.
To search the database using other terms, first decide whether you would like to search for protein sequences, their coding regions, or nucleotide sequences, by
checking the radio buttons to the left of the sequence type names.
The "Search for keyword" section allows users to search for sequences by 1). a string of word in virus strain names (e.g. New York); 2). a
pattern in nucleotide or protein sequences (e.g. AGCGAAAGCAGGGGT or RSKV); and 3). drug-resistance mutations in protein sequences (e.g. S31N or H274Y). A list of mutations
annotated in the database can be found here.
In the "Define search set" section, select one or multiple names (by holding the Ctrl or Shift key) each from the lists provided, and/or fill in the boxes. The fields are virus type (e.g. Influenza virus A, B
or C), Host (e.g. Human or Avian), Country/Region (e.g. Australia or Asia) or Year (or a range of year) viruses
were isolated, Segment (1 through 8) or protein name
(e.g. PB1-F2 or M1), Subtypes (e.g. H3N2 or H5), and a range of the lengths of the sequences.
You can limit your search results to full-length sequences by checking the appropriated
boxes. "Full-length only" applies to sequences that have complete coding regions including start and stop codons, and they are labelled as "c" (for complete) in the database query result.. "Full-length plus"
applies to all "Full-length only" sequences, plus those only missing start and/or stop codons, which are labelled as "nc" (for nearly complete) in the database query result. Partial sequences are labelled as "p" in the database query result.
Month and day can be added in addition to year. Please note that
not all sequences have month and day available. Therefore sequences with only year as collection date will not be included
in a search if a month of the corresponding year is entered in the query. For example, a search for sequences from
2006/05 to 2008/11 will retrieve those with month in collection date for 2006 and 2008, but not those with only 2006 or 2008
as collection date (because they could be from 2006/04 or 2008/12). However, all sequences from 2007, with or without month in the collection date, will be included in such
a query. Check the boxes next to "Month" or "Day" under "Collection date must contain" if one wants to retrieve only sequences with month or day in the collection date.
Released date is the date when a sequence first appeared in GenBank.
Check boxes next to the segment/protein names under "Required segments" to retrieve sequences defined in the "Segment/Protein" field when all of the selected segments of the same
virus isolate exist in the database. Check the "Full-length only" box in this section if the required segments must be full-length.
From a drop-down menu next to "Pandemic (H1N1) viruses" (also known as the swine flu outbreak), you can include, exclude
or retrieve only these sequences in your search results. Newly released sequences can be retrieved from the database by defining the GenBank release
date. For example, A(H1N1)pdm09 virus sequences released in GenBank between June 30 and July 6, 2010 can be
retrieved using this database query.
From a drop-down menu next to "The FLU project", you can include, exclude
or retrieve only these sequences in your search results. Sequences from the FLU project are those submitted to GenBank through a streamlined GenBank submission
pipeline.
These are mostly from large scale flu genome sequencing projects, which usually contain complete genomes, detailed source information and high
quality of annotations. Currently, the major contributors are the NIAID Influenza Genome Sequencing Project, the St. Jude Influenza Genome Project, the Centers for Disease Control and
Prevention, Centers of Excellence for Influenza Research and Surveillance (CEIRS), and the University of Hong Kong.
Sequences of reassortments or lab strains (those flagged as "LAB" in the country field) are excluded in the search by default,
and the drop-down menu next to "Lab strains" can be used should you want to include or retrieve only those sequences.
From a drop-down menu next to "Vaccine strains", you can include, exclude
or retrieve only sequences of WHO recommended vaccine strains in your search results.
From a drop-down menu next to "Lineage defining strains", you can include, exclude
or retrieve only sequences of prototype viruses of well defined lineages/clades. Currently, this includes those for Influenza
B viruses (Victoria and Yamagata), and the H5N1 and H9N2 subtypes of Influenza A viruses.
By checking the box next to "Collapse identical sequences", all groups of identical sequences in a dataset will be represented by the oldest
sequence in the group. This will reduce the number of sequences in some cases by keeping only unique sequences in a dataset.
After clicking the "Add Query" button, the query you selected and the number of resulting sequences will be shown in "Query
Builder". If "any" is selected in "Virus Species" and/or "Segment", a warning message will be shown and the "Multiple sequence
alignment" and "Tree building" functionalities will not be allowed in the subsequent steps when the resulting dataset contains sequences from
different virus species and/or different segment. A sample query page can be found here.
Multiple queries can be built by repeating the above steps. When a different "Virus Species" and/or "Segment" is selected in the new
query, the same warning message described above will be shown and the "Multiple sequence
alignment" and "Tree building" functionalities will not be allowed in the subsequent steps, if the resulting dataset contains sequences from
different virus species and/or different segment. When a different sequence type (i.e. Protein sequence, Coding region or
Nucleotide sequence) is selected for the new query, a pop-up window will ask whether you indeed would like to start a new query with a
new sequence type (which will clear the current "Query Builder"), or you want to continue with the current sequence type by going back to the current query
builder. This is to prevent mixing different sequence types in the same "Query Builder" (e.g.
protein sequences with nucleotide sequences). Queries in any combination from the "Query Builder" can be selected to get
sequences from the database.
Sequences found by the selected queries will be shown in a separate window once you click the "Show results" button. By default, the sequences are ordered by the virus
names. They can be reordered by up-to
three fields sequentially, by holding the Ctrl or Shift key while clicking on field headers. A sample resulting page can be found
here.
Sequences of interest can be
selected by checking the boxes to the left of accession numbers. When "Collapse identical sequences" is selected in query, the numbers of identical sequences in the collapsed
groups are shown in the column "#". These groups can be expanded by clicking the numbers, and sequences within the groups can be selected to be included in the dataset as well.
The corresponding protein, coding region or nucleotide sequences of the
selected sequences can be downloaded by selecting the appropriate name in the "Download results" drop-down menu. To meet the need
of different users, the definition line of the FASTA sequences in the downloaded files can be customized by clicking "Customize FASTA defline". The default defline is in the format of ">{accession} {strain} {year}/{month}/{day}
{segname}" (e.g. >ADA83577 A/Argentina/HNRG13/2009 2009/06/05 PB2), but you are able to add any fields by clicking the ones listed, or remove any
by deleting them from the Defline editing box. A space is inserted between fields by default, but it can be replaced with other characters by
typing in the editing box. When the "Remember changes" box is checked, the defline format you defined will be remembered and used in all subsequent downloads, until it is reset or cookies are deleted in the browser.
A
list of GenBank accession numbers for selected protein or nucleotide sequences, and a table of the search result in XML, CSV or
tab-delimited format can also be downloaded from the "Download results" menu.
Further sequences analysis of the selected sequences can be performed by clicking the "Do multiple alignment" or "Build a tree" button,
if they are allowed (i.e. no mixing species and/or segments in the dataset). User's own sequences (of the same sequence
type in FASTA format) can be added to the selected sequences for analysis, by clicking the "Add your own sequences" button. The number
of sequences added cannot be more than 128 KB in file size. |
Genome Set | |
The Influenza Virus Genome Set Tool displays nucleotide
sequences obtained from the NCBI Influenza Virus Sequence Database ordered by genome
segments for each virus. All segments of the same virus are grouped together in the same background color, alternating in light blue and
white. Genomes of the same virus isolate but sequenced in different labs are identified in the database, and are grouped separately based on
the sequence submitters. This tool is a convenient way to check the completeness of genome segments for viruses of interest.
Database searches can be performed similarly as described above. By default, this tool only gets
viruses with a complete set of segments in full-length (or in full-length plus if the check box
next to "Complete plus" is selected). To
get all viruses with any number of sequences, check the radio button next to "All" in the "Show results" box. The results are shown in the descending order by the number of
segments the viruses have.
|
Alignment | |
Multiple alignments of nucleotide or protein sequences from the NCBI Influenza Virus Sequence
Database and/or user's input file can be obtained, using the MUSCLE program. Start the
alignment by selecting the Alignment button in the top horizontal bar. This
will open a database query interface similar to the one described above. Please follow the instruction
for database query and be sure to select sequences from the same segment of the genome, and preferably of similar
sizes.
A maximum number of 1,000 is set for sequences allowed to be included in the alignment. For datasets larger than 1,000 sequences,
it is recommended to download the sequences using the download tool of the database, and run the multiple sequence alignment using a
program (e.g. MUSCLE) installed locally.
After sequences of interest are selected from the database and/or added from an input file, click the "Do multiple alignment" button to get the alignment.
The consensus sequence is displayed at the top of the alignment, and identical sequences to the consensus are shown in dots and gaps are shown in
dashes. In the coding region alignment, non-synonymous changes (in triplets) are highlighted in a different background color. The alignment can be navigated horizontally either by typing in the position you would like the sequences to start from in the text box
after "Go to position" and clicking "Go", or by moving the bottom scroll bar that wraps the alignment. When a sequence in the alignment is clicked, a small window will be popped up. The GenBank record
for the sequence can be opened by clicking the accession number in the pop-up window. The sequence can also be selected to perform BLAST 2 Sequences (Click the "BLAST 2 seq." button after two different sequences are
selected from the alignment). By clicking the "Select for anchor" option from the pop-up window, the consensus sequence will be replaced by
the selected sequence. When the anchor sequence is clicked, a small window with options will be popped up. The anchor sequence can be reset to the consensus sequence, and the anchor/consensus sequence can be displayed for copying. The multiple alignment file in FASTA format can
be downloaded by selecting "Download alignment". A printer-friendly version of the alignment can be obtained by clicking the "Print-friendly
version" button. If desired, click the "Build a tree" button to build a tree from the aligned sequences.
|
Clustering and phylogenetic analysis | |
Scope
Interactive tool DatasetExplorer is a part of the NCBI Influenza Virus Resource that provides an easy way to perform preliminary
analysis
on nucleotide and protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file. Datasets are visually represented
using phylogenetic/clustering trees. Users can select an algorithm to be used for building a tree as well as
similarity criterion.
|
Overview of the Methodology
First of all, start the tool by clicking the "Tree" button in the top
horizontal bar. Sequences are acquired from the NCBI Influenza Virus Sequence Database or
uploaded by a user as described
above. After a dataset has been selected, the sequences are aligned using a multiple alignment algorithm, in order to identify common regions in the sequences and establish correspondence between sequence columns (we perform multiple protein alignment, while alignment of the nucleotide sequences for the coding regions is induced by the protein alignment). Distances between sequences are calculated based on their dissimilarity in a selected region on the alignment, and analysis is performed. We offer visualization based on phylogenetic and clustering tree methods: the classical neighbor-joining method and
agglomerative hierarchical clustering methods.
Alignment of protein sequences is performed using the protein multiple alignment tool
MUSCLE. We offer different distance measures for calculating pairwise distances between sequences. Particularly, we use some distances implemented in PHYLIP package, as well mPAM weight matrix.
|
Sequence Alignment
The tool performs multiple protein alignments using the MUSCLE program and creates nucleotide alignment of the corresponding coding regions from protein alignment by using codon-amino acid correspondence.
After sequences are obtained from the NCBI Influenza Virus Sequence Database and/or users' input file, click the "Build a tree" button in
the database query results page to start the process. This will bring a window with graphic view of the multiple sequence alignments.
Sequence Region Selection
The graphic view of the multiple alignments of sequences selected from the previous step is displayed. The black and red colors in
the graphics represent the presence and absence of amino acid residues at the corresponding positions. The positions in the longest sequence of the selected
set for the first and last amino acid of each sequence are shown. A histogram showing the total
number of amino acid residues at each position is displayed at the top of the page. The program automatically selects the sequence region to be analyzed so that the majority of the sequences in the set will
be included. The sequence region can also be defined by users by first selecting all sequences in the set, and then entering the start and end
positions in the boxes provided. When clicking the "Select sequences" button, the region from sequences that have complete coverage between the two positions will be selected, and sequences excluded
from the selection will be highlighted with a background color in the graphic view.
Phylogenetic/Clustering Tree
A clustering or phylogenetic tree can be built by selecting one of the clustering
algorithms and a distance calculating method from the list, and clicking the "Next step" button.
Sequences of interest can be highlighted in the tree, and they can be selected or deselected using the check boxes
to the right of each sequences.
Distance methods approximating minimum evolution
Method
|
Description
|
Neighbor-Joining |
At each step,
a pair with a smallest value of
Dij - bi - bj
is chosen,
where Dij
is the distance between nodes i
and j, and
bi = ∑kn Dij /(n-2).
The distance between the new node u and each of remaining nodes
is defined as
Duk = (Dik + Djk - Dij ) /2.
Branch lengths are defined as
vui = (Dij + bi - bj ) /2
and
vuj = (Dij + bj - bi ) /2 (negative lengths are truncated to zero).
|
Agglomerative hierarchical clustering
methods
Method
|
Alternative name
|
Distance between clusters defined as:
|
Average Linkage |
UPGMA |
Average distance between pair of objects, one in one cluster, one in another
|
Complete Linkage |
Further Neighbor |
Maximum distance between pair of objects, one in one cluster, one in another
|
Single Linkage |
Nearest Neighbor |
Minimum distance between pair of objects, one in one cluster, one in another
|
|
Protein and Nucleotide Distances
We offer different distance measures for calculating nucleotide and protein pairwise sequence distances, such as those based on
Felsenstein F84 distance and Hammering distance for nucleotide sequences; the Dayhoff PAM matrix, the JTT matrix model, the PBM model,
and Kimura's approximation for protein sequences implemented in the
PHYLIP
package, as well as the mPAM weight matrix for protein sequences.
|
Tree Modification
An adaptive approach is used to visualize the tree in an aggregated
form adapted to the user's screen, allowing users to interactively refine or
aggregate visualization of different parts of the tree (see a paper for details). A branch
on the tree can be selected by clicking the root node, and the resolution of the selected branch can be changed by moving along the scale
bar. The GenBank accession numbers of amino acid sequecnes in the selected branch of a tree can be exported by clicking the "Download
accessions" button under the
scale bar.
Sequences on the tree can be searched by the fields in the database, and the resulting sequences or groups will be highlighted in green color.
|
Tree Export
The complete tree can be exported in the Newick
format by clicking the "Download full tree" button. The downloaded tree can be
displayed by many tree-viewing programs.
|
Sequence annotation | |
The Influenza Virus Sequence Annotation Tool is a web application for
user-provided Influenza A virus, Influenza B virus and Influenza C virus sequences. It can predict protein sequences encoded by a flu sequence and produce a feature table that can be used for sequence submission to GenBank, as well as
a GenBank flat file.
The type/segment/subtype of an input influenza sequence is first determined by BLAST, and then aligned against a corresponding sample protein set with a "Protein to nucleotide alignment tool" (ProSplign). The translated product from the best alignment
to the sample protein sequence is used as the predicted protein encoded by the input sequence.
Type/segment/subtype identification
An input sequence is searched by BLAST against a specialized influenza sequences database to determine the virus type (A, B or C), segment (1
through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each
virus segment and each subtype of the hemagglutinin and neuraminidase (available here). The top hit in the BLAST result is used to determine the virus
type/segment/subtype of the input sequence.
Sample protein sequences
Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and
neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A, PROTEIN-B and PROTEIN-C directories located
here). For the segments that encode proteins with large variations in amino acid sequences and
mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different
protein samples for hemagglutinin of Influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein
sequences is selected and aligned against the input sequence.
Protein to nucleotide alignment
A special global protein-to-nucleotide alignment tool, ProSplign, was designed to accurately annotate spliced genes and mature peptides of influenza viruses.
ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.
Interpreting alignment result and creating outputs
A successful protein-to-nucleotide alignment should pass the following criteria:
1) The input sequence should start with a correct start codon (or span the beginning of input sequence in case of partial 5' end)
2) The input sequence should end with one of the stop codons (or span the end of input sequence in case of partial 3' end)
3) The input sequence should have no frameshifts or internal stop codons
4) The number of exon(s) must be correct (2 for the second protein of segments 7 and 8 of Influenza A virus and segment 8 of Influenza B virus,
1 exon for all other segments/proteins)
If an alignment passes all four criteria above, the tool adopts the translated protein from the alignment as the protein prediction. Positions of
the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment doesn't pass any of the criteria, the tool
iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent
alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.
The first output of a successful annotation is a feature table, which is a five-column, tab-delimited
table of feature locations and qualifiers. The tool also creates the ASN.1, XML and GenBank formatted views of the same annotation, using
the following NCBI developed utilities: tbl2asn and asn2xml.
Drug resistance prediction
The most common signature mutations that might confer drug resistance by the virus can also be detected and reported by this tool. Such
mutations include L26F (e.g. CY009837), V27A (e.g.
DQ186974), A30T (e.g. EU263348), S31N
(e.g. DQ107508) and G34E (e.g. L25818) in the M2 protein, H274Y (e.g. DQ250165) and N294S (e.g. EF222322) in the N1 subtype of
neuraminidase, and R292K (e.g. AY643089) and
E119G/D/A/V (e.g. EU429720) in the N2 subtype of neuraminidase.
Other mutation detection
The signature mutation, E627K, in the PB2 protein (e.g. AY651719) that might confer high virulence of influenza viruses
will be detected and reported.
Instructions
To use the tool, simply add one or multiple nucleotide sequences in FASTA format into the sequence box. Sequences can also be
imported from a file by clicking the "Browse" button. After the "Annotate FASTA" button is clicked, feature tables separated by a line of
equal signs for each input sequence are shown in a separate window. A message showing the predicted segment, and subtype for the hemagglutinin and neuraminidase
segments will also be displayed. Warning messages will be shown along with the feature table, if the input sequence does not have a start/stop codon or
contains ambiguity sequences. In case frameshifts are found in the coding regions, or a stop codon is introduced within the coding region because of a
mutation, no feature table will be produced and an error message will be shown instead, indicating the nature (insertion, deletion or mutation), the
length and the location of the error. Other output format
(GenBank flat file, ASN.1, XML, protein FASTA and alignment) can be selected and be shown on the browser or saved to files.
This annotation tool uses published influenza protein sequences as training sets. There are chances that it will not work as expected for
some new sequence variations. Please report such cases to us so we can improve this tool.
How to cite the annotation tool
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Research. 2007 Jul 1;35(Web Server issue):W280-4.
|
FTP | |
Data in the NCBI Influenza Virus Sequence Database are available through ftp. The ftp directory contains the following
files and the corresponding compressed versions that are updated
everyday:
genomeset.dat - Table with supplementary genomeset data
influenza_na.dat - Table with supplementary nucleotide data
influenza_aa.dat - Table with supplementary protein data
influenza.dat - Table with nucleotide, protein and coding regions IDs
influenza.fna - FASTA nucleotide
influenza.cds - FASTA coding regions
influenza.faa - FASTA protein
The genomeset.dat contains information for sequences of viruses with a complete set of segments in full-length (or nearly
full-length). Those of the same virus are grouped together and separated by an empty line from those of other viruses.
The genomeset.dat, influenza_na.dat and influenza_aa.dat files are tab-delimitated tables which have the following fields:
GenBank accession number, Host, Genome segment number, Subtype, Country, Year, Sequence length, Virus name, Age, Gender. The influenza_na.dat and
influenza_aa.dat files have an additional field in the last column to indicate if a sequence is full-length.
The influenza.dat file is a tab-delimitated table which has the following fields:
GenBank accession number for nucleotide GenBank accession number for protein Identifier for protein coding region
A directory named "updates" contains daily updates for all of the above listed files in subdirectories for each date.
A directory named "ANNOTATION" contains reference sequences used in
the Influenza Virus Sequence Annotation Tool. The file blastDB.fasta has one representative sequence for each type/segment/subtype of
influenza viruses A, B and C, and it is used to build a specialized BLAST database for the determination of type/segment/subtype of input influenza
virus sequences. The PROTEIN-A, PROTEIN-B and PROTEIN-C subdirectories each contains sample protein and mature peptide sequences used to annotate
user-provided sequences.
|
|
|