Help, Influenza virus resource

Influenza Virus Resource presents data obtained from the NIAID Influenza Genome Sequencing Project as well as from GenBank, combined with tools for flu sequence analysis and annotation. In addition, it provides links to other resources that contain flu sequences, publications and general information about flu viruses.

Read more about: This resource | Flu database | Flu sequence submission to GenBank | NIAID Influenza Sequencing Project | Influenza virus biology

NCBI

Growth of flu sequences

GenBank sequences from the NIAID Project

Assembly Archive

Trace Archive

NIAID data releasing status

RefSeq genomes

RefSeq proteins

Protein Structures

Flu resources

NIAID Project

JCVI Flu

HealthMap Flu

Influenza Research Database

CDC Flu

Vaccine Selection

WHO Flu

NCBI Viruses

Viral Genomes

Virus Variation

Dengue virus

Retroviruses

SARS-CoV

Collaborators

Canterbury Health Laboratories

Ohio State University

St. Jude Children's Research Hospital

Surveillance Data, Inc.

Wadsworth-NYSDOH

Help Document

The NCBI Influenza Virus Sequence Database
Genome Set
Alignment
Clustering and phylogenetic analysis
Sequence annotation
FTP

The NCBI Influenza Virus Sequence Database

Nucleotide or protein sequences can be searched by adding a comma or space separated list of GenBank accession numbers or uploading a text file containing such a list under the "Get sequences by accession" section. The sequences can be added to the "Query builder" or shown directly by clicking the "Add query" or "Show results" buttons.
To search the database using other terms, first decide whether you would like to search for protein sequences, their coding regions, or nucleotide sequences, by checking the radio buttons to the left of the sequence type names.
The "Search for keyword" section allows users to search for sequences by 1). a string of word in virus strain names (e.g. New York); 2). a pattern in nucleotide or protein sequences (e.g. AGCGAAAGCAGGGGT or RSKV); and 3). drug-resistance mutations in protein sequences (e.g. S31N or H274Y). A list of mutations annotated in the database can be found here.
In the "Define search set" section, select one or multiple names (by holding the Ctrl or Shift key) each from the lists provided, and/or fill in the boxes. The fields are virus type (e.g. Influenza virus A, B or C), Host (e.g. Human or Avian), Country/Region (e.g. Australia or Asia) or Year (or a range of year) viruses were isolated, Segment (1 through 8) or protein name (e.g. PB1-F2 or M1), Subtypes (e.g. H3N2 or H5), and a range of the lengths of the sequences.
You can limit your search results to full-length sequences by checking the appropriated boxes. "Full-length only" applies to sequences that have complete coding regions including start and stop codons, and they are labelled as "c" (for complete) in the database query result.. "Full-length plus" applies to all "Full-length only" sequences, plus those only missing start and/or stop codons, which are labelled as "nc" (for nearly complete) in the database query result. Partial sequences are labelled as "p" in the database query result.
Month and day can be added in addition to year. Please note that not all sequences have month and day available. Therefore sequences with only year as collection date will not be included in a search if a month of the corresponding year is entered in the query. For example, a search for sequences from 2006/05 to 2008/11 will retrieve those with month in collection date for 2006 and 2008, but not those with only 2006 or 2008 as collection date (because they could be from 2006/04 or 2008/12). However, all sequences from 2007, with or without month in the collection date, will be included in such a query. Check the boxes next to "Month" or "Day" under "Collection date must contain" if one wants to retrieve only sequences with month or day in the collection date.
Released date is the date when a sequence first appeared in GenBank.
Check boxes next to the segment/protein names under "Required segments" to retrieve sequences defined in the "Segment/Protein" field when all of the selected segments of the same virus isolate exist in the database. Check the "Full-length only" box in this section if the required segments must be full-length.
From a drop-down menu next to "Pandemic (H1N1) viruses" (also known as the swine flu outbreak), you can include, exclude or retrieve only these sequences in your search results. Newly released sequences can be retrieved from the database by defining the GenBank release date. For example, A(H1N1)pdm09 virus sequences released in GenBank between June 30 and July 6, 2010 can be retrieved using this database query.
From a drop-down menu next to "The FLU project", you can include, exclude or retrieve only these sequences in your search results. Sequences from the FLU project are those submitted to GenBank through a streamlined GenBank submission pipeline. These are mostly from large scale flu genome sequencing projects, which usually contain complete genomes, detailed source information and high quality of annotations. Currently, the major contributors are the NIAID Influenza Genome Sequencing Project, the St. Jude Influenza Genome Project, the Centers for Disease Control and Prevention, Centers of Excellence for Influenza Research and Surveillance (CEIRS), and the University of Hong Kong.
Sequences of reassortments or lab strains (those flagged as "LAB" in the country field) are excluded in the search by default, and the drop-down menu next to "Lab strains" can be used should you want to include or retrieve only those sequences.
From a drop-down menu next to "Vaccine strains", you can include, exclude or retrieve only sequences of WHO recommended vaccine strains in your search results.
From a drop-down menu next to "Lineage defining strains", you can include, exclude or retrieve only sequences of prototype viruses of well defined lineages/clades. Currently, this includes those for Influenza B viruses (Victoria and Yamagata), and the H5N1 and H9N2 subtypes of Influenza A viruses.
By checking the box next to "Collapse identical sequences", all groups of identical sequences in a dataset will be represented by the oldest sequence in the group. This will reduce the number of sequences in some cases by keeping only unique sequences in a dataset.
After clicking the "Add Query" button, the query you selected and the number of resulting sequences will be shown in "Query Builder". If "any" is selected in "Virus Species" and/or "Segment", a warning message will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps when the resulting dataset contains sequences from different virus species and/or different segment. A sample query page can be found here.
Multiple queries can be built by repeating the above steps. When a different "Virus Species" and/or "Segment" is selected in the new query, the same warning message described above will be shown and the "Multiple sequence alignment" and "Tree building" functionalities will not be allowed in the subsequent steps, if the resulting dataset contains sequences from different virus species and/or different segment. When a different sequence type (i.e. Protein sequence, Coding region or Nucleotide sequence) is selected for the new query, a pop-up window will ask whether you indeed would like to start a new query with a new sequence type (which will clear the current "Query Builder"), or you want to continue with the current sequence type by going back to the current query builder. This is to prevent mixing different sequence types in the same "Query Builder" (e.g. protein sequences with nucleotide sequences). Queries in any combination from the "Query Builder" can be selected to get sequences from the database.
Sequences found by the selected queries will be shown in a separate window once you click the "Show results" button. By default, the sequences are ordered by the virus names. They can be reordered by up-to three fields sequentially, by holding the Ctrl or Shift key while clicking on field headers. A sample resulting page can be found here.
Sequences of interest can be selected by checking the boxes to the left of accession numbers. When "Collapse identical sequences" is selected in query, the numbers of identical sequences in the collapsed groups are shown in the column "#". These groups can be expanded by clicking the numbers, and sequences within the groups can be selected to be included in the dataset as well.
The corresponding protein, coding region or nucleotide sequences of the selected sequences can be downloaded by selecting the appropriate name in the "Download results" drop-down menu. To meet the need of different users, the definition line of the FASTA sequences in the downloaded files can be customized by clicking "Customize FASTA defline". The default defline is in the format of ">{accession} {strain} {year}/{month}/{day} {segname}" (e.g. >ADA83577 A/Argentina/HNRG13/2009 2009/06/05 PB2), but you are able to add any fields by clicking the ones listed, or remove any by deleting them from the Defline editing box. A space is inserted between fields by default, but it can be replaced with other characters by typing in the editing box. When the "Remember changes" box is checked, the defline format you defined will be remembered and used in all subsequent downloads, until it is reset or cookies are deleted in the browser. A list of GenBank accession numbers for selected protein or nucleotide sequences, and a table of the search result in XML, CSV or tab-delimited format can also be downloaded from the "Download results" menu.
Further sequences analysis of the selected sequences can be performed by clicking the "Do multiple alignment" or "Build a tree" button, if they are allowed (i.e. no mixing species and/or segments in the dataset). User's own sequences (of the same sequence type in FASTA format) can be added to the selected sequences for analysis, by clicking the "Add your own sequences" button. The number of sequences added cannot be more than 128 KB in file size.

Genome Set

The Influenza Virus Genome Set Tool displays nucleotide sequences obtained from the NCBI Influenza Virus Sequence Database ordered by genome segments for each virus. All segments of the same virus are grouped together in the same background color, alternating in light blue and white. Genomes of the same virus isolate but sequenced in different labs are identified in the database, and are grouped separately based on the sequence submitters. This tool is a convenient way to check the completeness of genome segments for viruses of interest.
Database searches can be performed similarly as described above. By default, this tool only gets viruses with a complete set of segments in full-length (or in full-length plus if the check box next to "Complete plus" is selected). To get all viruses with any number of sequences, check the radio button next to "All" in the "Show results" box. The results are shown in the descending order by the number of segments the viruses have.

Alignment

Multiple alignments of nucleotide or protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file can be obtained, using the MUSCLE program. Start the alignment by selecting the Alignment button in the top horizontal bar. This will open a database query interface similar to the one described above. Please follow the instruction for database query and be sure to select sequences from the same segment of the genome, and preferably of similar sizes.
A maximum number of 1,000 is set for sequences allowed to be included in the alignment. For datasets larger than 1,000 sequences, it is recommended to download the sequences using the download tool of the database, and run the multiple sequence alignment using a program (e.g. MUSCLE) installed locally.
After sequences of interest are selected from the database and/or added from an input file, click the "Do multiple alignment" button to get the alignment. The consensus sequence is displayed at the top of the alignment, and identical sequences to the consensus are shown in dots and gaps are shown in dashes. In the coding region alignment, non-synonymous changes (in triplets) are highlighted in a different background color. The alignment can be navigated horizontally either by typing in the position you would like the sequences to start from in the text box after "Go to position" and clicking "Go", or by moving the bottom scroll bar that wraps the alignment. When a sequence in the alignment is clicked, a small window will be popped up. The GenBank record for the sequence can be opened by clicking the accession number in the pop-up window. The sequence can also be selected to perform BLAST 2 Sequences (Click the "BLAST 2 seq." button after two different sequences are selected from the alignment). By clicking the "Select for anchor" option from the pop-up window, the consensus sequence will be replaced by the selected sequence. When the anchor sequence is clicked, a small window with options will be popped up. The anchor sequence can be reset to the consensus sequence, and the anchor/consensus sequence can be displayed for copying. The multiple alignment file in FASTA format can be downloaded by selecting "Download alignment". A printer-friendly version of the alignment can be obtained by clicking the "Print-friendly version" button. If desired, click the "Build a tree" button to build a tree from the aligned sequences.

Clustering and phylogenetic analysis

Scope Interactive tool DatasetExplorer is a part of the NCBI Influenza Virus Resource that provides an easy way to perform preliminary analysis on nucleotide and protein sequences from the NCBI Influenza Virus Sequence Database and/or user's input file. Datasets are visually represented using phylogenetic/clustering trees. Users can select an algorithm to be used for building a tree as well as similarity criterion.

Overview of the Methodology First of all, start the tool by clicking the "Tree" button in the top horizontal bar. Sequences are acquired from the NCBI Influenza Virus Sequence Database or uploaded by a user as described above. After a dataset has been selected, the sequences are aligned using a multiple alignment algorithm, in order to identify common regions in the sequences and establish correspondence between sequence columns (we perform multiple protein alignment, while alignment of the nucleotide sequences for the coding regions is induced by the protein alignment). Distances between sequences are calculated based on their dissimilarity in a selected region on the alignment, and analysis is performed. We offer visualization based on phylogenetic and clustering tree methods: the classical neighbor-joining method and agglomerative hierarchical clustering methods.

Alignment of protein sequences is performed using the protein multiple alignment tool MUSCLE. We offer different distance measures for calculating pairwise distances between sequences. Particularly, we use some distances implemented in PHYLIP package, as well mPAM weight matrix.

Sequence Alignment The tool performs multiple protein alignments using the MUSCLE program and creates nucleotide alignment of the corresponding coding regions from protein alignment by using codon-amino acid correspondence.

After sequences are obtained from the NCBI Influenza Virus Sequence Database and/or users' input file, click the "Build a tree" button in the database query results page to start the process. This will bring a window with graphic view of the multiple sequence alignments.

Sequence Region Selection The graphic view of the multiple alignments of sequences selected from the previous step is displayed. The black and red colors in the graphics represent the presence and absence of amino acid residues at the corresponding positions. The positions in the longest sequence of the selected set for the first and last amino acid of each sequence are shown. A histogram showing the total number of amino acid residues at each position is displayed at the top of the page. The program automatically selects the sequence region to be analyzed so that the majority of the sequences in the set will be included. The sequence region can also be defined by users by first selecting all sequences in the set, and then entering the start and end positions in the boxes provided. When clicking the "Select sequences" button, the region from sequences that have complete coverage between the two positions will be selected, and sequences excluded from the selection will be highlighted with a background color in the graphic view.

Phylogenetic/Clustering Tree A clustering or phylogenetic tree can be built by selecting one of the clustering algorithms and a distance calculating method from the list, and clicking the "Next step" button.

Sequences of interest can be highlighted in the tree, and they can be selected or deselected using the check boxes to the right of each sequences.

Distance methods approximating minimum evolution

Method Description

Neighbor-Joining At each step, a pair with a smallest value of D_ij - b_i - b_j is chosen, where D_ij is the distance between nodes i and j, and b_i = ∑_kⁿ D_ij /(n-2). The distance between the new node u and each of remaining nodes is defined as D_uk = (D_ik + D_jk - D_ij ) /2. Branch lengths are defined as v_ui = (D_ij + b_i - b_j ) /2 and v_uj = (D_ij + b_j - b_i ) /2 (negative lengths are truncated to zero).

Agglomerative hierarchical clustering methods

Method Alternative name Distance between clusters defined as:

Average Linkage UPGMA Average distance between pair of objects, one in one cluster, one in another

Complete Linkage Further Neighbor Maximum distance between pair of objects, one in one cluster, one in another

Single Linkage Nearest Neighbor Minimum distance between pair of objects, one in one cluster, one in another

Protein and Nucleotide Distances We offer different distance measures for calculating nucleotide and protein pairwise sequence distances, such as those based on Felsenstein F84 distance and Hammering distance for nucleotide sequences; the Dayhoff PAM matrix, the JTT matrix model, the PBM model, and Kimura's approximation for protein sequences implemented in the PHYLIP package, as well as the mPAM weight matrix for protein sequences.

Tree Modification An adaptive approach is used to visualize the tree in an aggregated form adapted to the user's screen, allowing users to interactively refine or aggregate visualization of different parts of the tree (see a paper for details). A branch on the tree can be selected by clicking the root node, and the resolution of the selected branch can be changed by moving along the scale bar. The GenBank accession numbers of amino acid sequecnes in the selected branch of a tree can be exported by clicking the "Download accessions" button under the scale bar. Sequences on the tree can be searched by the fields in the database, and the resulting sequences or groups will be highlighted in green color.

Tree Export The complete tree can be exported in the Newick format by clicking the "Download full tree" button. The downloaded tree can be displayed by many tree-viewing programs.

Sequence annotation

The Influenza Virus Sequence Annotation Tool is a web application for user-provided Influenza A virus, Influenza B virus and Influenza C virus sequences. It can predict protein sequences encoded by a flu sequence and produce a feature table that can be used for sequence submission to GenBank, as well as a GenBank flat file.
The type/segment/subtype of an input influenza sequence is first determined by BLAST, and then aligned against a corresponding sample protein set with a "Protein to nucleotide alignment tool" (ProSplign). The translated product from the best alignment to the sample protein sequence is used as the predicted protein encoded by the input sequence.
Type/segment/subtype identification
An input sequence is searched by BLAST against a specialized influenza sequences database to determine the virus type (A, B or C), segment (1 through 8) and subtype for the hemagglutinin and neuraminidase segments of Influenza A virus. The database contains one reference sequence for each virus segment and each subtype of the hemagglutinin and neuraminidase (available here). The top hit in the BLAST result is used to determine the virus type/segment/subtype of the input sequence.
Sample protein sequences
Representatives of published protein and mature peptide sequences for each virus segment and different subtypes for the hemagglutinin and neuraminidase segments of Influenza A virus are maintained on the server side (available in the PROTEIN-A, PROTEIN-B and PROTEIN-C directories located here). For the segments that encode proteins with large variations in amino acid sequences and mature peptide cleavage sites, more than one protein could be chosen to be included. For example, this collection currently has 16 different protein samples for hemagglutinin of Influenza A virus. Based on the segment and subtype determined by the BLAST result, a subset of sample protein sequences is selected and aligned against the input sequence.
Protein to nucleotide alignment
A special global protein-to-nucleotide alignment tool, ProSplign, was designed to accurately annotate spliced genes and mature peptides of influenza viruses. ProSplign also handles input sequences with insertions and/or deletions which may cause a frame shift in the coding region.
Interpreting alignment result and creating outputs
A successful protein-to-nucleotide alignment should pass the following criteria:
1) The input sequence should start with a correct start codon (or span the beginning of input sequence in case of partial 5' end)
2) The input sequence should end with one of the stop codons (or span the end of input sequence in case of partial 3' end)
3) The input sequence should have no frameshifts or internal stop codons
4) The number of exon(s) must be correct (2 for the second protein of segments 7 and 8 of Influenza A virus and segment 8 of Influenza B virus, 1 exon for all other segments/proteins)
If an alignment passes all four criteria above, the tool adopts the translated protein from the alignment as the protein prediction. Positions of the start, stop, splice sites (if present) and mature peptide are taken from the alignment. If an alignment doesn't pass any of the criteria, the tool iterates further by aligning next sample protein from the reference subset. If none of the sample proteins can be used to produce a decent alignment, the best aligned sample protein (with the highest alignment score) will be used to generate an error report.
The first output of a successful annotation is a feature table, which is a five-column, tab-delimited table of feature locations and qualifiers. The tool also creates the ASN.1, XML and GenBank formatted views of the same annotation, using the following NCBI developed utilities: tbl2asn and asn2xml.
Drug resistance prediction
The most common signature mutations that might confer drug resistance by the virus can also be detected and reported by this tool. Such mutations include L26F (e.g. CY009837), V27A (e.g. DQ186974), A30T (e.g. EU263348), S31N (e.g. DQ107508) and G34E (e.g. L25818) in the M2 protein, H274Y (e.g. DQ250165) and N294S (e.g. EF222322) in the N1 subtype of neuraminidase, and R292K (e.g. AY643089) and E119G/D/A/V (e.g. EU429720) in the N2 subtype of neuraminidase.
Other mutation detection
The signature mutation, E627K, in the PB2 protein (e.g. AY651719) that might confer high virulence of influenza viruses will be detected and reported.
Instructions
To use the tool, simply add one or multiple nucleotide sequences in FASTA format into the sequence box. Sequences can also be imported from a file by clicking the "Browse" button. After the "Annotate FASTA" button is clicked, feature tables separated by a line of equal signs for each input sequence are shown in a separate window. A message showing the predicted segment, and subtype for the hemagglutinin and neuraminidase segments will also be displayed. Warning messages will be shown along with the feature table, if the input sequence does not have a start/stop codon or contains ambiguity sequences. In case frameshifts are found in the coding regions, or a stop codon is introduced within the coding region because of a mutation, no feature table will be produced and an error message will be shown instead, indicating the nature (insertion, deletion or mutation), the length and the location of the error. Other output format (GenBank flat file, ASN.1, XML, protein FASTA and alignment) can be selected and be shown on the browser or saved to files.
This annotation tool uses published influenza protein sequences as training sets. There are chances that it will not work as expected for some new sequence variations. Please report such cases to us so we can improve this tool.
How to cite the annotation tool
Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Research. 2007 Jul 1;35(Web Server issue):W280-4.

FTP

Data in the NCBI Influenza Virus Sequence Database are available through ftp. The ftp directory contains the following files and the corresponding compressed versions that are updated everyday:

genomeset.dat - Table with supplementary genomeset data
influenza_na.dat - Table with supplementary nucleotide data
influenza_aa.dat - Table with supplementary protein data
influenza.dat - Table with nucleotide, protein and coding regions IDs
influenza.fna - FASTA nucleotide
influenza.cds - FASTA coding regions
influenza.faa - FASTA protein

The genomeset.dat contains information for sequences of viruses with a complete set of segments in full-length (or nearly full-length). Those of the same virus are grouped together and separated by an empty line from those of other viruses.
The genomeset.dat, influenza_na.dat and influenza_aa.dat files are tab-delimitated tables which have the following fields:
GenBank accession number, Host, Genome segment number, Subtype, Country, Year, Sequence length, Virus name, Age, Gender. The influenza_na.dat and influenza_aa.dat files have an additional field in the last column to indicate if a sequence is full-length.
The influenza.dat file is a tab-delimitated table which has the following fields:
GenBank accession number for nucleotide GenBank accession number for protein Identifier for protein coding region
A directory named "updates" contains daily updates for all of the above listed files in subdirectories for each date.
A directory named "ANNOTATION" contains reference sequences used in the Influenza Virus Sequence Annotation Tool. The file blastDB.fasta has one representative sequence for each type/segment/subtype of influenza viruses A, B and C, and it is used to build a specialized BLAST database for the determination of type/segment/subtype of input influenza virus sequences. The PROTEIN-A, PROTEIN-B and PROTEIN-C subdirectories each contains sample protein and mature peptide sequences used to annotate user-provided sequences.

|Disclaimer |Privacy statement | Accessibility |