New and Updated Genome Resources at the National Center for Biotechnology Information (NCBI)
An NCBI Workshop at the International Plant and Animal Genome XX Conference, Town and Country Hotel, San Diego, CA
Date: January 17, 2012 Time: 3:50 PM-6:00 PM Sunset Room
Speakers
Kim D. Pruitt
Primary Data Submission Portal Slides
The availability of DNA and RNA sequence data in archival databases provides critical support to many facets of ongoing research and analysis, tool, and resource development. Submissions of high-throughput datasets such as RNA-seq or genomic short-reads for variation discovery, or datasets that result from analysis and interpretation of primary data, such as whole genome assemblies, are frequently associated with ancillary highly relevant files (assembly AGP files, BAM alignment files) or metadata for the sample or project. NCBI accepts a wide variety of sequence data types plus ancillary files and metadata. The expanded scope of information submitted and the increased complexity of these submissions present challenges to the existing submission infrastructure. To address this, NCBI is developing a Submission Portal that will streamline submission of experimental datasets and associated metadata. The presentation will review which data types should be submitted to which NCBI database, current submission routes for traditional and next-gen data, including submissions to the Sequence Read Archive (SRA), and summarize the current status and future plans for the NCBI Submission Portal. Acknowledgement: The work presented reflects the work of numerous NCBI Archival database support staff.
Tatiana Tatusova
BioProject, Genome, and Assembly Databases Slides
In the last few years new generation sequencing technology has had a dramatic impact on genomic data in terms of data quantity and complexity. Several new resources have been created at NCBI to handle genome-associated data: BioProject, Genome, and Genome Assembly. The BioProject database tracks meta-data about the research initiatives. The Genome resource has been redesigned completely to represent new and complex data types. Major improvements include a more natural organization at the level of the organism for prokaryotic, eukaryotic, and viral genomes and reports include information about the availability of nuclear or primary genomes as well as organelles and plasmids. The Genome Assembly resource provides detailed information about the assembly including history tracking when assemblies are updated over time, and statistical reports. All genome associated resources are cross-linked allowing easy navigation through various types of information. Some existing and newly developed tools provide a new way to analyze and visualize complex genomic data and their relationships.
Françoise Thibaud-Nissen
The Eukaryotic Genome Annotation Pipeline Slides
The NCBI Eukaryotic annotation pipeline provides content for various NCBI resources including sequence and BLAST databases, Gene and the MapViewer genome browser. In recent years, the pipeline has been modernized to run efficiently with minimal human involvement. In the first 10 months of 2011 alone, the genomes of 22 organisms were annotated. The pipeline uses a modular framework for the execution of all annotation tasks from the fetching of raw and curated data from public repositories (sequence and Assembly databases) to the alignments of sequences and gene prediction, to the submission of the accessioned annotation products to public databases. Core components of the pipeline are alignment programs (Splign and ProSplign) and an HMM-based prediction program (Gnomon) developed at NCBI. Important features of the pipeline include its flexibility and speed, the tracking of gene loci from one annotation to the next, the possibility to annotate in coordination multiple assemblies for the same organism, the different weight given to curated evidence and non-curated evidence, and the production of models that compensate for assembly issues. We will describe the annotation pipeline dataflow and inputs, including how we use 454 RNA sequence data, and ongoing development efforts on using shorter RNA-seq data and on quality assessment. We will present the NCBI priorities and interests regarding annotation and describe how the integration of annotation and RefSeq curation provides a current, maintained, quality annotation product.
Deanna M. Church
Connecting the Lab to the Genome: CloneDB Slides
The ever-expanding availability of genome sequences from a variety of organisms has transformed the way researchers can approach biological questions. However, there is still a need to associate genomic sequence with physical reagents that can be used to perform experiments. NCBI has developed a resource to facilitate this association for one important reagent, namely clones. It has historically been difficult for researchers to take full advantage of the wealth of information associated with clone types, as data and metadata have been stored in disparate databases. The NCBI Clone DB provides integrated information for both vector-based and cell-based clones, including sequence data, map positions, gene content and distributor information. Data is available for hundreds of genomic and cell-based clones and libraries from a diverse range of animal and plant taxa. Library browsers permit viewing of high-level information, while library-specific pages contain detailed information on library construction and distributors. The database can be searched by clone name, accession, mapping position and features such as genes. Newly developed display pages for individual clones provide an integrated view of the data stored for each clone, and link to relevant NCBI resources. Data from Clone DB is currently displayed in the NCBI Map Viewer, which permits the simultaneous viewing of annotation tracks based on different coordinate systems, and in Clone Finder, a tool that permits location-based searches for clones in a graphical interface. We will present examples of clone data and demonstrate the features of the interfaces and associated tools.
Kim D. Pruitt, Tatiana Tatusova, and Karen Clark
Annual Report of Genome Sequencing Projects Slides
This presentation will review the current number, annual growth, status, and data availability for eukaryotic genome sequencing projects.