NCBI

AGP Specification v2.0

Introduction:

What it is: Describes the assembly of a larger sequence object from smaller objects. The large object can be a contig, a scaffold (supercontig), or a chromosome. Each line (row) of the AGP file describes a different piece of the object, and has the column entries defined below. Extended comments follow.

What it is not: neither a description of how sequence reads were assembled, nor a description of the alignments between components used to construct a larger object. Not all of the information in proprietary assembly files can be represented in the AGP format. It is also not for recording the spans of features like repeats or genes.

Changes from v1.1 to v2.0:

This version supersedes version 1.1 of the AGP file specification. The changes are:

Definitions:

Contig:
a non-redundant sequence formed by joining, based on sequence overlap, one or more smaller sequences. The smaller sequences are typically sequences that have been submitted to the International Sequence Database Collaboration (GenBank/EMBL/DDBJ). There should be no gaps in a sequence contig (although there may be short runs of Ns due to ambiguous base calls).
Scaffold (supercontig):
a non-redundant sequence formed by joining one or more sequence contigs. The distinction is that no sequence overlap is required to construct the larger sequence. Additional information, such as clone end analysis, can support the relationship. There can be, and typically there are, gaps in a scaffold.
Gap:
a sub region within an object where there is no known sequence. Generally represented as a series of the letter ‘N’
Component:
a sequence used to construct a larger sequence.

File Format:

One feature of the AGP file is that column definitions change depending on whether the line is a component line or a gap line. There is a single column definition up to column 5, then each column will have two definitions, depending on the value in column 5.

AGP File Format
column content description
1 object This is the identifier for the object being assembled. This can be a chromosome, scaffold or contig. If an accession.version identifier is not used to describe the object the naming convention is to precede chromosome numbers with ‘chr’ (e.g. chr1) and linkage group numbers with ‘LG’ (e.g. LG3). Contigs or scaffolds may have any identifier that is unique within the assembly.
2 object_beg The starting coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s.
3 object_end The ending coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s.
4 part_number The line count for the components/gaps that make up the object described in column 1.
5 component_type The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
A
Active Finishing
D
Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword).
F
Finished HTG (phase3)
G
Whole Genome Finishing
O
Other sequence (typically means no HTG keyword)
P
Pre Draft
W
WGS contig
N
gap with specified size
U
gap of unknown size, defaulting to 100 bases.
6a component_id If column 5 not equal to N or U: This is a unique identifier for the sequence component contributing to the object described in column 1. Ideally this will be a valid accession.version identifier as assigned by GenBank/EMBL/DDBJ. If the sequence has not been submitted to a public repository yet a local identifier should be used.
6b gap_length If column 5 equal to N or U: This column represents the length of the gap.
N type gaps can be of any length. A length of 100 must be used for all U type gaps.
7a component_beg If column 5 not equal to N or U: This column specifies the beginning of the part of the component sequence that contributes to the object in column 1 (in component coordinates).
7b gap_type

If column 5 equal to N or U: This column specifies the gap type.

Accepted values:

scaffold:
a gap between two sequence contigs in a scaffold (superscaffold or ultra-scaffold).
contig:
an unspanned gap between two sequence contigs.
centromere:
a gap inserted for the centromere.
short_arm:
a gap inserted at the start of an acrocentric chromosome.
heterochromatin:
a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere).
telomere:
a gap inserted for the telomere.
repeat:
an unresolvable repeat.
8a component_end If column 5 not equal to N or U: This column specifies the end of the part of the component that contributes to the object in column 1 (in component coordinates).
8b linkage

If column 5 equal to N or U: This column indicates if there is evidence of linkage between the adjacent lines.

Values:

  • yes
  • no
9a orientation

If column 5 not equal to N or U: This column specifies the orientation of the component relative to the object in column 1.

Values:

+
plus
-
minus
?
unknown
0 (zero)
unknown (deprecated)
na
irrelevant

By default, components with unknown orientation (?, 0 or na) are treated as if they had + orientation.

9b Linkage evidence If column 5 equal to N or U: This specifies the type of evidence used to assert linkage (as indicated in column 8b). Accepted values:
na
used when no linkage is being asserted (column 8b is ‘no’)
paired-ends
paired sequences from the two ends of a DNA fragment.
align_genus
alignment to a reference genome within the same genus.
align_xgenus
alignment to a reference genome within another genus.
align_trnscpt
alignment to a transcript from the same species.
within_clone
sequence on both sides of the gap is derived from the same clone, but the gap is not spanned by paired-ends. The adjacent sequence contigs have unknown order and orientation.
clone_contig
linkage is provided by a clone contig in the tiling path (TPF). For example, a gap where there is a known clone, but there is not yet sequence for that clone.
map
linkage asserted using a non-sequence based map such as RH, linkage, fingerprint or optical.
strobe
strobe sequencing (PacBio).
unspecified
used only when converting old AGPs that lack a field for linkage evidence into the new format.

If there are multiple lines of evidence to support linkage, all can be listed using a ‘;’ delimiter (e.g. paired-ends;align_xspecies).

Extended comments:

Describing breaks and continuity:

Information about continuity is provided by a combination of the gap_type (column 7b) and linkage (column 8b) that provide information on building the object. This first version of this specification did not specifically define how to use these columns, thus there has been a divergence in how they are currently used. Below is a proposal on how information should be encoded.

Gap_type Linkage Interpretation and description
Within-scaffold gaps: sequences on either side of the gap are in a single scaffold.
scaffold yes Do not break scaffold
There is evidence linking sequence contigs on both sides of the gap.
repeat yes Do not break scaffold
If an unresolvable repeat unit is spanned by linkage evidence, the linkage will be ‘yes’.
Scaffold-breaking gaps: sequences on either side of the gap are in separate scaffolds.
contig no Break scaffold
A contig gap indicates there is no evidence to link the adjacent sequence contigs.
repeat no Break scaffold
If an unresolvable repeat unit is not spanned by linkage evidence, the linkage will be ‘no’.
centromere/ short_arm/ heterochromatin/ telomere no Break scaffold
Gaps with these biological types are used for laying out scaffolds along a chromosome.
Invalid gap/linkage combinations
contig yes Invalid
If there is evidence of linkage between the adjacent sequence contigs, the gap type should be scaffold.
scaffold no Invalid
If there is no evidence of linkage between the adjacent sequence contigs, the gap type should be contig.
centromere/ short_arm/ heterochromatin/ telomere yes Invalid
It is invalid to use these biological types within a scaffold.

Describing scaffolds with unknown orientation:

Scaffolds can sometimes be positioned along a chromosome or linkage group without there being sufficient data to orient the scaffold. Such placed but unoriented scaffolds can be indicated in an AGP that specifies how a chromosome or linkage group is assembled from scaffolds by using ‘?’ in the orientation column (9a) (see the example “chromosome from scaffolds”). It is not appropriate to use an orientation of ‘?’ in an AGP that specifies how a chromosome is assembled from components, except for any components that are not scaffolded to other components (singletons). Using an orientation of ‘?’ for all the components in a multi-component scaffold is misleading because to do so implies that the component lies at the position indicated but could be in either orientation. Depending on the orientation of the scaffold, however, the components in an unoriented multi-component scaffold either lie at the indicated position in the ‘+’ orientation (the default) or at a different position in the ‘-‘ orientation. The preferred method to indicate that scaffolds have been placed but their orientation is unknown is to provide two AGP files, the first that builds scaffolds from components and the second that builds chromosomes from scaffolds. The unknown orientation of a scaffold would be indicated in the chromosome-from-scaffold AGP file with a ‘?’.

Validation:

File structure needs to be validated in the following ways:

File content needs to be validated in the following ways:

Examples:

December 27, 2011