Skip navigation and jump to page content   The Library of Congress >> More Online Collections
Library of Congress Web Archives BROWSE   |   SEARCH   |   TECHNICAL INFORMATION  
  LC Web Archives >> Technical information  


More about current efforts in the areas of national and international partnerships and efforts in the area of web capture can be found at


The Web sites were harvested by the Internet Archive. The harvesting depth varies according to the specifications of the curator. Information about the technical environment and tools used for harvesting web sites is available at

Search Component and Record Contents

Archived Web sites were cataloged using the Metadata Object Description Schema (MODS). Preliminary keyword, title, and subject metadata were extracted from the archived Web sites to create preliminary MODS records that were subsequently reviewed and/or enhanced by catalogers who assigned controlled subjects from Library of Congress Subject Headings (LCSH) or Thesaurus of Graphic Materials (TGM). A Lucene search interface was developed to search the MODS records both within and across the archived collections.


In addition, a MARC record for each collection is available in the Library of Congress Online Catalog so that the collection can be found along with other Library materials in the catalog.

Metadata included in collection-level records in Library of Congress Online Catalog:

245    $a Collection title $h [electronic resource].
520    $a General description of the collection content and number of Web sites and date range when
             Web sites were captured
6XX    $a Collection-level subject heading (usually several 6XX fields)
856    $a (link to the collection Overview page)

Web site level:

MODS data included in record for each archived Web site:

<titleInfo><title> - Title extracted by system from HTML title tag (when available) and reviewed by cataloger, otherwise supplied by cataloger
<titleInfo type="alternative"><title> - Alternative Title supplied by cataloger if different and useful.

<name type="personal"><namePart> - Name of Web site creator in inverted order; supplied by cataloger
<name type="corporate"><namePart> - Corporate Name of Web site creator; supplied by cataloger

<typeOfResource> -"text"; supplied by system

<genre> -"Web site"; supplied by system

ORIGIN INFO (A single site may have multiple captures--the first and last dates of capture are recorded)
    <dateCaptured encoding="iso8601" point="start">
            - Date of first capture of site; extracted by system from site
    <dateCaptured encoding="iso8601" point="end">
            - Date of last capture of site; extracted by system from site

LANGUAGE (languageTerm repeated for languages as needed)
    <languageTerm authority="iso639-2b" type="code"> - 3 letter code supplied by cataloger

PHYSICAL DESCRIPTION (internetMediaType repeated for types as needed)
- MIME type; supplied by system

- Extracted by the system from the META name="description" tag in archived Web site (when available); reviewed and/or edited by cataloger

<note type=”system details”>
- A note that records the URL of the Web site at the time of capture; supplied by system

SUBJECT (Subject repeated for subject headings and key words as needed)
<subject authority="lcsh"> - Collection-level and Web site specific (item-level) LCSH headings; supplied
    by cataloger (Collection-level headings are the same as are in collection-level record in LC Online Catalog)
<subject authority="lctgm"> - Collection-level and Web site specific (item-level) TGM headings; supplied
    by cataloger (Collection-level headings are the same as are in collection-level record in LC Online Catalog)
<subject authority="local"> - Subjects assigned by cataloger
<subject authority="keyword"> - Subject keywords extracted from META name=keywords tag in archived
    Web site (when available); reviewed, augmented, and/or edited by cataloger

RELATED ITEM (Contains the collection title and the persistent ID for the collection)
<relatedItem type="host">
     <titleInfo><title> - Collection Title; supplied by system
     <location><url> - Persistent ID for the collection, e.g.,
                        that resolves to the collection Overview page; supplied by system

IDENTIFIER (Contains the Resource ID for the Web site for single sites and for the resource page for a site with multiple captures)
<identifier> - Resolvable persistent identifier for the archived web site at the Library of Congress; supplied by the system

<location><url usage="primary display">
- Resolvable persistent identifier for archived Web site; supplied by system

- Rights/permissions information; supplied by system

      <recordCreationDate encoding="iso8601">
- Record creation date; supplied by system
     <recordIdentifier source="dlc"> - Identifier for the MODS record; supplied by system

  LC Web Archives >> Technical information  
 The Library of Congress >> More Online Collections
  August 5, 2011
Contact Us