Library of Congress

Web Archiving

The Library of Congress > Web Archiving > Technical Information
{subscribe_url: '/share/sites/Bapu4ruC/webarchiving.php'}

Tools

The Library and its partners are working on developing a common set of web archiving tools in four areas: selections and permissions; acquisitions; storage and maintainance; and access.

The Library is using open-source software as the foundation for some of these tools.

  • Selection and Permissions: The Library has developed and implemented the DigiBoard, a tool that allows nominators to select websites to be archived. The tool also streamlines the nominations, permissions and quality review processes. Future development is planned to support other processes.
  • Acquisition: The Library uses the Heritrix (external link) web crawler.
  • Access: The Library provides access to its archive content with the Wayback (external link) tool.

The technical environment for our acquisition, access, and storage is based on open platforms such as Linux.

Goals for the Web Archiving Process

The Library's goals for capturing, storing, and preserving web content during the archiving process are to:

  • Retrieve all code, images, documents, and other files essential to reproduce the site as completely as possible.
  • Capture and preserve all technical metadata from web server (e.g. HTTP headers) and crawler (e.g. context of capture, date and time stamp, and crawl conditions). Date information is especially important to distinguish repeated crawl versions of same site.
  • Store the content exactly as delivered. HTML and other code are always left intact, without making modifications necessary for access.
  • Maintain platform and file system independence. We do not use file system features such as naming or timestamps to record technical metadata.