The First American West: The Ohio River Valley 1750-1820


Building the Digital Collection

The University of Chicago Library, with the the Filson Historical Society of Louisville Kentucky, received an award in the 1998/99 round of the Library of Congress/Ameritech National Digital Library Competition. During the course of the project, approximately 18,000 images were scanned. This represents a total of 272 books and pamphlets, 63 broadsides and other ephemera, 404 manuscripts, 17 maps, and 28 prints. All materials were scanned by project staff at the University of Chicago and the Filson Historical Society. Selected texts were transcribed and marked up in SGML. Bibliographic records and the images and text of the digital reproductions were transmitted for mounting at the Library of Congress.

For more detail on different aspects of building this digital collection, see the sections below.


Digitization: Creating Images

A variety of flatbed and overhead scanners were used. depending on the characteristics of the original materials. Almost all items were scanned directly from the originals, although photographic intermediaries (4" x 5" transparencies) were used for the Audubon plates and for several other oversized maps and manuscript materials that could not easily fit on a flatbed scanner.

The original items selected for this project varied greatly in size, original format, and quality. Scanning specifications were chosen to meet the requirements of each different type of original. The overall intent of the scanning was to capture materials at a resolution that would promote detailed on-screen study of the scans. In general, printed books and pamphlets were scanned in 8-bit grayscale. All manuscripts and broadsides were scanned in 24-bit color, and maps and prints with color information in the original were also scanned in color. Every effort was made to preserve the look of the original documents. Therefore, the digital images may show discolorations, folds, tears, faded inks, and bleedthrough of the ink. All images are best viewed on a monitor that is set to "True Color" or "16777216 Colors" and at a display resolution of 1024 x 768 pixels or greater.

Below, find further details of scanning methods and specifications used for images for the different forms of material:

Images for Printed Books and Pamphlets
Master files At Chicago, bound books and pamphlets were scanned using both Epson Expression 1640XL flatbed scanners and a Minolta PS 3000 with a grayscale board. Books from the General Collection and pamphlets that opened flat were scanned on the flatbed scanners at 8-bit grayscale at 300 dpi or 400 dpi (depending on the quality and size of the printed text) using Lasersoft Silverfast 5 and saved as uncompressed TIFF files. Image files were cropped at the page edge and edited using Adobe Photoshop 5. 5.

In order to minimize stress to fragile bindings during scanning, Special Collections bound books were scanned face-up on the Minolta PS 3000 at 8-bit grayscale 400 dpi and saved as high-quality JPEG files using ISE-Scan software. The option to save these files as uncompressed TIFF files in grayscale was not available. It was somewhat difficult to get good-quality scans of some of the texts scanned on the Minolta, due to faint ink, waviness of the paper in the original, or tight bindings, all of which caused a blurry scan and necessitated re-scanning. Image files were cropped at the page edge and edited in Adobe Photoshop.

At the Filson Historical Society, all books and pamphlets were scanned on an Epson Expression 836XL flatbed scanner at 8-bit grayscale 300 dpi, except for color plates and title pages from rare or significant texts, which were scanned in 24-bit color. All files were saved as uncompressed TIFF files and edited in Adobe Photoshop.

Derivative files Derivative versions were created using Equilibrium DeBabelizer Pro 4.5. Two JPEG versions of each scan were created for viewing the books and pamphlets online. A screen-sized reference version scaled down to between 500 - 640 pixels in width provides sufficient legibility without excessive load time. The exact pixel width chosen depended on the size of the book and the legibility of the text. An unscaled, more detailed version allows for close inspection of detail.


Images for Manuscripts and Broadsides
Master files At Chicago, all manuscripts and broadsides were scanned on the Epson Expression 1640XL flatbed scanners in 24-bit color using Lasersoft Silverfast 5. Broadsides were scanned at 400 dpi and manuscripts were scanned at 600 dpi in order to capture the detail and characteristics of the ink, the handwriting, and the texture of the paper. Images were saved as uncompressed TIFF files and edited in Adobe Photoshop 5.5. A slight border was left around all manuscript and broadside scans so that the whole artifact can be viewed.

At the Filson Historical Society, all manuscripts and broadsides were scanned on an Epson Expression 836XL flatbed scanner in 24-bit color 300 dpi and saved as uncompressed TIFF files.

Derivative files Three JPEG versions of each scan have been created for online viewing of the manuscripts and broadsides. For convenience while searching for relevant items, a relatively small "page-turner" version permits scanning of an item without excessive load time. A scaled yet larger version provides enhanced legibility and an unscaled, high-resolution version supports close inspection of detail. Images were reduced in resolution, sharpened, and adjusted in tonal value if necessary using Adobe Photoshop's batch feature. The larger images often require horizontal scrolling to view the entire scan. In addition, a small GIF thumbnail of the first image of each item has been created for the "gallery view" offered by the Library of Congress as an alternative view for search results.

Images for Maps
Master files All maps were scanned on Epson Expression flatbed scanners in color or grayscale. Maps were scanned at 300 dpi and saved as uncompressed TIFF files. In cases where the map was slightly too large for the scanner bed, it was scanned in two sections and then stitched together in Adobe Photoshop.
Derivative files Maps were made available to LC in sid format using a wavelet-based image compressing software called MrSID (Multi-Resolution Seamless Image Database) so that details could be zoomed in on. Files were compressed at an average ratio of 22:1. A small GIF file was also generated for use as the initial thumbnail display of the item.

Images for Prints
Master files All prints were scanned in 24-bit color at 600 dpi in order to capture fine line detail in the originals, and the transparencies of prints were scanned in color at 1200 dpi using an Epson Expression 1640XL flatbed scanner with a transparency adapter. Images were saved as uncompressed TIFF files and edited in Adobe Photoshop 5.5.
Derivative files Four versions of each print have been created for online viewing. A small GIF thumbnail is used for the "gallery view" offered by the Library of Congress as an alternative view for search results. A larger GIF thumbnail is displayed with bibliographic information. Two JPEG versions, one scaled and one unscaled, support full-screen viewing and close inspection of detail. Images were reduced in resolution, sharpened, and adjusted in tonal value if necessary using Adobe Photoshop's batch feature.

Each master image was placed through a quality control process where staff checked for tonal fidelity, cropping, orientation, skew, and missing pages. In order to facilitate management of the files and to record administrative metadata for each file, a project database was constructed using Microsoft Access. All master image files for this project were burned to CD and two copies of each CD were produced. In addition, all files were archived onto tape using a robotic tape archiving and storage system at the University of Chicago.

Before the creation of derivative files, each item scanned for the project was assigned a unique identifier that indicated whether the item was from Chicago or Filson; whether it was a book/pamphlet, manuscript, broadside, print, or map; and whether it was a multi-page or a single page item. These identifiers also served as image directory names for the derivative files. Each derivative image file was named consistently within each directory. The files were then transferred using FTP to the Library of Congress for mounting as part of American Memory.


Digitization: Transcriptions

A total of 6,225 pages were transcribed (representing 31 printed books and articles and 74 manuscript items), and both the full text and the page images are presented online via a page-turning interface developed at LC.

An outside vendor, Pacific Data, was selected to perform the transcription of printed materials and encoding. Transcriptions were converted at an accuracy rate of 99.5% and encoded with Standard Generalized Markup Language according to the American Memory DTD. The resulting SGML texts were then inspected by the Library of Congress for compliance with American Memory requirements.

Seventy-four manuscripts from the holdings of the University of Chicago and the Filson Historical Society have been transcribed for The First American West: The Ohio River Valley, 1750-1820. These documents are representative of the diverse subject matter presented in the institutions' joint digital collection. Transcribers at both institutions applied a single set of editorial principles to provide teachers, students, and researchers an accurate rendering of the original documents. Pacific Data added the SGML markup to these transcriptions.

Editing principles for transcription of manuscripts:

  1. Original capitalization, punctuation, spelling, slips of the pen, and syntax are preserved.
  2. Abbreviations, contractions, and ampersands (& and &c.) are retained as written.
  3. Proper names, signatures and marks, and numbers are maintained.
  4. Long s is transcribed as short s.
  5. U and v are transcribed according to appropriate usage.
  6. Tailed p is extended to per, pro, or pre depending on the text.
  7. Thorns (ye yt ym) are transcribed the, that, or them.
  8. Catch words are omitted.
  9. Interlineations are included in the body of text at the point specified by the manuscript's writer.
  10. Super and subscripts have been lowered or raised to the text line.
  11. The number of multiple signatures [56 names] and the presence of seals [seal] have been indicated within brackets.
  12. The appearance of words struck through is retained or is indicated by brackets [one word struck through].
  13. Illegible text is identified by brackets [three words illegible].
  14. Interpolation of missing or damaged text is followed by a ? mark; if an interpolation is not supplied the appearance of the original is indicated in brackets [words missing].


Intellectual Access to the Collection

A local Microsoft Access database was constructed based on the Dublin Core element set. The Dublin Core format was chosen as the metadata standard for two reasons. Firstly, for more than 70 percent of the selected materials no MARC records were available. Secondly, not all records were to be created by catalogers. When MARC records were available for bibliographic items they were used as a source for input to the Access database. The database consisted of 13 Dublin Core elements with one additional local element (34 database fields in total to accommodate repeating elements). LCSH and the Art and Architecture Thesaurus were used for subject terms. The records were created in accordance with the University of Chicago's local guidelines for use of Dublin Core.

For most materials, records describe individual items. Manuscripts and broadsides were cataloged individually, as were maps and most plates. In a few cases, where numerous plates were selected from a single book or other bibliographic unit, the selections were cataloged under the original bibliographic title with notes describing what was scanned. Whenever selections were made from extensive bibliographic materials (i.e. books and codices) notes were added to the records to indicate that the entire unit was not scanned.

The Filson Historical Society and the University of Chicago created descriptions for their own materials locally. The Filson Historical Society sent their completed descriptions to the University of Chicago, which were then input into the Access database.


Integration into American Memory at the Library of Congress

The University of Chicago delivered image files, transcribed text marked up in the American Memory DTD, and bibliographic records in a Microsoft Access database to the Library of Congress. Naming conventions, appropriate file formats, and acceptable image sizes had been determined jointly. A mapping of the Dublin Core elements to the appropriate record display fields was also coordinated with LC. Each descriptive record linked to the corresponding digitized object using an identifier which referred to a directory so that objects with multiple files could be accommodated. Extensive coordination between digitization and record creation activities was needed to ensure that records and images matched up.

At the Library of Congress, the marked up texts were parsed, indexed, and reviewed extensively. The full text and the MARC records are indexed using Aurora (formerly InQuery), the search engine used for the American Memory service. Transcribed letters and other short items are indexed as units; books are indexed by chapter and automated tables of contents derived for navigation. All non-pictorial items are presented through the American Memory page-turning interface. Maps are presented through the "zoom" view used for maps digitized by the Library of Congress.


Return to The First American West: The Ohio River Valley 1750-1820