The Stars and Stripes, 1918-1919: Building the Digital Collection

American Memory | The Stars and Stripes, 1918-1919

Building the Digital Collection

In 1920, the A.E.F. Publishing Association in Minneapolis produced a bound volume containing facsimile reproductions of every page of the World War I edition of The Stars and Stripes. The Library of Congress owns two bound copies of the 1920 facsimile edition and a master negative microfilm copy of that edition. The microfilm was produced from the bound facsimile edition that the Library acquired from Camp Sherman in Chillicothe, Ohio. The majority of the images in the online collection were produced from this microfilm, as part of an experiment by the Serial and Government Publications Division to determine techniques for expanding access to microfilmed newspapers.

Scanning

From the master negative microfilm created by the Library of Congress, OCLC Digital & Preservation Resources (Bethlehem, PA) created a print negative. Using a SunRise ProScan IV scanner and SunRise's proprietary software ScanFlo 3.01, the print negative was scanned and the digital output captured as a 1-bit, 300 dpi image file with a lossless ITU Group IV compression.

To determine the optimal settings, test images were captured at the beginning, the middle, and the end of the reel, using three different contrast and threshold settings. The images were then processed through PipeX software, which ranked the output for each setting. The highest scoring setting was selected to capture the entire reel of film.

The bound volume originally used to produce the microfilm copy had about eighty torn pages, causing the microfilmed images of those pages to be incomplete. To correct this, a second copy of the 1920 volume, acquired by the Library in the 1990s, was sent to OCLC Digital & Preservation Resources so that the technicians could scan better examples of those pages. These images, created directly from paper, were scanned with a Zeutschel 7000A0 using OmniScan 9.0. The output for the Zeutschel was identical to the output from the SunRise: a 1-bit, 300 dpi image file with ITU Group IV compression.

Image Processing

Programmers from the Library of Congress's Information Technology Services (ITS), in consultation with the Library's Office of Strategic Initiatives (OSI), developed an online viewer to be used with the Stars and Stripes digital collection. Because a newspaper page's format is important, it is desirable that the online viewer display the overall layout of a page alongside a readable, magnified portion of the page. A viewer meeting these requirements was already in use in the digital map collections made available online by the Library of Congress's Geography and Map Division. The Geography and Map Division had developed a viewer using a wavelet-based, image-compressing software called Multi-Resolution Seamless Image Database, or "MrSID." Although MrSID is a "lossy" compressor, the software integrates multiple resolutions of an image in a single file designed to enable users to zoom in to view increasing levels of detail. ITS programmers adapted the Library's map viewer for the newspaper format, adding page-turning capabilities.

For use with the viewer, the original bitonal TIFF files of The Stars and Stripes were batch-processed by MrSID software, converted to grayscale, and compressed at a ratio of 22:1. A single Small Graphics Interchange Format (GIF) file was created for each page image, to be used as a navigator image. To create the GIFs, Image Alchemy software was used in a batch process. First the originals were blurred, then the image height was reduced to 300 pixels while retaining aspect ratio, then the images were sharpened and finally produced as GIF files according to GIF89a specifications.

A database was developed to drive the calendar-browse mode and provide issue information. Each issue of the newspaper was given one record in the database, with its title, date, and enumeration included to allow citation information to be displayed with each page. The viewer uses file-name information to drive its page-turning function within each issue.

Image Specifications

Master Image
Image resolution: 300 dpi
Tonal resolution: 1 bit (bitonal)
File format: TIFF
Compression: ITU Group IV

Zoom Window Image
Image resolution: 300 dpi
Tonal resolution: 8 bit (grayscale)
File format: MrSID
Compression: Wavelet

Navigator Image
Image size: 300 pixels on the long side
Tonal resolution: 8 bit grayscale
File format: GIF89a
Compression: Lempel-Ziv

Text Processing

To permit text searching, the images were processed with PrimeOCR optical-character-recognition software. Settings allowed automatic zoning to recognize columns and ignore pictorial elements containing no text. Each character was analyzed by six OCR algorithms, and PrimeOCR's Lexical Check was employed. Output was in the form of PrimeOCR's proprietary ".pro" format, which included the text gleaned from the images, along with information about each character's location and confidence level. One file was created for each page image. The information in the separate files was merged into one large file in a batch process to make it more efficiently searchable. A keyword search in American Memory or a search within the Stars and Stripes collection will first access the textual information. Once a specific record is selected for display, the character-location information is accessed, highlighting the location of the best match on the page.

American Memory | The Stars and Stripes, 1918-1919