Building the Digital Collection

Image Scanning

Nine of the Zora Neale Hurston plays were digitized on an i2S Digibook scanner in the Information Technology Services Digital Scan Center at the Library of Congress. The remaining play, a bound volume, was scanned by a Phase One camera. The plays were digitized as 300 dpi grayscale images, which were compressed using JPEG compression, producing images in the JPEG File Interchange Format (JFIF). GIF images were created for convenient access when using the page-turner feature.

The digital images reflect the original physical condition of the plays. A few pages have tears or tiny dog-ears at the corners, and some pages of formerly-bound plays have punch holes along the left margin. The plays are in relatively good condition, however, and required minimal conservation treatment before scanning. The Digital Scan Center staff took great care in the handling of the materials; the one bound play was placed in a book cradle and given overhead scanning.

Creating the Searchable Text

After Library of Congress staff approved the digitized images, searchable text was prepared in-house using proprietary Optical Character Recognition (OCR) software. The software generally performed with a 10-to-15 percent error rate, depending on such factors as image quality and original typewritten text quality. Because the OCR process produced errors in the text files, corrections for improved accuracy were made to the text files during the quality review phase of production. The resulting raw, or "dirty," OCR was corrected manually. The text is used to enhance collection searching. The OCR is encoded with Standard Generalized Markup Language (SGML) according to the American Memory Document Type Definition (DTD).

The Zora Neale Hurston Plays at the Library of Congress

Features:

Browse Collection by:

Building the Digital Collection

Image Scanning

Creating the Searchable Text