We are all pretty familiar with the process of scanning texts to produce page images and converting them using optical character recognition to full-text indexing and searching. But electronic texts have a far older-pedigree.
Text digitization in the cultural heritage sector started in earnest in 1971, when the first Project Gutenberg text — the United States Declaration of Independence — was keyed into a file on a mainframe at the University of Illinois. The Thesaurus Linguae Graecae began in 1972. The Oxford Text Archive was founded in 1976. The ARTFL Project was founded at the University of Chicago in 1982. The Perseus Digital Library started its development in 1985. The Text Encoding Initiative started in 1987. The Women Writers Project started at Brown University in 1988. The University of Michigan’s UMLibText project was started in 1989. The Center for Electronic Texts in the Humanities was established jointly by Princeton University and Rutgers University in 1991. Sweden’s Project Runeberg went online in 1992. The University of Virginia EText Center was also founded in 1992. These projects focused on keyed-in text structured with markup, ASCII or SGML at the time, transitioning to HTML and later, to XML.
Mosaic, the first web browser, was released in November 1993. The web was the “killer app” for digitized cultural heritage materials.
Do you want to see some real digitized text history? Check out this archived list of electronic text centers from 1994. It’s an international Who’s Who of digital humanities. And it’s a wonderful piece of computing history in and of itself, with its gopher servers and VAX machines and USENET groups and anonymous FTP sites.
Text digitization has long had as a part of its history the goal of preservation. The phrase “Preservation Reformatting” is known to all of us, and digitization is part of many institutional preservation strategies, especially for brittle books. Yale University Library and Cornell University Library undertook test test projects to digitize text materials and produce preservation microfilm from the digital files. Yale’s Project Open Book started in 1991, and the Cornell demonstration project formally started in 1993. Making of America launched in 1995. The first round of Library of Congress Ameritech digitization for the National Digital Library was in 1996. The National Archives and Records Administration’s Electronic Access pilot project started in 1997.
I’m skipping over much of the last 15 years of text digitization, where the work has shifted from the bespoke to mass digitization. There are a lot of references to the Open Content Alliance, the Universal Digital Library/Million Book Project, and Google Book Search, among others. The The Library of Congress is still participating in the development of standards for page imaging for textual materials as part of its Federal Agencies Digitization Guideline Initiative work.
And I’m not going to enter into the page images or marked-up keyed text or OCR debate that has existed since the earliest days of text digitization. There are ardent points of view of the pros and cons of each.
There is, of course, the preservation of the output of these text digitization projects. There is a very inclusive reports worth reading on that topic: Preservation of Digitized Books and Other Digital Content Held by Cultural Heritage Organizations, a report jointly written by Cornell University and Portico in 2011.