Over the past 2 months I’ve shared information with you about the Copyright records and the plans, challenges, and visions for preserving them in a digital form and making them widely available online. Today’s post is a brief update on recent progress.
First, I’m happy to report that digitization of the records is continuing. In the two months since this blog was started, nearly 2.2 million more catalog cards have been scanned, and the images inspected and placed in archival storage. This brings the total to more than 16.6 million cards completed, which is more than a third of the entire catalog.
Progress has also been made in scanning and making available the Catalog of Copyright Entries with 36 more volumes processed since late December bringing the total now to 456 out of the 660 volumes. We’re nearing 70% completion with some registrations from as far back as 1928 and including all classes of works.
The volumes scanned so far are available at the link below the CCE page shown to the left.
We also have a small group of subject matter experts and information technologists studying how the records can best be indexed and made available to users. A test database is being used to assess the efficacy of different approaches to indexing and displaying these older records. Integration with the post 1977 records is a goal, but accomplishing that is not without its challenges. Records since 1978 have been collected in a database containing multiple indexes for titles, authors, claimants, and other related index terms. This granularity has enabled flexibility in searching by title only, by name only, or by combinations of different indexes. But the card catalog consists largely of a single alphabetical index within each time period with names, titles, and other index terms interfiled. The plan is to capture all of the text on a card and to programmatically parse as much of the data as possible, placing it in the appropriate fields in the data record. We are studying the data patterns found in the cards to see if specific types of index terms can be distinguished based on position on the card or on unique characters such as the presence of the copyright notice symbol ©. But the data in the registration cards for the most part is not labeled making it difficult and perhaps impossible to programmatically distinguish all names and titles.
As we work through these challenges I’ll keep you posted on findings and continue to seek your input.