Data capture and indexing of the pre-1978 Copyright records will be by all accounts a challenging task. But some volunteer work recently done by Copyright staff may facilitate capturing the data and organizing it for effective searching. Focusing on the 1971 to 1977 time period within the Copyright Card Catalog, they have identified over 30 characteristics and patterns in the free form textual data that should make it easier to convert it into searchable online records.
The Copyright Card Catalog is considered the most up-to-date index to pre-1978 copyright registrations. It has been updated over time to reflect corrections and changes sometimes with handwritten annotations and sometimes with new cards. And so it is also considered the best source for information to build an online searchable index. The part of the catalog for registered works is divided into six time periods, the most recent being 1971 to 1977. There are 7.8 million cards for this time period representing 2.8 million registrations arranged in a single alphabetical index of names and titles. Each card has a heading that’s either a name or a title under which the card is filed. For most registrations the heading is followed by a text paragraph that starts with the title of the work and includes the author and claimant names, the registration number assigned and the effective date of registration as well as other facts pertinent to the registration. For renewal registrations, the original registration number and date are also included just after the copyright notice symbol.
The information in the registration cards is not tagged nor is it in specific fields. But the patterns found in the cards will enable a comparison of a card heading to data strings in the text paragraph and from that determine whether it’s a title, author name or claimant name.
Here’s a sample of the key patterns and characteristics found:
- 99.6% of the cards had the title of the work at the beginning of the text paragraph;
- 76.6% of the cards had the registration number at the end of the text paragraph;
- 94.6% of the cards contained a copyright notice symbol © or Ⓟ;
- Approximately 79% of the cards had the claimant name right after the © or Ⓟ;
- Approximately 7% of the cards were for renewal registrations;
- Author names often have “markers” such as the word “by” or the letters “w” or “m” for words or music indicating their role in the work;
- Registration numbers are recognized by the distinct class prefix and by the location on the card image;
- Date of registration has a consistent format and is almost always just before the registration number.
The patterns and characteristics identified will be valuable input for the development of workflows and tasks used for data capture through crowd sourcing, a process looking more and more feasible for Copyright data. Our research shows that workflows can be designed to capture and verify the data through keyboarding or they can incorporate OCR followed by cleansing and parsing using the crowd. The data can be indexed in the appropriate fields and any of the information in the text paragraph can be made accessible through keyword searching.
Good progress has been made on the preservation front of the project with more than 22 million cards digitized and all volumes of the published Catalog of Copyright Entries scanned and available online through the Internet Archive. Now we’re ready to move ahead on making the records more accessible online. The Office will soon issue a Request for Quotation containing sufficient details about the cards, and the patterns and characteristics found, to obtain cost estimates for capturing the data contained in the 1971 to 1977 catalog cards. I’ll keep you posted on progress and as always your input is most welcome and appreciated.
December 23, 2012 at 5:06 pm
I hope I can volunteer as a work at home capacity with my computer maybe 2 to 4 hours per day 3 days per week.
Thank You in advance for considering this.
How about if I can get the OCR data side by side with a card scan or picture of the card to verify the data that has been OCR’d correctly?
January 3, 2013 at 11:02 am
Steven B. Tuttle’s remark suggests that the OCR data appear side-by-side on the screen with an image of the card that had been scanned to produce the data. Might I suggest a better idea?
Why not have an imaging program function which automatically expands the space between lines of text on the card so that it becomes large enough to fit the OCR text? The OCR program knows where one line of text has reached bottom and where the next line has its top-most position. Having the OCR text directly above the image makes comparisons much easier, allowing difficult-to-spot differences to stand out because any letter or number in the middle of a line is positioned adjacent to its supposed counterpart.
January 3, 2013 at 11:23 am
The blog post mentions the planned use of automated procedures to parse the data derived from the catalog cards, and suggests that some percentage of the parses will be misidentified by the computer program because some cards have data which is positioned or formatted differently than is normal for the cards. Some human checking of the results will be necessary.
The human examination of the decisions made by the computer program as to where to parse and how to assign data fields, can be made easier by having the program place a bullet (of a type never seen on the catalog cards) where it determined to parse, and color-coding the data-field decisions. Thus, for example, the data identified by the program as being the registration number might get a light-yellow background. The date might get a light-green background. (The date is almost fool-proof to identify, because it will have one of twelve abbreviations for a month sandwiched between two groups of numbers.) The title might get an orange underline. In this way, a person might most-easily spot an error made by a computer..
A computer program may well be written with enough intricacy that a paragraph block which begins “MARY STEVENS, M. D. W. B. Pictures, Inc.” would be recognized as consisting of a title which ends on “M. D.” (a common abbreviation in English) and an author name which begins with “W. B.” (some human authors give their first and middle names as initials and spell out only the surname). However, some decisions will be made incorrectly by a program because human judgment can ferret out practices too uncommon to be thought of during the programming phase. This is where having different color underlinings or backgrounds would assist the person tasked with identifying poor decisions by the computer as to where to parse and assign text into data fields.