Open Source Tool Speeds Up Web Archive Scoping | The Signal: Digital Preservation

This is a guest post from Kathleen Kenney, Digitization Specialist, Digital Information Management Program, State Library of North Carolina.

The State Library of North Carolina, in collaboration with the North Carolina State Archives, has been archiving North Carolina state agency web sites since 2005 and social media since 2009. Since then, we have crawled over 82 million documents and archived 6 terabytes of data.

With each quarterly crawl we are capturing over 5,000 hosts, many of which are out-of-scope. As a government agency it is our responsibility to ensure that we are not archiving any sites that are inappropriate due to their content, copyright status, or simply because they are not related to North Carolina government. Managing the crawl budget is also a priority. Performing a crawl analysis allows us to prevent out-of-scope hosts from being crawled, and at the same time we often encounter new seeds that should be actively captured. A seed is any URL that we want to capture in our crawl.

In the past, a crawl analysis involved downloading the crawl reports, importing the data into a spreadsheet, and comparing the current list to previously reviewed URLs in order to delete any duplicates. After de-duping, the team would manually review over 3,000 URLs in order to constrain from the next crawl any out-of-scope web sites. An onerous task, but necessary nonetheless.

We developed a new open source tool, Constraint Analysis, to streamline this process dramatically. The tool works in conjunction with the Archive-It web archiving service. Now, instead of using a spreadsheet to manage the crawl reports and remove duplicates, we use a series of automated database queries. And, rather than manually reviewing each remaining URL, a .txt file of the URLs is uploaded into the tool. The tool sends a request to a free 3rd party screen scraper service, which generates a .png image of the home page. On the URL Listings page, the URLs and home page images are generated, 100 per page.

We are able to quickly scan the images and scroll through 100 images at a time, pausing only to click through to a URL if necessary, change a web site’s status from its default “constrain,” flag a site as a possible seed, or shorten a URL to constrain at the root, if necessary. Once all of the pages of images are reviewed, a .txt file of the data is exported to allow easy updating.

A modest change to the process and a simple tool reduced the time required for a crawl analysis from 16 hours to 4 hours. The Constraint Analysis tool is useful for crawl analysis and can also be modified for other quality assurance tasks such as analyzing seed lists or reviewing sites blocked by a robots.txt file. The tool is free and available for download from GitHub.

7 Comments

Ed Summers
October 31, 2011 at 7:31 am
That’s a creative use of the wimg webpage screenshot service to do something extremely practical. It’s awesome to see the State Library of North Carolina putting their tools out on GitHub.

A tool like wkhtmltopdf allows you to convert a webpage into an image file using WebKit, which is the opensource browser framework used by Safar in OSX. Something like wkhtmltopdf should make it possible to generate the images as part of the Constraint Analysis application, if more control over the image generation is needed.
Dean Farrell
November 2, 2011 at 1:26 pm
Thanks for the nice words, Ed. We looked at wkhtmltopdf, but our ISP didn’t want to add it to their setup. We now have a dedicated server so I’ve been thinking of adding it. There’s also a nice wrapper setup on GitHub for using whkhtmltopdf with PHP (https://github.com/knplabs/snappy).
Raffaele Messuti
November 8, 2011 at 12:59 pm
some other tools for the screenshooting job:

- http://www.paulhammond.org/webkit2png/
- http://khtml2png.sourceforge.net/

the best i like is phantom.js
http://www.phantomjs.org/

- https://github.com/ariya/phantomjs/blob/master/examples/rasterize.js
Dean Farrell
December 13, 2011 at 1:26 pm
Thanks for the link to phantoms. The Wimg service has instituted a limit of 3 concurrent screenshot request from the same IP; severely limiting the usefulness of the current approach. I’ve been testing wkhtmltopdf, but it’s not without its issues at least on Windows.
Maarten Brinkerink
April 2, 2012 at 9:08 am
Dear Bill,

Could you please tell me what license this tool is available under?

Best,

Maarten Brinkerink
Kathleen Kenney
April 2, 2012 at 1:33 pm
Hi, Maarten. All of the tools created by the State Library of North Carolina are in the public domain. Here is our rights statement, http://digital.ncdcr.gov/u?/p249901coll22,63754. Thank you for your interest in our Constraint Analysis tool.
Dean Farrell
April 13, 2012 at 3:19 pm
The tool has been moved to a new location on GitHub. The new link is: https://github.com/SLNC-DIMP/Constraint-Analysis

Dean

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.