There and Back Again: Scoping a Web Archiving Project Around the Hobbit

The following is a guest post from Sarah Weissman, a second year student in the MLS program at University of Maryland’s iSchool.

This past semester as part of the course Information Access in the Humanities, my classmates and I studied current trends in humanities scholarship. Under the guidance of Kari Kraus we learned about the availability of digital resources for humanities scholars and the use of these resources by researchers in humanities fields. For our final projects, we worked in groups to create humanities-based web collections, an act of curation similar to what many libraries and archives around the world are now undertaking. These student-developed collections are publicly available through The University of Maryland iSchool on the Archive-It website.

Each group of students was responsible for choosing a collection topic, appraising and selecting web sites to be included in the collection, and working with the Archive-It tool to collect the chosen websites. My group chose the then upcoming release of The Hobbit movie as the topic of our collection, foremost because of our shared interests in archiving multimedia content, and also because the movie was a highly anticipated event of some cultural significance, both as a popular film series and in the larger context of adaptations of the literary works of J.R.R. Tolkien. Our collection aimed to capture the digital artifacts associated with the movie, such as trailers, teaser videos, and artwork published by the movie’s creators in preparation for the movie release, as well as commentary published by its creators, actors and fans.  To this end we selected both official and unofficial/fan sites for our collection, and focused the majority of our collection on social media (including social network sites like Twitter and Google+, as well as web forums and blogs). The timing of the release was also an important factor in our choice, since the design of social media websites and limitations of crawling software often means having to capture these pages virtually in real time.

As social media sites grow in popularity and become more and more important in our day-to-day lives, users increasingly face related concerns about privacy, security and content ownership. At the same time this added relevancy makes social media content an important target for archiving. Unfortunately, archiving social media content can be challenging. Reasons for this include the prohibitive use of robots.txt files (files which provide access restriction guidelines for web crawlers enforced using the honor system), inacessibility of password protected content, inaccessibility of content for reasons of monetization, and fear or threat of litigation from content and/or website owners. This post aims to document some of the technical difficulties in archiving social media sites, in particular the challenges we faced during our student project in archiving web discussion boards, but first I will give a brief background on web archiving with Archive-It and what scoping web collections means.

When creating web collections, scoping is an important concept. In essence, scoping is making choices about what will and won’t be included in the final collection. Topic and website choice are both important factors in scoping, as is availability of resources, including any limitations in time or data storage, or a dollar budget for archiving tool subscription costs or hosting fees for online collections. The Archive-It web crawling application, which is a fee subscription tool, lets users limit scope by time (both frequency and duration of crawls) and data (number of documents collected) and also gives users finer control of scope through seed specification and scoping rules.

A seed is a URL given to Archive-It’s web crawler (the open source Heritrix crawler) as a starting point for web collection. The seed can be a top level domain, such as “www.loc.gov” or a document or directory below a top-level domain, such as “blogs.loc.gov/digitalpreservation/”.

To give you some idea what this looks like, here is the list of seeds for our project:

  • http://thehobbit.com/
  • http://thehobbitblog.com
  • http://www.mckellen.com/cinema/hobbit-movie
  • http://twitter.com/TheHobbitMovie
  • https://plus.google.com/116428360629190654024
  • http://the-hobbit-movie.com/
  • http://www.council-of-elrond.com/forums/forumdisplay.php?f=53

http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17

How a seed is specified can have drastic implications for collection scope. For example, “www.loc.gov/” with a trailing slash means something different to the web crawler than www.loc.gov. Specifying a url with or without a subdomain also makes a difference in crawling (e.g. “loc.gov” vs “www.loc.gov” vs. “blog.loc.gov”). Scoping rules in Archive-It can be used to both expand and limit scope. They can be specified in several ways, including URL prefixes, regular expressions, or SURT rules. Once a seed set is specified, Archive-It users can schedule crawls on one or more of the seeds either as repeated crawls with a specified frequency or as one time crawls started on demand. Archive-It’s web crawler works by storing a copy of the seed page, then following URLs from that page to find more pages to archive. By default it archives embedded content (such as images and video, javascript and stylesheets) and web pages that fall under the lowest-level directory in the seed URL. To further aid in scoping, Archive-It lets users run test crawls that mimic the behavior of a real web crawl but only record the URLs that would have been crawled, not web page content.

Despite the utilities available to the Archive-It user, scoping of web collections can be quite challenging. As an example, below is a snapshot of the top results from our first production crawl, a one-time crawl of 3 days on all 8 of our seeds:

Report of URLs from Archive-It

The report lists the number of URLs collected for each host (which you will recall includes the hosts for our seed sites as well as hosts for embedded content). By default the report is sorted by URLs, but the two hosts at the top of the list on the left, www.council-of-elrond.com and newboards.theonering.net, are also at the top when the list is sorted by data and queued documents, finishing with about 500k and 4.7 million queued docs respectively. This is surprising, given that all other hosts in the crawl finished with no queued documents, but, as we learned during this project, it’s hard to guess how much data any one seed will bring in. Nevertheless, 4.7 million URLs is a lot of URLs, about 67 times more documents than we crawled during our 3-day crawl, and thus way more than we could hope to crawl in a semester. Even if we could collect that many documents, assuming the data scaled in proportion we would be looking at about half a terabyte of data for just one host.

A little detective work helps here. Looking more closely at theonering.net, we see that the specific forum we were trying to crawl reports having 163995 posts in 5356 threads. This means that we have both not crawled enough URLs, but also queued too many documents. So what is going on?

In order to understand what is happening, it helps to have a basic idea of how a web-based forum is constructed. Much like a wiki, a web forum is not stored as a collection of web pages, but is a database-driven, dynamically generated website. Posts live in a backing data store and are displayed to website users through scripts. (Note that a web crawler does not preserve the underlying architecture of the website it is crawling, but only the web pages that are loaded when it visits a URL.) For example, take a look at a typical URL from theonering.net below:

http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?post=528337;sb=post_time;so=DESC;forum_view=forum_view_collapsed;guest=53819654

If you are unfamiliar with dynamically generated web content, what we see here is a URL that points to a script “http://newboards.theonering.net/forum/gforum/perl/gforum.cgi.” Everything after the “?” is an argument to gforum.cgi. We can guess that “post=528337″ means load post number 528337, “sb=post_time;so=DESC” means sort by post time in descending order, “forum_view=forum_view_collapsed” means display the rest of the post in the forum in a collapsed style, and “guest=53819654″ means that we are browsing the forum as a guest with a particular session ID.

As mentioned above, Heritrix is designed to scope to the directory level of your seed. This means that anything not falling under “http://newboards.theonering.net/forum/gforum/perl/” is considered out of scope. Unfortunately, because of the dynamic nature of the site, everything falls under the same directory and only the trailing arguments change.

Here’s how the original page looks in a web browser.

Screenshot of theonering.net

Just looking at the page one can identify some potential problems for the web crawler. One is the “Forum Nav” frame that appears on every page. In addition, other navigational links appear above the forum text. This means that starting with the seed:

http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=17

the crawler gives us not only everything in forum #17 (Movie Discussion: The Hobbit), but also everything else linked to in the side nav:

  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=7
  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=9
  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=8
  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=10
  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=11
  • http://newboards.theonering.net/forum/gforum/perl/gforum.cgi?forum=15

Luckily, as mentioned before, Archive-It gives us a way to limit crawl scope using scoping rules. Here are the scoping rules we used for our crawl:

Scoping rules we used for our crawl

They might be a little hard to read, but the gist is that we blocked all URLs that matched any of the forums in the side nav. (Regex-savvy readers will probably note that these rules could be more concisely expressed as a single regular expression. This is just how they evolved over time.)

So why do we have at least 4 million extra documents in our queue?

To answer this question we have to dive into the data. Archive-It lets you download lists of crawled, queued or blocked URLs as text files that you can put into the data analysis tool of your choice. Since I needed to sort URLs and match them against patterns for this analysis, my tool of choice was a Perl, a scripting language that is useful for text manipulation.

  1. Did we get the scoping rules wrong? No. Only one URL in 4.7 million came from a side nav forum (and that is probably because I left the trailing “;” in the regex rules.)
  2. Did we miss some forums with our scoping rules? Yes, but that only accounts for about 11k of the crawled or queued URLs.
  3. Did we collect 4 million images? No. Only about 200 of the URLs link to forum post attachments, which resolve to images, and no .jpg, .png, or .gif files are in the list.

So what is really going on? Remember that session ID (“guest=53819654″)? When Archive-It crawls the web, it uses numerous threads, each of which gets assigned a different ID by the forum. These IDs do not change the content of the web page, but they do change the URLs. This means that URLs are not unique, and the same document might get queued multiple times with different URLs. How big of a problem is this? Well, of the documents crawled and queued for theonering, there are 203,526 unique documents out of 4,813,987. That’s just 4%. (If you look at just the crawled documents it’s a little better, there are 32,512 out of 71,839, which is 45%.)

Now that we’ve discovered the problem, can we fix it with more scoping rules? Maybe… but probably not. Looking at the list of URLs crawled or queued sorted by number of repeats (where I only show the arguments part of the URL):

  • do=login;guest=nnnnn       22579
  • guest=nnnnn        22578
  • do=search;guest=nnnnn    22578
  • do=whos_online;guest=nnnnn         22578
  • guest=nnnnn;category=3   19148
  • do=search;search_forum=forum_17;guest=nnnnn       18675
  • forum=17;guest=nnnnn      14952
  • username=DanielLB;guest=nnnnn   11147
  • do=message;username=DanielLB;guest=nnnnn           9958
  • username=Shelob%27sAppetite;guest=nnnnn              8898
  • username=Captain%20Salt;guest=nnnnn       8329
  • do=message;username=Shelob%27sAppetite;guest=nnnnn      7764
  • username=Carne;guest=nnnnn         7174
  • username=dormouse;guest=nnnnn  7151
  • do=message;username=Captain%20Salt;guest=nnnnn                6923
  • username=Ataahua;guest=nnnnn    6846
  • username=Silverlode;guest=nnnnn  6500
  • username=Estel78;guest=nnnnn       6458
  • username=dave_lf;guest=nnnnn      6267
  • do=message;username=Carne;guest=nnnnn 6234
  • do=message;username=dormouse;guest=nnnnn          6206
  • username=Bombadil;guest=nnnnn   6073
  • username=Ardam%EDr%EB;guest=nnnnn    5855
  • username=Faenoriel;guest=nnnnn   5811
  • username=Voronw%EB_the_Faithful;guest=nnnnn     5756
  • username=Kangi%20Ska;guest=nnnnn           5737

We see that the most repeated URLs correspond to the navigation links at the top of the page, which are found on every page and which weren’t blocked by our scoping rules. These could generally be excluded with additional rules, except for the URLs that point to the forum we want to crawl, of course. Additionally, we can block all user pages. Unfortunately this list has a looooong tail, and includes many URLs that link directly to posts that we want to crawl. The longer we crawl the more links we visit and the more (potentially) different session IDs we generate, which means that the crawl might never finish. Adding more scoping rules may limit the redundant URL queuing enough to get all the URLs we want to capture, but this does not solve the problem in general and there is an obvious trade-off with data usage. More scoping rules may also lead to slower crawling, since the underlying software must test each URL against more patterns in order to make scoping determinations.

Unfortunately, we ran out of time for our project before coming up with a solution to this problem. So what might we have done? One option would have been to work with the web archiving service developers to improve their system. For example, more advanced regular expression rules that allow for both pattern matching and substitution might have let us replace the IDs generated by the website with a fixed value. The addition of a user option that limits the number of threads might have let us crawl without generating multiple guest IDs. For certain types of sites it just might be more effective to work directly with website owners to develop local mirrors or to archive content in other formats, rather than relying exclusively on general-purpose web-archiving services.

As a student, creating a real web-archiving collection has been a great opportunity, both to get hands on experience with tools that librarians and archivists are using on the job, and to witness first-hand the challenges of digital preservation. My unexpected journey through the world of web forums has taught me that the problem of web-archiving is by no means solved. Librarians and archivists have an opportunity to lead the way, to create the necessary technology and policy so that social media and other at risk web content can be preserved.

Planning for National Preservation Week 2013

As this calendar year comes to a close, I’m thinking about my favorite work highlights from this past year. I’m happy to say there have been many, but Preservation Week tops my list. For the past few years, the Library of Congress has celebrated ALA’s Preservation Week, holding public outreach events to promote the importance …

Read more »

Before You Were Born: We Were Digitizing Texts

We are all pretty familiar with the process of scanning texts to produce page images and converting them using optical character recognition to full-text indexing and searching. But electronic texts have a far older-pedigree. Text digitization in the cultural heritage sector started in earnest in 1971, when the first Project Gutenberg text — the United …

Read more »

Digital Technology Expands the Scope and Reach of the Dumbarton Oaks Research Library and Collections

I am happy to have had the chance to interview Jan Ziolkowski, Director, and Yota Batsaki, Executive Director, of Dumbarton Oaks, about some recent developments involving use of technology to enhance the institution’s collections. Bill: The Dumbarton Oaks collections are as fascinating as they are diverse, relating as they do to Byzantine, Pre-Columbian and Garden …

Read more »

Raising Digital Preservation Awareness to Combat Complacency and Fear

This post is adapted from remarks I gave to the judging panel for the 2012 Digital Preservation Award on behalf of The Signal. We were honored to be among the finalists for the award, which was subsequently won by The Digital Preservation Training Programme, University of London Computing Centre (to whom we offer hearty congratulations!). …

Read more »

Call to Action to Preserve Science Discourse on the Open Web

Fifty years from now, what currently accessible web content will be invaluable for understanding science in our era? What kinds of uses do you imagine this science content serving? Where are the natural curatorial homes for this online content and how can we work together to collect, preserve, and provide access to science on the …

Read more »

December 2012 Library of Congress Digital Preservation Newsletter

The December 2012 Library of Congress Digital Preservation Newsletter is now available. http://www.digitalpreservation.gov/news/newsletter/201212.pdf In this issue: Digital Preservation Pioneer: Martha Anderson, Director Program Management NDIIPP Levels of Digital Preservation Candidate One Release Find out about the latest PDF/A specification: PDF/A-3 Did you know that museums had computer networks in the 1960′s? Read about a recent …

Read more »