2012 September | The Signal: Digital Preservation

The following is a guest post by Abbie Grotke, Web Archiving Team Lead.

You might imagine that with the web being in its twenties everyone would know exactly what a website is. But you’d be surprised – those of us in the web archiving business spend quite a bit of time pondering what makes up a particular organization or person’s website. We’re often challenged to determine what domains constitute any given site we are trying to archive.

Members of "Puppies for a Stronger Government" gathering to design their website? No, just an image from Prints and Photographs collection, Library of Congress LC-USZ62-60676

The Library of Congress must send notices or permission requests to site owners for most content that we collect. In some cases, content that we collect might have links to other content that either we don’t want or whose owner we didn’t contact. Knowing the boundaries of a website is an important part of our process.

When we archive websites, our Library recommending officers select a seed URL (the starting point for the crawler), and input that into a curator tool that is used manage our web archiving workflow. We use this seed URL as the starting point for the crawler, but we also use that as the URL that we catalog so that researchers can access archived content. Seed URLs can be any URL on the web, but typically they are the top-level domain.

Let me provide a close-to-home example. Everyone knows the Library’s URL is www.loc.gov. If we set our crawler to archive that domain, it will follow links to subdomains and pages and content, such this blog post.

But what about copyright.gov or digitalpreservation.gov? Or our newly launched beta.congress.gov?

All of those domains also equal the Library of Congress website. Our Web Archiving Team has a term for these, and things like an organization or candidate’s social media accounts. We call them “intellectually part of the site.” Humans recognize that they are related, but the machines, the crawling tools, cannot. We have to give special instructions to the crawlers to get that content. We do that by “scoping” them so the crawler knows what links it should follow for that site.

Our Recommending Officers, who are selecting content for the archives, are constantly challenged in identifying what seed URLs to archive out of the billions of pages out there. After a seed URL is nominated, our team, often in consult with the RO, scopes it to identify what is intellectually part of the site. We often go back as sites change or evolve, to re-evaluate the scoping.

In our election archives, we find a lot of people and organizations buy up new web domains to support whatever hot topic is in the news (domains are so cheap, why not!). Here’s an (albeit silly) example to help illustrate this issue:

Let’s pretend that there’s a site that’s been identified for archiving that was created by an organization called Puppies for a Stronger Government, and they are at http://www.puppiesforastrongergovernment.com (not a real URL–at least not as of this writing). This site has social media accounts that we’ll want to archive, but they’ve also just announced that a candidate (a cat, of all things!), is running on a platform that they don’t agree with. So quickly they buy up a few domains to get their message out to a variety of audiences: http://kittensdontunderstandpuppies.com and http://virginiansfordogparks.com and http://WeAreDefinitelyCuter.com. They put links to these new domains on their website, and use the same design and contact information.

We’re obviously not really archiving puppy and kitten debates, as much as I might like to pretend we are. But this does illustrate the type of thing we see with content we archive.

Are these new domains intellectually part of the original site? Or should we treat them as new sites entirely? Sometimes, particularly if there aren’t design clues or obvious contact information, it is hard to tell. We can archive them all; that’s not a problem – but how we handle them (as a seed or a scope) affects how comprehensively we might collect something, and whether or not we catalog them for access. If we treat something a “seed” it gets more weight in our workflow process; a scope is just an added bonus.

And while there is generally an assumption that a website = a domain name; domain names do change. Most would agree that the same website content and organization at a new domain name is still the same “website.”

So, next time you’re asked “what’s your website?” how will you respond?

9/28/12: Typographical errors fixed.

Archive for September 2012 (20 posts)

Archiving the “Intellectual” Components of a Website

Yes, The Library of Congress Has Video Games: An Interview with David Gibson

Exhibiting Video Games: An interview with Smithsonian’s Georgina Goodlander

Born Digital Minimum Processing and Access

Digital Cultural Heritage DC Meetup Launched

New Residency Program Moves Forward

Communities of Practice Make it Possible: Digital Preservation at Smaller Institutions

Being Digital–Before You Were Born

New Web Archiving Resources

Sharing, Theft, and Creativity: deviantART’s Share Wars and How an Online Arts Community Thinks About Their Work