Archiving the “Intellectual” Components of a Website

The following is a guest post by Abbie Grotke, Web Archiving Team Lead.

You might imagine that with the web being in its twenties everyone would know exactly what a website is. But you’d be surprised – those of us in the web archiving business spend quite a bit of time pondering what makes up a particular organization or person’s website. We’re often challenged to determine what domains constitute any given site we are trying to archive.

Members of "Puppies for a Stronger Government" gathering to design their website? No, just an image from Prints and Photographs collection, Library of Congress LC-USZ62-60676

Members of "Puppies for a Stronger Government" gathering to design their website? No, just an image from Prints and Photographs collection, Library of Congress LC-USZ62-60676

The Library of Congress must send notices or permission requests to site owners for most content that we collect.  In some cases, content that we collect might have links to other content that either we don’t want or whose owner we didn’t contact.  Knowing the boundaries of a website is an important part of our process.

When we archive websites, our Library recommending officers select a seed URL (the starting point for the crawler), and input that into a curator tool that is used manage our web archiving workflow. We use this seed URL as the starting point for the crawler, but we also use that as the URL that we catalog so that researchers can access archived content.  Seed URLs can be any URL on the web, but typically they are the top-level domain.

Let me provide a close-to-home example. Everyone knows the Library’s URL is www.loc.gov. If we set our crawler to archive that domain, it will follow links to subdomains and pages and content, such this blog post.

But what about copyright.gov or digitalpreservation.gov? Or our newly launched beta.congress.gov?

All of those domains also equal the Library of Congress website. Our Web Archiving Team has a term for these, and things like an organization or candidate’s social media accounts. We call them “intellectually part of the site.” Humans recognize that they are related, but the machines, the crawling tools, cannot. We have to give special instructions to the crawlers to get that content. We do that by “scoping” them so the crawler knows what links it should follow for that site.

Our Recommending Officers, who are selecting content for the archives, are constantly challenged in identifying what seed URLs to archive out of the billions of pages out there. After a seed URL is nominated, our team, often in consult with the RO, scopes it to identify what is intellectually part of the site. We often go back as sites change or evolve, to re-evaluate the scoping.

List of related URLs

In our election archives, we find a lot of people and organizations buy up new web domains to support whatever hot topic is in the news (domains are so cheap, why not!). Here’s an (albeit silly) example to help illustrate this issue:

Let’s pretend that there’s a site that’s been identified for archiving that was created by an organization called Puppies for a Stronger Government, and they are at http://www.puppiesforastrongergovernment.com (not a real URL–at least not as of this writing). This site has social media accounts that we’ll want to archive, but they’ve also just announced that a candidate (a cat, of all things!), is running on a platform that they don’t agree with. So quickly they buy up a few domains to get their message out to a variety of audiences: http://kittensdontunderstandpuppies.com and http://virginiansfordogparks.com and http://WeAreDefinitelyCuter.com. They put links to these new domains on their website, and use the same design and contact information.

We’re obviously not really archiving puppy and kitten debates, as much as I might like to pretend we are. But this does illustrate the type of thing we see with content we archive.

Are these new domains intellectually part of the original site? Or should we treat them as new sites entirely? Sometimes, particularly if there aren’t design clues or obvious contact information, it is hard to tell. We can archive them all; that’s not a problem – but how we handle them (as a seed or a scope) affects how comprehensively we might collect something, and whether or not we catalog them for access. If we treat something a “seed” it gets more weight in our workflow process; a scope is just an added bonus.

And while there is generally an assumption that a website = a domain name; domain names do change. Most would agree that the same website content and organization at a new domain name is still the same “website.”

So, next time you’re asked “what’s your website?” how will you respond?

9/28/12: Typographical errors fixed.

2 Comments

  1. Sharad Shah
    September 28, 2012 at 3:22 pm

    Reflecting on these digichival/archigital dilemmas, I try to fight off the impending aneurysm.

    That said, I think taking a genealogical approach makes sense (i.e. create a family tree). You have the parent (original) website, it’s *official* spawn, and illegitimate children who “borrow” content from the parent and each other.

    Treating each site as independent makes sense (but finding ways to point out the relationship between sites is helpful for archival and research purposes)–unless they are mirror/clone sites, in which case the content is not different–only the URLs. (And in those cases, it should be treated as single person with multiple identities.)

    Also, puppies are way cooler than kittens.

  2. Hank the Cat
    September 28, 2012 at 7:45 pm

    While Puppies for a Stronger Government is a fictitious group dreamed up by the author, I want to let these puppies know that I’m more than willing to sit across the table from them and hear what they have to say with regards to government restructuring and the implications of objectifying beauty as a measurable, scientific standard where one can claim superiority over another.

    Phantom puppy groups aside, I can see where this could be quite an issue with the archival process of the internet. Once elected, I plan to help the LOC streamline and better determine a process and scope for easier implementation to archiving the endless web – especially all those cat videos.

    Sincerely,
    Hank the Cat

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.