Update on the Twitter Archive at the Library of Congress

(The following is a guest post from the Library’s Director of Communications, Gayle Osterberg.)

An element of our mission at the Library of Congress is to collect the story of America and to acquire collections that will have research value. So when the Library had the opportunity to acquire an archive from the popular social media service Twitter, we decided this was a collection that should be here.

In April 2010, the Library and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms.

The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date.

This month, all those objectives will be completed. We now have an archive of approximately 170 billion tweets and growing. The volume of tweets the Library receives each day has grown from 140 million beginning in February 2011 to nearly half a billion tweets each day as of October 2012.

The Library’s focus now is on addressing the significant technology challenges to making the archive accessible to researchers in a comprehensive, useful way. These efforts are ongoing and a priority for the Library.

Twitter is a new kind of collection for the Library of Congress but an important one to its mission. As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries.

Although the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.

Attached is a white paper [PDF] that summarizes the Library’s work to date and outlines present-day progress and challenges.

 

8 Comments

  1. Rick
    January 4, 2013 at 1:43 pm

    When and how will we be able to access all the tweets?

  2. Greg Robie
    January 5, 2013 at 7:04 am

    Though this is not my insight, is it applicable to this project of archiving the Tweets of Twitter (& in #haiku form)?

    Everybody talks
    & nobody is list’nen
    4 not feel’N heard

    cc 2012 greg robie

  3. Greg Robie
    January 5, 2013 at 7:26 am

    And that should have been 2013, not 2012 . . . which is corrected in the Tweet I tweeted of it. ;)

  4. Pete
    January 7, 2013 at 6:23 pm

    I’m sorry, but this is ridiculous. Of all the worthless things to archive, we are spending money on this?

    Only 13% of americans have twitter accounts….compared with Facebook holding 70% of americans. So not only do you have a tiny minority, but the data itself is of extremely poor quality.

    Atleast with say facebook, people are able to “write sentences”, and people publish articles, poems, etc. What was the last story you read entirely on twitter??

  5. Erin Allen
    January 8, 2013 at 9:16 am

    Thanks Rick. Technology to allow for scholarship access to large data sets is lagging
    behind technology for creating and distributing such data. The Library is
    working to develop a basic level of access that can be implemented while
    archival access technologies catch up. These efforts are ongoing a priority
    for the Library, but we cannot provide an estimated timeframe at this point.
    When technologically feasible, access to the archive would be offered to
    researchers on site.

  6. Michael Manoochehri
    January 8, 2013 at 4:40 pm

    Hi Gayle and Erin:

    I am lead for Google’s Developer Relations efforts around our Data products. I have been following this effort and would like to talk to you about some ideas I have for actually exposing this data to the public. I think we can get the query times down quite a bit further than 24 hours :-)

    - Michael Manoochehri

  7. Helmut Schwarzer
    January 8, 2013 at 4:47 pm

    To preserve and enshrine a Mount Everest of largely mindless babble by Mr. & Ms Everyman strikes me as utterly risible.

  8. Kate Bowers
    January 9, 2013 at 2:00 pm

    To those wondering about why we should keep such “low quality” communications, please consider that the Library of Congress originally rejected an archival copy of Walt Disney’s Bambi because the film was deemed “saccharine and derivative.” In short: archival value and perceived value are not the same thing.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully responsible for everything that you post. The content of all comments is released into the public domain unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless, the Library of Congress may monitor any user-generated content as it chooses and reserves the right to remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's privilege to post content on the Library site. Read our Comment and Posting Policy.

Required fields are indicated with an * asterisk.