Because the Library of Congress is archiving every tweet ever: poop.
— Victor Luckerson (@VLuck) April 15, 2010
…That’s a message I tweeted back in April 2010, when it was announced that the Library of Congress was planning to archive every publicly available tweet ever posted on the social network. Now, almost three years later, the Library’s Twitter archive is beginning to take shape, and there are clues as to what uses researchers will derive from all of our 140-character witticisms.
When the Library initially took on the Twitter archive in 2010, it was already a daunting 21 billion tweets filled with words, hashtags, geolocation info, and other metadata. Today the Library has access to more than 170 billion tweets or about 85 terabytes of data. With about half a billion tweets now flowing into the archive daily, the biggest immediate challenge is finding a way to make all this information coherent and usable.
“One of the things that makes this collection a little bit different is the velocity with which it’s growing,” says Gayle Osterberg, director of communications for the Library of Congress. “The computing capacity to search for an item or a series of items across billions and billions of tweets isn’t cost-effective at the present time for a public institution.”
Osterberg says the costs associated with the project, in terms of developing the infrastructure to house the tweets, is in the low tens of thousands of dollars. The tweets were offered as a free gift from Twitter, and are being transferred to the Library through a separate company, Gnip, at no cost. Each day tweets are automatically pulled in from Gnip, organized chronologically and scanned to ensure they’re not corrupted. Then the data are stored on two separate tapes which are housed in different parts of the Library for security reasons.
The Library has mostly figured out how to make the archive organized, but usability remains a challenge. A simple query of just the 2006-2010 tweets currently takes about 24 hours. Increasing search speeds to a reasonable level would require purchasing hundreds of servers, which the Library says is financially unfeasible right now. There’s no timetable for when the tweets might become accessible to researchers.
“The goal would be to be able to answer whatever query a researcher might have here at the Library in our reading room,” Osterberg says. “The balance is making the access both meaningful and cost-effective for the Library.”
While you can’t yet make a trip to Washington D.C. and have casual perusal of all the world’s tweets, the technology to do exactly that is readily available—for a cost. Gnip, the organization feeding the tweets to the Library, is a social media data company that has exclusive access to the Twitter “firehose,” the never-ending, comprehensive stream of all of our tweets. Companies such as IBM pay for Gnip’s services, which also include access to posts from other social networks like Facebook and Tumblr. The company also works with academics and public policy experts, the type of people likely to make use of a free, government-sponsored Twitter archive when it comes to fruition.
Through Gnip, researchers have already made extensive use of much of the Twitter archive. Sherry Emery, a senior scientist at the Institute for Health Research and Policy at the University of Illinois at Chicago, analyzes tweets about smoking to understand the role of media in influencing the habit. When the Center for Disease Control launched a graphic anti-smoking ad campaign last spring, Emery and her team were able to analyze every public tweet about smoking and understand how people were reacting to the commercials.
“We can’t use Twitter to look at whether they actually quit smoking,” Emery says. “But we can really get a better understanding of whether people embraced the message of the ad or not, which is an important intermediate step to making changes in their behavior.”
Even with the tools available to quickly search the Twitter archive, making sense of such a huge dataset can be a challenge. Emery’s group has amassed more than 50 million tweets about smoking since December 2011. “A big part of what we do is just cleaning the data to make sure that the tweets that we’re looking at are about smoking tobacco and not about smoking weed or smoking ribs or smoking hot girls,” she says.
Using computer software to assess human emotion in tweets is also tricky. For example, someone tweeting “This is scary!” about the CDC commercial featuring former smokers with artificial voice boxes might seem like a negative reaction to a computer program. In fact, it’s the desired effect for an ad aimed at curbing smoking. Human coders have to feed the computer between 500 and 1,000 sample tweets to help it properly understand how to organize responses to a research question.
Other uses for tweets have also emerged. Daniel Hodd, a business student at Fordham University, is studying the conversation surrounding 50 stocks on Twitter to see if investor sentiment correlates with stock price. Chris Cantey, a master’s student studying cartography at the University of Wisconsin, has already used the more limited Twitter API system to geographically map last month’s flu outbreak. He’s now using the full firehose to analyze how responses to Hurricane Sandy unfolded in real time.
All the researchers agree that Twitter is a powerful tool for sociological study. Soon, if the Library of Congress can make its database fully functional, it’ll also be an easily accessible one. And one day, long after we’ve all sent our final snarky tweet, our messages will live on.
“Social media gives even those among us who don’t have the time to pick up a pen every day an opportunity to be recorders of and witnesses to history,” says Osterberg. “Those perspectives will be incredibly valuable to researchers and authors and policy makers down the road who want to understand the times we’re living in today.”