The curious case of the closed access data set in the open access journal


Published on November 14, 2013 by Dr. Randal S. Olson

data sharing open access open science plos one proprietary data twitter

4 min READ


Earlier this year, I ran across a news article that got me really excited in the science-nerdy kind of way. The article talked about how we could measure how happy the people in each U.S. state are just by looking at geotagged tweets. They even linked to a shwanky web app that the researchers had put together showing the "average happiness" of the U.S. since 2009. I have a penchant for playing around with social network data, so I was ecstatic when I saw that the authors had published the corresponding article in PLoS ONE (two articles, in fact).

That means I could easily get my hands on the raw data set, right?

Wrong.

I received a response to my raw data request a week later saying that they couldn't share the raw Twitter data. They're absolutely right, of course. It says it right there in the Twitter API usage terms.

Twitter has made it clear time and again that they don't want Twitter content being stored outside of Twitter, and they especially don't want people sharing that Twitter data if it is stored external to Twitter. Basically: If you want Twitter data, you have to go to them and access it through the Twitter API yourself. Here's a couple clauses in the Twitter API usage terms that make it difficult to use for research:

You shall not use Twitter Content or other data collected from end users to create or maintain a separate status update or social network database or service.

You will not attempt or encourage others to use or access the Twitter API to aggregate, cache (except as part of a Tweet), or store place and other geographic location information contained in Twitter Content.

The problem is: That restriction directly contradicts PLoS ONE's rules about sharing data. In fact, it's bolded right there on the web site:

PLoS ONE will not consider a study if the conclusions depend solely on the analysis of proprietary data.

PLoS ONE's stance on proprietary data makes sense. After all, one of the major reasons PLoS was founded was to make research easily accessible and reproducible -- and that entails sharing the raw data underlying every study.

So, what can be done about this curious case of the closed access data set in the open access journal?

Does this mean researchers using Twitter data can't publish in open access journals?

Does this render Twitter an inviable platform to study social networks, if the ultimate goal is to publish the study open access?

I don't really have any answers, and the folks at PLoS ONE have been pondering it since June.

Any thoughts?

Update (11/17/2013) -- possible solution?

After a brief email conversation with Jonathan Eisen (partially shown here in the comments), we reached a couple possible solutions:

Jonathan Eisen
... I think the only way I would ponder allowing something to be published would be if the full workflow for ALL analyses of said data was released so that at least people could examine and try to use the workflow themselves. If not, I don't like it.

Randy Olson
The authors explained their method in the paper. However, a critical component of replicating the study is accessing Twitter to get the tweets from 2011 that they actually used for the study, which by now is extremely difficult if not impossible. (By default, the Twitter API only accesses recent tweets.) IMO, that makes the study irreproducible.

Some possible solutions given Twitter's data sharing restrictions:

1) If researchers could denote which tweets they used in a study (e.g. with a list of tweet IDs) and Twitter allowed the mining of specific tweet IDs, then the study is semi-reproducible. The person replicating the experiment still has to mine all 10,000,000 tweets, which is a significant burden, but at least it's possible to access the same data used in the study again.

2) If Twitter could allow researchers to register a set of tweets in Twitter with a key name (e.g., "geo-happiness-plos-one-2013-tweets"), then researchers reproducing the study could contact Twitter and ask for the set of tweets with that key name. That of course places a burden on Twitter to organize tweets in a certain way, which I doubt they will do (unless there's $$$ in it).

What do you think?