The curious case of the closed access data set in the open access journal

Earlier this year, I ran across a news article that got me really excited in the science-nerdy kind of way. The article talked about how we could measure how happy the people in each U.S. state are just by looking at geotagged tweets. They even linked to a shwanky web app that the researchers had put together showing the “average happiness” of the U.S. since 2009. I have a penchant for playing around with social network data, so I was ecstatic when I saw that the authors had published the corresponding article in PLoS ONE (two articles, in fact).

That means I could easily get my hands on the raw data set, right?

Wrong.

I received a response to my raw data request a week later saying that they couldn’t share the raw Twitter data. They’re absolutely right, of course. It says it right there in the Twitter API usage terms.

Twitter has made it clear time and again that they don’t want Twitter content being stored outside of Twitter, and they especially don’t want people sharing that Twitter data if it is stored external to Twitter. Basically: If you want Twitter data, you have to go to them and access it through the Twitter API yourself. Here’s a couple clauses in the Twitter API usage terms that make it difficult to use for research:

You shall not use Twitter Content or other data collected from end users to create or maintain a separate status update or social network database or service.

You will not attempt or encourage others to use or access the Twitter API to aggregate, cache (except as part of a Tweet), or store place and other geographic location information contained in Twitter Content.

The problem is: That restriction directly contradicts PLoS ONE’s rules about sharing data. In fact, it’s bolded right there on the web site:

PLoS ONE will not consider a study if the conclusions depend solely on the analysis of proprietary data.

PLoS ONE’s stance on proprietary data makes sense. After all, one of the major reasons PLoS was founded was to make research easily accessible and reproducible — and that entails sharing the raw data underlying every study.

So, what can be done about this curious case of the closed access data set in the open access journal?

Does this mean researchers using Twitter data can’t publish in open access journals?

Does this render Twitter an inviable platform to study social networks, if the ultimate goal is to publish the study open access?

I don’t really have any answers, and the folks at PLoS ONE have been pondering it since June.

Any thoughts?

Update (11/17/2013) — possible solution?

After a brief email conversation with Jonathan Eisen (partially shown here in the comments), we reached a couple possible solutions:

Jonathan Eisen
… I think the only way I would ponder allowing something to be published would be if the full workflow for ALL analyses of said data was released so that at least people could examine and try to use the workflow themselves. If not, I don’t like it.

Randy Olson
The authors explained their method in the paper. However, a critical component of replicating the study is accessing Twitter to get the tweets from 2011 that they actually used for the study, which by now is extremely difficult if not impossible. (By default, the Twitter API only accesses recent tweets.) IMO, that makes the study irreproducible.

Some possible solutions given Twitter’s data sharing restrictions:

1) If researchers could denote which tweets they used in a study (e.g. with a list of tweet IDs) and Twitter allowed the mining of specific tweet IDs, then the study is semi-reproducible. The person replicating the experiment still has to mine all 10,000,000 tweets, which is a significant burden, but at least it’s possible to access the same data used in the study again.

2) If Twitter could allow researchers to register a set of tweets in Twitter with a key name (e.g., “geo-happiness-plos-one-2013-tweets”), then researchers reproducing the study could contact Twitter and ask for the set of tweets with that key name. That of course places a burden on Twitter to organize tweets in a certain way, which I doubt they will do (unless there’s $$$ in it).

What do you think?

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in open science, philosophy Tagged with: , , , , ,
  • I’ve published in PLoS One, and my paper had *some* data I can’t share because it has protected health information. Open data needs to consider consent.

    While twitter is public, consent to have your tweets published as part of a research study seems like a different thing.

    • +1. I think it’s perfectly fine to exclude personally identifying information, especially in the case of health-related data. However, that doesn’t prevent the sharing of the data in an anonymized format. Just assign each person a random ID # and that should be fine to publish, right?

  • I think that there are cases where privacy issues might limit what data one should / would be allowed to reshare. This is quite common in certain social sciences for example. And some medical sciences. And it makes sense that some pieces of information should be held back. But then how can one make an analysis of such information “science” and reproducible in some way? I think a solution would be to require full publication of an entire workflow relating to how the information was gathered and processed to allow as best as possible others to assess what one did. And one would / should have some process by which others can access the information to reanalyze it even if it is kept “closed”. I am pretty sure there are systems for this in the social sciences …

  • It s simple really – either PloS One needs to change their policy or the paper should be retracted and republished in a closed access journal.

    • The latter was my initial reaction as well. However, that means that all studies based on restricted data can’t be published in open access journals like PLoS ONE. Do we really want to force these authors to send their papers to closed access journals, where the entire study is behind a paywall there?

      • I think this gets at what we mean by reproducibility. If you’re looking to redo the analysis to see if the analytical technique or code works in your hands, that’s one thing. If you’re looking to see if their result about inferring happiness is more generally true, then you’d want to re-run the analysis in a set of tweets collected today to see if you can make the same determinations.

        I won’t argue against open data and I agree it seems to clash with the journal’s policy, but I do think the second case is a more meaningful way to replicate the result.

  • Great post Randy. This is definitely an important issue and it goes far beyond Twitter data. Even a lot of really great core ecological datasets have restrictions on access and redistributing. For example, eBird, the Christmas Bird Count data collected by the Audubon Society, and the North American Butterfly Count data collected by the North American Butterfly Association, are all not available for public download and all have non-redistribution clauses if you negotiate for access to this data. This is of course a huge issue for reproducibility and one that we should work on improving through dialogue with the data providers, but should we really not use these amazing datasets to do science in otherwise open ways? It’s a hard question.

    Our solution to this has been along the lines of what Jonathan Eisen proposes, but with the addition of using as much open data as possible in addition to the closed data. The some code runs all of the datasets so the analysis of the open data is fully replicable with raw data provided. We then add the first reasonable intermediate data product that doesn’t violate the non-redistribution clause for the closed data so that analysis from that point can be replicated for the closed data. See, e.g., https://github.com/weecology/white-etal-2012-ecology. This is far from perfect, but it’s the best middle ground I’ve found for working with cool closed data.

  • Pingback: The curious case of the closed access data set ...()