Why posts get removed from /r/DataIsBeautiful

I’ve been a moderator of /r/DataIsBeautiful — one of the largest online communities dedicated to data analysis and visualization — for the past 2 1/2 years. During that time, I’ve reviewed thousands of data visualizations created by amateurs and professionals alike.

(For those not in the know: Moderators on Reddit volunteer their time to help run the subreddits, remove spam, enforce posting rules, and various other tasks to keep the subreddits on-topic and spam-free.)

Although moderating on Reddit is often a thankless job, my experience on /r/DataIsBeautiful has provided me a unique perspective on the data visualization world.

For this post, I thought it would be a fun exercise to visualize how /r/DataIsBeautiful’s posting rules have evolved over time. After all, it only seems appropriate that a /r/DataIsBeautiful moderator would analyze and visualize their own community, right?

/r/DataIsBeautiful currently has five core posting rules (1-5), and three experimental rules (6-8):

dib-posting-rules

These posting rules try to provide objective criteria for what makes an appropriate /r/DataIsBeautiful post, and generally make sure that:

  1. The post actually contains a data visualization
  2. The post gives appropriate credit to the data visualization’s creator
  3. The post isn’t obviously and maliciously misleading

Evolution of /r/DataIsBeautiful’s posting rules

I’ve always been curious about the relative importance of these rules over time, so I analyzed the Reddit comment cache on BigQuery, parsed out the official moderator comments that /r/DataIsBeautiful moderators made when removing posts, and binned them by month.

I was able to analyze the comments between January 2013 and January 2016, which provides a unique perspective on the subreddit before and after it became a default subreddit in early 2014.

DIB-post-removal-reasons-fractions

By far, the two biggest reasons for posts being removed on Reddit are:

  • failing to link to something that includes a data visualization, and
  • failing to properly credit the original author of the visualization

Today, those two reasons constitute roughly 50% and 40% of all post removals, respectively.

Interestingly, ever since it defaulted /r/DataIsBeautiful has been receiving increasingly more posts that don’t even include data visualizations. I believe this trend highlights the importance of an active moderation team as a community grows: As /r/DataIsBeautiful’s subscriber numbers climbed from the hundreds of thousands into the millions, the moderators were there to help the community stay on track and share relevant content.

You’ll also notice that there was a stint in 2014 where the moderation team required that all posts linking directly to images must link to PNGs (due to text quality issues with JPEGs), but that rule was replaced when we enacted the rule requiring that all posts link directly to the original source. Since only Original Content creators can post direct links to images now, we decided that it was best to allow them to decide how they wanted to share their content.

You may also notice the lack of the “no political posts except for Thursdays” rule, which was introduced in February 2016. Since February 2016’s comment data is not yet available, I’ll have to leave that analysis for a future post.

Post removal counts

For completeness, I’ve also included the raw counts for each post removal reason below.

DIB-post-removal-reasons-counts

You’ll likely notice the spike in “original source” removals in late 2014, which was due to the /r/DataIsBeautiful mod team more strictly enforcing directly links to the original source. The community took a couple months to get used to the rule, but eventually returned to normal.

For the most part, posting rules 3-8 deal with edge cases: Posters confusing infographics with data visualizations, sensationalized post titles, or reposts of links that were already shared recently. While theserules are nonetheless important, this analysis has shown us that only two of the eight posting rules play the most important role in keeping /r/DataIsBeautiful on track.

We’ll be sure to continually review and revise these posting rules as the /r/DataIsBeautiful community grows. If you have ideas on how we can improve the community — through posting rules or otherwise — please reach out to us by modmail.

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in data visualization, reddit Tagged with: , , ,
  • Mario Lurig

    your legend says “Visualation”… you forgot the iz on both if your images.

    • You haven’t heard of visualations? They’re all the rage now, Mario!

      Hehe… thanks. I’ve fixed the typos.

  • Andrew Gelman

    Randy:
    I think it would be better if the order of the colors in the legend were the same as the order of the curves on the graph: that is, orange on top, then blue. That could make it easier to follow.