Revisiting the vaccine visualizations

Last year, the vaccination debate was all the rage again. “Pro-vaxxers” were loudly proclaiming that everyone should get vaccinated and discussing the science behind it, and “anti-vaxxers” were casting their doubts and still refusing to get vaccinated for personal reasons. Around that time, The Wall Street Journal released a brilliant series of heat maps showing infection rates for various diseases over time, broken down by state. These heat maps easily demonstrated one of the most important facts in the vaccination debate: Time and time again, vaccines work.

wsj-polio-dataviz

Today, I would like to revisit the WSJ’s heat maps through the lens of a data visualization practitioner. In particular, I would like to show how these heat maps can possibly be improved upon by reviewing some basic rules of data visualization, and trying out some other methods for displaying the data. Below, I’m going to walk through four major criticisms and show how addressing them can possibly improve the original work.

For the curious, I’ve released my notebook with the Python code used to generate the new visualizations.

Categorical color palettes should not be used to display continuous values

Perhaps one of the most straining issues with the original WSJ heat maps was their use of a custom categorical color palette to display the infection rates. The palette runs through most of the colors of the rainbow at seemingly-random intervals. It’s possible that they calculated the quantiles to determine the ranges for the color bins (as they should!), but that wasn’t indicated in their methodology.

In any case, it’s rarely a good idea to use multiple colors to display a single continuous variable. Here, all we want to do is use color to show the infection rates for each year. If we use more than one color, our readers have to constantly refer back to the legend to figure out what each color means, which is an unnecessary cognitive strain on our reader. Instead, we should use a single-color sequential palette, where lighter shades indicate lower values and darker shades indicate higher values. I’ve reworked the Polio heat map to do just that below.

polio-cases-heatmap-sequential-colormap

One exception to this “rule,” of course, is diverging color palettes. If there is a clear divide in our continuous variable — for example, if we’re displaying gains and losses for a company — then it could be appropriate to use a diverging color palette with one color to represent gains (values >= $0) and another to represent losses (values <$0). Just for fun, I recreated the same chart above for Measles so we can compare it to the originals on WSJ. measles-cases-heatmap-sequential-colormap

Multi-hue color palettes should take color blindness into account

Color blindness is probably one of the most-overlooked issues in data visualization, and the WSJ heat maps are a great example. I ran the WSJ heat map above through a color blindness simulator for red-green color blindness — the most common form of color blindness — and below is the result.

wsj-polio-dataviz-deuteranopia

Disastrous! Much of the color gradient is lost in some yellow/grey abyss, and the dark purple colors represent low values whereas the lighter yellow and dark grey colors represent higher values. This color palette survives better than most and the main message is still (mostly) communicated, but the WSJ color palette is certainly far from ideal here.

For comparison, I ran my rework from above through the same red-green color blindness simulator. As we can see, the simple sequential color palette is practically unaffected by this form of color blindness. Problem solved!

polio-cases-heatmap-sequential-colormap-deuteranopia

The main lesson here is that we should always run our color palette through a color blindness simulator before committing to it. Roughly 5% of our audience will experience our data visualizations through that lens.

Color can’t display specific values very well

One of the major drawbacks of heat maps is that they rely on color to communicate the specific values in each cell. While it’s not always important to display a precise value, there can sometimes be important trends hiding in these small differences. For that reason, I reworked the Polio heat map into a simple line chart below, where each light line is a state and the dark line is the median value between all the states for each year.

polio-cases-line-chart-raw

The above chart isn’t too useful, and the data is too messy to make much sense of the state-by-state trends. However, the decline in infection rates after the introduction of the vaccine is abundantly clear even in this case.

No post of mine is complete without small multiples, so let’s give that a try. Below, each state has its own chart, and all 50 states (+ D.C.) are put on the same time axis.

polio-cases-small-multiples

Each line tells its own story, and these are stories that were masked in the heat maps. Small multiples allow use to see specific state-by-state trends, for example, Polio outbreaks were already on the decline in South Dakota even before the introduction of the Polio vaccine. Meanwhile, Polio outbreaks were at their worst in New Hampshire just prior to the introduction of the Polio vaccine, which made short order of Polio immediately thereafter.

We should always ask ourselves when designing data visualizations: Do we care about the broader story, or the smaller stories? In this case we could go either way, but the direction we go depends on the story we want to tell.

Sometimes you can show too much data

Another fair criticism of all the data visualizations shown so far is that they show too much data. After all, the main message of the WSJ heat maps was simple: When introduced to human populations, vaccines work. There’s no need to show the state-by-state trends then; in fact, we may be overwhelming our reader by providing too much data that doesn’t get right to the point. For example, what happened with Polio in Utah, with the infection rate more than doubling after the introduction of the Polio vaccine? Or what about South Dakota, where Polio seems to have been mostly eliminated even before the vaccines were made available?

These outliers are distractions to the overall trend. We can overcome these distractions by applying a simple statistical analysis to the data, and show the overall trend with confidence bounds. Below, I’ve done just that by plotting the median Polio infection rate across all states (dark line) with bootstrapped 95% confidence intervals (shaded area).

polio-cases-line-chart-statistics

By summarizing the data with some basic statistics, we’ve removed the distractions and gotten straight to the point: Overall in the U.S., Polio outbreaks were on the rise from the 1940s onward. Right at the introduction of the Polio vaccine in 1955, we immediately saw a decline in Polio outbreaks until it was practically eliminated in the 1960s.

Again, we should always consider our story when designing data visualizations. If we have one clear story that we want to communicate, we should consider reducing the amount of data we show to the point that we can effectively — and honestly — communicate our story. There’s no point in confusing our reader with unnecessary details, unless those details contain an important caveat.

An aside

At face value, these charts only demonstrate correlations: When vaccinations were introduced to the population, the prevalence of infectious disease decreased shortly thereafter. I believe it’s important to point out here that even though I want to focus on data visualization techniques in this post, the science behind vaccination is not up for debate, and these charts are in fact demonstrating a proven causal relationship. Please don’t waste your time typing out “correlation != causation” in the comments.

Conclusions

To wrap up, these are the lessons we’ve drawn from revisiting the popularized vaccine visualizations:

  1. Use sequential color schemes when presenting continuous values
  2. Consider color blindness before committing to a color scheme
  3. When presenting specific values is important, don’t use color to represent those values
  4. Only show enough data to effectively and honestly tell your story

If you liked what you saw in this post and want to learn more, check out my Python data visualization video course that I made in collaboration with O’Reilly. In just one hour, I will cover these topics and much more, which will provide you with a strong starting point for your career in data visualization.

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in data visualization, python, tutorial Tagged with: , ,
  • Jim Fowler

    Randal, awesome graphics. Has anyone discussed the ramp-up of polio prior to the vaccine? The measles plot is more what I would expect: a higher background rate before, a lower rate after. But what was going on with the rabies rate before?

    • I believe the sharp rise of Polio outbreaks is what drove the rapid development and introduction of the Polio vaccine. From my readings, it was a rapidly growing crisis at the time. Perhaps others who knows more about the history of Polio can chime in.

  • Great post Randy!

    The original visualization is a very interesting case in color. It is perhaps the best example of a qualitative color scheme applied incorrectly yet coming _so_ close to working. The little blue blip seems arbitrary, and then it transitions into an almost typical linear gradient that by itself isn’t too egregious. But the combination of those factors was an odd choice in my opinion.

    Having several hues makes it interesting, but as you state it falls flat for colorblind readers. A multi-hue perceptual color scheme would be good middle ground—perhaps something like viridis or magma.

    I also wonder if there is a way to combine the utility of the heatmap to show patterns, with the precision of lines? I was able to do something similar here: http://earthobservatory.nasa.gov/IOTD/view.php?id=85703 The line is helpful for precise counts, but the heatmap reveals patterns not apparent in the line. Neither method would work best on its own, but together they perform well.

    Even as a ‘super fan’ of small multiples, I don’t think they work here. The original succeeds as a news graphic because of the alignment of the dates allows the line signifying the vaccine introduction to appear almost like a wall. Readers can interpret that almost immediately. It sends a visual message that doesn’t depend on reading the legend or understanding the text. Some precision and micropatterns are lost, but those are outweighed by the intuitiveness (in my opinion).

    • Thanks Josh! Always appreciate your insights.

      Having several hues makes it interesting, but as you state it falls flat for colorblind readers. A multi-hue perceptual color scheme would be good middle ground—perhaps something like viridis or magma.

      I agree that multi-color palettes make things interesting, but don’t you think they take away from interpretability for the average reader too? Unless we can remember their color sequence by heart, we’re going to be constantly jumping back and forth between the heat map and legend to find out what the colors mean.

      I also wonder if there is a way to combine the utility of the heatmap to show patterns, with the precision of lines?

      Showing both a line chart and heat map — as you did — is certainly the most straightforward way to pull that off. 🙂

      That’s a tough one in this case. As I showed, lines can’t effectively show all of the data here because of their significant overlap, unless we use small multiples. Even then, there’s so many categories that the small multiples aren’t particularly effective here.

      The chart that shows the median + 95% CIs seems to be the best of both worlds to me, although that method hides the individual state trends in favor of highlighting the overall trend.

      Even as a ‘super fan’ of small multiples, I don’t think they work here.

      I agree with you in this case. As I mentioned above, I actually think the last chart I made that summarizes the trend with statistics communicates the story most clearly, but of course it’s not quite as pretty as the heat map. Always a tradeoff.

      • > I agree that multi-color palettes make things interesting, but don’t you think they take away from interpretability for the average reader too?

        More often than not they do. But the original in this case leans toward a somewhat reasonable (mis)use of color than the typical rainbow/Jet palette. A true rainbow palette would be a disaster for this!

        As a news graphic, I see the original as something for readers to glean quickly without the expectation that they’ll be peeking at the legend for precise values. It succeeds at that, but not as well as it could with a more appropriate color scheme.

        The problem with the original palette isn’t so much the number of hues (though I disagree with the use of blue). From values ≥ about 10, it only has two hues (yellow, red) much like a ColorBrewer multi hue palette. But unlike a ColorBrewer palette, it does not vary in saturation as values increase. That is a bigger problem, as greater increases in saturation would make that portion of the palette colorblind friendly and more perceptually accurate for those with typical vision.

  • Tom Clift

    I think there is an issue with the line chart. In the heat map, the “vaccine introduced” line goes through the middle of the year, indicating the place to look for the data to change. In the line chart, we see the peak drop off, then a “vaccine introduced” line. But how many data points are there between the peak and the vertical line? In this case it seems the vertical line belongs 1 pixel after 1954, rather than right on 1955, or we are left asking why rates were sharply dropping just before introduction.

    • That’s an interesting point, Tom. Presumably the infection rates dropped the year the vaccine was introduced because the vaccine was introduced early enough in the year to have a significant impact. I can see how that would look somewhat confusing to the viewer, though!

  • Koba Khitalishvili

    Insightful.

  • JohnG

    One small comment: I would have liked to have seen a “population of the US” line in the final graph. I wonder if the visual surge in cases from 1940 to 1950 was based on a population boom, better reporting or some other aspect that is not reflected here

    • All of the values reported in this data set are rates per 100k people, so population booms are already accounted for in the analysis.

  • Care to show the prevalence of autism since the 1960’s?

  • Marina

    Hi Randy! The times palette reads as a transitional sequential palette to me rather than a true categorical; there is a shift from light to dark in the colors rather than the palette not expressing magnitude. (And I will admit that the shift isn’t pure; it could be argued that the darkest blue isn’t lighter than the lightest yellow.) I think the transitional sequential adds value because the burst of yellow brings interest to the area from 1945 to 1955. It tells the story of a distinct uptick during that period. And for myself — not colorblind — it’s easier to read the story of what happened before and after that period by matching colors on either side of the visualization. Thank you for posting these. I am creating the colorblind charting palettes for my corporation and may present what you’re saying here to the group for discussion.

  • Before the polio vaccine, Doctors used to routinely call any childhood paralysis polio. It played well on insurance forms. Here’s an analysis of the Detroit polio epidemic of 1958. http://jama.jamanetwork.com/article.aspx?articleid=327642 They had a big epidemic, but when they went in and examined the cases, it turned out that less than 1/3 of the patients even had polio virus, and whether it is what was causing their problem is of course even then unclear. Maybe they would have beat it easy without some other factor (eg DDT).

  • Mario Lurig

    I searched far and wide for something as simple as ColorOracle. Thanks for the link!

  • Susan Kling Finston

    “Pro-vaxxers” ? There are not two sides to every argument. How about “MDs, Immunologists and Microbiologists” ?

  • Kaleb Holten

    Logged in with the sole purpose of saying thank you. I am color blind and as someone who enjoys looking at data I often run across visualizations that I simply can’t read well. I really appreciate you mentioning that as a frequent problem.