Major League Baseball home run leaders, 1871-2016

Earlier this week, a Reddit user shared a fascinating animated data visualization showing the MLB home run leaders from the past 200+ years. I found this visualization especially interesting because it was one of the few examples where I’ve seen an animated data visualization effectively tell a story that a static visualization couldn’t tell. I’ve recreated the user’s visualization with Python below.

mlb-baseball-homerun-records

Click here for a gfycat version

We see the early battle for the home run throne in the late 1800’s, then a period of stagnation in the early 1900s. As one Redditor explained:

The reason why the records remain fairly stagnant from 1903-1920 is because there was a Dead-ball era in the MLB. During the “Dead Ball Era,” ballparks had ridiculously large dimensions, balls were used until they weren’t useful anymore, and many pitchers “doctored” the ball by spitting on it or covering it in tobacco.

Shortly thereafter, we see why Babe Ruth was such a sensation in the 1920s, as his record skyrocketed to 714 home runs over his career. Ruth held that record until the 1970s, when Hank Aaron dethroned him and held onto the record for another 40 years. There’s a lot more to discover in this animated visualization, but I’ll leave that as an exercise to the reader.

The downside of this visualization

Of course, the most obvious criticism of this visualization is that the lines represent different people over time, which can be disorienting to some viewers. It’s important to remember that this visualization is meant to show the evolution of home run records over time, and not necessarily the home run records of any particular individual.

One possible way to overcome this shortcoming is to take a cue from xkcd and assign each player their own line:

dominant_players_large

However, we would likely have to limit the number of players we visualize at once, and would likely only be able to show one or two dominant players during each time period.

Furthermore, since career home run records only go up over time, we would quickly see the 400+ home run range filled with several player’s records. Perhaps an xkcd-like version can be made for yearly home runs of dominant players, but I’ll leave that as an exercise for the future.

How do I remake this visualization?

Below is the Python code that I used to generate the animated visualization. Once you’ve generated all of the individual frames, you’ll have to stitch them together with a program such as ffmpeg or Camtasia.

import matplotlib.pyplot as plt
import pandas as pd

# This is my custom matplotlib style -- feel free to reuse it
plt.style.use('https://gist.githubusercontent.com/rhiever/d0a7332fe0beebfdc3d5/raw/223d70799b48131d5ce2723cd5784f39d7a3a653/tableau10.mplstyle')

mlb_data = pd.read_csv('http://www.randalolson.com/wp-content/uploads/mlb-home-run-leaders-static.csv', sep='\t')
mlb_data.set_index('year', inplace=True)

# For every year (except the first 5)...
for year in mlb_data.index.unique()[5:]:
    # Each year gets its own figure
    plt.figure(figsize=(6, 9))

    # Subset the data to only the data leading up to the current year
    subset = mlb_data.loc[mlb_data.index <= year]

    # Plot each home run record line separately
    for i in range(1, 11):
        player_name = mlb_data.loc[mlb_data.index == year, 'player-{}'.format(str(i).zfill(2))].values[0]
        player_homeruns = int(mlb_data.loc[mlb_data.index == year, 'rank-{}'.format(str(i).zfill(2))].values[0])

        player_label = '{}. {} ({})'.format(i, player_name, player_homeruns)
        subset['rank-{}'.format(str(i).zfill(2))].plot(label=player_label)

    # We don't need an x-axis label -- it's obvious that it's years
    plt.xlabel('')

    # matplotlib has an annoying tendency to represent years in scientific notation
    # This line disables that
    plt.gca().get_xaxis().get_major_formatter().set_useOffset(False)

    # Give some space on the bottom so the x- and y-axis ticks don't overlap
    plt.ylim(ymin=-1)

    # Place the legend to the right of the figure
    plt.legend(fontsize=14, bbox_to_anchor=(1.65, 0.8), title='Top 10', frameon=False)

    plt.grid(False, axis='x')
    plt.title(year)
    plt.ylabel('Home Runs')

    # You can also save it as a .pdf here
    plt.savefig('{}.png'.format(year))

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in data visualization, python, tutorial Tagged with: , ,