Python 2.7 still reigns supreme in pip installs

The Python 2 vs. Python 3 divide has long been a thorn in the Python community’s side. On one hand, Python package developers face the challenge of supporting two incompatible versions of Python, which is time that could be better spent improving the package. On the other hand, many Python users are reluctant to upgrade from Python 2 to 3 because of the time commitment such an upgrade entails. The Python Software Foundation’s official stance on the matter is:

Python 2.x is legacy, Python 3.x is the present and future of the language

(Upgrading Python 2 code to Python 3 isn’t that bad, by the way.)

I like to check in on the Python community’s transition from 2 to 3 every once in a while, so I figured it was about time for another check. Conveniently, Juan Pablo posted a preliminary analysis yesterday looking at the evolution of pip package downloads by Python version over the past couple months. Below, I will delve into that data a bit more to see what insights we can draw.

Overall Python downloads

If we look at all pip installs in July and August ’16, Python 2.7 still comprises roughly 90% of all pip installs. There are some interesting day-to-day fluctuations that presumably show fewer users installing Python packages on the weekend, but overall there are about 10,000,000 packages installed on Python 2.7 distributions every day.

python-pip-downloads

Of course, the statistics above capture all pip installs from all Python packages. What about the Scientific Python (SciPy) stack, which most of my readers are probably concerned about?

SciPy stack downloads

To look at Python usage across the SciPy stack, I used the same query as above except I limited the search to the following packages:

  • NumPy
  • SciPy
  • matplotlib
  • pandas
  • SymPy
  • IPython
  • Jupyter
  • nose
  • scikit-learn
  • scikit-image
  • Seaborn
  • Bokeh

(I added the last two per my personal opinion; the others are referenced on the SciPy page.)

python-scipy-stack-pip-downloads

Much to my dismay, even in the SciPy stack pip is used to install an order of magnitude more packages on Python 2.7 distributions than Python 3 (roughly 80% of all installs), with little sign of slowing down. At this point, we have to wonder: will Python’s scientific Python community be ready when support for Python 2 is officially dropped in 2020?

Breakdown by packages in the SciPy stack

I was also curious to see the Python usage statistics broken down by the various scientific Python packages, so that’s what I’ve plotted below. In general, it seems that it’s important for the scientific Python packages to primarily support Python 2.7, 3.4, and 3.5, with a handful of packages even having a large Python 2.6 user base.

python-pip-package-bokeh-downloads

python-pip-package-ipython-downloads

python-pip-package-jupyter-downloads

python-pip-package-matplotlib-downloads

python-pip-package-nose-downloads

python-pip-package-numpy-downloads

python-pip-package-pandas-downloads

python-pip-package-scikit-image-downloads

python-pip-package-scikit-learn-downloads

python-pip-package-scipy-downloads

python-pip-package-seaborn-downloads

python-pip-package-sympy-downloads

Conclusions

Python 2 still seems to be the most-used version of Python by far, at least in terms of packages installed via pip. With Python 2’s end of life in the near future, it’s time for the Python community to start having a serious conversation about how to smooth the transition from Python 2 to 3.

Aside from 2to3 for automatic code translation, most major packages providing Python 3 support, and several guides focused on porting Python code from 2 to 3, what are we missing? If you’re still using Python 2, what would convince you to make the switch?

Data source

I queried the PSF downloads table on BigQuery using the following query.

SELECT
  CONCAT( DATE(timestamp), '_', REGEXP_EXTRACT(details.python, r'^([2-3].[0-9]).') ) AS date_python,
  COUNT(details.python) AS downloads
FROM (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-06-01'), TIMESTAMP('2016-09-01')))
WHERE
  LOWER(details.installer.name) LIKE 'pip'
  AND (LOWER(file.project) LIKE 'numpy'
    OR LOWER(file.project) LIKE 'scipy'
    OR LOWER(file.project) LIKE 'matplotlib'
    OR LOWER(file.project) LIKE 'pandas'
    OR LOWER(file.project) LIKE 'sympy'
    OR LOWER(file.project) LIKE 'ipython'
    OR LOWER(file.project) LIKE 'jupyter'
    OR LOWER(file.project) LIKE 'nose'
    OR LOWER(file.project) LIKE 'scikit-%'
    OR LOWER(file.project) LIKE 'seaborn'
    OR LOWER(file.project) LIKE 'bokeh')
GROUP BY
  date_python
ORDER BY
  date_python

This query was modified from Juan Pablo’s earlier query.

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in data visualization, python Tagged with: , ,
  • Charles Chen

    As far as I know, more people are using the Anaconda distribution due to its out of the box convenience. Maybe the statistics would be more interesting if it included the number from Anaconda as well.

    • Juanlu001

      I agree, providing conda stats would be a nice addition to the article.

    • I would love to include download statistics from conda, but as far as I know those statistics aren’t posted publicly. Let’s hope someone on the conda team at Continuum runs across this blog post. 🙂

  • Tim Allen

    Could a large percentage of these be accounted for by OS installs? Red Hat / CentOS / Fedora are shipping with either 2.6.6 or 2.7.5 as their default system Python versions. For many of the data scientists I know, they’ll install the scientific packages at the system root level. This is also a fairly common practice for packages like virtualenv, virtualenvwrapper, Pygments, and many of the most popular system utility packages. I doubt there is a way to distinguish installs via the system installed ‘pip’ in these data, but I’d wager dollars to donuts that when OS’s start shipping with Python 3 as the default, rather than Python 2, we’ll see these numbers flip. Even Ubuntu 16 still ships with Python 2.7.x as the system default, and most users still don’t use virtualenvs.

    • when OS’s start shipping with Python 3 as the default, rather than Python 2, we’ll see these numbers flip.

      I sure hope so!

      This is why I always push for folks to install the Anaconda Python distribution: conda makes it a piece of cake to run Python 3 as your base install and revert over to a Python 2 install if you run into some wonky code that only runs on Python 2.

      • Matthias Bussonnier

        It’s also interesting to look at the distribution of pip version. You can basically spot a wide majority of users that haven’t even upgraded pip; and those who are on normal update schedule with pip 8.x. Then the picture is different with 2 and 3 being often much closer.

  • Mark Lawrence

    If this https://nothingbutsnark.svbtle.com/python-3-support-on-pypi is anything to go by I’d guess that these figures can be taken with a pinch of salt.

  • waltercool

    This is a stupid chain, when library start supporting 3+, the migration from 2 for devs would be faster. I found a lot of times trying to use a library and not being available for Python 3, or just not officially supported the 3.x codebase.

    I think we should start pushing just like Java done it to avoid devs to stay at 1.5 or 1.6, just pushing great features on new Python version, but going with the big frameworks at road, at this time I would recall Numpy, Flask, Django, Cherripy, matplotlib, rpy and few other statistical ones. Python have lot of strenghts there, and migration from the libraries or frameworks are stucking the move sometimes.

    Also, would be OK to just stop supporting Python2.7. Now, I would request to Python 3 devs to stop breaking APIs from minor changes. I like API updates, but you can push devs by force to use a different way to do some params under PEP updates.

  • Matthew Parker

    Slightly off topic, but what happened towards the end of August that caused the spike in jupyter/matplotlib/numpy etc downloads, but only on 3.5? was there a new release of something?

  • Andy

    I’m late to the party but how did you get the numbers? The ones from pipy directly are crazily inflated, as far as I know that’s due to caches and proxies hitting it over and over again, see http://stackoverflow.com/questions/38102317/why-pypi-doesnt-show-download-stats-anymore

    Did you get them from somewhere else or are these just the “highly inaccurate” ones?