Python 2.7 still reigns supreme in pip installs


Published on September 04, 2016 by Dr. Randal S. Olson

ipython pandas python

4 min READ


The Python 2 vs. Python 3 divide has long been a thorn in the Python community's side. On one hand, Python package developers face the challenge of supporting two incompatible versions of Python, which is time that could be better spent improving the package. On the other hand, many Python users are reluctant to upgrade from Python 2 to 3 because of the time commitment such an upgrade entails. The Python Software Foundation's official stance on the matter is:

Python 2.x is legacy, Python 3.x is the present and future of the language

(Upgrading Python 2 code to Python 3 isn't that bad, by the way.)

I like to check in on the Python community's transition from 2 to 3 every once in a while, so I figured it was about time for another check. Conveniently, Juan Pablo posted a preliminary analysis yesterday looking at the evolution of pip package downloads by Python version over the past couple months. Below, I will delve into that data a bit more to see what insights we can draw.

Overall Python downloads

If we look at all pip installs in July and August '16, Python 2.7 still comprises roughly 90% of all pip installs. There are some interesting day-to-day fluctuations that presumably show fewer users installing Python packages on the weekend, but overall there are about 10,000,000 packages installed on Python 2.7 distributions every day.

python-pip-downloads

Of course, the statistics above capture all pip installs from all Python packages. What about the Scientific Python (SciPy) stack, which most of my readers are probably concerned about?

SciPy stack downloads

To look at Python usage across the SciPy stack, I used the same query as above except I limited the search to the following packages:

  • NumPy
  • SciPy
  • matplotlib
  • pandas
  • SymPy
  • IPython
  • Jupyter
  • nose
  • scikit-learn
  • scikit-image
  • Seaborn
  • Bokeh

(I added the last two per my personal opinion; the others are referenced on the SciPy page.)

python-scipy-stack-pip-downloads

Much to my dismay, even in the SciPy stack pip is used to install an order of magnitude more packages on Python 2.7 distributions than Python 3 (roughly 80% of all installs), with little sign of slowing down. At this point, we have to wonder: will Python's scientific Python community be ready when support for Python 2 is officially dropped in 2020?

Breakdown by packages in the SciPy stack

I was also curious to see the Python usage statistics broken down by the various scientific Python packages, so that's what I've plotted below. In general, it seems that it's important for the scientific Python packages to primarily support Python 2.7, 3.4, and 3.5, with a handful of packages even having a large Python 2.6 user base.

python-pip-package-bokeh-downloads

python-pip-package-ipython-downloads

python-pip-package-jupyter-downloads

python-pip-package-matplotlib-downloads

python-pip-package-nose-downloads

python-pip-package-numpy-downloads

python-pip-package-pandas-downloads

python-pip-package-scikit-image-downloads

python-pip-package-scikit-learn-downloads

python-pip-package-scipy-downloads

python-pip-package-seaborn-downloads

python-pip-package-sympy-downloads

Conclusions

Python 2 still seems to be the most-used version of Python by far, at least in terms of packages installed via pip. With Python 2's end of life in the near future, it's time for the Python community to start having a serious conversation about how to smooth the transition from Python 2 to 3.

Aside from 2to3 for automatic code translation, most major packages providing Python 3 support, and several guides focused on porting Python code from 2 to 3, what are we missing? If you're still using Python 2, what would convince you to make the switch?

Data source

I queried the PSF downloads table on BigQuery using the following query.

SELECT
  CONCAT( DATE(timestamp), '_', REGEXP_EXTRACT(details.python, r'^([2-3].[0-9]).') ) AS date_python,
  COUNT(details.python) AS downloads
FROM (TABLE_DATE_RANGE([the-psf:pypi.downloads], TIMESTAMP('2016-06-01'), TIMESTAMP('2016-09-01')))
WHERE
  LOWER(details.installer.name) LIKE 'pip'
  AND (LOWER(file.project) LIKE 'numpy'
    OR LOWER(file.project) LIKE 'scipy'
    OR LOWER(file.project) LIKE 'matplotlib'
    OR LOWER(file.project) LIKE 'pandas'
    OR LOWER(file.project) LIKE 'sympy'
    OR LOWER(file.project) LIKE 'ipython'
    OR LOWER(file.project) LIKE 'jupyter'
    OR LOWER(file.project) LIKE 'nose'
    OR LOWER(file.project) LIKE 'scikit-%'
    OR LOWER(file.project) LIKE 'seaborn'
    OR LOWER(file.project) LIKE 'bokeh')
GROUP BY
  date_python
ORDER BY
  date_python

This query was modified from Juan Pablo's earlier query.