Introducing TPOT, the Data Science Assistant

Some of you might have been wondering what the heck I’ve been up to for the past few months. I haven’t been posting much on my blog lately, and I haven’t been working on important problems like solving Where’s Waldo? and optimizing road trips around the world. (I promise: I’ll get back to fun posts like that soon!) Instead, I’ve been working on something far geekier, and I’m excited to finally have something to show for it.

Over the summer, I started a new postdoctoral research position funded by the NIH at the University of Pennsylvania Computational Genetics Lab. During my first month there, I started looking for big problems in the field of data science to take on. Science (especially computer science) is often too incremental, and if I was going to stay in academia, I wanted to tackle a big problem. It was around that time that I started thinking about the process of machine learning and how we could let machines solve problems themselves rather than needing input from humans.

You see, machine learning is transforming the world as we know it. Google search engines were massively improved by machine learning, as were Gmail’s spam filters. Voice assistants like Siri — as silly as they can be — use machine learning to translate your voice into something the computer can understand. Stock market investors make millions every day using machine learning to predict when to buy and sell. And the list goes on and on…

Wonder how Facebook always knows who you are in your photos? They use machine learning.

Ever wonder how Facebook always knows who you are in your photos? They use machine learning.

The problem with machine learning is that building an effective model can require a ton of human input. Humans have to figure out the right way to transform the data before feeding it to the machine learning model. Then they have to pick the right machine learning model that will learn from the data best, and then there’s a whole bunch of model parameters to tweak that can make the difference between a dud and a Nostradamus-like model. Building these pipelines — i.e., sequences of steps that turn the raw data into a predictive model — can easily take weeks of tinkering depending on the difficulty of the problem. This is obviously a huge issue when machine learning is supposed to allow machines to learn on their own.

An example machine learning pipeline

An example machine learning pipeline, and what parts of the pipeline TPOT automates.

Thus, the Tree-based Pipeline Optimization Tool (TPOT) was born. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. Think of TPOT as your “Data Science Assistant”: TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines, then recommending the pipelines that work best for your data.

An example tree-based pipeline with two copies of the data set entering the pipeline.

An example TPOT pipeline with two copies of the data set entering the pipeline.

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. As an added bonus, TPOT is built on top of scikit-learn, so all of the code it generates should look familiar… if you’re familiar with scikit-learn, anyway.

TPOT is still under active development and in its early stages, but it’s worked very well on the classification problems I’ve applied it to so far.

Check out the TPOT GitHub repository to see the latest goings on. I’ll be working on TPOT and pushing the boundaries of machine learning pipeline optimization for the majority of my postdoc.

An example using TPOT

I wanted to make TPOT versatile, so it can be used on the command line or via Python scripts. You can look up the detailed usage instructions on the GitHub repository if you’re interested.

For this post, I’ve provided a basic example of how you can use TPOT to build a pipeline that classifies hand-written digits in the classic MNIST data set.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=5, verbosity=2)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

After 10 or so minutes, TPOT will discover a pipeline that achieves roughly 98% accuracy. In this case, TPOT will probably discover that a random forest classifier and k-nearest-neighbor classifier does very well on MNIST with only a little bit of tuning. If you give TPOT even more time by setting the “generations” parameter to a higher number, it may find even better pipelines.

“TPOT sounds cool! How can I get involved?”

TPOT is an open source project, and I’m happy to have you join our efforts to build the best tool possible. If you want to contribute some code, check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

tl;dr in image format

Justin Kiggins had a great summary of TPOT when I first tweeted about it:

Anyway, that’s what I’ve been up to lately. I’m looking forward to presenting TPOT at several research conferences in the coming months, and I’d really like to see what the machine learning community thinks about pipeline automation. In the meantime, give TPOT a try and let me know what you think.

Dr. Randy Olson is a Senior Data Scientist at the University of Pennsylvania, where he develops state-of-the-art machine learning algorithms with a focus on biomedical applications.

Posted in machine learning, python, research Tagged with: , , , , ,
  • Michael Markieta

    How best would you benefit from the community? Would you like us to submit test units using common open source datasets?

    • Great question! There are several ways that the community can help with the TPOT project.

      Perhaps the most basic way to help is to give TPOT a try for your normal workflow and let me know how it works for you. What worked well? What didn’t work well? What new features do you think would help? I have my way of doing things, but I’d like to design this tool to be useful for everyone.

      Beyond that, I’d love to hear about TPOT’s performance on some data sets beyond the limited set I’ve looked at. I’d especially love to see comparisons to hand-designed pipelines, or pipelines designed by other tools.

      For those interested in contributing, we’re still working hard on integrating more ML models, more feature selectors, more feature constructors, etc. We really want TPOT to implement as many pipeline operators as reasonably possible, and that takes a lot of work. So the more PRs we get, the better!

      • Michael Markieta

        Plenty to do! would you be able to write a short blurb about how you integrate all of the ML functions as pipelines. i think this would give others a way see how “you like to do it” rather than just interpreting the code.

  • Nate Juboor

    Intro Linux user here, not sure why I am getting this error on the tpot install. All other installs went well.

    Downloading/unpacking xgboost (from tpot)
    Could not find a version that satisfies the requirement xgboost (from tpot) (from versions: 0.4a12, 0.4a13, 0.4a14, 0.4a15, 0.4a18, 0.4a19, 0.4a20, 0.4a21, 0.4a22, 0.4a23, 0.4a24, 0.4a25, 0.4a26, 0.4a27, 0.4a28, 0.4a29, 0.4a30)
    Cleaning up…
    No distributions matching the version for xgboost (from tpot)
    Storing debug log for failure in /home/david/.pip/pip.log
    _____
    Also, Dr Olson, I have a few questions/suggestions in regards to practicality in a project I am working on (using tpot). I am a student close by your lab, what is the best way to get in contact with you?

    • Hi Nate,

      Try just running `pip install xgboost` and see if that works, then install tpot again. Let me know how that goes. I can’t reproduce your error on my end.

      In regards to getting in touch, please email me: http://www.randalolson.com/contact/

      • Nate Juboor

        Results from ‘sudo pip install xgboost’:

        [email protected]:~$ sudo pip install xgboost
        Downloading/unpacking xgboost
        Could not find a version that satisfies the requirement xgboost (from versions: 0.4a12, 0.4a13, 0.4a14, 0.4a15, 0.4a18, 0.4a19, 0.4a20, 0.4a21, 0.4a22, 0.4a23, 0.4a24, 0.4a25, 0.4a26, 0.4a27, 0.4a28, 0.4a29, 0.4a30)
        Cleaning up…
        No distributions matching the version for xgboost
        Storing debug log for failure in /home/david/.pip/pip.log

      • Nate Juboor

        Dr. Olson,

        I use TPOT all the time now for quick direction when I’m stuck in a rut on a difficult problem.

        Two questions:

        1) CUDA support?
        2) Keras integration?

        –DNJ

        • Glad to hear you’re finding TPOT useful! I don’t think CUDA support are in the works, but the next 0.7 release will have multiprocessing support via joblib. We’ve worked on a demo that integrates Keras, but it’s still in the early prototyping phase—not usable at the moment.

  • jt

    Hi Randy,

    Thanks for building TPOT! I used to use something similar called Eureqa (http://www.nutonian.com/products/eureqa/). Right now I’m working an unbalanced classification problem, so I wanted to suggest that TPOT incorporate alternative class weighting schemes or scoring functions. The accuracy score default on TPOT means that it’s really not suited to my usecase (roughly 3% of my examples are in the positive class)

    • Hi jt,

      It seems your comment got cut off early. In any case, we allow TPOT to use custom scoring functions (http://rhiever.github.io/tpot/examples/Custom_Scoring_Functions/) to guide the optimization process. By default, TPOT uses balanced accuracy, which accounts for class imbalances. Perhaps that helps overcome your issue?

      • jt

        Totally! I’ll give it a shot today once my current feature selection pass is done. I’ve been running forward feature selection + parameter grid search on this problem for the past 18 hours so far.

  • Bill White

    Do I need the latest pip?

    Downloading/unpacking pip>=8.1.0 (from pypandoc)
    Downloading pip-8.1.2-py2.py3-none-any.whl (1.2MB): 1.2MB downloaded
    Cleaning up…
    Exception:
    Traceback (most recent call last):
    File “/usr/lib/python2.7/dist-packages/pip/basecommand.py”, line 122, in main
    status = self.run(options, args)
    File “/usr/lib/python2.7/dist-packages/pip/commands/install.py”, line 290, in run
    requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
    File “/usr/lib/python2.7/dist-packages/pip/req.py”, line 1260, in prepare_files
    )[0]
    IndexError: list index out of range

    Storing debug log for failure in /root/.pip/pip.log

    • Hey Bill! I don’t think you need the latest pip. Typically we recommend installing TPOT on top of an Anaconda Python distribution install [1], since otherwise installing some of the libraries (especially numpy) can be a pain in Python.

      If you keep having issues with the install, please file an issue [2] and let us know your Python install details etc.

      [1] https://www.continuum.io/downloads
      [2] https://github.com/rhiever/tpot/issues/new

      • Bill White

        Hmm, I a little leery of installing a second Python, but I suppose it can live alongside the OS version okay (Anaconda ahead in the PATH)?

        • Yep, that’s right. I’ve been running Anaconda on top of the OS Python for ages and it works great. IIRC the Anaconda installer automatically adds the Anaconda install to the PATH.

          • Bill White

            Yay! Anaconda worked like a charm. Now to try some stuff. Thanks dude!

          • Bill White

            How long should the example take to run? It’s been running for over 5 minutes. Am I impatient?

            [email protected]:~/analysis/TPOT_tests$ python
            Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40)
            [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
            Type “help”, “copyright”, “credits” or “license” for more information.
            Anaconda is brought to you by Continuum Analytics.
            Please check out: http://continuum.io/thanks and https://anaconda.org
            >>> from tpot import TPOT
            >>> from sklearn.datasets import load_digits
            >>> from sklearn.cross_validation import train_test_split
            >>> digits = load_digits()
            >>> X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
            … train_size=0.75)
            >>> tpot = TPOT(generations=5)
            >>> tpot.fit(X_train, y_train)

            • If you change this line:

              tpot = TPOT(generations=5)

              to

              tpot = TPOT(generations=5, verbosity=2)

              It’ll show a progress bar for you. That way you’ll at least get a sense of how it’s progressing. Generally, TPOT will take a while because it’s running k-fold cross-validation on every pipeline on the full training data set.

              • Bill White

                Nice, I can see it working now! I am anxious to give this a try with some of our data sets. Thanks again!

              • Bill White

                Sorry, one more thing. Not sure if this is useful, but I got this error at the end of the example:

                [email protected]:~/analysis/TPOT_tests$ python example.py
                Generation 1 – Current best internal CV score: 0.990575903376
                Generation 2 – Current best internal CV score: 0.990582851858
                Generation 3 – Current best internal CV score: 0.991425080097
                Generation 4 – Current best internal CV score: 0.991425080097
                Generation 5 – Current best internal CV score: 0.991821054216

                Best pipeline: ExtraTreesClassifier(input_matrix, 90, 0.19, 0.33000000000000002)
                Traceback (most recent call last):
                File “example.py”, line 12, in
                tpot.score(X_train, y_train, X_test, y_test)
                TypeError: score() takes exactly 3 arguments (5 given)

  • D L von Kleeck

    I see you have done some work on demo that integrates Keras. SUPER!!! ANNs are universal estimators and except for the “Black Box” thing they are the way to go. Having a GA IDs the “Best” architecture and parameters is sorely needed.
    Let me know on your progress.

    Doc vK

  • shubham jain

    Hello sir. Can you please tell me how tpot uses genetic algorithm for optimization?