Introducing TPOT, the Data Science Assistant
Some of you might have been wondering what the heck I’ve been up to for the past few months. I haven’t been posting much on my blog lately, and I haven’t been working on important problems like solving Where’s Waldo? and optimizing road trips around the world. (I promise: I’ll get back to fun posts like that soon!) Instead, I’ve been working on something far geekier, and I’m excited to finally have something to show for it.
Over the summer, I started a new postdoctoral research position funded by the NIH at the University of Pennsylvania Computational Genetics Lab. During my first month there, I started looking for big problems in the field of data science to take on. Science (especially computer science) is often too incremental, and if I was going to stay in academia, I wanted to tackle a big problem. It was around that time that I started thinking about the process of machine learning and how we could let machines solve problems themselves rather than needing input from humans.
You see, machine learning is transforming the world as we know it. Google search engines were massively improved by machine learning, as were Gmail’s spam filters. Voice assistants like Siri — as silly as they can be — use machine learning to translate your voice into something the computer can understand. Stock market investors make millions every day using machine learning to predict when to buy and sell. And the list goes on and on…
The problem with machine learning is that building an effective model can require a ton of human input. Humans have to figure out the right way to transform the data before feeding it to the machine learning model. Then they have to pick the right machine learning model that will learn from the data best, and then there’s a whole bunch of model parameters to tweak that can make the difference between a dud and a Nostradamus-like model. Building these pipelines — i.e., sequences of steps that turn the raw data into a predictive model — can easily take weeks of tinkering depending on the difficulty of the problem. This is obviously a huge issue when machine learning is supposed to allow machines to learn on their own.
Thus, the Tree-based Pipeline Optimization Tool (TPOT) was born. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming. Think of TPOT as your “Data Science Assistant”: TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines, then recommending the pipelines that work best for your data.
Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there. As an added bonus, TPOT is built on top of scikit-learn, so all of the code it generates should look familiar… if you’re familiar with scikit-learn, anyway.
TPOT is still under active development and in its early stages, but it’s worked very well on the classification problems I’ve applied it to so far.
Check out the TPOT GitHub repository to see the latest goings on. I’ll be working on TPOT and pushing the boundaries of machine learning pipeline optimization for the majority of my postdoc.
An example using TPOT
I wanted to make TPOT versatile, so it can be used on the command line or via Python scripts. You can look up the detailed usage instructions on the GitHub repository if you’re interested.
For this post, I’ve provided a basic example of how you can use TPOT to build a pipeline that classifies hand-written digits in the classic MNIST data set.
from tpot import TPOTClassifier from sklearn.datasets import load_digits from sklearn.cross_validation import train_test_split digits = load_digits() X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25) tpot = TPOTClassifier(generations=5, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test))
After 10 or so minutes, TPOT will discover a pipeline that achieves roughly 98% accuracy. In this case, TPOT will probably discover that a random forest classifier and k-nearest-neighbor classifier does very well on MNIST with only a little bit of tuning. If you give TPOT even more time by setting the “generations” parameter to a higher number, it may find even better pipelines.
“TPOT sounds cool! How can I get involved?”
TPOT is an open source project, and I’m happy to have you join our efforts to build the best tool possible. If you want to contribute some code, check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.
tl;dr in image format
Justin Kiggins had a great summary of TPOT when I first tweeted about it:
@randal_olson pic.twitter.com/ds5iTbA2oF
— Justin Kiggins (@neuromusic) November 13, 2015
Anyway, that’s what I’ve been up to lately. I’m looking forward to presenting TPOT at several research conferences in the coming months, and I’d really like to see what the machine learning community thinks about pipeline automation. In the meantime, give TPOT a try and let me know what you think.
How best would you benefit from the community? Would you like us to submit test units using common open source datasets?
Great question! There are several ways that the community can help with the TPOT project.
Perhaps the most basic way to help is to give TPOT a try for your normal workflow and let me know how it works for you. What worked well? What didn’t work well? What new features do you think would help? I have my way of doing things, but I’d like to design this tool to be useful for everyone.
Beyond that, I’d love to hear about TPOT’s performance on some data sets beyond the limited set I’ve looked at. I’d especially love to see comparisons to hand-designed pipelines, or pipelines designed by other tools.
For those interested in contributing, we’re still working hard on integrating more ML models, more feature selectors, more feature constructors, etc. We really want TPOT to implement as many pipeline operators as reasonably possible, and that takes a lot of work. So the more PRs we get, the better!
Plenty to do! would you be able to write a short blurb about how you integrate all of the ML functions as pipelines. i think this would give others a way see how “you like to do it” rather than just interpreting the code.
Intro Linux user here, not sure why I am getting this error on the tpot install. All other installs went well.
Downloading/unpacking xgboost (from tpot)
Could not find a version that satisfies the requirement xgboost (from tpot) (from versions: 0.4a12, 0.4a13, 0.4a14, 0.4a15, 0.4a18, 0.4a19, 0.4a20, 0.4a21, 0.4a22, 0.4a23, 0.4a24, 0.4a25, 0.4a26, 0.4a27, 0.4a28, 0.4a29, 0.4a30)
Cleaning up…
No distributions matching the version for xgboost (from tpot)
Storing debug log for failure in /home/david/.pip/pip.log
_____
Also, Dr Olson, I have a few questions/suggestions in regards to practicality in a project I am working on (using tpot). I am a student close by your lab, what is the best way to get in contact with you?
Hi Nate,
Try just running `pip install xgboost` and see if that works, then install tpot again. Let me know how that goes. I can’t reproduce your error on my end.
In regards to getting in touch, please email me: http://www.randalolson.com/contact/
Results from ‘sudo pip install xgboost’:
[email protected]:~$ sudo pip install xgboost
Downloading/unpacking xgboost
Could not find a version that satisfies the requirement xgboost (from versions: 0.4a12, 0.4a13, 0.4a14, 0.4a15, 0.4a18, 0.4a19, 0.4a20, 0.4a21, 0.4a22, 0.4a23, 0.4a24, 0.4a25, 0.4a26, 0.4a27, 0.4a28, 0.4a29, 0.4a30)
Cleaning up…
No distributions matching the version for xgboost
Storing debug log for failure in /home/david/.pip/pip.log
Dr. Olson,
I use TPOT all the time now for quick direction when I’m stuck in a rut on a difficult problem.
Two questions:
1) CUDA support?
2) Keras integration?
–DNJ
Glad to hear you’re finding TPOT useful! I don’t think CUDA support are in the works, but the next 0.7 release will have multiprocessing support via joblib. We’ve worked on a demo that integrates Keras, but it’s still in the early prototyping phase—not usable at the moment.
Hi Randy,
Thanks for building TPOT! I used to use something similar called Eureqa (http://www.nutonian.com/products/eureqa/). Right now I’m working an unbalanced classification problem, so I wanted to suggest that TPOT incorporate alternative class weighting schemes or scoring functions. The accuracy score default on TPOT means that it’s really not suited to my usecase (roughly 3% of my examples are in the positive class)
Hi jt,
It seems your comment got cut off early. In any case, we allow TPOT to use custom scoring functions (http://rhiever.github.io/tpot/examples/Custom_Scoring_Functions/) to guide the optimization process. By default, TPOT uses balanced accuracy, which accounts for class imbalances. Perhaps that helps overcome your issue?
Totally! I’ll give it a shot today once my current feature selection pass is done. I’ve been running forward feature selection + parameter grid search on this problem for the past 18 hours so far.
Do I need the latest pip?
Downloading/unpacking pip>=8.1.0 (from pypandoc)
Downloading pip-8.1.2-py2.py3-none-any.whl (1.2MB): 1.2MB downloaded
Cleaning up…
Exception:
Traceback (most recent call last):
File “/usr/lib/python2.7/dist-packages/pip/basecommand.py”, line 122, in main
status = self.run(options, args)
File “/usr/lib/python2.7/dist-packages/pip/commands/install.py”, line 290, in run
requirement_set.prepare_files(finder, force_root_egg_info=self.bundle, bundle=self.bundle)
File “/usr/lib/python2.7/dist-packages/pip/req.py”, line 1260, in prepare_files
)[0]
IndexError: list index out of range
Storing debug log for failure in /root/.pip/pip.log
Hey Bill! I don’t think you need the latest pip. Typically we recommend installing TPOT on top of an Anaconda Python distribution install [1], since otherwise installing some of the libraries (especially numpy) can be a pain in Python.
If you keep having issues with the install, please file an issue [2] and let us know your Python install details etc.
[1] https://www.continuum.io/downloads
[2] https://github.com/rhiever/tpot/issues/new
Hmm, I a little leery of installing a second Python, but I suppose it can live alongside the OS version okay (Anaconda ahead in the PATH)?
Yep, that’s right. I’ve been running Anaconda on top of the OS Python for ages and it works great. IIRC the Anaconda installer automatically adds the Anaconda install to the PATH.
Yay! Anaconda worked like a charm. Now to try some stuff. Thanks dude!
How long should the example take to run? It’s been running for over 5 minutes. Am I impatient?
[email protected]:~/analysis/TPOT_tests$ python
Python 2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org
>>> from tpot import TPOT
>>> from sklearn.datasets import load_digits
>>> from sklearn.cross_validation import train_test_split
>>> digits = load_digits()
>>> X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
… train_size=0.75)
>>> tpot = TPOT(generations=5)
>>> tpot.fit(X_train, y_train)
If you change this line:
tpot = TPOT(generations=5)
to
tpot = TPOT(generations=5, verbosity=2)
It’ll show a progress bar for you. That way you’ll at least get a sense of how it’s progressing. Generally, TPOT will take a while because it’s running k-fold cross-validation on every pipeline on the full training data set.
Nice, I can see it working now! I am anxious to give this a try with some of our data sets. Thanks again!
Sorry, one more thing. Not sure if this is useful, but I got this error at the end of the example:
[email protected]:~/analysis/TPOT_tests$ python example.py
Generation 1 – Current best internal CV score: 0.990575903376
Generation 2 – Current best internal CV score: 0.990582851858
Generation 3 – Current best internal CV score: 0.991425080097
Generation 4 – Current best internal CV score: 0.991425080097
Generation 5 – Current best internal CV score: 0.991821054216
Best pipeline: ExtraTreesClassifier(input_matrix, 90, 0.19, 0.33000000000000002)
Traceback (most recent call last):
File “example.py”, line 12, in
tpot.score(X_train, y_train, X_test, y_test)
TypeError: score() takes exactly 3 arguments (5 given)
tpot.score() only takes two arguments: X_test and y_test. Are there some old docs that I missed still list it as needing all 4?
I am using the example here: http://www.randalolson.com/2015/11/15/introducing-tpot-the-data-science-assistant/
Duh, that is this page! So the example above.
Oops, this page is out of date! I’ll update it. The latest docs are here: http://rhiever.github.io/tpot/using/
Hee, maintenance is a b!
I see you have done some work on demo that integrates Keras. SUPER!!! ANNs are universal estimators and except for the “Black Box” thing they are the way to go. Having a GA IDs the “Best” architecture and parameters is sorely needed.
Let me know on your progress.
Doc vK
Hello sir. Can you please tell me how tpot uses genetic algorithm for optimization?