Can AutoML beat humans on Kaggle?
Automated Machine Learning (AutoML) is poised to make a transformative impact on data science in 2017. At the University of Pennsylvania, we’ve been working hard to develop TPOT, a state-of-the-art open source AutoML tool that optimizes machine learning pipelines for supervised learning problems.
Now we’d like to see what you can do with TPOT.
Over the next couple months, we’re going to challenge you to apply TPOT to any data science problem you find interesting on Kaggle. If your entry ranks in the top 25% of the leaderboard on a Kaggle problem, we want to see how TPOT helped you accomplish that.
At the end of the competition, the TPOT team will review all entries, rank them, and award (monetary!) prizes to the top 3 entries: $500, $250, $100. We’ll also post a write-up highlighting the best entries after the competition.
Entries will be judged based on their rank achieved on the Kaggle problem as well as the technical write-up provided with the entry.
Email your entries to firstname.lastname@example.org
Entries are due on August 7, 2017, and winners will be announced the following week.
Getting started with TPOT
If you’re new to TPOT and AutoML, you’re in luck! We’ve written extensive documentation describing how to use TPOT, as well as a User Guide and API documentation. To give you a starting point, below is basic example applying TPOT to scikit-learn’s MNIST dataset.
from tpot import TPOTClassifier from sklearn.datasets import load_digits from sklearn.model_selection import train_test_split digits = load_digits() X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25) tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot_mnist_pipeline.py')
This Python code will fit TPOT on the training data, score it on the testing data, then export the optimized pipeline as Python code to the file “tpot_mnist_pipeline.py”.
Importantly, the machine learning algorithms that TPOT uses can be heavily customized. Essentially, TPOT can optimize pipelines for any machine learning algorithm that follows the scikit-learn interface. You can read more about customizing TPOT’s configuration in the User Guide here.
Getting started with Kaggle
If you’re new to Kaggle, they provide an extensive list of competitions that are running on their web site. One of the classic Kaggle problems is the Titanic challenge, where you’re tasked to use machine learning to predict the survivors of the Titanic disaster.
We have a Jupyter Notebook tutorial that walks you through a basic approach to solving Kaggle’s Titanic challenge using TPOT.
Additional competition notes
- Every entry must have an open source license so we can share and discuss everyone’s entries.
- We recommend using Jupyter Notebooks for the entries, but any format is acceptable as long as it has the code and write-up.
- An entry must rank in the top 25% of the leaderboard on August 7 to count toward the competition.
- Kaggle tutorial problems, such as the Titanic problem, will not count toward the competition.
- We’ve created a public GitHub repository for anyone who wants to collaborate on entries for the competition.