UPD (April 20, 2016): Scikit Flow has been merged into TensorFlow since version 0.8 and now called TensorFlow Learn or tf.learn.
Google released a machine learning framework called TensorFlow and it’s taking the world by storm. 10k+ stars on Github, a lot of publicity and general excitement in between AI researchers.
Now, but how you to use it for something regular problem Data Scientist may have? (and if you are AI researcher — we will build up to interesting problems over time).
Why do I care?
A reasonable question, why as a Data Scientist, who already has a number of tools in your toolbox (R, Scikit Learn, etc), you care about yet another framework?
The answer is two part:
- Deep Learning part of TensorFlow allows you to stack a number of different models and transformations in one model and learn them all together. You can handle text, images and regular categorical and continues variables inside one model with ease. It’s also easy to do multi-target and multi-loss at the same time, pre-learning and a host of other ML techniques that would be either hard or impossible to do in the conventional setups.
- Pipelining part of TensorFlow will grow into a powerful way of processing data. In the future data processing and machine learning will be all done in one framework and TensorFlow pushes into that direction.
Simple model for Titanic dataset
Let’s start with simple example — take Titanic dataset from Kaggle.
pip install numpy scipy sklearn pandas
# For Ubuntu:
pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0-cp27-none-linux_x86_64.whl
# For Mac:
pip install https://storage.googleapis.com/tensorflow/mac/tensorflow-0.8.0-py2-none-any.whl
You can get dataset and the code from http://github.com/ilblackdragon/tf_examples
Quick look at the data (use iPython or iPython notebook for ease of interactive exploration):
>>> import pandas
>>> data = pandas.read_csv('data/titanic_train.csv')
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
PassengerId Survived Pclass Name Sex Age SibSp
0 1 0 3 Braund, Mr. Owen Harris male 22 1
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.25 NaN S
Let’s test how we can predict Survived class, based on float variables in Scikit Learn:
>>> y, X = train['Survived'], train[['Age', 'SibSp', 'Fare']].fillna(0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> lr = LogisticRegression()
>>> lr.fit(X_train, y_train)
>>> print accuracy_score(y_test, lr.predict(X_test))
We separate dataset into features and target, fill in N/A in the data with zeros and build a logistic regression. Predicting on the training data gives us some measure of accuracy (of cause it doesn’t properly evaluate the model quality and test dataset should be used, but for simplicity we will look at train only for now).
Now using tf.learn (previously Scikit Flow):
>>> from tensorflow.contrib import learn
>>> import random
>>> random.seed(42) # to sample data the same way
>>> classifier = learn.LinearClassifier(n_classes=2,
>>> classifier.fit(X_train, y_train, batch_size=128, steps=500)
>>> print accuracy_score(classifier.predict(X_test), y_test)
Congratulations, you just built your first TensorFlow model!
TF.Learn (previously Scikit Flow)
TF.Learn is a library that wraps a lot of new APIs by TensorFlow with nice and familiar Scikit Learn API.
TensorFlow is all about a building and executing graph. This is a very powerful concept, but it is also cumbersome to start with.
Looking under the hood of TF.Learn, we just used three parts:
- layers — set of advanced TensorFlow functions, that allow to easily build complex graphs. From fully connected layer, convolution, batch norm to losses and optimization.
- graph_actions — set of tools to perform training, evaluating and running inference on TensorFlow graphs.
- Estimator — packaging everything in a class that follows Scikit Learn interface and provides a way to easily build and train custom TensorFlow models. Subclasses of Estimator like LinearClassifier, LinearRegressor, DNNClassifier, etc — are pre-packaged models similar to Scikit Learn LogisticRegression that can be used in one line.
Even as you get more familiar with TensorFlow, pieces of Scikit Flow will be useful (like graph_actions and layers and host of other ops and tools). See future posts for examples of handling categorical variables, text and images.
Since writing this post, I founded NEAR Protocol. Read more about our journey.