The list of data science and machine learning platforms is growing fast as the newly emerging market of Machine Learning as a Service (MLaaS) develops, making it difficult for businesses to decide which solution best fits their needs. The Gartner Magic Quadrant for Data Science and Machine-Learning Platforms provides an excellent evaluation of 16 vendors to help businesses identify the right platform as a function of their needs. However, the study covers neither Google’s nor Amazon’s machine learning services.
Therefore, I have examined some of the services offered by the two tech giants to understand their utility as end-to-end data science and machine learning platforms. The explored services are Google Cloud Datalab, Google Cloud ML Engine, and Amazon SageMaker. In this article, I present a discussion of their similarities and differences and explain why I believe Amazon provides the better end-to-end data science and machine learning platform.
A Notebook Environment
Notebooks are an incredibly powerful tool for interactively developing and presenting data science projects. A notebook integrates code along with its output into a single document that combines visualizations, text, mathematical equations, and other rich media. Notebooks are probably the most popular environment for developing machine learning models among data scientists. Demonstrating a good understanding of data science needs, both Google Cloud Datalab and Amazon SageMaker offer a fully managed cloud notebook environment.
Both cloud platforms offer easy deployment of a notebook. In Datalab, all you have to do to launch an existing notebook instance is to open your terminal and type:
datalab connect <datalab_instance_name>
When the connection to the notebook instance succeeds, you can connect to the notebook through your browser at http://localhost:8081/.
As you can see, it is quite similar to a Jupyter notebook, but with a different UI.
SageMaker, on the other hand, provides one-click notebook deployment through your SageMaker console, and it looks exactly like a Jupyter notebook, because it is.
Notebook Kernels and Pre-loaded Libraries
SageMaker Jupyter notebooks come pre-configured with kernels supporting PySpark, SparkR, MXNet, TensorFlow, PyTorch, and Chainer. On the other hand, Datalab comes with only TensorFlow pre-installed. Both notebooks support Python 2 and Python 3.
In any case, new packages can be installed via shell commands using conda or pip. You can execute shell commands directly in a Datalab or SageMaker notebook cell by prefixing the command with
!pip install <package_name>
Distributed training of machine learning algorithms is one of the most powerful methods for enabling the exploitation of very large datasets. The principle is simple: divide the data among N computing nodes to train your learning algorithm (almost) N times faster.
SageMaker simplifies training your models on a set of distributed compute engines. For its built-in algorithms, you only have to specify the size and number of instances you want. SageMaker automatically releases these compute resources as soon as training ends. SageMaker also gives you the ability to execute distributed training jobs for your custom algorithms.
In contrast, Datalab does not offer distributed training. Instead, it trains the models on the Datalab instance. Google Cloud ML Engine, which is a separate service people often use in conjunction with Datalab allows distributed training of TensorFlow models.
SageMaker provides model hosting services for model deployment. It produces an HTTPS endpoint where your machine learning model is available to provide inferences. It gives the user rich model management capabilities. Here’s a list of model management features I find the most useful:
- You can deploy multiple variants of a model to the same HTTPS endpoint. This is useful for testing variations of a model in production. (also doable with Google, but no traffic control)
- You can configure a deployed model to elastically scale with workload. Automatic scaling dynamically adjusts the number of instances provisioned for a deployed model in response to changes in workload. (You can also do that with google batch predictions)
- You can modify an endpoint without taking models that are already deployed into production out of service. For example, you can add new model variants, update the ML Compute instance configurations of existing model variants, or change the distribution of traffic among model variants. (Also doable with Google, except traffic control)
- It is possible to deploy your own models. You can package your own algorithms for use with SageMaker, regardless of programming language or framework. You package the algorithm and inference code in Docker images, and use the images to train a model and deploy it.
Cloud ML Engine allows hosting trained TensorFlow, scikit-learn and xgboost models in the cloud behind a REST API. Cloud ML Engine provides two methods of serving predictions:
- Online prediction is a service optimized to run your data through hosted models with as little latency as possible.
- Batch prediction is a service best used when you don’t need your predictions in real-time, or when you have a large number of instances to get predictions for. When using batch prediction, the prediction service automatically scales compute resources depending on the size of the batch.
At the time of writing, it is not possible to deploy a custom model that follows the scikit-learn API. Therefore, the only way to deploy your own algorithm with Cloud ML Engine would be to write it in TensorFlow.
Automatic Hyperparameter Tuning
Hyperparameters are the parameters of a model that describe the training process itself and are set before the learning process begins. For example, the hyperparameters of a neural network could be the learning rate, the number of hidden layers, the number of neurons in each layer, the activation functions, etc.
The most common strategies used by data scientists for finding good values of the hyperparameters are:
- Manual trial and error. The data scientist guesses good settings of the hyperparameters based on his experience and evaluates them by cross-validation.
- Grid search. Exhaustive searching through a manually specified subset of the hyperparameter space of the learning algorithm.
- Random sampling. Replaces the exhaustive enumeration of all combinations by selecting them randomly.
The problem with these methods is that they require a large number of evaluations to find a good setting of the hyperparameters. These evaluations are very expensive as they involve training and cross-validating the learning algorithm.
SageMaker and Google Cloud ML Engine offer automatic hyperparameter tuning based on a technique called Bayesian optimization. In practice, Bayesian optimization has been shown to obtain better results in fewer evaluations compared to grid search and random search, due to its ability to reason about the quality of experiments before they are run.
In SageMaker, this feature works not only for built-in algorithms, but also for bring-your-own training jobs in docker. Google Cloud ML Engine on the other hand only provides this feature for TensorFlow models.
The following table summarizes the differences and similarities between Amazon’s and Google’s cloud machine learning services.