Building a great ML platform using Jupyter Hub, SageMaker and spark on AWS.

Pedro Neme
4 min readJan 6, 2019

--

A scalable, powerful, interactive and multiuser ready platform for analytics and machine learning, 100% cloud.

In the last year, one of the challenges I have working for a growing and changing data science team was the task of building a platform that allows us to solve some of the key difficulties we where battling in that time. Some of these are the following:

  • We were struggling on working with large (really large) volumes of data stored on s3.
  • We have no simple way to train models on GPU except using an EC2 GPU instance on AWS that is not very cost effective if you let it running and will need to be manually started/stopped.
  • Not having an easy, multi-user ready and stable way to connect to the team EMR cluster for making interactive spark analytics.
  • Not having a simple and standard way to connect to the company official databases motors via JDBC connectors (Teradata, Presto, Hive) using Python.

Of course, as we use a lot of Python and some R, thinking of using Jupyter notebooks was the first decision we made.

Building around Jupyter and using a lot of AWS services, we managed to build a platform that solved most of our problems.

Analytics and ML platform centered on Jupyter Hub platform and AWS services
  • Using Jupyter Hub installed on a large EC2 instance has been great for multiuser needs, letting users install there own kernels, libraries, choosing between Jupyter Lab or the standard Jupyter, and most importantly having a secure and stable environment for the daily work. The installation on EC2 is not so difficult if you have a little experience working with Linux and setting some basic configs like security groups on EC2. A great installation guide and the one I follow during installation is the official https://jupyterlab.readthedocs.io (if you don't need the solution for multiples user you can directly use AWS SageMaker notebooks)
  • Using AWS s3 let you scale up your storage needs effortless and allows the team to have secure and fast storage for all types of data.
  • SageMaker has brought to us the capability of having a serverless ML training service that has no limit on the quantity or scale of the jobs we can run. If you don’t know about SageMaker I recommend to you take a look. Some of the best features are the capability to deploy Jupyter notebooks easily for single users, having a lot of built-in algorithms (look some of there here), having a great training service that allows you to train any kind of model using GPU or CPU using Docker (this also allows you to run your training job from a Jupyter notebook like a simple built-in model while having your complete model in a GitHub repo, a lot of great examples are provided by AWS here) And automatically serving your model if you want.
  • Making an in-house Python SDK for wrapping JDBC connections to all of our DB was one of the best decision we have made, It allows us to have a standard and optimized way to connect to Presto, Hive, and Teradata, execute queries, and display results in a single line using pandas.
Mockup of a real execution of queries against our Data Warehouse

We also have build methods that allow exporting queries results from Teradata to s3 and uploading data from s3 to a table in the warehouse.

Mockup of a real export from Teradata to s3

This easy ways to connect to multiple platforms and exporting data to s3 saves us a lot of time dealing with multiple data sources and allows us to use s3 as our main data storage.

  • Spark on EMR has become to us the main tool for dealing with large volume data stored on s3. Because we want to have all of the services in a single place we needed a way to connect to the cluster remotely from our Jupyter Notebooks. For this, we use the SparkMagic Kernel to connect via Apache Livy to EMR. This has been doing the job for us, but without a doubt, is the most unstable part of the solution. Because of this for productive Spark jobs that run every day, we have built an API using Boto 3 and flask to execute jobs on EMR in a “serverless” way that is not interactive but is very reliable (you could use https://aws.amazon.com/glue/ for this, but because we need to access multiple s3 buckets on different AWS accounts we decided using a house-made solution).

Working daily with Jupyter with all of these services ready to use, has made us more productive and challenge us to build a homemade solution for scheduling the execution of Jupyter notebooks for production jobs.

In the next post, my idea is to deep dive into some examples of how to use Jupyter notebooks to execute custom SageMaker models and how to crone the execution using papermill.

Thank you for the reading!

https://www.linkedin.com/in/pedro-neme/

--

--