Google Cloud Datalab vs Anaconda Cloud vs Domino Datalab

An overview of hosted notebook solutions

When it comes to scientific experimentation and collaboration, people tend to need the same thing : an easy-to-use interface to hack and optimize their algorithms, a data i/o, support for their preferred language (R, Julia, Python…)

Jupyter — IPython Notebook

A natural solution to these problems emerged in 2011 with the release of Jupyter, a web application that allows you to create a “notebook file” that serves as:

  • an interactive code interface for over 40 languages
  • a data visualization tool, that comes bundled with matplotlib and integrates with other graphing tools such as plotly and Google Charts
  • a markdown editor to document algorithms, enabled for math formulas through Mathjax & LateX
Math formulas in Jupyter

A few years back, people would work locally and email their collaborators with new versions of their algorithm once they’re satisfied of their improvements. A few (gifted) scientists would actually use version control softwares like Git or Mercurial…

Nowadays though, the scientific community is working more and more tightly with the computer science community. Version control is more of a norm now (thank God!) and people are looking towards cloud solutions for easier and more efficient collaboration.

What are your options when it comes to Jupyter notebook collaboration ?

Let’s start with the poor man’s solution : Google Drive / Dropbox. Hosting a notebook file on a cloud storage is virtually free these days, but it can prove quite challenging in terms of collaboration.

Git comes to the rescue with the infamous Github.com platform, which now natively supports .ipynb format (Jupyter Notebook) for visualization. It’s a cheap, great solution that introduces a much-needed layer of version control. Collaboration still isn’t as easy as it could be because each collaborator needs to have their own local Jupyter server to edit the notebook (e.g. using Anaconda Navigator).

A similar solution would be to use Anaconda Cloud, which also allows notebook-sharing and representation but no online editor and cloud execution so far.

What about Jupyter cloud solutions ?

Aside from self-hosting a Jupyter dedicated server, there are (as far as I know) two main Jupyter cloud platforms out there.

Domino Data Lab

A San-Francisco-based startup company providing a cloud infrastructure based on Amazon AWS.

Based on my experimentations, Domino’s solution lacks the following features.

  • No on the fly editing

Using DDL’s cloud solution works as a succession of “runs” that execute your whole notebook in one go. There is indeed a web-based IDE that allows you to edit your notebook, but you have to save it first then run the whole thing, which is much, much slower that hacking away at an algorithm locally using Jupyter.

DDL doesn’t seem to be intended towards algorithm development but more towards hosting and distributed execution.

  • No upfront pricing

You have to contact them directly to get a quote based on your projected needs.

  • No version control

Domino Datalab Doesn’t connect with any kind of version control as far as I know (except Github for package distribution). They do provide an “auto-save” feature for notebooks.

Concretely, their “run” system described above saves the output of each run (including produced/modified files). It still lacks the strength of an actual version control, including -but not limited to- branching.

Google Cloud Datalab (beta)

As of late 2015, Google also provides another Jupyter cloud solution : Google Cloud Datalab.

https://datalab.cloud.google.com

I’ll go over how GCD actually works and some of the caveats I’ve encountered so far :

GCD isn’t actually a platform in itself, it’s a docker image built and maintained by the Google team on Github.

Note: Although Jupyter-based, GCD only allows Python language so far. It’s a beta though so I’m guessing other languages should be supported soon.

The installation proposed so far is kind of minimalistic : it’s a basic OAuth2 procedure that requires your permission to deploy two servers. One Google Compute Engine instance (which serves only during installation and updates, so remember to shut it down afterwards) and one Google App Engine instance which is the actual Jupyter webserver you’ll be using.

GCD installation process
Note: GCD requires a special type of Google App Engine instance called “flexible environment” (formerly “managed VM”). These are in beta and currently only available in the US regions.

You’ll be able to see and manage both instances in your Google Cloud Console, including scalability.

Under the scene, GCD deployment also creates a Google Cloud Repository (beta) branch called “datalab” and that will serve as the underlying version control.

Note: Google Cloud Repository is also a beta service so far. Although git-based as well, it is much inferior to Github for now. It does allow Github/Bitbucket mirroring which makes it more transparent but is nonetheless required by Google Cloud Datalab. That’s a pretty clear technical lock-in right there and a no-go for many CTOs. Do note also that git GUIs like SourceTree may not natively function with Google Cloud Repository due to Google’s OAuth-based authentication.

One major strength that GCD does have though is on-the-fly online editing (finally!). You can indeed hack away at your algorithm online, extract data and represent it in graphs (including Google Charts, Plot.ly charts, matplotlib charts…). Code execution is handled by the underlying Google App Engine instance.

Combining Pandas dataframe with Plot.ly graphing in Google Cloud Datalab

Practically speaking, version control works but it’s kind of a mess to manage : although there’s an auto-save feature on the notebook itself, you still have to manually commit your online changes through the online interface.

matplotlib works natively in Jupyter (and in GCD as well)

GCD’s greatest strength in my opinion is its native connection with Google BigQuery. Where you used to work with CSV/JSON or a small MySQL database, you can now work with a big data infrastructure with very little additional cost.

Google Cloud Datalab + Google Big Query + Google Charts

Collaboration is pretty awful as well so far : you need to give “Project Owner” permission to your collaborators in order for them to simultaneously access Google Cloud Datalab’s GAE instance, Big Query and Google Cloud Repository (git)… that’s pretty lax. Given all that though, they should be able to hack away at your algorithm and commit their improvements to the underlying repository so that’s basically what you should be looking for.

Conclusion

Google Cloud Datalab is still far from being production-ready, but it’s a very promising solution. The only solution so far that comes close to the ease-of-use of a local Jupyter server with the added benefits of cloud hosting (CVS, scaling, sharing, etc.).

The proof of concept is indeed there, and it merely lacks some polish around the angles (more language support, easier notebook collaboration, better version control…)

With the recent release of Google Data Studio (beta), it seems Google aims to provide a complete data toolkit. Let us see if they indeed manage to create the final products they’ve set their sights on.