From Science to Production: Unleash your Jupyter Notebooks
by Lori Eich
Data scientists are explorers. They use Jupyter Notebooks, one of the most popular environments for data science analysis, to begin work toward creative solutions to big problems. But once those solutions are discovered…what’s the next step? In order for data scientists to make a major impact, the creativity that starts in notebooks needs to find its way out to the rest of the organization — to the decision makers who need to make choices grounded in truth. Civis Platform provides a framework that delivers the Jupyter environment in a way that makes notebooks shareable, discoverable, scalable, and secure, and with a path to production. Jupyter Notebooks are where data science starts — and now that they are included as part of Civis Platform, that exploration is only the beginning.
As a product manager for Civis Platform, I pay attention to the tools our data scientists like to use. More and more, I have noticed them doing their initial exploratory Python and R work in Jupyter Notebooks. This isn’t a surprise, as Jupyter is a powerful open-source tool that offers a fast, responsive environment for data scientists to explore data, see results, iterate, and make progress toward big problems.
We love Jupyter because we love science. We scientists are tinkerers. We love digging into the nuts and bolts of our research, looking for patterns and excavating insights from mountains of data. Jupyter enables data scientists to dive into a problem to immediately start to look for answers.
A while ago, a client asked us to solve a problem with noisy, constantly changing, oddly-formatted data. Jupyter was key to success on this project — but the experience also exposed a series of challenges that we’re excited to be solving. For the time series data in this project, there was no out-of-the-box modeling algorithm that would work. Our data scientists had to explore spin-off solutions from existing algorithms and even tried to write some new algorithms from scratch. We were on a tight timeline and had to experiment with data cleaning, modeling, and validation on unfamiliar datasets. Thanks to Jupyter, we were able to use our favorite Python and R libraries, visualize the data alongside the code, and document our strategies as we went, which led to quick iterations toward a solution.
However, our team still struggled when it was time to bring everybody’s work together to deliver our solution. Each person on the team was working on a different piece of model validation code so that we could compare our models to each other — but they were working in their own Jupyter notebooks on their laptops. To be truly effective, everyone’s work needed to be merged back together. In order to share their code, the team needed to be working from identical environments and data sources — but that was almost impossible due to the constantly changing data source and creative solutions involved in this project. Collaborating was tedious and required a lot of extra effort — our team had to email notebooks to each other, try to coordinate timing of git commits, and of course ended up with a long list of files with names like “Customer Segmentation 2017–04–13 — LE” in a shared online folder. We lost hours of time just untangling the mess of trying to work together.
Experiencing this struggle first-hand made us realize not only that our team’s work requirements were evolving, but that data science teams everywhere were probably experiencing this, too. We realized that Civis Platform had an opportunity to enable a culture change for data science teams. Our cloud-based platform already includes a high-security, collaborative sharing environment. By adding Jupyter Notebooks as another sharable feature inside of our platform, we get a double win — Civis Platform now has the tool that data scientists love to use, and data scientists are able to easily collaborate in their Jupyter Notebooks in a way that just wasn’t possible in a local environment.
But again…exploration is just the beginning. A truly data-driven organization takes those exploratory insights and puts them into a place where decision makers can use them to make choices grounded in truth. With Jupyter Notebooks in Civis Platform, data scientists can take their exploratory code from analysis and push it straight into production scripts. Your data is in the cloud, and your notebooks connect to it through the Civis Data Science API. You can ship your notebooks code to a production-ready R or Python script, which interacts with the same API client and points to the same datasets as your notebook. Perhaps most importantly for busy data science teams, R and Python scripts can be added to a workflow and put on a schedule, so you can set up an automated pipeline that delivers a report directly to anybody who needs your research to make important decisions.
We want data scientists to be able to work together to confront big questions, and then be able to scale up to solve big problems. A data scientist working in a silo is the past — the future of data science is teamwork. Civis Platform gives data science teams the tools they love, the data they need, and the ability to launch their analysis into action to make an impact.
By the way, we’re at JupyterCon this week — come stop by, talk data science (and notebooks!) with us! And be sure to check out a talk I’m giving with Skipper Seabold on Friday.