How to Inspire Code Reuse in Data Analytics

DataKitchen
Apr 20, 2017 · 3 min read

Seven Steps to Implementing DataOps: Step 5 — Reuse & Containerize

In a previous blog we introduced DataOps, a new approach to data analytics, which can put analytics professionals in the center of the company’s strategy, advancing its most strategic and important objectives. DataOps doesn’t require you to throw away your existing tools and start from scratch. With DataOps, you can keep the tools you currently use and love. You may be surprised to learn that an analytics team can migrate to DataOps in seven simple steps. This blog entry is step 5 of 7.

In DataOps, the data analytics team moves at lightening speed using highly optimized tools and processes. One of the most important productivity tools is the ability to reuse and containerize code.

When we talk about reusing code, we mean reusing data analytics components. As we wrote in an earlier blog entry, all of the files that comprise the data analytics pipeline — scripts, source code, algorithms, html, configuration files, parameter files — we think of these as code. Like other software development, code reuse can significantly boost coding velocity.

Code reuse saves time and resources by leveraging existing tools, libraries or other code in the extension or development of new code. If a software component has taken several months to develop, it effectively saves the organization several months of development time when another project reuses that component. This practice can be used to decrease project budgets. In other cases, code reuse makes it possible to complete projects that would have been impossible if the team were forced to start from scratch.

Containers make code reuse much simpler. A container packages everything needed to run a piece of software — code, runtimes, tools, libraries, configuration files — into a stand-alone executable. Containers are somewhat like virtual machines, but use fewer resources because they do not include full operating systems. A given hardware server can run many more containers than virtual machines.

A container eliminates the problem in which code runs on one machine, but not on another, because of slight differences in the set-up and configuration of the two servers or software environments. A container enables code to run the same way on every machine by automating the task of setting up and configuring a machine environment. This is one DataOps technique that facilitates moving code from development to production — the run-time environment is the same for both. One popular open-source container technology is Docker.

Each step in the data-analytics pipeline is the output of the prior stage and the input to the next stage. It is cumbersome to work with an entire data-analytics pipeline as one monolith, so it is common to break it down into smaller components. On a practical level, smaller components are much easier to reuse by other team members.

Some steps in the data-analytics pipeline are messy and complicated. For example, one operation might call a custom tool, run a python script, use FTP and other specialized logic. This operation might be both hard to set up, because it requires a specific set of tools, and difficult to create, because it requires a specific skill set. This scenario is another common use case for creating a container. Once the code is placed in a container, it is much easier to use by other programmers who aren’t familiar with the custom tools inside the container, but know how to use the container’s external interfaces. All of the complexity is embedded inside the container. It is also easier to deploy that code to different environments. Containers make code reuse much more turnkey and allow developers much greater flexibility in sharing their work with each other.

We will discuss other ways that DataOps enhances the flexibility of the data analytics pipeline in our next blog.

data-ops

The DataOps Blog

data-ops

The DataOps Blog

DataKitchen

Written by

data-ops

The DataOps Blog