Structuring a “Docker for Data Science” Training Journey

Moussa Taifi PhD
Xandr-Tech
Published in
9 min readAug 27, 2018

(Joint work with Win Suen)

Why design a “Docker for Data Science” training?

The AppNexus Data Science and Analytics teams grew 4x in size and complexity in the past year. More projects with more collaborators means a greater need for ways to reliably share and deploy code, often across different time-zones, countries, and technical environments.

The Data Science Core Tech (DSCT) team at AppNexus is responsible for supporting the technical and architectural needs of the DS/Analytics team members. The mission of the DSCT team is to remove any friction that can come between a data scientist or an analyst and the rest of the engineering organizations.

The DSCT team selected Docker as a central data science tool, and since then Docker has solved many of the barriers to code portability, reproducibility and deployment in our DS/Analytics teams. To encourage adoption of Docker across our organization, we designed an in-house training series to familiarize and empower teammates with Docker.

The central challenge of designing a Docker course for data science is that there is a glut of information online about Docker usage that has to be pieced together by each learner. This increases the cognitive load on the users, and slows down their progress when integrating Docker into their toolbox.

The focus of this post will not be on the benefits of Docker, but rather on the following concern:

How to customize Docker training materials to fit the Data Science and Analytics workflows and use cases.

There are plenty of Docker tutorials available online, far more than any one person has time to complete. Therefore, the primary challenge we faced was:

How could we create actionable training materials that teammates could use from Day 1?

How to build effective Docker for Data Science technical training:

We used the 4Cs course design principles, to ensure technical trainings were effective and impactful for the whole team. For more on the 4Cs, see [1]. We focused our effort on following the principles of continuing education for adult learners [2][3] and expanded it to Adult Technical Learners (ATLs).

Clarity

As an attendee, how can I be sure this Docker training is worth my time? How can I start using what I’ve learned about Docker from day 1?

  • ATLs are interested in learning tools and concepts that have a more direct impact on their occupation.Their schedule does not always allow for learning for its own sake. We made sure to explain clearly how each lesson fits into their goals for self-advancement and productivity improvements.
  • To build relevant content we focused on the co-ownership of the design of the Docker training by both a DS engineering platform architecture representative and a primary data science representative.
  • We met to iterate and improve on content after the feedback from an initial pilot session: We took out any overly-complex content, repeated important concepts, summarized essential elements, and added humorous analogies.
  • After the initial info session many participants wanted us to incorporate hands-on training. This was in line with what ATLs are looking for. ATLs want to be involved in structuring their learning and evaluating their outcomes [2]. ATLs want an environment in which mistakes/bugs/typos are safe/expected, and are a reference for continued learning.
  • We optimized the contents by reducing the sequential nature of the training. After 2 sequential intro classes, each subsequent module was self contained to allow participants to show up for the sections they cared about. We also tried to reorganize the hands-on content to focus on teaching Docker concepts, not just procedural knowledge.

As an example instead of going over the full docker container lifecycle[4]:

We focused instead on the user side of things and simplified the user guide to cover the most common commands:

$ docker build
$ docker run
$ docker rm
$ docker images
$ docker tag
$ docker push

Capacity

How do we create effective training given different teams have different use cases? What are the prerequisites needed for training attendees to be successful?

  • We created specialized learning paths and adapted them to specific data science and analytics use cases and prerequisites. We looked at the common use cases in DS/Analytics and targeted the primary usage patterns.
  • We examined the assumed prerequisite knowledge. We tried to answer the question: What does a participant need to know for this training? For example, many attendees have worked with the Python ecosystem with tools such as Jupyter, Pandas and Flask. We used that knowledge to customize the learning paths to be Python-centric.
  • During and after every session we encouraged questions from the participants to fine tune the contents of the learning paths. For example initial Docker installation blockers were reported, and after customizing the installation path we saved 10% of the training time to focus on the more useful Docker concepts.

Consistency

There are infinite layers of abstraction for containerization. How do we select the most actionable content? How do we choose the most effective training format?

  1. Stick to what we observed was effective: teammates could expect each course to be part presentation covering the essential concepts, and part hands-on lab.
  2. The hands-on section was especially important because our learning goal was for team members to have all the tools to start using Docker in their day to day work. Walking through the dev/test/deploy workflows of a Dockerized example app as a group, made attendees more comfortable with concepts and commands. This gave them a safe place to ask any questions during the hands-on session.
  3. Make it easy for new people to join the training sessions. We offered each learning path multiple times (various time of day and day of week combinations) over the course of a full quarter. We kept the technical requirements to a minimum. In our case we focused on having a minimal MacOSX Laptop with Docker/Git working configuration. This was enough to remove 80% of the blockers during the hands-on sections.
  4. Before each training, we clearly communicated what teammates could expect to be able to do by the end of the session.

Commitment

How to keep people engaged and likely to use Docker?

  • Early on we focused on presenting the course from a “partner’s perspective”. Instead of using the traditional top-down lecturing style, we committed ourselves to be the “guide on the side” rather than the “sage on the stage” [1]. We considered teammates “partners in learning” because ultimately they have full autonomy and responsibility for deciding whether Docker is right for them, and for learning the tool.
  • We kept refreshing the journey with new training courses that incorporated the initial feedback from the team.
  • We provided a common Slack channel for continued support and community, as well as one-on-one in person consultation.
  • We built Extensive DS/Analytics Docker templates to jump-start team adoption. These standardized Docker images templates designed for data science needs made barriers to entry much lower.

Curation of Docker Learning Paths (Final form):

Different data scientists and analysts at AppNexus have different use cases, but fortunately for us they use closely related tools in the Python ecosystem. We used that point of reference to build Python-centric Docker Learning paths.

Curating courses into learning paths gave teammates the autonomy to select training sessions that were most applicable to their day-to-day work. The final learning path tree we came up is the following:

For our first iteration, we designed 4 in-depth paths for common data science use cases:

  1. Jupyter on Docker: Jupyter Notebooks are popular for experimenting with new libraries and algorithms. Dockerizing Jupyter Notebooks reduces friction in code/notebook sharing, and contributes a great deal to research reproducibility (no more “Works on my machine” days!).
  2. PyCharm Docker integration: For data scientists who already interact with PyCharm daily, we leveraged their familiarity with this popular IDE to encourage Docker use.
  3. Docker for Hadoop: Some teams work closely with our big data infrastructure (Hadoop, Spark, Yarn), so we tailored this session to cover how they could integrate Docker with these existing tools. We wanted to avoid letting complexity in the setup phase deter teammates from taking advantage of Docker. To solve this problem, we provided ready-made Dockerfile image templates instructions to jump-start users.
  4. Docker for Kubernetes: As the industry-wide “container orchestration wars” have settled, Kubernetes has emerged as the dominant player for deploying docker based apps on clusters.To promote this new tool for our team and prepare them to be ready to embrace this new technology, we built a set of training documentation that builds on the existing learning paths that they already went through in this program. We tried to keep the training material consistent with the existing Docker examples and workflows, this simplified the introduction of the complementary Kubernetes concepts.

Impact of the “Docker for Data Science” Training on our Team

Looking back at this training design journey 2 quarters later, we noticed many areas that benefited tremendously from this training:

  1. Increased number of apps that analysts can contribute to: The ability to integrate the Pycharm IDE with Docker was a game changer in terms of the number of apps that a single analyst can contribute to. Previously each internal app needed a dedicated infrastructure/development environment. Now each analyst can comfortably take on 3x the number of apps that he/she contributes to.
  2. Direct interactive experimentation with SQL queries using unified Docker images: The ability for team members to directly experiment with SQL queries on a Jupyter notebook that runs on the same docker image where the target app runs improved the productivity of our team. This removed the fear that some queries/data pipelines that require specific dependencies would work at analysis/dev time but break when we serve it to our internal customers.
  3. Improved Data Science Experiments’ Reproducibility: Using Docker as the primary method to package all the component of DS model training, testing and deployment proved to be a great help to tame the complexity of our machine learning pipelines. The ability of one data scientist to extend an existing experiment but simply tweaking the feature engineering step, or the model training/testing parametrization, increased the rate of A/B testing that our team is able to perform on live traffic.
  4. Improved compute resource utilization: The ability to stack more apps in less servers means extensive savings for our budget. Instead of having to spin up multiple development environment for each developer-app combination, team members can use the same level of resources they currently use and expand the range of their activities and the number of the apps they contribute to. In addition, for production-level workloads, we trained users to extend their Docker knowledge to the Kubernetes clusters’ resources. This allowed the team to run larger machine learning experiments, without taking on too much risk in terms of initial resource capacity acquisition.

Bonus: Propaganda!

Thanks for reading about our experience Structuring a “Docker for Data Science” Training Journey. During the development of this course we also made some propaganda posters to make this new technology stack less intimidating and encourage participation. Feel free to use them if you are leading a Docker training session for your own teams or generate new ones here [5].

References:

[1] Dr Mo Hamza (2018) Swedish Civil Contingencies Agency, Training Material Development Guide, https://www.msb.se/RibData/Filer/pdf/26433.pdf (accessed 2018-08)

[2] The Principles of Adult Learning Theory https://online.rutgers.edu/blog/principles-of-adult-learning-theory/ (accessed 2018–08)

[3] Malcolm S. Knowles, Elwood E. Holton III, and Richard A. Swanson (2005) The Adult Learner: The Definitive Classic in Adult Education and Human Resource Development, Burlington, MA: Elsevier.

[4] Docker internals, http://docker-saigon.github.io/post/Docker-Internals/ (accessed 2018–08)

[5] Vintage poster generator: https://www.postermywall.com/index.php/g/vintage-posters

We are hiring! Please checkout our open roles: https://xandr.att.jobs/job/new-york/data-science-platform-engineer/25348/12859712

--

--

Moussa Taifi PhD
Xandr-Tech

Senior Data Science Platform Engineer — CS PhD— Cloudamize-Appnexus-Xandr-AT&T-Microsoft — Books: www.moussataifi.com/books