Data Science @ Pixability: How Scrum Works for Us

Published in

The Pixel

6 min readJan 27, 2017

Like many data scientists, I work in a field that did not formally exist when I finished university. The novelty of the field — and its amorphous boundaries — give data science teams both the liberty to and the challenge of defining not only what data scientists do, but how they do it. Scrum is one possible answer to the second question.

Essentially, scrum is an ‘agile’ software development framework that is increasingly being adopted by data science teams. Developed by Japan in the 1980s, it promotes speed and cyclical iterations over extensive, linear, ‘water fall’ planning and development.

Is Scrum Right for Data Science?

But is it an appropriate fit? There’s currently an ongoing debate within the data science community as to whether Scrum is too cumbersome. At Pixability, the majority of our workload is project-based, so Scrum works well for us. Here, the data science team acts as a research and development team: we develop proofs of concept and boundary-pushing innovations. Given that responsibility, a framework that favors rapid iteration would be appropriate. However, the nature of research and development work often requires that we construct appropriate boundaries around extremely amorphous problems. It’s often difficult to predict, on day 0, what sort of data wrangling will be needed, what model or algorithm will be the best fit, and what other questions will pop up during exploratory data analysis.

Adapting Scrum, then, is a challenge to be tackled actively. At Pixability, we’ve developed — and we continue to refine — a hybrid scrum-like system of planning, estimating, and tracking our data science work. Our work is organized into Corporate Quarterly Strategy -> Major Business Objectives -> Versions -> Epics -> tickets or ‘stories’ (e.g. “Visualize Simulated Samples”). Open-ended questions are bounded up and capped off into time-delimited packets. We call these ‘discovery stories.’ For example, a discovery story might be spending one or two days exploring a Bayesian framework for estimating YouTube view rates.

Example: Creating our AI Bot — COEy

One of our most recent projects was building a solution to enable non-technical users to easily employ the Campaign Optimization Engine (COE). The COE is our cross-platform (YouTube, Facebook, Instagram, Twitter, etc.) autopilot. We needed a simple solution to start the COE on new or existing campaigns, as well as monitor their progress. As with most of the features we develop, we always consider user adoption by creating features that users want to use.

One of many Q4’16 Data Science MBOs

From this very high-level description, “COE: Enable Campaign Managers (CMs) to easily use the COE,” we prioritized the use cases, then created a series of Discovery stories to do the real research.

From our Discovery story research we created stories to build interfaces to help Campaign Managers (CMs) start COE jobs, and give them information on the current jobs and visualizations to help them quickly sum-up performance. One story in particular helped us make the COE more approachable to the end-user.

The Discovery Story

In this story, we explored the best way to make the COE approachable to novice CMs. Given that we all use and communicate over Slack, we researched how we could incentivize CMs to want to use the technology. Campaign Managers at Pixability are extremely skilled, hard working, biased towards action, want to get an answer quickly, talk fast and often and don’t have a lot of patience for dense technology. With those realities and given the rise of AI, plus the fact that Slack supports bots, we decided to create an AI bot — who we called COEy.

The Proof of Concept — COEy

Once our discovery story informed us of a potential solution we started developing a proof of concept, itself a user story based on the most important functionality that would be delivered by COEy. We first needed to create her with Slack, which in and of itself was a user story. After some team brainstorming, and keeping in mind our original goals of creating technology that is approachable to our CMs, we build COEy and enabled the bot in Slack.

With that done, we next needed to give her a brain — she’s an AI bot, after all! We developed a question / response mechanism — a common technological implementation of any AI bot. This was, as with the previous tasks, a user story. We evaluated a variety of AI frameworks, but ultimately converged on api.ai. Its framework was straightforward and enabled us to get COEy answering questions, quickly.

Next, COEy needed to help the Data Science team and Campaign Managers actually manage the campaigns on the COE. Any good piece of technology has a help menu, and COEy is no different.

Beyond these, we created a series of other user stories to increase COEy’s capabilities, both within Data Science but also across teams. Like most software engineering teams, Pixability’s team uses Scrum, making it easy to stay efficient while passing along stories and tickets.

Productizing COEy

Because we rely on Engineering/DevOps to productize our work, we passed COEy to DevOps via another story:

Additionally, we pass our models and other innovations to Engineering in the form of Docker containers. These standardize and simplify the hand-off (more on that in a follow-up blog post).

Managing Data Science Now vs the “Wild West” Days

It wasn’t always this well-organized. As the Pixability data science team has matured, we have tried various approaches to organize our projects. Back in our “wild west” beginnings, we did not track our work using an agile system, though we did still have daily standup meetings to check in with each other. We then adopted scrum, and started performing two-week sprints. With this new system, it was easier to prioritize our work. With the addition of weekly grooming meetings to comb through items in the backlog and bi-weekly planning meetings to discuss projects as a team, we cut down the time it took to complete our tasks and had the reassurance that each task or project was always the most strategic thing we could be doing.

Continuously engaging with the scrum process through daily morning stand-up meetings keeps us on task and allows us to make increasingly accurate predictions of our workloads: tickets beget more tickets, and we regularly discuss, in a group, what will be needed for the work and how long it might take. Our stand-ups provide a continuous opportunity to untangle methodological issues, or challenge each other’s assumptions of the data. Sometimes this means dropping old tickets that are no longer relevant. Sometimes, of course, our estimated workload per ticket might be way off. But, on average, these estimates are accurate enough to plan out our two-week ‘sprints’ and keep us on track.

We’ve noticed a number of additional benefits: we have a readable, shareable record of our work, we have built-in reflections through our Sprint Retrospectives, and we have transparency about each others’ evolving projects. It’s easier to see the links across projects, and learn as a team.

Of course, there are challenges as well. Scrum was not designed with data-driven research in mind. It can be difficult to plan very far in advance, except in very broad, high-level stories that don’t capture changing research designs. Our list of stories necessitates continuous maintenance and upkeep. There is no clear way to ‘assign’ the same story to two different people; constraining how collaborative tasks can be assigned. (For example, a blog post that includes drafting, editing and reviewing from multiple people!) And so on. But we’ve found scrum to be, currently, a net gain. We’re excited to continue iterating on it and finding new ways to better use it. We’re also excited to see the continuing evolution of general data science management techniques.

Check back here next month as we talk about how Data Science–particularly the Data Science Platform that we created–enables our sister team in Analytics to uncover meaningful insights faster using components such as Vertica, Kafka and Tableau.