Applying agile to data science
Data science is at the heart of every enterprise seeking to amplify growth, innovation and competitive advantage in the digital era. And yet, Gartner estimates that 85 percent of data projects fail, primarily because of the complexities involved in integrating the data science function into conventional enterprise infrastructure and culture.
The intrinsically probabilistic and non-deterministic characteristics of data science do not make it an easy or a natural fit for organizations accustomed to predominantly linear and reasonably predictable development models. Data science development models tend to account for their inherent unpredictability by valuing experimentation, iteration and continuous improvement as much as, if not more than, traditional metrics of project scope, cost and schedule. And it is this distinctive aspect of data science that makes it an ideal fit for agile development methodologies.
As much as the principles of agile that have successfully transformed software development are broadly applicable to data science, they have to be adapted and evolved to custom fit the dynamics of a practice that delivers insight rather than code. That process of evolution will accelerate as agile data becomes more mainstream. In the meanwhile, it is worth exploring some fundamental principles of agile and their relevance to the data sciences practice.
Building multi-disciplinary teams: A data science unicorn combines deep programming competencies, proficiency in math and statistics, extensive domain expertise, great communication skills and is mythical. The alternative, therefore, is to assemble a cross-functional team comprising data scientists, engineers, business executives, technology leaders and quality experts. This multi-disciplinary team takes collective ownership of the task of identifying and validating business use cases, defining data requirements, architectures and quality/governance frameworks, and delivering minimally viable products.
Continuous iteration, deployment and delivery: The route from data to insight is never linear and therefore defies representation as a sequence of tasks. The progression of any data science project is best described as a series of concurrent experiments that typically yield the best features, models or insights at the end of multiple iterations or sprints.
Of course, the objective is to deliver a minimally viable product, representing some meaningful stakeholder value, at the end of these iterations. This can be a particularly consequential challenge in data science projects where a lot of time is taken up by data preparation and where diligently built and tested models, or even workflows, may need to be discarded in favor of a better route to insight.
In agile data science, the minimum viable product could start off as a narrowly scoped business requirement that serves as a platform upon which subsequent deliverables are modularly developed, tested and deployed. In the agile methodology, every deployment serves as a milestone from which to review and refine each delivered iteration and to adapt to evolving project requirements.
Continuous integration: As a highly exploratory and experimental discipline, not every iteration in a data science project may yield a deliverable or product that significantly advances stakeholder value over the last iteration. But as opposed to software development, agile data science emphasizes the need to document code to repository management solutions like BitBucket or GitLab even if no incremental value has been delivered. Even incomplete or intermediate outcomes have to be documented to a central repository and shared across the data science project team as well all enterprise stakeholders. Lack of adequate documentation and version controls could result in issues with reproducibility or even delay production.
Mapping value creation: The data-value pyramid, modeled after Maslow’s hierarchy of needs, can enable data science teams to plan and build value in a series of iterations across a hierarchy of values, from simple records up to interactive predictions. Each iteration delivers a demonstrable increment of value that builds up to a complete business solution.
The data value pyramid provides a conceptual structure that can, at the very least, help data science teams with a reasonable visualization of the progress of a project. Though it represents a logical and seemingly-sequential progression, the route to the top of the pyramid may not always be linear or track upward. Even iterations that may require to cross back to lower levels, to get new data or iterate on EDA’s, for instance, can still represent progress and value.
Challenges: As agile data science becomes more mainstream, one of the biggest challenges will be to harness the productivity potential of agile without compromising the reproducibility, transparency and accountability of workflows in the development cycle. But there is already a lot of ground being covered in addressing these issues. For instance, Airbnb’s Knowledge Repo, an open source curated data science platform, facilitates the discoverability, reproducibility and reusability of data science work across the company.
The consequences of these factors becomes even more pronounced in industries like banking and financial services governed by stringent data and compliance regulations. Here again, emerging data science governance platforms provide the tools to audit and control how data sets and models are applied and shared across a company.
Conclusion: Agile data science is currently still a discipline defined by a set of principles and values adapted from the more established and refined methodologies of software development. Over time, this nascent discipline will evolve into its own distinct set of practices, processes and workflows that are best suited for the data science practice. But agile data science is not just about streamlining the data science development life cycle. The biggest challenge will be to integrate the core principles and values of agile data science with the existing culture, infrastructure and processes of a business.
