From Wobbly Data Science Project to Efficient DataOps

andrew wong
Human Science AI
Published in
3 min readMar 19, 2019

If you want to build a ship, don’t drum up people to collect wood, and don’t assign them tasks and work, but rather teach them to long for the endless immensity of the sea.

Antoine de Saint-Exupéry

When I first started off writing my first few lines of code into Jupyter Notebook, I find joy and freedom of expression. Then, I carry on to code and code. And, write markups after markups. After two data science projects, I came to a realization. Hang on there — this is too short-term and narrow focus. What I mean by too short-term is that I was coding without thinking about future implications, whether there is technical debt, whether it is maintainable, etc. What I mean by narrow focus was that I only focusing on my part and not the whole sum.

Why did I come to this realization?

First, is my background in software engineering. I have been grounded on a discipline way of running software development lifecycle (SDLC in short). This is essentially focusing on technical excellence, and collaboration between Product Owner, Developers, Testers, DevOps, Solution Architects are highly embedded in daily software engineering practice. This start my search for similar practices or use cases within Data Science space. I will delve later on this matter.

Second, is my background as a agile/ delivery coach. I have been grounded on agile development as a mindset and principles of delivering through small iterations, regular cadence, team-based, and prioritizing work based on highest business value first. I will delve later on this matter.

Third, is my background in design sprint. In short, design sprint is a short 4–5 days (4 days if you’re following the Design Sprint 2.0) process starting from defining problem space (Day 1) to defining solution space (Day 2) to prioritizing/deciding on solution(s) forward (Day 3) to prototyping the solution (Day 4) and finally testing the prototype on real users (Day 5).

So, the question is that how can we transform from doing wobbly Data Science project into an efficient DataOps (Read: embedding discipline software engineering practice, with Agile Development disciplines, and Design Sprint intensity). How can we start to think longer term, with broader horizon (Read: teach them to long for the endless immensity of the sea).

I am not claiming I know all and that all the writing below will help you. I am experimenting and sharing to the world at the same time (Read: failures are expected, guarantee).

Chart a path in two-weeks cadence

I want to use agile development principles of time-boxed, short iterations because the what’s coming up is uncertain. I rather work on it, and review at the end of the two weeks. This is regret minimization.

Focus on business data science, rather than just data science

The joy of creating and developing business value trumps just doing data science. If there is no impact to business decision-making, there is no real world impact (unless, you’re able to advanced data science knowledge). This is about focusing on business value first.

Fast prototyping in a week space

I have introduced Design Sprint (4–5 days) earlier. If we can get out a prototype early into the wild, it will help immensely. This is about small increment, chipping off doubt one day at a time.

Tell data stories that tie back to the problem at heart

The importance of making a thick connection to the problem, and the steps towards solving the problem have real benefits of building trust in the data because of transparency and the ability to reproduce the results. This is about producing a readable and relate-able data science project.

The importance of scientific method

A rigorous scientific method will help to build case for more data science projects. The practice of logical reasoning (i.e. logical statistical analysis) and use of empirical evidence (i.e. generate and evaluate scientific evidence and explanations, and participate productively in scientific practices and discourse.)

That’s for now for this March 2019. More to come.

--

--