Lessons in managing data science projects

Published in

AI Practice GovTech

7 min readAug 28, 2020

Here in DSAID, when we provide data science consulting services to other public agencies, our officers take on the role of both data scientist and project manager. Though not as sexy as the latest machine learning algorithm, project management is in fact crucial to our success. Good project management processes provide discipline and structure, reduce the risk of delays and digressions, and allow us to focus on producing meaningful insights instead of administrative overhead.

It is therefore just as important for us to keep up with best practices in project management as it is to constantly refresh our technical skills. Recently, we had an exploratory collaboration with Pivotal Labs (now VMware Tanzu) on a data science project and got to learn from their industry-standard best practices in applying agile project management to data science.

Here are some of our key lessons and takeaways:

(1) Define the roles well, especially the end user

Public sector projects are often cross-cutting and involve multiple teams or organisations. Even though the eventual goal is always the greater public good, each team/organisation will still have its own mandate and areas of interest. Identifying the correct team as the end user is important, as you will work closely with them and generate findings to answer their specific business needs.

The end user should typically be the individual or team that would benefit most from the insights produced by the data science project. They are also usually the ones who have the mandate to change operations or policies as a result of the findings.

Take for example an operations team that does inspections, and a separate coordination team that collates data for the ops team to plan its inspection schedules. While the coordination team might be the one approaching us to develop a risk prediction model, the real end user is actually the operations team — they are the ones who’d benefit most by changing their operations to a more efficient inspection schedule based on the prediction model. They’re also the ones who can best explain to us the nuances we need to consider while developing the model.

Identifying the correct end user is not trivial — if not done well, you might be halfway through the project before realising that you would have to start over!

It’s also useful to define the other roles well, be it data scientists, product owner, data subject matter expert, resource persons, and so on. Specifying their role, responsibilities and expected time involvement ensures that everyone is clear about how and what they need to contribute towards the success of the project.

(2) Take a user-centric approach

The concept of user-centricity is not novel, but a particularly useful technique we learned from Pivotal involves formulating user stories in this format:

“As [user], I want to… so that …”

For example, “As John from the inspection team, I want a ranking of all areas under my purview by predicted risk score, so that I can prioritise inspection of the highest risk areas.”

Doing so centres our frame of reference around the end user rather than the data scientist. This constantly reminds the data science team to build products that are ultimately useful for the end user, and prevents the possibility where data scientists end up building things that are “cool” and use the latest machine learning technique, but which may not actually add much value for the end user!

(3) Project management tools are useful for stakeholder engagement

In our engagement with Pivotal, the tool that we used was Pivotal Tracker. We would list down all the stories/tasks that the data science team is working on, give each a score based on effort required, and prioritise these stories/tasks. Based on the estimated bandwidth of the team, Tracker will automatically generate the stories/tasks for each sprint (usually a week), as well as the estimated end date.

This is very helpful for two reasons:

Internally, it helps us to focus on the tasks at hand;
Externally, providing this overview to stakeholders allows greater transparency, making it clear to them how any subsequent changes would impact the timeline. This transparency helps in prioritisation and usually prevents superfluous requests.

(4) Work towards an MVP first

It is very common to want your product to look substantive (or even nearly perfect) before your clients or bosses see it — after all, nobody wants a seemingly half-complete product to be a representation of your efforts!

Yet this is in fact a very “waterfall” way of thinking. An agile mindset, on the other hand, emphasises building an MVP (Minimum Viable Product) first, before iterating on it subsequently. This will not come intuitively to many of us, and even for those who understand the concept of an MVP, what you think of might not even be the Minimum!

For example, when building a predictive model (let’s say to predict a probability), instead of using a simple regression model to predict, one could start off with an MVP that takes a simple average of the past three time periods. This does not seem very data science-y — no regression, let alone more complex techniques like random forests or gradient boosting! Yet it fulfills the very definition of MVP, as averaging is in fact also a way of predicting, albeit a very rudimentary one.

Your stakeholders are not likely to be satisfied with a simple average in the final product, but the MVP offers an early but tangible demonstration of what that final product could look like; you might take that predicted simple average probability, visualise it into a chart, and show your stakeholders what the final chart would look like. This helps you solicit feedback early, and allows for more timely course-corrections if needed. Thereafter, it is a matter of changing the technique under the hood — whether you replace the simple average with a regression or random forest, the predicted output is still a probability that will simply fit back into your earlier chart.

The above graphic by Henrik Kniberg is a well-known illustration of how MVPs are developed — getting a working model out early and soliciting feedback regularly. Which brings us to…

(5) Meet and iterate regularly

With the waterfall approach in the past, our team used to meet our end users only when we had substantive findings to share. This usually amounted to two or three times in a three-month project cycle, just before major presentations to management.

Switching to a more agile process entails much more regular meetings with the working team — weekly, or sometimes even multiple times a week — where we are constantly in touch with our stakeholders regarding the progress of the project.

This initially felt quite unnatural, especially in early phases of the project where work is focused on data cleaning and is yet to produce actionable or meaningful insights for the end user. Most end users aren’t going to be impressed by your data cleaning and feature engineering work, regardless of how much work went into it (even when we as data scientists know this takes up the bulk of our efforts!). But being disciplined in having these regularly scheduled meetings help create commitment and buy-in, and all parties will soon get used to the cadence of the meetings.

While more meetings might seem-time consuming, they in fact allow for more timely conversations and course-corrections, on top of giving the end user greater assurance that we are on the right track — which saves time in the long run. Don’t forget that at times the end user might not be 100% certain about what they want either — so creating an MVP then having frequent iterations from there will also help them gain greater clarity on what they really want!

Bonus: our experience with pair programming

One of the characteristic features of Pivotal’s data science projects is the use of pair programming — where two data scientists work on the same workstation, with one writing the code while the other actively reviews it, with the two switching roles frequently.

We tried it out during the engagement, and found that certain data science tasks lend themselves better to pair programming, such as early in the project where the team is scoping out the overall approach, as well as during the modelling phase where we start to interpret the results and finetune algorithms. Others such as data cleaning or pipelining may sometimes not benefit as much, especially if the tasks are straightforward.

If you’re not used to it, pair programming can feel unnatural with someone watching over your shoulder, but it also does generate useful conversations and insights and helps significantly in debugging. It is however a significant commitment in terms of manpower, so it might not be for everyone — that will depend on factors such as the make-up and personnel in the team. Regardless of whether it’s appropriate for your context, it is an experience that we’d encourage any data scientist to try at least once!

Some concluding thoughts

Successful data science projects typically demonstrate the following:

Good soft skills (for project scoping, to arrive at the right research questions);
Good hard skills (technical ability to derive meaningful insights); and
Good structures (so the project doesn’t get sidetracked)

We’ve listed some tangible steps in project management above, to help with creating better structures in data science projects. It’s not a complete substitute for good soft skills and stakeholder management, but it certainly can help.

Yet project management should also more than just the sum of these steps — it’s not just about implementing set procedures, but systematically identifying human heuristics, biases and potential for errors in your workflow, then trying to eliminate or mitigate them. There is no fixed formula or one-size-fits-all approach, and ultimately the ideal project management process is one that is best suited for your team’s context, resourcing, and operating environment. There will be a process of experimenting and adapting until you hit on what works best for you and your stakeholders.

Nevertheless, we hope that the above will help inspire some thought about we can always improve our processes and deliver more value to our stakeholders!