Lessons from running a Data Team
This post was originally published at: http://farhan.org/lessons-from-running-a-data-team.html
In my last job as Head of Data at Axial, I managed a team that was responsible for data engineering, analytics and reporting. Our mandate was to use data to help the business operate more efficiently by not only providing insights into key metrics and their drivers, but also by building tools and integrations that improved various business processes and reduced bottlenecks.
Over the year that I ran the team, I learned quite a few lessons that I’ll try and summarize in this post.
- Make data quality a priority — Data quality checks should be part of the design, not an afterthought. While it’s always possible to automate more of the process, be prepared to spend a significant amount of time manually checking data. There’s really no substitute.
- Instrument your pipeline — Don’t wait for things to break badly before you find out about them. Instrument your pipeline so that you can find out about issues closer to where and when they happen. Tools like Sangati can help.
- Fix quality issues at the source — It’s much easier to fix data quality issues upstream than downstream, even if it’s more painful and takes longer. For example, rather than compensating for an incorrectly implemented event in your ETL, have the product and engineering team fix it in the product.
- Set up restores — Not backups, restores. Unless you have tried restoring your data, it’s not backed up. Test this every quarter if not more frequently.
- Organize your code — Keep your code organized. Even if they are one-off SQL queries or ad-hoc analyses, you never know when you’ll need to refer to them again. Tag them, make them searchable.
- Version your data — Make sure your preserve the ability to go back in time and reconstruct what the data looked like at some point in history. For example, a properly implemented slowly changing dimension (SCD) strategy will get you that. Or you could simply snapshot your data daily (depending on its size) and archive it in AWS Glacier.
- Build simple models — Start simple and add complexity only if necessary. 90% of the time, a simple linear or logistic regression model will get you what you need. If it doesn’t work, make sure you understand the question well before attempting anything more complex.
- Black box analytics are often useless — Unless making accurate predictions is the end goal, a predictive model is generally not that useful unless the business can operationalize and act on it. Decomposing the model to figure out the contributing features usually means that your model needs to be simple.
- Don’t reject the mundane — Often times, getting the right data to the right people at the right time is all you need. This means building integrations with tools that are part of the daily workflow. It’s not the most interesting work, but it’s usually a huge productivity and efficiency booster.
- Be adamant about the limits of data — Not everything can be answered using data that you have. Be vocal about these limits. These limits could be a result of noisy data, too many parameters and not enough observations, inadequate instrumentation or something else.
- SQL, SQL, SQL — More than anything, getting good at SQL will pay dividends. Make sure everyone on your team has the opportunity to improve their skills. 90% of the questions can be answered using SQL.
- Build intuition about the data — This comes with experience and gaining a deep understanding of the domain. But it will be your secret weapon to spot errors quickly and save time.
- Ask why? Repeatedly. — Aristotle said this thousands of years ago, but it still holds true — asking the right question is half the answer. Spending time upfront to refine the question often saves a lot of time in the long run.
- Know how the business has changed — Startups, by definition, need to be agile. Keep track of strategic and product decisions that could have an impact on how the users behave in the product. It will have an impact on the data and you’ll need to account for this in your models if the data spans periods on both sides of these decisions.
- Teach others how to fish — Unless others outside the data team know what kind of questions data can help answer, they won’t bother asking you for help or coming up with ideas to explore. Take time to catalog the data, make it easily accessible and hold office hours and lunch talks to evangelize what data can do. It’ll help spark the right questions and build a culture of data curiosity.