We’re growing the data science team at Windfall Data, where our mission is to identify the net worth of every household in the world. One of the exciting aspects of building a data science team at a startup is being able to lay the groundwork for creating an impactful team.
One of the ways of measuring the maturity of a data science team is by adapting Joel’s software test to the discipline of data science. This post explores three interpretations of this test, each of which define a set of yes/no questions to answer about your data science team. The more questions that you can answer yes to the better. I’ve answered each of these questions and provided some details about data science at Windfall. The goal of performing this exercise is to identify areas of improvement, and highlight the foundation we have in place for growing our team.
The “Joel Test” for Data Science
Nick Elprin, Domino Data Lab
Here’s Nick’s proposal of the “Joel Test” for Data Science:
- Can new hires get set up in the environment to run analyses on their first day?
Yes. Once access is set up for GCP, new data scientists can run queries with BigQuery, spin up GCE instances, and run Cloud DataFlow jobs. Right now data scientists need to run Jupyter notebooks locally, but we will explore using JupyterHub as the team grows.
- Can data scientists utilize the latest tools/packages without help from IT?
Yes. Data scientists can install whatever software they need on their local machine, which is usually an R or Python environment, in addition to IntelliJ for launching production jobs.
- Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
Yes. Our data scientists can run Cloud DataFlow jobs using auto-scaling for both testing and production use cases.
- Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
No. While we do use Jupyter notebooks and Github to store past analyses, these do not provide a sufficient environment for rerunning past experiments, because they do not package up data or script dependencies. One approach we’ll explore is using Docker for reproducible research.
- Does collaboration happen through a system other than email?
Yes. We use slack for messaging and have weekly data science reviews.
- Can predictive models be deployed to production without custom engineering or infrastructure work?
Yes. One approach we use is exporting models to PMML, as demonstrated in this tutorial. This requires initial engineering work, but is reusable once a template has been set up.
- Is there a single place to search for past research and reusable data sets, code, etc?
No. While we do document analyses as long-form written reports, we do not currently make all of these reports available in a central, searchable repository. We are using google docs and confluence to work towards this goal, but something like Airbnb’s knowledge repo would be ideal.
- Do your data scientists use the best tools money can buy?
N/A. We haven’t purchased any vendor tools for data science yet. I am excluding this from our score, because we are not constrained by a lack or vendor tools with our current team size. We use a number of open source tools and leverage GCP to scale our analyses to millions of households.
Nick’s blog post calls out two questions that we need to consider as we expand our data science team: how do we archive past analyses and experiments in order to be accessible to all team members, and how do we package up past experiments to be reproducible?
Highly Effective Data Science Teams
Drew Harry, Twitch
Here’s Drew’s proposal of the “Joel Test” for Data Science:
- Do you spend the vast majority of your time on projects that take longer than a day?
Yes. Our data science team is focused on projects that are scoped as part of a biweekly planning process. While we sometimes have high-priority work that needs to jump the queue, this is usually customer-facing work rather than ad-hoc requests for data or metrics.
- Does data infrastructure have dedicated engineers working on it?
Yes. Our data pipeline has dedicating engineering support. Given our size, data science is often involved in making updates to our pipeline.
- Do people in the organization have ways to access basic data without asking a data scientist?
No. We have not set up dashboards for reporting, and most data questions are currently answered via SQL.
- Can you access data without impacting production system performance?
Yes. While data science does ETL tables from our production system to BigQuery, where the majority of our analysis is performed, these queries do not noticeably impact system performance.
- Do you spend more time doing analysis than waiting for data?
Yes. Query performance is not currently a bottleneck for our data science team. Data science can also load data into our system without needing support from engineering.
- Is there documentation for major schemas?
No. While a subset of the major tables do have documentation, we do not have comprehensive documentation for all data sources.
- Is instrumentation considered part of a minimum launch-able product?
N/A. Our customer-facing product is data, rather than a web page or application that needs tracking instrumented.
- Do you have a process for detecting and fixing bugs in data collection?
No. Data quality is a huge concern for us and we manually validate sources, but we haven’t set up an automated process for tracking detect rates or other quality metrics. We’re hiring for a governance role that will help define this process.
- Is past research work documented and available in a central location?
No. This is similar to #7 above. We write and share reports, but have not yet formalized an archival process.
- Does the team have a regular process for reviewing work before sharing it?
Yes. We have regular meetings to share the impact of new models and to review edge cases before deployment.
- Do you run experiments to understand the impact of decisions?
Yes. We run experiments to help our customers understand the impact of using data in fundraising efforts and marketing campaigns, and to measure the significance of these actions.
- Can you report negative results without major political pressure?
Yes. Improving model and data source quality is one of the key functions of our data science team, and we are defining a postmortem process to handle this type of scenario.
- Can the CEO (or other leader) name at least one way the team contributed that quarter?
Yes. Our data science team has helped make significant improvements to our net worth models, which is one of the key reasons that we are looking to grow our data science team.
- Are data scientists consulted in product and business planning processes?
Yes. Our product is data, and data science is a key stakeholder in setting a company roadmap for our products.
Drew’s blog post has some similar themes to the the first post, noting that data scientists should be able to do impactful work without getting slowed down by infrastructure. One of the points that Drew raises is establishing more process for measuring the quality of our data sources.
13 Steps to Better Data Science: A Joel Test of Data Science Maturity
Enda Ridge, Guerrilla Analytics
Here’s Enda’s proposal of the “Joel Test” for Data Science:
- Are results reproducible?
No. This is similar to #4 in the first post. We use notebooks to store past analyses, but do not have a system for reproducing environments.
- Do you use source control?
Yes. We use git for both analysis workbooks and production jobs.
- Do you create a data pipeline that you can rebuild with one command?
Yes. We use Cloud DataFlow to define and execute batch jobs.
- Do you manage delivery to a schedule?
Yes. We have roadmap review meetings to track the status of data science projects and to coordinate delivery with our engineering team.
- Do you capture your objectives (scientific hypotheses)?
Yes. Our product manager provides a spec at the start of projects, which defines the scope of work to be performed.
- Do you rebuild pipelines frequently?
Yes. We use Cloud DataFlow to define our data pipeline and run these on a regular schedule and ad-hoc as needed.
- Do you track bugs in your models and your pipeline code?
Yes. We use Pivotal for bug tracking, and review model code via pull requests on github.
- Do you analyse the robustness of your models?
Yes. We track the accuracy of our models over time and compare the performance of new proposed models. This is one of the few scenarios where our data science team has set up dashboards.
- Do you translate model performance to commercial KPIs?
Yes. We have target metrics that we expect new models to improve upon.
- Do new candidates write code at interview?
Yes. We expect data scientists to have hands-on experience with SQL and a scripting language (R or Python).
- Do you have access to scalable compute and storage?
Yes. We use the auto-scaling feature of Cloud DataFlow to productize our predictive models.
- Can Data Scientists install libraries and packages without intervention by IT?
Yes. Same as #2 from the first post.
- Can Data Scientists deploy their models with minimal dependencies on engineering and infrastructure?
Yes. Same as #3 from the first post.
Enda’s list has a bit of overlap with the first post, but raises more questions around model maintenance and data pipelines. We score quite well on this list, but still fail the reproducible environment question also in the first list.
We are making good progress towards building a mature data science team. Many of the tools are already in place for building and deploying models, but we have some improvements to make on archiving our past experiments and making them reproducible.
Overall Score: 26/33
Our goal is to get closer to 33 as we grow our data science team. Achieving some of these will require a technical focus, such as including Docker into our experimentation workflow, but many are focused on process and should improve as our data science team matures.