10 best practices you cannot do without in any data science team
Being a data scientist is exciting, rewarding and challenging all at the same time. As we go about our day at work, we encounter many opportunities while we work with people from across the company. Not to mention, we deal with a plethora of responsibilities as well. At ANWB, we have been growing a data science team for over two years, building tailored data science solutions for internal clients across all of our business lines. We believe that it is a good idea to periodically reflect on our projects in order to draw learnings from them. Everybody can always improve. After all, experience equals growth!
As such, our team of data scientists and data engineers have compiled a list of best practices that will be useful to any data science team. Stick to these and you will see that your projects will run more smoothly. We promise.
ANWB’s top 10
1 Always double check if the business question is clear
Translate the business question to a data science question and then dive into the data. You will discover very soon that something is missing, e.g., problems with data or ambiguity in the business question. Getting the business question right is an iterative process. Don’t be afraid to keep going back until you have a crystal clear idea of what you want to build. This will end up saving you and the business a lot of valuable time.
How we do it: We always start with the use case in our initial conversations with our stakeholders. This will provide a lot of clarity in their expectations from our solution, because it gives us an initial idea of what we need to do to solve their problem. This initial idea is tested iteratively during our entire project, because requirements can shift from time to time. Once the business properly sees the benefits of the proposed solution, they will automatically get invested into the project and are naturally more eager to help you out with the development.
2 Make sure to define the scope of your projects
It is very tempting to spend time creating plots, working on follow-up questions and doing extra research. Be critical of how these activities will assist in creating a better solution in the end. Know when to stop. As a data scientist, you always want to provide the best solution possible. It can happen that you obtain a model which performs as well as 90% for your task. Given a good set of features, this result can be obtained relatively quickly. However, it is natural to feel the urge to spend another two months trying to get a higher model performance, yet it ends up performing 1% better. Perhaps the customer was already very satisfied with 90%, so try to scope your model performance too: when is it good enough? The same thing can happen in reverse: what if the customer expects the project to be valuable with a performance of at least 90%, but you seem to be stuck at 70% given all of the data and most informative features. Is it wise to continue? If you don’t feel that activities will lead to a better result, scrap it from your planning. Make sure to focus your activities toward reaching the end goal.
How we do it: We do two things to define the scope of our projects. First, we use an extensive template that our stakeholders need to fill in before we start working on the project. It contains questions such as the envisioned use case, the available data and the “landing zone”, i.e., the requirements needed for deploying our model in such a way that there are no loose ends. It is clear from the start where the model is hosted and what the inputs and outputs are, and how exactly the outputs will be used by our customers. The second thing we do is working in sprints, typically three weeks long. At the end of each sprint, we check our performance with the stakeholder during a review. This provides the stakeholder with the opportunity to give us ideas on how to improve the results such as delivering more data or business rules. We have found that this is an efficient way to make fast progress. It prevents us from being stuck in optimizing our models with minimal improvements.
3 Learn to pull the plug early and don’t be afraid to say no
It can happen that a project is too ambitious for the available resources (time, money, people, equipment). Always evaluate whether this is the case and signal to others when there are too many challenges to ensure a good end of the project. It will save a lot of time and energy if you go back to the drawing board to scope the project differently or more precisely. Ask for more senior assistance, return back to the basics (what’s the goal?) and which related questions need to be answered. Not all projects are successful. The best thing you can do is to learn how to recognize signs of failing projects as early as possible and addressing it. But indeed, we can learn a lot from projects that are also not successful!
How we do it: We cut our development cycle into phases. Each phase can consist of one or multiple sprints. After an initial scoping and gathering of requirements, we spend a relatively small amount of time working on a PoC (Proof of Concept). Once the PoC is finished, we assess how fruitful our solution is to solving the business problem. If we think we can continue, we will get a green light from everyone to progress toward the next step of creating an MVP (Minimum Viable Product). In this stage, we productionize our code and work out the final kinks and improvements in our model. Once this is done, we assess the results once more before moving the data science solution into production. The key is to build in many go/no-go moments in order to prevent spending too much time on an unsuccessful project.
4 Add domain experts to your project group
Speaking with domain experts is crucial to understanding the business question. Don’t just gather requirements from the stakeholder, but also make sure that you speak to the domain experts or end users of your solution. Prevent living in an ivory tower as you will risk developing a product that does not fit the end users’ demands.
How we do it: It is difficult to think of all the relevant questions before starting the project as new questions will arise during it. Make sure beforehand that there is always at least one domain expert available to help you out when things are not clear. Another reason to have them involved is that you will need their expertise to improve the model even after it is considered as finished. The quality of your model will really become clear once it is in production or when your experts have a way to interact with it. This feedback loop between experts and you is very important since it can help you detect and fix model defects rapidly and ensures that you can deliver more added value to your business.
5 Manage your model by dumbing it down
Try to understand your model as well as possible. Some models may require tens, if not hundreds, of different features. Once you have a basic performance that you and your stakeholders are satisfied with, you will need to try to make it as manageable as possible.
How we do it: Once our model has been trained and assessed, we can look at removing features if they provide no additional value. Always choose a simpler model if they have the same performance compared to a more complicated model. This has multiple benefits. First of all, it is less likely that you are modeling real-world noise. Secondly, the fewer the number of features, the easier it becomes for you and your stakeholder to understand and explain the results of the model. Thirdly, another benefit is that your model becomes more maintainable. Every time your model needs to perform inference or be retrained, you need to put less effort in gathering and engineering all the features that you have used.
6 Translate your results to real-world metrics
We think in metrics such as accuracy and precision, whereas our stakeholders think in €€€ or operational time saved. It is our job to translate model performance to their real-world metrics. Additionally, always suggest room for improvement (e.g., better data, extra features) and ask if they can provide it to us for better results. This will ensure that the stakeholder can assess the quality of our solution and how well it would work within their operational process.
How we do it: We take a moment to understand the use case and try to look for a solution in order to make a proper translation from our metrics to those of our stakeholders. We build a presentation slide deck which explains the improvements in metrics such as profit, NPS and operational time savings. From time to time, we also explain basic data science concepts (precision, recall, AUC etc.) to those that are interested. It helps them understand our world better, which is healthy for our long-term cooperation. Some have even started reasoning with us on our level. After all, everyone wants to feel smart 😉.
7 Have a basic deployment infrastructure in place
It is very important to have a basic infrastructure in place which can bring you from your deployable code to integration within the systems of the stakeholder. Otherwise the time to market for your solutions will be extremely long. As data scientists you will get caught up in creating the core solution. Deploying it should not be an afterthought, but should almost be seen as an entire project in itself.
How we do it: Within our organization we have a toolset that can host models through Gitlab pipelines and AWS. There are two kinds of models (batch vs real-time) that we can deploy. From a data scientist perspective, our code needs be able to run locally on our laptops that follows a certain convention and folder structure. We then set up some configuration files in which we register a few important settings, such as the batch schedule (in case of batch deployment) and the S3 buckets that we draw data from for inference purposes. We can also provide our own Dockerfile of instructions. Our CI/CD Gitlab pipelines are set up in such a way that they can build the project into a Docker container which is subsequently deployed into our AWS environment. We first test our solution in dev. If our build and deployment is successful, we move our deployment towards production.
8 Develop your deployment infrastructure in multiple phases
If you are a data engineer, then you will need to develop and maintain a well-oiled deployment infrastructure. This infrastructure enables internal customers to use the solutions which the data scientists have created. This seems like a gargantuan task at first, but achievable when development is correctly phased out.
How we do it:
1. Sketch out the structure of your platform, based on the requirements given out by both the business and your data scientists. Spend a lot of time doing this as it will end up saving time.
2. Create pipelines that can be triggered. You can phase out the jobs in each pipeline (e.g., data preprocessing, model inference etc.). Test every large addition to your infrastructure very thoroughly. Modularity is extremely important for proper and easy maintenance.
3. Work iteratively on your development until you have a good end-to-end solution.
4. Automate this solution such that it fits your platform using CI/CD.
9 Shape a team that is versatile
Try to get a good mix between all levels, i.e., juniors, mediors and seniors with different skillsets. Another plus is to have people with different ambitions, e.g., people who like to manage projects and people who want to be more involved on the operational side. The short-term benefit is that you can put people with different strengths together to work on projects. Having a versatile project group will lead to different viewpoints, which can be healthy and may prevent groupthink. On the longer term, you may want to think about managing these proportions as the team grows. This needs to be balanced out between the team’s capacity to pick up projects and business demand.
How we do it: We have created a skills matrix in which we document our team members’ strengths. We aggregate this on a team level, so that we can identify our collective shortcomings. This helps us in growing into our ideal picture of a data science team. Individually, members of the team can see which skills they need to develop in order to become more T-shaped.
10 Proper communication really is the key to success
Not every stakeholder is the same. Some stakeholders are more data-minded and pick up concepts faster than others. Adjust your communication style to your audience by getting familiar with your stakeholders. However, communication happens on each level: between team members, between yourself and your (external) stakeholders. You need to learn how to strike a good balance in your communication: involve everyone at the right time whilst doing your job-related activities. Our field is very broad and we need to set priorities every day. Proper communication helps you maintain a goal-oriented attitude. In the end it will help you become a more effective professional.
How we do it: Ask your stakeholders for their background during your first project meeting, so that you can tune your communication to what the stakeholders are comfortable with. Think about your dependencies: when do you need someone (e.g., a fellow data scientist or stakeholder) and what do you need from them? Make sure to set expectations beforehand and be open and clear about your plans.
Wrap-up!
And that’s it! Okay, perhaps a bonus tip: enjoy! Sometimes we get so wrapped up in our project that we forget to take a moment to appreciate how awesome our work is. We have to be a jack of all trades. Not a single day goes by which is the same as before. Our field is so incredibly broad that we are never done mastering everything. We literally cannot get bored. The world of data is continuously evolving and we are here to witness it. Let’s get better together. Will you apply these tips in your next project?
Author: Prashand Ramesar, Data Scientist