8 Best Practices for Working with Your Data Science Vendor — from Data Scientists
When you leverage data science expertise from outside your company for a project, you face a multi-layered challenge. There’s likely to be a gap in domain knowledge between your in-house experts and the data scientists, existing workflows that won’t readily accommodate new processes and meanwhile, the lack of a common working language can obscure essential details. These potential obstacles can add up like a snowball, potentially hurting productivity and leaving you with results that fall short of your expectations.
Starschema’s data science team has delivered over a hundred projects for major clients from a wide range of industries, from healthcare to finance, from manufacturing to global NGOs. These experiences have led us to a list of unwritten best practices that we have found time and time again to facilitate smooth and effective work in situations where a client employs external data science expertise. We’ve decided to put these in writing to help future clients — of ours and colleagues from other companies — make the most of their experience working with a data science vendor.
Here are eight tips with occasional real-life examples that will make working with data scientists from a vendor easier and help you get better results faster.
1. Create a mutual understanding of project requirements.
Clearly communicate how the expected results will be used, how they will be evaluated and what the KPIs are. For example, if the solution needs to work in real time, data scientists will have to know that a complex ensemble model — where multiple learning techniques are used in parallel — might not be appropriate, as such models take too long to calculate.
Data science KPIs are not always a good predictor of business results, which often presents a problem even in common use cases. A predictive model might accurately predict potential churners, for example, but if this result isn’t used to formulate appropriate retention incentives, the churn ratio will not improve.
2. Be prepared to adjust business processes.
Integrating data science, such as continually monitoring model performance, might require changes in business processes. Let’s say that a model is used to prevent a negative business outcome. If you make a preventive business effort in all cases where the model predicts a negative outcome, you might never experience the issue in the future — but how can you be sure that prevention was used only where necessary?
If prevention is expensive, you should focus only on cases where it is really necessary. It pays to retain a ‘control group’, where the preventive measures are not implemented. The model can then be validated against this backtesting sample. Selecting the size of this sample requires some expertise: the sample should be selected carefully to be as small as possible but, at the same time, be representative.
3. Help domain experts translate their needs into data science tasks.
Your in-house subject matter experts are most likely not data scientists –– but their knowledge of the business case might be extensive. Helping them translate their needs into data science tasks can facilitate communication. This can be accomplished with the help of your data science vendor — in fact, it helps to have the vendor involved early. It is essential to find a KPI of acceptance that is based on an agreement between your in-house teams and the data scientists. The KPI should be measurable from related data and support your business goals, reflecting both business value and quantitative metrics of model performance.
In some domains, KPI selection is relatively straightforward, as there are well-known metrics in use at organizations. For example, when working on an image segmentation problem — where certain pre-defined patterns must be located in pictures — , sometimes we learn mid-project that there is no automated evaluation for the results, only “eyeball verification.” This is clearly not objective and, more importantly, not practical when dealing with thousands of images. Translating the desired aspects of the solution, such as recognizing a particular pattern, into a data science task will greatly facilitate timely — and cost-effective — delivery in such cases.
4. Do status meetings more frequently.
Make sure that the analysis is going in the proper direction. You might have lots of questions that are better answered immediately rather the upon delivery, when there’s no more time left to make substantial changes. Based on early results, domain experts might find an valuable outcome that can refine the approach to the problem.
Minor changes in the focus of analysis are hardly unusual, especially where data scientists are not experts in the domain at hand. For example, when analyzing X-ray imagery, we might identify a source where the model underperforms relative to other sources. The reason might be that the X-ray machine used different settings (e.g. lower ) or produced an artifact, so all images coming from that source should either be dropped or analyzed separately.
5. Expert knowledge can make or break data preparation.
One of the greatest contributions you can make to a data science project is high-quality data preparation. If the data is well-prepared and the proper variables are sent to the model, you can expect good results. A model is only as good as the input data. If no relevant variables are prepared in the data, even the best algorithm won’t give you high-quality predictions.
Domain experts are likely to know what variables have the highest impact on the target and which records should be filtered out from the data. If they support the data science team with such insights, model performance will likely improve.
A simple example: let’s say we’re predicting an illness where obesity increases risk. Having only body height and weight in the data might not be sufficient. But by calculating BMI (body mass index) we get the level of obesity, which would be a better predictor than weight or height alone.
Without domain knowledge, data scientists can derive lots of new variables by combining the inputs in different ways — and if they’re lucky, they will stumble upon relevant factors, but the safest and most productive way is to collect all this information from domain experts.
6. Strongly consider interpretability.
Black box models might provide more accurate predictions, but it is often more effective — and sometimes required by regulation — to use interpretable algorithms. When your algorithm is interpretable, target domain experts can validate the relationships it identifies and make sure that the prediction is calculated based on relevant information. There are well-known cases when black box models learned features irrelevant to the target. To learn more about this topic, see our white paper on interpretability:
Opening the Black Box - Learn how to think about AI and thrive in...
Machine Learning models can be so complex that they seem like a magical black box with inputs and outputs, but little…
7. Facilitate understanding of the benefits of the project for all involved.
It often helps to involve each affected party early on and create a mutual understanding of why the given project is beneficial — not just for the company but also for individuals. As we mentioned above, it’s essential that the data scientists and the domain experts work together to share specific domain knowledge and validate the insights from the analysis. However, it’s difficult to achieve this synergy if not everybody is on the same page.
For example, let’s imagine a project for a marketing agency to automate campaign performance reports using text mining. While it’s evident for the management why automation is beneficial for the company, it might not be as readily apparent to the team member who currently creates these reports. From their perspective, the project can be a threat, something that would make their job obsolete. With this attitude, it is reasonable to expect this employee to be less than cooperative.
In reality, data science projects can automate and enhance a relatively small part of a person’s job, not eliminate it completely, so these kinds of fears are usually overblown. In fact, the projects tend to free up time and energy to focus on more complex and engaging tasks for employees. It is well worth dealing with these kinds of concerns and addressing them early on to ensure that everyone involved is behind the project, understands its value and shares the urge to move it forward.
8. Micromanagement kills the spirit.
It takes considerable effort to create a genuinely cooperative setup where the data science vendor and the in-house team of experts can and are willing to work together on a given project. And the best way to ruin this is by micromanaging every issue. Nothing can slow down a project more than regularly having to wait for a department head to approve answers to questions from the data science team that don’t require or benefit from their review.
Instead of creating a process for seeking approval, create clear boundaries of what is and isn’t shareable and what requires special permission. The resulting process can be streamlined further by appointing a project manager who has the authority to make these calls and is available to step in when necessary. This can ensure the adequate flow of information while clearing boundaries around decisions and responsibilities and promote smooth project delivery.
Overall, the best strategy to involve a data science vendor in a project is to prepare an in-house team for cooperation. Data scientists bring the knowledge of data manipulation and algorithms, while your in-house domain experts bring industry-specific knowledge, both of which are essential for success when it comes to predictive modeling or related techniques.
Creating this cooperative team setup might be more challenging than some would expect, but by clarifying the expectations, highlighting the benefits for everybody involved, setting up regular update meetings and understanding the potential long-term business process changes, you are already setting up the project for success!
Do you — as a client or a data scientist — have something to add to this list? Have you experienced difficulties with projects because one or more of the above points were not observed? We’d love to hear about it, so let us know in a comment below or on our social media profiles!
About the authors:
Eszter Windhager-Pokol is head of data science at Starschema. She holds a degree in Applied Mathematics and has more than ten years of experience supporting data-driven decision-making as a consultant, with additional experience researching collaboration filtering and developing user behavior analytics products for IT security purposes. Eszter regularly holds data science trainings for business users and teaches Mastering the Process of Data Science at CEU as a visiting faculty instructor. She is an organizer of the R-Ladies Budapest meetup group and a member of the program committees of several international data science conferences. Connect with Eszter on LinkedIn..
Berta Böjte is a data scientist at Starschema. She helps companies provide better services and products for end-users with advanced analytics and machine learning solutions. In recent years, her main focus has been to develop a recommendation tool for a Wall Street company to ensure that clients receive high-quality, tailored financial services. Connect with Berta on LinkedIn.
Ákos Fekete has worked in a variety of roles in the fields of data science and data engineering. He recently helped a telco company create integrated data for supporting personalized recommendations for clients and obtain geospatial information to gain value from their data. His work was also instrumental in developing a data quality framework for the financial data lake of a Fortune 500 company. In addition to working as a data engineer on a variety of projects, he has carried out multiple data science projects as well. Connect with Ákos on LinkedIn.
REACH OUT TO STARSCHEMA HERE:
READ MORE FROM STARSCHEMA:
Fighting the COVID-19 pandemic with data and context
Learn how Starschema, a global data services consultancy, helps the fight against COVID-19 with open data & analytics.