A large part of our business is making valuable data science products a reality. I’m going to share some of what we have learnt on how to do this quickly whilst maintaining the highest quality.
Pandas alone won’t save you
The first mistake is to believe that every data science project is only a data science problem. Multiple different roles have an important part to play.
The best and most efficient teams I have worked in are a combination of Data Science + Product + Engineering roles. That will not come as a surprise, but what makes a difference is ensuring that these roles have a part to play throughout the project. Unless the project is heavily standardised or a repeat of another — there is virtually no chance of finishing on time or to an acceptable standard by passing (or more commonly “throwing”) from one team to another.
A common example of where early team integration really shines is during feature generation.
- Data features will often need to be re-run, re-configured and re-designed (multiple times)
- New data sources (with valuable additional signal) need to be ingested and cleaned
- Tests on various potential models will be repeated and reviewed when features change
A role diverse team will create repeatable processes from the outset and adhere to the boundaries of company infrastructure, ensuring that no costly logic “re-writes” are required when moving from test to live datasets. The benefits are often greater than the sum of parts — we develop data models supporting current project requirements and perhaps those of others too. This is just one of the areas where we have managed to cut weeks, if not months, out of projects without effecting quality.
MLOps products help, collaboration is better
Secondly, do not let anyone tell you that an MLOps product (Kubeflow, MLflow, Cloud provided, etc)+ Data Science are going to be all the tools your team will ever need.
Elements of Data Science projects are similar, especially in operations (monitoring, testing, processing, development environments etc). Although there are great open source and cloud based tools available, parts of your projects will be unique to your business.
If your business is software based this might be second nature. If not, having a strategy to share both code and documentation between projects is a fantastic accelerator. Make sure you bring relevant teams together to discuss lightweight governance, location and features of existing reusable parts.
Here is an example of our serverless components, we use these across several projects to reduce maintenance time and improve quality: https://datasparq.ai/elements-of-engineering
Think about the end at the beginning
Finally, those with experience will know that finishing a data science project is as difficult as starting or building one. It turns out that APIs are not designed for human consumption. And, no team is likely to be happy manually running data processes, creating graphs or having similar conversations forever.
It is also possible that no two Data Science projects you run have the same results interface. A handful of examples we have completed include: APIs, web / mobile applications, static files, databases (all different kinds), messaging service notifications and more.
Finishing a project means delivering something that can be used in operations, sustainably. It’s well worth taking the time early on in a project to make sure your team either has or can acquire the skills needed to finish the project as they might be a little different to starting and developing. This is where product-oriented awareness and design helps round a project off in a way that drives impact for the customer.
Do not let any of the above put you off! Data Science projects are extremely rewarding, fantastic investments and really enjoyable to work on. The extra effort invested in early team integration, code collaboration and building tricky results interfaces is often rewarded by increased adoption and usage.