Deploying Machine Learning Solutions
#2 in the Evolving Data Science Series
As data scientists work they have 2 principal goals:
- Solve a business problem given business provided guidance and constraints
- Hand-off the solution to those who will manage it
This approach has some major drawbacks, notably:
- Data scientists do not maintain the solution — they have no skin in the game
- Industrialization process is slow and can result in models which are no longer relevant and possibly entirely incorrect
- No guarantee that a newly deployed model is better or provides any business value
The Need to “Stay in Touch”
When data scientists fail to maintain touch with their product/solution, it loses credibility, viability, maintainability, and evolvability. So, how does this happen? I’ll discuss this, and then talk about how to address it.
In a traditional data science setting the data scientist, or team of data scientists, will sit as a team and decide on a solution approach, apply a standardized ‘workflow’ to solve the problem, run numerous models on a specific set of data, and then decide when they have the best solution based on one or more metrics.
Putting a Model Into Production
This model, code, and all is then passed on to an ML Engineer, Data Engineer or Software Engineer — depending on the organization. Once handed off, and also dependent on the engineer’s familiarity with the problem, the mathematics, and the language the model was created in, the model is put into production.
Productionizing a model is tough
Putting the model into production can be an arduous process, and from the data engineer's standpoint they would need to:
- Rewrite the code into another language (reinventing the wheel)
- Debug code without data science context (subpar results)
- Spend time to learn and understand algorithms (non-development time)
- Reinvent the wheel (repeating yourself)
- Add in components non-engineers would not consider (e.g. logging, exception handling, and much more)
At the same time, this leads to some solution challenges:
- Losing context that may be implicit in code
- Providing results that are no longer valid
- Becoming difficult to update when the solution diverges (becomes invalid when it once was)
- Not being easily understood by data scientists
- Being delivered after its usefulness is past
Similarly, when a data scientist tries to deliver the solution without partnering with a data engineer, certain aspects may get “dropped”:
- Data scientists are not data engineers and so solutions may be less than “done” — think about a loaf of bread that is baked and looks great on the outside but is raw and doughy in the middle
- Observance of SLAs and finer interactions with the system
- Security associated with the solution
- Architectural qualities (reliability, availability, etc.) that may not be observed or considered
- Correctness (quality, fault likelihood, etc.) of code may be low
Addressing the Gaps
In both of these scenarios, serious gaps exist, and so neither is ideal. But finding that one-eyed, purple unicorn with batwings and who lives solely on blue raspberries grown in Atlantis who can do this for you is not feasible.
The point here is that putting a machine learning model into production is not the job of an engineer, nor is it the job of the data scientist. While I have highlighted some of the issues that happen when a data scientist deploys a solution or when an engineer is involved, these are only some of the problems that can, and will, arise. The truth is, this is a job that belongs to a team. This is a process that should be continual, should be automated, and should integrate into a larger system.
Building a solution space to enable MLOps that is fit-for-purpose for your organization is what we do at Hashmap. Our data scientists, engineers, and architects will work with you to help build a solution that fits your organization.
Spanning strategic assessments to tactical implementations, we partner with our clients to provide guidance, direction, and advice that helps better position data and analytics solutions and outcome delivery for this new decade.
Please contact us here to find out more about how we can help you evolve your 2020 data & analytics vision.
This is part of the Evolving Data Science series.
Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.
If you enjoyed reading this, some of John’s other recent stories are below:
Securely Using Snowflake’s Python Connector within an Azure Function
Why and How to use Key Vault
Using DBT to Execute ELT Pipelines in Snowflake
DevOps and DataOps for Snowflake with DBT and Azure DevOps
Orchestration and DAG Design in Apache Airflow — Two Approaches
Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…
John Aven, Ph.D., is Lead Regional Technical Expert at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.