Deploying Machine Learning Solutions

#2 in the Evolving Data Science Series

John Aven

Published in

Hashmap, an NTT DATA Company

4 min readJan 7, 2020

As data scientists work they have 2 principal goals:

Solve a business problem given business provided guidance and constraints
Hand-off the solution to those who will manage it

This approach has some major drawbacks, notably:

Data scientists do not maintain the solution — they have no skin in the game
Industrialization process is slow and can result in models which are no longer relevant and possibly entirely incorrect
No guarantee that a newly deployed model is better or provides any business value

The Need to “Stay in Touch”

When data scientists fail to maintain touch with their product/solution, it loses credibility, viability, maintainability, and evolvability. So, how does this happen? I’ll discuss this, and then talk about how to address it.

In a traditional data science setting the data scientist, or team of data scientists, will sit as a team and decide on a solution approach, apply a standardized ‘workflow’ to solve the problem, run numerous models on a specific set of data, and then decide when they have the best solution based on one or more metrics.

Putting a Model Into Production

This model, code, and all is then passed on to an ML Engineer, Data Engineer or Software Engineer — depending on the organization. Once handed off, and also dependent on the engineer’s familiarity with the problem, the mathematics, and the language the model was created in, the model is put into production.

Productionizing a model is tough

Putting the model into production can be an arduous process, and from the data engineer's standpoint they would need to:

Rewrite the code into another language (reinventing the wheel)
Debug code without data science context (subpar results)
Spend time to learn and understand algorithms (non-development time)
Reinvent the wheel (repeating yourself)
Add in components non-engineers would not consider (e.g. logging, exception handling, and much more)

At the same time, this leads to some solution challenges:

Losing context that may be implicit in code
Providing results that are no longer valid
Becoming difficult to update when the solution diverges (becomes invalid when it once was)
Not being easily understood by data scientists
Being delivered after its usefulness is past

Similarly, when a data scientist tries to deliver the solution without partnering with a data engineer, certain aspects may get “dropped”:

Data scientists are not data engineers and so solutions may be less than “done” — think about a loaf of bread that is baked and looks great on the outside but is raw and doughy in the middle
Observance of SLAs and finer interactions with the system
Security associated with the solution
Architectural qualities (reliability, availability, etc.) that may not be observed or considered
Correctness (quality, fault likelihood, etc.) of code may be low

Addressing the Gaps

In both of these scenarios, serious gaps exist, and so neither is ideal. But finding that one-eyed, purple unicorn with batwings and who lives solely on blue raspberries grown in Atlantis who can do this for you is not feasible.

The point here is that putting a machine learning model into production is not the job of an engineer, nor is it the job of the data scientist. While I have highlighted some of the issues that happen when a data scientist deploys a solution or when an engineer is involved, these are only some of the problems that can, and will, arise. The truth is, this is a job that belongs to a team. This is a process that should be continual, should be automated, and should integrate into a larger system.

Building a solution space to enable MLOps that is fit-for-purpose for your organization is what we do at Hashmap. Our data scientists, engineers, and architects will work with you to help build a solution that fits your organization.

Spanning strategic assessments to tactical implementations, we partner with our clients to provide guidance, direction, and advice that helps better position data and analytics solutions and outcome delivery for this new decade.

Please contact us here to find out more about how we can help you evolve your 2020 data & analytics vision.

This is part of the Evolving Data Science series.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

If you enjoyed reading this, some of John’s other recent stories are below:

Building ML Pipelines

Kubernetes with Argo for the Win

medium.com

Securely Using Snowflake’s Python Connector within an Azure Function

Why and How to use Key Vault

medium.com

Using DBT to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with DBT and Azure DevOps

medium.com

Orchestration and DAG Design in Apache Airflow — Two Approaches

Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…

medium.com

John Aven, Ph.D., is the Director of Engineering at Hashmap providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Be sure and connect with John on LinkedIn and reach out for more perspectives and insight into accelerating your data-driven business outcomes.

Deploying Machine Learning Solutions

#2 in the Evolving Data Science Series

The Need to “Stay in Touch”

Putting a Model Into Production

Addressing the Gaps

If you enjoyed reading this, some of John’s other recent stories are below:

Building ML Pipelines

Kubernetes with Argo for the Win

Securely Using Snowflake’s Python Connector within an Azure Function

Why and How to use Key Vault

Using DBT to Execute ELT Pipelines in Snowflake

DevOps and DataOps for Snowflake with DBT and Azure DevOps

Orchestration and DAG Design in Apache Airflow — Two Approaches

Orchestration of ETL processes — aka data pipelines — is a conceptually simple exercise, it’s the implementation that…

Written by John Aven