How to power up your product by machine learning with python microservice, pt. 1
The main motivation is to show how to conduct data science and machine learning projects and give a hands-on walk through on how to build and integrate machine learning service into software product with microservice oriented architecture.
You may find these articles useful, if you are software engineer working on machine learning project, or data scientist/machine learning engineer/data engineer considering ways to improve your work flow and your models delivery, or project manager trying to connect the dots and improve data product delivery process, or executive manager working towards employing data science and machine learning in your organisation.
The set of articles is being split into different parts. This part, part 1 concerns the background/intro to the topic.
To start off, let’s define microservices/microservice oriented software architecture and machine learning.
In a nutshell, microservices is an approach to build software when different functional application parts are independent and communicate via communication protocol over a network. Every application component is isolated and interfaces other components according to service layer agreement, SLA.
Machine learning, or ML is a cross-disciplinary approach to extract patterns from the data using iterative computational techniques and applying mathematical algorithms. It is a sub-domain of computer science/data science. A machine learning software product follows the flow:
train: data + result -> rules
serve: rules + data -> result
Why Microservices and Python?
First off, you may ask yourself two reasonable question, why microservices and why python? I, as pretty much all of us, will use google to answer it :)
As one sees, the community interest in using python, microservices software development, machine learning and DevOps is growing over time, and these areas of interest are highly correlated. In the other words, when people search terms concern microservice, or machine learning, or DevOps topics, it is very likely that python related search terms were queried on google.
The search trends above, the fact that python is the third most popular general programming language after C and java according to the TIOBE index (as of July’19) and relative simplicity of using python to bootstrap and execute a project makes it to be the leader for machine learning and data science.
As for microservice oriented software architecture, it is clear that developers community’s interest is skewed towards microservices against monolithic software architecture.
One of the main advantages of microservices is independency of software product feature services which enables companies for distributed development and improves software scalability. It makes this development approach to be a reasonable architecture choice to integrate machine learning into software product.
Data Science/Machine Learning Project
Why shall I bother?
It is a reasonable question to ask as an executive, shall I be bothered by data science, or machine learning for my product at the moment?
It comes with the followup questions:
- Is my organisation ready to implement machine learning services?
- What and how shall we improve to get to the point of readiness?
The diagram above may be helpful to answer those questions. Let’s follow it:
- First question to answer, if there is already efficient classical solution which delivers result according to business objectives. Classical in this case means that the software product works as “data + predefined business rules = result”. If the answer, is yes, you probably should focus on other problems, or redefine the scope of the current problem prior to employing machine learning for your product. If the answer is no, we go to the point 2.
- Get a better picture of your resources and understand if you have, or you can afford data science experts to integrate machine learning into your product. If the answer is no, you should de-scope your problem, re-structure your organisation, consider outsourcing, or use of third party service providers for your product. If the answer is yes, we go to the point 3.
- What about infrastructure? Do we have what is required? Machine learning is very computation heavy and time consuming process, hence to reduce costs, it would require sufficient hardware resources and automated data pipelines. If the answer is no, you should define the strategy to provision, build and maintain infrastructure to build and host data science solutions and machine learning models. If the answer is yes, we go to the point 4.
- Do we need all necessary data? Can all data be accessed, are they on one common storage layer? This is a very tricky question and it may require inputs from many teams like Data Engineers, DWH, Product, Developers. This question however is important to clarify because there cannot be data science without data :) If the answer is no, the foremost task prior to data science project execution is to build a unified data layer with clean and easy-to-access data and to automate collection of the data relevant for your business problem. If the answer is yes, we go to the point 5.
- How flexible are the work flows in my organisation? Can we adjust to new approach to solve problems? This is by far the trickiest question on the list, it may not have straight yes, or no. However, if your organisation won’t be able to support the mindset of processes and work automation, any of your data science initiatives are likely to fail. For example, if it takes weeks to get access to data, months to get recourses (GPU machines, workstations etc.) for model training, infinite hours of meetings to get sufficient support from business units to scope the problem, even small positive value is unlikely to be generated for your company with machine learning. If the answer is no, your top priority should be reorganisation of your organisation’s work flows to facilitate integration of new approaches to efficiently solve your business problems.
If after answering all questions from the flow chart we get to the the final yes, we are ready to get to the point of working on machine learning services and on their integration into the product. Let’s start with the project flow.
Data Science/Machine Learning Project Flow
Data science, or machine learning project’s foremost goal is to solve business problem. It is the foundation of a project, and all to be defined around the problem to be solved. Once it is set, the data science/machine learning project should go according to the cycle:
- The scope and the project objectives to be set by Business Stakeholders, Project Managers and Data Scientists.
2. Necessarily data to be prepared. Here, many teams are being involved:
- Product/Project Managers and Data Scientists to define data features for modelling. Product people are necessary to be involved to bridge data team with business.
- Data Engineers and Developers to implement new pipelines to send required data from the product to the data platform.
- Machine Learning and Data Engineers to setup data pipelines on data platform.
3. Infrastructure to train and deploy models to be provisioned by the SRE/DevOps and Machine Learning engineers. Once the infrastructure is set, it can of course be reused for other ML projects.
4. A step being the cycle itself is called machine learning experiment, it involves data scientists and analysts. The model, or the product of the machine learning project is being defined at this stage of the project.
5. The step build model bridges data scientist and machine learning engineers together. The product of machine learning is being built at this stage of the project.
6. The final stage of the cycle, the model deployment involves Machine Learning engineers, Developers and Ops/SRE/DevOps engineers. Only once this step is completed, the project can be evaluated by stakeholders. Only at this stage, the result of many teams work can be integrated into the product and hence can influence business. Without this stage, a data science/machine learning project cannot be considered as a project because it would never interface users otherwise. All machine learning project should have deployment as the foremost objective, API first design is one of the useful approaches to follow.
Once machine learning model/product was delivered to users, the project can de-scoped, or redefined to improve existing model, or new project can be kicked-off to build a new one.
There are some rules, I would refer to as golden rules when you conduct a data science/machine learning project:
1. Lack of problem specification → infinite time-to-release for the model
If you miss the stage of the project scope, not able to gather requirements from your stakeholders and to set clear business objectives for the problem, you cannot deliver a machine learning solution.
2. No infrastructure ≡ not sufficient, or bad data → no, or bad model
You cannot implement any machine learning solution without sufficient data platform, otherwise you very unlikely to have high enough quality of data to be used to extract systematic patterns, to train models and to implement your solution to be serving your users.
3. Lean/iterative development → successful ML project ≡ product delivery
Machine learning can only be efficiently delivered using lean/iterative development approach:
- Proof of Concept, PoC to be deployed first to test data pipelines infrastructure and to set a model baseline.
- Minimum Viable Product, MVP of the service to be delivered to stakeholders for solution evaluation.
- Productionised version of the service to be delivered in the end as the final result with monitoring and (automated) model adjustment.
So now, when we defined machine learning project flow, let’s have a look at how machine learning services can be integrated into the product platform and what logical architecture for the data platform can be employed.
The diagram above illustrates two main functional parts of software application, or so-called platforms:
- Product platform with the services communicating to each other and to third party services to deliver value to users/business customers through different interfaces, like web GUI, or REST API end-points.
- Data platform with services to support internal customers/business stakeholders with sufficient tools to facilitate decisions making by highlighting users activity.
Data platform can be broken down into three parts with potentially three teams being involved to maintain them:
- Data Warehouse, or DWH — the part which interfaces the product platform and third party data providers to consolidate all useful data in one persistent data storage and transform loaded raw data to provide pre-defined KPIs and data marts for further consumption by analytics, data science teams and business stakeholders. Data Engineering team is responsible for this part. The team has at least three sets of SLA: DWH-to-BI+Business, DWH-to-Product, DWH-to-DS+Business.
- Analytics/Business Intelligence Platform, or BI — the part which consumes data from DWH and provides analytics insides to business stakeholders as reporting solutions, for example like dashboards, descriptive and prescriptive ad-hoc analyses. BI and Data Analysis teams are responsible for this part. The team has at least two sets of SLA: BI-to-DWH, BI-to-Business.
- Data Science Platform, or DS — the part which consumes data from DWH to build data science and machine learning services to improve analytics solutions facing internal customers/business stakeholders, or to implement new product features to improve business value for users/business costumers. Data Science and Machine Learning teams are responsible for this part. The team has at least three sets of SLA: DWH-to-DWH, DS-to-Product, DS-to-Business.
Machine Learning Project Delivery
Results of data science platform can be deployed in (at least) three ways:
- Data batch processing — data science/machine learning service processes data batches on schedule to make a prediction and writes results back to the data storage, or to the analytics platform for further consumption.
- Reporting — data science/machine learning service generates and delivers a report as for example, dashboard, web-hook message, SMS, or e-mail report.
- Model as a service, MaaS — data science/machine learning service delivers result of model prediction in real time. It can be integrated into the product platform to communicate with other product feature services, for example to communicate with the items search, or content delivery service.
Despite of being one of the most useful ways to deploy your machine learning solution, MaaS is barely covered on the web. For example, there were only 5 (!) model deployment related articles out of total 650 posts published on towardsdatascience.com in 2019 from April till June.
That brings us to the main technical objective of the articles set, an illustration of how to build MaaS. The best way to accomplish this objective is to follow the learn by doing approach. Stay tuned to read the hands-on part of this articles set where I walk you through the steps to build a machine learning model and integrate it as microservice to existing web-application.
To wrap up, the bullet-points for take-away:
- Python and microservices is a reasonable choice of instruments to integrate machine learning into your product.
- Data science project can only be executed when your organisation is ready for such strategical move, for example if you have common data layer with easy access, follow flexible work flows and can afford data science experts.
- Golden rules should be satisfied for successful execution of data science projects. First, you should always keep in mind that data science/machine learning is about solving business problem, not about fancy techniques, and lack of project objectives specification tends to project failure. Second, lack of data, or absence of sufficient infrastructure tends to bad, or no data science solution at all. Third, iterative lean approach tends to successful execution of data science/machine learning project.
- Model as a Service is an efficient, maintainable and scalable way to deliver machine learning solution to your users, or internal clients.
- About microservices: https://microservices.io
- About machine learning: https://link.medium.com/C9pSewGfSY
- API first approach: https://apifriends.com/api-creation/api-first
- API first approach: http://engineering.pivotal.io/post/api-first-for-data-science/
- MVP in data science: https://link.medium.com/PbmMswPfSY
- Lean development: https://leankit.com/learn/lean/principles-of-lean-development/