Implementing a new data science and analytics platform Part 2
Recap.
In the part 1, I introduced our journey towards the implementation of a Data Science and Analytics platform. I explained that a data driven company needs to consider many aspects, from hiring good talent to investing in a new data platform.
We went through a non-exhaustive list of requirements for a good data platform, that we used to shortlist two solutions for a POC: Databricks and AWS Sagemaker.
Databricks is a software platform that helps its customers unify their analytics across the business, data science, and data engineering. It provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Find more details here.
Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environments. Find more details here. This was already available internally, as CTM uses AWS as cloud solution.
POC ran for a duration of a month, during which we assessed the functionalities of both solutions, and validated against each other as well as the current environment where relevant.
Methodology
The POC was divided into 2 main parts:
- Architecture & devops assessment
- End to end testing
Architecture & devops assessment
In this part, the focus was on the platform deployment and administration. We created an isolated AWS account, identical to the main account we use for our daily tasks. We then went ahead with the deployment of Databricks, which we found straightforward. The tests were evaluated against the following categories:
- Deployment: How easy it is to deploy Databricks within AWS
- Administration: What features are available for the platform admin, how effective they are
- Tools & Features: Are the available tools capable of covering all our daily tasks
- Performance: Query performance, job performance, …
- Integration with external services
End-to-end testing
This is where the solution has been tested in much finer detail, developing, and productionising a Machine Learning model with Databricks and Sagemaker.
Since Databricks casts a wider net than just Machine Learning applications, we have arranged for a 2-day Hackathon that involved various teams within the Data Function to go over a scripted task list, predefined by the representatives of each team (insights, Analytics, Data Science, etc).
This part was evaluated using a scorecard that rolled up into various categories such as:
- Productivity & Workspace: ease of use, platform performance, stability of the environment, …
- Collaboration: Collaboration with other users, sharing results and dashboards, …
- Analytics: Data manipulation, visualisation, and data export, …
- Data Science: machine learning lifecycle management, …
Note: List above is not an exhaustive one, just a high-level overview
Each team member was expected to score various tasks under each category. These were then discussed, to understand the reasons behind them and averaged where relevant to get an idea of which solution was preferred by the team. (Databricks, Sagemaker or current way).
At the end of the POC, we all agreed that Databricks was a better technical solution in our situation, as it enables seamless collaboration across key personas. Note that Sagemaker, being very ML oriented, was not suitable for some of our use cases, in which case the comparison was made against current way of doing those tasks.
Key takeaways
It was a great experience designing and running such a large POC, involving many stakeholders, both within and outside the CompareTheMarket. There was a lot to learn, not just technically, but in many other areas like data governance, risk & compliance, project management, just to list few of them.
These are some of my key takeaways from this project:
- Build a POC environment which is as close as possible to the final environment
- Properly define in advance the scope of the POC. This should not prevent the team from exploring all the features within the platform, but it helps to structure the POC, and to make sure we cover all the functionality we are interested in
- Involve the wider team in the scope definition, to make sure nothing major has been left out
- Involve other stakeholders as/when possible (Governance team, information security, …)
- Involve the vendor‘s technical team, and do not hesitate to let them know if there are any issues or questions
- Some of the problems we have identified turned out to be non-issues and resolved quickly by the support team
- Have a scorecard. We found it particularly useful in our case. Although the responses can be subjective, it helped to have productive discussion with the team and allowed us to identify key outcomes
The next part will focus on the implementation of the platform, which was also rich in learnings.