Implementing a new data science and analytics platform
How do you choose a strong solution for your business? Which platforms are best? There’s lots to consider:
Be more data driven! This is a sentence we hear more and more. The boom of big data technologies have opened the doors to possibilities we could never have imagined before. The affordability of these solutions makes advanced analytics and data science available to all.
However, having a strong data science and analytics foundation requires a lot of aspects to be taken into consideration and investments to be made:
- Hire new talent
- Review the internal technology stack and potentially invest in new technologies
- Put in place a proper governance around data related activities
comparethemarket.com started this journey many years ago, by hiring data specialists (data engineers, data analysts, data scientists etc) and implementing new infrastructures (Hadoop initially).
This has led to the rapid growth of our data activities, with many positive results (massive processing of unstructured data, machine learning at scale).
However, a few years ago we decided to build on this by implementing a unified data science and analytics platform as this was easier to maintain, more cost effective and flexible for the work we were doing.
This was a long project, but after 10 months of work, it is now complete!
Over the course of a series of blog posts I’ll share our learnings from the implementation.
- Initial decisions
- Proof Of Concept
- Architecture design
- Implementation & Onboarding
Note: What we are sharing in this blog is not THE way to implement a data science and analytics platform, but a solution that was fitting our context.
Although the business had invested in many data tools, there was no enterprise platform in place for delivering advanced analytics and data science. Given the growing size of the team, and the amount of incoming projects, we decided to look for a solution to enable collaboration between the team members, and allow to ingest new projects.
A few of our requirements were:
- Scalability: Not just kit wise but also scalability of the team to take on more projects including upskilling, onboarding, collaborating etc…
- End-to-end functionality: Ability to tackle various tasks on the platform, through standardised methods/kit, without depending on external resources
- Collaboration: Facilitate joint work on a project
- Skills gap: A platform providing skills stacks able to cover the major roles (from data analytics to data science to machine learning engineering)
- PII / Sensitive Data controls: Meet the data governance and security requirements
As a result, we have reviewed and engaged several vendors to explore market offerings and find a suitable partner to help us deliver the new platform.
We have done extensive research on the vendors listed in Gartner’s Latest Magic Quandrant for Data Science and ML Platforms and added some others we have interacted with during the last couple of years.
Some of the leaders in the magic quadrant were ruled out due to:
- Op model: Proprietary software/licensing with high pricing, vendor lock-in and/or skew to on-premises infrastructure that would be inflexible to changes in our internal infrastructure.
- Performance on key capabilities such as collaboration, advanced analytics & ML Ops and scalability.
Finally we landed on two options to run a POC!
- AWS Sagemaker: Already available solution as CTM uses AWS as cloud solution.
- Databricks: For the list of features it provides.
The POC ran for one month during which we have developed and tested functionalities available on the platforms. This is the topic of the next series, where we will see how we organised the POC to be sure to have an objective result and make an informed decision on the way forward.