Democratising Analytics

Krzysztof Adamski
inganalytics.com/inganalytics
7 min readJul 28, 2021

--

Our mission with ING’s Data Analytics Platform is to make data easily accessible within the company in order to speed up experimentation.

While developing DAP, our ultimate goal and at the same time challenge is to ensure that everyone has equal rights to get access to data. Data tends to be stored in silos, which makes access difficult for teams not directly involved with particular domains. Although this protects the data from being abused by unauthorised users, it does not spark innovation and the goal to have all datasets available in a single place is hard to be achieved. With the proper data governance we make sure data owners can trust that the central platform implements the necessary controls. Empowered with tools and data our teams can bootstrap new or improve the existing analytics initiatives for the benefits of our customers. This is what we call analytics democratization.

Mariusz shared with you our architecture for data discovery. This is our central place to get insights on what kind of data is already available on the platform.

Agata stressed out the importance of three processes we maintain: data profiling, data quality and data stability so our platform users can trust the data we provide.

We have also highlighted the self-service aspect of the platform and this blog is to describe how the DAP project journey starts.

Starting a data analytics initiative requires some substantial investment in infrastructure upfront or in the case of a public cloud choice, you need to make conscious decisions on the products to be able to use the full potential of big datasets, while having return on investment. Nowadays we clearly see the trend to use cloud native open source technology. This is to ensure the portability of the solution we build between different premises. What is more, it allows us to scale as our data volumes grow, being able to accumulate all the data while having us less dependent on licensing costs.

Data Infrastructure — one step closer to analytics democratisation

While nowadays companies’ infrastructure is well-set to support business growth, there is still a gap in making analytics products more accessible to the end users. This gap is being fulfilled by the platform economy. The journey starts with the focus on developers’ productivity, making sure basic tools and templates are accessible to them.

With DAP our idea was to standardize the process of starting a data analytics project by providing enough freedom for projects to start analyzing data fast, while limiting the cognitive load of choosing the right tools for the job.

In DAP we are considering data as a product and critical on the same level as core IT infrastructure. Data is the beating heart of the organisation and all solutions should be built around it, while being made easily manageable and accessible to everyone. Having fast and easy access to data and deriving the necessary insights from it, can not only improve business productivity and reduce costs, but it can also enhance collaboration, interoperability and efficiency of analytics projects at a global scale.

DAP’s architecture is inspired by the new paradigm shift of a modern distributed architecture, what is nicely put as “data mesh” architecture*. This means we want to break down the silos of the centralized, highly complex architectures that allow operability and data access to skilled engineers only. Our goal is to create a net of distributed data projects, owned and managed by cross-functional teams, making data and services around data available to many different domain experts, from product owners, to analysts, business managers and more.

Platform infrastructure foundation

Every platform needs to be built on a solid foundation starting from highly-available, persistent storage. We didn’t want to experiment much with the storage, treating it as a commodity and decided to go with a solution that allows access to and management of the data it stores over an S3 compliant interface. This choice will also make us ready for the move to the cloud in the future.

From the beginning a few years ago we knew we were going to be a container based platform so we can provide a lightweight isolation layer for different projects. At that time it wasn’t yet clear which technology would win. Some high tech companies went with their own container orchestration software and for most of the others we were left choosing between Mesos and Kubernetes. We have chosen the latter which in that case was the right move as this became a standard commodity in any public cloud offering.

Introducing project abstraction

As we are not in the business of providing infrastructure, we decided to create an abstraction of the project that would combine the basic primitives of infrastructure and be served to our users via a self-service portal.

The journey starts for every user of DAP by spinning up their own containerized desktop based on xrdp that makes a single entrypoint to pass our secure perimeter. This ensures the necessary control of the endpoints.

DAP is ready to be used from this point and users can play in their sandboxed environment getting a fair share of resources, along with other resources currently available at the platform. The onboarding process is supported with self-paced training that consists of example jupyterlab notebooks, which the users can execute step by step. This part is inspired by Spotify Golden Path journey although there is still much room for improvement..

The sandboxed environment gives you no access to ING data so you have to bring your own example datasets or use some of the publicly available sets we have on DAP. Eventually the appetite grows and regular users are ready to use their knowledge and skills with real ING data.

DAP users can only get access to the available datasource once a project that is created is associated with a business purpose. It all starts with just a few clicks.

Genesis — your single place to manage data analytics projects

The product that curates end-to-end your analytics projects’ lifecycle — from onboarding to data access — is called Genesis. Genesis aims to support the whole lifecycle of the project from its creation, support for periodic membership review, data retention policy and data access workflow management.

Genesis underpins the various infrastructure components as shown in the figure below to make sure all the resources are ready for the project. What is more, Genesis is easily extendable to manage different types of resources as we grow with the platform. In order to support a new resource type a new client is plugged into the core project. This ensures the required flexibility and we can decide which particular resources we want to support as a platform and even decide whether particular projects need a separate tier.

As a result teams that reach a certain level of maturity and are looking for more control, can decide whether they want to manage their own scheduling service etc. We started with object storage support — provisioning S3 buckets, a shared project database registered in Hive metastore — and now we extend the scope towards secrets engines in Hashicorp Vault, kubernetes namespaces, project scheduler, model registry based on mlflow and more.

Single users and teams are not restricted and have all the flexibility they need to execute the project at hand. Having the option to “customize” analytics resources to each user’s project needs, is another step towards analytics democratization.

Into the clouds

Genesis can also play an important role in a hybrid cloud environment providing the ability to promote certain projects to the cloud, creating the necessary resources in the cloud of choice and making them accessible without the need for our users to touch the cloud consoles. This accelerates the speed and reliability of the services DAP provides while offering users a seamless experience.

Genesis is now fully in production in DAP, managing over 200 projects. When we started we aimed to create a tool to empower our users, so that they could be in control of their projects. The main metric we track is the amount of requests handled by the DAP Infrastructure team vs users handling them themselves. We are happy to see that over 90% of the changes to the projects are now made by DAP users with more than 100 changes per month.

We believe that in our journey to democratize analytics we have to make sure the services we build are easy to understand and consumed by users that are not necessarily infrastructure engineers. We obviously see the tradeoff between the rich capabilities provided by the cloud consoles and by what Genesis can provide nowadays. Yet we strongly believe this abstraction creates a coherent journey and makes sure we carefully choose the services we offer on the platform taking into account among other things data security and operational complexity.

Stay tuned for more stories from our ING Data Analytics Platform team.

--

--

Krzysztof Adamski
inganalytics.com/inganalytics

Data Infrastructure Architect | Infrastructure Lead at ING Data Analytics Platform