How DevOps allows you to move the elephant in the room

Published in

Quaero CDP

8 min readAug 26, 2020

Customer Data Platform and DevOps

“The most powerful tool we have as developers is automation.”
— Scott Hanselman

The first time I heard about the term CDP, I had no idea what it meant. In case you don’t either, it’s an acronym for “customer data platform”. I’ve now understood that it is the technology that is revolutionizing marketing. As the marketing brains see the advantage of “data driven” marketing — acquiring, storing, validating and processing data becomes vital for any organization. While data remains at the heart of the solution, the Quaero CDP also provides its end users a rich UI and set of APIs to curate, signalize, visualize and activate the data as well. As you can imagine, that requires a host of other services to enable such a paradigm. This evolving list currently includes Nginx web-servers, ASP.NET Core API, Spring-boot APIs, ELK, Looker, Jupyter, Jenkins & Druid. That’s a total of more than 10 services per installation. There is usually more than one installation for every client and these services are distributed across Kubernetes (K8s), SQL Server and EMR.

“Micro” services you say? The “elephant size” of the problem is evident when it comes to the actual implementation of CDP. Quaero differentiates itself by having a best-in-class private CDP approach — where we take the software to the data, and not the other way round. Meaning, all the components that make the platform tick, need to be transplanted into a customer’s on-premise data-center and/or their virtual private cloud. To add to the complexity, Open Source tools (K8s, Airflow, Prometheus and Grafana, ELK stack, Jenkins) ship with their own set of problems, including the operational overhead of maintaining these tools. The advantages of using such platforms are considerable — cost benefits, community development, white-labeling, etc., But when you take a look at operational complexity, CDP ends up being an elephant stuck in a room. DevOps help to reduce such complexities.

Cloud helps:

“It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change.”
— Charles Darwin

Quaero has been providing services in the CDP sector for the last two decades, and in that course we have adopted traditional software development approaches. We hosted the “Quaero AdVantage” (precursor to our CDP) in our datacenter, where we were doing manual deployments (mostly) and had no monitoring in place; we had to make sure one engineer was available 24x7 if something went wrong with production workloads. In 2017, we realized that our monolithic technology stack was holding us back from delivering value that marketers seek. Hence the development of our current CDP, Quaero 3.0, designed in a manner where scalability and agility were core principles. We knew that if we continued to serve from the datacenter, it was going to be difficult to manage the data and simultaneously fulfill customers’ need of not pushing the data outside of their infrastructure. Cloud service providers, such as AWS and Azure, were proving to be significantly better than traditional data centers in this regard.

Using the cloud also had other advantages. Cloud was significantly cheaper than procuring hardware, OS licenses and the databases separately. It has also made us more hardware agnostic, meaning, we no longer just had one place to deploy our product. By choosing AWS as our primary CSP (although we have the capability to deploy it on Azure and GCP as well), we had solved one piece of the puzzle, but we still had to figure out how we were going to deploy.

Ashwin Nayak, our VP of Engineering, recently wrote a great blog on how we just recently and literally, pulled the plug on the last server in our data-center.

Kubernetes and Containerization:

Based on our design principles, most of the services of our application were running independently. But we were deploying these services natively on individual Virtual Machine(VM)s. So scaling the application meant that we would be spinning up multiple VMs that were not utilizing the cloud resources efficiently. The other problem that we had to solve was service discovery. Since our micro-services were running on a fleet of machines, we had to manually map how the frontend could talk to the backend. This was painful to maintain, as our infrastructure grew and the method was error prone.

So, we needed an orchestration tool which could take care of the following issues.

Scheduling
Service discovery
Limiting resource
Autoscaling
Auto-healing/Self-healing
Monitoring

Kubernetes was an up and coming technology back then, and not as widely adopted as it is today. But it nicely fit in with our application stack (thanks to the micro-services architecture) and also solved the issues that mentioned above. Quaero was one of the early adopters of AWS Elastic Kubernetes Service. Our UI component was able to reach the backend APIs and other services just by referring to its internal cluster “Service Name”. The Kubernetes scheduler ensures that at least one instance of all of our core services is always available. Even if any one of the pods had crashed for some network or other configuration issue, the scheduler took care of gracefully removing the pod from service and automatically restarting it.

When issues arise in our application deployments, as they inevitably do, we need a way to get to the heart of the problem. Our platform drives daily marketing operations for some of the largest enterprises in the world. Downtime is, quite simply, not an option. We use Prometheus and Grafana hybrid stack to monitor all our services. Grafana has an inbuilt alerting mechanism that notifies when a service goes down or when any node is being used over the specified threshold. We have integrated our support Slack channels with Grafana, so that it pushes periodic summary of service availability/unavailability with useful snippets of graphs and performance statistics. These Grafana dashboards have proven to be our go to tool whenever we start to debug an issue.

Cross platform networking:

Even though we had most of our services running as containers, we had our data layer (metadata and customer data itself) which couldn’t be integrated into Kubernetes. We had been using managed Hadoop clusters (Cloudera) and MS SQL Server prior to Kubernetes and did not want to make a lot of changes to the data stack. So, we had to make sure that our backend APIs running on Kubernetes were able to reach our data entities and vice versa. We figured the easiest and most efficient way to achieve this was to use Nginx Ingress. Ingress service is typically exposed as an (internal) Network Load Balancer for our backend services, which acts as a gateway for data processing services to consume. We have reduced the manual effort on setting up these compute and networking components by automating both provisioning and configuration of infrastructure.

Reproducing the infrastructure:

Having solved the orchestration part of our application, let us turn to the infrastructure on which the application is hosted. As mentioned earlier, we have three important infrastructure components.

Elastic Kubernetes Service (application microservices)
MS SQL Server(for maintaining metadata about the data),
Elastic Map Reduce(for crunching actual customer data).

We were following a SOP (a playbook with step by step instructions) for setting up our infrastructure. Once the infrastructure was provisioned, we were following another playbook for configuring the different services. This method of provisioning and configuration seemed to work at the early stages. But as our clientele (along with our team) grew, these operations became mundane and it was hard to teach engineers to follow the SOPs that had hundreds of steps which needed to be performed without any mistakes. If someone missed a step or made a typo in some configuration, we had to spend hours debugging.

That’s when we made the decision to use Ansible with Terraform for automating our infrastructure configuration. Since both tools were idempotent, it did not matter if something went wrong during installation. All we had to do was to rerun the Ansible playbook to reset the infrastructure. This made the documentation of our Standard Operating Procedure(SOP)s easy as well. Ansible uses YAML(Terraform has its own Domain Specific Language(DSL)) which can be understood unlike having to figure out how complex bash/powershell scripts work. To quote an engineer from Google,

“Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.”
— Google

The other big advantage of using Ansible/Terraform is that we no longer had to maintain snowflake servers. Every operations engineer would make the change using Ansible hence he/she doesn’t need to log in to the servers in any case. This has prevented any “configuration drift” that would’ve occurred had we not used the above mentioned tools.

Moving the elephant: The ability to bring CDP where the data is and not the other way around:

“If you think it’s expensive to hire a professional, wait until you hire an amateur.”
— Red Adair

As you can probably see by now, one of Quaero’s biggest differentiators from the rest of the market, i.e a being a private CDP, was achieved , by adopting DevOps philosophies. To this day, the ability to bring our technology to the customers remains our core strength, and while other CDPs have now started to look at this as a new delivery method, we continue to be leaders here. There aren’t many vendors in the market that can install their software in customer’s infrastructure, not having to push customer’s data beyond their customers’ boundaries.

“The secret of change is to focus all your energy not on fighting the old but on building the new”.