Our Digital Transformation towards a Successful Big Data Platform
At Kamstrup, we create smart metering solutions for energy, heat, and water. From intelligent meters and remote reading systems to analytics and services, we deliver high-quality solutions for utilities, property managers, and other companies who depend on reliable consumption data. Our quote is ”You can’t optimize what you can’t measure”, and regarding the optimizing part that’s where I as Chief Architect and my teammates in the Analytics and Data Science department come in. Our goal is to create and deliver the next generation of SaaS products for data analysis in the utility sector.
In this blog post, I will tell you how we ended up with a successful big data platform.
The final result: 10k Platform Overview
Let’s start with a simple overview of where our platform ended up.
Data from devices, power-, water- and heat-meters, pumps et.al. are normally collected by Head-End systems using cellular towers, radio mesh networks, NB-IoT, SIGFOX, and more. A device produces 1 to more than 15 readings (unique registers) with a frequency that ranges from every 15 min to simple drive-by once a month. Each customer has from a few thousand to more than a million devices.
Our analytics platform receives these device readings from multiple sources: Multiple Kamstrup Head-End systems and various third-party customer SCADA systems. The data are ingested on three primary endpoints: FTP, a Web API, and Azure Data Factory.
When the data has been ingested, we split the data into two lanes: one for device information; and one for the raw device readings.
The device information is stored and used by internal services to process and present the (raw) device readings. The raw device readings are also used directly by the power products but are periodized before the usage by water and heat products to ease analytics.
Each product, like Power Intelligence and Leak Detector, or overall domain services, like handling of Devices and Customers, are treated as Bounded Contexts, see DDD Bounded Context. We further enforce this concept by creating an Edge, with the BFF Backend for Frontend pattern, so a specific product can retrieve and transform raw data from internal services to suit its specialized frontend. The internal data can be simple raw data for graphs, alarms or models from machine learning, and allows the user to see, understand, analyze and interpret his original meter data in a single context.
Now, that was the technical stuff, let’s look at our journey to get there.
This is where our journey begins
Our Analytics team was formed mid-2017 due to the same reasons as a lot of other companies do it: Digital transformation into a more service-based offering of products to our customers. The team was, and still is, part of the UX department which means that we strive to have a strong understanding of our customers and interact with them regularly to perform product discovery and validate solutions. A lot of the early work was done using Sprint, and we still do design sprints this way for major features. We embrace the virtues of a minimum viable product, shipping incremental value, validating ideas, and iterations. Products should not be ”completed” in half-year iterations, they should grow and evolve every single week, so we can deliver the right thing at the right time! We want to do a full Build, Measure, and Learn circle and test a hypothesis in days instead of months (years).
Next steps moving forward
Our journey forward involved the usual modernizing of product offerings and moving to the cloud. We did not want to do it like the ”Move Fast and Break Things” Facebook Motto, but move slowly forward and circumvent any major issues if we could. An example of this is one of the reasons for our move to the cloud. Like many other companies, it was common for development teams to face month-long deliveries for servers and SQL databases, so our move to the cloud was, more or less, head-of-development sanctioned skunkworks. Instead of waiting for a companywide decision to move products and development teams to the cloud, our team did a sanctioned bottom-up approach, started the journey, and circumvented the innate response to change by the immune system many IT departments have. After we circumvent problems like these, we would start working with the departments in order to find permanent solutions, like hiring a full-time DevOps member of the team to liaison with IT, starting an Azure Efra group, and companywide cloud governance initiative.
A core principle, we have had from the very beginning, is the virtue of Minimum Viable Product (MVP) for both frontend, platform, and products as a whole. In many cases, we have succeeded and have for example created minimal product frontends and core platform services like handling of Customers and periodization of meter values that, with small incremental improvements, have been working since day one.
An example of the opposite is the current rewrite of the Device handling service, that simply was done to academically correct in regard to the database and its entity-relationship model, and then had its model and API mistreated and broken by small incremental ”fixes”. In the end, it did not improve either performance, readability or maintainability :)
When doing big data platforms, you cannot buy a platform. Building it right the first time is expensive AND time-consuming. You can do on-a-napkin calculations, but in the end, you end up rewriting something.
You can get some nice stories about creating simple solutions and rewriting software every few months in
You Are Not Google” or Facebook, or Amazon…. — Oz Nova and the Techtopia #52: Hvad laver Uber i Aarhus podcast. The podcast tells how they started and moved forward:
“In engineering, we do not have a test department, we do not have an operations department, only a department doing development (and test and operations)”
“It gives a focus on quality from the very beginning”
”We do not want to achieve 0 errors in production, but the ones we have should have a minimal impact on customers (and can be reverted)”*, all while having ”30% growth in traffic every month”
(Almost) no hiccups in Operations and Availability
Our core platform and the products themselves were heavily inspired by the old, but still very good Twelve-Factor App article and the Azure Application Architecture Guide. The focus to have separate environments and keep them as similar as possible using infrastructure as code and CI/CD pipelines was the basis for a stable platform and keep us in sync with Kamstrups ISO27001 certification.
What really allowed us to keep the platform stable in the beginning, was our focus on centralized logging, dashboards, and a small number of dedicated people that truly had seen their call in DevOps. Later on, a focus on numbers for Service Level Objectives and Service Level Indicators and treating every small hiccup in operations that might have impacted customers as an incident and performing lessons learned improved our platform further.
Inspiration for this journey can be found in Googles Site Reliability Engineering, especially the chapters on Service Level Objectives, Managing Incidents, and Postmortems found in the Site Reliability Engineering book at SRE books. But in the end, no documentation, metric, alert, or SLI can help you when you forget the checklists and promotes to fast between environments!
An early dead end and its solution
Our biggest dead-end was properly when we ditched our Cassandra cluster for Cosmos DB. Without a full-time Linux and/or Cassandra expert we simply did not have the manpower to do both development, operations on Linux servers, and optimizing the Cassandra installation. Cassandra was good for us, so we were a bit sad to leave it in the ditch and change to a fully managed alternative.
Yes, Cosmos DB can be expensive but is it a beast when you understand document design, partitioning of data, data access paths and have a way to offload data using TTL. Also, remember not to be afraid of the 429 ghosts during daily workloads and to increase Request Units for scheduled daily, weekly or monthly operations.
When you combine Cosmos DB with Blob storages and partition your data as your need them for either graphs or cross customer analysis, you end up with a managed, cost-effective, and almost simple coq in the overall platform.
Core platform performance with Managed services
The change to Cosmos DB also increased our focus on managed services as the basis for our products and usage of the platform as a service (PaaS) rather than infrastructure as a service (IaaS), or Kamstrup on-premises infrastructure. The rationale was that we always would have patched servers, that we most likely could scale the resources if needed, and that we could do deployments easier by having infrastructure as code. This was also one of the reasons, we almost 3 years ago selected a “managed” Event Hubs and Service Fabric as an application platform, and then on the way found alternatives in Azure Functions or similar Azure managed services for infrastructure.
Our focus on managed services allows us to (auto) scale horizontally, adding or removing new instances as demand requires, i.e. deploying a Service Fabric application on more than one node, adding new nodes, using an Azure managed service that allows for auto-scaling, like Azure Functions, Cosmos DB, or the plain boring Azure Storage Accounts. Because as we know, choosing boring technology can be effective, albeit a seed for endless discussions because we are not using this week’s latest and greatest framework. I like simple and boring stuff as it allows us to be a bit creative with features as we do not have to update pipelines, documentation, operations, and monitoring every single day. Or as one of my team members recently said to me:
“Honestly I prefer simple and scalable frameworks so I can add the complexity by myself”!
What to do next? — Well, start feeding the animal!
Since we started our journey, almost all other departments at Kamstrup have learned a lot related to cloud, MVP based development, and DevOps. We now have a lot better cooperation with IT and based on the centralized logging and monitoring we did from the very beginning, they, almost, have primary responsibility for daily operations and alerts. Our focus for our platform is, therefore, shifting toward never to stop learning from incidents and remove toil either through automation or as self-service by customers. We should not accept “we use to”-mentality, but instead keep being a front runner on cloud-based products.
And well, then we have the small elephant in the room: migrating away from service fabric! We have started watering and feeding the animal, so it can be lured away from our large stack of china in a very near future :)
By Jesper Færgemann, jfm@kamstrup.com
Chief Architect in the Analytics and Data Science department, Kamstrup
About me
I work as Chief Architect in Kamstrup’s Analytics and Data Science department, where we based on Google’s Sprint process create and deliver the next generation of SaaS products for data analysis in the utility sector. The products are cloud-first as the first products in Kamstrup, and we are data-driven in operations and evaluating the usage of our products in order to ensure fast feedback to both development and business teams. Our operations are heavily inspired by the principles from Googles Site Reliability Engineering and Accelerate.