Evolving into a Big Data-Driven Business in the Azure Cloud: Part I

10 min readSep 13, 2019

You might be surprised to learn that this article isn’t about machine learning and artificial intelligence. It's about the primitive that goes before those capabilities. The effort and understanding required to build a Data Lake in the Azure Cloud.

The Early Days: The Monolith

In later stories, I will cover Azure Machine Learning but for now, we must start at the very beginning of time, well at least at the beginning of enterprise solutions.

I won’t bore you with yet another description of monolith architecture and the difficulties it presents. But I did want to highlight what I perceived to be a major advantage. In my experience when an organisation has grown a huge monolith operational system it’s often accompanied by an equally enormous data analytics function. Both systems have strongly typed schemas, storing tabular data, advantage number one. The second advantage is that the teams responsible for developing the operational entities have a responsibility and commitment to making the data available in the data analytics function. Once the data analytics function has grown to a vast size the team's responsibilities diverge but the early foundations were built by the same people that had to understand the operational systems and the business data within them. That understanding is then transferred to the analytical schemas. I see both these traits as advantages, advantages that can be lost when developing a Data Lake in the new world of mirco-service architecture and polyglot data storage.

People Structure

When organisations want to build a brand new application, that will make their current system look like a dinosaur, they’ll rightly embark on that adventure with a startup mentality. Acquire a little office on the other side of the road, fill it with their brightest minds and hire in a consultancy to accelerate new tech adoption and deliver the MVP. Whilst this might make a lot of sense when delivering features based applications, utilising a stockpile of feedback gathered from customers over an extensive period of time. It doesn’t seem to make a lot of sense when developing a Data Lake. There is no stockpile of customer feedback explicitly directing the feature requirements, no net-promoter-scoring to help guide the priorities. No detailed explanation as to exactly what we hope to achieve! Just an understanding that if we put all the data in one accessible platform good things will happen because that's what the tech giants have done.

Building the next generation of an application can be done within the confines of a single domain. Therefore all the business understanding and context can be captured and managed with non-trivial but feasible effort. A Data Lake will ingest data from all corners of the organisation so the domain knowledge is as far spread as the disparate data sources. Expecting one central team to manage all the relationships and capture all the data understanding required is not a feasible approach. It’s a funnel-shaped problem with lots of teams working on systems that are producing data and one team struggling to broker relationships and master the understanding of the business context the data represents.

The source teams also have no incentive to supply accurate trustworthy accounts of the data their system produces. So the centralised team quickly become their own delivery bottleneck as they don’t have the bandwidth to bring the extensive list of data sources under management. Instead, the source teams need to be responsible for ensuring that their data is ingested into the lake and they are also responsible for any metadata required to interpret the data they’ve published, treating the data as ‘product’ they export to customers. In return, the source teams can be early beneficiaries of the Data Lake by giving them secure access to data in the lake that originated from their systems. Thus creating a reciprocal relationship. You might think why should the source teams care about having access to their data in the lake. It’s their data, they already have access to it within their domain. Having a replicated copy of their operational data available for analytics is extremely valuable as it gives the teams the opportunity to derive value and insight from patterns they are yet to discover, without impacting the production workload.

For example, the team might release a feature that they expect will drive new user registrations. Being able to analyse and understand the characteristics of the data before and after the feature release will give the team the ability to measure success. Moreover, having engineering teams working closely with the data in the lake makes them the experts of the data in its new form and location. Any queries coming from other parts of the enterprise concerning their data in the lake can be serviced by the individuals that know that data best.

Many organisations have or are in the process of moving away from the monolith. With this revolution, there’s been a paradigm shift in operational systems. Architects and engineers are moving away from traditional transactions over stateful data in relational data stores to eventually consistent architectures storing event-based data in non-relational distributed systems like CosmosDB. This new way of storing data comes with many benefits but it also has some drawbacks. I consider relational SQL engines to be the Swiss Army Knife of data technologies. They can host highly concurrent OLTP workloads very effectively but can also expose the data they store to ad-hoc queries and analytics. With a relational engine, you can service business transactions and at the same time produce queries and views to understand patterns stored in the data. Imagine you're in the middle of a technical crisis and the CIO needs to know right now how many accounts are impacted by a bug in the system. In a relational system, this would be a fairly trivial low impact query. In a no-SQL distributed storage engine like CosmosDB ad-hoc cross-document queries have the potential to spike the request unit consumption beyond the provisioned wallet, thus throttling actual user requests. So, in my opinion, it’s too dangerous to run any “I need this number, now” queries on the operational system. “Tell us something we don’t already know?” Enterprises have been separating their OLTP and OLAP workloads for centuries. Why can’t the teams just get what they need from the data warehouse? Well in a modern cloud-based architecture where the majority of operational systems are now using document databases to process transactions it will take some time to build the warehouse. I’ve done the maths on this and I calculate it will take 42 light-years to add every single attribute in the enterprise to the dimension and fact tables of the warehouse. Why so long? impedance mismatch debt. In the era when relational engines reigned supreme application programmers were forced to collapse their class structures of nested objects into normalised tabular forms, to efficiently store and retrieve data from the relational engine (strict schema on write). So effectively you had many souls working away to solve the impedance mismatch problem. In this new era, NoSQL technologies sell themselves on the promise that the impedance mismatch problem is no longer a problem that needs solving. Now trust me on this, this is a big selling point to software engineers. I’ve spent many a year working as a data engineer and had many conversations with software engineers over technology choice and found their disdain for maintaining mapping layers like NHibernate, Dapper or EF incredible. It’s as if relational supremacy was my fault and I personally had imprisoned them in ORM hell for all these years. Thus, the opportunity to adopt NoSQL technologies, abandoning the need to manage mappings somehow makes up for the wrongs they experienced whilst incarcerated. If you read any of my articles or watch any of my videos you’ll see I’m actually a big fan of NoSQL technologies, in particular, CosmosDB. It’s awesome for the right use case! Coming back to my point, adopting NoSQL technologies and dropping ORM doesn’t mean the work the ORM was doing evaporated. It’s just becoming someone else's problem. Why? well, the operational systems world may have moved on but many organisations have well-established OLAP architecture who’s primary source of data is a relational store. Just as relational databases have reigned supreme Kimball’s star schema has also had overwhelming success. Yes, analytical technologies have evolved to source from a panoply of disparate sources but fundamentally a great deal of data is still stored and sourced from relational tabular schemas. So the impedance mismatch problem has now moved downstream to anyone who needs data in a schema-on-write tabular store. The Data Lake shouldn’t care about the needs and wants of every data consumer. If it attempts to curate specialised forms of data for all, it will be pulled in all directions becoming a centralised fat-berg blocker. As consuming teams exert their problems upon the data engineers that now have the job of keeping everyone happy. That's where this beautiful term schema-on-read comes into play and really resonates with me. In short, the data is stored as-is, how we received it, how it was originally published. If you want or need the data in a different shape well guess who’s job that is buddy? This remains true until there is a collective realisation that some refined views need promoting to materialised views as many consumers share the same requirements and the compute required to process the data becomes unpalatable. Raising the question of who is responsible for maintaining these views. This responsibility should reside where possible with the teams that know the data best, distributing the responsibility instead of centralising it. However, the design and creation of views should not supersede the ingestion of raw data. Data first, refinement comes later.

Data Swamp Fear Motivates Data Prison

Balancing data security and governance with accessibility and delivery velocity is extremely difficult. I’m surprised someone hasn’t described a GAV theorem where elements Data Governance, Accessibility and Delivery Velocity are traded off against each other. A Data Lake with poor data governance will bring feeds under management at a greater pace but won’t necessarily make the data very accessible as it will be scattered in an un-organised fashion across multiple storage assets (data swamp). A data lake with good data governance will set the foundations for organised, secure, data structures making data easily accessible at the cost of being much slower to develop. At one extreme, we can create a data swamp at pace and at the other extreme we slowly construct a maximum-security data prison. At both of these extremes, data accessibility is the victim.

Making Data Flow: Lambda vs Kappa

Lambda and Kappa architectures are big subjects and I don’t intend to cover them here. But whether you agree or disagree with either of the architectural approaches isn’t important. Building a big data platform that exposes data for stream analytics is important.

Where previously we had service-orientated-architecture (SOA) we now have micro-services. When considering data-driven intelligence, micro-service architecture exposes messages as unbounded streams of data, as they communicate events to other services within the enterprise. For example, an Orders service communicates order processing events to the Payments services when card payments need to be processed. Which is great as we have an existing event endpoint we can subscribe to and route these out-going messages to new analytical capabilities. Having access to near-real-time events opens the door to near-realtime intelligence. I like to think of the data as being in-flight, like oxygen and hydrogen atoms buzzing around in the atmosphere. They can be emitted from the leaves of a publishing micro-service and breathed in by a subscribing micro-service. Admittedly atoms travel around in a somewhat chaotic manner unlike a good micro-service architecture but the pure rawness of atoms really hits a bell with me. In the same way that oxygen and hydrogen can become organised to make water in a lake. Message data can be organised to make partitions of parquet files in a Data Lake. Unfortunately, this organisation process stabilises the data ready for batch analysis and the opportunity to perform near-real-time intelligence is lost. As water can once again lose its organisation and become free atoms. Stationary data on the lake can be shredded back into a stream of events. However, time can not be re-materialised and the opportunity to make near-real-time decisions is gone. Don’t make data stored in the lake a precursor to making intelligent decisions. Don’t make streaming analytics a secondary concern. Tap into the streams of data as they’re making their journey to the lake.

Education and Discoverability

It’s important to not overlook the importance of upskilling employees with the skills they need to use data effectively and to also make accessing and understanding data intuitive. Data that's difficult to find becomes effectively lost and the effort put into capturing and preserving it becomes wasted effort. Organisations might use a mixture of share-point, confluence, github and other platforms for publishing internal documentation. With many teams using many different publishing technologies, it can become difficult to create a consistent central catalogue of the data that's available for ingestion and indeed analysis. Let us consider some examples from our day-to-day lives which may reflect the problem of making data discoverable.

Starting with my kid's toy boxes. They’re packed full toys of all shapes and sizes, some were a massive hit with the kids others not so much. However, as they are all stuffed into various boxes and containers, only the ones at the top are discoverable. Consequently, regardless of the toy’s merit, only a few get played with regularly, the others can be considered lost. How about a Library? now we’re getting there, ordered and categorized publications by many authors. But a bit of a fuddle to find the exact information you’re looking for. Google…em, I meant Bing? Masses of discoverable information with a simple interface, yes please. In essence, we need two things to make datasets discoverable, a community of publishers and a search engine.

With data discovery services in place, employees can then start solving problems with data and upskill themselves by tapping into the datasets and metadata collated in a centralised knowledge repository. Azure Data Catalog provides publishing capabilities, data tagging, data sampling and a search engine. In future articles, I’ll expand on the usage of a Data Catalog and how one might be built alongside your data.

End of Scene One

I covered a lot of the topics which I hope demonstrates why it’s difficult to build a Data Lake and become a big-data-driven organisation. In the next part of this story, I’ll get into more technical details and discuss how a Data Lake can be forged in Microsoft Azure.

Influences

Martin Kleppmann's blog

I'm Martin Kleppmann - researcher, software engineer, entrepreneur, author and speaker. I am a researcher in…

martin.kleppmann.com

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh

Zhamak Dehghani Zhamak is a principal technology consultant at ThoughtWorks with a focus on distributed systems…

martinfowler.com