EUDB: How Data Sovereignty Impacts Data Engineering
Engineering teams in Microsoft are no strangers to privacy & compliance related requirements. A very significant chunk of engineering cycles and resources is always dedicated to conforming to, or meeting the next new regulation, legislation, guidance, certification, or commitment that our leadership makes to earn trust with our customers.
It is serious business. In fact, it is serious enough that when designing new systems, it would be prudent to bias towards design choices that will help weather such compliance directives — with the least disruption and shortest timelines possible. Such capability ultimately translates to a competitive edge.
Data sovereignty is a particularly hot area in recent areas and has an outsized impact on large & central data teams like IDEAs. Last year Microsoft announced a commitment to the European Union to create an EU data boundary to store and process customer data. You can read more about it in these official blog posts:
- Answering Europe’s Call: Storing and Processing EU Data in the EU — EU Policy Blog
- EU Data Boundary for the Microsoft Cloud: A progress report — EU Policy Blog
The commitment in a nutshell, is that user grain data associated with and belonging to EU customers, does not leave the EUDB (EU data boundary).
Let’s break that down with a simple example in the context of a query. Figure 1 shows a simple query which counts the number of users that meet a simple criterion by querying a user grain table.
Figure 2 below, shows the same query re-implemented to comply with EUDB requirements.
In the reimplementation in Figure 2, the user grain source table is now geo-partitioned in two locations — in ROW (Rest of World — i.e., a Data Center located outside the EU, perhaps in North America) and EUDB (Data Center located within the EU). The original filtering happens in parallel (with the necessary co-ordination) in two locations, with an intermediate copy of aggregated data into ROW, to facilitate the final aggregation in the third segment. As seen in this example, no user grain data left EUDB.
To accomplish this, we introduced considerable complexity into what should be a trivial query. We went from a single query script to a minimum of three scripts targeted against at least two storage instances located in different geographies and with an intermediate copy from one region to another.
Let’s take another example for illustration, which adds a dimension table and two stages of aggregation. Figure 3 lays out a query that attempts to identify apps with the most users at risk of subscription renewals. We start with finding user subscriptions that have not been renewed, associate these to their SKUs (the subscription type that includes which Apps are included), then find the most impacted SKUs, and use that information to derive the list of Apps and finally identify the count of users with expiring subs and low engagement on those Apps.
[Link to open SVG file for Figure 3]
As expected, the reimplementation adds quite a bit more complexity as seen in Figure 4.
[Link to open SVG file for Figure 4]
For the reimplementation in Figure 4, the original script was split into six segments with two cross region copies — assuming that the two user grain assets were already geo-partitioned, and copies of the dimensional tables were already present in both regions.
Hopefully this begins to illustrate what is expected. Let’s get into some of the key concerns that immediately follow for taking a large data operation producing thousands of durable data assets from hundreds of upstream sources, and making it conform to EUDB requirements.
1. Bifurcated data sources: A data team like IDEAs consumes data from a large variety of data sources — including telemetry. As I allude to in my earlier post on telemetry acquisition, there is rather large co-ordination exercise around the bifurcation of user grain sources upstream. Service stamps need to be provisioned in EU data centers and their telemetry collectors appropriately rerouted to new Event sinks within the EU. Much of the heavy lifting here is handled by product & platform engineering, but the downstream pipeline re-engineering must be concordant with these changes to avoid disruption to downstream data consumers.
2. Source geo-fragmentation: During this expansive restructuring across a huge ecosystem of Apps & Services, it is inevitable that some data does not get routed correctly. Additionally, there are several edge cases which makes the right outcome a bit of a challenge to arbitrate. For example, what happens when an EU user is off on travel to somewhere outside the EU? Should the user’s requests be re-routed to the Service instances in Europe even if that will degrade the experience? Or should the user be rerouted to the closest instance and have their data moved out-of-band to the EU with a short delay. We’ll let the legion of privacy specialists and legal experts sort these out. However, from a system point of view, we need to be able to detect such cases, measure the extent to which it is happening and have tooling & protocols in place to mitigate this.
3. Pipeline re-engineering & Data sharing: Pipeline lineage can be complex and can span across serval data teams. IDEAs alone has a data processing dependency graph with tens of thousands of pipeline nodes which in turn shares data to several hundred teams in the company. Re-implementations like those in Fig. 4 have to be engineered and deployed in a coordinated fashion through this entire graph without disruption.
4. Orchestration complexity & Dev productivity: When fully implemented, the IDEAs processing graph will have roughly 2–3 times the number of individual activities than it did prior to supporting EUDB. This will be split across at least two regions and at least as many data centers. Lot more places where things can go wrong. It will have a lot more copy of intermediate and dimensional datasets which can get really expensive if it is not done thoughtfully and monitored on an ongoing basis. Without automation and adequate tooling, this will add considerably more development tax and operational stress for what is effectively the same output.
5. EUDB centric authoring features & controls: In addition to a one-time pipeline re-engineering effort, we must be able to prevent any unsanctioned data movements with subsequent updates. Our pipeline is a high churn construct with several hundred deployments in any given month. We must think about prevention right at the authoring stage, and certainly ensure that any violations are caught at deployment time. This isn’t just a dev-ops concern, but also a compliance concern. The best controls are ones which prevent violations by design.
6. Non-distributive measures: In the examples above, we used COUNT as the aggregation operator which are distributive — i.e., it can be applied individually in each geo-partition then re-applied across the results from each geo-partition for the correct final result. Non-distributive operators like DISTINCT COUNT, PERCENTILE etc. do not support this behavior. Alternatives exist, including probabilistic implementations which are susceptible to error. These need to be evaluated with the specific use case in mind.
7. Self-Serve friction and the non-developer persona: SQL skills are not exclusive to developers. Many analysts, program managers and subject matter experts are comfortable querying their data using SQL and SQL-like analogs. In Microsoft, a lot of non-data engineers routinely write such queries. However as shown in the examples above, EUDB adds a whole level of orchestration complexity that needs to be accounted for, which such users will routinely run into. They will either need to be supported with suitable abstractions over their query engines and geo-partitioned datasets, or take a significant productivity hit as they attempt to work within the constraints that EUDB imposes.
8. The next Sovereign Boundary: As mentioned above, data sovereignty is a hot topic and the EUDB is unlikely to be the last boundary that teams like IDEAs will need to account for. Earlier this month, a similar requirement was pulled out from later revisions for a proposed data protection law in India: Indian IT minister signals reversal of data sovereignty plan • The Register. It would therefore seem prudent to ensure that any system capability designed to support EUDB, be supportive (or at least not impede) future data sovereignty or residency requirements.
EUDB is a beast of a problem. It’s the kind of engineering challenge where we have teams across the company in product, platform and data engineering domains rallying together to hit a demanding timeline. It is also a valuable learning opportunity to devise innovative design patterns and solutions, push for convergence on best practices and infrastructure, and a forcing function to re-examine how we do data in the company.
The set of concerns discussed above also offer a compelling set of requirements for a data engineering stack. To make this transition, we need a great eco-system around data interfaces, metadata driven code generation, the ability to enforce data movement invariants and reasonable alternatives for handling non-distributive measures. We need great metadata around our assets including which among them are in scope of a re-implementation and comprehensive understanding of lineage. Additionally, we need to hide much of this complexity from developers building data pipelines and perhaps more importantly for data scientists, program managers, business analysts and subject matter (functional) experts who are ultimately the final consumers of this data. Fortunately, IDEAs has had long standing engineering investments that support these very requirements, putting us in good stead for successfully taking on this demanding commitment. More on these systems in future posts.
If your team is taking on data sovereignty and have run into similar challenges, would love to learn on how you are gearing up to tackle them.