“Guarantee your architecture will produce bad data” an analysis of the Lollapalooza effect in Data and AI

Franco Patano
7 min readJun 2, 2024

--

Disclaimer: The crazy and wild opinions expressed in this article are entirely my own and are in no way, shape, or form endorsed, supported, or even remotely acknowledged by my employer, family, or anyone else who knows me. If anyone were aware of the contents of this piece, they’d likely disavow any knowledge of my existence and perhaps even issue a public apology for knowing of me in the first place. Read at your own risk, and remember: any complaints, outcries, or calls for my immediate jailing should be directed solely at me and not the innocent people who have to endure my presence.

https://tenor.com/view/perfect-popcorn-michael-jackson-gif-26271837

In Poor Charlie’s Almanac, Charlie Munger talks about the Psychology of human misjudgment, and he interestingly introduces the subject cleverly in one of his talks. To be more engaging, he would flip the matter of giving good life advice around, and explain how you could “Guarantee a Life of Misery”. I thought this was a clever way to engage with the audience. In my day-to-day work, I discuss strategy and tactics to improve data and AI architecture, and thus I have a front-row seat to the monsters that exist in the wild in these on-premises data centers, along with their appendages in the cloud. These monsters keep people trapped into thinking they are stuck with what they have because of a confluence of psychological biases, reinforcing mechanisms, and compounding effects that Charlie Munger refers to as the Lollapalooza effect. Let’s dissect the beast and give a tribute to Charlie by determining the core tenants of “Guarantee your architecture will produce bad data”.

Best-of-breed siloed systems

You should use best-of-breed software from different vendors relying on custom integrations with lots of professional services, training, and support costs. Remember that each system will require someone to manage it, to avoid key person risk, you have to have two. There should be a separate ETL tool, various flavors of warehouse engines, a data lake stack, a separate data governance tool, additional federation software, another data science notebook and machine learning engineering stack, a DevOps and MLOps stack, and deep learning stack to fine-tune LLMs. It should be complicated with containers, or separate services that your expensive technical staff have to manage and spend all their time keeping up with all the custom integration friction.

Custom Integrations

All the siloed best-of-breed software should be held together by custom software, and bonus points if an expensive systems integrator is supporting it. All these vendors’ independent release cycles should line up perfectly, and all our custom scripts will work forever. They will never need an update when we need it the most, where we would end up having to overpay for urgent patches. We can rely on our software vendors will offer us expensive patch solutions as integration issues come up.

Data Lineage should be an afterthought

The old joke in metadata management was the documentation and lineage are only as good as the last time it was updated. Good luck figuring that out. Nobody needs to understand provenance when they are diagnosing a data quality issue. These additional tools shouldn’t be automated, the people will remember to update all the individual-specific lineage information every time they make one small update, right? When users ask for the business logic, we can just have a developer translate the 1k+ line stored procedure code that produced it.

Inconsistent Data Formats and Silos

When ingesting data, you should keep extra copies of data in various formats and in locked in vendor silos. There should be a copy in the data lake, a copy in the warehouse, a copy for the spreadsheets, and a copy for finance. All with their complex transformation logic separated and lightly documented. This will ensure that no one knows what the “correct” data is, and finance will end up plugging the gap anyway. Split architecture like Lambda where we have multiple copies of the data in various locations and formats, parquet, avro, json, orc, databases, data warehouses, and cubes is ideal. Data inconsistencies proliferate through all of the various pipelines that have different versions, this ensures no one knows what data is good or bad. It's a Mexican standoff of epic proportions just to keep things moving along.

No real-time monitoring

When failures pile up, just toss them into a folder that no one notices. Pipelines can fail silently for a few days, nobody looks at the numbers day to day. As long as the month-end reporting pipeline doesn’t fail again, you should be fine. There isn't any value in understanding why the pipelines fail in certain scenarios, as long as we eventually get it right.

Minimal viable documentation

When thinking about documenting your system, think less is more. You don’t want to overcomplicate your users with lots of details. They should just take what they see for granted that everything works as expected. Make it difficult for other team members to understand sources, transformations, and dependencies, the knowledge should become tribal. With minimal viable documentation, data is open to interpretation by everyone, allowing bubbles of bad data to proliferate.

Isolated Data Governance

Implementing governance early and often is what you should do here. Every one of the bespoke best-of-breed systems should have its own governance implementation. These systems should not communicate with each other, instead when it comes time for compliance reporting, the users can just extract to Excel and vlookup across sheets of source system extracts. This ensures that desktop databases persist throughout the organization for point solutions like audit.

Wild Access Controls

Because we have governance solutions for each product in the stack, managing the persona permissions of each group should be implemented in each product for each team. This proliferation guarantees we won't be able to keep track of who has access to what. We supply the custom integrations with the use of superuser accounts to generate all the extracts to Excel that the users keep asking us for.

Ad-hoc Data Cleaning

While on the topic of exporting to Excel, this is the most important piece of the bad data puzzle. You see if everyone is allowed to craft their own version of the truth in their little silo in a spreadsheet, this ensures no one has any confidence in data. We must allow everyone to apply their logic and transformations in silos, this ensures bad data is used to make business decisions with confidence.

Old Data

A lack of real-time processing and fractured orchestration tools will ensure that nobody knows how fresh the data is. Users do not need to be constantly updated with the noise that happens in the day-to-day. The warehouse is aggregated daily, which should be good enough. In the modern age, there isn’t a need for real-time accurate data to make fast decisions.

Fragmented Workflows

The orchestration of this magnificent system to produce bad data should be certain to use many types of orchestration software in aggregate. This will ensure that when people find bad data, it will be challenging to trace back where it came from, and how it was transformed. We need an ETL orchestration tool, another one for ML/AI, and another orchestration tool to rule them all. Using flat file extracts out of source systems with black box tools like stored procedures ensures that bad data is blindly persisted through the pipes.

Expensive and Complex Maintenance

Paying for each of these best-of-breed systems ensures the most expensive system configuration to produce our bad data. The specialists required to write and support these custom integrations are equally as expensive. When business partners find oddities in data, we should make it cost-prohibitive to make fixes, that way we will sweep our data problems under the behemoth rug of this massive cost sink. The ongoing maintenance costs, and complexity to fix, ensure no one will want to make requests to fix it. Plus we have every one of our excellent tools exporting to Excel, Isn’t that what the users want anyway?

Lock-in with Stockholm

With all these effects at play ensure the herd mentalities persist throughout your organization and keep everyone in the bad data limbo. The final nail in the coffin is how expensive and complex it would be to migrate away from this monstrosity. The confluence of these affects the herd and Stockholm syndrome sets in nicely, after all, isn't this the way we have always done it?

Hopefully, you found this topic as fun to read as it was for me to write. If you want my perspective on how to do the opposite of this, check out my talk at Data and AI Summit and the follow-up to this blog, Rise of the Medallion Mesh.

--

--

Franco Patano

I spend my time learning, practicing, and having fun with data in the cloud.