The 2022 Data World in Three Words: “I Was Wrong” — 6 Truths I didn’t understand 365 days ago.
60,000 words in newsletters, a dozen articles, and a book.
That’s what I wrote about the data space in 2022.
I write, mostly to learn. Yet when I look back at 2022, most of what I learned was things I got wrong, things I had to unlearn, and things I had to learn anew.
So, following my own advice to take unlearning in the data space seriously, here are the top 6 of my unlearnings.
(1) The Data Mesh is there to stay, but it’s not coming as fast as expected.
The data mesh keeps haunting me, whenever I think I’m done writing about it, it comes back. At the beginning of the year, I was finishing the book “Data Mesh in Action” going through all kinds of data meshes out in the real world and drafting data mesh blueprints for start-ups. My position at the beginning of the year was “the data mesh is coming to every company & every industry, fast”.
And while is still believed this to be the case, 2022 turned out to be the year of a shift towards pragmatism in terms of the data mesh.
Barr Moses, CEO of Monte Carlo, predicted for 2023 the idea of data meshes coupled with fat, emphasis on fat, and central platforms.
And while a lot of companies try to adopt “ideas from the data mesh”, the adoption of the data mesh itself remains a hard task only a few companies tackle head-on.
The reason for this is: The data mesh is first and foremost a cultural/processes/people shift, and these things are notoriously hard to change.
The reason I believe that the data mesh is the future is just as simple: data is going to eat up every single value proposition, and every product in 10–20 years is going to be all about data, and about nothing else.
To extract value from data at scale, companies will have to decentralize at some point in time, and thus, enter into the data mesh.
Thanks to Zhamak Dehghani we now have a roadmap for this, and a name. But the process is still going to be unevenly distributed, owned by a lucky few, and take time.
(2) “Just do what the software engineers do” isn’t going to cut it in the data space.
I tend to argue, the data mesh is yet another decentralization move, just as all the others that the software engineering & product world already performed. So at the beginning of the year, it only seemed natural to me to keep on saying “just transfer practice X from software engineering to data and you’ll be so much better off”.
“DataOps: Taking the Dev world into data” and the rise of “Data Reliability Engineers” are all examples of this idea. Take what works in the software engineering world and see whether this works well in the data world.
2022 was the year I realized, it just doesn’t work like that. After failing a lot (literally years) in trying to help data developers understand how to use software engineering best practices, I learned the hard way two lessons that do change my perspective:
- Data people still have largely a different background, culture & practices than software engineers
- The data value creation process looks different than the software one
(1) might seem obvious to a lot of people, but this lesson goes way deeper than is commonly accepted. Most data product managers come to form a technical background, in the software world this isn’t the case. Business analysts turning analytics engineers have even less exposure to software engineering than most data engineers. Let alone data engineers & data scientists that often have a quantitative or science background.
(2) software that gets released is exposed to end users and creates value. Data-heavy products, on the other hand, get released, and only when the data hits the software do they create value. This sounds trivial, but data for data-heavy products is the main part of it. This makes the whole “software delivery process” for data longer and more complex. The focus shifts onto the data and software tools are traditionally ill-equipped to handle this. Leaving us with a whole lot of nothing.
Some random implications of these two lessons are:
- The data mesh as “just another decentralization movement” makes sense only, if the company culture supports it.
- Machine learning integrated into products only makes sense, if the company has product management that openly embraces this.
- If your company considers data to be a sidekick, then no big investment in best practices makes sense.
- Pushing data into your CI system makes sense only as long as it stays small. Yes, it’s called “dbt — data build tool” — but it is not the same as a software build process, not for the vast majority of cases.
Speaking of companies that consider data to be a sidekick…
(3) 99,9% of companies are not aware of what data is important and what to do with it.
At the beginning of 2022, I believed 90% of companies are wrong about these big ideas. But over the year 2022, two things happened:
- I realized more kinds of data are important than I believed to be before (see below)
- A few major trends actually moved the world in the opposite direction!
(2) As much as I love dbt, it is made to make you focus on well-structured modeled data. This movement makes the true issues seem farther off compared to the alternative. Problem is, the alternative, to focus on well-structured, modeled data, isn’t an alternative at all.
I believe that the following four things are completely underrated:
- Real-time data use (not to be confused with event-driven architectures)
- Unstructured data (and everything that builds on top of a use case)
- Company external data (and everything that builds on top of as a use case)
- Turning data into actions
You might argue that this is obvious, and you knew that already. But I think the extent to which this is true is astonishing.
(4) Turning data into action is important, yes yes, we get it… But do you? What did your data team do over the last 4 weeks? Think about it, what efforts are truly targeted at helping to turn data into actions? If your data team isn’t the right place to look, is someone else at the company helping to do this? If not, you’re in serious trouble.
How about a simple decision: What is more important, taking a couple of hours to ingest a new data source or spending a couple of hours helping a decision-maker understand data? Only one of these options helps to turn data into action.
(3) Company external data is huge and next to no company utilizes it. It's only the Amazons (hello “minimal pricing mechanism”) and Netflixes of the world, and the investment world, that take external data seriously. Period. It’s completely out of scope for most, and yet it is where all the growth happens.
(2) Unstructured data. I love dbt. But dbt helped to launch a revolution that targets to focus on small selected data sources, on well structured & modeled data. That’s a good thing on the one hand and a terrible thing from another perspective. It makes unstructured data look like the unwanted cousin. And it’s not. It’s the most important data source you have, but in 2022, it became less important.
(1) Would you rather ingest a new data source or reduce your data frequency from 2 hours to 30 mins? My feeling is, most people will choose the first option. That’s kind of what the dbt movement does, make these kinds of tasks easy. So real-time data became less important in 2022. And yet, if you look at things like the pandemic and the Ukraine war, it seems like “real-time” data is becoming the only thing important to any business.
(4) Even a collapsed crypto world has so much to teach the data world.
At the beginning of 2022, I had a faint interest in the crypto world. Then I took a closer look. Oh boy, I should’ve looked way earlier. What I realized over the period of 2022 is that the crypto world has a similar set of physics to the data world. And even better, it seems like the crypto world is 2–3 years ahead of the data world.
The three laws of crypto physics:
- Systems thinking tops product thinking.
- Community-led innovation tops company-internal innovation in terms of breadth.
- Open tops closed.
These laws don’t apply to any sector, not at all. But they have held very steadily in the data space. And so I did shift my perspective there and am starting to learn a lot from the crypto space.
Talking about openness…
(5) OS is hard, like really hard to pull off. And nobody talks about it.
Publishing open source for business, a lot of companies pull this off right? Google is leveraging open-source solutions like Kubernetes to achieve huge business goals. RedHat, Automattic, and GitLab, all built successful companies on open source. It should be obvious and easy to publish open source to achieve business goals.
And yet, as I kept writing about it in 2022 I realized, it is not. And worse, no one is talking about it. I literally could not find a single book on the topic of publishing open-source software for business benefits.
What I did see in 2022 is that it is hard, way harder than most people think to use open source for business. Google just adopted Apache Iceberg, and that will make it incredibly hard for companies like drem.io or tabular to pull off their business models (both based on Apache Iceberg).
Companies like Dbt are in a constant (albeit so far successful) struggle with their openness. And every day new founders seem to ask the same question: do I open source? How much do I open source? Then how do I make money?
Caveat: However, I do believe it's not only possible but incredibly valuable to incorporate publishing open source for business purposes into your repertoire. It’s the reason I keep on writing extensively about it.
(6) Open Source is not the only option for data companies, but key challenges need to be solved.
In How to Become The Next 30 Billion $$$ Data Company, I argued that the only way to become a great data company is to rely heavily on open-source.
I don’t think that’s true anymore. I’ve studied the rise of many data companies over the year and I now think I have a better understanding of what I was trying to get at.
My new understanding is, that you will need to heavily rely on three things:
- some kind of standard/ protocol,
- on a large degree of openness,
- and need to make network effects key to your company strategy.
You might be able to do this with open source, you might only cover the first two with open source, or you may not use open source at all. With respect to the lesson before, I’m more inclined to recommend not defaulting to open source, but rather thinking deeply about your strategies in each of these categories.
While open source is the way a large number of companies entered the data space, it might not be the best option. I suggest first focusing on these three pillars and then deciding whether it makes sense to try to utilize open source to make your business happen.
Thanks to P. for making me put this into much nicer words.
Now it is your turn, feel free to tell me I’m still completely wrong!