Cloud Data Architecture trends in 2022 and beyond…

Gaurav Thalpati
3 min readJan 14, 2022

Data is evolving, so are the data architectures

A couple of months back, I wrote about my journey of the last couple of decades in the data world & how things changed from building warehouses to building lake houses as a combination of lake & warehouse

Off-late, things have been changing rapidly in the data world with architectures & technologies evolving on daily basis. I got a few weeks’ breaks from my daily job & I tried to spend more time reading about what are the latest trends that can dominate 2022 &the next few years.

Here are my top 3 for 2022

A True Lakehouse — not the traditional Lake+House
Yes. You read that right! Lake+House seems to be a traditional architecture now. It amazes me how quickly things have changed in this space. Till last year, I thought (Data)Lake+(Ware)House was the best architecture for all data ecosystems. However, a lot of issues/limitations are now being discussed across various forums.
There seems to be a paradigm shift towards a true lakehouse.

A lakehouse that can have all the data in only one place. No need to move data into another warehouse or any other storage, no need to copy data or maintain multiple copies. A single source of truth residing only in data lakes — providing the cost benefits of lake & performance & speed of a warehouse

Databricks & Dremio have already developed products that support this concept by leveraging the open file/table formats like Iceberg & Delta Lakes with Parquet. We would soon see a lot of organizations adopting this architecture for implementing their data ecosystems.

There are a lot of articles & tech talks around this new true lakehouse concept. Read these for a better understanding of this subject

SQL — not Spark
There seems to be a severe crunch of skilled cloud data engineers, who can analyze & understand data & implement optimized well performant custom data pipelines using some of the complex frameworks like Spark. As data engineers, they have to gain expertise across various technologies (Cloud, ETL, Spark, SQL, Python, Scala, BI, Streaming) along with good data analysis skills. The list is endless. It’s difficult to find all these skills in one person.

The much easier approach towards delivery is to use simpler but powerful frameworks & languages for implementing data pipelines. SQL seems like the “go-to” option for implementations instead of other complex frameworks & languages for such data engineering workloads.
I’m sure there will be many who might not agree to this but no one can deny the simplicity of SQL and how quickly one can acquire the basic working knowledge of SQL
Also, with the rise of Databricks SQL, Snowflake it becomes really easy to implement & deliver projects with just SQL skilled developers & leave all performance & infra management to these platforms

Data Governance — not performance or cost
And the last, one which I think is very critical, is the overall data governance & security. With a lot of innovations happening around the Cloud platforms, I don't think we need to worry any more about the performance or cost optimizations. There are multiple tools, utilities, best practices, case studies to refer to for performance & cost optimizations but data governance still seems like one area where a lot of work is still required. Cloud platforms still don't have full-fledged native services for end to end data lineage tracking.
We might see a lot of Data Governance tools getting integrated with the leading data cloud platforms for implementing an end to end easy solution for data governance & data lineage

While these 3 are at top of my list, there would be many other such areas that we will have to wait & watch. Please comment with your thoughts & let me know which areas you think will play a key role in 2022.

Data Architecture Trends in 2022

--

--