A Look Back at Key Trends in Data Infrastructure in 2023 by Four Industry Founders

RisingWave Labs
ILLUMINATION’S MIRROR
13 min readMar 10, 2024
From top left, Alexander Gallego; bottom left, Jason Reid; top right, Xiang Fu; and bottom right, Yingjun Wu. Image created by the author.

Introduction

The discussion with the four founders of data infrastructure startups focused on key trends in the industry for 2023. These trends included the significance of Bring Your Own Cloud (BYOC) for cost and complexity reduction, along with data security, the rise of open data formats, and the potential impact of streaming systems on AI/ML workloads, particularly due to the rise of generative AI. Based on past trends, they also glanced at the future trends in 2024.

The most important technology trend of 2023

Alexander Gallego:

I consider BYOC (Bring Your Own Cloud) one of the most important trends in data infrastructure particularly in the context of streaming because it directly addresses the two essential challenges of cost and complexity, which have been the Achilles’ heel for many in the streaming sector. Although this topic still generates heated discussions, especially among some early-stage founders who passionately disagree with me. Thus, we actually had to build our Redpanda cloud version twice — once without BYOC and once with it — to unify the execution semantics.

Xiang Fu:

In my view, the key topic this year is likely the vector database, particularly from a data analytics perspective. There is no doubt that AI dominates this year’s discussions. Intriguingly, vector databases started the year as a novel and sophisticated technology. Yet, by this year’s end, the proliferation of vector DB connectors in numerous machine learning projects became obvious. These connectors are ubiquitous, and that is why almost all databases now claim some sort of vector processing capabilities. This trend is fascinating, and therefore we’ve even integrated vector indexing in Apache Pinot.

Similarly, I’m totally on the same page as Alex when it comes to recognizing the importance of BYOC. At StarTree, for example, we’re broadening our capacity to offer BYOC to clients. The concept is gaining traction among SaaS companies and others, particularly regarding data security and compliance, with many actively exploring it.

Jason Reid:

My focus is primarily on open data formats and storage, which are fundamental to trends like generative AI and BYOC. This year has been significant, marking the mainstream acceptance of the unbundling of traditional data warehouses. This shift is reshaping our conventional understanding of databases and data warehousing.

Traditionally, we’ve seen combined stacks of computing and storage. However, the trend is now towards their separation. People are increasingly storing data in open formats like Iceberg, applying diverse computational methods, whether for AI, vector indexing, classic BI workloads, or future developments. This transition is fascinating and is likely to foster a highly competitive environment, a development I view as beneficial for the market. Thus, I am observing how this unfolds will be intriguing.

Yingjun Wu:

I observe the prevailing trend is the industry’s discussion over whether to adopt a decomposable/composable ecosystem or an all-in-one solution. We see significant projects gravitating towards either approach, with some like Databricks evolving into all-in-one solutions, while others specialize in specific domains.

I’ve also noticed that stream processing is gradually gaining more attention. Through our experience building RisingWave, a streaming database, we’ve come to realize that stream processing systems do not exist in isolation but, an essential part of larger data ecosystem.

How will open data formats impact data lakehouse adoption in 2024?

Jason Reid:

We’ve engaged in numerous discussions with companies of various sizes types. Some are deeply invested in modern data stacks, centered around ecosystems like Snowflake, dbt, and Fivetran, while others stick to more traditional methods like S3, Parquet files, and Hive. Sometimes, it’s a mix of both approaches.

Many companies we’ve talked to are just starting to explore open formats like Iceberg or Delta. Some use them for specific tasks, but not everyone is fully on board yet for comprehensive adoption, except for big players like Netflix — they’ve really pioneered the use of Iceberg.

Looking ahead, I expect things to get even better in the coming years. Integrations will improve a lot, making them easier to use.

Xiang Fu:

Absolutely, I agree. One interesting trend is the move toward making data formats more consistent. Take Databricks, for example, now supporting various formats like Parquet. This means you can use the same query engine or Lakehouse solution, no matter the file format. There’s also a push to streamline access for different connectors, making real-time support more effective.

We’re focused on integrating with Lakehouse, suggesting it for batch data ETL, and using systems like Pinot for quick and secure data retrieval.

We’ve observed that data latency from Lakehouse is decreasing, getting close to real-time from minutes to an hour. This creates a challenge for streaming solutions, making people question the need for super-fast data access. So, we’re working on highlighting the extra value streaming tech brings to customers and users.

Alexander Gallego:

Users care more about getting their tasks done than the specific labels of technologies like A, B, C, or D. If RisingWave works best for using a particular Iceberg table, that’s the way to go. Similarly, if streaming through Kafka API with Redpanda fits their needs, it’s a good choice. The key factors here are Iceberg and the entire Lakehouse model, especially when used together. Putting the user at the center and giving them data ownership through Iceberg, an open format, is crucial. It lets users freely choose the best approach, regardless of the tech provider.

Here’s a lesser-known industry insight: retrieving data from an SaaS provider can come with costs that create a kind of lock-in. Connecting to a broad ecosystem, whether it’s RisingWave, Pinot, or any other vendor, gives users choices in API and format, and most importantly, control over their own data.

Yingjun Wu:

Just 30 minutes ago, I posted a query on Twitter, asking users about their preferred data lake formats. Surprisingly, around 50% said they like using S3 without sticking to any specific format.

But I believe users will soon find this approach less efficient and start looking for better solutions, like data lakes with open formats. The need for such a solution comes from a desire of having a unified approach for stream processing engines, messaging queues, OLAP solutions, Elasticsearch, and more. We need a standard way to access data, and as a vendor, creating connectors for each solution takes a lot of engineering effort.

From the user’s perspective, the absence of an open format brings similar challenges. They’d have to choose a solution that works with their current systems. But with a standardized data format, data becomes portable and accessible from any system. This change is set to make a big impact, and I predict even more adoption in 2024 compared to this year.

How will the data lakehouse ecosystem impact the future of streaming data infra?

Yingjun Wu:

I think the concept of a streaming lakehouse or a streaming data warehouse, promoted by some vendors, is a valid approach. In the past, people stored data in S3 using CSV files and then loaded it into Redshift for dashboarding. However, integrating a stream processor with the data lake changes the scenario. You can now directly ingest data from a streaming source, process it using a streaming platform, and store it in Iceberg. If you need further or real-time analytics, a system like Pinot, especially if it integrates with Iceberg, can be used. This way, all data has the potential to become streaming data.

While some might have found streaming too complex or unnecessary before, possibly due to cost or the need for specialized systems, I think offering a solution that provides a user-friendly experience, such as open data formats, is likely to make users prefer real-time streaming solutions now.

Jason Reid:

I agree with what’s been said. It’s important to have one format for both real-time and batch analytics, making separate architectures unnecessary.

Looking ahead to 2024, I hope to see a real consolidation of the table-stream concept. Schema management and access control should become more standardized. We should worry less about things like data retention, disaster recovery, and governance, which used to separate streaming and warehousing managed by different systems. The market’s demand for simplicity is likely to simplify things in this area.

Xiang Fu:

I also want to emphasize the idea of unifying stream and batch processing and having unified access. It’s a big challenge for many users and customers who have to learn different technologies and write different code for various tasks. Achieving this kind of unification is a major development that I can see coming.

The cost-effectiveness of cloud-hosted services and BYOC

Alexander Gallego:

We noticed a significant market trend over eight quarters, with the Total Cost of Ownership (TCO) becoming the main consideration for purchases. Recently, we successfully migrated a publicly traded customer from using about 384 Kafka brokers to just 24 Redpanda brokers.

Especially with BYOC, it plays a crucial role in the TCO equation. There are notable optimizations, like clients now running similar workloads on 24 brokers compared to 380. However, the overall TCO perspective includes things like user management, administrative tasks, and other operational aspects of running the service. I believe BYOC will be a key trend moving forward, providing a cost-effective, fully managed cloud solution in the long term. This approach significantly reduces the costs associated with staff managing the cloud.

Yingjun Wu:

I agree with Alex’s point. At RisingWave, we provide solutions for both cloud and BYOC. Some organizations have strict regulations against storing data in public cloud services. To do business with them, we must respect their policies and adapt to their environment.

Some companies also have cloud credits from providers like Azure and prefer to use these credits. They may not want a direct relationship with a cloud vendor but can still use their credits through a marketplace. In certain cases, companies prefer separate billing rather than consolidating all expenses with a single vendor. This is another reason why BYOC is in demand, as it aligns with cost-efficiency and cost-cutting initiatives.

While some organizations choose BYOC due to regulatory constraints, others go for hosted cloud services because they prefer not to manage infrastructure themselves. As a vendor, it’s crucial for us to offer both options to meet diverse customer needs and preferences.

Alexander Gallego:

I recently had a conversation with a major cloud vendor, some two weeks ago. They disclosed an important issue related to users exceeding their allocated quotas. Despite a decline in product usage due to market conditions, users had committed to more resources, leading to a substantial bill at the end of the year. This emphasizes the importance of being able to use credits from cloud vendors, like AWS or GCP commitments, as a strong incentive. It’s common to have substantial sums, like $10 million, available, and not utilizing them can reflect unfavorably on the finance department’s image.

Jason Reid:

I think it all comes down to what you want to focus on. Do you aim for speed, innovation, and agility? In better market conditions, we usually prioritize innovation and speed to get products out there. In tougher economic times, it’s more like, ‘Let’s cut costs. We’ll buy cheaper hardware even if it’s slower, because speed doesn’t matter much.’ So, it seems like it’s just part of the regular ups and downs in the broader economy.

Xiang Fu:

I would say when we dive into the BYOC game in the streaming industry, it signals growth, increased awareness, and wider adoption. As we deal with larger customers who highly value this technology, more requirements come into play. For example, our initial customers emphasized that their data must stay within their cloud. That’s why, when developing our core solution, the first version was BYOC.

On the topic of cost-effectiveness, the SaaS pricing model is interesting. Some data hosting solutions have shifted from consumption-based billing to metrics like CPU cores, data size, and disk size — resembling a traditional infrastructure-based model. This approach brings pricing more in line with the vendor’s costs and simplifies customer calculations based on deployment size.

Is BYOC a suitable long-term strategy?

Yingjun Wu:

I see BYOC becoming the top choice for many organizations when it comes to storing and using data. Thinking about this, while I trust my colleagues and friends, it doesn’t mean giving them access to my home or finances. Control, especially in security and data compliance, is crucial.

Would Customers desire a single, integrated system?

Yingjun Wu:

I believe customers probably prefer a single unified system, but the reality is that different products excel in different areas. Snowflake, for example, provides a batch system with some streaming capabilities, like a live or dynamic table. While these features represent streaming, they may not be the most robust solution for all needs. Serious users looking for optimal performance should choose the best system available in the market.

Alexander Gallego:

It’s widely recognized that excelling in one specific area requires a laser-focused approach. For example, it’s challenging to imagine a company like Redpanda excelling at same time in multiple domains. I think it’s a significant challenge for any company.

Yingjun Wu:

That’s exactly why a solution like Iceberg is essential. Without it, as a vendor, building everything from scratch would be a time-consuming task, possibly taking a decade. By collaborating with other vendors, we can focus on our core development efforts instead of trying to address every possible market need.

Xiang Fu:

Currently, our main focus should be on our respective areas, and ensuring the seamless integration and accessibility of our solutions. Our primary concern revolves around usability and the developer experience, which from my perspective, remains the main hurdle for users when adopting streaming solutions.

Jason Reid:

I agree with all the points discussed here. While I may not be as deeply immersed in the streaming aspect as some others in this conversation, I’ve noticed that the main challenges in streaming revolve around observability and developer productivity, areas where streaming still lags behind batch systems. I see progress being made in closing this gap, and as it continues to improve, I anticipate greater adoption.

The impact of AI advancements on real-time inference

Jason Reid:

This concerns the impact of AI advancements on real-time inference for different applications and the potential for significant growth in streaming. I’d like to compare this perspective with the viewpoint of those who support localized inference, like edge computing, as an alternative to streaming data through a centralized inference system. Some suggest doing inference directly on the device or a similar approach. I’m curious to know if the panel has insights or opinions on these two contrasting opportunities.

Yingjun Wu:

In our company, the primary focus isn’t on that aspect. Some companies, like Pinot, may be more inclined to do so. Regarding edge computing, we’ve come across interesting systems, though not directly related to AI, such as DuckDB or similar solutions.

These are innovative solutions, but compared to cloud services like Pinot and other databases, they often lack a persistence layer. This means users can’t store data there, and in the event of a device failure, data recovery could be a challenge. With Pinot, on the other hand, data persistence is a key feature.

While, certainly, there are use cases for edge computation, we don’t believe it’s our current focus or that it will replace the use of cloud services for these purposes.

Alexander Gallego:

We host some of the largest AI companies using Redpanda. Thus, I think incrementalism is crucial in stream processing, a method where data is continuously processed as it arrives. This differs from batch processing, which divides data into specific time intervals. Streaming’s incremental approach helps adapt to changes in underlying data, reducing costs for new AI companies in model development and generating revenue.

Consider ChatGPT’s model training, which may cost around $1 billion with numerous GPUs. Now, imagine releasing GPT-4 incrementally. You start with the first $100 million worth of training data and continuously refine the model using the cost-effective incrementalism of stream processing. This is particularly advantageous given the expensive nature of renting GPUs and CPUs for training these large models.

Will Stream processing impact on AI/ML workloads in 2024 and how?

Yingjun Wu:

I agree with what’s been said. It’s important to have one format for both real-time and batch analytics, making separate architectures unnecessary.

Xiang Fu:

Currently, streaming systems play a significant role in immediate assistance, especially in areas like model monitoring, inspection, evaluation, and related aspects. These systems are valuable for helping individuals assess their model quality and other evaluation metrics. In vector-based use cases, integrating streaming with the freshest data is crucial for enhancing user experiences.

For example, when you ask ChatGPT about today’s events, it might not have access to the latest information. However, creating a system that can provide accurate real-time updates on current events is far more beneficial. This approach, focused on delivering fresh content to users, is more advantageous compared to a static large knowledge base.

What will be the next big thing in data Infrastructure in 2024?

Jason Reid:

I believe we’re still in the early stages of the concept of having an open format that connects everything within an ecosystem. I anticipate further maturity in this area, with larger enterprises adopting it more extensively, making it a standard architectural practice in 2024.

Yingjun Wu:

I think ensuring high consistency and addressing similar concerns will be crucial in 2024. Whether it’s about unifying batch and stream processing or optimizing the interaction between specialized streaming and batch systems, the key is to enhance their synergy. This involves dealing with various data types and creating robust failure recovery models to tackle the numerous challenges ahead.

Alexander Gallego:

As Jason said, we anticipate both unbundling and bundling trends. Currently, we are pushing compute down to Redpanda. I’m aware of at least one company building its entire business around this architectural style, using co-processors to enable lookup indexes like concurrent skip lists for timetable offsets.

I believe this trend will continue, and my prediction is that we will see the ability to deploy computation and business-level logic directly to the engines.

Xiang Fu:

From my perspective, our actions are closely aligned with the growing familiarity and value people are getting from streaming systems. I’m optimistic and anticipate that the entire market could see significant growth, possibly up to 2.5 times its current size, next year.

CONCLUSION

The discussion with the four founders of data infrastructure startups focused on key trends in the industry for 2023. These trends included the significance of Bring Your Own Cloud (BYOC) for cost and complexity reduction, the rise of open data formats, and the potential impact of streaming systems on AI/ML workloads particularly due to the rise of generative AI in 2024. Overall, the data industry is expected to see continued growth and innovation in the coming year, driven by these trends.

--

--