On the Quest for Extremely Fast Data Ingestion

RTInsights Team
RTInsights
Published in
6 min readApr 6, 2023

By: Elisabeth Strenger

CloudDataInsights (CDI) caught up with Adi Gelvan,* Co-founder and CEO of Speedb, at a pivotal moment–It was about to re-launch itself as an open source company. The quest for a high-performing storage engine has led to open source and a new approach to understanding and solving common user challenges. There were many months of research, transitioning software engineers to a new mode of thinking and working, and building the community that will engage with the technology and collaborate in its evolution. This is the open-source quest for extremely fast data ingestion.

CDI: Going all in on open source is a huge decision to make with lots of implications for your current customers and future users. How did you come to that decision?

Adi Gelvan: I think that being an open-source company really starts from being authentic. We had an interesting path to open source, and honestly, it wasn’t a business decision or a strategic decision. It was a path we went through. When we first incorporated the company, we were trying to find a solution to a very painful problem, and the path took us to open source.

We started the company around our cutting-edge technology. We had found a problem that thousands of companies are struggling with, that is, there was no scalable, high-performant, embedded key value store. The two market-leading technologies that addressed this gap were Level DB (open source but designed by Google for Google) and RocksDB (designed by Facebook for Facebook) are great, but they weren’t meant to be general-purpose solutions.

CDI: So, the existing solutions addressed very specific data sets, data ecosystems, and use cases. That seems to contradict the open-source approach. Can you tell us more?

Adi Gelvan: Yes, so they worked well for very specific workloads. They do a wonderful job for Facebook and Google and some companies like them. But thousands of customers are using them and not getting their merits because there are capabilities that are missing. When we came up with our hybrid compaction and the new technology that allowed us to develop Speedb which can outperform Level DB and RocksDB and give a lot of value to various workloads with large data sets. We said, “We’ve got it, now let’s start selling it.” Big companies were interested in our “secret sauce,” but in the end, companies are not the actual customers. It’s actually the person who presses Enter or writes the code to embed it–typically a developer. Well, developers don’t buy secret sauce even if it’s ten times better or a hundred times better.

There’s a saying in Hebrew that if three people are telling you that you’re a donkey, go find some grass to eat. Well, I needed about 300 people to tell me that, not just three.

Enough customers are saying, “I will buy it, just let me try it. Let me contribute. Let me actually help you do something that will fit my needs.” We said, “Okay.” If we had continued to work on our secret sauce behind closed doors, we might have done something wrong. That’s what led us to change not just the face of the company but the heart of the company.

Speedb developers, who hadn’t spoken to people in months, are now actively answering questions from the community on Discord. People whose code no one ever saw were now open-sourcing it.

CDI: Engaging the community is probably also key to developing a versatile product, one that is agnostic because enterprise-driven development organizations often try to satisfy the biggest customer. The community can balance that influence and keep development moving in a generally acceptable direction. Are you seeing this balancing effect?

Adi Gelvan: So I think that the essence of why we exist is because people who developed this technology were developing for the customer “du jour.” They did it for themselves. And if you see the biggest customers, by the way, the big giants who took RocksDB and forked it to their own needs, they also repeat stuff for their own needs. And you have giants like Alibaba and ByteDance, the mothership of TikTok. They have their own version and are doing their own stuff. But as they do their work in a silo, no one from the community is actually gaining from their work.

What we’re all about is bringing value to the community. And who is the community? These are developers who are excited about LSM and RocksDB but also want to bring value to their own mothership. They want some features that are not of interest to Facebook or ByteDance, but they have their own needs. In the RocksDB project, there were around 300 pull requests waiting for months to years for action. We’re gradually embedding these into the Speedb open-source code. This is part of the community-building we are doing today.

Here is an Interesting example–we get more and more voices from the community, from bigger customers, like the biggest chip makers you can think of that are waiting for two years for two pull requests. We’re taking care of these right now, and we’re so glad that they are talking to us. That means that they also believe in what we’re doing. And I’m certain that a thousand engineers from the outside of a company know much more than the smartest people on earth if they’re on the inside, simply because the outside engineers are your target customers. They are the users.

CDI: Now for a few technical questions. Why was including metadata so important to you or to your customers?

Adi Gelvan: Most of the data moving over networks today is metadata. The ratio between data and metadata in the past decade has dramatically changed because of connected devices and IoT.

For example, if you have a page of temperature readings then the metadata is the location, the height, etc. Sometimes it’s ten times the size of the temperature reading data. Now, if you can’t access the metadata, you can’t access the data. Legacy systems were simply not built to accommodate this volume of metadata.

CDI: One of the techniques Speedb uses to handle large amounts of metadata is hybrid compaction. Can you explain what this is?

Adi Gelvan: Compaction is a critical process within a data structure called log-structured merge-tree (LSM). The LSM tree has layers. Within the layers, you have sst files. When a level is filled with sst files, you join them together and write them to the next level as a bigger sst file. The data then takes less space when merged into one large file. You call this process compaction. Now, Google working with academia, invented the LSM tree. Facebook took it one step further, but everyone who tried to improve this mechanism was looking at it on an X and Y axis, so only in two dimensions.

Essentially what Speedb did was to look at the LSM tree in a multidimensional way and divided every level into multiple levels, which gave us another level of improvement of the compaction process. That’s the essence of hybrid compaction. Bottom line, the most important measurement of the efficiency of compaction is WAF = write amplification factor, which essentially means how many physical writes the system does compared to one logical write, and here we were able to improve it from X30 to X5.

CDI: Tell us about the work you’ve done on the storage engine to support very fast writes.

Adi Gelvan: The underlying engine of every database or application is the storage engine which takes the data that is written into the database and writes it into the underlying storage, which can be media, flash, a file system, or the S3 protocol. They were often overlooked because they were very simple–all they did was handle the metadata and make sure the data was written in the right position. But today, when metadata started exploding and creating a real burden on every appl…

Continued on CloudDataInsights.com

--

--