Ain’t No Database for All Your Needs

Published in

Marvelous MLOps

6 min readJul 15, 2024

Recently I had a chance to play with different databases, for a use case requiring low-latency record retrieval.

Among the popular choices for analytics are DuckDB, SQLite, and LDBM. In this article, I will show how these databases perform across different query types and why we should focus on the needs of the project when it comes to choosing a database. There are many excellent database options, which are designed to address different problems with unique strengths. To pick the right one for your needs, you should identify the most important feature for your use case, prioritize it, and select accordingly.

DuckDB: The Analytical Powerhouse

DuckDB is one of the most popular in-process SQL OLAP database management systems, specifically designed for analytical executions. It promises high performance in complex query operations and aggregations. The main difference between OLAP and OLTP is how the data is stored in the database. In OLAP, all data-associated columns are stored near one another in memory; therefore, reading and computing on the columns is efficient. In contrast, in OLTP databases, all data associated with a record is stored close together in memory, which is optimized for reading and writing rows efficiently. Since DuckDB is an OLAP database, complex queries on data (joins, aggregations, grouping) are pretty efficient.

However, DuckDB can disappoint when it comes to point lookups. The overhead of managing columnar storage and complex query optimization might lead to higher latency which isn’t ideal for simple retrievals.

Persistent indexing is now possible with DuckDB, and to improve performance with point look-ups (primary key, unique identifier etc.) it’s recommended to use ART index. ART indexes work by maintaining a secondary copy of the data in a compact, tree-based structure that enables fast lookups. Since it maintains a secondary copy of the data, it can also complicate data processing, especially in the presence of transactions.

Ideal use cases: Analytical workloads, data science, and machine learning.

SQLite: The Lightweight Champion

SQLite is a widely used, lightweight, disk-based database. It’s simple and easy to set up as it is a serverless database that does not require any additional installation. Another advantage is that SQlite takes up little disk space and memory, which makes it suitable for embedded systems, like IoT applications, and mobile devices.

However, SQlite has limitations when it comes to scalability, it is not ideal for large-scale applications with high concurrency. It is a single-thread database and doesn’t support concurrent access. If multiple users try to access the database simultaneously, you may see lower performance.

SQLite uses B-Tree indexes for efficient searching, sorting, and range queries. Indexes can be created on one or more columns of a table. Proper indexing in SQLite can significantly improve query performance, especially for large tables because it reduces the amount of data that needs to be scanned to find relevant rows. However, indexing might slow down write operations (inserts, updates, deletes) as the index needs to be updated.

Ideal use cases: Mobile applications, embedded systems, small to medium-sized websites.

LevelDB: The High-Performance Key-Value Store

LevelDB is a fast key-value storage library written at Google, providing an ordered mapping from string keys to string values. It is designed for high performance and concurrency, and optimized for fast read and write operations, which makes it ideal for high-throughput applications. Since it’s a key-value database, it doesn’t support SQL queries, which requires custom query logic to be implemented, and lacks advanced features like joins, transactions, and complex queries. LevelDB does not come with a built in indexing either.

Ideal use cases: High-throughput applications, caching systems, log storage.

Hands-on

Below is shown an example analysis made with a fake dataset. You can find the full notebook here.

Create a random dataset with 1 million rows.

import pandas as pd
import random
from datetime import datetime, timedelta

# Function to generate random datetime
def random_date(start, end):
    return start + timedelta(
        seconds=random.randint(0, int((end - start).total_seconds())),
    )

# Generate a sample dataset
data = []
start_date = datetime(2020, 1, 1)
end_date = datetime(2023, 1, 1)

for i in range(1000000):
    transaction_datetime = random_date(start_date, end_date)
    data.append([
        i,  # TransactionID (Primary Key)
        random.choice(['Customer1', 'Customer2', 'Customer3', 'Customer4', 'Customer5']),  # CustomerID
        transaction_datetime,  # TransactionDatetime
        random.choice(['Product1', 'Product2', 'Product3', 'Product4', 'Product5']),  # ProductID
        random.randint(1, 10),  # Quantity
        round(random.uniform(10, 1000), 2),  # Price
        random.choice(['Store1', 'Store2', 'Store3', 'Store4', 'Store5']),  # StoreID
        round(random.uniform(1, 50), 2)  # Discount
    ])

columns = [
    'TransactionID', 'CustomerID', 'TransactionDatetime', 'ProductID', 'Quantity', 'Price', 'StoreID', 'Discount'
]

df = pd.DataFrame(data, columns=columns)

# Display the first few rows of the dataframe to understand its structure
df.head()

2. Create 3 databases and populate

# Setup DuckDB
duckdb_conn = duckdb.connect(database=':memory:')
df.to_sql('transactions', duckdb_conn, index=False)

# Function to execute DuckDB query
def duckdb_query(query):
    return duckdb_conn.execute(query).fetchall()


# Setup SQLite
sqlite_conn = sqlite3.connect(':memory:')
df.to_sql('transactions', sqlite_conn, if_exists='replace', index=False)

# Function to execute SQLite query
def sqlite_query(query):
    cursor = sqlite_conn.cursor()
    cursor.execute(query)
    return cursor.fetchall()


# Setup LevelDB
db = leveldb.LevelDB('leveldb_test')

# Write data to LevelDB
batch = leveldb.WriteBatch()
for idx, row in df.iterrows():
    key = str(row['TransactionID']).encode()
    value = row.to_json().encode()
    batch.Put(key, value)
db.Write(batch, sync=True)

3. Create different types of queries

# Function to generate a query with 30 random TransactionIDs
def generate_random_keys_query():
    random_transaction_ids = random.sample(range(1, 10001), 30)  # Assuming TransactionID ranges from 1 to 10000
    transaction_id_str = ', '.join(f"'{id}'" for id in random_transaction_ids)
    return f"SELECT * FROM transactions WHERE TransactionID IN ({transaction_id_str})"

# Example queries to test
def get_queries():
    return [
        "SELECT * FROM transactions WHERE TransactionID = '1'",  # Query by ID
        "SELECT * FROM transactions WHERE TransactionID = '1' AND StoreID = 'Store1'",  # Query by ID and another column
        generate_random_keys_query(),  # Query with multiple keys
        "SELECT CustomerID, SUM(Price * Quantity) FROM transactions GROUP BY CustomerID"  # Aggregation query
    ]

Here we create 4 different query types.

Query by ID
Query by ID with a condition based on another column
Query by multiple IDs (randomly generated)
Aggregation Query

4. Execute Queries

We run executions for 100 times and take the average for each query and database. (Except aggregation on LevelDB. Since aggregation takes quite long for LevelDB, we run the query once.)

# LevelDB conditions
leveldb_conditions = [
    (1, None),
    (1, 'Store1'),
]

# Function to run queries multiple times and return the average time
def average_query_time(query_func, query_generator, runs=100):
    total_time = 0
    for _ in range(runs):
        query = query_generator() if callable(query_generator) else query_generator
        elapsed_time, _ = measure_query_time(query_func, query)
        total_time += elapsed_time
    return total_time / runs

# Function to run leveldb queries multiple times and return the average time
def average_leveldb_time(query_func, args, runs=1):
    total_time = 0
    for _ in range(runs):
        elapsed_time, _ = measure_query_time(query_func, *args)
        total_time += elapsed_time
    return total_time / runs

# Measure query times
queries = get_queries()
duckdb_times = [average_query_time(duckdb_query, q) for q in queries]
sqlite_times = [average_query_time(sqlite_query, q) for q in queries]

leveldb_times = [
    average_leveldb_time(leveldb_query_single, leveldb_conditions[0]),
    average_leveldb_time(leveldb_query_single, leveldb_conditions[1]),
    average_leveldb_time(leveldb_query_multiple, [random.sample(range(1, 10001), 30)]),
    average_leveldb_time(leveldb_query_aggregation, [])
]

5. Results

Conclusion

DuckDB and SQLite are more suited for SQL-based operations and analytical queries. SQLite greatly benefits from indexing, so it’s wise to create proper indexing for the point lookup operations. DuckDB outperforms SQlite for complex aggreagations but it is less efficient for point looksup even with indexing. LevelDB is highly promising for key-value operations and offers ACID compliance, however, it lacks SQL support, making it less convenient for SQL-based analytics.
In this article we simply picked and compared 3 different databases. Depending on your need, you can follow the similar approach and compare other databases.

https://www.sqlite.org/
https://duckdb.org/
https://duckdb.org/docs/sql/indexes.html
https://github.com/google/leveldb