An Overview of Databases — Part 6: Different Solutions for Different Problems

Saeed Vayghani
13 min readJul 29, 2024

--

Part 1: DBMS Flow
Part 2: Non-Relational DB vs Relational

Part 3: CAP and BASE Theorem

Part 4: How to choose a Database?
Part 5: Different Solutions for Different Problems
Part 6: Concurrency Control
Part 7: Distributed DBMS
>> Part 7.1: Distributed DBMS (Apache Spark, Parquet + Pyspark + Node.js)
Part 8: Clocks
>> Part 8.1: Clocks (Causal Consistency With MongoDB)
>>
Part 8.2: Clocks (MongoDB Replica and Causal Consistency)
Part 9: DB Design Mastery
Part 10: Vector DB
Part 11: An interesting case, coming soon!

A Few Practical Examples:

Let’s not just imagine hypothetical situations; let’s consider actual real-world issues. We should think about these problems, describe them in detail, identify the limitations, and then brainstorm a few potential solutions for each one.

Problem #1: Real-Time Exclusion of IDs While Parsing Millions of Records
Problem #2: OLAP-Related Issues and Reporting from Huge Databases
Problem #4: Hierarchy and Directories Management
Problem #3: Counting and number of unique events or IDs.
Problem #5: Session Store Management
Problem #6: Managing Time Series Data
Problem #7: File Storage for Streaming Services
Problem #8: Efficiently Handling SPARSE Columns
Problem #9: Tracking and Storing Logs of Data Changes
Problem #10: Storing Semi-Schema-Less Documents
Problem #11: Ad Tech for Real-Time Bidding

Problem #1: Real-Time Exclusion of IDs/Emails/Strings While Parsing Millions of Records

Description: A service needs to exclude certain IDs, emails, or strings instantaneously while parsing millions of records.

Constraint: The blacklist database is constantly being updated.

Example: Sending bulk text messages to users, where certain users need to be excluded based on their ID or email.

Solutions:

1. Redis Built-In Data Structures:

Set: Redis Sets can be used to store unique blacklisted items, allowing for quick membership checks and exclusions.

Bitset: Ideal for blacklists containing numerical IDs. A Bitset is a data structure that efficiently represents a set of integers. Redis Bitsets can handle up to ²³² bits. You can imagine the cost savings in the memory. One application is when Memory Cost matters.

const redis = require('redis');

const client = redis.createClient();

client.on('error', (err) => {
console.error('Redis error:', err);
});

async function addToBlacklist(id) {
if (id >= Math.pow(2, 32)) {
return new Error(`Error: ID ${id} is beyond the Bitset limit of 2^32.`);
}
await client.setbit('blacklist', id);
}

async function isBlacklisted(id, callback) {
if (id >= Math.pow(2, 32)) {
return new Error(`Error: ID ${id} is beyond the Bitset limit of 2^32.`);
}
return await client.getbit('blacklist', id);
}

addToBlacklist(1002);
addToBlacklist(1004);
addToBlacklist(Math.pow(2, 32)); // Attempt to add an ID at the Bitset limit
addToBlacklist(Math.pow(2, 32) + 1); // Attempt to add an ID beyond the Bitset limit

Bloom Filters: A space-efficient probabilistic data structure that can test whether an element is a member of a set. Bloom filters guarantee that false negatives will not occur, but there can be false positives. They are useful for quickly checking if an element is not in the blacklist while saving space.

const { BloomFilter } = require('bloomfilter');

const bloom = new BloomFilter(
32 * 256, // number of bits to allocate.
16 // number of hash functions.
);

// Example blacklist
let blacklist = ["user1@example.com", "user2@example.com"];

blacklist.forEach(email => bloom.add(email));

function checkAndSendMessage(email) {
if (!bloom.test(email)) {
console.log("Sending message to:", email);
} else {
console.log("Skipped (Blacklisted):", email);
}
}

checkAndSendMessage("user1@example.com"); // Should be skipped (blacklisted)
checkAndSendMessage("user3@example.com"); // Should be sent (not in the blacklist)

2. MySQL MEMORY Engine:
The MySQL MEMORY storage engine creates tables that are stored in memory. This is beneficial for fast lookups and quick updates, making it suitable for maintaining a frequently updated blacklist.

CREATE TABLE temp_table (
id INT AUTO_INCREMENT,
name VARCHAR(100),
value VARCHAR(255),
PRIMARY KEY (id)
) ENGINE=MEMORY;

3. RAM-Based Database:
An old-school hack involves creating a RAM disk partition and installing a database on it. This ensures that the database operations are extremely fast, though the data is volatile and will be lost on a system reboot.

// Create a partition on RAM and install your DB there

// Create a Mount Point
sudo mkdir /mnt/ramdisk
sudo mount -t tmpfs -o size=1G tmpfs /mnt/ramdisk

// Verify the RAM Disk
df -h

// Move Your Database to the RAM Disk
// Install your DB files in this RAM disk

4. Hybrid Approach:
Store cold data (less frequently accessed) on disk/RDS (Relational Database Service) and warm data (frequently accessed) in Redis. This leverages the speed of in-memory databases for current data and the persistence of disk-based storage for older data.

Problem #2: OLAP-Related Issues and Reporting from Huge Databases

Description: Obtaining reports from a huge database, typical of Online Analytical Processing (OLAP) workloads. Examples include generating analytics reports from large datasets.

Constraint: Having a huge database.

Example: Analytics events.

Solutions:

1. Columnar Databases:
Columnar databases store data in columns rather than rows. This is highly efficient for read-heavy operations, such as aggregations and large-scale data scans, which are common in OLAP workloads.

Benefits:

  • Improved I/O efficiency, as only the relevant columns are read.
  • Better compression rates, as similar data types are stored together.
  • Faster query performance for analytical queries.

Problem #3: Counting and number of unique events or IDs.

Constraint: Low-cost storage.

Example: The number of unique strings like Krns or Ids.

Solutions:

1. SQL, Built-in counter
Using a traditional SQL database involves creating a table specifically for tracking each unique string or event.

2. CASANDRA
Cassandra has a built-in counter feature. In Cassandra, counters are a special type of column used to store a number that can be incremented or decremented.

CREATE TABLE page_views (
page_id uuid PRIMARY KEY,
view_count counter
);

UPDATE page_views SET view_count = view_count + 1 WHERE page_id = your_page_id;
UPDATE page_views SET view_count = view_count - 1 WHERE page_id = your_page_id;
SELECT view_count FROM page_views WHERE page_id = your_page_id;

3. Redis counter
In-memory storage offers very fast read and write operations. Simple to implement incremental counters for tracking unique strings or events.

4. Hyperloglog (Available on Redis and a few more DBMSs)
It is an algorithm for efficiently estimating the cardinality (the number of unique elements) of very large datasets using minimal memory. It provides a close approximation with a standard error of 0.81%, making it highly effective for counting unique items in big data analytics with significant space savings.

const Redis = require('ioredis');
const redis = new Redis();

async function addUniqueEvent(eventKey, userId) {
await redis.pfadd(eventKey, userId);
}
async function countUniqueEvents(eventKey) {
return await redis.pfcount(eventKey);
}
const eventKey = 'my_unique_key_name';
const userId = 'user-123';
addUniqueEvent(eventKey, userId);

Problem #4: Hierarchy and Directories Management

Description: Creating a custom service akin to S3 presents a challenge in handling hierarchies or directories efficiently, especially when dealing with limitations on the number of files per directory.

Constraint: There’s a cap on how many files can reside in a single directory, complicating scalable and efficient data organization.

Example: An application that lets users upload and organize vast numbers of files into complex folder structures, mirroring the functionality of services like Amazon S3.

Solutions:

1. Neo4j: Using Neo4j, a graph database, can tackle the directory management challenge. Files and directories are nodes interconnected by relationships, enabling swift navigation and updates within intricate hierarchical structures. This setup excels at managing deeply nested folders by facilitating efficient queries and operations.

2. ArangoDB: As a versatile multi-model database, ArangoDB leverages its graph capabilities for directory and hierarchical data management. It combines graph, document, and key-value data models to efficiently organize and query files and their metadata. ArangoDB’s querying language, AQL, supports complex operations across nested structures, thus providing a scalable solution for extensive directory systems.

const { Database, aql } = require('arangojs');

// Connect to ArangoDB
const db = new Database({
url: "http://localhost:8529",
databaseName: "databaseName",
auth: { username: "username", password: "password" },
});

// Collection Name
const collectionName = "Directories";

async function createCollection() {
const collection = db.collection(collectionName);
try {
const exists = await collection.exists();
if (!exists) {
await collection.create();
}
}
}

// Function to insert a directory or file
async function insertFileOrDirectory(name, parent = null) {
const result = await db.query(aql`
INSERT { name, parent } INTO ${db.collection(collectionName)}
RETURN NEW
`);
}

async function main() {
await createCollection();
await insertFileOrDirectory("root");
await insertFileOrDirectory("subdirectory1", "root");
}

main().catch((error) => console.error(error));

Problem #5: Session Store Management

Description: Securely and efficiently managing session data, like authentication tokens, is key in web applications to maintain user sessions.

Constraint: The solution needs to offer fast access and scalability while ensuring data integrity and the ability to automatically remove expired sessions.

Example: A web service needs to efficiently verify user sessions with fast retrieval of authentication tokens and manage session expiration seamlessly.

Solutions:

DynamoDB: Provides low latency and scalability, with built-in item expiration (TTL), making it ideal for managing session data in large-scale applications.

Redis: An in-memory data store offering sub-millisecond response times, perfect for session management due to its fast data access and support for key expiration.

Cassandra: Known for its scalability and high availability, it supports fast writes and reads, and row-level TTL, fitting for large volume session data handling.

Riak: Riak is a distributed NoSQL key-value database designed with high availability, fault tolerance, and distributed data storage at its core. It excels in environments where operational simplicity and reliability are critical. For session data management, Riak stands out due to its:

Problem #6: Managing Time Series Data

Description: Efficiently storing and retrieving time series data, such as social network timelines, presents unique challenges due to the high volume and frequency of data points generated over time.

Constraint: The chosen solution must handle large datasets efficiently, support rapid data insertion, and facilitate quick retrieval of sequential data points.

Example: A social media platform needs to efficiently manage user activity timelines, including posts, likes, and comments, indexed by time.

Solutions:

SQL Composite Primary Key: Utilizing a relational database with a composite primary key (including a time component) can organize time series data efficiently. This approach benefits from relational DBMS’s maturity and flexibility in querying but might require careful schema design and indexing to scale.

InfluxDB: A purpose-built time series database designed for high write and query performance. It efficiently handles time-stamped data, offering built-in functions for time series analysis. InfluxDB is ideal for applications requiring real-time analytics and has features like data retention policies to manage data growth.

const {InfluxDB, Point, FluxTableMetaData} = require('@influxdata/influxdb-client');

const url = 'http://localhost:8086';
const token = 'yourAuthToken';
const org = 'yourOrg';
const bucket = 'yourBucket';

const client = new InfluxDB({url, token});

async function writeData() {
const writeApi = client.getWriteApi(org, bucket);
writeApi.useDefaultTags({host: 'host'});

const point = new Point('social_media')
.tag('user', 'user123')
.addField('likes', Math.floor(Math.random() * 100))
.addField('comments', Math.floor(Math.random() * 100))
.time(new Date());

writeApi.writePoint(point);

await writeApi.close()
}

writeData();


// Query Data
async function queryData() {
const queryApi = client.getQueryApi(org);

const query = `from(bucket:"${bucket}") |> range(start: -1h) |> filter(fn: (r) => r._measurement == "social_media")`;
console.log('query:', query);

queryApi.queryRows(query, {
next(row, tableMeta) {
const res = tableMeta.toObject(row);
console.log(
`${res._time} ${res._measurement}: ${res.likes} likes, ${res.comments} comments`
);
},
error(error) {
console.error(error);
},
complete() {
console.log('SUCCESS');
},
});
}

queryData();

MongoDB’s Time Series Collections: A feature designed specifically for time series data, optimizing storage and improving query efficiency for sequential data. This solution leverages MongoDB’s scalability and flexibility while providing a more efficient way to handle time-stamped data through automatic indexing and a simplified schema.

Problem #7: File Storage for Streaming Services

Description: A streaming service needs efficient, scalable storage for multimedia files to ensure smooth playback and quick access.

Constraint: he solution must handle large files and support high-throughput read operations to serve content with minimal latency.

Example: A platform streaming videos requires a robust storage system to store, manage, and serve video files of various sizes and formats.

Solutions:

MongoDB with GridFS: MongoDB can store metadata or smaller files directly, but for large files (like videos), GridFS is ideal. GridFS divides files into chunks and stores them as separate documents, allowing efficient storage and retrieval of large files without loading them entirely into memory.

// The following example uploads a file named example.txt to the myfiles database:
mongofiles -d myfiles put example.txt

// Listing Files in GridFS
mongofiles -d myfiles list

Amazon S3: A highly scalable object storage service that’s perfect for storing and retrieving any amount of data, including large multimedia files for streaming services. S3 offers high durability, availability, and performance, making it suitable for serving streaming content to a wide audience.

Network File System: NFS offers a traditional yet effective approach. NFS allows files to be stored on a networked server and accessed like a local file system, which can be particularly useful for applications that need to stream large files.

Problem #8: Efficiently Handling SPARSE Columns

Description: where database tables have a lot of optional fields, which often remain null for many records, the storage and performance implications of handling such “sparse” data become a concern.

Constraint: The solution should efficiently store data while optimizing for null values to reduce storage requirements and possibly improve query performance.

Example: Consider a user profile table in a social networking application where out of 50 possible fields (like interests, activities, qualifications), a typical user only fills out a handful, leaving the rest as null.

Solutions:

1. PostgreSQL HSTORE or JSONB:
HSTORE
: It is great for storing sets of key/value pairs in a single column efficiently, handling sparse data by simply not storing null values at all.
Best for applications that work with simple, flat key-value data where the values are always strings. It’s lightweight and straightforward for scenarios where document structure is not required.

INSERT INTO table_name (user_id, attrs) VALUES
(1, 'name => John, age => 29, city => New York');

SELECT user_id FROM profiles WHERE attrs -> 'city' = 'New York';

// GIN Index for HSTORE
CREATE INDEX idx_gin_hstore ON table_name USING GIN (hstore_column);

JSONB: A binary JSON format that stores JSON data in a decomposed binary format. JSONB is ideal for columns with highly variable schemas or optional fields.
Ideal for storing and querying complex nested documents and unstructured data.

INSERT INTO user_profiles (user_id, profile) VALUES
(1, '{"name": "John", "age": 29, "city": "New York"}');

SELECT user_id FROM user_profiles WHERE profile ->> 'city' = 'New York';

// Advanced JSONB Indexing
// CREATE INDEX idx_gin_jsonb_path_ops ON table_name USING GIN (jsonb_column jsonb_path_ops);

CREATE INDEX idx_jsonb_name ON table_name USING GIN ((jsonb_column ->> 'name'));

2. Document databases like MongoDB are designed to store and query data as JSON-like documents. These databases inherently support sparse data by storing only non-null attributes in each document.

Problem #9: Tracking and Storing Logs of Data Changes

Description: For industries like finance and insurance, it’s crucial to have an immutable log of data changes for records such as financial transactions or claims histories. This ensures traceability, auditability, and compliance with regulatory requirements.

Constraint: The solution must guarantee data integrity, allow for easy retrieval of historical data, and ensure that records cannot be altered once written.

Example: A banking application needs to store a history of all user transactions in an immutable manner, ensuring that once a transaction is recorded, it cannot be modified or deleted, and any change or addition can be easily tracked back.

Solutions:

1. Amazon Quantum Ledger Database (QLDB): It is designed specifically for use cases requiring a centralized, immutable ledger that records all changes to your data transparently. It provides an append-only journal, ensuring that each data modification is cryptographically chained.

2. Any Pure SQL Solution: Implementing an audit trail in a traditional SQL-based database involves creating audit tables that log every insert, update, or delete operation. This can be achieved through database triggers or application-level logging.

For every table requiring an audit trail (e.g., transactions), you create a corresponding audit table (transactions_audit) and define triggers. These triggers automatically insert a new record into the audit table on any DML operation, capturing the operation type, the data changed, the timestamp, and user information.

CREATE TABLE transactions_audit (
audit_id SERIAL PRIMARY KEY,
operation_type VARCHAR(10), -- INSERT, UPDATE, DELETE
original_data JSONB,
changed_data JSONB,
operation_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
user_id INT
);

CREATE OR REPLACE FUNCTION log_transaction_changes()
RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = 'DELETE') THEN
INSERT INTO transactions_audit(operation_type, original_data, user_id)
VALUES ('DELETE', row_to_json(OLD), current_user_id());
RETURN OLD;
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO transactions_audit(operation_type, original_index, changed_data, user_id)
VALUES ('UPDATE', row_to_json(OLD), row_to_json(NEW), current_user_id());
RETURN NEW;
ELSIF (TG_OP = 'INSERT') THEN
INSERT INTO transactions_audit(operation_type, changed_data, user_id)
VALUES ('INSERT', row_to_json(NEW), current_user_id());
RETURN NEW;
END IF;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER transactions_audit_trigger
AFTER INSERT OR UPDATE OR DELETE ON transactions
FOR EACH ROW EXECUTE FUNCTION log_transaction_changes();

Problem #10: Storing Semi-Schema-Less Documents

Description: Storing variable-structured documents like music albums and tracks, which may have varying attributes.

Constraint: Must efficiently accommodate diverse and nested data structures without heavy schema modifications.

Example: Music catalog application requiring flexibility for differing metadata across albums, such as varied track details.

Solutions:

1. MongoDB: Ideal for schema-less document storage, supporting nested data and arrays. Facilitates easy evolution of document structure over time.

2. PostgreSQL JSONB: Supports nested documents and arrays, offering SQL advantages like transactions and joins, along with schema flexibility.

Problem #11: Ad Tech for Real-Time Bidding

Description: The challenge in a real-time bidding ecosystem for advertisements is processing and analyzing millions of ad impressions per second efficiently.

Constraint: Needs to handle extremely high throughput (200,000 to 500,000 transactions per second with just a single node) with minimal latency, maintaining performance consistency even under heavy load.

Example: An ad platform must evaluate and respond to bid requests within milliseconds, processing data from various sources.

Solutions:

1. AeroSpike DB: It is designed for high throughput and low latency, crucial for real-time systems. Its unique architecture allows for rapid data access, making it exceptionally well-suited for environments where milliseconds can impact revenue.

Note: Asses the network infrastructure to ensure that it can support the high data throughput and low latency communication between AeroSpike and application servers.

Now it is your turn to think about the below list of challenges:

  • Dynamic Pricing in E-commerce
  • Scalable Personalization for Content Platforms
  • Efficient Resource Allocation in Cloud Environments
  • Detecting and Mitigating Fake News Spread
  • Real-time Monitoring of Environmental Health
  • And more…

Share your solutions with me.

--

--

Saeed Vayghani

Software engineer and application architecture. Interested in free and open source software.