Breaking Down Big Data: Efficient Search Strategies for 10+GB Files

Oleh Teslenko
T-slen
Published in
5 min readJun 16, 2024
Photo by Luca Campioni on Unsplash

How to find a unique string in a big-sized file? Imagine that you work at Google and you have a huge data set of unique users stored in files. One day you come to the office and find out the new task — to create a service that can find any user by email and return the user’s name. The main condition is that this service should be as fast as possible and can work in real-time (10K requests per second).

There are several different solutions depending on your needs. Let’s discuss possible ways to do this:

Relational Database

Obviously, relation databases such as MySQL or PostgreSQL could manage to do this task. All we need is to use an index for one column, don’t we? But would it be as effective and fast as we want for 10K requests per second? The short answer is no.

Non-relational Database

We can use key-value databases to store all data in them. For example, it could be Aerospike or Redis. These systems are often used for working with data in real-time. They efficiently store data in RAM that helps them to return data with high speed. However, there could be one drawback. Our files could take hundreds of GB and our memory space would increase significantly.

Bloom Filter

Bloom Filters are probabilistic data structures that offer space-efficient membership testing. They use space cleverly, allowing you to check if an item might be present in a dataset without storing the entire dataset itself. This approach could be a very effective solution if we can tolerate at least 1% false positives. More about this approach is below in the article More about this approach is below in the article:

Binary Search

Binary search is a search algorithm that finds the position of a target value within a sorted array — Wikipedia

We a little bit rearrange this definition for a sorted file. So the main idea is that if we have sorted data structure we can use this type of algorithms. In this article, I want to describe this way to find a unique string in a large file.

How Does It Work?

First, we need to create a sorted file. The best way to do this is to use the MapReduce algorithm.

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. — Wikipedia

So we need to divide our process into two steps — map and reduce. To simplify, we only use the “map” part in this article.

Unsorted file (data.csv)

apple,25,Ukraine
grape,345,France
pineapple,15,France
mandarin,22,Spain
mango,45,Ecuador
strawberry,12,Spain
blackberry,10,France
raspberry,20,Poland
cherries,50,Spain

Map

We use the map method for formatting a data in file.

// map.js

import fs from 'fs/promises';

const filename = './data.csv';

async function main() {
const fileHandle = await fs.open(filename, 'r');
for await (const line of fileHandle.readLines()) {
const splitLine = line.split(',');
const ip = splitLine[0];
const fraudPercentage = splitLine[2];
process.stdout.write(ip + '\t' + fraudPercentage + '\n')
}
}

main();

Next, run the script in your terminal:

node map.js >./pairs.data

After that, we will have the next file — pairs.data:

apple   Ukraine
grape France
pineapple France
mandarin Spain
mango Ecuador
strawberry Spain
blackberry France
raspberry Poland
cherries Spain

Next, we need to sort it out. Run the next command in your terminal:

sort -k1,1 ./pairs.data >./sorted.data

Let’s look at our binary search script:

// index.js

import fs from 'fs/promises';

const args = process.argv.slice(2);

if (args.length !== 2) {
console.log("No all arguments provided. See README.md for more information.");
process.exit();
}
const SEARCH_VALUE = args[0];
const DEFAULT_CHUNK_SIZE = +args[1]; // depends on the data size and the system memory
const DEFAULT_COMPARE_FN = async (searchValue, candidateValue) => {
if (searchValue > candidateValue) {
return 1;
} else if (searchValue < candidateValue) {
return -1;
} else {
return 0;
}
};


const DEFAULT_DELIMITER = "\n";
const FILENAME = 'sorted.data';

console.time('binarySearchInFile');
binarySearchInFile({
filename: FILENAME,
searchValue: SEARCH_VALUE,
})
.then(console.log)
.then(() => console.timeEnd('binarySearchInFile'));

export async function readChunk({ fileHandle, chunkSize, position }) {
const { buffer, bytesRead } = await fileHandle.read({
buffer: Buffer.alloc(Number(chunkSize)),
length: Number(chunkSize),
position: Number(position)
});

if (bytesRead === chunkSize) {
return buffer;
} else {
const bufferCorrectSize = Buffer.allocUnsafe(bytesRead);
buffer.copy(bufferCorrectSize, 0, 0, bytesRead);
return bufferCorrectSize;
}
}

export async function binarySearchInFile({
filename,
searchValue,
compareFn = DEFAULT_COMPARE_FN,
chunkSize = DEFAULT_CHUNK_SIZE,
delimiter = DEFAULT_DELIMITER,
} = {}) {
if (!filename) throw new Error('"filename" required');
if (!searchValue) throw new Error('"searchValue" required');

const fileHandle = await fs.open(filename, 'r');
const stats = await fileHandle.stat();

let start = 0;
let end = stats.size;

while (start < end) {
let mid = Math.floor((start + end) / 2);

let readMore = true;
let data = '';
let parts = [];
let position = mid;

while (readMore) {
data += await readChunk({ fileHandle, chunkSize, position });
parts = data.split(delimiter);
position += chunkSize;
readMore = (mid === 0 && parts.length < 2) || parts.length < 3;
if (position > stats.size) {
await fileHandle.close();
return { found: false };
}
}

const midLine = mid === 0 ? parts[0] : parts[1];
const cmpResult = await compareFn(searchValue, midLine.split('\t')[0]);

if (cmpResult === 0) {
await fileHandle.close();
return { found: true, record: midLine, searchValue };
} else if (cmpResult === 1) {
start = mid;
} else if (cmpResult === -1) {
end = mid;
} else {
throw new Error('"compareFn" should return 0|1|-1')
}

if (start === mid && (end - start) === 1) {
await fileHandle.close();
return { found: false, searchValue };
}
}

await fileHandle.close();
return { found: false, record: null };
}

Run index.js to find the key passing arguments — “apple” is the key and 4 — the number of chunks bytes in the file, depending on the data size and the system memory:

node index.js apple 4

Useful links:

  1. Original code example
  2. The code provided in this article (there you can also find examples of typescript and using callbacks instead of async/await)
  3. Apache MapReduce documentation

--

--