Efficient Float Array Storage on Elasticsearch

Dean Shaff
YukkaLab
Published in
7 min readJun 23, 2021

--

Recently my colleagues and I ran into an issue where we wanted to store reasonably large arrays of floating-point numbers (“floats”) in Elasticsearch. For our application, we wanted to add a 512-element float array to each document in our cluster. We started by adding a property of type float to our elastic mapping, but we quickly realized that this had a negative impact on the performance of our app – these arrays added around 10kb to each document we requested from elastic. If we request 1000 documents containing float arrays for a given API request we would download an additional 10Mb! The purpose of this post is to explain why these float arrays take up the space they do and what we did at Yukka Lab to improve the storage efficiency of these arrays.

To explain why using 10kb to store 512 floats is more than is necessary, we first have to understand how different representations of numbers change the amount of space they occupy. The way computers store floats internally is not the same as how floats are represented in JSON arrays, or even how we might think about them intuitively. With JSON, a float-like number is simply a series of digits before and after a decimal place (I say float-like because it might be possible to write down JSON numbers that are not expressible with floating point numbers). This text-based representation, while human readable, is not very efficient; a float takes up (at minimum) as many bytes as there are digits, plus the decimal and an optional sign.

When storing floats in memory computers use a representation akin to scientific notation, or a number multiplied by a base raised to certain power. Importantly, computers use a fixed number of human-unreadable bytes to represent floats. Generally, floats use either 4 bytes (“32-bit”) or 8 bytes (“64-bit”), but other precisions are possible and sometimes used. Using more bytes allows for representing a greater range of numbers, with smaller increments between consecutive numbers. For more information on exactly how floats are stored in contemporary computers see these Wikipedia articles here and here. We can store floats in “binary” format by grabbing the bytes associated with the float or array of floats and dumping them to disk.

To illustrate the difference between these two means of storing floats, let’s write a little Python (with the help of numpy) that prints out the size in bytes (assuming characters are represented as single bytes) of a JSON array and of the same array as a block of memory. I’ve added some inline comments to help elucidate what is happening.

>>> import json
>>> import numpy as np
>>> arr = np.random.rand(512).astype(np.float32) # create an array of 512 random numbers, and convert to 32-bit float (numpy defaults to 64-bit)
>>> bin_len = len(arr.tobytes()) # ndarray.tobytes() produces a `bytes` object with the binary contents of the array
>>> json_len = len(json.dumps(arr.tolist())) # ndarray.tolist() produces a Python list
>>> print(f"Binary representation using {(bin_len/json_len)*100:.2f}% as much space")
Binary representation using 19.73% as much space

We can see that if we’re using 32-bit floats, we might stand to reduce our storage footprint by 80% if we move from storing arrays as JSON objects to storing them in their binary form. Note that if we are storing single digit, whole number floats, we might not see any reduction in space usage.

At this point, it seems like we have a solution to our original storage problem — simply dump the binary representation of our big float arrays into Elasticsearch, and we’ve reduced our storage requirement! Unfortunately, Elasticsearch cannot store raw binary data. The closest thing it has to this is the “binary” field type, which allows for storing Base64 encoded binary data. It’s important to note that data stored using the “binary” field type is neither searchable nor indexable.

Base64 is a standardized way of converting between a series of byte values and a string of ASCII characters. Base64 is widely used in networking applications where it may not be possible to send or receive raw binary data. Encoding bytes as base64 strings does incur some overhead, generally on the order of one third of the size of the original bytes. We can write some Python functions that encode and decode floating point arrays.

import base64

import numpy as np


def encode(arr: np.ndarray) -> str:
"""
encode numpy array as bigendian-ordered base64 string, appropriate for
transfering over network
"""
dt = np.dtype(arr.dtype) # create a numpy dtype object that is the same as our original array.
dt = dt.newbyteorder(">") # change the bit ordering from whatever it was before (most likely little-endian) to big-endian.

arr_be = arr.astype(dt) # convert our array to this new ordering

return base64.b64encode(arr_be.tobytes()).decode() # without decode, we return a bytes object


def decode(arr_str, dtype=None) -> np.ndarray:
"""
decode a base64 string, returning a numpy array
"""
if dtype is None:
# assume we're using bigendian ordered 32 bit floats
dtype = np.dtype(np.float32)
dtype = dtype.newbyteorder('>') # use big-endian byte ordering

return np.frombuffer(base64.b64decode(arr_str), dtype=dtype) # numpy.frombuffer creates a numpy array from a bytes object.

These functions account for endianness, or the order in which bytes are interpreted by a computer. Different computer architectures use different endianness by default (although most use “little-endian”), which means that without explicitly specifying endianness we could end up in a situation where the encoding machine assumes one endianness while the decoding machine assumes another. If this were to happen, the decoding machine would inadvertently be working with an arrays whose contents do not match that of the encoding machine.

Now, let’s modify the original snippet of Python code to see how much space we’re saving using our base64 encoded arrays:

>>> base64_arr = encode(arr)
>>> base64_len = len(base64_arr)
>>> print(f"base64 encoded array using {(base64_len/json_len)*100:.2f}% as much space!")
base64 encoded array using 26.31% as much space!

As mentioned above, using base64 does incur some overhead, but we can still save a significant amount of space over JSON arrays.

How might we use our encoding and decoding functions in the context of an application that reads and writes to an Elasticsearch instance? The following script writes some documents containing both JSON arrays and our base64 encoded arrays to an Elasticsearch index, reads them out, and then compares the size of the different encodings after retrieval. I’m going to be running a local version of Elasticsearch (see this article in the Elasticsearch docs for how to get started with running Elasticsearch on your local machine), but you can use an existing cluster; simply modify the base_url variable at the top of the script.

import base64
import json
from datetime import datetime

import numpy as np
import requests


base_url = "http://localhost:9200"

# note that the only difference between the mapping for base64 encoded binary data and a JSON float array is the `"type"` parameter!
mapping = {
"mappings": {
"properties": {
"big_array_binary": {
"type": "binary",
"index": False
},
"big_array_json": {
"type": "float",
"index": False
}
}
}
}


def encode(arr: np.ndarray) -> str:
"""
encode numpy array as bigendian-ordered base64 string, appropriate for
transfering over network
"""
dt = np.dtype(arr.dtype)
dt = dt.newbyteorder(">") # use bigendian byte ordering

arr_be = arr.astype(dt)

return base64.b64encode(arr_be.tobytes()).decode() # without decode, we return a bytes object


def decode(arr_str, dtype=None) -> np.ndarray:
"""
decode a base64 string, returning a numpy array
"""
if dtype is None:
# assume we're using bigendian ordered 32 bit floats
dtype = np.dtype(np.float32)
dtype = dtype.newbyteorder('>') # use bigendian byte ordering

return np.frombuffer(base64.b64decode(arr_str), dtype=dtype)


def create_index(index_name: str, mapping: dict):
"""
create and index with a given name and mapping
"""
with requests.Session() as session:
session.put(f"{base_url}/{index_name}", json=mapping)


def delete_index(index_name: str):
"""
delete index `index_name`
"""
with requests.Session() as session:
session.delete(f"{base_url}/{index_name}")


def add_docs():
"""
Add 100 documents to the `comparison` index.
Each document contains the same array, encoded in two different ways.
"""
# use `default_rng` so we get the same results every time we run the script
# the first argument to `default_rng` is a "seed" for the random number generator.
rg = np.random.default_rng(1024)
with requests.Session() as session:
for idx in range(100):
arr = rg.random(512, dtype=np.float32)
data = {
"big_array_binary": encode(arr),
"big_array_json": arr.tolist()
}
session.post(f"{base_url}/comparison/_doc", json=data)


def compare_docs():
"""
Grab the documents we put in with `add_docs`.
Ensure that the different array encodings are the same,
and then compare the size of the arrays after retrieval.
"""
size = 100
query = {
"query": {
"match_all": {}
},
"size": size
}
size_json = 0
size_binary = 0
with requests.Session() as session:
data = session.get(f"{base_url}/comparison/_search", json=query).json()
hits = data["hits"]["hits"]

nclose = 0
for item in hits:
arr_json = item["_source"]["big_array_json"]
size_json += len(json.dumps(arr_json))
arr_binary = item["_source"]["big_array_binary"]
size_binary += len(arr_binary)
arr_binary = decode(arr_binary)
# we're using 32-bit arrays, so the default arguments to `allclose` might yield a false negative
if np.allclose(arr_binary, arr_json, atol=1e-4, rtol=1e-4):
nclose += 1

print(f"{nclose}/{len(hits)} are close")
print(f"JSON response is {(size_json/size_binary):.3f} times bigger")


if __name__ == '__main__':
create_index("comparison", mapping)
add_docs()
compare_docs()
delete_index("comparison")

The exact amount of improvement will vary depending on the nature of the data we put in. For random numbers between 0 and 1, the JSON arrays should be about 3.5 times bigger; for evenly spaced, single digit floats, the JSON arrays will only be about 1.5 times bigger.

Should we be replacing all floats with base64 encoded floats in Elasticsearch? Most definitely not. For single floats, the difference in required storage is small enough so as not to matter. Moreover, using base64 encoded floats imposes technical and computational overhead both when adding and retrieving documents from the index. We might choose to use this approach when storing larger numbers of floats (greater than 100 per document), and when we only need 32-bit precision, as the efficiency gains are higher.

--

--