Postgres JSONB Usage and performance analysis

Eresh Gorantla
Geek Culture
Published in
7 min readSep 7, 2021

This story focuses on various features that JSONB provides and formulated with a sample and explained with scenarios where unstructured data can be stored and queried. Also discussed are data validations from the application layer.

What is JSONB?

The data type JSON and JSONB, as defined by the PostgreSQL documentation, are almost identical; the key difference is that JSON data is stored as an exact copy of the JSON input text, whereas JSONB stores data in a decomposed binary form; that is, not as an ASCII/UTF-8 string, but as binary code.

And this has some immediate benefits:

  • More efficiency.
  • Significantly faster to process.
  • Supports indexing (which can be a significant advantage, as we’ll see later). simpler schema designs (replacing entity-attribute-value (EAV) tables with JSONB columns, which can be queried, indexed, and joined, allowing for performance improvements up until 1000X!)

Few Drawbacks:

  • Slightly slower input (due to added conversion overhead).
  • It may take more disk space than plain JSON due to a larger table footprint, though not always.

Maximum size Postgres JsonB can support As per Postgres official documentation, the maximum size is 255 MB per document. Please see the source code here.

Data Saving and analysis of select queries

Please find complete source code on the java front which has an auditing framework as well here.

I have created a sample application that stored basic data of users like their name, contact_number, security_number, and country in normal RDBMS columns. There are few more additional details collected which are applicable for a country and also for a profession that they belong to. Below is how the data is saved for different countries and professions as well. In order to check the feasibility, I have included plain string, boolean, numbers, Array of Strings, and a complex object. So this will compromise all the use cases and select queries can run against them to see how it works.

I have loaded the data to the table around 4 Million data spread across different countries and different professions.

Types of searches With JsonB Data:

There are many types of queries that can be done on JSONB data like attributes, contains, existence, and specific JSONB functions to extractPath or extract value …

Let us see how data can be queried on JSONB for different data types.

Query 1 : (On String) Data exact match Let us query on profession = Farmer across UserDetails data set.

Query 2 : (On String) data wildcard match

a profession like ‘%Farm%”

Query 3 : (On number) of family members which are the number

Query 4 : (On number) Find family members within range

Query 5 : (on Boolean) Find if any users hasDisability

Query 6 : (On Array of Strings/Object) Find Exact match:

alternate contact numbers which are Array of String and match with exact data.

For the above query, there were no results. But we have the data in the table. The reason is we can not query for the exact match of array type just like that.

Other ways :

  1. Use JSONB function contains, but it would return true or false not the result set. This can be tweaked to return a result set.
  2. Tweaking the above query, use like operator instead of “=” even though it is an exact match.

The time taken to return the result is more because it is not the trigram index that can work, it is an exact match. Trigram index will not benifit here.

3. Use attribute way for search criteria.

Query 7 : (Array Of String) → wildcard match

Here we can not use attribute way for contains. We can either use JSONB contains function or use normal way to search wildcard with trigram index. There was no index used yet.

You could see how the like operator outperformed in performance with an exact match against wildcard search.

Query 8: On Date/Timestamp query → exact match

As we have a date of birth property we will search on this.

Query 9: On Date/Timestamp query → on a range

Query 10: On Object → Lets query on “info” object which is a child node inside, Let’s find out who has crop as “Corn”.

Query 11: On Object → With crop as ‘Corn’ and farm size between 2–3.

Please make a note at Execution Time which is 44.733 ms, which is pretty less. But if we use attribute way we can reduce the time furthermore.

JSONB Performance:

Among all the queries that were above this section haven’t used any indexes and performance is relatively good on 4M data. What if we want to increase the performance on the large data set.

  1. To improve query on large data set we have to define proper search criteria parameters in each search, we have to include many of needed properties in the query. (This is the same for JSONB or normal query set as well).
  2. With the use of indexes, we can improve performance.
  3. With partitions.

Let us talk about indexes here because we use pagination a lot for fetching the results, where we end up fetching actual data within the offset and total count (Which is sequential in nature. On large data set this could be a bottleneck).

JSONB and Indexes

When we use ->> operator of JSONB, PostgreSQL can use B-tree or Hash index for processing the operations. ->> operator returns the value of the specified attribute in text format. PostgreSQL can use indexes for the text results as compare operands. The GIN index can be used by the GIN JSONB operator class.

GIN JSONB operator classes
Default operator class jsonb_ops supports existence operators( ?, ?&, ?| ) and containment operator( @> ). And jsonb_path_ops supports containment operator only. Therefore GIN index is only possible to search for that have a particular key or key-values.[PostgreSQL Doc : jsonb operators ]

I have conducted tests on 4M data with several approaches and formulated a conclusion.

BTREE:

CREATE INDEX idx_btree_profession ON user_details USING BTREE ((details->>'profession'));CREATE INDEX idx_btree_profession_hash ON user_details USING HASH((details->>'profession'));--------------------------------------------------------------------SELECT count(*) FROM user_details WHERE details->>'profession' = 'Farmer';
SELECT count(*) FROM user_details WHERE details->>'profession' = 'Farmer'
OR details->>'profession' = 'Doctor';

GIN:

CREATE INDEX idx_gin_profession ON user_details USING GIN ((details->‘profession’));--------------------------------------------------------------------SELECT count(*) FROM user_details WHERE details->>'profession' = 'Farmer';
SELECT count(*) FROM user_details WHERE details->>'profession' = 'Farmer'
OR details->>'profession' = 'Doctor';

GIN With JSONB_OPS

CREATE INDEX idx_btree_profession ON user_details USING GIN (details jsonb_ops);CREATE INDEX idx_btree_profession ON user_details USING GIN (details jsonb_path_ops);--------------------------------------------------------------------SELECT count(*) FROM user_details WHERE details @> '{"profession" : "Farmer"}';
SELECT count(*) FROM user_details WHERE details @> '{"profession" : "Farmer"}'
OR details @> '{"profession" : "Doctor"}';

According to the result above, BTREE performed the best in search for a single attribute, Gin index outstands in building time and the size of the index. Jsonb_path_ops performed better in searching every single attribute than jsonb_ops. Although jsonb_path_ops is the index for all attributes, it performed well compared to the single key indexes, also the index size and building time of jsonb_path_ops were not good enough as it has more than 40 attributes.

Conclusions

BTREE index performed the best in indexing search for a single attribute. HASH index also performed well, nonetheless not providing WAL (Write Ahead Log) is critical to the abnormal termination. GIN index benefits in indexing all attributes with a single index, however it performances poorer in indexing a single key than BTREE and HASH index.

As a result, in the choice of an index for JSONB type, if the index searches a few predetermined attributes, BTREE is the best choice, however, if the index searches arbitrary attributes prefers the GIN index. In the case of using the GIN index, use jsonb_path_ops as a class checking whether it simply has a key-value, but in order to check not only has a key-value also the existence of the key, better to use jsonb_ops.

Validation on Application Side For Unstructured data

As we are dealing with unstructured data, we have to handle it from either the application side or the database to insert proper data into the table so that the data will be consistent.

Validate JSON Structure:

As we all know there is a way to validate JSON text using JSON Schema. https://json-schema.org/. With the help of schema, we can validate the data.

  1. Database side: It is not recommended to have schema validation on the DB side, because if there is an addition to an existing structure that wouldn’t be an issue, but if there is deletion it forces us to migrate and then add the data. It will become cumbersome by adding more and more validations. Even in No SQL databases as well we don’t have schema validations on the database side.
  2. Application side: It is a kind of application logic that the application takes care of while inserting into DB. We have to make sure when there are direct writes into DB from the backend we have to take care of it.

How to validate JSON Schema in Java/Spring:

There are few available libraries that can do for us. They can integrate with spring or we can also perform on-demand validation when the data actually exists.

Please see below the schema I have implemented for the Professions property. More and more complex schemas can be written to validate the JSON string.

Apart for this we can also inject a logical framework inside application logic (more specific to data)

--

--

Eresh Gorantla
Geek Culture

Experience in Open source stack, microservices, event-driven, analytics. Loves Cricket, cooking, movies and travelling.