Are json columns a viable implementation for tagging ?

All code examples are using postgres 9.6.2 - other databases may have similar facilities.

While designing a schema for a new application, a question arose as to whether we should use normalised tables or a use a json column to store arbitrary properties and in particular a tagging solution.

I am not going to discuss the benefits of using a normalised form vs using a json column because the ultimate decision belongs to the function of the attributes and if or not they are well known before design.

In our case we wanted to implement a tagging facility, i.e. the ability to add random tags to a set of records, à la Gmail.

Let’s define a minimal record table — this could be anything:


Using a minimal normal form

The first idea was to use an external tag table. The following code shows how to create the tag table, insert a tag and retrieve a tag:

CREATE INDEX idx_tag_tag ON tag (tag);
-- add a tag for record 1
INSERT INTO tag (record_id, tag) VALUES (1, 'Awesome'),
-- remove a tag for record 1
DELETE FROM tag WHERE record_id = 1 AND tag = 'Awesome'
-- query a tag
FROM tag AS t
LEFT JOIN record AS r ON t.record_id =
WHERE t.tag = 'Awesome'

This seems simple and straightforward enough. But we knew we had to capture other optional properties on the record table and we thought about using a json column. The tag attributes are a nice fit for optional properties.

Using a json column

Here is an attempt to do just that.

-- add a properties column of type jsonb
ADD properties jsonb;
-- create index on properties.tags
CREATE INDEX idx_record_properties_tag
ON record USING GIN ((properties -> 'tags'));
-- add a tag to record 1, ignore duplicates
UPDATE record
SET properties = properties || ('{"tags":' || ((properties -> 'tags')::jsonb || '["Awesome"]'::jsonb)::text || '}')::jsonb
WHERE id = 1;
-- remove a tag from record 1
UPDATE record
SET properties = properties || ('{"tags":' || ((properties -> 'tags')::jsonb - 'Awesome')::text || '}')::jsonb
WHERE id = 1;
-- query a tag
FROM record
WHERE properties->'tags' ? 'Awesome';

The select is simpler, since we don’t have to join to another table, but the insert is more complicated because we have to extract the tags property in order to append to it. But the adding and removing of tags is remarkably symmetrical.

Comparing performances

We can compare performances using a large dataset and using the explain plan feature.

Large data set

It is fairly simple to generate a large dataset, the idea being that the size should be an order of magnitude (x10) bigger than a production dataset in order to have a good estimation of live performances. I had chosen to test on a million record alongside a million tag. I used the generate_series function from postgres to build it up.

-- create a million record
INSERT INTO record (name)
FROM generate_series(1, 1000000);

-- create a million tag
INSERT INTO tag (record_id, tag)
ceiling(random() * 1000000), -- use a random record
left(md5(i::text), 5) -- random tag
FROM generate_series(1, 1000000) s(i);
-- add tags to properties column
UPDATE record AS
SET properties =
(SELECT row_to_json(p)
FROM (SELECT array_agg(t.tag) AS tags
FROM tag AS t
WHERE t.record_id =
) AS p)
WHERE exists(SELECT t.tag FROM tag AS t WHERE t.record_id =;

Running the insert, delete and select queries on this data set show consistent performances across both approaches. For reference, on an empty database postgres 9.6.2 running via docker and an attached volume on a recent MacBook Pro doing not much else, both queries return within 4ms.

Explain plans

-- for the normal form
Nested Loop (cost=0.85..29.36 rows=2 width=119)
-> Index Scan using idx_tag_tag on tag t (cost=0.42..12.46 rows=2 width=10)
Index Cond: ((tag)::text = 'Awesome'::text)
-> Index Scan using record_pkey on record r (cost=0.42..8.44 rows=1 width=109)
Index Cond: (id = t.record_id)
-- for the json
Bitmap Heap Scan on record (cost=27.73..3467.76 rows=997 width=75)
Recheck Cond: ((properties -> 'tags'::text) ? 'Awesome'::text)
-> Bitmap Index Scan on idx_record_properties_tag (cost=0.00..27.48 rows=997 width=0)
Index Cond: ((properties -> 'tags'::text) ? 'Awesome'::text)

I am certainly no expert in deciphering explain plans using different index strategies (`btree` vs `gin`), but the way I read it is as follows:

  • for the normal form, it does a nested loop supported by the index `idx_tag_tag` (cost up to 29.36) then it goes on to perform the join to record supported by the `record_pkey` index (cost up to 8.44)
  • for the json column, it first does a Bitmap Index scan on the index `idx_record_properties_tag` (max cost of 27.48) then it rechecks (validate) the test (max cost 3467.76) — for my use case this seems acceptable because the recheck is on a very small dataset only done to weed out false positive.

Further explorations

There are many other forms of querying that I didn’t look at in this article (querying more than one tag, excluding a tag or any combination of those). I will probably revisit when the need will arise.

For the sake of simplicity, I didn’t look at avoiding duplication, but this is trivial on the normal form (through a unique constraint on the tag table) and a little work on inserting in the json array (we have to either remove a potential tag before adding a new one or filter on pre existing one — but this stays a code responsability).

There are also many more normalised schemas that can be used, see this link for reference.

In an another context, we can use a tags column that stores a jsonb array, removing and appending tags would be more straightforward, but the current added complexity doesn’t preclude us to use the feature as a property of a larger object. Though I would not recommend using a deeply nested property to store an array of strings that needs to be updated individually, unless the update code sits outside of sql.


Performance test shows that it is acceptable to use a jsonb column to store tags attributes and some other non structural adjacent properties, this provides a lot a convenience to capture extra attributes without adding extra columns or tables with a chance to normalise later if the data shows to be structural. Using json as a denormalized datastore gives us flexibility without loosing performances as long as indexes are used.

We have to note that more work is required for implementing extra features we get for free with the normalized form (like tag uniqueness) and some ORM may need special care to map those attributes.

I have shown that, even for array values of a json column, indexing, queries and performances are acceptable for the level of flexibility provided.