Indexing JsonField in Django and PostgreSQL
Couple days ago we received a task from marketing, for implementing the task we needed to store some specific data, as the data didn’t have any scheme and except one of endpoints we didn’t ran a query on it, we decided to go with django’s JsonField.
but for the technical team, there was a doubt about the performance of the query on JsonField with big amount of data, so we created a sample project to test querying Jsonfield in django.
Whole code of the project is available on github.
to follow this article you need to have a basic understanding of django, and command-line environment.
first we are going to install django and postgres adapter
pip install django psycopg2-binary
then creating a django project, called jstest(smart name :) )
django-admin startproject jstest
cd jstest
python3 manage.py migrate
for creating the endpoint, we sill start a new django app
python3 manage.py startapp testapp
now lets go and change the code.
in jstest/settings.py we add our new app to INSTALLED_APPS and change database to postgres
INSTALLED_APPS = [
…
‘testapp’,
]
DATABASES = {
‘default’: {
‘ENGINE’: ‘django.db.backends.postgresql_psycopg2’,
‘NAME’: ‘YOUR_DB_NAME’,
‘USER’: ‘YOUR_DB_USER’,
‘PASSWORD’: ‘YOUR_DB_PASSWORD’,
‘HOST’: ‘localhost’,
‘PORT’: ‘’,
}
}
in testapp/models.py we add following model:
from django.db import models
from django.utils import timezone
class Product(models.Model):
name = models.CharField(max_length=20)
attributes = models.JSONField()
create_date = models.DateTimeField(default=timezone.now)
and finally migrate the changes:
python3 manage.py makemigrations
python3 manage.py migrate
now for testing the query speed we need to add dummy data to the product table, for testing purpose we write 2 million rows into the table.
if you are using my code you can just run the following command:
python3 manage.py create_sample_data
if not, this function creates a sample data for you
from itertools import islice
from testapp.models import Product
batch_size = 5000
objs = (
Product(
name=’product No.%s’ % i,
attributes={“name”: i, “values”: [i, i + 1]},
) for i in range(2_000_000)
)
while True:
batch = list(islice(objs, batch_size))
if not batch:
break
Product.objects.bulk_create(batch, batch_size)
stored data in attribute field is as follow:
{
“name”: 1,
“values”: [1,2]
}
now lets implement end point for query speed ….
add following code to testapp/views.py, I will explain it later
from .models import Product
from django.db import connection
from django.http import HttpResponse
from random import randrange
def product_view(request):
rand_num = randrange(0, 2_000_000)
_ = Product.objects.get(attributes__name=rand_num)
dict_lookup_time = connection.queries[-1][“time”]
_ = Product.objects.get(
attributes__values__contains=[rand_num, rand_num + 1],
)
list_lookup_time = connection.queries[-1][“time”]
return HttpResponse(
f”dict lookup time: {dict_lookup_time} “
f”list lookup time: {list_lookup_time}”
)
we want query the JsonField based on “name” and “value” keys, first we generate a random number then make a query with this random number.
rand_num = randrange(0, 2_000_000)Product.objects.get(attributes__name=rand_num)
# search value of “name”
Product.objects.get(attributes__values__contains=[rand_num, rand_num + 1])
#search in value of “values”
and with this code we can find out haw much last query took.
connection.queries[-1][“time”]
with adding the below code to jstest/urls.py we are done with django.
from django.contrib import admin
from django.urls import path
from testapp.views import product_view
urlpatterns = [
path(‘admin/’, admin.site.urls),
path(‘’, product_view),
]
with the following command start the project and open localhost:8000 in your browser.
python3 manage.py runserver
now you probably would see something like this:
1 second for a project with no load is not a acceptable time, so lets fix it.
for reducing query time, we add a index to attributes field:
CREATE INDEX JsonFieldIndex ON testapp_product USING GIN(attributes jsonb_path_ops);
now lets refresh the web page and see if anything appears different
not bad, it’s half of previous time, but still too long for a project with no load.
now we create two separate indexes for “name” and “values” as we know the value of one is an array and the other one is an integer.
at first drop previous index
drop index JsonFieldIndex;
then create new indexes
CREATE INDEX JsonFieldIndexList ON testapp_product USING GIN((attributes->’values’) jsonb_path_ops);CREATE INDEX JsonFieldIndex ON testapp_product USING HASH((attributes -> ‘name’));
now if we refresh our web page, we see this result.
the difference is noticeable, right? :)
Why we used GIN and HASH for indexing?
CREATE INDEX JsonFieldIndexList ON testapp_product USING GIN((attributes->’values’) jsonb_path_ops);
in first example our data looks something like below:
"values": [1, 2]
according to postgres documents:
GIN indexes are “inverted indexes” which are appropriate for data values that contain multiple component values, such as arrays
the GIN is best indexing type for our purpose, and also we don’t want index keys, so we need to index just values(in our example [1, 2]), based on this we need to use GIN with jsonb_path_ops
operator class.
CREATE INDEX JsonFieldIndex ON testapp_product USING HASH((attributes -> ‘name’));
in second example our data looks like below:
"name": 1
according to postgres documents:
Hash indexes can only handle simple equality comparisons. The query planner will consider using a hash index whenever an indexed column is involved in a comparison using the
=
operator.
HASH uses less space in comparison to other Indexing types and in our query we use just = operator, based on this the best index type for our purpose is HASH.
if you want to read about different indexes in postgres JSON field read this article.
if you want to read more about postgres index types this is an article you want to read.