Managing Elasticsearch in Django like a pro
I have used Elasticsearch in a couple of my projects recently and I really love it. Its an excellent backend to build search engines or even for data analysis.
This article assumes you have some understanding of Elasticsearch and what its capable of, so I won’t go into installation or other introductory schematics of it. There are many articles which shows you how to setup ES and create indexes and queries.
Coming to Django.
Whenever you want to build a search engine in your Django project, you try to search for readily-available Django packages and one of them you find is django-haystack which is very widely used in the Django community. It supports a lot of backends including ES, Whoosh, Solr, etc. Developers use it because its super easy to setup and configure your index, the query syntax is also similar to Django’s ORM, so its a total win-win.
But managing a really big open source project like this takes a lot of effort and its been lacking support for newer versions of Elasticsearch. So we look for an alternative.
Alternative to django-haystack
Elasticsearch itself is really easy to use and we can communicate with it using REST-APIs alone. But firing rest queries every time to perform an action is quite messy, so they have created a high level python client elasticsearch-dsl-py. This is an official ES python client, so its going to stay updated for latest versions of ES (currently 6.x.y). This is built upon another low level python client elasticsearch-py which you can also use some advanced features in ES (or may be run raw queries).
Setting up elasticsearch-dsl-py
You obviously need to download Elasticsearch and I would suggest you to download Kibana too as we are going to use it ahead.
Installing elasticsearch-dsl-py is as simple as running:
pip install elasticsearch-dsl
This article assumes you are using Elasticsearch 6.x.y and Django 2+.
There are some breaking changes in ES6+ especially removal of mapping types which you want to refer to if you are coming from a older version.
Somethings to note
I like django-haystack a lot. The way it allows to create index document and manage ES using different management commands (rebuild_index, update_index, etc) is really cool.
Elasticsearch-dsl-py tries to provide those functions with a simple to use ORM and document creation but doesn’t provide any features to manage the ES backend.
So this article is me trying to imitate some of the django-haystack’s features to create a solid management system for elasticsearch-dsl-py and I’ll be referring to django-haystack a lot.
Alright, lets begin.
Configuring the index and document
This is an example from the elasticsearch-dsl-py’s documentation.
The document config is fairly easy to understand. The rest of the code refers to document life cycle in elasticsearch-dsl-py.
Article.init() is used to create the index and mappings in ES before indexing the document. It has to be called every time we setup the index. So its a manual process and we need to automate that which we never had to worry in django-haystack.
This is how we are going to do it.
As you can see it looks similar but there are some extra methods in the class which are inspired from haystack.
get_index_queryset: override the queryset from the abstract class. default is Model.objects.all()
get_updated_field: (required) can be used for fetching recent non-indexed data based on the datetime value.
There is also an abstract index but before going to that lets create our management command from managing the index.
This is going to be our controller for managing the index and data.
These are the settings we need to define.
ES_CONNECTIONS: connection settings, can support multiple hosts
ES_INDEXES: determines which index is on which host.
ES_DEFAULT_BATCH_SIZE: default batch size for bulk index. can be overridden in management command.
And here’s the management command.
As you can see its pretty straightforward. Only used as a controller to different methods and all the methods are defined in the abstract class. So the command stays clean.
These are again inspired from haystack :P
batch-size: specify batch size to override the default setting.
remove: this will remove the stale documents from the index. Basically items that does not exist in your db but do in the index.
index: specify which index to use.
clear_index: Clears the index and runs the indexing.
age: how recent posts you wish to index (specify hours, default=0 meaning all). This is where the get_updated_field() method is used.
All the methods in the command runs some method in the abstract index. So we create that now.
This holds all the business logic required to manage our index.
The article is long enough, so I won’t be able to explain all the methods used in the abstract class. So I have just explained it in the code comments.
You have reached here. Great. Lets look at some examples where we can apply our code to.
- Simple Indexing
python3 manage.py index_documents
- Updating (remove stale and index newly updated in last 24hrs)
maybe you want to setup a cron to run every night.
python3 manage.py index_documents --remove --age 2
- Clear everything and rebuild
python3 manage.py index_documents --clear_index
- Specify only a single index (if you have multiple setup)
python3 manage.py index_documents --index blog
This was my attempt to simplify managing ES in Django. For querying ES and performing other actions elasticsearch-dsl-py is capable of handling every use case. It just lacks management features like haystack.
There are obviously more things that can be done besides this but for now this is enough.
I know I could have created a reusable app for this but I had time constraints and a blog post was easier.
To be honest, this got quite long. If you are patient enough to read this full and find it interesting then please share it.