Full-Text Search in Django with PostgreSQL

Article based on my talk about Full-Text Search in Django with PostgreSQL.

Image for post
Image for post
http://www.paulox.net/2017/12/22/full-text-search-in-django-with-postgresql/

I’ve given this talk in: Pycon Otto 2017 (Florence), EuroPython 2017 (Rimini), PGDay.IT 2017 (Milan), PyRoma 2017.11 (Rome)

Goal

Motivation

Contents

These are the main topics of this article:

  • external products for Full-Text Search
  • Full-Text Search support in PostgreSQL
  • Django Support for Full-Text Search
  • the concertiaroma.com real-world project
  • upcoming innovation in Full-Text Search
  • some personal conclusions
  • useful resources

Full-Text Search

Full-Text Search derives from the need to do some searches in computer-stored documents. For example, to find all documents that contain specific words and their variations. If a document contains “house” or “houses” it will be the same for the search.

FTS Features

This is a list of some features that we can find in a quite advanced Full-Text Search, to be used in a real-world website.

  • Ranking
  • Stop-words removal
  • Multiple language support
  • Accent support
  • Indexing
  • Phrase search

Tested Products

Elastic and Solr are two software programs for the Full-Text Search that are popular today. There are others, but these are the only ones that I have used in my professional projects. They are Lucene based and written in Java.

  • Apache Solr

Elasticsearch

Snap Market

Snap market was a startup where I worked which produced a mobile phone application for buying and selling used items with about half a million mobile users.

Issues

Management problems

@@ -52,7 +52,8 @@ public class DecompoundTokenFilter … {
- posIncAtt.setPositionIncrement(0);
+ if (!subwordsonly)
+ posIncAtt.setPositionIncrement(0);
return true;
}

Apache Solr

GoalScout

GoalScout.com is website dedicated to showing sport videos uploaded by public users with about 30,000 videos.

Issues

Synchronization problems

Existing Products

PROS

  • Full featured products
  • Resources (documentations, articles, …)

CONS

  • Synchronization
  • Mandatory use of driver (haystack, bungiesearch, …)

Ops Oriented

The focus is on system integrations. I am more a #dev than an #ops so I don’t like to be forced to integrate various systems. I prefer developing and solving problems by writing python code.

FTS in PostgreSQL

PostgreSQL has been supporting Full-Text Search since 2008. Internally it uses “tsvecor” and “tsquery” data type to process the data to search. It has some indexes that can be used to speed up the search: GIN and GiST. PostgreSQL added support for Phrase Search in 2016. Support for JSON[B] Full-Text Search was added in version 10.

  • TSVECTOR to represent text data
  • TSQUERY to represent search predicates
  • Special Indexes (GIN, GIST)
  • Phrase Search since version 9.6 (2016)
  • FTS for JSON[B] since version 10 (2017)

What are documents

The “document” is the general concept used in Full-Text Search and where the search is done. In a database a document can be a field on a table, the combination of many fields in a table or in different tables.

Django Support

Features

The module django.contrib.postgres contains the support to Full-Text Search in Django, since the version 1.10. BRIN and GINindexes were added in the version 1.11. The GIN index is very useful to speed up Full-Text Search.

Dev Oriented

The focus is on programming. The use of Postgres Full-Text Search in Django is more developer friendly.

Making queries

We can look at the functions of Full-Text Search in Django starting from the models present in the Django official documentation. We have a Blog and an Author class connected through an Entry class The Django ORM creates tables and constructs queries.

from django.contrib.postgres.search import SearchVectorField
from django.db import models
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
lang = models.CharField(max_length=100, default='english')
def __str__(self):return self.name
class Author(models.Model):
name = models.CharField(max_length=200)
email = models.EmailField()
def __str__(self):return self.name
class Entry(models.Model):
blog = models.ForeignKey(Blog, on_delete=models.CASCADE)
headline = models.CharField(max_length=255)
body_text = models.TextField()
pub_date = models.DateField(auto_now_add=True)
mod_date = models.DateField(auto_now=True)
authors = models.ManyToManyField(Author)
n_comments = models.IntegerField(default=0)
n_pingbacks = models.IntegerField(default=0)
rating = models.IntegerField(default=5)
search_vector = SearchVectorField(null=True)
def __str__(self):return self.headline

Standard queries

These are the basic searches that we can use on models in Django using “filter”.

from blog.models import Author, Blog, EntryAuthor.objects.filter(
name__contains='Terry'
).values_list('name', flat=True)
SELECT "blog_author"."name"
FROM "blog_author"
WHERE "blog_author"."name"::text LIKE '%Terry%'
['Terry Gilliam', 'Terry Jones']
Author.objects.filter(
name__icontains='ERRY'
).values_list('name', flat=True)
SELECT "blog_author"."name"
FROM "blog_author"
WHERE UPPER("blog_author"."name"::text) LIKE UPPER('%ERRY%')
['Terry Gilliam', 'Terry Jones', 'Jerry Lewis']

Unaccented query

By activating the unaccent PostgreSQL module, we can use the “unaccent” extension.

from django.contrib.postgres.operations import UnaccentExtension
class Migration(migrations.Migration):
...
operations = [
UnaccentExtension(),
...
]
CREATE EXTENSION unaccent;
Author.objects.filter(
name__unaccent='Helene Joy'
).values_list('name', flat=True))
SELECT "blog_author"."name"
FROM "blog_author"
WHERE UNACCENT("blog_author"."name") = (UNACCENT('Helene Joy'))
['Hélène Joy']

Trigram similarity

By activating the trigram PosgreSQL module, we can use the “trigram” extension.

from django.contrib.postgres.operations import TrigramExtension
class Migration(migrations.Migration):
...
operations = [
TrigramExtension(),
...
]
CREATE EXTENSION pg_trgm;
Author.objects.filter(
name__trigram_similar='helena'
).values_list('name', flat=True))
SELECT "blog_author"."name"
FROM "blog_author"
WHERE "blog_author"."name" % 'helena'
['Helen Mirren', 'Helena Bonham Carter']

Search lookup

This is the base search lookup of Django and with this we can execute a real Full-Text Search call on a field.

Entry.objects.filter(
body_text__search='Cheese'
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']

SearchVector

We can use a “SearchVector” to build our document in more fields of the same object or connected objects too. Then we can filter on the document with a string.

from django.contrib.postgres.search import SearchVectorsearch_vector = SearchVector('body_text', 'blog__name')Entry.objects.annotate(
search=search_vector
).filter(
search='Cheese'
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
INNER JOIN "blog_blog"
ON ("blog_entry"."blog_id" = "blog_blog"."id")
WHERE to_tsvector(
COALESCE("blog_entry"."body_text", '') || ' ' ||
COALESCE("blog_blog"."name", '')
) @@ (plainto_tsquery('Cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']

SearchQuery

When we insert text into a Full-Text Search by using a “search query” we can apply “stemming” and “stop-word removal” even to the user texts. and to these we can apply basic logical operations.

NOT

Python

from django.contrib.postgres.search import SearchQuerysearch_query = ~SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ (!!(plainto_tsquery('toast'))) = true
['Pizza Recipes', 'Pain perdu']

OR

Python

search_query = SearchQuery('cheese') | SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ ((
plainto_tsquery('cheese') ||
plainto_tsquery('toast')
)) = true
['Cheese on Toast recipes', 'Pizza Recipes']

AND

Python

search_query = SearchQuery('cheese') & SearchQuery('toast')
search_vector = SearchVector('body_text')
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(COALESCE("blog_entry"."body_text", ''))
@@ ((
plainto_tsquery('cheese') &&
plainto_tsquery('toast')
)) = true
['Cheese on Toast recipes']

SearchRank

We can use the PostgreSQL “rank” to calculate the score of a document in relation to a searched text, and we can use it to filter and sort it.

from django.contrib.postgres.search import SearchRanksearch_vector = SearchVector('body_text')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)
Entry.objects.annotate(
rank=search_rank
).order_by(
'-rank'
).values_list('headline', 'rank'))
SELECT "blog_entry"."headline",
ts_rank(
to_tsvector(COALESCE("blog_entry"."body_text", '')),
plainto_tsquery('cheese')
) AS "rank"
FROM "blog_entry"
ORDER BY "rank" DESC
[
('Cheese on Toast recipes', 0.0889769),
('Pizza Recipes', 0.0607927),
('Pain perdu', 0.0)
]

Search configuration

We can setup the “search vector” or “search query” to execute “stemming” or “stop words removal” for a specific language.

language = 'french'search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE to_tsvector(
'french'::regconfig,
COALESCE("blog_entry"."body_text", ''))
@@ (
plainto_tsquery('french'::regconfig, 'œuf')
) = true
['Pain perdu']
from django.db.models import Flanguage = F('blog__lang')search_vector = SearchVector('body_text', config=language)
search_query = SearchQuery('œuf', config=language)
Entry.objects.annotate(
search=search_vector
).filter(
search=search_query
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
INNER JOIN "blog_blog"
ON ("blog_entry"."blog_id" = "blog_blog"."id")
WHERE to_tsvector(
"blog_blog"."lang"::regconfig,
COALESCE("blog_entry"."body_text", '')
)
@@ (
plainto_tsquery("blog_blog"."lang"::regconfig, 'œuf')
) = true
['Pain perdu']

Queries weighting

It’s possible to set up the search to give a different weight to various fields and then use these values in searches.

search_vector = SearchVector('body_text', weight='A') +
SearchVector('headline', weight='B')
search_query = SearchQuery('cheese')
search_rank = SearchRank(search_vector, search_query)
Entry.objects.annotate(
rank=search_rank
).order_by(
'-rank'
).values_list('headline', 'rank'))
SELECT "blog_entry"."headline",
ts_rank((
setweight(
to_tsvector(COALESCE("blog_entry"."body_text", '')),
'A'
)
||
setweight(
to_tsvector(COALESCE("blog_entry"."headline", '')),
'B'
)
), plainto_tsquery('cheese')) AS "rank"
FROM "blog_entry"
ORDER BY "rank" DESC
[
('Cheese on Toast recipes', 0.896524),
('Pizza Recipes', 0.607927),
('Pain perdu', 0.0)
]

SearchVectorField

If we want to speed up and simplify the query execution we can add a “search vector field” to the model and then execute searches on this specific field.

Entry.objects.filter(
search_vector='cheese'
).values_list('headline', flat=True))
SELECT "blog_entry"."headline"
FROM "blog_entry"
WHERE "blog_entry"."search_vector"
@@ (plainto_tsquery('cheese')) = true
['Cheese on Toast recipes', 'Pizza Recipes']
search_vector = SearchVector('body_text')Entry.objects.update(search_vector=search_vector)
UPDATE "blog_entry"
SET "search_vector" = to_tsvector(
COALESCE("blog_entry"."body_text", ''))

www.concertiaroma.com

  • about 15,000 bands
  • more than 16,000 shows
  • about 200 festivals
  • about 30,000 user/month

Version 2.0

The old version of the website was developed some years ago with Django 1.7 and it runs on Python 2.7. The data was managed by PostgreSQL version 9.1 and the search was performed by using the SQL LIKE syntax.

  • Django 1.7
  • PostgreSQL 9.1
  • SQL LIKE

Version 3.0

The new version, recently released, was developed with Django 1.11 and it runs on Python 3.6. The data is managed by PostgreSQL 9.6 and the search uses its Full-Text Search engine.

  • Django 1.11
  • PostgreSQL 9.6
  • PG FTS

Band models

We can look at the functions of Full-Text Search in www.concertiaroma.com starting from the same models present in the project. We have a Genre class connected to a Band class.

from django.db import models
from .managers import BandManager
class Genre(models.Model):
name = models.CharField(max_length=255)
class Band(models.Model):
nickname = models.CharField(max_length=255)
description = models.TextField()
genres = models.ManyToManyField(Genre)
objects = BandManager()

Band Manager

This is an example of “Manager” for the Band class which defines a search method that contains all the Full-Text Search logic.

from django.contrib.postgres.aggregates import StringAgg
from django.contrib.postgres.search import (
SearchQuery, SearchRank, SearchVector, TrigramSimilarity,
)
from django.db import models
search_vectors = (
SearchVector('nickname', weight='A', config='english') +
SearchVector(StringAgg('genres__name', delimiter=' '), weight='B', config='english') +
SearchVector('description', weight='D', config='english')
)
class BandManager(models.Manager): def search(self, text):
search_query = SearchQuery(text, config='english')
search_rank = SearchRank(search_vectors, search_query)
trigram_similarity = TrigramSimilarity('nickname', text)
return self.get_queryset().annotate(
search=search_vectors
).filter(
search=search_query
).annotate(
rank=search_rank + trigram_similarity
).order_by('-rank')

Band Test Setup

To better understand the mechanism, we can take into consideration an example of a simplified test.

from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre
class BandTest(TestCase):def setUp(self):# Genres
blues, _ = Genre.objects.get_or_create(name='Blues')
jazz, _ = Genre.objects.get_or_create(name='Jazz')
swing, _ = Genre.objects.get_or_create(name='Swing')
# Bands
ella_fitzgerald, _ = Band.objects.get_or_create(
nickname='Ella Fitzgerald',
description=(
'Ella Jane Fitzgerald (25 April 1917–15 June 1996) '
'was an American jazz singer often referred to as '
'the First Lady of Song, Queen of Jazz and Lady '
'Ella. She was noted for her purity of tone, '
'impeccable diction, phrasing and intonation, and '
'a horn-like improvisational ability, '
'particularly in her scat singing.'))
django_reinhardt, _ = Band.objects.get_or_create(
nickname='Django Reinhardt',
description=(
'Jean Django Reinhardt 23 January 1910 16 May 1953 '
'was a Belgian-born, Romani French jazz guitarist '
'and composer, regarded as one of the greatest '
'musicians of the twentieth century. He was the '
'first jazz talent to emerge from Europe and '
'remains the most significant.'))
louis_armstrong, _ = Band.objects.get_or_create(
nickname='Louis Armstrong',
description=(
'Louis Armstrong (August 4, 1901 – July 6, 1971), '
'nicknamed Satchmo, Satch and Pops, was an American'
' trumpeter, composer, singer and occasional actor '
'who was one of the most influential figures in '
'jazz. His career spanned five decades, from the '
'1920s to the 1960s, and different eras in the '
'history of jazz.'))
# Bands and Genres
ella_fitzgerald.genres.add(blues)
django_reinhardt.genres.add(jazz)
louis_armstrong.genres.add(blues, swing)
def test_band_search(self):# ...

Band Test Method

In the search test on the bands we simply invoked the search method giving a search text and we got back the list of values for the fields ‘nickname’ and ‘rate’. ‘nickname’ is stored on the band table, while ‘rate’ is calculated by our search method at runtime.

from collections import OrderedDict
from django.test import TestCase
from .models import Band, Genre
class BandTest(TestCase): def setUp(self):# ... def test_band_search(self):
band_queryset = Band.objects.search(
'jazz').values_list('nickname', 'rank')
band_objects = list(
OrderedDict(band_queryset).items())
band_list = [
('Django Reinhardt', 0.265124),
('Ella Fitzgerald', 0.0759909),
('Louis Armstrong', 0.0759909)]
self.assertSequenceEqual(band_objects, band_list)

What’s next

We have seen a simplified use of the current features of Django and the PostgreSQL Full-Text Search.

  • Multiple language configuration
  • Search suggestions
  • SearchVectorField with triggers
  • JSON/JSONB Full-Text Search
  • RUM indexing

Conclusions

In conclusion, the following are the conditions for evaluating the implementation of a Full-Text Search with PostgreSQL in Django:

  • not doing very complex searches
  • managing all the components easily
  • avoiding data synchronization between different systems
  • having PostgreSQL already in the stack
  • operating in a Python-only environment

Acknowledgements

20tab

For all the support (www.20tab.com )

Marc Tamlyn

For django.contrib.postgres (github.com/mjtamlyn)

Resources

These are the resources that I used to prepare this article and to develop the search function I have showed you.

Thank you

CC BY-SA

This article and related presentation is released with Creative Commons Attribution ShareAlike license.

Source Code

I published the source code used in this article on GitHub.

Slides

You can download the presentation from my SpeakerDeck account.

Originally posted on http://www.paulox.net/2017/12/22/full-text-search-in-django-with-postgresql/

Written by

🐍#Python developer 🦄#Django contributor ‍🗣️#Conference speaker 🏡#Remote worker 🐧#GNULinux user 👨‍💻#FreeSoftware advocate 🌍 www.paulox.net

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store