Today we are going to be talking about Full Text Search in Django. Now even though Django is a batteries included web framework, it does not come with an inbuilt search functionality, there isn't one perfect way to implement search in your web application provided by it, there are many different ways in which it can be done, today we are gonna be talking about a few of those methods, let's start by first listing out the methods which we can utilise, I will be giving you a brief overview of each method as well:
- Basic : When you have a small scale app and you just wanna implement a search view that can be used to filter out the items based on user's query without any added intelligence, you can utilise the filter function provided within the Django ORM itself, or there's another thing called Q object present in it, which you can use for that purpose as well. We'll discuss the two in the subsequent sections.
- Full Text Search : Django provides Full Text Search ability included in the django.contrib.postgres module, now going by its name you must have guessed it already that it only works when you use PostgreSQL as your database backend. Don't be disappointed yet, there are other solutions available as well for other database providers, but they won't be built into the Django itself, there are various well maintained third party packages available for that. If you're a beginner to the field of search, you might be wondering what this Full Text Search means, don't worry yet, we'll get to that point soon enough.
- Hosted Solutions : Many external services such as Algolia, Swifttype and many others provide the search as a service, which provide additional benefits such as search analytics, allows you to search in different ways for example to find titles that contain multiple words but not in the same order as the query, words only a certain distance apart, words slightly misspelled, phonetic searches etc, apart from the obvious one which is speed.
- External Tools : These are the fully blown services providing search ability that you would have to configure and run on your own. Some examples of it are ElasticSearch and Solr both are which Lucene based.
Today, we will only be talking about the first two methods, since I want to keep this beginner friendly. So let's jump right into it.
The image above is something is you must have encountered on your favourite websites, and you would have wondered that how magical they seem to be. But since you are reading this article, you are not just an average internet user, you are a developer so let's break it down into the components that you all are quite familiar with already.
If you take a closer look, a search is simply just a combination of :
- Processing (or the magic part)
Forms is something that Django provides you with. Processing is essentially querying your database for user's query. Results is the items you get after filtering out the records saved in your database based on user's query.
... will be continued later on.
Full Text Search
Let us start by defining Full Text Search itself,
Textual search operators have existed in databases for years, PostgreSQL has
ILIKE operators for textual data types, but they lack many essential properties that you would need to include in your search for it to be as magical as the ones you've seen before, which are:
- There is no linguistic support, even for English. Regular expressions are not sufficient because they cannot easily handle derived words, e.g., categories and category. You might miss documents that contain categories, although you probably would like to find them when searching for category. One thing you might say is, that you can just use OR to search for multiple derived forms, but it is tedious and error-prone (some words have several thousand derivatives).
- They provide no ordering (ranking) of search results, which makes them ineffective when thousands of matching documents are found.
- They tend to be slow because their is no index support, so they must process all documents for every search.
Full text searching allows documents to be preprocessed and be saved as an index for later rapid searching. Preprocessing includes:
- Parsing documents into tokens. It is useful to identify various classes of tokens, e.g., numbers, words, complex words, email addresses, so that they can be processed differently. PostgreSQL uses a parser to perform this step. A standard parser is provided, and custom parsers can be created for specific needs.
- Converting tokens into lexemes. A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case, and often involves removal of suffixes (such as
esin English). This allows searches to find variant forms of the same word, without tediously entering all possible variants. Also, this step typically eliminates stop words, which are words that are so common that they are useless for searching (such as the word 'the' in English). PostgreSQL uses dictionaries to perform this step. Various standard dictionaries are provided, and custom ones can be created for specific needs.
- Storing preprocessed documents optimised for searching. For example, each document can be represented as a sorted array of normalized lexemes.
A data type
tsvectoris provided for storing preprocessed documents, along with a type
tsqueryfor representing processed queries.
I know you all have been eagerly waiting for this moment by now, let's jump right into the code now.
Let's start by defining our models on which we will be working from here on, it's a Post model as in Blog Post which is defined as:
from django.db import models class Post(models.Model): title = models.CharField(max_length=100) overview = models.CharField(max_length=200) content = models.TextField()
To be able to utilise full text search feature of postgres, we will need to add the following to
INSTALLED_APPS within our
# blog/settings.py INSTALLED_APPS = [ ... 'django.contrib.postgres', # new ]
Querying Single Field
Now the simplest way to do search is to search a single term against a single column. For e.g. :
This will perform full text search behind the scenes and return the list of results that have the matching title. But you might have noticed something, it only searches against a single field which seems to be rather limiting.
Querying Multiple Fields
The Post object we have been querying against also contains the field named
overview . To query against both the fields, we will need to use
from django.contrib.postgres.search import SearchVector >>> Post.objects.annotate(search=SearchVector('title', 'overview')).filter(search='vortex')
Preprocessing the user query
By default the order of the words present in the query is not relevant, i.e. it performs a keyword based search, but if you want to find the items containing the text in the exact order as present in the query, then you will need to perform a
phrase search, which can be done as follows :
from django.contrib.postgress.search import SearchQuery >>> SearchQuery('red tomato', search_type='phrase')
If you do not the pass the
search_type argument, it defaults to 'plain', which results in keyword based search.
>>> SearchQuery('red tomato') # two keywords >>> SearchQuery('tomato red') # same results as above
It can be used to perform even more advanced operations such as:
>>> SearchQuery('foo') | SearchQuery('bar') # will search for either foo or bar >>> SearchQuery('foo') & SearchQuery('bar') # will search for both
Trigram is a sequence of three consecutive characters in a string. For example, the trigrams of Donut are
nutetc. PostgreSQL splits a string into words and determines trigrams for each word separately. It also normalizes the word by downcasing it.
Full text search doesn't care about word orders. Stemming, words have a base where they originate from. Ignoring stop words, such as 'is', 'the' etc. Words that don't really provide that much meaning in text.
The problem that you have while using these is that they use their own data stores. Let's say you had your PostgreSQL database before, but now next to your database you have your second, i.e. the search engine's database and you need to synchronise (which means saving new items to the search database as well as deleting items that are no longer present on your PostgreSQL database), which can be done using transaction hooks.
While using such tools, there might be times when your search engine database has additional information which isn't present in the PostgreSQL database anymore.