What is Indexing?
Indexing makes columns faster to query by creating pointers to where data is stored within a database.
Imagine you want to find a piece of information that is within a large database. To get this information out of the database the computer will look through every row until it finds it. If the data you are looking for is towards the very end, this query would take a long time to run.
Visualization for finding the last entry:
If the table was ordered alphabetically, searching for a name could happen a lot faster because we could skip looking for the data in certain rows. If we wanted to search for “Zack” and we know the data is in alphabetical order we could jump down to halfway through the data to see if Zack comes before or after that row. We could then half the remaining rows and make the same comparison.
This took 3 comparisons to find the right answer instead of 8 in the unindexed data.
Indexes allow us to create sorted lists without having to create all new sorted tables, which would take up a lot of storage space.
What Exactly is an Index?
An index is a structure that holds the field the index is sorting and a pointer from each record to their corresponding record in the original table where the data is actually stored. Indexes are used in things like a contact list where the data may be physically stored in the order you add people’s contact information but it is easier to find people when listed out in alphabetical order.
Let’s look at the index from the previous example and see how it maps back to the original Friends table:
We can see here that the table has the data stored ordered by an incrementing id based on the order in which the data was added. And the Index has the names stored in alphabetical order.
Types of Indexing
There are two types of databases indexes:
Both clustered and non-clustered indexes are stored and searched as B-trees, a data structure similar to a binary tree. A B-tree is a “self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time.” Basically it creates a tree-like structure that sorts data for quick searching.
Here is a B-tree of the index we created. Our smallest entry is the leftmost entry and our largest is the rightmost entry. All queries would start at the top node and work their way down the tree, if the target entry is less than the current node the left path is followed, if greater the right path is followed. In our case it checked against Matt, then Todd, and then Zack.
To increase efficiency, many B-trees will limit the number of characters you can enter into an entry. The B-tree will do this on it’s own and does not require column data to be restricted. In the example above the B-tree below limits entries to 4 characters.
Clustered indexes are the unique index per table that uses the primary key to organize the data that is within the table. The clustered index ensures that the primary key is stored in increasing order, which is also the order the table holds in memory.
- Clustered indexes have to be explicitly declared in the case of Postgres.
- Created when the table is created.
- Use the primary key sorted in ascending order.
Creating Clustered Indexes
The clustered index will be automatically created when the primary key is defined:
CREATE TABLE friends (id INT PRIMARY KEY, name VARCHAR, city VARCHAR);
Once filled in, that table would look something like this:
The created table, “friends”, will have a clustered index automatically created, organized around the Primary Key “id” called “friends_pkey”:
When searching the table by “id”, the ascending order of the column allows for optimal searches to be performed. Since the numbers are ordered, the search can navigate the B-tree allowing searches to happen in logarithmic time.
However, in order to search for the “name” or “city” in the table, we would have to look at every entry because these columns do not have an index. This is where non-clustered indexes become very useful.
Non-clustered indexes are sorted references for a specific field, from the main table, that hold pointers back to the original entries of the table. The first example we showed is an example of a non-clustered table:
They are used to increase the speed of queries on the table by creating columns that are more easily searchable. Non-clustered indexes can be created by data analysts/ developers after a table has been created and filled.
Non-clustered indexes point to memory addresses instead of storing data themselves. This makes them slower to query than clustered indexes but typically much faster than a non-indexed column.
You can create many non-clustered indexes. As of 2008, you can have up to 999 non-clustered indexes in SQL Server and there is no limit in PostgreSQL.
Creating Non-Clustered Databases(PostgreSQL)
A partial index covers just a subset of a table’s data. It is an index with a WHERE clause. The idea is to increase the efficiency of the index by reducing its size. A smaller index takes less storage, is easier to maintain, and is faster to scan.
For example, suppose you allow users to flag comments on your site, which in turn sets the
flagged boolean to true. You then process flagged comments in batches. You may want to create an index like so:
CREATE INDEX articles_flagged_created_at_index ON articles(created_at) WHERE flagged IS TRUE;
This index will remain fairly small, and can also be used along other indexes on the more complex queries that may require it.
Expression indexes are useful for queries that match on some function or modification of your data. Postgres allows you to index the result of that function so that searches become as efficient as searching by raw data values. For example, you may require users to store their email addresses for signing in, but you want case insensitive authentication. In that case it’s possible to store the email address as is, but do searches on
WHERE lower(email) = '<lowercased-email>'. The only way to use an index in such a query is with an expression index like so:
CREATE INDEX users_lower_email ON users(lower(email));
Another common example is for finding rows for a given date, where we’ve stored timestamps in a datetime field but want to find them by a date casted value. An index like
CREATE INDEX articles_day ON articles ( date(published_at) ) can be used by a query containing
WHERE date(articles.published_at) = date('2011-03-07').
A unique index guarantees that the table won’t have more than one row with the same value. It’s advantageous to create unique indexes for two reasons: data integrity and performance. Lookups on a unique index are generally very fast.
In terms of data integrity, using a
validates_uniqueness_of validation on an ActiveModel class does not really guarantee uniqueness because there can and will be concurrent users creating invalid records. Therefore you should always create the constraint at the database level - either with an index or a unique constraint.
There is little distinction between unique indexes and unique constraints. Unique indexes can be thought of as lower level since expression indexes and partial indexes cannot be created as unique constraints. Even partial unique indexes on expressions are possible.
While Postgres has the ability to create multi-column indexes, it’s important to understand when it makes sense to do so. The Postgres query planner has the ability to combine and use multiple single-column indexes in a multi-column query by performing a bitmap index scan. In general, you can create an index on every column that covers query conditions and in most cases Postgres will use it, so make sure to benchmark and justify the creation of a multi-column index before you create one. As always, indexes come with a cost, and multi-column indexes can only optimize the queries that reference the columns in the index in the same order, while multiple single-column indexes provide performance improvements to a larger number of queries.
However, there are cases where a multi-column index clearly makes sense. An index on columns
(a, b) can be used by queries containing
WHERE a = x AND b = y, or queries using
WHERE a = x only, but will not be used by a query using
WHERE b = y. So if this matches the query patterns of your application, the multi-column index approach is worth considering. Also note that in this case creating an index on
a alone would be redundant.
In PostgreSQL, the “\d” command is used to list details on a table, including table name, the table columns and their data types, indexes, and constraints.
The details of our friends table now look like this:
Query providing details on the friends table: \d friends;
Looking at the above image, the “friends_name_asc” is now an associated index of the “friends” table. That means the query plan, the plan that SQL creates when determining the best way to perform a query, will begin to use the index when queries are being made. Notice that “friends_pkey” is listed as an index even though we never declared that as an index. That is the clustered index that was referenced earlier in the article that is automatically created based off of the primary key.
We can also see there is a “friends_city_desc” index. That index was created similarly to the names index:
CREATE INDEX friends_city_desc ON friends(city DESC);
This new index will be used to sort the cities and will be stored in reverse alphabetical order because the keyword “DESC” was passed, short for “descending”. This provides a way for our database to swiftly query city names.
After your non-clustered indexes are created you can begin querying with them. Indexes use an optimal search method known as binary search. Binary searches work by constantly cutting the data in half and checking if the entry you are searching for comes before or after the entry in the middle of the current portion of data. This works well with B-trees because they are designed to start at the middle entry; to search for the entries within the tree you know the entries down the left path will be smaller or before the current entry and the entries to the right will be larger or after the current entry. In a table this would look like this:
Comparing this method to the query of the non-indexed table at the beginning of the article, we are able to reduce the total number of searches from eight to three. Using this method, a search of 1,000,000 entries can be reduced down to just 20 jumps in a binary search.
When to use Indexes
Indexes are meant to speed up the performance of a database, so use indexing whenever it significantly improves the performance of your database. As your database becomes larger and larger, the more likely you are to see benefits from indexing.
When not to use Indexes
When data is written to the database, the original table (the clustered index) is updated first and then all of the indexes off of that table are updated. Every time a write is made to the database, the indexes are unusable until they have updated. If the database is constantly receiving writes then the indexes will never be usable. This is why indexes are typically applied to databases in data warehouses that get new data updated on a scheduled basis(off-peak hours) and not production databases which might be receiving new writes all the time.
NOTE: The newest version of Postgres (that is currently in beta) will allow you to query the database while the indexes are being updated.
Testing Index performance
To test if indexes will begin to decrease query times, you can run a set of queries on your database, record the time it takes those queries to finish, and then begin creating indexes and rerunning your tests.
To do this, try using the EXPLAIN ANALYZE clause in PostgreSQL.:
EXPLAIN ANALYZE SELECT * FROM friends WHERE name = 'Blake';
Which on my small database yielded:
This output will tell you which method of search from the query plan was chosen and how long the planning and execution of the query took.
Only create one index at a time because not all indexes will decrease query time.
- PostgreSQL’s query planning is pretty efficient, so adding a new index may not affect how fast queries are performed.
- Adding an index will always mean storing more data
- Adding an index will increase how long it takes your database to fully update after a write operation.
If adding an index does not decrease query time, you can simply remove it from the database.
To remove an index use the DROP INDEX command:
DROP INDEX friends_name_asc;
The outline of the database now looks like:
Which shows the successful removal of the index for searching names.
- B-Tree is the default that you get when you do
CREATE INDEX. Virtually all databases will have some B-tree indexes. B-trees attempt to remain balanced, with the amount of data in each branch of the tree being roughly the same. Therefore the number of levels that must be traversed to find rows is always in the same ballpark. B-tree indexes can be used for equality and range queries efficiently. They can operate against all datatypes, and can also be used to retrieve NULL values. B-trees are designed to work very well with caching, even when only partially cached.
- Hash Indexes pre-Postgres 10 are only useful for equality comparisons, but you pretty much never want to use them since they are not transaction safe, need to be manually rebuilt after crashes, and are not replicated to followers, so the advantage over using a B-Tree is rather small. In Postgres 10 and above, hash indexes are now write-ahead logged and replicated to followers.
- Generalized Inverted Indexes (GIN) are useful when an index must map many values to one row, whereas B-Tree indexes are optimized for when a row has a single key value. GINs are good for indexing array values as well as for implementing full-text search.
- Generalized Search Tree (GiST) indexes allow you to build general balanced tree structures and can be used for operations beyond equality and range comparisons. They are used to index the geometric data types, as well as full-text search.
What makes indexing so effective?
Indexing doesn’t change the existing document/table it is being applied to (unless it is clustered indexing which I’ll discuss later in the blog), it instead creates a new data structure that has two blocks for every entry. These two blocks are the:
- Data Field on which index is created
- Pointer to the Row/Document where that data field is stored in Database(Address)
How does search improve after implementing indexing?
- Now instead of linearly searching through all the fields inside the database, searching is made in a Binary search fashion. [Which reduces the searches from N to (Log N)].
Hence, saving our compute.
- The sequential flow through the connected blocks in memory is also reduced to only 2, as after indexing there are only two fields in the new data structure created.