Having a healthy relational database management system (RDBMS) performing well under high loads with big data is a difficult task to achieve. It requires very good collaboration of the systems administrator, database administrator, and most importantly the developers who will be consuming the database services. If any one of those parties fails to perform well their duties, the RDBMS will suffer when the traffic is high and most of the time the RDBMS will be blamed for being slow and the NoSQL databases will be praised. Whatever technology you are using, you must know its details and use it in the best ways possible; otherwise, you will keep blaming the technology and switch from one to another with the hope of solving your problems and all you will get is just frustration.
While having job interviews with many web developers, it becomes obvious that very few of them really know enough about the relational databases they are using, most of the time even the fundamental CRUD (create, read, update, delete) operations are not very well optimized. One of the most creative(!) way of optimization I’ve heard (many times) is to provide the most selective filters at the beginning and the other less selective ones at the end, as if the database process the predicates (Boolean expressions after the WHERE clause) from left to right, one by one. Don’t do it, it won’t help!
The focus of this article will be how to optimize SQL queries on MySQL 8.0; however, the principles are universal and apply to other RDBMS solutions and older versions, like MySQL 5.x deployments. There might be some minor syntactic differences, and in this case, you should consult the documentation of your RDBMS. There are many optimizations that must be applied to get better performance but I will focus on indexing, which is the most effective method of database optimization and must be done solely by the developers. I will cover the basics of indexing and the general basic rules in this article, and I will publish another one for more advanced use cases.
Indexes — Part 1
First of all, I will be using the “indexes” version for the plural form of the index, if you feel more comfortable, you can replace it with “indices”. An index is a special data structure built on a table that provides faster lookups for locating data matching some criteria. In RDBMS products, the information is organized around tables and it is stored on the hard disk, scattered on many disk blocks. If the table is not too big, on the order of few megabytes, reading the data from the disk might not be a problem thanks to the fast disk reads. However, as the tables become bigger, it is harder to collect the data from the hard disk and load it into memory for filtering operations. Here indexes come for help.
There are different types of indexes for different kinds of data types but I will be looking into the B-Tree (B stands for balanced not binary) indexes. The implementation details of the B-Tree are out of scope for the article and most of the developers but you need to know that the lookup, delete and insertion operations are O(log n) for B-Trees instead of O(n) for linear search or full table scan. As a simplification, if your table contains 1 million records, you need 1 million comparisons to find a single record whereas, with an appropriate index (with a degree of 4, meaning each branch can hold 4 elements), you need 10 comparisons, though the B+Tree indexes can do it with 2 or 3 comparisons. I guess you can imagine what will happen if your table has 1 billion records with a table size of 500 GB. A hint: If your disk can read at speed of 1GB/s, then your table will be loaded into memory in 500 seconds, which means the hard disk must only serve your query for more than 8 minutes and if the result cannot fit into the memory, the operating system will need to swap between disk and memory, something you must avoid at all costs!
The developer must consider the future usage of the tables in terms of the number of records, the table size on disk, the different variations of queries to be performed, and must create the appropriate indexes while designing the table. The developer is responsible for the proper index creation, not the system administrator, the DBA, or any other consultant.
To illustrate the concepts, I will assume you have a database called “scratch” with “utf8mb4” encoding and “utf8mb4_general_ci” collation, which you can create via your favorite MySQL GUI client or issue the following command via mysql CLI:
CREATE DATABASE `scratch` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
Please note that this database and upcoming testing and learning process must be done on your development computer, not on production, as it will cause suspension of some database operations. I don’t want to have any external dependency on any server-side programming language to seed the database, so I created a MySQL stored procedure to populate the “users” table we will be using. In order to store the procedure on your “scratch” database, please download the following code to your computer and run it via
mysql command-line utility or any MySQL client you are comfortable with.
mysql CLI, the installation of the procedure to the database will be as follows:
$ mysql -u dbusername scratch -p < /path/to/sp_GenerateUserData.sql
After the installation of the procedure, you can run it to insert 1M records like the following:
mysql> USE scratch;Database changedmysql> CALL sp_GenerateUserData(1000000);+ — — — — — — — — — — — — -+| Status |+ — — — — — — — — — — — — -+| 1,000,000 rows inserted |+ — — — — — — — — — — — — -+1 row in set (2 min 27.00 sec)Query OK, 0 rows affected (2 min 27.01 sec)
The more records you have, the better you will understand the power of indexing, so I advise you to have at least 10M records, or better COUNT(*)’ing your table should take at least 15 seconds. In my case, I have a rather old ultrabook, with an SSD disk write speed of 165MB/s and read speed of 200MB/s. I consider 30 million records (3.2 GB without any indexing except the primary key) for my table to be enough to serve the purpose of the demonstration. To observe the problems, it is nice to be short on memory, and in my case I have only 2 GB of free RAM, preventing my table to fit completely into the memory, a common scenario you will have on production deployments.
What rules to follow when creating an index? The following list is by no means complete and indexing is a broad subject that deserves a book to be explained in full details.
- For small tables, you don’t need to create an index at all. Assume you have a “countries” table and it contains less than 300 countries (the count might change depending on your requirements, but most probably will not be greater than 300). Let’s consider that you store them in a table (less than 100KB on disk) so that you can populate the countries select box on the UI etc. We don’t expect this table to grow exponentially, maybe 1 or 2 countries will be added every year, and most probably nothing will change for months and months. If you think you need to access those countries too often, you could prefer to cache the data on an in-memory database as a hash so that you don’t query the database all the time. A better approach would be to get the data when your program boots up, assuming that your backend program could maintain state while working, and does not depend on any external service. The application could refresh its data at certain intervals (like 1 hour, 1 day, etc) or better provides an endpoint that lets you trigger the cached data to be reloaded from the database.
- Every table should have a primary key, called “id” most of the times, using an integer field and you should consider the 4-byte integer (
INT) first, and make sure to mark it as
UNSIGNEDto double your available range capacity (you will be safe up to around 4.29 billion). Do not use UUID, GUID, etc as a primary key since they compare much slower than integers and they take up more space.
- You should not include the primary key in other indexes (aka secondary indexes), it is included automatically for InnoDB tables.
- You should optimally not exceed more than 5–6 indexes per table. You might be inclined to create too many indexes to cover different cases for the
WHEREconditionals but do not forget that every index you create will decrease the speed of inserts, updates, and deletes since the row modifications require the index trees to be updated and rebalanced. Not to mention that the indexes are destined to be loaded into memory and you will consume the precious memory for extra indexes.
- Usually, you cannot create indexes very easily on big tables because adding (so is updating/deleting) a record to a table and updating the index operations are done in a transaction. Hence, the table does not accept any changes while an index is being created. As a developer, you need to think in advance about the table indexes very carefully for the tables that are expected to grow and create the indexes at the beginning, as much as you can. Otherwise, when the table is big, you might have to take maintenance breaks to add the index. There is at least one RDBMS that allows you to add a CONCURRENT index, meaning that you can still modify the table while an index is being created; however, this feature does not always work and the created index is not valid all the time. By no surprise, this concurrent index feature is supported by PostgreSQL; as you might guess MySQL does not support concurrent indexing.
- Avoid redundant indexes, like the following:
CREATE INDEX idx1 ON users (email)
CREATE INDEX idx2 ON users (email, pass)
In this case, the first index, idx1, is completely redundant and should be dropped because its benefit is covered by idx2. But while populating the table via the stored procedure, the emails were inserted as unique entities and most of the time the emails are unique as a business requirement. In this case, it is much better to place a
UNIQUE index on the emails as follows:
CREATE UNIQUE INDEX email_uq ON users (email)
Assuming the users sign in to your application via their emails and passwords, this unique index will help in locating the email addresses very quickly and is more cost-effective than the above idx2, since it is smaller and requires less modification. The email addresses are usually not updated after the signup process, kind of read-only, and the database will update the unique index only when a user is inserted or deleted. In the idx2 case; however, whenever the users change their passwords, the index must be updated to be in sync with the table data. The email_uq index will also be the last checkpoint to prevent duplicate email addresses to be signed up with, and you can rely on the failure of user creation in case of duplicate emails.
- An index defined on multiple columns is called a “compound” index, like idx2 above, and the order of the columns is very important. You can use the phonebook analogy for better comprehension: Assume you have a phonebook sorted (indexed) by the first name, and then the last name. If you have only the last name of a person, you cannot find the people matching the last name without scanning all of the whole phonebook, something called full table scan in the RDBMS world. If the last name is more important and you always have the last name, you must purchase a phonebook sorted by the last name and then the first name; so, the index must be created by mentioning the last name and first name in order.
- As a general rule, you should never use * to get all of the field data in your table. Assume you need the first and last names of the matching people in the above case and you have your index having those 2 columns. If you only select first and last name; your query will be much faster because they will be retrieved from the index, and extra reads for the other properties will not be made. Although it is an error of the developer; the queries generated via Object Relational Mapper (ORM) tools usually ask for all of the fields even though they are not needed.
- Be careful with the range predicates since they break the index elements coming after them. For example assume you would like to search for the user with the name “Ivan” who is supposed to register past one year (last 365 days), via the following query:
SELECT * FROM users WHERE first_name='Ivan' AND created_at >= NOW() — INTERVAL 365 DAY
The following index will search only for the last 365 days and will not help with the name predicate:
CREATE INDEX idx3 ON users (created_at, first_name)
The indexing is not limited to the
WHERE predicates, it also helps with queries that include
ORDER BYclauses which will be explained in “Part 2” of this article. I’ll also demonstrate the use of the virtual columns and how they are indexed, and function-based indexing, which is a new feature of MySQL 8. Most importantly, I will show the great
EXPLAIN keyword which provides some critical info on how the query the optimizer will run the given query.
You should try different queries and create indexes helping those queries to run faster. Indexing is not a black art and once you know the general rules, you will be able to decide on the indexes with facts rather than trials and errors.