High Performance MySQL

Ferhat Can
Mar 14 · 10 min read

Having a healthy relational database management system (RDBMS) performing well under high loads with big data is a difficult task to achieve. It requires very good collaboration of the systems administrator, database administrator, and most importantly the developers who will be consuming the database services. If any one of those parties fails to perform well their duties, the RDBMS will suffer when the traffic is high and most of the time the RDBMS will be blamed for being slow and the NoSQL databases will be praised. Whatever technology you are using, you must know its details and use it in the best ways possible; otherwise, you will keep blaming the technology and switch from one to another with the hope of solving your problems and all you will get is just frustration.

While having job interviews with many web developers, it becomes obvious that very few of them really know enough about the relational databases they are using, most of the time even the fundamental CRUD (create, read, update, delete) operations are not very well optimized. One of the most creative(!) way of optimization I’ve heard (many times) is to provide the most selective filters at the beginning and the other less selective ones at the end, as if the database process the predicates (Boolean expressions after the WHERE clause) from left to right, one by one. Don’t do it, it won’t help!

The focus of this article will be how to optimize SQL queries on MySQL 8.0; however, the principles are universal and apply to other RDBMS solutions and older versions, like MySQL 5.x deployments. There might be some minor syntactic differences, and in this case, you should consult the documentation of your RDBMS. There are many optimizations that must be applied to get better performance but I will focus on indexing, which is the most effective method of database optimization and must be done solely by the developers. I will cover the basics of indexing and the general basic rules in this article, and I will publish another one for more advanced use cases.

Indexes — Part 1

First of all, I will be using the “indexes” version for the plural form of the index, if you feel more comfortable, you can replace it with “indices”. An index is a special data structure built on a table that provides faster lookups for locating data matching some criteria. In RDBMS products, the information is organized around tables and it is stored on the hard disk, scattered on many disk blocks. If the table is not too big, on the order of few megabytes, reading the data from the disk might not be a problem thanks to the fast disk reads. However, as the tables become bigger, it is harder to collect the data from the hard disk and load it into memory for filtering operations. Here indexes come for help.

There are different types of indexes for different kinds of data types but I will be looking into the B-Tree (B stands for balanced not binary) indexes. The implementation details of the B-Tree are out of scope for the article and most of the developers but you need to know that the lookup, delete and insertion operations are O(log n) for B-Trees instead of O(n) for linear search or full table scan. As a simplification, if your table contains 1 million records, you need 1 million comparisons to find a single record whereas, with an appropriate index (with a degree of 4, meaning each branch can hold 4 elements), you need 10 comparisons, though the B+Tree indexes can do it with 2 or 3 comparisons. I guess you can imagine what will happen if your table has 1 billion records with a table size of 500 GB. A hint: If your disk can read at speed of 1GB/s, then your table will be loaded into memory in 500 seconds, which means the hard disk must only serve your query for more than 8 minutes and if the result cannot fit into the memory, the operating system will need to swap between disk and memory, something you must avoid at all costs!

The developer must consider the future usage of the tables in terms of the number of records, the table size on disk, the different variations of queries to be performed, and must create the appropriate indexes while designing the table. The developer is responsible for the proper index creation, not the system administrator, the DBA, or any other consultant.

To illustrate the concepts, I will assume you have a database called “scratch” with “utf8mb4” encoding and “utf8mb4_general_ci” collation, which you can create via your favorite MySQL GUI client or issue the following command via mysql CLI:

CREATE DATABASE `scratch` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Please note that this database and upcoming testing and learning process must be done on your development computer, not on production, as it will cause suspension of some database operations. I don’t want to have any external dependency on any server-side programming language to seed the database, so I created a MySQL stored procedure to populate the “users” table we will be using. In order to store the procedure on your “scratch” database, please download the following code to your computer and run it via mysql command-line utility or any MySQL client you are comfortable with.

Using mysql CLI, the installation of the procedure to the database will be as follows:

$ mysql -u dbusername scratch -p < /path/to/sp_GenerateUserData.sql

After the installation of the procedure, you can run it to insert 1M records like the following:

mysql> USE scratch;Database changedmysql> CALL sp_GenerateUserData(1000000);+ — — — — — — — — — — — — -+| Status                   |+ — — — — — — — — — — — — -+| 1,000,000 rows inserted  |+ — — — — — — — — — — — — -+1 row in set (2 min 27.00 sec)Query OK, 0 rows affected (2 min 27.01 sec)

The more records you have, the better you will understand the power of indexing, so I advise you to have at least 10M records, or better COUNT(*)’ing your table should take at least 15 seconds. In my case, I have a rather old ultrabook, with an SSD disk write speed of 165MB/s and read speed of 200MB/s. I consider 30 million records (3.2 GB without any indexing except the primary key) for my table to be enough to serve the purpose of the demonstration. To observe the problems, it is nice to be short on memory, and in my case I have only 2 GB of free RAM, preventing my table to fit completely into the memory, a common scenario you will have on production deployments.

Fundamentals

What rules to follow when creating an index? The following list is by no means complete and indexing is a broad subject that deserves a book to be explained in full details.

CREATE INDEX idx1 ON users (email)
CREATE INDEX idx2 ON users (email, pass)

In this case, the first index, idx1, is completely redundant and should be dropped because its benefit is covered by idx2. But while populating the table via the stored procedure, the emails were inserted as unique entities and most of the time the emails are unique as a business requirement. In this case, it is much better to place a UNIQUE index on the emails as follows:

CREATE UNIQUE INDEX email_uq ON users (email)

Assuming the users sign in to your application via their emails and passwords, this unique index will help in locating the email addresses very quickly and is more cost-effective than the above idx2, since it is smaller and requires less modification. The email addresses are usually not updated after the signup process, kind of read-only, and the database will update the unique index only when a user is inserted or deleted. In the idx2 case; however, whenever the users change their passwords, the index must be updated to be in sync with the table data. The email_uq index will also be the last checkpoint to prevent duplicate email addresses to be signed up with, and you can rely on the failure of user creation in case of duplicate emails.

SELECT * FROM users WHERE first_name='Ivan' AND created_at >= NOW() — INTERVAL 365 DAY

The following index will search only for the last 365 days and will not help with the name predicate:

CREATE INDEX idx3 ON users (created_at, first_name)

The indexing is not limited to the WHERE predicates, it also helps with queries that include GROUP BYand ORDER BYclauses which will be explained in “Part 2” of this article. I’ll also demonstrate the use of the virtual columns and how they are indexed, and function-based indexing, which is a new feature of MySQL 8. Most importantly, I will show the great EXPLAIN keyword which provides some critical info on how the query the optimizer will run the given query.

You should try different queries and create indexes helping those queries to run faster. Indexing is not a black art and once you know the general rules, you will be able to decide on the indexes with facts rather than trials and errors.

Star Gazers

“If you want to master something, teach it.”