When you look for Django ORM optimization tips on the web, you will find mainly articles telling that you should use
prefetch_related to improve your queries.
If you are using MySQL or Amazon Aurora as a Database, you are probably using InnoDB which is the default storage engine. At Back Market, we are using Amazon Aurora and are scaling very fast. In early 2020 we opened two new countries and the number of connections is increasing every day. If it is not well designed, a database can become a bottleneck.
In the first part of the article, we will go through some of the InnoDB architecture concepts and in the second part we will see some tips and good practices improving your models for better MySQL queries performances. In this article, I’m using Django code snippets because it is the main framework we use at Back Market but understanding the InnoDB architecture is useful to all developers.
InnoDB uses a variant of B-Tree, called B+Tree, to store data and indexes.
A B-Tree is a balanced tree that maintains data sorted. B-Tree is used widely in database storage engines as it enables fast lookup and avoids scanning the whole table in order to find an element. You only need to make h+1 lookups to find an element where h is the height of the tree.
In a B+Tree, nodes of some level are doubly linked in order to navigate between them without the need to go to the upper level. This makes scan and range operations more efficient.
InnoDB Buffer pool
The buffer pool is a server cache containing data and indexes. It is an important component of InnoDB to optimize queries as frequently used data will be served from the memory and not from the disk.
The buffer pool is implemented as a linked list dived into two subparts:
- Top 5/8 contains recently accessed data. It is known as a new sublist.
- Bottom 3/8 known as an old sublist and contains older data.
- The frontier between these two areas is called the midpoint.
This a simplistic explanation of how it works:
- When data is loaded for the first time into the buffer pool, it is inserted at the midpoint.
- Accessing data from the old sublist moves it to the new sublist.
- Unused pages will move down in the linked list structure until they are very old and evicted from the buffer pool.
A page is a group of rows. Its size is fixed and by default is 16 KB. The number of rows inside a page depends on the row size. InnoDB has the constraint that a page must fit at least 2 rows, so for 16 KB page, a row can not exceed 8126 bytes.
A page is also the smallest unity of data that is transferred between the buffer pool and disk.
An extent is a group of contiguous pages. With the default configuration, an extent is 1MB and contains 64 pages. When InnoDB prefetches data using read-ahead feature, it loads an entire extent into the buffer pool.
This is where the data is stored. Internally it is organized as a B+Tree sorted by the primary key. It can be known also as the primary index. Leaf nodes are pages that contain table rows.
This is the structure that stores indexes other than the primary key. It is organized as a B+Tree sorted by the index value which makes them useful to find range values for example. This property helps to make
ORDER BY and
GROUP BY faster too.
When you declare a field with db_index=True, a secondary index is built to store field values:
field1 = models.IntegerField(default=0, db_index=True)
Or when creating a compound index:
index_together = ["pub_date", "deadline"]
Unique fields will create a secondary index as well:
field1 = models.CharField(max_length=5)
field2 = models.CharField(max_length=10)
unique_together = (("field1", "field2"),)
Foreign keys also act the same in InnoDB:
my_model = models.ForeignKey(OtherModel, on_delete=models.CASCADE)
Leaf nodes contain index value and the primary key of the associated row. The primary key helps to find the row in the Clustered index.
To see the indexes of a specific table, you can run this query:
SHOW INDEXES FROM table_name;
InnoDB Row format
InnoDB row format determines how clustered index rows will be physically organized on disk. Until MySQL 5.7.8, the default InnoDB row format was COMPACT and in the latest MySQL versions, the default row format is DYNAMIC. The main difference between COMPACT and DYNAMIC is how they store variable-length fields such as
VARCHAR and TEXT are considered as variable-length fields because on disk it will only allocate the size of the written data and not the field max length.
- When you declare a CHAR(300) field, it will allocate 300 bytes on disk even if you write a single character into it.
- If you have a VARCHAR(300) field and you write only a single character, it will allocate 3bytes: 2 bytes to store the size and 1 byte to store the character. In Django,
CharFieldis translated into VARCHAR.
In the clustered index, data is stored within the page. We have seen previously that a row has a fixed size. So what happens when the size of a column exceeds the 8kb limit? Let’s take this example:
field1 = models.CharField(max_length=5000)
field2 = models.CharField(max_length=5000)
field3 = models.CharField(max_length=1000)
field4 = models.IntegerField(default=0)
Each row of
SomeModel will have a maximum size a bit more than 11010 bytes:
- 5002 bytes for field1
- 5002 bytes for field2
- 1002 bytes for field 3
- 4 bytes for field4
- We will ignore page internal fields for the sake of simplicity.
This is too big to fit in a single row, so what will happen? Variable-length columns that are too long to fit on a B-tree page are stored in an area outside the page called overflow page.
COMPACT row format will try to store the longest variable-length columns off-page until the row size fits 8KB. If a variable-length column exceeds 768 bytes, it will store the first 768 bytes values within the page, and extra data in an overflow page.
DYNAMIC row format, when the data is too big to fit inside the row, the whole variable-length column is stored in the overflow page and it only contains A 20 bytes pointer to that page. For Text and Blob, if the size is less than 40 bytes, it will not be offloaded. Fixed-size columns can not be stored in an overflow page.
When fetching a page content, extra work is done to get the content of off-page columns.
My Django model design
So now that we’ve gone through main InnoDB architecture concepts, how can we conceive better models?
KISS: Keep it small and simple
When choosing a field for your Django model, try to use the smallest type possible to save disk and memory. Making a more compact model lets you:
- Fit more rows within a page.
- Lower the number of pages and the size of the B+Tree.
- Fit more data inside the buffer pool and make it behave like an in-memory database.
- Using the read-ahead feature, the extent loaded into the buffer pool will contain more rows.
- In the case of a range scan, you will read fewer pages and make less I/O.
When dealing with choices in your model, if you have a small range of int values, for example, consider using a
SmallIntegerField instead of an
IntegerField this can save you some bytes.
INFO = 0
WARNING = 1
ERROR = 2
LEVEL_CHOICES = (
(INFO, "Log info level"),
(WARNING, "Log warning level"),
(ERROR, "Log Error level"),
)level = models.SmallIntegerField(choices=STATE_CHOICES, default=INFO)
When you create a
CharField, maybe you will give little attention to
max_length saying it is a variable-length field and InnoDB will store on disk only what it needs as we have seen in row format section.
name = models.CharField(max_length=5000)
However, it is better to allocate only the space you need. Here is what High-Performance MySQL authors say about this:
Storing the value
'hello'requires the same amount of space in a
VARCHAR(200)column. Is there any advantage to using the shorter column?
As it turns out, there is a big advantage. The larger column can use much more memory, because MySQL often allocates fixed-size chunks of memory to hold values internally. This is especially bad for sorting or operations that use in-memory temporary tables. The same thing happens with filesorts that use on-disk temporary tables.
The best strategy is to allocate only as much space as you really need.
Do not use a CharField to represent an int
If you want to store an integer value, use the correspondent field model and not a
CharField. Integer comparison is simpler and quicker than text because it’s smaller and of the charset/collation overhead. Also using a 4 bytes
IntegerField, you can store values between -2147483648 and 2147483647, whereas with a
CharField you need much more bytes depending on your encoding.
For example, if you are using latin1 which uses 1 byte per character, the max value you can have with 4 bytes string is 9999.
Primary key design
We have seen earlier that the smaller the row is, the better it is. This applies too to the secondary index. As we have seen in the secondary index architecture, the primary key is appended to each index record. So Your index size will increase with the size of your primary key. This also implies that there is no need to create a compound index where the primary key is the rightmost column as it will be appended to it.
When possible try making your primary key as small as possible to save space and make index processing more efficient.
Also when possible prefer using an
AutoField than a
CharField, it’s smaller and faster to process.
Random value index and primary key
Clustered and secondary indexes are sorted B+Trees. When you have a
Auto incrementprimary key, for example, rows are inserted in order inside a page because generated values are sorted. When a page is full, a new page is created.
When using a random value, such as UUID or MD5, the new value is not necessarily bigger than the previous value and can not be appended to the end of the page and has to be inserted in a middle of a previously created page. If the page is already full and does not contain enough space to insert the row, a page split is necessary. So inserting random values like a primary key or index causes extra work, page splits and random disk accesses.
Non-Null field and index
When you do not need NULL value in your column, declare it as non-NULL. MySQL offers better performance when you declare a field as non-nullable because it eliminated null testing overhead.
title = models.CharField(max_length=150, null=False)
It also helps save some space in the index, because declaring a field as Null adds one bit per column.
Low cardinality indexes and selectivity
Index cardinality is the number of unique values in a table. For example, a
BooleanField will have a cardinality of 2. The selectivity is the ratio between the index cardinality and the number of records. A higher selectivity makes your index more “unique” and it will reduce the number of matching records which will make a query faster.
When the index is not selective enough and the query returns a large amount of data from the table, the query optimizer can choose to not use the index as it considers that it is faster to scan the clustered index than to scan the secondary index and fetch data from the clustered index.
You can check index cardinality by running
SHOW INDEXES command.
Unused or too many indexes
Indexes are very powerful and can speed queries a lot. Some people may be tempted to put an index on each field of a model thinking they are making some optimizations. This is an anti-pattern and something we have seen at BackMarket.
Indexes have also a cost because associate B+Tree needs to be updated at each value change which has an overhead on DML operations (INSERT/UPDATE/DELETE). Adding indexes is useful when its has more befits than its overhead. Use the
EXPLAIN command in MyQSL shell or
explain() on the Queryset (since Django 2.1) to check which indexes are used for a given query.
Time after time, an index that you have added a while ago and was useful is not used anymore. This can be due to a low selectivity for example because the table has grown a lot or maybe there is a better index to use. It can be good to drop these indexes because they add overhead with no query optimization.
You can use schema_unused_indexes to see indexes that are not used anymore. This is based on performance schema which is an in-memory table. That means it will lose data history at each restart. If sys schema is not activated or you are using an old version of MySQL, you can use this query to extract unused indexes info from performance schema.
I hope with this article you have a better understanding of InnoDB key components and can help you in improving your Django model design. Here are the 3 things that you should keep in mind:
- Try to keep you row size as small as you can
- Use the appropriate field model
- Keep an eye on your indexes
If you want to join our Bureau of Technology or any other Back Market department, take a look here, we’re hiring! 🦄
Externally Stored Fields in InnoDB
This article discusses the storage (inline and external) of field data in the InnoDB storage engine. All fields of…
High Performance MySQL, 3rd Edition
Chapter 4. Optimizing Schema and Data Types Good logical and physical design is the cornerstone of high performance…
MySQL :: MySQL 8.0 Reference Manual :: 15.5.1 Buffer Pool
The buffer pool is an area in main memory where InnoDB caches table and index data as it is accessed. The buffer pool…
Why low cardinality indexes negatively impact performance
Low cardinality indexes can be bad for performance. But, why? There are many best practices like this that DBAs hear…
Thank you Ikram for the beautiful charts 😉