How we reduced MySQL CPU usage by more than 70%

Identifying and fixing application code that causes high MySQL CPU usage.

Published in

Eleos Health

6 min readMay 19, 2024

Developer Identifying and fixing application code

Have you ever stared at MySQL CPU usage numbers, feeling lost in a sea of data? We’ve been there too. Through systematic analysis and strategic interventions, we pinpointed and fixed the root causes of our high CPU usage. The outcome? A remarkable 70% reduction in CPU load, allowing us to optimize resources and cut costs by downsizing instances. This post will walk you through our troubleshooting journey, sharing the tools, techniques, and insights that led to our success.

🔭 Problem and Strategy

When troubleshooting high CPU usage on MySQL, it is easy to get lost and face multiple challenges when mapping DB queries to application code.

Here, I present the workflow for iteratively identifying the top problematic query while linking it with the relevant application code.

Generally, the root cause of 80% of the problems is one of the following:

Too many rows scanned
A large amount of data is being sent over the wire
A large number of queries

For more complex issues, it is better to consult a good DBA.

We need to find the application code that creates these queries and either change the code to prevent the issue or optimize the tables to solve the business need.

🗺️ SQL ↔︎ Application Code Map

First, we must create a map between the SQL logs and the backend code.
I’ve considered the following solutions:

Open Telemetry Framework:

It provides a nice UI over Grafana and the ability to query and aggregate the data “traces.”
OTEL integrates many languages and frameworks.
I’ve failed to make it work in production with proper stability, but if it works — it’s fantastic.

image credit from the blog above getting started with otelsql

Example: https://opentelemetry.io/blog/2024/getting-started-with-otelsql/

Injecting correlation ID into the SQL query:

Many libs provide this possibility.
It provides a one-to-one mapping between results in general and slow query logs to application logs. After some thinking, we can find the caller application code.
However, I feared runtime risk and tried to avoid touching critical code paths.
So I didn’t choose it.
Before the query is:

SELECT * FROM table;

Now, when adding the trace UUID, it becomes:

SELECT * FROM table; /* service=Alfa uuid=0e9c4fa8-ca8e-4ef7-b24b-d6cc16bc3ef3

Example: https://django-correlation-id.readthedocs.io/en/latest/

Custom database logger:

The selected solution was to implement custom mysql logging middleware.
A “database wrapper” logs the mapping between the query and the stack trace.
The stack traces provide complete application flow visibility from the beginning of the code run until the SQL call.
See the Django implementation below.
Other frameworks would have options similar to hook before SQL execution.
The result would look as follows:

{
  .... <fields by logger config>
  "sql": "SELECT * FROM table WHERE id = %s LIMIT 21",
  "stacktrace": " <Full stacktrace to call>",
  "sender": "sqlstacktracer",
}

🛞 Iterative improvement workflow

Now, we have a one-to-one mapping between parameterized SQL query and application code.
We can continue to iterative improvement workflow.

Identify the top problematic query

1. Take the top problematic query if performance insight is on and the rating exceeds 0.1.
2. Otherwise, go to the slow query log and identify the frequency and weight of the query.
For each query, the slow query log shows:

# Query_time: 10.000  Lock_time: 0.001234 Rows_sent: 1234  Rows_examined: 12345678

We can generate the frequency using cloudwatch insight by pattern function or any tool parsing the parameterized SQL to generate the same output.

fields @timestamp, @message, @logStream, @log
| pattern @message

The pattern function produces:

Now, with both query metadata and the count, pick the top impacting query.
3. Finally, there is the general log.
Use the pattern function again to get context.
Here, only a little metadata exists. It is what it is.
The event count would be much higher, yet no high count indicates an issue.

Specific query analysis

Good, now we have the top problematic query to fix.
Answer the following questions:

Using the SQL ↔︎ application code map:
- Which code creates it?
- Which business value does it serve?
- Who is the business owner?
- Is it still needed?
- Is up-to-date data required, or can it be calculated periodically?
- Can it move to a read replica?
Using MySQL, explain:
- How many rows are examined?
- Query time?
- Lock time?
- Rows returned?
- How many bytes of data are returned?

Fixing specific query

The nice part comes with the data at hand, actually solving the issues.
We won’t go into actual solutions here.
Just give reference to further search:

Too many rows examined? → add an index, denormalize, and remove joins
Long query time? → use cache or pre-calculate at scheduled intervals
Too much data? → select only needed columns
Too many similar calls? → it’s an n+1 problem. Move the query before looping in the application code

By following the steps outlined above, you can resolve over 80% of common application problems

💈Django implementation

A flat directory is required at the application level.
1. Create the following files:

 -- django-project-folder
+  |---- sqlstacktrace 
+           |---- base.py
+           |---- stacktrace.py
+           |---- __init__.py

2. Create a database wrapper inside base.py

class DatabaseWrapper(MysqlDatabaseWrapper):

    @async_unsafe
    def create_cursor(self, name=None):
        cursor = self.connection.cursor()
        return StacktraceCursorWrapper(cursor)

3. Point the database driver to use the new DatabaseWrapper, and consider running it only with the feature flag.
Otherwise, significant logging costs might occur.

DATABASES["default"]["ENGINE"] = "sqlstacktrace"

4. Implement the curser wrapper at base.py

logger = logging.getLogger(__name__)

class StacktraceCursorWrapper(CursorWrapper):
    # no change to sql is done, only print to specific __name__ logger
    def execute(self, sql, params=()):
        try:
            self.log_sql_with_stacktrace(sql)
        except BaseException:
            pass  # don't fail no matter what

        return self.cursor.execute(sql, params)

At every SQL call, the execute function is called.

5. Now add the actual logging function

class StacktraceCursorWrapper(CursorWrapper):
    def log_sql_with_stacktrace(self, sql):
        try:
            stacks = get_stacktrace()
            stacktrace = []
            for stack in stacks:
                stacktrace.append(
                    u"""File "{0}", line {1}, in {2}\n\t{3}""".format(
                        *[smart_text(stack_data) for stack_data in stack]).replace("%", "%%"))
            stacktrace = "\n".join(stacktrace)
            stacktrace = stacktrace.replace('/*', '\/\*').replace('*/', '\*\/')
        except:
            stacktrace = u"WITHOUT STACKTRACE"

        logger.info({"sql": sql, "stacktrace": stacktrace}, extra={"sender": "sqlstacktracer"})

6. There is a utility function to format the stack trace. The file is here: https://github.com/adw0rd/django-sql-stacktrace/blob/0cd797391d1e06b42c9e039a92232130361c30d9/sqlstacktrace/stacktrace.py
Please copy and paste it under stacktrace.py

7. The result log with json logger:

{
  .... <fields by logger config>
  "sql": "SELECT * FROM table WHERE id = %s LIMIT 21",
  "stacktrace": " <Full stacktrace to call>",
  "sender": "sqlstacktracer",
}

Aggregate and search operations can investigate a query’s performance when it contains a variable (not a specific ID), as exemplified above.

⚒️ Tools Reference

The general log

The general log shows every query-run without a lot of metadata.
Find the general by typing the database name into Cloudwatch insights or checking the database config.
Docs: https://dev.mysql.com/doc/refman/5.7/en/query-log.html

The slow query log

Shows only queries larger than long_query_timeDocs: https://dev.mysql.com/doc/refman/5.7/en/slow-query-log.html

MySQL performance schema

The MySQL performance schema shows real-time performance-related data such as process lists, connected clients, and queries with locks.
Docs: https://dev.mysql.com/doc/refman/5.7/en/performance-schema.html

Performance insight

AWS on RDS and other vendors provide visibility UX over the data MySQL generates.
Docs: https://aws.amazon.com/rds/performance-insights/

Thanks, https://github.com/adw0rd/django-sql-stacktrace for reference, and most of the code