Query Sniper
Controlling runaway queries.
Heads up, we’ve moved! If you’d like to continue keeping up with the latest technical content from Square please visit us at our new home https://developer.squareup.com/blog
Picture this: your service receives a request and kicks off a database query; the request times out and leaves the query running — leading to an unnecessary database load and possibly starving the database of resources. When services automatically retry requests, the problem is only exacerbated and can bring the most powerful database servers to their knees.
So, how does one guard against this?
While this is a fairly common problem with many platforms, we’re solely covering Java and JDBC running on a MySQL back-end. However, ideas discussed throughout should be applicable to other platforms and databases.
Database timeouts
Very recent versions of MySQL allow the setting of global timeouts, such that no query can take more than a predefined amount of time to run. This is a fairly coarse-grained approach, since limits have to be set for the longest-running query possible. So, this approach was ruled out. (Plus, we don’t run bleeding-edge versions of MySQL in production.)
Application-level timeouts
You can set a timeout on MySQL’s JDBC driver. We use HikariCP for connection pooling; it also allows you to configure the pool accordingly, delegating to the driver to set timeouts. This sounded promising.
MySQL’s JDBC Driver
For anyone who’s read the source code of Connector/J, MySQL’s JDBC driver, you’re probably shuddering at this point. The driver was written when Java 1.3 was all the rage, and received only a few minor updates. The code is spaghetti; full of coarse synchronization and generally inflexible and not very extensible.
Oh well.
Poking into the driver, I saw that a JDK Timer is started for each connection instance and a cleanup task is scheduled to run after the predefined time has passed. In com.mysql.jdbc.StatementImpl we see:
public class StatementImpl implements Statement {
...
class CancelTask extends TimerTask {
...
@Override
public void run() {
Thread cancelThread = new Thread() {
@Override
public void run() {
...
}
Cleanup task
The cleanup task scheduled by the driver is simple (also in com.mysql.jdbc.StatementImpl):
..
cancelConn = StatementImpl.this.connection.duplicate();
cancelStmt = cancelConn.createStatement();
cancelStmt.execute("KILL QUERY " + CancelTask.this.connectionId);
...
Aside from cleaning up the state held on the Statement instance (inside some truly awful synchronization blocks), the task performs two critical actions:
- Clones the JDBC connection to establish a new connection to the database.
- Issues a KILL QUERY <connection_id> query on the new connection, passing in the original connection’s ID.
Shortcomings
This works well except for two issues:
Timers
JDK Timers are old, clunky and moreover, expensive. Each one creates a new thread. (Even Javadocs on Timers recommend using ScheduledExecutorServices instead.)
Cloning connections
Cloning the JDBC connection each time can lead to problems with the number of available connections a database server may have available — particularly if it is processing runaway queries that are hogging resources. Such a timeout task may not be able to connect to the database to kill runaway queries.
The Query Sniper
Because of these shortcomings, we decided to write our own query sniper rather than simply set JDBC timeouts. The query sniper effectively does exactly the same thing as setting a timeout on the JDBC connection, except it’s implemented as a JOOQ ExecuteListener and submits a task to a single ScheduledExecutorService. This is configured with just one maintenance thread for the entire system every time a query is executed. Here is a simplified version of the idea:
package querysniper;import java.util.concurrent.ScheduledExecutorService;
import org.jooq.ExecuteContext;
import org.jooq.ExecuteListenerProvider;
import org.jooq.Query;
import org.jooq.impl.DefaultExecuteListener;import static java.util.concurrent.TimeUnit.MILLISECONDS;public class QuerySniper extends DefaultExecuteListener {
private static final int MAX_QUERY_TIME = 10_000L; // 10 secondsprivate final ScheduledExecutorService scheduledExecutorService;
private volatile boolean queryRunning = false;public QuerySniper(ScheduledExecutorService scheduledExecutorService) {
this.scheduledExecutorService = scheduledExecutorService;
}@Override
public void executeStart(ExecuteContext ctx) {
// Start a timer
queryRunning = true;// You'd typically have this hook into your system to understand how much
// time to let your query run for
long timeBudget = MAX_QUERY_TIME;scheduledExecutorService.schedule(() -> {
if (queryRunning) {
log(ctx, timeBudget);
if (ctx.query() != null) {
ctx.query().cancel();
}
queryRunning = false;
}
}, timeBudget, MILLISECONDS);
}@Override
public void executeEnd(ExecuteContext ctx) {
// Stop timer
queryRunning = false;
}@Override
public void exception(ExecuteContext ctx) {
// Stop timer on exception as well
queryRunning = false;
}private void log(ExecuteContext ctx, long timeBudget) {
System.out.println("Query exceeded %s ms. Killing it!", timeBudget);
System.out.println("Query: %s", ctx.sql());
}
}
Query Snipers are added to all JOOQ DSL sessions using JOOQ’s Configuration.set() API:
DefaultConfiguration cfg = getDefaultConfiguration();
ScheduledExecutorService executor = getScheduledExecutorService();
cfg.set(() -> new QuerySniper(executor));
Fine-grained timeouts
For added benefit, the query sniper inspects the request context and determines how much time it has remaining to run the query before the request times out; thereby, setting fine-grained timeouts on a per-query basis.
Cloning connections again
We wanted to maintain a separate, pre-established connection pool consisting of a single connection, solely for the query sniper to kill long running queries. That way, we’d overcome the issue of not being able to establish a new database connection. But, we hit a snag. To maintain our own connection pool for this purpose, we’d have to hand-craft our own KILL QUERY <connection_id> statement. MySQL’s JDBC driver doesn’t expose the transaction ID to allow us to do this.
Sadly, the query sniper just ends up calling Statement.cancel() on a running statement it wants to kill. This still causes the connection to be cloned, etc. just as before.
Conclusion
With our query sniper in place and running in production on a number of systems, we saved ourselves a fair few outages — outages we’d seen prior when testing a new system we built which had a few rogue queries. These outages would continue to happen (judging by the query sniper logging its activities) had it not shot down these abandoned queries.
What’s next?
As much as we’d like to open source this piece of code, it is far too closely tied to Square’s infrastructure to be extracted into a separate library. (If you want to see more, you could always join our team.) However, I hope the pattern described above and the code snippet included will be useful to others.
I’d really like to be able to maintain a persistent connection to the database. To this end, I hope to patch MySQL’s driver to expose its transaction ID, to allow for hand-crafted KILL QUERY … statements.
Or maybe what I really need to do, is rewrite the MySQL JDBC driver using modern ideas of concurrency and thread safety, resource management, configurability and extensibility? ;)