Right-Sizing AWS RDS? Request Mirroring Load Test Come To Rescue
We all know feel the pain when trying to choose the right instance size for AWS RDS (Amazon Relational Database Service) instance. There are so many instance types optimized to fit different use case. And within each type, there are wide variety of sizes to choose from like large, xlarge, 2xlarge, 4xlarge etc. You definitely don’t want to over provision and waste tons of money. At the same time, if you choose a database instance size that is too small, your bottleneck limitations can cause many outages or performance regressions that heavily impact your availability and latency.
Choosing the right size for your AWS RDS instances is an art.
Some metrics you should keep an eye…
Here I’m going to combine my own experience with AWS official recommendation for best practice for Amazon RDS to give some key point matrix for you to better understand your database performance and help you identify if you are hitting any sort of limitations. For each database of different size, there are different limitations for everything. Hitting any limit could become your bottleneck and affect your performance.
- CPU Utilization — Percentage of computer processing capacity used.
If at any point of time, your max CPU utilization on any database instance exceed 75%, this is definitely a sign that you should scale up to more instance (horizontal scaling). If this is a one writer, multiple readers architecture and your writer instance’s CPU utilization is sometimes above 75%, this is a sign calling for larger instance size (vertical scaling) .
- Average Active Sessions (AAS) — An active session is a database connection that has submitted a request to the database but has not yet received the response.
Measuring the average number of active concurrent sessions over time provides a clear picture of the load on the database. If AAS number exceed the number of virtual CPU of this instance, then it indicates your database instance is overloaded. More detailed information regarding understanding AAS can be found here.
- Freeable Memory — How much RAM is available on the DB instance, in megabytes.
The red line in the Monitoring tab metrics is marked at 75% for CPU, Memory and Storage Metrics. If instance memory consumption frequently crosses that line, then this indicates that you should check your workload or upgrade your instance.
- Read Throughput, Write Throughput — The average number of megabytes read from or written to disk per second.
- Network Receive Throughput, Network Transmit Throughput — The rate of network traffic to and from the DB instance in bytes per second.
- DB Connections — The number of client sessions that are connected to the DB instance.
Though, if you are asking me, you can read all the documentations in the world, yet you can’t confidently choose what is the correct size for your current database load without actually trying it out. This part is especially true when you are downsizing from an under utilized large database instance. How can you say with confidence that the downsizing will not give you any performance regression? Similarly, how can you say with confidence that upsizing will bring you a much better performance? More importantly, which instance is the right target instance for your downsizing or upsizing?
To answer these hard questions, I’m going to break down these question and give you some practical solution to help you load test properly so that you can confidently draw your own conclusion for your case.
How to load test reading traffic on different instance size
Traditionally, database take both reading and writing traffic. If your database has a separation of reader and writer instances (e.g. Amazon Aurora Database), and currently has multiple reader instances with round robin load balancing traffic to each reader instance, you can easily test out the current load’s performance on a different instance size by spinning up an extra reader instance with target size and terminate an old reader instance.
With the AWS build in metrics, you can easily check if the load for the current traffic is too little or too much for the new instance size. Though you need to execute with precaution using this method. Since this is load test on a production traffic, this is less ideal compared to some other methods (you can imagine many reason not to do so ahah). Take AWS Aurora database as an example, you need to make sure not to leave mixed instance size for too long. This is not recommended by AWS due to the automatic writer failover mechanism of Aurora database. Also you need to make sure to set the failover priority to lower so that it won’t be selected as the new writer instance if anything bad happened to the writer.
How to load test writing traffic on different instance size
Load test writing traffic on a difference instance size is where most people is mainly stressed about when talking about database rightsizing. One obvious way to do load test is to write some load test and set up the infrastructure to see the performance difference on two database with different instance size — one the same size with production and another one with target instance where you would like to upsizing or downsizing too. While this is all great, writing load test and setting up the load testing infrastructure is no easy work for an engineer. It requires in depth knowledge for use case of the database to set up proper sql statements that you are testing on to generate similar load from production and it requires many infra knowledge to set up the load test environment properly.
Here, I would like to talk about two methods that can help you to do a shallow load test by mirroring production traffic with a lot less work to set up compare to writing your own load test. I recently used both method for load test for two different databases and I would love to share my experience as an inspiration for how you can easily load test your database and find the right size for your database.
Method 1: Use envoy to do request mirroring
Here’s some background information for folks who might be new to envoy or envoy request mirroring.
What is Envoy?
Envoy is an open source edge and service proxy, designed for cloud-native applications.
What is request mirroring?
Request mirroring (also called shadowing, shadow mirroring) is when production traffic is mirrored to destination cluster of a developer’s choosing. Shadowing production traffic allow us to test the production traffic on a test instance (in our case a test database) to get accurate comparison and understand how the test instance behave under the same load without affecting end clients in any way. This can help us to make risky changes with higher confidence and validate the test instance before pushing it to production.
Is this the right choice for you?
If you use envoy as a proxy in your current infrastructure and all your database request comes from your client side request, using envoy to do request mirroring is the perfect method for you. Allow me to elaborate with a picture.
As you can see from the envoy documentation, envoy provides you powerful features like request mirroring to easily load test your database. All you need to do is do change your envoy configuration file and have it also do mirrored request to your clone application. After you set up the clone application and prod load test clone database, make sure your clone application point to the DNS of your prod load test clone database, then you are ready to go. Just leverage envoy request mirroring feature and change the envoy configuration, and voila, your envoy will make sure to send a copy of the same http request to your clone application automatically. You can wait and observe how the same load performs on your load test database, if there is any unexpected longer latency or any triggered warning. If everything looks good, you have successfully choose the right target instance size. If not, you can always repeat by changing your load test database size and repeat the same process until you are happy with the result.
This practice is very safe using the mechanism of fire and forget traffic from envoy. Particularly there’s a clone service that is in charge of sending the database traffic towards the load test clone database, you will not affect the production service by giving it extra load like firing a request mirroring traffic from the production service. This clear separation between the load test service and load test database provides an isolated safe environment for you to perform request mirroring without affecting the production traffic.
- Make sure the clone application have the same set up as production application. Otherwise you introduced more variables to the load test and it will be harder to draw conclusion from two tests performed with different set up.
- Make sure the clone application is linked to the clone load test database endpoint. You never want to mess up with your production database (danger of double write). If your application writes to more than one database (include some database you don’t want to perform rightsizing), then redirect the other database endpoint to some invalid ones is one way to avoid double write to those databases. However, if there’s retry for failed database read or write, this will affect the latency looking from the clone service side as well so keep it in mind when interpreting the load test result.
- I would also like to point out with the current micro-service architecture, your service might have some downstream service that it is calling. That downstream service might write to other database. So be careful with that, or you might introduce double write to other database of your downstream service. You can identify downstream service and write some code in the clone service to not send traffic to the downstream service. For a monolith, there’s no worries on this aspect. So envoy request mirroring can be ideal for load test on a monolith service or a service with no downstream services.
Method 2: Use ProxySQL to do request mirroring for SQL traffic
If your current infrastructure doesn’t support you to use envoy request mirroring, here’s a second option that might give you more possibility to do request mirroring for your database — use ProxySQL to do request mirroring for sql traffic.
As usual, here’s some background information for folks who might be new to ProxySQL or ProxySQL request mirroring.
What is ProxySQL?
ProxySQL is an open source high performance, high availability, database protocol aware proxy for MySQL. ProxySQL build complex ProxySQL Query Rules to route writes to primaries, distribute reads across replicas and rewrite queries on the fly with highly granular criteria.
What is ProxySQL request mirroring?
Summarizing in my own words, when mirroring feature is enabled through ProxySQL Query Rules, ProxySQL forward the SQL statement to prod database as usual and will also send a copy of the production traffic’s SQL statements to the clone database. ProxySQL request mirroring does not guarantee the order of execution and does not guarantee data consistency. But it is very useful when comparing performance of two database with slight different (for example different MySQL version, or in our case different DB instance size) by sending same read and write SQL statement to two different databases.
Is this the right choice for you?
Different from requests mirroring by envoy, we are leveraging the ProxySQL’s built-in requests mirroring function to send a mirroring sql statement to the load test clone database every time we send a sql statement to production database. This is perfect for when you have multiple micro-services communicating to one database or requests other than http that issue reads and writes to the database (e.g. using kafka to issue sql write statements). You can do requests mirroring through adding a ProxySQL layer in the middle. Just replace the database hostname inside the application configuration to the newly set up ProxySQL hostname, then we can leverage ProxySQL to do requests mirroring. Allow me to elaborate using a picture:
As shown in the picture, instead of mirroring the http requests to achieve simulating production traffic load, here we are focused on directly mirroring SQL statements by introducing a new proxy layer provided by open source project ProxySQL.
This practice doesn’t have the double write danger as we mentioned in the envoy requests mirroring method since only sql requests are mirrored. Also since all SQL statement going through can be mirrored by ProxySQL, we can cover scenarios where one database’s SQL statement is issued by multiple services or sources (kafka for instance). This is personally my second and only remaining choice when the envoy requests mirroring would not be sufficient.
- Make sure the set up is done properly when using ProxySQL for request mirroring. As shown in picture, you will need to set up some EC2 machines to run ProxySQL service and then set up the DNS for it before you can shift traffic to this new hostname. Since we are directly shifting production SQL traffic to go through ProxySQL service, it is critical to make sure the set up is correctly. Otherwise, it will cause some down time.
- Have a mechanism to shift traffic to ProxySQL service gradually. Like I mentioned before, we are adding an extra layer of ProxySQL service between your production service and production database. If we have mechanism to shift traffic smoothly and roll back when things turns wrong, it would be very nice in case things go south.
- ProxySQL is a less established open source project compare to envoy. I personally ran into some errors like this one during my own request mirroring process. This blocked me from sending one of the top SQL statement to the load test clone cluster and in result leading to an inaccurate load test result. Great news is that it didn’t block forwarding the traffic to the production database, it just threw warnings when trying to send the mirroring request to the load test clone cluster. So if you are expert on ProxySQL definitely go for it, otherwise there’s some risk running into unsolvable bugs. But like I said, ProxySQL does tackle a more generalized use case where envoy request mirroring couldn’t handle.
- The extra network hops of ProxySQL might introduce some latency when looking from the service side metrics. If you are only looking at database side metrics this would not add any latency at the database layer. However if you are also evaluating the latency metrics measured from the application side, you might be seeing more latency from introducing one extra ProxySQL service in the networking path. You can always rely on some tracing tool to factor out the time on ProxySQL service. But this requires more patience and precaution when looking at the data.
I’ve stumbled and struggled during my first journey to do load test for rightsizing database instance. I hope that my article will give you some inspirations if you facing similar challenge.
Thank you for reading. Please reach out to me in the discussion or on linkedin if you have any thoughts or comments.
Happy new year everyone :)