Changing AWS RDS IOPS provisioning in production

Robert Rees
POP Developers
Published in
2 min readSep 24, 2018

Recently we had to change the amount of IOPS we had provisioned on our production instances. The ModifyDBInstance documentation says this can be done without outage or disruption, but Googling for people who had made the change revealed a variety of stories about people’s experience so I thought it might be useful to add how our change went.

tldr; as per the documentation it is fine to change to change your provisioned IOPS on a running system without risking outages in your production system.

Situation

We had made a database change where the execution on production was different from pre-production and therefore the database was having to do significantly more work. We hit our IOPS limit and database reads were therefore being throttled resulting in the entire system running slowly and dropping below acceptable performance for some users.

Solution

We added an emergency index which improved the point problem and made sure we wouldn’t just consume any additional resources once they were supplied.

We then maxed out the Provisioned IOPS which would effectively remove any quota or throttling restrictions.

Why we expected this to work

We have a shared pre-production environment where the IO demand is small on individual instances but can experience significant peaks as environments come on line and are used for various testing procedures.

We were essentially applying the same configuration for our pre-production environment and were expecting similar performance output.

What happened

We have read-replicas attached to our production master instance and these have to be setup to the new provisioned value before the master. This meant our migration time was doubled but it also meant we got to track the process in a much less critical instance. The change in the replica turned out to be identical to that in the production instance.

The total time to provision the new IOPS was around 45 minutes from start to completion. However we saw the new behaviour resulting from the changed limit around 20 minutes into the change. On the production instance the throttling of operations ceased and after a spike of work the system returned to the responsiveness that users were expecting.

Essentially the system was returned to health and normal operation well before the entire change was complete and without any outage of the service.

If we had needed to make this change in a hurry (say the system was down rather than in a degraded performance state) we would have been better off bringing down the replicas, provisioning the master and then bringing up a new replica.

What’s next?

There’s not a strong moral to this post, it’s really just about sharing information. Once we’d rectified the problem we could have downsized the provisioned IOPS but to simplify the operations side we’re keeping the value higher than we absolutely need for now to stop mistakes in development impacting our clients.

--

--