Migrating from Oracle Exadata to AWS

Published in

Trainline’s Blog

7 min readJan 25, 2017

Since my talk at AWS re:Invent in December, we’ve had a lot of questions about how we tackled our Oracle Exadata migration to AWS.

This post goes through our Oracle migration journey in more detail and also includes some general tips for moving large databases to the Cloud.

Our Oracle Exadata Migration

The Challenge

Databases running on Exadata are at the core of many big systems and are a common challenge when moving to commodity hardware in the Cloud. Typically, they are big, high performance, and latency sensitive.

Our primary database was 4TB in size and required 600MB of mixed IO per second. We were fortunate that it was largely read heavy but unlucky in how the applications were using it. The database was called several times during some key user journeys so any extra latency or slow-down would be amplified for our users.

Another complication, as ever with Oracle, was licensing. We had enough licenses to cover 16 VCPUs in AWS but knew that keeping to this target was going to be tough. One downside of AWS is that you can’t miss performance on 16 cores and just go for 18 or 20 instead. The next size up is 32 cores which would have been painfully expensive!

Step 1: AWS Baseline

Our first move was to simply throw the database into AWS and get a feel for how it worked.

We used a specialist tool from Oracle called Real Application Testing (or ‘RAT’) to record 1 hour of live production traffic so we could play it back against the AWS version and compare performance.

As a starting point, we created an R3.8xLarge instance (32 cores), gave it 4 1TB disks (largest size available at the time) and played back the RAT test.

Initial results showed the scale of the problem — 33% slower on double the cores.

Step 2: Infrastructure Tuning

From the baseline is was obvious that the key issue for our particular database was IO throughput rather than CPU or RAM. We were hitting the throughput limits of individual EBS volumes and noticed that a large number of reads were hitting the comparatively expensive EBS volumes when we might be able to cache these better.

After a fair bit of experimenting, we ended up with the following configuration:

I2.4xLarge instance (16 cores, IO optimised, more ephemeral storage)
8 x 500GB EBS volumes (more disks to help Oracle ASM stripe and avoid hitting the throughput limits of any single volume)
Option set within Oracle to move 400GB of level 2 buffer cache to the faster ephemeral storage

These changes meant we had an IO optimised AWS Server, avoided single volume throughput limits and were able to serve a large proportion of our read traffic from local instance storage rather than EBS.

The new results from our RAT testing showed we were now 20% slower than our production database but on a 16 core machine. Overall this was a big step forward.

A couple of notes:

In general, we are big fans of AWS’s RDS service. Unfortunately, it wasn’t the right fit for this particular database due to the level of fine tuning we needed access to within Oracle
We did experiment with PIOPS for the EBS volumes but ultimately found that for us, GP2 offered the best performance/price point. Your mileage may vary, especially if your database is more write heavy.

Step 3: Database Tuning

Our next step was to review the database itself for optimisations.

We found we were keeping a longer transaction history in the OLTP database than necessary and so ran clean-up scripts to get rid of this older operational data. The history was already safe in our data warehouse and this simple step helped shrink the database considerably.

We also took the opportunity to upgrade to the latest version of Oracle to benefit from a few useful performance improvements in the new edition.

Finally, we tuned some indexes. On Exadata, indexes tend to be removed specifically to benefit from Exadata’s storage cell offloading but on AWS we had to keep analysing the slowest queries from each RAT run and tuning the execution plans to make better use of index scans. In some areas, the direct access from Level 2 flash cache was actually faster than index scans, although this will depend on your queries and workload.

At this point our RAT test results were around 12% slower than production.

Step 4: Application Tuning

Knowing that IO was the limiting factor, we did some analysis on the heaviest IO queries coming from our applications. This showed some obvious areas where we could reduce the amount of data being requested or the frequency of calls. This work got added to our Dev team’s migration backlogs and we were soon within the ~5% performance window we agreed would unblock migration to AWS.

It seems like the sheer power of Exadata has a tendency to make everybody just a little bit lazy. There is no urgency to optimise queries because results get thrown back near instantly even when badly written. The move to AWS has definitely helped serve as a bit of a forcing function to help us identify and optimise a few dodgy queries.

Today

Since our initial migration we have continued to optimise the infrastructure, database and application usage. Thanks to some amazing and ongoing work by our DBAs and Dev Teams, the key user journeys hitting our database now run 10% faster in AWS than they did on-prem against Exadata!

Unfortunately, it’s not really possible to get a direct like-for-like comparison of just the database element anymore. There are a lot of different factors that influence overall performance and the database and our use of it have evolved over time. None the less, the database went from being a major problem for our migration to a more flexible and cost effective solution hosted in AWS and providing our customers with a better experience.

We’re really pleased with how things worked out and would definitely encourage you not be put off if you too have a monster database to move!

Large Database Migrations — General Strategies

The following are some approaches to consider if you’re planning your own Exadata or large database migration.

Divide and Conquer

First it’s worth validating the assumption that your database really is one database. Many monolithic schemas were created for historical or convenience reasons rather than necessity. It may be there are logical subsets of schema that can be teased apart to reduce the scope of the problem.

Shrink and Shift

If the database must be kept whole, there are still opportunities for optimisation before migration.

For example, can the retention period for transactional data be reduced? Is the same period needed for all tables?

Proper use of indexes can make a big difference to the IO and memory requirements of your queries. The raw power of Exadata can cover for under-optimised indexes and hide problem queries. Finding and fixing these issues can reduce the target requirements in AWS considerably.

Modify Apps

Database read load can be reduced by:

Avoiding unnecessary calls and reducing the frequency of required calls
Minimising the volume of data retrieved by queries
Making greater use of application level caching
Working from read-replicas rather than a single master

Write heavy databases are harder to deal with. One strategy, if some writes can afford to be asynchronous, is to create a separate writer process. Suitable updates can then be posted to the writer’s queue and persisted to the database at a different speed to time-sensitive writes.

Change the Database Engine

One pretty new option worth considering is PostgresSQL for Aurora RDS.

Provided you aren’t using too many Oracle specific stored procedures, triggers etc. then migration should be relatively painless. Once there, Aurora can support impressive scale and speed. Aurora is really shaping up to be an impressive offering so do check it out if you haven’t already!

Go Native or Go Home

Sometimes the right choice is to give up. If your database is too big, too Oracle specific, too complicated to move to AWS, then chances are it is also the cause of significant friction and risk in your development cycle.

In these cases, starting from scratch and running the 2 systems in parallel is a good approach. You can build a new Cloud-native solution in AWS whilst verifying the results against the original system. Once you have enough confidence in the new system you can quietly lead your Exadata behind the cow shed.

In fact, you may be able to do this earlier than expected if you bring the new system online in stages. Once some load has moved on to the new system, what’s left on Exadata may be small enough to migrate using one of the other approaches above.

Summary

There are a number of options available to help move large databases to the public Cloud. What at first may look like an intractable problem can in most cases be broken down and handled by some combination of the different techniques.

Having your databases in AWS also comes with a raft of benefits in terms of agility, cost, flexibility etc. We’ve certainly found it well worth the effort. Good luck and happy migrations!