Moving Data From MySQL to Redshift or BigQuery: Which to Choose and How to Migrate
With the massive push toward business intelligence and data-driven insights, many businesses are surrounded by content and competition that pressures them to upgrade their legacy systems and move from their traditional relational databases to a parallel cloud analytics engine like Amazon Redshift or Google BigQuery. Yet, there’s a lot to keep in mind when considering the switch.
Here’s what you need to know and how to make the right choice for your business.
MySQL: How It Works and Where It Falls Short
Today, many are moving from a relational database (RDBMS) like MySQL to an analytical data warehouse like Amazon Redshift or Google BigQuery. Why?
As a RDBMS, MySQL has a lot of great benefits and strengths. These include:
- Maturity: MySQL is an extremely established database, meaning that there’s a huge community, extensive testing and quite a bit of stability.
- Compatibility: MySQL is available for all major platforms, including Linux, Windows, Mac, BSD and Solaris. It also has connectors to languages like Node.js, Ruby, C#, C++, Java, Perl, Python and PHP.
- Cost-effective: The database is open source and free.
- Replicable: The MySQL database can be replicated across multiple nodes, meaning that the workload can be reduced and the scalability and availability of the application can be increased.
- Sharding: While sharding - splitting and storing a single logical dataset in multiple databases - cannot be done on most SQL databases, it can be done on MySQL servers. This is both cost-effective and good for business.
That said, there are also significant drawbacks for certain business structures. These revolve around the fact that MySQL is table-based, primarily vertically scalable, and it generally uses structured query language (SQL) for defining and manipulating data. While all of these features have their benefits, to be sure, they pose a problem if you need things like a dynamic schema, easy manageability, scalability, query performance and flexibility with your data.
On top of that - and probably most importantly - there are issues when it comes to real-time analytics on large amounts of data. MySQL just can’t handle real-time analytics at scale, and that’s simply because it wasn’t built to work with data like that. Instead, it was designed to control and run fundamental business tasks. Its purpose and design is for fast transactional random access, not analytics. And today, analytics, planning, and problem solving are becoming increasingly important.
That’s because, today, everyone wants to know the what, why, and how of their data - what it shows them about their company, their strategies, and the state of the market that they’re growing in, why they should continue or stop certain efforts, and how to optimize the information they’re getting and use it to grow their business.
Loading Into Redshift or BigQuery: The Steps
That’s why many move their data to a database that is designed to support these analytical functions. The two breakaway choices here, as mentioned, are Amazon Redshift and Google BigQuery.
Step 1: Redshift or BigQuery
The first step here, then, is deciding which of the two is best for you, your data, and your business. Now, we’ll get into the key differences in a minute, but let’s take a quick look at the similarities first and go over what exactly makes these databases so useful.
Columnar Data Storage and Massively Parallel
The first strength is the data storage itself and the way that these tools execute queries. Both tools use a columnar storage structure - rather than a relational database system - which makes them ideal for analytical use cases. They also both use compression, fast internal network, and multi-node data sharding.
Here’s what that means and why it’s a good thing: relational databases like MySQL store data internally in row form, meaning that all data rows are stored together and usually indexed by a primary key to make them easily accessible. This is great if you want to get transactional information - like reading rows by ID to display all of your user profiles - but it’s horribly inefficient if you want to gain analytical insights. Why? It requires that you read through the entire database, along with unused columns, to produce results. That means spending a lot of time and energy sifting through data that, for analytical purposes, is completely irrelevant.
With a columnar storage structure, this is no longer an issue. Data is stored in columns instead of rows, and when you have a query it only reads the columns involved in the query, resulting in accelerated disk access and CPU cache. In other words, it saves time, space, and access to data.
Additionally, both tools are designed to be massively parallel, meaning that every query is distributed across multiple servers to accelerate query execution. In other words, you get fast, specific results every time you have a query.
While both are good options, there are some key differences in the features and services offered by each tool.
Panoply performed a real-world comparison of the two to find out how quickly and efficiently they handle queries, particularly as they increased in scale and complexity. At the end of their experiment (which can be found in detail here) they found that, once it was optimized and fine-tuned, Redshift outperformed Google BigQuery, noting that, “the only instance where BigQuery has superior performance is in big join operations.”
This brings us to our next point: simplicity. BigQuery basically works out of the box, while Redshift requires a very specific skill set and immense knowledge of how it works in order to use and optimize it effectively. That means that engineers constantly have to work to keep it optimized in order to keep up with your company’s data scale, your query patterns, and your industry’s best practices.
Looking objectively, BigQuery seems to be a more cost effective choice. Whereas Redshift is structured around cluster uptime, BigQuery is structured by storage and queries. This distinction, though small, can make Redshift more complex and unpredictable when you actually start to work on it - which could be a huge issue for users that want an intelligible, up-front understanding of their costs.
The Big Takeaway
Ultimately, there are tons of similarities between Amazon Redshift and Google BigQuery, with the big differences being that Google BigQuery is simple and user-friendly out of the box - requiring less as far as expertise and engineering skill - while Redshift is more complex and arguably provides better performance and cost, especially at scale. So look at who in your company will be using the tool, their skill level, your budget, and how much you will realistically scale before making a decision.
Step 2: Loading Data to Redshift or BigQuery
Once you decide on your tool, here’s how to load the data. Keep in mind, this has to be done properly and particularly, as there are major problems that come with improperly or incompletely loaded data.
First thing’s first: you can’t work on Redshift like you’d work on your other tools. Specifically, unlike some RDS tools, Redshift is really not meant to be used for lots of random INSERT, UPDATE, or DELETE statements. Instead, it points you to their COPY command, which can bulk copy data from S3.
This command is especially useful for data analysts or users who are working with lots of JSON or CSV data that they have to move and run queries on. In fact, it’s specifically optimized for this kind of bulk loading use case — it will even do things like automatically parsing JSON-line formats, automatically decompressing gzip-compressed files in S3, and automatically parsing string-based embedded data types like timestamps.
With that in mind, here are the 5 steps that are generally followed for loading data into Redshift:
- Create a Redshift cluster
- Export a MySQL database and split it into multiple files
- Upload the files to Amazon S3
- Run a COPY command to load the table to Redshift
- Verify that the data was loaded correctly
Like with Redshift, you never send BigQuery an INSERT or UPDATE statement. Instead, you either send it streaming writes, or you bulk load data using the bq tool.
BQ works similarly to the Redshift COPY command, assisting with bulk loading of JSON or CSV data or even syncing that data from Amazon S3 buckets. This tool, and even more specifically the BQ load command, can also be used to upload files to your datasets, adding schema and data type information along the way (the syntax can be found here in the Quickstart guide for bq).
Then, you just repeat this process as many times as it takes to load all of your tables into BigQuery.
If these seems overly complicated — or if you simply don’t want to take it on yourself — Xplenty is the right data integration platform for you.
Interested in more articles about data integration and data analytics? Our most popular articles.
Establishing a strong business intelligence (BI) strategy for your business can be extremely powerful - if it's done…www.xplenty.com
A data lake is essentially a bottomless repository that you can fill with data without having to consider its value or…www.xplenty.com