AWS Redshift Primer for Dummies
Redshift is a cloud-based warehouse service. It is fully managed and can handle petabytes of data. You can use as much as you need and build as the size of your data increases. It enables companies to extract useful insights about the customers and business without spending too much on infrastructure and maintenance.
Creating a data warehouse involves creating a cluster of Amazon Redshift. We can then start uploading data and then extract insights. That’s where redshift specializes, it can handle petabytes of data and generate results faster.
It is built on top of MPP (massively parallel processing) technology which allows it to handle certain operations much faster compared to regular databases. MPP is the main distinguishing factor which makes it an ideal choice for building data lakes and data warehouses. It can handle analytics operations on data at the scale of big data. Because of the nature of the database, it is extremely important to understand the internals of the database to be able to utilize its full potential.
Technology:
Amazon Redshift was built on top of an older version of PostgreSQL. It differs from PostgreSQL and other RDBMS in that it is columnar databases. RDBMS are optimized for single row and transactional operations and uses indexes to improve the query performance. Such databases are most suitable where you already know the types of queries that will be executed on the databases and hence are normally the first choice for integration for software development. These are not suited to big data, Analytics and data warehouse operations.
A relational database looks like this:

Stores data on disk like this:

If you want to find users with a salary between a specific range, the database will have to scan every row to find all the relevant rows. We can build an index on salary to fix that, like this:

This will allow us to search the salaries within this index which is much faster but requires operations on index first and then within the database. Also requires additional space.
For BI / Data warehousing, the queries are not limited and hence require the ability to slice and dice by almost any field. That’s where the columnar databases come in. Rather than storing the data along rows, data is stored along columns. Its a very small change but effects are transformational, allowing each field to behave like an index and hence making the bulk operations extremely fast. Columnar Databases are also better optimized for encoding and hence improving the space utilization(compression).
A columnar database would store this data like this:

You can see that every field is already an index and hence eliminating the need for building indexes and making the operations faster especially relating to returning results to analytics queries.
Bluudata is a big data company specializing in building data lakes. Please feel free to reachout to us, if you have any further questions.
