“Redshift” vs “Hadoop” vs “BigQuery”

Abhishek Srivastava
Next Gen Technology Insider
2 min readMay 24, 2016

Whenever it comes to analyse big data sets, plenty of data warehousing options appear in front available in the market. Sometimes you find it hard to decide which option you should go with, there can be many factors one should pay attention towards before commencing, like

  • Efficient querying and storage for structured data
  • Budget
  • Fully managed vs licenced
  • Type of data processing (batch/real-time)
  • On demand and Scalability

Here are some key features of a few popular data warehousing options:

Redshift

  • It is a cloud-based, fully managed and hosted solution by Amazon, which would allow you to scale as needed to massive data volumes, and it’s built on Postgres, making it easy to use and integrate with other tools.
  • It is built on the top of Massive Parallel Processing (MPP). Just launch a set of nodes, called an Amazon Redshift cluster. As you provision your cluster by choosing number of CPUs, RAM, HardDisks etc. and turn them on, you can upload your data set and then perform data analysis queries.
  • If you are an application developer, you can use the Amazon Redshift Query API or the AWS Software Development Kit (SDK) libraries to manage clusters programmatically.

Hadoop

  • Apache Hadoop is used for massive storage of data and batch processing of that data. It is very mature, popular and you have a lot of libraries that support this technology. But Hadoop is not suitable for real-time querying and data analysis.
  • Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed in such a way that all common hardware failures will automatically be handled by the framework.
  • The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.

BigQuery

  • Google’s BigQuery is truly fully managed (that make the service faster or more resilient), highly scalable, low-cost cloud based analytics data warehouse. There is no infrastructure to manage so you can do Real Time Analysis with petabytes of data to get meaningful insights without worrying about any database admin.
  • BigQuery transparently brings in as many resources as needed to run your query in seconds also the first terabyte (1 TB) of data processed each month is free
  • You can access BigQuery Services using APIs available for different languages like Java, Python, PHP etc. There is also a convenient UI available to do all your SQL like queries at once on the large data sets.

Conclusion

Choose any of the above options wisely for data warehousing and data analysis based on your requirement. Though my pick is Google BigQuery as it is fully managed, has quick response time, low-cost, high scalability and you can read and write data easily in BigQuery via Cloud Dataflow and Hadoop too.

Originally published at Next Gen Technology Insider.

--

--

Abhishek Srivastava
Next Gen Technology Insider

More than 17+ years of experience in analysis, design and development of enterprise applications using GCP, AWS and MEAN stack