This post will introduce in high level the common Big-data solutions for storing and accessing data, so if you are already a big data pro you can skip this one.
In another post I will describe what tools we use here in the NMC and why
So Let’s get started — generally speaking there are 2 approaches to solve this question:
- Data warehouse — on premise or cloud solution
- Scalable Storage + Query Engine
data warehouse
A large store of data accumulated from a wide range of sources within a company and used to guide management decisions
A data warehouse solution is usually an appliance that encapsulates all the needed technologies (storage, query engine, metadata etc.) for the client.
Data is loaded into the data warehouse with specific schemas and then can be queried. if results should be used externally it is possible (but not always easy) to extract them.
Data warehouse solutions were very common in the past and they work fine but they have some big disadvantages:
- Vendor lockdown
- Hard to Scale
- Not elastic
* There are cloud based data warehouse solutions such as AWS Redshift but since most of them are similar internally to the Query engine architecture I will classify them as query engines as well
query engine
Basically, an SQL engine Translates SQL to Mapreduce/DAG jobs over data of various size and formats. This is achieved either by using an existing framework (such as hadoop MR or spark) or by using independent implementation.
Since the query engine does not necessarily store the data itself, it usually supports multiple data sources and file formats.
Unlike most data warehouses, query engines are usually pretty elastic and can scale easily and thus offer a more cost effective solution
Common storage options are HDFS, AWS S3 (and the Azure and google equivalents).
As you can see there are more components to configure and it might take some time to optimize but this approach is far more scalable and on most cases should also be more cost effective (storage can be very cheap and compute resources can be managed on-demand).
There are multiple solutions and relevant technologies here. I will elaborate on the post I mentioned in the begining
A little about data lake
A data lake is a storage repository that holds a vast amount of raw data in its native format until it is needed. While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data.
Basically Data lake is metaphorically a lake of data in it’s raw state flowing from all over the organizations. The paradigm consists of:
- Unstructured data sources (the water)
- Storage (where is the lake)
- File system and file formats (how the lake looks like)
- Tools to analyze it
the following illustrations explains quite well the differences between data lake and data ware house
Summary
Storing a lot of data is not a simple task. If not done right you can easily get performance problems (writing and/or reading) and have scale related problems (cost, trouble scaling-up etc.)
Storing your raw data in a cloud storage (such as AWS S3) and querying it with big-data querying engines according to your needs is the most scalable and versatile solution.