How to develop an on-premise data lake
Originally published at jugsi.blogspot.com.
How to develop an on-premise data lake
What is a data lake?
A data lake is a system or repository of data stored in its natural format, usually object blobs or files.
Another interesting definition
If you think of a data-mart as a store of bottled water, the data lake is a large body of water in a more natural state.
What is the best technology for Data Lake?
It depends from the platform … Azure Data Lake Storage … Google Cloud Storage .. AWS S3 … but due to the popularity (today) the most common technology is S3
S3 is so popular that is supported by the most common Analytic Platform/Middleware:
- Kafka
- Airflow
- Spark
- Storm
- Hive Hadoop
- …
S3 is so popular that it is possible to deploy an S3 server on-premise with Minio .
To work with Minio we can leverage on docker:
docker run -p 9000:9000 minio/minio server /data
Please take note of the
AccessKey: XXXXXX
SecretKey: XXXXX
provided by minio, then we can try our ingestion:
aws configure
aws configure set default.s3.signature_version s3v4aws --endpoint-url http://localhost:9000 s3 ls
aws --endpoint-url http://localhost:9000 s3 mb s3://mybucket
aws --endpoint-url http://localhost:9000 s3 ls
aws --endpoint-url http://localhost:9000 s3 cp awscli-bundle.zip s3://mybucket
finally open the browser to http://localhost:9000 ….and enjoy minio
Please visit the official documentation https://github.com/minio/minio
Ok we have our Repository up and running, but how about our data lake? Apache Hive is our friend….