How to develop an on-premise data lake

Giacomo Veneri
digitalindustry
Published in
2 min readFeb 22, 2019

Originally published at jugsi.blogspot.com.

How to develop an on-premise data lake

What is a data lake?

A data lake is a system or repository of data stored in its natural format, usually object blobs or files.

Another interesting definition

If you think of a data-mart as a store of bottled water, the data lake is a large body of water in a more natural state.

What is the best technology for Data Lake?

It depends from the platform … Azure Data Lake Storage … Google Cloud Storage .. AWS S3 … but due to the popularity (today) the most common technology is S3

S3 is so popular that is supported by the most common Analytic Platform/Middleware:

  • Kafka
  • Airflow
  • Spark
  • Storm
  • Hive Hadoop

S3 is so popular that it is possible to deploy an S3 server on-premise with Minio .

To work with Minio we can leverage on docker:

docker run -p 9000:9000 minio/minio server /data

Please take note of the

AccessKey: XXXXXX

SecretKey: XXXXX

provided by minio, then we can try our ingestion:

aws configure
aws configure set default.s3.signature_version s3v4
aws --endpoint-url http://localhost:9000 s3 ls
aws --endpoint-url http://localhost:9000 s3 mb s3://mybucket
aws --endpoint-url http://localhost:9000 s3 ls
aws --endpoint-url http://localhost:9000 s3 cp awscli-bundle.zip s3://mybucket

finally open the browser to http://localhost:9000 ….and enjoy minio

Please visit the official documentation https://github.com/minio/minio

Ok we have our Repository up and running, but how about our data lake? Apache Hive is our friend….

--

--