How to Build a Data Lake in AWS

Karl Robinson
Nov 7 · 4 min read

What is a Data Lake?

A Data Lake is a place where you can store all your structured and unstructured data at scale, to enable analytics to guide better business decisions. Analytics could be simple dashboards or visualisations, or more complex big data processing and machine learning to predict future business outcomes.

Why build a data lake?

Businesses that extract value from their data are likely to achieve a higher organic growth rate than their competitors, according to this Aberdeen report.

When building a data lake, you need consider:

· Security — consider access to the components of the data lake — traditional perimeter security, Identity & Access Management for the users or services that will access the data, encryption of the data both at rest and in transit, and compliance with any standards that you need to adhere to. AWS has a range of services and tools to assist with all the above.

· Scalability — AWS runs the largest public cloud infrastructure. As your data volumes grow, and usage of the data lake increases, you will have access to virtually limitless resources in the AWS cloud.

· Reliability — AWS S3 Storage has been designed to deliver ’11 9s’ durability meaning loss of data us highly unlikely. AWS storage offerings include geographic redundancy and automatic replication.

· Cost — Tiered S3 storage — Standard, Standard Infrequent Access, One Zone Infrequent Access, Glacier, with automated lifecycle policies to migrate between tiers, all help to keep costs under control. And you only pay for the storage you consume.

· Ease of use — range of AWS services to assist in data movement, data lake storage, data lake analytics, and machine learning all accessible via the AWS console.

Data Movement.

The first step to building a Data Lake on AWS is to get your data into the AWS cloud. Data Movement can be achieved several ways depending on the volume and rate of change of the data that you are looking to move. If you have large volumes of data which are not changing at all, then you could establish a ‘Direct Connect’ with AWS — this is a 1Gbps or 10Gbps pipe directly from your network into the AWS cloud, over which you can copy your data to S3.

If the volume of data is going to take too long to copy over a Direct Connect, you can obtain a physical storage appliance from AWS — they will ship (or drive) it directly to your premises, where you can connect to your network, copy the data and ship (or drive) it back to the AWS cloud. The portable devices are known as AWS Snowball, and there is also the AWS Snowmobile — a datacentre on the back of a truck, for the largest volumes of data.

If new data is being generated all the time, then it would make sense for this data to be streamed directly to the AWS cloud using a service such as Amazon Kinesis Data Firehose. If it’s video streams, you can use Amazon Kinesis Video Streams, or if the data is being generated by IoT devices or sensors, you’ll want to use AWS IoT Core.

Data Lake Storage

Once the data is ready for the cloud, it can be easily stored in AWS S3 or Glacier, and catalogued with AWS Glue. Glue makes the data easily accessible to users, enabling ETL (Extract, Transform & Load) operations to get the data ready for further analysis.

Data Lake Analytics

Once the data is ready for analysis, there are a range of AWS services that can be leveraged, depending on your use case.

Interactive Analytics — Amazon Athena — analyse your S3 data with standard SQL queries. Athena is serverless so you only pay for the queries you run — you can run Athena as soon as your data is in the cloud and results are returned in seconds.

Big Data Processing — Amazon EMR (Elastic Map Reduce) — a managed service including Hadoop, Hbase, Spark and Presto, for cost effectively processing huge volumes of data.

Data Warehousing — Amazon Redshift — enables you to run complex queries against petabytes of structured data.

Real time Analytics — Amazon Kinesis — enables you to analyse streaming data from IoT devices, website & application logs, as it arrives in your data lake.

Operational Analytics — Amazon Elasticsearch Service — enables you to search, explore, filter, aggregate and visualise application monitoring, log and clickstream analytics data in near real time.

Dashboards and Visualisations — Amazon Quicksight — this is Amazon’s equivalent to Power BI, offering the ability to create rich dashboards and visualisations on the web.

Data Lake Machine Learning

AWS Deep Learning AMIs — enables experienced data scientists to quickly and cost effectively run all the major machine learning frameworks on AWS, including, TensorFlow and Apache MX Net.

Amazon SageMaker — enables you to build, train and deploy machine learning models.

So there are a lot of services available to help you import, store and analyse your data, but perhaps all of this sounds a little complicated? Fortunately, as with many AWS services, it is easy to find reference architectures which can be easily deployed in your own environment using Cloud Formation templates. You can find a data lake solution on the AWS Solution Finder.

If it still looks too complicated, or you are not sure what to do with the Datalake once deployed, you could enlist the help of an AWS Managed Services provider. They will bring the benefit of a team of AWS architects and Data Scientists ready to help you extract value from your business data.

Karl Robinson

Written by

Experienced startup boot-strapper and closet cloud geek. Director and Co-Founder of Logicata, a Public Cloud Managed Services Provider https://www.logicata.com/

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade