AWS Glue

Write code once and re-use it multiple times!

Photo by Blake Connally on Unsplash

As a software engineer, you might have learnt about few concepts in software design principles like DRY (Don’t Repeat Yourself) and KISS(Keep it Simple, Stupid). Hence, we always develop our code in a re-usable way by placing them in utilities, common and shared folders.

When it comes to AWS Glue jobs, they allow us to point our job to only one single file during creation of jobs. …


Enterprise organisations are utilising cloud services to build data lakes, warehouses and automated ETL pipelines. In AWS Cloud, data lakes are built on top of Amazon S3 due to its durability, availability, scalability and cheap of cost. Amazon Athena is one of the best tools to query data from S3. When it comes to programatic interaction with AWS services, Boto3 is the first python package that comes to everyones mind. But programatically querying the S3 data using Athena into Pandas dataframes to do ETL hasn’t been that easier when using the Boto3 package alone as it is.

Recently, I came…


Introduction

In today’s modern world, while developing web applications and micro-services there are various kinds of databases available to choose from to suit the needs of application in a right way based on the use case we are dealing with. Often the data might be split across multiple places and makes it bit complex and difficult to merge data between those multiple databases to generate insights on a near-real time basis.

Scenario

Assume you have to generate insights on top of data stored in databases like DynamoDB, DocumentDB which gets updated very frequently and analytical data stored on top of S3 in…


Image source: https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Building_machine-learning_infrastructure_on_Amazon_EKS_with_Kubeflow_CON306-R1.pdf

Introduction:

Kubeflow is an open-source and free machine learning Kubernetes-native platform for developing, orchestrating, deploying and running scalable and portable machine learning workloads. It started as a Google internal project and now made as an open source project. It made its debut at the annual KubeCon Conference 2017 and almost three years later, version 1.0 is released in March 2020. It can be installed in cloud, On-prem and local machines as well.

It is built mainly around 3 principles:

  1. Composability
  2. Portability
  3. Scalability

Here’s a reference architecture of Kubeflow on AWS:


Introduction:

While working in data engineering projects, one might have come across use case similar to below where realtime streaming data is being ingested into S3 in a partitioned format (YYYY/MM/DD/HH) via Firehose delivery stream which has to be consumed immediately to generate QuickSight dashboards.

A sample end-to-end flow might look as follows:

Sample flow demonstrating generation of QuickSight reports from S3 data

Important point in above workflow is the ability to get insights on latest data as soon as the data gets ingested. We all know that using partitions to scan only the needed data is one of best ways to improve query performance and reduce costs as well. But…

Subhash Burramsetty

Associate Technical Architect at Presidio Cloud Solutions

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store