As a software engineer, you might have learnt about few concepts in software design principles like DRY (Don’t Repeat Yourself) and KISS(Keep it Simple, Stupid). Hence, we always develop our code in a re-usable way by placing them in utilities, common and shared folders.
Clean and modular code makes lives simpler!
When it comes to AWS Glue jobs, they allow us to point our job to only one single file during creation of jobs. …
Enterprise organisations are utilising cloud services to build data lakes, warehouses and automated ETL pipelines. In AWS Cloud, data lakes are built on top of Amazon S3 due to its durability, availability, scalability and cheap of cost. Amazon Athena is one of the best tools to query data from S3. When it comes to programatic interaction with AWS services, Boto3 is the first python package that comes to everyones mind. But programatically querying the S3 data using Athena into Pandas dataframes to do ETL hasn’t been that easier when using the Boto3 package alone as it is.
In today’s modern world, while developing web applications and micro-services there are various kinds of databases available to choose from to suit the needs of application in a right way based on the use case we are dealing with. Often the data might be split across multiple places and makes it bit complex and difficult to merge data between those multiple databases to generate insights on a near-real time basis.
Assume you have to generate insights on top of data stored in databases like DynamoDB, DocumentDB which gets updated very frequently and analytical data stored on top of S3 in…
Kubeflow is an open-source and free machine learning Kubernetes-native platform for developing, orchestrating, deploying and running scalable and portable machine learning workloads. It started as a Google internal project and now made as an open source project. It made its debut at the annual KubeCon Conference 2017 and almost three years later, version 1.0 is released in March 2020. It can be installed in cloud, On-prem and local machines as well.
It is built mainly around 3 principles:
Here’s a reference architecture of Kubeflow on AWS:
While working in data engineering projects, one might have come across use case similar to below where realtime streaming data is being ingested into S3 in a partitioned format (YYYY/MM/DD/HH) via Firehose delivery stream which has to be consumed immediately to generate QuickSight dashboards.
A sample end-to-end flow might look as follows:
Important point in above workflow is the ability to get insights on latest data as soon as the data gets ingested. We all know that using partitions to scan only the needed data is one of best ways to improve query performance and reduce costs as well. But…
Associate Technical Architect at Presidio Cloud Solutions