Think twice before redesigning your Airflow data pipelines

Image for post
Image for post
Photo by Craig Adderley from Pexels

TaskFlow API is a feature that promises data sharing functionality and a simple interface for building data pipelines in Apache Airflow 2.0. It should allow the end-users to write Python code rather than Airflow code. Apart from TaskFlow, there is a TaskGroup functionality that allows a visual grouping of your data pipeline’s components. After reviewing those features, I wasn’t sure whether I should include them in the strengths or weaknesses of the new Airflow release. …


“Everything fails all the time.” — Werner Vogels

Image for post
Image for post
Photo by Gnist Design from Pexels | Branded content disclosure

Data is the new oil. We rely on it not only to make decisions but also to operate as a business in general. Data loss can lead to significant financial consequences and loss of reputation.

In this article, you will find ten actionable methods to protect your most valuable resources.

1. Backup, Backup, Backup

This goes without saying, and we all know it. We need to have a backup strategy and an automated way of regularly taking periodic snapshots of our databases.

However, with today’s large amounts of data, implementing a reliable backup plan that can quickly recover your databases becomes challenging. Therefore, it…


Sharing tricks that helped me in data engineering with Python

Image for post
Image for post
Source: https://i.redd.it/gpp8gmh0on861.jpg | Branded content disclosure

Data engineering is a fascinating field. We are dealing with a variety of tools, databases, data sources in different forms and shapes, and ETL jobs processing vast amounts of data every day. Due to the diversity of tasks and technologies, it pays off to know some useful tricks to make you more productive with respect to data processing and code deployments. In this article, we’ll look at three tricks that will make your Python projects more efficient.

1. Using a temporary directory

When reading data from flat files, many data engineers use libraries such as pathlib and shutil to create directories and remove them at…


A platform-agnostic way of accessing credentials in Python

Image for post
Image for post
Photo by Kat Jayne from Pexels | Branded content disclosure

Even though AWS enables fine-grained access control via IAM roles, sometimes in our scripts, we need to use credentials to external resources not related to AWS, such as API keys, database credentials, or passwords of any kind. There is a myriad of ways of handling such sensitive data. In this article, I’ll show you an incredibly simple and effective way to manage that using AWS and Python.

Table of contents:

· Different ways of managing credentials · Describing the use case · Implementation — PoC showing this method ∘ Create the API Key ∘ AWS Secrets Manager ∘ Retrieve the…


The easiest way to orchestrate your Python data pipelines

Image for post
Image for post
Photo by Burst from Pexels

Even though there are so many workflow orchestration solutions and cloud services for building data workloads, it’s hard to find one which is actually pleasant to use and allows you to get started quickly. One of my favorite tools for building data pipelines in Python is Prefect — a workflow management platform with a hybrid agent-based execution model.

What does a hybrid execution model entail? It means that even if you use the cloud orchestration platform (Prefect Cloud), you still own and manage your agents. In fact, Prefect has no direct access to your code or data. Instead, it only…


Serverless provides benefits far beyond the ease of management

Image for post
Image for post
Photo by Pixabay from Pexels | Branded content disclosure

It’s hard to determine what can be considered a “good” or “bad” engineering practice. We often hear about best practices, but everything really boils down to a specific use case. Therefore, I deliberately chose the word “useful” rather than “good” in the title.

The modern DevOps culture introduced several paradigms that are useful regardless of the circumstances: building infrastructure in a declarative and repeatable way, leveraging automation to facilitate seamless IT operations, and developing in an agile way to keep improving our end-results over time. …


Is your monitoring system observable?

Image for post
Image for post
Photo by Scott Webb from Pexels | Content disclosure

Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and “robustness” of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?

Wikipedia’s definition of observability

“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” — Wikipedia

An observable system allows us to assess how the system works without…


They view software architectures from a single angle. And that’s dangerous

person in a pose that suggests they’re considering alternatives
person in a pose that suggests they’re considering alternatives
Photo by Afif Kusuma on Unsplash

Recently, I’ve seen a video by a really great developer and YouTuber. Its title is “Serverless Doesn’t Make Sense.” Even though I really enjoyed the video, I am not sure whether the author’s points about serverless are entirely valid, and I want to discuss them in this article.

In the introduction, the author made a joke: “There are two things in this world I don’t understand — girls and serverless.”

I don’t know about his relationship with girls, but is he right when it comes to serverless? Let’s have a look at his criticism and discuss potential contra arguments. …


What changed in the new release: Airflow 2.0

Image for post
Image for post
Photo by Alexas Fotos from Pexels

Several months ago I wrote an article discussing the pros and cons of Apache Airflow as a workflow management platform for ETL and data science. Due to the recent major upgrade, I want to give an update of what changed since then in the brand-new Airflow 2.0 version. To get a full picture, you may want to have a look at the previous article first:

Table of contents

· The strengths of Airflow 2.0 as opposed to the previous versions ∘ The new UI looks fresh and modern ∘ The scheduler is no longer a bottleneck ∘ Airflow finally has…


Share your Python code with others via Dockerhub

Image for post
Image for post
Photo by JÉSHOOTS from Pexels

The easiest way to package your code for production is by using a container image. DokerHub is like Github for container images— you can upload and share with others an unlimited amount of publicly available dockerized applications at no cost. In this article, we’ll build a simple image and push it to Dockerhub.

1. Sign up for a free Dockerhub account

Anna Anisienia

Data Engineer, M.Sc. in BI, AWS Certified Solution Architect, HIIT, cloud & tech enthusiast living in Berlin. www.annageller.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store