TaskFlow API is a feature that promises data sharing functionality and a simple interface for building data pipelines in Apache Airflow 2.0. It should allow the end-users to write Python code rather than Airflow code. Apart from TaskFlow, there is a TaskGroup functionality that allows a visual grouping of your data pipeline’s components. After reviewing those features, I wasn’t sure whether I should include them in the strengths or weaknesses of the new Airflow release. …
Data is the new oil. We rely on it not only to make decisions but also to operate as a business in general. Data loss can lead to significant financial consequences and loss of reputation.
In this article, you will find ten actionable methods to protect your most valuable resources.
This goes without saying, and we all know it. We need to have a backup strategy and an automated way of regularly taking periodic snapshots of our databases.
However, with today’s large amounts of data, implementing a reliable backup plan that can quickly recover your databases becomes challenging. Therefore, it…
Data engineering is a fascinating field. We are dealing with a variety of tools, databases, data sources in different forms and shapes, and ETL jobs processing vast amounts of data every day. Due to the diversity of tasks and technologies, it pays off to know some useful tricks to make you more productive with respect to data processing and code deployments. In this article, we’ll look at three tricks that will make your Python projects more efficient.
When reading data from flat files, many data engineers use libraries such as
shutil to create directories and remove them at…
Even though AWS enables fine-grained access control via IAM roles, sometimes in our scripts, we need to use credentials to external resources not related to AWS, such as API keys, database credentials, or passwords of any kind. There is a myriad of ways of handling such sensitive data. In this article, I’ll show you an incredibly simple and effective way to manage that using AWS and Python.
Table of contents:
Even though there are so many workflow orchestration solutions and cloud services for building data workloads, it’s hard to find one which is actually pleasant to use and allows you to get started quickly. One of my favorite tools for building data pipelines in Python is Prefect — a workflow management platform with a hybrid agent-based execution model.
What does a hybrid execution model entail? It means that even if you use the cloud orchestration platform (Prefect Cloud), you still own and manage your agents. In fact, Prefect has no direct access to your code or data. Instead, it only…
It’s hard to determine what can be considered a “good” or “bad” engineering practice. We often hear about best practices, but everything really boils down to a specific use case. Therefore, I deliberately chose the word “useful” rather than “good” in the title.
The modern DevOps culture introduced several paradigms that are useful regardless of the circumstances: building infrastructure in a declarative and repeatable way, leveraging automation to facilitate seamless IT operations, and developing in an agile way to keep improving our end-results over time. …
Observability has gained a lot of popularity in recent years. Modern DevOps paradigms encourage building robust applications by incorporating automation, Infrastructure as Code, and agile development. To assess the health and “robustness” of IT systems, engineering teams typically use logs, metrics, and traces, which are used by various developer tools to facilitate observability. But what is observability exactly, and how does it differ from monitoring?
“Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs.” — Wikipedia
An observable system allows us to assess how the system works without…
Recently, I’ve seen a video by a really great developer and YouTuber. Its title is “Serverless Doesn’t Make Sense.” Even though I really enjoyed the video, I am not sure whether the author’s points about serverless are entirely valid, and I want to discuss them in this article.
In the introduction, the author made a joke: “There are two things in this world I don’t understand — girls and serverless.”
I don’t know about his relationship with girls, but is he right when it comes to serverless? Let’s have a look at his criticism and discuss potential contra arguments. …
Several months ago I wrote an article discussing the pros and cons of Apache Airflow as a workflow management platform for ETL and data science. Due to the recent major upgrade, I want to give an update of what changed since then in the brand-new Airflow 2.0 version. To get a full picture, you may want to have a look at the previous article first:
Table of contents
The easiest way to package your code for production is by using a container image. DokerHub is like Github for container images— you can upload and share with others an unlimited amount of publicly available dockerized applications at no cost. In this article, we’ll build a simple image and push it to Dockerhub.