Photo by Chris Ried on Unsplash

If you are wondering how to start working with Apache Airflow for small developments or academic purposes here you will learn how to. Well, deploying Airflow on GCP Compute Engine (self-managed deployment) could cost less than you think with all the advantages of using its services like BigQuery or Dataflow.

This scenario supposes you need a stable Airflow instance for a Proof of Concept or for learning space.

Why not Cloud Composer? Cloud Composer is a fully managed workflow orchestration service built on Apache Airflow. I suggest going with this if you or your team require a full production or…


Illustrative photo by JJ Ying on Unsplash

In part one of this series, we discussed the background and reasons for creating dynamic sourcing pipelines — including an overview of the pipeline components. In this part, we will describe the architecture and metadata in more detail. Let’s get started!

Architectural Overview

Our extract, transform, and load system (ETL) is written in Python, which is well known by GeoPhy’s engineers. Airflow is also built in python, hence easy for us to extend. For a similar reason, we chose PostgreSQL for the database technology. Postgres also shines for geospatial data, our most common type of data.

A Python/Postgres stack may be relatively…


Contributing to an open-source project with more than twenty thousand GitHub stars like Apache Airflow can be intimidating, but you can contribute much more than code, and there are steps you can take to build confidence before contributing to an open-source code base.

Photo by Jasmin Sessler on Unsplash

Non-code Contributions

Perhaps surprisingly, there’s a lot more to contribute to Airflow (or any open source project) than code. Naturally, there’s a lot involved in such a significant project: a vibrant community, a blog, and a repository with hundreds of thousands of lines of code. A very large and very important chunk of that code isn’t code at all…


illustrative image for generic pipelines

Introduction

Airflow is a great tool with endless possibilities for building and scheduling workflows. At GeoPhy we use it to build pipelines dynamically, combining generic and specific components. Our system maximizes reusability and maintainability, by creating each of these pipelines from the same code. It also keeps flexibility over the specific components where you need it.

This article is the first of a three part series describing how GeoPhy uses Airflow to create dynamic sourcing pipelines. In part two we’ll go deep into the metadata framework that we developed. …


Illustration by Alejandra Ramos

Apache Airflow is already a commonly used tool for scheduling data pipelines. But the upcoming Airflow 2.0 is going to be a bigger thing as it implements many new features.


Screenshot from awesome Airflow website

Airflow 2.0 is a big thing as it implements many new features. Like the high available scheduler or overall improvements in scheduling performance, some of them are real deal-breakers. But apart from deep, core-related features, Airflow 2.0 comes with new ways of defining Airflow DAGs. Let’s take a look at what’s been improved!

TaskFlow API (AIP-31)

The first significant change is the introduction of TaskFlow API. This new thing consists of three features:

  • XComArg — a layer over already existing XCom which simplifies accessing and passing information between tasks,
  • @task decorator which makes using PythonOperator smooth and easy,
  • Pluggable XCom backends which can…


This blog post was originally published at https://www.polidea.com/blog/

Is it possible to create an organization that delivers tens of projects used by millions, nearly no one is paid for doing their job, and still, it has been fruitfully carrying on for more than 20 years? Apache Software Foundation proves it is possible. For the last two decades, ASF has been crafting a model called the Apache Way — a way of organizing and leading tech open source projects. …


This blog post was originally published at https://www.polidea.com/blog/

This is a tale about a modern approach to developer productivity. Improving productivity has been a recurring theme in most of the roles and places I have worked.

My journey to developer productivity

It started at the beginning of my career with my older and more experienced colleague — one of the first “real” software engineers I had interacted with by then. I learned a lot from him. …


Graphics by Milka Toikkanen and Alejandra Ramos

This blog post was originally published at https://www.polidea.com/blog/

Airflow 2.0 is a huge change in the workflow management ecosystem. There are so many new things in Airflow 2.0, it’s hard to keep up. However, one topic is very dear to my heart — the project I was driving at the Airflow team for nearly a year. Let’s talk about Airflow Providers.

What are the Airflow providers?

Airflow 1.10 has a very monolithic approach. It contains the core scheduling system, and all the integrations with external services — hooks, operators, sensors. Everything but a kitchen sink was thrown into a single “apache-airflow” package, no matter if…

Apache Airflow

Home of tutorials, use-cases and anything related to Apache Airflow.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store