Airflow - as near Real-Time Scheduler
Apache Airflow is a popular open-source workflow orchestration and scheduling technology. Airflow, while primarily intended for task scheduling and workflow management, does not provide real-time scheduling out of the box.
Airflow defines and manages processes using a directed acyclic graph (DAG). It enables you to plan chores based on time intervals such as hourly, daily, or weekly. These tasks may be dependent on one another and may be initiated by particular situations or occurrences.
Use case — Assume you are working on a need where the ODS layer receives data from the source through Kafka. There are around 35 dimension tables loaded from approximately 300 tables in ODS, and approximately 15 facts tables loaded from those dimensions. Now you must devise a process to ensure that whenever anything changes in ODS, those changes are reflected in facts (within 30 minutes).
Ideal Solution — Based on the above use case, you might explore the following solutions : Creating a dynamic Airflow Dag for dimensions and need to produce 35 SQL files for each dimension table, then send 35 dimensions into it and it will check for changes in the underlying database and choose the update/merge statement from the SQL file and execute. You may use the same procedure to import the information, and finally, all Dags are scheduled at 30-minute intervals.
However, Airflow’s default scheduler conducts jobs on a regular basis rather than in real-time. The scheduler examines the task schedules and initiates task execution as needed. The task executor is in charge of task execution, which may be configured to use various execution backends such as local, distributed, or cloud-based.
It is determined by your architecture. However, you should consider an event-driven architecture as given below-
- If you’re using AWS, you could stream data from Kafka to S3, alter it using a lambda, and then bulk transfer it into your destination. There are some designs that should work as well.
- Because Python isn’t extremely fast, airflow is relatively sluggish and computationally costly. It works well for batch operations and is bearable for micro batching (every 15 minutes), but it quickly gets pricey.
Airflow is not well suited for use as a real-time scheduler. You can probably make it work, but it will be costly. There are two methods to save money:
→ Fast scheduling, slow processing: event-based file batching, as you suggest — but keep an eye out for latency from record arrival in Kafka to file published on S3, plus processing time through lambda, which might add up to more than needed total delay.
→ scheduling is sluggish, but processing is quick: To save money, a 15-minute DAG schedules low-latency streaming data processing that shuts off when new data is ingested.
Always-on stream processing is the most costly low-latency alternative.
If you need near real-time scheduling capabilities, where activities must be completed instantly or in near real-time, Airflow may not be the best option. For such cases, other technologies or tools, such as event-driven architectures or streaming systems like Apache Kafka or Apache Flink, are more suited.
Tips — Airflow is ideally designed for frequent schedule-based operations, right down to running every minute, thus your 30 minute interval would be no problem technically. Before picking a technological solution to match the requirement, the definition of “near real time” is critical. Many times, there is a misunderstanding about what near real-time or real-time genuinely imply for the use case, resulting in bad or inefficient solutions.
Misconceptions — People use the phrase “real time” without providing any context. They make the error of presuming that “real time” is synonymous with “streaming.”
- Real time can refer to a variety of things, ranging from actual streaming, as in fraud detection, where even the tiniest lag in data processing has a significant impact, to data that is refreshed once a day if that is all the source provides.
- If you take real time at face value and do not go further to discover the genuine business requirement, you may end up designing a system with substantially more complexity (therefore suboptimal) built into it than is required when a simple batch job run at agreed-upon intervals (SLA) will suffice.
Having said that, Airflow may still be useful for managing complicated processes and dependencies, even if it lacks genuine real-time features. It is particularly good in orchestrating and scheduling activities across several systems, as well as handling retries, monitoring, and providing visibility into process execution.