Building a robust data pipeline for processing Satellite Imagery at scale

Published in

Fasal Engineering

7 min readDec 17, 2021

What is a Data Pipeline?

A data pipeline is a series of data processing steps. If the data is not currently loaded into the data platform, then it is ingested at the beginning of the pipeline. Then there are a series of steps where each step delivers an output that is the input to the next step. This continues until the pipeline is complete. In some cases, independent steps may run in parallel.

A data pipeline is an organized form of workflow that eliminates many manual steps from the process and enables a smooth, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected. It automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. It provides end-to-end velocity by eliminating errors and combatting bottlenecks or latency. It can process multiple data streams at once. In short, it is an absolute necessity for today’s data-driven enterprise. A data pipeline views all data as streaming data and it allows for flexible schemas. Regardless of whether it comes from static sources (like a flat-file database) or from real-time sources (such as online retail transactions), the data pipeline divides each data stream into smaller chunks that it processes in parallel, conferring extra computing power.

Since data pipelines are a series of automated steps, when these steps get too heavy, it’s no longer possible to keep track of them with cron jobs or spreadsheets. For example, you have plenty of logs stored somewhere on S3, and you want to periodically take that data, extract and aggregate meaningful information and then store them in an analytics DB (e.g., Redshift). Usually, this kind of task is first performed manually, then, as things need to scale up, the process is automated and for example, triggered with cron. Ultimately, you reach a point where the good old cron is not able to guarantee a stable and robust performance. It’s simply not enough anymore. That’s when you need workflow management software (WMS) that helps automate all the processes.

What is WMS?

Workflows are building blocks of business processes. A workflow is a series of tasks or actions that are performed in a sequential manner to achieve an end goal. For example, the procurement process is made up of workflows for purchase requisition, purchase order, and invoice processing. Each of these individual workflows is made up of several tasks like creation, review, approval, and routing to the following process.

Workflow management software or WMS is all about managing and streamlining workflows for optimal output. Cutting out redundant tasks, ensuring resource availability for all tasks, structuring and streamlining task sequences are the focus of WMS. They make it easier to supervise and manage tasks in your data pipelines.

Fasal’s Data Pipeline for Satellite Imagery Data processing

Step 1- Getting polygon coordinates of the field
Step 2 — Pre Processing the shape of the plot using boundary coordinates.
Step 3 — Data Capturing (Sentinel 2A Data for processing)
Step 4 — Data Processing & Generating NDVI, SAVI, SIPI, EVI images
Step 5 — Clipping of shape-file of geotagged fields
Step 6 — Generating Colour coded Image with Improving image quality.

Initial Architecture (Serverless)-

Drawbacks and Challenges of the above architecture

Monitoring and debugging was a big challenge with the serverless base data pipelines, especially for historical executions.
Different tools were used to build the pipeline with different specifications and use cases.
- Scheduler — Meteor Cron server to schedule the Jobs.
- Storage- S3 for permanent storage and EFS for temporary storage.
- Executer — AWS Lambda.
- Assembling the jobs using AWS SQS.
Challenges in AWS Lambda for executing python programs.
- Maximum time for execution cannot be more than 15 mins.
- Have to separately attach EFS to store the local data.
- Schedulers were separately built to schedule these functions hence creating a void.
Challenges in AWS SQS for assembling Python programs.
- No prior knowledge of whether the previous job ran with success or failure.
- Cannot define any data dependency between the jobs and their execution status.
- Lambda functions were triggered on the basis of the queue line up so the sequence of the assembly may vary and parallel execution of the jobs becomes difficult.

How Fasal overcame these challenges and built a robust data pipeline on Airflow

A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. It can help you to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.

In Airflow, a workflow is defined as a collection of tasks with directional dependencies, basically a directed acyclic graph (DAG where each node in the graph is a task, and edges define dependencies among the tasks. Tasks belong to two categories:

a. Operators: they execute some operation

b. Sensors: they check for the state of a process or a data structure

Real-life workflows can go from just one task per workflow (you don’t always have to be fancy) to very complicated DAGs, almost impossible to visualize. The main components of Airflow are:

a Metadata Database

a Scheduler

an Executor

Airflow architecture and key Features

The metadata database stores the state of tasks and workflows.
The scheduler uses the DAGs definitions, together with the state of tasks in the metadata database, and decides what needs to be executed.
The executor is a message queuing process (usually Celery) which decides which worker will execute each task. With the Celery executor, it is possible to manage the distributed execution of tasks. An alternative is to run the scheduler and executor on the same machine. In that case, the parallelism will be managed using multiple processes.
Airflow also provides a very powerful UI. The user is able to monitor DAGs and tasks execution and directly interact with them through a web UI.

It is common to read that Airflow follows a “set it and forget it” approach, but what does that mean? It means that once a DAG is set, the scheduler will automatically schedule it to run according to the specified scheduling interval.

Fasal’s Airflow Infrastructure setup

Cloud Composer helps you to create Airflow environments quickly and easily, so you can focus on your workflows and not on your infrastructure. Otherwise, you will need to spend a lot of time doing DevOps work: create a new server, manage Airflow installation, take care of dependency management, package management, make sure your server is always up and running, then you have to deal with scaling and security issues…

If you don’t want to deal with all of those DevOps problems, and instead just want to focus on your workflow, then Google Cloud composer is a great solution for you. The nice thing about Google Cloud Composer is that you as a Data Engineer or Data Scientist don’t have to spend that much time on DevOps. You just focus on your workflows (writing code) and let Composer manage the infrastructure. Of course, you have to pay for the hosting service, but the cost is low compared to if you have to host a production airflow server on your own. This is an ideal solution if you are a startup in need of Airflow and you don’t have a lot of DevOps folks in-house.

Fasal’s Airflow DAG structure

Conclusion

Implementation of the satellite imagery data pipeline with the help of the Airflow workflow manager has helped Fasal to build a robust processing pipeline that can scale on-demand as per the need. Airflow helps us to manage huge data flow and long processing units. Airflow provides us with a way to set up programmatic workflows. Tasks, for instance, can be generated on the fly within a DAG while SubDAGs and XComs allow creating complex dynamic workflows. Dynamics DAGs can for instance be set up based on variables or connections defined within the Airflow UI.

With this implementation, Fasal is now able to push its limits in research and development. More and more data collection will be very useful in the future for various research focused on remote sensing for crops. These data will also be useful for bringing more accuracy and precision to our advisories and predictions. The scope of satellite data is very huge and one can say it’s the future of farming as it was said for weather forecasting years back. Based on prediction and detection with high precision and accuracy farming will become cost-effective and highly productive. And if the popularity of satellite data grows like this, sooner or later we will have live coverage of our fields on every farm with a precise crop health assessment.