Spark Structured Streaming as a Batch Job? File based data ingestion benefits from pseudo streaming?

Using Trigger ONCE functionality in streaming

Karthikeyan Siva Baskaran
3 min readJul 14, 2020

In this blog, we are going to see the importance of Trigger Once option in streaming and how we can use this trick in real time file based data ingestion workloads.

Scenario

Everyday upstream application will be sending the data in the form of csv files in the allocated staging directory. This data needs to be ingested into delta lake. But, the frequency of the incoming files can be at any time. So, we need a Spark Batch or Streaming Job to process the data whenever the data is placed in the staging directory.

Practical Demo

Spark Streaming Demo

Batch Job complexities

Following needs to be take care from developer side.

Bookkeeping is one of major drawback in batch job. Need to track, What are all the files are already processed, failed and running.

  • In case of failure, that particular file needs to be re-processed in next batch.
  • In case of running state, next batch job should not pick the same file to process it twice. This may occur, when the file is too huge and it is in process state until the next time, your job gets triggered
  • In case of processed, that particular file needs to be marked as completed. so that, it won’t get processed once again.
  • . All these information needs to be maintained in any persistent storage, example - MySQL.
File Ingestion using Batch Job

Streaming Job to Rescue

By default, Structured Streaming takes care of bookkeeping without any additional codes. Developer focus only about the core business logic, and not the low-level bookkeeping.

Generally many thinks that streaming job needs to be run 24/7. But, Run Once trigger will fire only once when the job gets triggered. Next run only occurs when the jobs gets triggered second time.

This brings the Streaming feature into Batch Job. We can call this as a pseudo streaming job.

Pseudo Streaming Job

Spark stores all metadata related information in check point directory, which can be specified in spark write stream query.

Streaming check point directory

Files gets committed automatically once it is processed

Codes for Reference

Wrapping up

Using trigger once you can achieve the best of both worlds for batch and streaming processing. As it is Pseudo streaming, you don’t need to run the streaming job 24/7. This saves your cost, also leads to focus more on logic instead of bookkeeping.

Follow us on Medium and YouTube to get more tips like this.

If you like this blog, please share with your team mates, friend and clap 👏 in the range of 1 to 50 in Medium. You can clap up to 50 times per post.

Happy Pseudo Streaming !!

--

--