Let’s Learn Together Sessions: Spring Batch

Published in

Javarevisited

14 min readFeb 21, 2021

In this article, you will learn how to make it possible to use batch processing in your Spring-based applications with the help of the Spring Batch framework.

History of Batch Processing:

Batch processing is a terminology that has been around for many years.
Batch processing can be considered as a set of processes that consist of processing a certain part of a large number of data as input at each step.

This type of batch process generally includes day to day, or scheduled time operations. Over the years the way batch processing has changed, but the term remains a common practice even today.

The earliest batch processing was based on the development of the first computing systems. The first computers could do one job at a time. That is, a single person controlling the machine could run one job per unit of time and this person had to wait for the current job to finish before starting another job. When the running job was finished, the person controlling the machine had to trigger the machine to start the next job.

This process was repeated for every new or batch job. The time complexity which is referred to as the amount of time to complete a single job by a person was very high. On the other hand, each job is executed with human intervention so the need for humans for continuity of the process was obvious.

People, who used computing systems in these times, used punch cards (image of punch card below) or magnetic tapes to run the batch processes. The invention of punch cards dates back to the 1700s. In 1890, Herman Hollerith was an American statistician, developed a machine to process punch cards. Herman used this machine in the US 1890 Census at first. This machine reduced the all US Census time from 8 years to 6 years. This was one of the earliest examples of batch processing with real machines. [1]

Although this system partially eliminated the time complexity of batch operations, the human need in the system was still clear.

Image Credit: https://www.computerhope.com/jargon/p/punccard.htm :: Punch card

So, the next goal was to eliminate the human need in batch processing. The batch processing continued its evolution in the following decades. With the new techniques and software systems emerging in the batch process, human intervention has decreased day by day.

With the evolution of highly usable and performance computing systems such as mainframe systems, batch processing has become processable with higher performance and less memory consumption.

With software languages such as COBOL and REEX that offer easy development and high readability, the conversion of batch processing for use in mainframe systems has become easier.

Nowadays, modern functional and object-oriented languages have replaced old-fashioned method languages such as Cobol. Therefore, the batch processing operations required for business needs have begun to be developed in these modern, highly scalable, and featured languages.

Nowadays, While open-source Hadoop Frameworks such as Spark are used for big data processing, people prefer batch ETL tools such as Informatica for the smaller datasets.

On the other hand, relational databases like Amazon Redshift and Google BigQuery are the other popular options to make batch processing possible in your day-to-day business operations. [2]

In Java EE, you can also use batch processing with the help of specification JSR352. Application developers can use this specification model to develop robust batch processing systems.

Modern Java Web frameworks like Spring also developed batch frameworks over this specification. Developers are now able to batch-process with minimal effort using specific configurations and pre-implemented interfaces and listeners provided by batch frameworks. Spring explains the popular batch framework it has developed on its documentation website as follows:

A lightweight, comprehensive batch framework designed to enable the development of robust batch applications vital for the daily operations of enterprise systems.

Spring Batch framework not only provides useful functions but also supports logging/tracing, transaction management, job processes statistics, job skip, restart, and resource management functionalities. We can list the main functions provided by the spring batch as follows:

Transaction management
Chunk based processing
Declarative I/O
Start/Stop/Restart
Retry/Skip
Web-based administration interface

In summary, although batch processing goes back to the 1700s, the volume and uses of batch processing have evolved and transformed in today’s computing world. Let’s see where to use batch processing nowadays in the next part of the article.

Where to use batch processing?

Batch processing has been used in different areas and sectors for years. Nowadays still batch processing is one of the main operations used in companies and even in government offices.

The batch processing technique is mostly used in ETL transactions, which aim to extract data from several systems, then store the data in a data warehouse system and then convert it from data warehouses to analytical platforms. For example, a cost accounting system may combine data from payroll, sales, and purchasing.

On the other hand, a data visualization tool collects data from several systems like databases, social media entries, data collector tools, and other services, then stores it in data warehouses and can provide custom data visualizations for their customers by applying transformations on data.

Image Credit: https://www.spec-india.com/tech-in-200-words/what-is-etl :: ETL process

On the other hand, The migration of data in legacy systems is a process that is widely performed with batch processing. Report generation processes, billing analysis systems, and log analysis systems use batch processing to implement efficient and scalable solutions in a manageable and monitorable structure.

Batch processing is one of the preferred techniques to happen daily operations and business reports which were generated with the processing of large volumes of data.

With the transformation of batch processing, many data entry professionals lost their work to computers, and companies on the other hand avoided a huge expense.

In addition to these areas of use, the transformation of processes requiring human intervention into unmanned processes will increase in the future. While this transformation will cause many people to be unemployed, it will also create new job opportunities for people in the management of computing systems.

How does it work?

To understand the batch processing in the Spring Batch framework, you must know the terminology and main components of a batch system.

Batch processing of data is a process where large volumes of data are collected first, then processed in a specific way, and then batch results are produced. A batch process usually is composed of tasks called jobs. Each Job describes a processing flow or steps. Each Step is composed of a reader, a processor, and a writer. In Spring Batch, the main task is called a job. Jobs can be scheduled in time or triggered by the event.

On the other hand, the Step, processing unit of the job, is one of the key points on the Spring Batch infrastructure. A job can contain one or more steps depending on the logic we define. We can define a Step in the Spring Batch by using a chunk or tasklet model.

In the chunking approach, there are 3 components in the initialization of a Step which are ordered below:

Item Reader: reading from the database, message queue, or whatever.
Item Processor: apply business logic to the data which comes from the item reader and process the data.
Item Writer: take the data and write to the database or message queue.

The data is stored in defined chunks and processed over these chunks.
A chunk is a combination of a certain part of the data. You can specify the size of the data in a chunk with the chunk size parameter. “Chunks provide a simple solution to deal with paginated reads or situations where we don’t want to keep a significant amount of data in memory”. [3]

Image Credit: https://livebook.manning.com/concept/spring/filter-item-processor :: Chunk-oriented Step

Instead, in the tasklet model tasklets are performed as a single task within a step. Jobs in the tasklet model include reading, processing, and writing steps and execute each step one after the other. In contrast to the chunk model, it processes all data in a single run of steps. As a risk, if your data is very large, the resources can be exhausted. So if your data volume is large, it would be better to choose the chunk approach. The tasklet model Step creation, typically used for operations such as deleting a resource or executing a query.

In the Spring Batch framework, another important component is JobRepository. It stores job and step details in an in-memory database which is handled by the framework. On the other hand, this repository periodically stores job and step executions during item processing and calculates execution metrics to provide statistical data. So the management and process of job and step execution is provided by the Spring Batch framework.

After you define a Job and related Steps in a Spring Batch infrastructure, you need a JobLauncher to run a particular job. Spring Batch provides a simple interface for a job running and all possible ad-hoc executions with JobLauncher. The user can run a job manually synchronously or asynchronously depending on the implementation. To run the job using JobLauncher, you need to create a JobInstance and give job parameters to JobInstance using JobParameters, if available.

You can see the main infrastructure of the Spring Batch framework in the picture below:

Image Credit: https://www.toptal.com/spring/spring-batch-tutorial:: Spring Batch Framework

Let’s design an application that reads data from a free API that provides data on Coronavirus. This demo application transforms the data to entity objects by using a mapper in the processor step. And then it saves all of the Covid data to a CSV file.

To use the Spring Batch framework in a Java project, you need to add the following dependency to your application.

<groupId>org.springframework.boot</groupId>
   <artifactId>spring-boot-starter-batch</artifactId>
</dependency>

To enable batch processing, you need to add @EnableBatchProcessing annotation to the main class. With enable annotation, you can use Spring batch features and provide a base configuration for setting up batch jobs in a Configuration class.

@SpringBootApplication
@EnableBatchProcessing
public class CovidBatchServiceApplication {
}

After adding the dependency and annotation, you can start the design of our application by defining Steps in the batch process. In the Spring Batch framework, you can define a Step using StepBuilderFactory. It creates a new object with a Builder pattern and takes reader, processor, and writer objects as a parameter like below:

@Autowired
private StepBuilderFactory stepBuilderFactory;@Bean
Step saveDataFromApiToCsvFileStep(){
    return stepBuilderFactory.get("saveDataFromApiToCsvFileStep")
            .<CovidCountryDataDTO,CovidCountryData>chunk(5)
            .reader(reader())
            .processor(processor())
            .writer(fileItemWriter())
            .build();
}

In the above code, the chunk size is set to 5, the default batch chunk size is 1. So, it reads, processes, and writes 5 of the data set each time. The reader can be defined by using the ItemReader interface which comes from the Spring Batch framework.

ItemReader interface includes a read() method, so this method can be overridden in the implemented reader class. In the init time of the class, the reader class reads data using a service that gets data from API with RestTemplate.

Then in the read method, It is typically called multiple times for each batch, with each call to read() returning the next value and finally returning null when all input data has been exhausted as below [4]:

public class CovidDataItemReader implements ItemReader<CovidCountryDataDTO> , InitializingBean {

    @Autowired
    CovidApiBatchService covidApiBatchService;

    private List<CovidCountryDataDTO> covidCountryDataList;
    private Integer index = 0;
    private boolean initialized;

    private void init() {

        covidCountryDataList = covidApiBatchService.getCovidSummaryData().getCountries();
        initialized = true;
    }

    @Override
    public void afterPropertiesSet() throws Exception {
        init();
    }


    @Override
    public CovidCountryDataDTO read(){

        if(!initialized)
            init();

        if(index == covidCountryDataList.size()){
            index = 0;
            initialized = false;
            return null;
        }

        CovidCountryDataDTO countryDTO = covidCountryDataList.get(index);
        index++;
        return countryDTO;
    }
}

After the implementation of the custom Reader class, it is needed to develop a Processor which maps data, comes from Reader, to the entity object list. With the help of ItemProcessor, a custom ItemProcessor class can be implemented easily. ItemProcessor contains the process() method that allows us to apply business logic to the data read in the reader process.

@Slf4j
public class CovidDataItemProcessor implements ItemProcessor<CovidCountryDataDTO, CovidCountryData> {

    @Autowired
    private CovidCountryDataMapper covidCountryDataMapper;

    @Override
    public CovidCountryData process(CovidCountryDataDTO covidCountryDataDTO){

        return covidCountryDataMapper.map(covidCountryDataDTO);
    }
}

After implementing the custom processor class, the last step remains to design the writing class. As with reader and processor steps, another Spring Batch interface, ItemWriter, can be used to implement writer functionality. If you would like to write data to a file, Spring provides another customized interface for this functionality which is called as FlatFileItemWriter.

First, you need to define a class that implements this interface. In the constructor of the implemented writer class, first, the title fields of the file are set. Then the delimiter is defined for the CSV file and the data columns are given to the bean wrapper field extractor to extract only certain fields from the data. The method writes data to the resource defined in the properties file. The sample implementation of this functionality is below:

public class CovidDataFileItemWriter extends FlatFileItemWriter<CovidCountryData> {

    private static final String[] dataColumns = new String[]{"country", "countryCode", "newConfirmed", "totalConfirmed", "newDeaths", "totalDeaths", "newRecovered", "totalRecovered"};


    public CovidDataFileItemWriter(ApplicationProperties applicationProperties){

        super();

        this.setHeaderCallback(writer -> writer.write("Country;CountryCode;New Confirmed;Total Confirmed;New Deaths;Total Deaths;New Recovered;Total Recovered"));

        this.setAppendAllowed(true);

        DelimitedLineAggregator<CovidCountryData> delimitedLineAggregator = new DelimitedLineAggregator<>();

        delimitedLineAggregator.setDelimiter(";");

        BeanWrapperFieldExtractor<CovidCountryData> fieldExtractor = new BeanWrapperFieldExtractor<>();

        fieldExtractor.setNames(dataColumns);

        delimitedLineAggregator.setFieldExtractor(fieldExtractor);


        this.setResource(new PathResource(applicationProperties.getCsvFilePath()));
        this.setLineAggregator(delimitedLineAggregator);

   }
}

Now the first step is completed with reader, processor, and writer method implementations. Let’s define another step that reads data from the CSV file written in the previous step, then processes the data with the processor designed in the previous step, and finally writes the processed data to a database. For this aim, we will define another Step bean as below:

@Bean
Step  (){
    return stepBuilderFactory.get("saveDataFromCsvFileToDbStep")
            .<CovidCountryDataDTO,CovidCountryData>chunk(5)
            .reader(fileItemReader())
            .processor(processor())
            .writer(writer())
            .build();
}

The reader class can be defined by using the FlatFileItemReader interface which comes from the Spring Batch framework. So you have to define a class that implements this interface.

In the constructor of the implemented reader class, first, the delimiter from which to read data from the file is specified, and then the columns of data to be parsed are given. You need to also specify the target object type, and set the resource where the reader will be read. Also, assuming that the first line of the file is the header line, you must pass the linesToSkip parameter as 1 to skip the first line of the file during the reading to the class.

public class CovidDataFileItemReader extends FlatFileItemReader<CovidCountryDataDTO> {


    private static final String[] dataColumns = new String[]{"country", "countryCode", "newConfirmed", "totalConfirmed", "newDeaths", "totalDeaths", "newRecovered", "totalRecovered"};

    private static final int LINES_TO_SKIP = 1;


    public CovidDataFileItemReader(ApplicationProperties applicationProperties){

        super();

        DefaultLineMapper<CovidCountryDataDTO> lineMapper = new DefaultLineMapper<>();

        DelimitedLineTokenizer delimitedLineTokenizer = new DelimitedLineTokenizer();
        delimitedLineTokenizer.setDelimiter(";");

        delimitedLineTokenizer.setNames(dataColumns);

        BeanWrapperFieldSetMapper<CovidCountryDataDTO> fieldSetMapper = new BeanWrapperFieldSetMapper<>();
        fieldSetMapper.setTargetType(CovidCountryDataDTO.class);

        lineMapper.setLineTokenizer(delimitedLineTokenizer);

        lineMapper.setFieldSetMapper(fieldSetMapper);

        this.setResource(new PathResource(applicationProperties.getCsvFilePath()));
        this.setLineMapper(lineMapper);
        this.setLinesToSkip(LINES_TO_SKIP);
    }
}

In the processor step, we will use the processor class which we have already defined in the previous step.

Since we would like to store the data in a database rather than write it in a CSV file, we need to develop a class that implements the ItemWriter interface. ItemWriter interface provides a write() method to make write operation possible in a batch process. In the code below, the saveAll method of the data store is called and saved to the database.

public class CovidDataItemWriter implements ItemWriter<CovidCountryData> {    @Autowired
    private CovidDataRepository covidDataRepository;    @Override
    public void write(List<? extends CovidCountryData> countryDataList){
        if(countryDataList != null) {
            covidDataRepository.saveAll(countryDataList);
        }
    }
}

Thus, we finished the implementations of these two steps, which are saveDataFromApiToCsvFileStep and saveDataFromCsvFileToDbStep. We need to connect these two steps to a job to run them. To define a Job in Spring Batch, JobBuilderFactory can be used. Like StepBuilderFactory, it is designed with a builder pattern and creates a Job instance with given parameters.

The steps to be run are defined in the job. In addition, if you want to run the job more than once, you must define an incrementer that increments the job instance ID one at a time. Otherwise, it returns the error “A job instance already exists and is complete for parameters={}”. Using the next and flow step fields, you can specify which steps will begin first and which steps will begin after each step:

@Autowired
private JobBuilderFactory jobBuilderFactory;@Bean
Job batchJob(){
    return jobBuilderFactory.get("job1")
            .incrementer(new RunIdIncrementer())
            .flow(saveDataFromApiToCsvFileStep())
            .next(saveDataFromCsvFileToDbStep())
            .end()
            .build();
}

If you define the Job bean, the application tries to run the job when the application is a startup. You can disable it in the properties file, so you can call the job either manually or with a scheduled operation.

spring.batch.job.enabled=false

Before we run the application, we can enable the H2 (in-memory) console in application.yml as below:

spring:
  h2:
    console:
      enabled: true
  jpa:
    database-platform: org.hibernate.dialect.H2Dialect
    show-sql: false
  datasource:
    url: jdbc:h2:mem:testdb
    driverClassName: org.h2.Driver
    username: admin
    password: admin

We can run the job using the JobLauncher interface from the Spring Batch framework. JobLauncher includes the run() method, which creates a JobInstance to run the job with its steps. Using the JobParameters class from the Spring Batch framework, we can pass any parameter such as user, start time, and running parameter to the job instance as follows:

JobParameters jobParameters = new JobParametersBuilder()
        .addString("user", user)
        .addDate("startDate",new Date())
        .toJobParameters();

jobLauncher.run(batchJob, jobParameters);

Spring Batch provides an in-memory H2 database, called as JobRepository. The schema of these repositories is below:

Image Credit: https://docs.spring.io/spring-batch/docs/current/reference/html/schema-appendix.html: Spring Batch Tables

After running the job, we can use the batch_step_execution and batch_job_execution tables to see the status of the executed jobs and steps, and the metric values that calculate the read, write, number of transactions, and start and end times.

The Spring Batch framework enables process intervention with listener interfaces. The implemented interfaces in the Spring Batch framework are: [5]

JobExecutionListener
StepExecutionListener
ItemReadListener
ItemProcessListener
ItemWriteListener
SkipListener

Intervening during the execution of a job or step to do Logging, verification, etc. they are used to process some operations. JobExecutionListener interface provides interception for Spring batch jobs, so you can apply any logic before and after a job with the help of beforeJob and afterJob overridden methods. In the below example, we created an INFO level log before and after job execution.

@Log4j2
public class CovidDataJobItemListener implements JobExecutionListener {

    @Override
    public void beforeJob(JobExecution jobExecution) {
        log.info("Job started for user: {}.",jobExecution.getJobParameters().getString("user"));
    }

    @Override
    public void afterJob(JobExecution jobExecution) {
        log.info("Job ended for user: {}.",jobExecution.getJobParameters().getString("user"));
    }
}

Like JobExecutionListener, StepExecutionListener provides interception for Spring batch steps, so you can apply any logic before and after a step with the help of beforeStep and afterStep overridden methods. Not only can you intercept a step, but also you can intercept the before and after of any single step in the reader, processor, writer steps.

ItemReaderListener, ItemProcessorListener, and ItemWriterListener interface can be implemented for customizing before — after operation functionality. Also by using on…Error methods in these interfaces, you can catch and manage the errors. [6] The following example shows the general structure of these interfaces implementation:

@Log4j2
public class CovidDataItemReaderListener implements ItemReadListener<CovidCountryDataDTO> {


    @Override
    public void beforeRead() {
    }

    @Override
    public void afterRead(CovidCountryDataDTO covidCountryDataDTO) {
    }

    @Override
    public void onReadError(Exception e) {
    }
}

After implementing such custom listener classes, you need to pass them as parameters to job or step objects in the bean definition as follows:

@Bean
Step saveDataFromApiToCsvFileStep(){
    return stepBuilderFactory.get("saveDataFromApiToCsvFileStep")
            .<CovidCountryDataDTO,CovidCountryData>chunk(5)
            .reader(reader())
            .processor(processor())
            .writer(fileItemWriter())
            .listener(processListener())
            .listener(readerListener())
            .listener(writerListener())
            .build();
}

Conclusion:

Batch processing has been a proven standard technique for processing large volumes of data for years. Although techniques and tools have changed over the years, the ideology behind the process has remained the standard. Batch processing not only provides a faster solution to data processing but also reduces human intervention, reducing the costs companies dedicate to managing batch processing.

Spring Batch framework provides highly usable and configurable batch processing functionalities over the Java EE JSR352 standard. The framework not only provides builder classes to simplify the job and step creation, but it also provides interfaces to develop custom solutions for job and step executions and listeners.

When you use the Spring Batch framework, you don’t have to deal with the transaction and context management of batches. The Spring Batch framework supports transactional mechanism, database and context execution management, and also partitioning, which is not discussed in this article. If you would like to customize any implementation in the framework, you can easily override any functionality using the given interfaces.

Batch processing has been used by many companies for years. Companies seek to transform ETL and other batch processes into modern, high-performance frameworks and batch processing tools. If you have similar needs in your project, the Spring Batch framework will be a perfect solution for you.

The code is available on Github.

References:

[1] https://www.bmc.com/blogs/what-is-batch-processing-batch-processing-explained/

[2] https://www.upsolver.com/blog/batch-stream-a-cheat-sheet

[3] https://www.baeldung.com/spring-batch-tasklet-chunk

[4] https://www.toptal.com/spring/spring-batch-tutorial

[5] https://howtodoinjava.com/spring-batch/spring-batch-event-listeners/

[6] https://www.javadevjournal.com/spring-batch/spring-batch-listeners/