Introduction to Spring Batch . Take 2
This is the second part of our introduction to Spring Batch. Previously we published Spring Batch Introduction. Take 1 where we explained some basic concepts around Spring Batch jobs and described the minimum necessary to run a Spring Batch application. Enjoy the take 2!
A step can be declared tasklet-oriented or chunk-oriented.
- Tasklet-oriented step: A simple task defined in the method tasklet(). A transaction is created for every item. In case of failing the writing the transaction is rolled back.
- Chunk-oriented step: A chunk is a set of items that have been read and processed one after another but they are going to be written together. The number of elements per chunk is defined in the step declaration with the method chunk(). A transaction is created per chunk.
We could choose chunk-oriented processing to make use of as much RAM as possible (without running out of it) or to take benefit of transactionality (e.g. roll back 10 items if one fails and they must go all together). We also could choose chunk-oriented processing to write a bunch of items (e.g. an endpoint which receives an array of items).
As usually with Spring Boot, it is not necessary to declare a transactionManager because it is automatically injected by the framework.
In the next commit we have declared an ItemReader which reads a CSV file and an ItemWriter which prints every person. We also have changed the step declaration to define a chunk of 2 items.
The implementations of FlatFileItemReader, DelimitedLineTokenizer, DefaultLineMapper and FieldSetMapper help us to read a CSV in a few lines.
6. Fault Tolerance
The components ItemReader, ItemProcessor and ItemWritter will throw an exceptions in case of error. We can skip a number of exceptions with the method skipLimit() and the exception will be caught without breaking the application. Moreover, the exception types to skip need to be defined using the method skip() as many times as necessary.
We can implement SkipListener to work with the skipped items, e.g. printing the input or storing in another CSV.
In the third commit We have introduced a reading error “-” in the CSV and a writing error (throwing IllegalArgumentException) in the writing. We have included faultTolerant() method in the step declaration with a limit of 2 skipping FlatFileParseException, FlatFileFormatException and IllegalArgumentException.
As the limit is 2, the job will execute correctly skipping the 2 errors introduced. As an exercise, you could add another one into the CSV file and see how the job fails.
We can retry an action if it has previously failed: Retry a reading, a processing or a writing. In the fourth commit we have added the retryLimit(1) to retry exceptions of type retry(IllegalArgumentException.class).
It is convenient to include skipping with retrying because after a number of retries the exception still could persist.
8. Log and statistics
The method StepBuilderFactory#listener() receives any listener type and there are more as ChunkListener, StepListener, RetryListener… All of those Listeners can be used to log the batch process progress. We can use the step context or the job context to save timestamps and other info.
We also could chain listener() to JobBuilderFactory with argument of a JobExecutionListener instance.
The class StepExecution has a getSummary() method which return some information as follows:
The field status is a BatchStatus enum and can be mapped to be the application exit code. The field exitStatus is a ExitStatus enum and represents the finishing step status.
The fields existStatus is not always equivalent to status. We could customize its value to provide more information. This information could be used to define a conditional flow where a step would start depending of the previous step exitStatus.
By the way, Spring Batch allows to define flows of steps where a step is conditionally executed taking into account the step finishing status.
In this commit we have written a lot of listeners to log every action during the batch processing. Notice how a “timing” attribute has been added to the chunk context to measure the time spent per chunk.
9. Batch process stopping and rerunning
Spring Batch give us a JobOperator bean with the next methods:
- getRunningExecutions(“jobName”) which return a list of job ids.
- stop() which will stop the step as soon as the developer-written code finishes.
- restart() to continue from the next step.
10. Web administration
The Spring Cloud Data Flow project provides a web interface and CLI to manage jobs and streams coming from Spring Cloud Task and Spring Cloud Stream projects. The goal of Spring Cloud Task project is to integrate Spring Batch jobs as cloud microservices.
In order to convert the Spring Batch project into a Spring Cloud task one it is only necessary to add the dependency and the @EnableTask annotation.
The following are the steps to run the job inside Spring Cloud Data Flow:
The common case is to register the application with the Maven URI maven:// so Spring Cloud Data Flow downloads the artifact from the repository.
In the example, we are defining the type as task, other application types are sinks and sources.
We can review the job statistics in: http://localhost:9393/dashboard/index.html#/jobs/executions/1
We can review the step statistics in: http://localhost:9393/dashboard/index.html#/jobs/executions/1/1
You can give a try to Spring Cloud Data Flow, there are really cool things like a visual tool to paint the data flow between microservices.
A property can be set to choose the folder where the application logs will be stored. However you can find a line in Spring Cloud Data Flow logs starting with “Logs will be in” indicating a temporal folder chosen by Spring.
You can see how easy is to convert the project to a Spring Cloud Data Flow one in this commit.
There are different strategies to scale up the batch processing either multithreading or multiprocessing.
Step partitioning is one of those strategies and it allows to run steps in remotes machines or local threads. It consists of defining a master step which will delegate the partitioned work in slave steps. The slave step and the partitioning strategy have also to be defined.
The master step will use an implementation of Partitioner to write in the executionContext the necessary information for every slave to process a data partition.
In the last commit we have added a second CSV file and defined a CustomMultiResourcePartitioner bean which will use the master step (also declared) to write the CSV file name in every context. Every slave will receive from the master step a different context.
The two slave steps will run in different threads because we have declared a task executor with the StepBuilderFactory DSL.
Note how the @StepScope annotation has been used to retrieve the file name using Spring Expression Language from the context.
Spring Batch also provides us several implementations to make batch processing development easy. Here you can find a list of them:
PS: Don’t forget to follow us on Twitter.