Routine Jobs with Kubernetes, Spring Cloud Dataflow and Spring Cloud Task

Andrios Robert
pismolabs
Published in
5 min readMay 10, 2017

We do quite a few things with Spring frameworks, as they tend to keep up with the evolving technological environment, and it was not different with the cloud. Spring Cloud suite brings the productivity of its already established frameworks to the cloud, and a lot more.

For the purpose of this solution, we decided to go with Spring Batch for the implementation of the routines, interfaced with Spring Cloud Tasks, and deployed to Spring Cloud Dataflow. Using the implementation for Kubernetes, the Dataflow server is used to intermediate with the infrastructure in the cloud and enable the tasks to dynamically allocate resources as needed.

Our use case of Spring Cloud Dataflow may not seem too obvious, since most of the known uses and available resources online are related to its streaming processing capabilities and the use of Tasks for ETL/ELT. However, gluing the short lived Tasks capabilities with Spring Batch implementations deployed to Kubernetes, proved to be perfectly suited for a routine job in a cloud environment.

Time to some code

To understand the implementation, we are going to use a simple Partitioned Spring Batch example already available at Spring Cloud Github repository. First, we are going to need the worker for our job, which is the component responsible for the processing of the data. To do so, we need a Spring Boot application, coupled with the @EnableTask notation. The same goes for the master (we are not going deep into Spring Batch implementation here, check out Spring documentation for more info).

The main piece of the worker application is the Step definition, which is where the actual business logic is defined for each chunk of data we process:

For the master, the partition is where the magic happens. Here we define the GRID_SIZE which will be translated to how many partitions our job is going to be splitted into:

The last piece of code we should focus on is the PartitionHandler definition, which is responsible by launching the workers. Here is where things got quite tricky when coupled with Kubernetes:

Here we start to perform a few customizations to Spring example project, for it to work with Kubernetes. The two biggest ones are: 1) to use a docker image as the resource representing the worker, and 2) to remove the passing of the environment variables to the worker (more on that in a moment). A full Kubernetes version of the Spring example is available at our Github

The setup could be split into two different projects for organization purposes, because once the logic on each side (master and slave) starts to grow, you would have your jars carrying things they are not going to use. To keep things simple, we are going to stick with the all-in-one-project approach of the Spring example.

The last change we need to perform to the project is to use the Spring Cloud Kubernetes Deployer as the SPI for the project, as well stated in the project documentation.

Infrastructure

Implementation settled, time to prepare the infrastructure to run it, and for that we are going to need a running Kubernetes cluster (you can use Minikube for simplicity purposes). Once Kubernetes is up, we need to deploy the Spring Cloud Dataflow Server to it. You can use the Deployment and Service definitions available at the Spring Github repository for the server, but if you want a different version than the one on the docker image available at the Spring Docker Hub, you will need to build your own image.

To create the Deployment and Service, simply clone the repository and run the commands bellow. Check the status of the server at the Kubernetes Dashboard, or via kubectl, and make sure it is running and the service is exposing it.

To interact with the Dataflow Server, we can use its REST APIs, the Dashboard or use the interactive shell. We are going to use the Shell for this example, but you can check out more about the other options at Spring documentation. The idea behind using the shell is to make it easier to interact with the server on a automated way, using simply a bash script. To run the shell, simply execute the jar and point its configuration to the service we just created on Kubernetes, from there you should be able to create apps, tasks and perform actions on them, such as running, stopping or deleting.

After setting up the shell, embed it on a docker image, along with the instructions to point to our server. This is going to open our path for the next topic.

How about the scheduling?

Scheduling routine tasks tools are lacking, with most solutions being closed and proprietary, the open source options result on some implementation relying on a crontab at the OS level. Kubernetes introduced a pretty straightforward and consistent solutions to this issue, with the Cron Job feature, which is responsible by managing time based jobs. You simply have to define a spec with a cron expression and a container image, and Kubernetes will handle the scheduling, and here comes our Dataflow Shell container image.

The idea is to create a Kubernetes cronJob, which using the Shell container image, launches our partitioned job, based on the cron expression, using a simple shell script that could be the docker image entrypoint. Here is an example of this script:

Wrapping up

Gluing all these things together was a big tricky, since Spring Cloud Dataflow is a pretty new tool and its documentation still evolving. A few takeaways for this solution are:

Back to the point of not passing the environments variables to the worker in the partitionerHandler implementation, it is necessary because the Kubernetes Deployer launches the worker using every argument it receives as command line args. It makes the launching fail because Kubernetes forwards all the context (including arguments and environment variables) to its child pods.

Using a container image (instead of default Maven artifact) as the resource for the worker is the only way to make it work with the Kubernetes Dataflow Server. Even though it is not explicit in the latest documentation (it was added to the snapshot of M2), it does not support Maven artifacts as resources as of now.

Use a shell entrypoint when wrapping your master application in the docker image, and use java binary command for the worker. This will make the master receive the parameters from the application, such as jdbc information or any other data from the application configurations, and will also make sure the worker only receives what the master is sending, and not the whole context, which might make it fail to launch with no clear reason.

That's it for today, see you in the next post!

--

--

Andrios Robert
pismolabs

engineer interested in high scale systems and machine learning