Setting up a Spring Cloud Data Flow sandbox using Docker with Kubernetes

Published in

Pandera Labs

14 min readMar 12, 2018

Spring Cloud Data Flow has intrigued me for a while — it seems to combine the capabilities of the tried-and-true Spring Integration and Spring Batch (both of which I’ve been a user of for years) but updates them for modern fast data / microservice / cloud-native application patterns.

I’ve watched a number of live and recorded presentations about SCDF but I’ve yet to actually give it a go. Given the recent release of Spring Cloud Data Flow 1.3.0, I decided that it’s high time I dive in and give it a shot.

While I could certainly have run SCDF on my local machine using the self-contained Local Server — that didn’t seem like nearly as much fun as deploying all of its constituent components to a container orchestration platform.

Given that I’ve also been meaning to investigate Kubernetes and that SCDF has first class support for it (as well as Cloud Foundry, Yarn, etc.), I decided to kill two birds with one stone and give Kubernetes a go as well.

There are myriad options for running a Kubernetes cluster — from managed cloud services like Google Kubernetes Engine, or Amazon Elastic Kubernetes Service (currently in preview) to tried-and-true local-machine solutions like Minikube. In October 2017 Docker announced support for running Kubernetes on the Docker platform and an initial preview for Mac was released in early January 2018. As I have at least some experience using Docker, I decided that I should build upon that and try out this new offering.

This initial post covers my experience and steps for getting a sandbox Spring Cloud Data Flow environment deployed. In a future post I will cover some of the capabilities of, and experiences using, SCDF.

Look at all these pipelines full of… data! (src: https://www.flickr.com/photos/bilfinger/14074154115)

What is Spring Cloud Data Flow?

The official documentation describes it thusly:

Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines.
Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.
…
The Spring Cloud Data Flow server uses Spring Cloud Deployer, to deploy pipelines onto modern runtimes such as Cloud Foundry, Kubernetes, Apache Mesos or Apache YARN.

Basically, SCDF is a runtime platform for deploying and managing data processing pipelines that comprise many individual Spring-Boot based microservices applications. SCDF itself is a service that exposes a restful management api (and has a corresponding CLI) as well as a web-based management and configuration UI.

It supports two distinct application types (Streams and Tasks) with initial support for a third (Functions). Streams focus on processing potentially infinite data streams — basically a standard pipe-and-filter architecture, but where each constituent component is a standalone Spring-Boot microservice. Tasks are used to process finite data sets — think of more traditional ETL or Spring Batch jobs. Finally, Functions are the latest addition and allow defining business logic as a single function to be executed (similar in concept to AWS Lambdas).

SCDF comprises a number of Spring Projects as explained in this handy diagram:

Spring Cloud Data Flow has alot of moving parts… but don’t be scared!

While it may seem overwhelming given the number of Spring Cloud * projects in use, lets quickly break down the main ones, and I’ll give you my (admittedly novice) understanding:

Spring Cloud Stream

Based on Spring Integration; reuses much of the terminology and core concepts of a pipe-and-filter architecture. Conceptually you have data Sources, data Processors, and data Sink. These components communicate by passing Messagesacross Channels managed by a messaging middleware. Apache Kafka and RabbitMQ are supported out-of-the-box but binders could be written for other middlewares.

Generally, I think of it as expanding the capabilities of Spring Integration’s “Enterprise Integration Patterns” and applying them, not to enterprise monoliths, but to cloud-native microservices.

Spring Cloud Task

Enables launching of arbitrary short-lived Spring Boot applications. Integrates directly with Spring Batch, so while it’s definitely an oversimplification, you can sort of think of this as a “Spring Cloud Batch”.

Spring Cloud Deployer

SPI for managing deployment of Spring Boot applications to various platforms (Local, Kubernetes, Mesos, Yarn etc.)

Spring Cloud Skipper

Provides ability to define and deploy Packages to a variety of different Platforms. A Package is a manifest file in yaml that defines what should be deployed (perhaps a single spring boot app, or a group of applications). A Platform is where the application actually runs (e.g. Kubernetes). Leverages Spring Cloud Deployer to deploy packages to platforms.

Spring Cloud Data Flow

Runtime platform for managing the lifecycle of Spring Cloud Stream/Task applications. You write your app, then register your application with the running SCDF server (either as a Docker Container, or using Maven coordinates) and then SCDF allows you to define streams that include your application, and manages deploying the application to the underlying platform when a Stream is deployed.

Also includes a very nice modern looking user interface! For those that remember the now defunct Spring Batch Admin UI (retired in favor of SCDF) this provides a much needed visual upgrade.

Now that we’ve reviewed what SCDF is, let’s get it set up and running!

Installing Docker with Kubernetes

In order to get Kubernetes running locally the first thing we have to do is to install / update Docker installation using the latest Edge release with Kubernetes support.

Once installed, before enabling Kubernetes, I’d highly recommend configuring docker resource allocations. By default only 1 Gb of RAM was allocated and in my testing this did not seem to be enough for Kubernetes to operate successfully.

Launch Docker, and open Preferences -> Advanced , and change the default CPU/Memory allocations.

My settings for a 4 (physical) core 16GB MacBook Pro

Once complete, open the Kubernetes tab, check enable Kubernetes and click Apply.

This then installs the Kubernetes components into docker and displays a progress window.

When I initially did this it took me a few tries to actually get this to complete. It initially hung at this step forever — but after I updated the resources settings as suggested above, and installed an update pushed out by the Docker team, it started working as expected. Hopefully this goes more smoothly for you as they continue to iron out defects in pre-release.

Now we can validate that Kubernetes is up and running:

$ kubectl cluster-infoKubernetes master is running at https://localhost:6443
KubeDNS is running at https://localhost:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy$ kubectl versionClient Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}

Kubernetes Dashboard (Optional)

While this is totally optional, installing Kubernetes Dashboard is easy, and provides some interesting views into what’s running on your local k8s cluster.

$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yamlsecret "kubernetes-dashboard-certs" created
serviceaccount "kubernetes-dashboard" created
role "kubernetes-dashboard-minimal" created
rolebinding "kubernetes-dashboard-minimal" created
deployment "kubernetes-dashboard" created
service "kubernetes-dashboard" created

Now launch the proxy to enable accessing the dashboard from outside the cluster (aka from your host machine).

$ kubectl proxyStarting to serve on 127.0.0.1:8001

You should now be able to access the dashboard at: http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login

Click SKIP to bypass setting up security

The Dashboard is useful to monitor and report what’s going on with your cluster

Yours should look like this… at least, once we’ve finished deploying Spring Cloud Data Flow

Deploy Spring Cloud Data Flow Server

Spring Cloud Data Flow Server has a number of different platform-specific implementations. As we are deploying to Kubernetes, that means we need to use Spring Cloud Data Flow Server Kubernetes. This project contains a set of Kubernetes object specs that define each component needed for SCDF that you can use to deploy to your k8s cluster.

First we need to get the source code which contains Kubernetes configurations for each of the necessary SCDF components.

$ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes.git

Then checkout to the appropriate release tag, in my case v1.3.1.RELEASE

$ cd spring-cloud-dataflow-server-kubernetes
$ git checkout v1.3.1.RELEASE

* Note: I ran into problems using 1.3.0 and opened an issue, it turns out that it had a bug and Kubernetes 1.9 was not fully supported. This was then corrected so make sure you use 1.3.1 or greater.

Make Services Available to Host machine

Now, before we actually deploy anything we need to make a few modifications to the provided Kubernetes specs to better suit our goal of creating an environment that we can use to fully explore SCDF locally. By default none of the SCDF components are exposed outside of the Kubernetes cluster, and will thus not be accessible from our host server. For instance, by default we would not be able to connect to the mysql database from our host machine. To ensure that Kubernetes provides a port on the host proxied to the cluster service we need to modify the Service specs and set the type to NodePort.

As such please edit each of the following files:

src/kubernetes/rabbitmq/rabbitmq-svc.yaml
src/kubernetes/mysql/mysql-svc.yaml
src/kubernetes/redis/redis-svc.yaml
src/kubernetes/skipper/skipper-svc.yaml
src/kubernetes/server/service-svc.yaml

and ensure that the spec.type value is set to NodePort.

For example the redis-svc.yaml should look like:

apiVersion: v1
kind: Service
metadata:
  name: redis
  labels:
    app: redis
spec:
  type: NodePort  
  ports:
    - port: 6379
  selector:
    app: redis

Now that we have all the configuration, we can deploy the infrastructure components necessary for SCDF to operate.

Messaging Middleware

As mentioned previously, SCDF supports both Kafka and RabbitMQ middleware. For now we’ll use RabbitMQ.

$ kubectl create -f src/kubernetes/rabbitmq

RDBMS Datastore

SCDF supports postgresql, mysql, or h2 out of the box. For now let’s go with mysql.

$ kubectl create -f src/kubernetes/mysql/

Analytics / Metric Collectors

SCDF Supports basic analytics such as incrementing counters, field value counters (count unique values in payload), and aggregate counters (count per time-unit). Using this capability requires Redis for key/value storage .

$ kubectl create -f src/kubernetes/redis/
$ kubectl create -f src/kubernetes/metrics/metrics-deployment-rabbit.yaml
$ kubectl create -f src/kubernetes/metrics/metrics-svc.yaml

Spring Cloud Skipper

Spring Cloud Skipper enables updating and rolling back the version of deployed applications and streams.

$ kubectl create -f src/kubernetes/skipper/skipper-deployment.yaml
$ kubectl create -f src/kubernetes/skipper/skipper-svc.yaml

Spring Cloud Data Flow
Wow, we’re finally ready to deploy the actual SCDF server!

We first need to edit the Service spec to indicate that we are using Spring Cloud Skipper (this is optional, but if not used automated updates and rollbacks of deployed streams is not support)

Edit src/kubernetes/server/server-deployment.yaml and uncomment the lines as directed:

# Uncomment the following properties if you’re going to use Skipper for stream deployments
 — name: SPRING_CLOUD_SKIPPER_CLIENT_SERVER_URI
 value: ‘http://${SKIPPER_SERVICE_HOST}/api'
 — name: SPRING_CLOUD_DATAFLOW_FEATURES_SKIPPER_ENABLED
 value: ‘true’

Now deploy

$ kubectl create -f src/kubernetes/server/server-roles.yaml
$ kubectl create -f src/kubernetes/server/server-rolebinding.yaml
$ kubectl create -f src/kubernetes/server/service-account.yaml
$ kubectl create -f src/kubernetes/server/server-config-rabbit.yaml
$ kubectl create -f src/kubernetes/server/server-svc.yaml
$ kubectl create -f src/kubernetes/server/server-deployment.yaml

Now we should be able to list out all Services deployed to Kubernetes and validate that we see ours:

$ kubectl get svcNAME          TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)          AGE
kubernetes    ClusterIP   10.96.0.1        <none>        443/TCP          2h
metrics       ClusterIP   10.104.246.91    <none>        80/TCP           47s
mysql         NodePort    10.97.130.190    <none>        3306:31561/TCP   1m
rabbitmq      NodePort    10.111.147.114   <none>        5672:31112/TCP   1m
redis         NodePort    10.97.46.206     <none>        6379:32071/TCP   56s
scdf-server   NodePort    10.102.125.104   <none>        80:30518/TCP     13s
skipper       NodePort    10.110.90.243    <none>        80:31280/TCP     37s

The second port (after the :) indicates the host port that is being proxied to the service, for instance, in the above case, highlighted port 30518 is proxied to the SCDF server.

Use Spring Cloud Data Flow

Now that all of the required dependencies, and SCDF itself, are deployed, we can load up the ui and log in with the default username and password (user : password)

http://localhost:30518/dashboard

So now we can see the UI… but it’s depressingly empty. There are no Apps registered, nor Streams, Tasks or Jobs (Spring Batch!) defined. We should do something about that…

Import Starter Apps via the CLI

Thankfully Spring provides a number of out-of-the-box applications that we can register and try out. We could theoretically deploy these via the UI, but instead we’ll use the CLI.

First we need to download the CLI jar file

$ wget http://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/1.3.1.RELEASE/spring-cloud-dataflow-shell-1.3.1.RELEASE.jar

Then we can launch it, note that since we plan to use Spring Cloud Skipper to manage our deployments we need to include that on the command line.

$ java -jar spring-cloud-dataflow-shell-1.3.1.RELEASE.jar --dataflow.mode=skipper

Next we need to connect to the SCDF server so grab that port from above and run:

server-unknown:>dataflow config server --username user --password password http://localhost:30518

Now we are connected to our local SCDF server, and we can register our applications! Thankfully, Spring provides a number of out-of-the box Stream and Task applications that we can register — as well as short urls to import them all at once. Let’s import the latest Stream and Task applications:

dataflow:>app import --uri http://bit.ly/Celsius-SR1-stream-applications-rabbit-docker

* Note: The Task applications are available at: http://bit.ly/Clark-GA-task-applications-docker

Now we can check the UI and see all of our newly deployed applications.

Or, if you prefer the command line, the CLI works great as well

Deploy and test a Stream

Finally! Everything’s ready for us to actually define and deploy a Stream and see some real Action! Spring provides instructions for a number of Sample Pipelines we could try out, but for now we’ll go with the simplest possible example from their documentation.

We’ll connect the time application source, which generates a timestamp every 1 second, into the log application sync which receives data and writes it to a log file.

We could define this programmatically in java, declaratively via the UI, or even visually in the UI by dragging and dropping the desired components, but for simplicity we’ll create it via the CLI using the Stream DSL.

dataflow:>stream create foo --definition "time | log"

We can then see in the UI that the stream is created, but is yet deployed.

and see a visual representation as well:

Now we deploy the Stream:

dataflow:> stream deploy foo

On deployment SCDF will create the necessary RabbitMQ topics for message passing, and create k8s specs for the foo and time docker applications, injecting the necessary configuration to allow them to connect to Rabbit and read/write from the appropriate topic.

Now, we can check and see that SCDF handled defining and deploying all of the necessary Kubernetes resources for our data pipeline runtime:

$ kubectl get allNAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/foo-log-v1    1         1         1            1           7m
deploy/foo-time-v1   1         1         1            1           7m
...NAME                        DESIRED   CURRENT   READY     AGE
rs/foo-log-v1-56585bbd49    1         1         1         7m
rs/foo-time-v1-58cb77d869   1         1         1         7m
...NAME                 DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/foo-log-v1    1         1         1            1           7m
deploy/foo-time-v1   1         1         1            1           7m
...NAME                        DESIRED   CURRENT   READY     AGE
rs/foo-log-v1-56585bbd49    1         1         1         7m
rs/foo-time-v1-58cb77d869   1         1         1         7m
...NAME                              READY     STATUS    RESTARTS   AGE
po/foo-log-v1-56585bbd49-fr9xm    1/1       Running   0          7m
po/foo-time-v1-58cb77d869-48g5r   1/1       Running   0          7m
...NAME              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                          AGE
svc/foo-log-v1    ClusterIP   10.107.78.88     <none>        8080/TCP                         7m
svc/foo-time-v1   ClusterIP   10.103.170.65    <none>        8080/TCP                         7m
...

If we’re interested, we can even see the kubernetes configuration that SCDF generated to properly deploy the time app. Note, we can see the values runtime config values SCDF injected into the app to provide the appropriate bindings to the RabbitMQ backed destination.

$ kubectl get pod foo-time-v1-58cb77d869-48g5r -o yaml
...
spec:
  containers:
  - args:
    - --spring.metrics.export.triggers.application.includes=integration**
    - --spring.cloud.dataflow.stream.app.label=time
    - --spring.cloud.stream.metrics.key=foo.time.${spring.cloud.application.guid}
    - --spring.cloud.stream.bindings.output.producer.requiredGroups=foo
    - --spring.cloud.stream.metrics.properties=spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*
    - --spring.cloud.stream.bindings.applicationMetrics.destination=metrics
    - --spring.cloud.dataflow.stream.name=foo
    - --spring.cloud.stream.bindings.output.destination=foo.time
    - --spring.cloud.dataflow.stream.app.type=source
    env:
    - name: SPRING_RABBITMQ_PORT
      value: "5672"
    - name: SPRING_RABBITMQ_HOST
      value: 10.106.249.73
    - name: SPRING_CLOUD_APPLICATION_GUID
      value: ${HOSTNAME}
    - name: SPRING_CLOUD_APPLICATION_GROUP
      value: foo
    image: springcloudstream/time-source-rabbit:1.3.1.RELEASE
    imagePullPolicy: IfNotPresent
...

Finally — we can check the logs of the Kubernetes service and see that our data pipeline is producing the expected results:

$ kubectl logs po/foo-log-v1-56585bbd49-fr9xm
...
2018-02-26 21:53:36.979  INFO 1 --- [           main] o.s.i.endpoint.EventDrivenConsumer       : Adding {message-handler:inbound.foo.time.foo} as a subscriber to the 'bridge.foo.time' channel
2018-02-26 21:53:36.979  INFO 1 --- [           main] o.s.i.endpoint.EventDrivenConsumer       : started inbound.foo.time.foo
2018-02-26 21:53:36.980  INFO 1 --- [           main] o.s.c.support.DefaultLifecycleProcessor  : Starting beans in phase 2147483647
2018-02-26 21:53:37.086  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:36
2018-02-26 21:53:37.285  INFO 1 --- [           main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat started on port(s): 8080 (http)
2018-02-26 21:53:37.290  INFO 1 --- [           main] o.s.c.s.a.l.s.r.LogSinkRabbitApplication : Started LogSinkRabbitApplication in 21.201 seconds (JVM running for 22.771)
2018-02-26 21:53:37.895  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:37
2018-02-26 21:53:38.900  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:38
2018-02-26 21:53:39.901  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:39
2018-02-26 21:53:40.909  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:40
2018-02-26 21:53:41.907  INFO 1 --- [ foo.time.foo-1] log-sink                                 : 02/26/18 21:53:41
...

SUCCESS! Wow, isn’t that exciting?? Okay, fine, you’re right, seeing the time printed out every second isn’t that mind-blowing. But a LOT had to happen to get us to this point!

Recap

Let’s quickly review what happened to get us to this point:

We setup a kubernetes cluster running our local machine
By slightly tweaking a few YAML files — we successfully deployed the SCDF server and all of its dependencies to our local kubernetes cluster
We imported some starter Spring Boot Apps into SCDF, making the platform aware of all sorts of easy to use data sources, data processors and data sinks
We composed a new Data Processing Pipeline (Stream) using two of those Apps via the CLI and one single line of Stream DSL text
Via one single line in the CLI we told SCDF to deploy our application. SCDF then orchestrated creating the necessary messaging middleware configuration, generated kubernetes specs with all required runtime properties injected, and handed the specs off the kubernetes to deploy

Wow. While the functionality of this particular stream is not very interesting, when you think about capabilities this provides us to very easily build, deploy and operate much more complex pipelines, it’s quite impressive.

In just a few commands we can spin up entire clusters of “computers” and deploy dozens of services to it. We can use declaratively defined data pipelines and let the platform handle all of the non-essential configuration and deployment.

I had a ton of fun learning and playing around with Kubernetes & SCDF and I look forward to further exploring SCDF’s capabilities with a more real-world data processing use case. Stay tuned for a future post covering that experience.

Big thanks to the SCDF team who were very helpful and responsive in gitter when I ran into issues!