Setting up a Spring Cloud Data Flow sandbox using Docker with Kubernetes
Spring Cloud Data Flow has intrigued me for a while — it seems to combine the capabilities of the tried-and-true Spring Integration and Spring Batch (both of which I’ve been a user of for years) but updates them for modern fast data / microservice / cloud-native application patterns.
I’ve watched a number of live and recorded presentations about SCDF but I’ve yet to actually give it a go. Given the recent release of Spring Cloud Data Flow 1.3.0, I decided that it’s high time I dive in and give it a shot.
While I could certainly have run SCDF on my local machine using the self-contained Local Server — that didn’t seem like nearly as much fun as deploying all of its constituent components to a container orchestration platform.
Given that I’ve also been meaning to investigate Kubernetes and that SCDF has first class support for it (as well as Cloud Foundry, Yarn, etc.), I decided to kill two birds with one stone and give Kubernetes a go as well.
There are myriad options for running a Kubernetes cluster — from managed cloud services like Google Kubernetes Engine, or Amazon Elastic Kubernetes Service (currently in preview) to tried-and-true local-machine solutions like Minikube. In October 2017 Docker announced support for running Kubernetes on the Docker platform and an initial preview for Mac was released in early January 2018. As I have at least some experience using Docker, I decided that I should build upon that and try out this new offering.
This initial post covers my experience and steps for getting a sandbox Spring Cloud Data Flow environment deployed. In a future post I will cover some of the capabilities of, and experiences using, SCDF.
What is Spring Cloud Data Flow?
The official documentation describes it thusly:
Spring Cloud Data Flow is a toolkit for building data integration and real-time data processing pipelines.
Pipelines consist of Spring Boot apps, built using the Spring Cloud Stream or Spring Cloud Task microservice frameworks.
…
The Spring Cloud Data Flow server uses Spring Cloud Deployer, to deploy pipelines onto modern runtimes such as Cloud Foundry, Kubernetes, Apache Mesos or Apache YARN.
Basically, SCDF is a runtime platform for deploying and managing data processing pipelines that comprise many individual Spring-Boot based microservices applications. SCDF itself is a service that exposes a restful management api (and has a corresponding CLI) as well as a web-based management and configuration UI.
It supports two distinct application types (Streams and Tasks) with initial support for a third (Functions). Streams focus on processing potentially infinite data streams — basically a standard pipe-and-filter architecture, but where each constituent component is a standalone Spring-Boot microservice. Tasks are used to process finite data sets — think of more traditional ETL or Spring Batch jobs. Finally, Functions are the latest addition and allow defining business logic as a single function to be executed (similar in concept to AWS Lambdas).
SCDF comprises a number of Spring Projects as explained in this handy diagram:
While it may seem overwhelming given the number of Spring Cloud * projects in use, lets quickly break down the main ones, and I’ll give you my (admittedly novice) understanding:
Based on Spring Integration; reuses much of the terminology and core concepts of a pipe-and-filter architecture. Conceptually you have data Sources
, data Processors
, and data Sink
. These components communicate by passing Messages
across Channels
managed by a messaging middleware. Apache Kafka and RabbitMQ are supported out-of-the-box but binders could be written for other middlewares.
Generally, I think of it as expanding the capabilities of Spring Integration’s “Enterprise Integration Patterns” and applying them, not to enterprise monoliths, but to cloud-native microservices.
Enables launching of arbitrary short-lived Spring Boot applications. Integrates directly with Spring Batch, so while it’s definitely an oversimplification, you can sort of think of this as a “Spring Cloud Batch”.
SPI for managing deployment of Spring Boot applications to various platforms (Local, Kubernetes, Mesos, Yarn etc.)
Provides ability to define and deploy Packages to a variety of different Platforms. A Package is a manifest file in yaml that defines what should be deployed (perhaps a single spring boot app, or a group of applications). A Platform is where the application actually runs (e.g. Kubernetes). Leverages Spring Cloud Deployer to deploy packages to platforms.
Runtime platform for managing the lifecycle of Spring Cloud Stream/Task applications. You write your app, then register your application with the running SCDF server (either as a Docker Container, or using Maven coordinates) and then SCDF allows you to define streams that include your application, and manages deploying the application to the underlying platform when a Stream is deployed.
Also includes a very nice modern looking user interface! For those that remember the now defunct Spring Batch Admin UI (retired in favor of SCDF) this provides a much needed visual upgrade.
Now that we’ve reviewed what SCDF is, let’s get it set up and running!
Installing Docker with Kubernetes
In order to get Kubernetes running locally the first thing we have to do is to install / update Docker installation using the latest Edge
release with Kubernetes support.
Once installed, before enabling Kubernetes, I’d highly recommend configuring docker resource allocations. By default only 1 Gb of RAM was allocated and in my testing this did not seem to be enough for Kubernetes to operate successfully.
Launch Docker, and open Preferences -> Advanced
, and change the default CPU/Memory allocations.
Once complete, open the Kubernetes tab, check enable Kubernetes and click Apply.
This then installs the Kubernetes components into docker and displays a progress window.
When I initially did this it took me a few tries to actually get this to complete. It initially hung at this step forever — but after I updated the resources settings as suggested above, and installed an update pushed out by the Docker team, it started working as expected. Hopefully this goes more smoothly for you as they continue to iron out defects in pre-release.
Now we can validate that Kubernetes is up and running:
$ kubectl cluster-infoKubernetes master is running at https://localhost:6443
KubeDNS is running at https://localhost:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy$ kubectl versionClient Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T10:09:24Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.2", GitCommit:"5fa2db2bd46ac79e5e00a4e6ed24191080aa463b", GitTreeState:"clean", BuildDate:"2018-01-18T09:42:01Z", GoVersion:"go1.9.2", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes Dashboard (Optional)
While this is totally optional, installing Kubernetes Dashboard is easy, and provides some interesting views into what’s running on your local k8s cluster.
$ kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/master/src/deploy/recommended/kubernetes-dashboard.yamlsecret "kubernetes-dashboard-certs" created
serviceaccount "kubernetes-dashboard" created
role "kubernetes-dashboard-minimal" created
rolebinding "kubernetes-dashboard-minimal" created
deployment "kubernetes-dashboard" created
service "kubernetes-dashboard" created
Now launch the proxy to enable accessing the dashboard from outside the cluster (aka from your host machine).
$ kubectl proxyStarting to serve on 127.0.0.1:8001
You should now be able to access the dashboard at: http://localhost:8001/api/v1/namespaces/kube-system/services/https:kubernetes-dashboard:/proxy/#!/login
Click SKIP
to bypass setting up security
The Dashboard is useful to monitor and report what’s going on with your cluster
Deploy Spring Cloud Data Flow Server
Spring Cloud Data Flow Server has a number of different platform-specific implementations. As we are deploying to Kubernetes, that means we need to use Spring Cloud Data Flow Server Kubernetes. This project contains a set of Kubernetes object spec
s that define each component needed for SCDF that you can use to deploy to your k8s cluster.
First we need to get the source code which contains Kubernetes configurations for each of the necessary SCDF components.
$ git clone https://github.com/spring-cloud/spring-cloud-dataflow-server-kubernetes.git
Then checkout to the appropriate release tag, in my case v1.3.1.RELEASE
$ cd spring-cloud-dataflow-server-kubernetes
$ git checkout v1.3.1.RELEASE
* Note: I ran into problems using 1.3.0
and opened an issue, it turns out that it had a bug and Kubernetes 1.9
was not fully supported. This was then corrected so make sure you use 1.3.1
or greater.
Make Services Available to Host machine
Now, before we actually deploy anything we need to make a few modifications to the provided Kubernetes spec
s to better suit our goal of creating an environment that we can use to fully explore SCDF locally. By default none of the SCDF components are exposed outside of the Kubernetes cluster, and will thus not be accessible from our host server. For instance, by default we would not be able to connect to the mysql database from our host machine. To ensure that Kubernetes provides a port on the host proxied to the cluster service we need to modify the Service
specs and set the type to NodePort
.
As such please edit each of the following files:
src/kubernetes/rabbitmq/rabbitmq-svc.yaml
src/kubernetes/mysql/mysql-svc.yaml
src/kubernetes/redis/redis-svc.yaml
src/kubernetes/skipper/skipper-svc.yaml
src/kubernetes/server/service-svc.yaml
and ensure that the spec.type
value is set to NodePort
.
For example the redis-svc.yaml
should look like:
apiVersion: v1
kind: Service
metadata:
name: redis
labels:
app: redis
spec:
type: NodePort
ports:
- port: 6379
selector:
app: redis
Now that we have all the configuration, we can deploy the infrastructure components necessary for SCDF to operate.
Messaging Middleware
As mentioned previously, SCDF supports both Kafka and RabbitMQ middleware. For now we’ll use RabbitMQ.
$ kubectl create -f src/kubernetes/rabbitmq
RDBMS Datastore
SCDF supports postgresql, mysql, or h2 out of the box. For now let’s go with mysql.
$ kubectl create -f src/kubernetes/mysql/
Analytics / Metric Collectors
SCDF Supports basic analytics such as incrementing counters, field value counters (count unique values in payload), and aggregate counters (count per time-unit). Using this capability requires Redis for key/value storage .
$ kubectl create -f src/kubernetes/redis/
$ kubectl create -f src/kubernetes/metrics/metrics-deployment-rabbit.yaml
$ kubectl create -f src/kubernetes/metrics/metrics-svc.yaml
Spring Cloud Skipper
Spring Cloud Skipper enables updating and rolling back the version of deployed applications and streams.
$ kubectl create -f src/kubernetes/skipper/skipper-deployment.yaml
$ kubectl create -f src/kubernetes/skipper/skipper-svc.yaml
Spring Cloud Data Flow
Wow, we’re finally ready to deploy the actual SCDF server!
We first need to edit the Service
spec
to indicate that we are using Spring Cloud Skipper (this is optional, but if not used automated updates and rollbacks of deployed streams is not support)
Edit src/kubernetes/server/server-deployment.yaml
and uncomment the lines as directed:
# Uncomment the following properties if you’re going to use Skipper for stream deployments
— name: SPRING_CLOUD_SKIPPER_CLIENT_SERVER_URI
value: ‘http://${SKIPPER_SERVICE_HOST}/api'
— name: SPRING_CLOUD_DATAFLOW_FEATURES_SKIPPER_ENABLED
value: ‘true’
Now deploy
$ kubectl create -f src/kubernetes/server/server-roles.yaml
$ kubectl create -f src/kubernetes/server/server-rolebinding.yaml
$ kubectl create -f src/kubernetes/server/service-account.yaml
$ kubectl create -f src/kubernetes/server/server-config-rabbit.yaml
$ kubectl create -f src/kubernetes/server/server-svc.yaml
$ kubectl create -f src/kubernetes/server/server-deployment.yaml
Now we should be able to list out all Services deployed to Kubernetes and validate that we see ours:
$ kubectl get svcNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 2h
metrics ClusterIP 10.104.246.91 <none> 80/TCP 47s
mysql NodePort 10.97.130.190 <none> 3306:31561/TCP 1m
rabbitmq NodePort 10.111.147.114 <none> 5672:31112/TCP 1m
redis NodePort 10.97.46.206 <none> 6379:32071/TCP 56s
scdf-server NodePort 10.102.125.104 <none> 80:30518/TCP 13s
skipper NodePort 10.110.90.243 <none> 80:31280/TCP 37s
The second port (after the :
) indicates the host port that is being proxied to the service, for instance, in the above case, highlighted port 30518
is proxied to the SCDF server.
Use Spring Cloud Data Flow
Now that all of the required dependencies, and SCDF itself, are deployed, we can load up the ui and log in with the default username and password (user
: password
)
http://localhost:30518/dashboard
So now we can see the UI… but it’s depressingly empty. There are no Apps registered, nor Streams, Tasks or Jobs (Spring Batch!) defined. We should do something about that…
Import Starter Apps via the CLI
Thankfully Spring provides a number of out-of-the-box applications that we can register and try out. We could theoretically deploy these via the UI, but instead we’ll use the CLI.
First we need to download the CLI jar file
$ wget http://repo.spring.io/release/org/springframework/cloud/spring-cloud-dataflow-shell/1.3.1.RELEASE/spring-cloud-dataflow-shell-1.3.1.RELEASE.jar
Then we can launch it, note that since we plan to use Spring Cloud Skipper to manage our deployments we need to include that on the command line.
$ java -jar spring-cloud-dataflow-shell-1.3.1.RELEASE.jar --dataflow.mode=skipper
Next we need to connect to the SCDF server so grab that port from above and run:
server-unknown:>dataflow config server --username user --password password http://localhost:30518
Now we are connected to our local SCDF server, and we can register our applications! Thankfully, Spring provides a number of out-of-the box Stream and Task applications that we can register — as well as short urls to import them all at once. Let’s import the latest Stream and Task applications:
dataflow:>app import --uri http://bit.ly/Celsius-SR1-stream-applications-rabbit-docker
* Note: The Task applications are available at: http://bit.ly/Clark-GA-task-applications-docker
Now we can check the UI and see all of our newly deployed applications.
Or, if you prefer the command line, the CLI works great as well
Deploy and test a Stream
Finally! Everything’s ready for us to actually define and deploy a Stream and see some real Action! Spring provides instructions for a number of Sample Pipelines we could try out, but for now we’ll go with the simplest possible example from their documentation.
We’ll connect the time
application source, which generates a timestamp every 1 second, into the log
application sync which receives data and writes it to a log file.
We could define this programmatically in java, declaratively via the UI, or even visually in the UI by dragging and dropping the desired components, but for simplicity we’ll create it via the CLI using the Stream DSL.
dataflow:>stream create foo --definition "time | log"
We can then see in the UI that the stream is created, but is yet deployed.
and see a visual representation as well:
Now we deploy the Stream:
dataflow:> stream deploy foo
On deployment SCDF will create the necessary RabbitMQ topics for message passing, and create k8s spec
s for the foo
and time
docker applications, injecting the necessary configuration to allow them to connect to Rabbit and read/write from the appropriate topic.
Now, we can check and see that SCDF handled defining and deploying all of the necessary Kubernetes resources for our data pipeline runtime:
$ kubectl get allNAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/foo-log-v1 1 1 1 1 7m
deploy/foo-time-v1 1 1 1 1 7m
...NAME DESIRED CURRENT READY AGE
rs/foo-log-v1-56585bbd49 1 1 1 7m
rs/foo-time-v1-58cb77d869 1 1 1 7m
...NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
deploy/foo-log-v1 1 1 1 1 7m
deploy/foo-time-v1 1 1 1 1 7m
...NAME DESIRED CURRENT READY AGE
rs/foo-log-v1-56585bbd49 1 1 1 7m
rs/foo-time-v1-58cb77d869 1 1 1 7m
...NAME READY STATUS RESTARTS AGE
po/foo-log-v1-56585bbd49-fr9xm 1/1 Running 0 7m
po/foo-time-v1-58cb77d869-48g5r 1/1 Running 0 7m
...NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/foo-log-v1 ClusterIP 10.107.78.88 <none> 8080/TCP 7m
svc/foo-time-v1 ClusterIP 10.103.170.65 <none> 8080/TCP 7m
...
If we’re interested, we can even see the kubernetes configuration that SCDF generated to properly deploy the time
app. Note, we can see the values runtime config values SCDF injected into the app to provide the appropriate bindings to the RabbitMQ backed destination.
$ kubectl get pod foo-time-v1-58cb77d869-48g5r -o yaml
...
spec:
containers:
- args:
- --spring.metrics.export.triggers.application.includes=integration**
- --spring.cloud.dataflow.stream.app.label=time
- --spring.cloud.stream.metrics.key=foo.time.${spring.cloud.application.guid}
- --spring.cloud.stream.bindings.output.producer.requiredGroups=foo
- --spring.cloud.stream.metrics.properties=spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*
- --spring.cloud.stream.bindings.applicationMetrics.destination=metrics
- --spring.cloud.dataflow.stream.name=foo
- --spring.cloud.stream.bindings.output.destination=foo.time
- --spring.cloud.dataflow.stream.app.type=source
env:
- name: SPRING_RABBITMQ_PORT
value: "5672"
- name: SPRING_RABBITMQ_HOST
value: 10.106.249.73
- name: SPRING_CLOUD_APPLICATION_GUID
value: ${HOSTNAME}
- name: SPRING_CLOUD_APPLICATION_GROUP
value: foo
image: springcloudstream/time-source-rabbit:1.3.1.RELEASE
imagePullPolicy: IfNotPresent
...
Finally — we can check the logs of the Kubernetes service and see that our data pipeline is producing the expected results:
$ kubectl logs po/foo-log-v1-56585bbd49-fr9xm
...
2018-02-26 21:53:36.979 INFO 1 --- [ main] o.s.i.endpoint.EventDrivenConsumer : Adding {message-handler:inbound.foo.time.foo} as a subscriber to the 'bridge.foo.time' channel
2018-02-26 21:53:36.979 INFO 1 --- [ main] o.s.i.endpoint.EventDrivenConsumer : started inbound.foo.time.foo
2018-02-26 21:53:36.980 INFO 1 --- [ main] o.s.c.support.DefaultLifecycleProcessor : Starting beans in phase 2147483647
2018-02-26 21:53:37.086 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:36
2018-02-26 21:53:37.285 INFO 1 --- [ main] s.b.c.e.t.TomcatEmbeddedServletContainer : Tomcat started on port(s): 8080 (http)
2018-02-26 21:53:37.290 INFO 1 --- [ main] o.s.c.s.a.l.s.r.LogSinkRabbitApplication : Started LogSinkRabbitApplication in 21.201 seconds (JVM running for 22.771)
2018-02-26 21:53:37.895 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:37
2018-02-26 21:53:38.900 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:38
2018-02-26 21:53:39.901 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:39
2018-02-26 21:53:40.909 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:40
2018-02-26 21:53:41.907 INFO 1 --- [ foo.time.foo-1] log-sink : 02/26/18 21:53:41
...
SUCCESS! Wow, isn’t that exciting?? Okay, fine, you’re right, seeing the time printed out every second isn’t that mind-blowing. But a LOT had to happen to get us to this point!
Recap
Let’s quickly review what happened to get us to this point:
- We setup a kubernetes cluster running our local machine
- By slightly tweaking a few YAML files — we successfully deployed the SCDF server and all of its dependencies to our local kubernetes cluster
- We imported some starter Spring Boot Apps into SCDF, making the platform aware of all sorts of easy to use data sources, data processors and data sinks
- We composed a new Data Processing Pipeline (Stream) using two of those Apps via the CLI and one single line of Stream DSL text
- Via one single line in the CLI we told SCDF to deploy our application. SCDF then orchestrated creating the necessary messaging middleware configuration, generated kubernetes specs with all required runtime properties injected, and handed the specs off the kubernetes to deploy
Wow. While the functionality of this particular stream is not very interesting, when you think about capabilities this provides us to very easily build, deploy and operate much more complex pipelines, it’s quite impressive.
In just a few commands we can spin up entire clusters of “computers” and deploy dozens of services to it. We can use declaratively defined data pipelines and let the platform handle all of the non-essential configuration and deployment.
I had a ton of fun learning and playing around with Kubernetes & SCDF and I look forward to further exploring SCDF’s capabilities with a more real-world data processing use case. Stay tuned for a future post covering that experience.
Big thanks to the SCDF team who were very helpful and responsive in gitter when I ran into issues!