Apache Beam: Manage your dependencies in a world without Spring/Guice…

Brachi Packter
Analytics Vidhya
Published in
2 min readFeb 23, 2020
Photo by Blake Connally on Unsplash

It isn’t a brilliant idea, just hidden feature in ApacheBeam that you may miss when reading the whole documentation…

Apache Beam is a robust distributed streaming data pipeline framework, it is SDK and has many runners: Flink, DataFlow and more.

When I deep dive into the pipeline and have to communicate with external services, Like RedisClient, DB connection pool, or some internal configuration services, I feel like my code starts being messy.

I miss Spring or other DI frameworks. where I have one place to define and initialize my services, and another place to inject them when needed, without this ability I had to pass into my DoFn all the parameters and dependencies that are needed to create the service.
In addition to it, usually, services should be singletons, this is very important in high scale application, just imagine a case when my service contains a DB connection pool and it isn’t a singleton, in this case, I may hit very quickly the max allowed connection in the DB.

Let’s sum up what was missing, with a code example (the ugly one):

  1. Singleton: creation is blocking with locks.
  2. Coupling: a function must know all the Redis creation properties, even it just needs to use it.
  3. Hard refactoring: if I need now to add more parameter to Redis creation I have to pass it to the constructor, and if this is called from many places I have to modify them, tests may be failed, and this refactoring is error-prone.
  4. Testing is hard: a lot of static, complex to mock dependencies.

The solution:

We probably have Pipeline options contains all Redis configuration:

We just had to add one special option “RedissonClient getRedissonClient();”, And annotate it with @Default.InstanceFactory(RedisClientFactory.class) the class in the annotation is a factory that “knows” how to initialize my target service, and it will be created only once per worker.

Here is the Factory:

The factory above gets all the required properties needed to create Redis client, changes or addition of dependencies/ parameters will take effect only here.

And DoFn after the change:

Caveats:

  1. No DI outside the @ProcessElement scope, Options are accessible only from DoFn function.
  2. Featureless, no AOP, instance lifecycle hooks.
  3. Instances are always initialized in lazy mode, only when you ask for them they are created, no way to overcome it.

Why not Spring?

Apache Beam doesn’t expose hook on the worker initialization time, so I can’t use Spring, because it needs some starting point on each worker to initialize its context and beans.

Yes, a small change, very straight forward, but if you don’t follow it in time you can end up with coupled, not testable code, and hard to maintain.

--

--