Simplify the developer experience with OpenShift for Big Data processing by using Lithops framework

The road to OpenShift

Red Hat OpenShift is the leading enterprise-ready container engine. It is based on Kubernetes and is capable of massively scaling and executing Docker containers. It can be installed on premise, in the cloud, or used in hybrid solutions. While all this sounds promising, it’s still a non-trivial task for developers to containerize their software and deploy it to OpenShift. We now discuss some of the challenges developers may face when moving to OpenShift.

The learning curve of APIs

Developers who need to use OpenShift need to choose either Knative API, OpenWhisk API, or K8’s job descriptor API. In each case, they need to learn new API semantics, perhaps use different third-party SDKs, or even use a REST API. The learning curve, while not complicated, still consumes time and requires skills to tune the application for the optimal usage.

To package the code or not to package?

While Docker is a popular choice to contain applications and their dependencies, it is not trivial to find the right way to execute the proprietary code in the container image. If the code is packed in the Docker container, then using the public Docker hub could be problematic for proprietary code. Private container registries may resolve this, but this adds another layer of complexity to set up and maintain private registries, share containers, and so on.

Decide on the right scale

While OpenShift is designed to scale any number of containers, it can be difficult to decide how many invocations a certain workflow requires, when the input dataset is located in the persisted storage. This becomes even more challenging if each invocation needs to process a different subset of the input data and perhaps generate different output, which needs to be collected. For example, consider code that need to pre-process a large CSV file and generate a new processed dataset. In this scenario, each OpenShift invocation need to process a different chunk from the input CSV file and persist the results into persistent storage, like CEPH or cloud object storage. To support such a scenario, the developer will need to write the boiler plate code to partition the input dataset into chunks and then process them in parallel. The developer will also need to address how to enable each invocation to process a particular chunk, how to coordinate results from the invocations, where to persist the results, how to collect them, etc. While all the above are solvable, there is a lot of effort that goes into implementing everything and then perform testing.

Authentication with external resources

When OpenShift invocation needs to access an external service and perform authentication, this could add more complexity that needs to be handled. Consider thousands of invocations, where each needs to set up a connection to access an object storage. In this case, each invocation will perform authentication. But, if thousands of invocations do it simultaneously, it is more likely that the object storage provider will throttle the requests. To resolve this, code needs to authenticate once and share tokens among invocations. This adds another implementation complexity: how to share tokens, how to re-authenticate when a token expires, and so on.

Make it simpler with Lithops

Tackling the above challenges requires a lot of effort by developers — and this is on top of creating the actual business logic of their applications. This situation could eventually force developers to implement a sort of middle-framework, if only to assist them in deploying their applications to the OpenShift cluster. To reduce these efforts and enable developers to focus on their business logic, we introduce the Lithops framework. It’s designed to greatly reduce developers’ efforts and provide them with advanced capabilities for integrating their code with OpenShift. The following example demonstrates how Lithops simplifies deployment to OpenShift.

Summary and next steps

While OpenShift is a great platform on which to massively scale Docker containers, we learned that it may require a substantial effort to adapt the application or code before using OpenShift efficiently for big data processing. We saw how Lithops can greatly reduce the amount of work required by developers to integrate and benefit from OpenShift. You can easily try it out to see all the benefits by yourself. Setup Lithops, configure Knative or K8s Jobs API and checkout how Lithops simplifies your developer experience. You are welcome to start with applications and examples. In the next blogs, we will further demonstrate the value Lithops provides for developers to use OpenShift or even hybrid clouds for their workloads.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gil Vernik

Gil Vernik


I am a hands-on senior architect, technical and a team leader IBM Research. I am an expert in Big Data analytic engines and serverless computing