Simplify the developer experience with OpenShift for Big Data processing by using Lithops framework

Gil Vernik
5 min readSep 17, 2021

--

To reduce developers’ efforts and allow them to focus on their business logic, we introduce the Lithops framework — designed to simplify integrating code with OpenShift while providing advanced capabilities.

Serverless computing is a widely popular topic attracting attention from both industry and academia. The unique user experience of being able to deploy applications without having to manually set up VMs or choose required resources is what makes serverless so attractive. Obviously, this does not mean there are no servers behind the scenes. It simply means users don’t need to worry about servers anymore. As serverless computing continues to grow, so application containers like Docker are becoming a popular choice for packaging software, while Kubernetes-based platforms are being widely used to scale containers across distributed clusters.

This blog targets developers who need to containerize their applications to scale them over Kubernetes clusters. We discuss the efforts required to containerize an application and how to deploy it to the Kubernetes platform. To address the challenges, we introduce an open source Lithops framework that provides an easy “push the button” move to Kubernetes. Developers can focus on their code, while Lithops knows how to containerize the code and deliver it to the Kubernetes cluster.

The road to OpenShift

Red Hat OpenShift is the leading enterprise-ready container engine. It is based on Kubernetes and is capable of massively scaling and executing Docker containers. It can be installed on premise, in the cloud, or used in hybrid solutions. While all this sounds promising, it’s still a non-trivial task for developers to containerize their software and deploy it to OpenShift. We now discuss some of the challenges developers may face when moving to OpenShift.

The learning curve of APIs

Developers who need to use OpenShift need to choose either Knative API, OpenWhisk API, or K8’s job descriptor API. In each case, they need to learn new API semantics, perhaps use different third-party SDKs, or even use a REST API. The learning curve, while not complicated, still consumes time and requires skills to tune the application for the optimal usage.

To package the code or not to package?

While Docker is a popular choice to contain applications and their dependencies, it is not trivial to find the right way to execute the proprietary code in the container image. If the code is packed in the Docker container, then using the public Docker hub could be problematic for proprietary code. Private container registries may resolve this, but this adds another layer of complexity to set up and maintain private registries, share containers, and so on.

Decide on the right scale

While OpenShift is designed to scale any number of containers, it can be difficult to decide how many invocations a certain workflow requires, when the input dataset is located in the persisted storage. This becomes even more challenging if each invocation needs to process a different subset of the input data and perhaps generate different output, which needs to be collected. For example, consider code that need to pre-process a large CSV file and generate a new processed dataset. In this scenario, each OpenShift invocation need to process a different chunk from the input CSV file and persist the results into persistent storage, like CEPH or cloud object storage. To support such a scenario, the developer will need to write the boiler plate code to partition the input dataset into chunks and then process them in parallel. The developer will also need to address how to enable each invocation to process a particular chunk, how to coordinate results from the invocations, where to persist the results, how to collect them, etc. While all the above are solvable, there is a lot of effort that goes into implementing everything and then perform testing.

Authentication with external resources

When OpenShift invocation needs to access an external service and perform authentication, this could add more complexity that needs to be handled. Consider thousands of invocations, where each needs to set up a connection to access an object storage. In this case, each invocation will perform authentication. But, if thousands of invocations do it simultaneously, it is more likely that the object storage provider will throttle the requests. To resolve this, code needs to authenticate once and share tokens among invocations. This adds another implementation complexity: how to share tokens, how to re-authenticate when a token expires, and so on.

Make it simpler with Lithops

Tackling the above challenges requires a lot of effort by developers — and this is on top of creating the actual business logic of their applications. This situation could eventually force developers to implement a sort of middle-framework, if only to assist them in deploying their applications to the OpenShift cluster. To reduce these efforts and enable developers to focus on their business logic, we introduce the Lithops framework. It’s designed to greatly reduce developers’ efforts and provide them with advanced capabilities for integrating their code with OpenShift. The following example demonstrates how Lithops simplifies deployment to OpenShift.

Using Lithops eliminates the need to learn Knative API, learn how to invoke executions where each execution receives a single parameter from input dataset, how to collect results, etc. Lithops exposes Python’s feature API and the Python multiprocessing API. It also supports various backends. For example, if there is no Knative and a developer wishes to use the standard Kubernetes API, the only required change would be to change the backend

Lithops also contains an advanced data partitioner that supports various data types as well as the chunking of CSV files. For example, if the data is located in IBM Cloud Object Storage and each invocation needs to process a single file, the code would look as follows.

The Lithops partition discovery will automatically assign each invocation a single object from the given URI, hiding all object storage semantics from the developer. Internally, Lithops performs authentication and token re-share, and hides all the complexity associated with accessing object storage.

Summary and next steps

While OpenShift is a great platform on which to massively scale Docker containers, we learned that it may require a substantial effort to adapt the application or code before using OpenShift efficiently for big data processing. We saw how Lithops can greatly reduce the amount of work required by developers to integrate and benefit from OpenShift. You can easily try it out to see all the benefits by yourself. Setup Lithops, configure Knative or K8s Jobs API and checkout how Lithops simplifies your developer experience. You are welcome to start with applications and examples. In the next blogs, we will further demonstrate the value Lithops provides for developers to use OpenShift or even hybrid clouds for their workloads.

Stay tuned!

--

--

Gil Vernik

I am a hands-on senior architect, technical and a team leader IBM Research. I am an expert in Big Data analytic engines and serverless computing