Spotlight: CLAIMED–The Open Source Framework for Building Coarse-Grained Operators for Accelerated Discovery in Science

Published in

Open-Source Science (OSSci)

9 min readOct 26, 2023

CLAIMED — the open source framework for building coarse-grained operators for accelerated discovery in science

Welcome to our new Spotlight Series, where we get to know science-focused open-source software projects in order to better understand how Open-Source Science (OSSci) can add value and help accelerate scientific research through better open source.

Welcome to Spotlight! First, introductions. Who are you? And what do you do?

My name is Romeo Kienzler. I work for IBM Research as Data and Software Engineer.

Tell us about CLAIMED. What problem are you trying to solve?

In modern data-driven science, reproducibility and reusability are key challenges. Scientists are well skilled in the process from data to publication. Although some publication channels require source code and data to be made accessible, rerunning and verifying experiments is usually hard due to a lack of standards. Therefore, reusing existing scientific data processing code from state-of-the-art research is hard as well. This is why we introduce CLAIMED, which has a proven track record in scientific research for addressing the repeatability and reusability issues in modern data-driven science. CLAIMED is a framework to build reusable operators and scalable scientific workflows by supporting the scientist to draw from previous work by re-composing workflows from existing libraries of coarse-grained scientific operators. Although various implementations exist, CLAIMED is programming language, scientific library, and execution environment agnostic.

How did the project come about?

In our research departments, we collaborate with Citizen Data Scientists (CDS) who work extensively with large datasets in fields like computer vision, time series analysis, and NLP. However, we face challenges with existing approaches, such as monolithic scripts lacking quality and reproducibility. We’ve identified the following requirements to improve data-driven research:

Low-code/no-code environment with Jupyter notebooks for rapid prototyping
Scalability during development and deployment
GPU support for big data processing
Pre-built components for various research domains
Support for popular Python and R libraries
Extensibility for future advancements
Ensuring research reproducibility
Data lineage tracking
Facilitating collaboration among researchers

We evaluated various software tools but found none fully meeting our needs. To address this, we created the CLAIMED framework for low-code/no-code environments in data-driven science.

How does it work?

CLAIMED is a library for low-code and no-code workflows in AI, ML, ETL, and data science. It offers pre-built components for various domains, supports multiple languages, and runs on different engines like Kubernetes, KNative Serverless, Airflow, and more. We’ve introduced a command-line tool for interactive use, emphasizing a “write once, runs anywhere” approach. CLAIMED facilitates rapid prototyping for seamless integration with CI/CD into production. Components are deployable as Docker containers or Kubernetes services and can be used across various engines. IBM Watson Studio Pipelines includes CLAIMED as a core component.

The CLAIMED Framework consists of three key components:

CLAIMED operator source code: Tested and production-ready open-source scripts and notebooks.
CLAIMED component compiler (C³): Turns source code into deployable operators and adds them to catalogs.
CLAIMED operator consumers: Including the CLAIMED CLI, Kubeflow, and IBM watsonx.ai.

C³ is language and framework agnostic. It automatically creates container images containing all code and required libraries in the correct versions, then pushed to a registry. In addition, C³ creates all necessary deployment descriptors for the different target runtime platforms.

C³ is a compilation tool that simplifies the process of turning source code into reusable operators. It allows the operator’s author to focus on development while handling additional tasks like creating Dockerfiles and Kubeflow Pipeline yaml deployment descriptors. The operator’s interface is derived from the source code, following minimal conventions for compilation. Failure to follow these conventions results in detailed compilation errors.

The “Write Once, Run Anywhere” (WORA) concept dates back to 1995 with the Java Virtual Machine (JVM), proposed as a universal platform for applications. With the rise of non-JVM languages like Go, JavaScript, and Python, along with Docker containerization, the WORA ecosystem has evolved. Kubernetes has become the leading container orchestrator, and RedHat OpenShift the leading Kubernetes distribution.

CLAIMED leverages the WORA promise by creating operators once and using C³ (based on RedHat’s S2I) to deploy them across various container platforms and workflow orchestration systems. CLAIMED operators play a vital role in Data & AI workflows, delivering speed, reproducibility, verifiability, and reusability to open-source science and accelerated discovery.

When did you launch and how has the project been growing since?

This project originally started in 2015 as supplementary materials to IBMs MOOCs on Coursera and EDX — which btw. 1.000.000+ learners have already taken — managed through GitHub to allow learners not only to use, but also to back-contribute code related to the courses. We’ve noticed that we actually — without intending — created a library of notebooks solving the majority of common data science problems. Besides hardening those notebooks for production use we’ve also created automation scripts for containerizing them to run on Kubernetes and OpenShift — this is when CLAIMED was born

On November 17, 2022, CLAIMED joined the Linux Foundation AI Incubation. This partnership allows us to utilize the LFX Platform for detailed project analytics. The charts below present project statistics,
captured on May 30, 2023, covering the past three years of CLAIMED’s development.

Commits trend — the number of commits performed during the selected time period

Contributor growth — new contributors actively contributing to the project

Lines of code added across all unique commits

Pull requests/changeset history — the total number of pull requests/changeset history submitted and merged

What about resources? How do you support your work?

At first, it was tough to find people who were excited and skilled to help with our open-source project. I led a team at IBM Silicon Valley Lab, which gave us some resources to start with. We began by getting help from people within IBM. We also brought in new talent through IBM’s trainee program called “Jumpstart.” We expanded our reach by teaming up with the Rensselaer Center for Open Source Software.

Later, we worked with the Linux Foundation, which helped our project become well-known. As it gained popularity, more developers wanted to join and help out. This was a big moment for our project’s growth.

How is CLAIMED being applied in reasearch? Could you name a couple of examples?

Absolutely, here you go…

Example 1: Classification of Computer Tomography (CT) scans for COVID-19

CLAIMED enables users to easily create workflows using operators that can run on different platforms, ensuring a seamless developer experience. As a demonstration, we built a workflow exclusively with CLAIMED operators to classify COVID-19 status in Computer Tomography (CT) scans using the publicly available [covidata] dataset. We employed the Elyra [elyra] Pipeline Visual Editor, supporting local, Airflow, and Kubeflow execution. To underscore CLAIMED’s capabilities, we used the AIX360/LIME [aix360] library to highlight the limitations of a subpar deep learning model focused solely on CT scan bones. With CLAIMED, even those without coding experience can use its operators in a low-code environment, simplifying the process of creating effective outcomes quickly.

Thanks to the already available AIX operators, the researchers at University Hospital Basel were able to identify a biased model. The model used latent information (gender, in this case, derived from bone structures) to improve classification performance as classes were heavily skewed on the gender dimension.

Example 2: Geospatial-Temporal Data Analysis

In geospatial-temporal data analysis, there are two main data sources: Earth observation (EO) data and climate data. EO data is gathered from satellites, planes, and drones, offering various image resolutions of the planet. Climate data, on the other hand, is typically collected or simulated by climate models in different spatial and temporal resolutions.

Initially, we query and filter data from Sentinel-2 and Landsat sources using the query-geodn-discovery operator. After aligning them to the same Coordinate Reference System (CRS) with the regrid operator, we combine these datasets based on their spatial and temporal keys. In parallel, we retrieve vector data from OSM in a PostGIS database using the postgis-connector operators. The generate-annotations operator transforms this vector data into polygon-label pairs, which are spatially linked to the satellite imagery using the join-spatial operator.

We eliminate cloud-covered images with the mask-clouds operator and normalize the images to address lighting and color variations. With the data prepared, we proceed to fine-tuning using the prithvi-finetune operator [prithvi], combining predefined operators with custom code. In this case, the IBM geospatial foundation model “prithvi” serves as the head for a trainable U-net architecture. After training with data from the train-test-split operator, the model undergoes testing with the test operator. If the metrics meet the desired criteria, the model is automatically deployed to the inference service of watsonx.ai, supported by Model Mesh.

Taking advantage of the tight integration of Kubeflow Pipelines into the RedHat OpenShift Data Science Platform

Have there been any challenges along the way?

The project has two main streams: the S2I compiler and the operator library. The S2I compiler was in a constant state of evolution, which necessitated adjustments to ensure compatibility with the evolving component library. As the project gained momentum, onboarding new contributors became a challenge, particularly due to the relatively complex nature of the compiler technology. While CLAIMED was designed to abstract away much of the underlying complexity, the S2I compiler demanded a comprehensive understanding of the underlying technologies such as Kubernetes, Kubeflow, Airflow, Docker, YAML, and more. Navigating these complexities while maintaining the project’s overarching goals required a concerted effort from the team.

What are your plans going forward?

C³ has similarities to RedHat’s Source-to-Image (S2I), part of the OpenShift Source-to-Image project. OpenShift, developed by Red Hat, includes S2I, which helps build reproducible container images from source code. S2I creates container images with code, dependencies, and a builder image to define the build process. The open-source OpenShift Source-to-Image (S2I) project is on GitHub at: https://github.com/openshift/source-to-image. Given RedHat’s focus on open source Data & AI, integrating C³ into OpenShift S2I could be mutually beneficial for CLAIMED and RedHat.

For people interested in learning more about CLAIMED, which resources can you recommend?

We’ve noticed that in enterprise production grade usage the components ifself are often not open sourced and companies build a library on top of the open source component library using the C3 source to image compiler. Therefore, we recommend starting with the C3 documentation— then, the following set of videos is a good start to understand on how all fits together:

[YouTube] CLAIMED: Create AI/ML Pipelines wo/ programming skills using JupyterLab, Elyra, KubeFlow, Kubernetes

[YouTube] How to create kubeflow pipelines and components using CLAIMED

[YouTube] How to create Kubeflow Pipeline Components/Kubernetes Jobs with Jupyter Notebooks #CLAIMEDframework

Last but not least, CLAIMED was among the six winning projects of the inaugural IEEE OSS Awards earlier this year. What was your impression of the award, and how useful do you think these forms of recognition are for people like you who are advancing open-source scientific software?

We where super happy to receive this award as the bar is quite high to get it. This shows that our assumptions and solutions are going in the right direction. There are a couple of metrics open source developers check before they consume or contribute to open source. Some of those are number of stars on GitHub, metrics on issues, pull requests and commits, number of contributors and their activeness but also external recognition. And I think with this IEEE embracement we finally ticked all those check boxes and became one of the most successful IBM open source projects.

Thanks, Romeo. Appreciate it!

Thanks for reading! Are you involved in an OSS project in science and would like to share your experience? Let us know!