Expanding OpenVINO™ in Node.js: My Google Summer of Code 2024 Journey
Introduction
Participating in the Google Summer of Code (GSoC) 2024 has been a rewarding experience for me, as I got to contribute to a large-scale ML toolkit in OpenVINO. Since this was the first time I had contributed to a very large-scale open-source project, I was not very confident about myself going all the way through. In this blog post, I’ll walk you through my entire journey of Google Summer of Code 2024 with OpenVINO, how I got selected, the knowledge I acquired, the obstacles I encountered along the way, and how I tackled them.
Journey Begins
After the GSoC dates were announced I made a list of organizations according to my interests. OpenVINO stood out among them because it said — make AI inference faster and easier to deploy. I didn’t have much classic ML/DL experience though I had worked with building applications around LLMs like Llama, setting up inference using vllm, etc., and diffusion models like Stable Diffusion, inference as a problem sounded exciting to me.
I gave the OpenVINO projects list a glance and got to know about the prerequisite task, which was to submit a PR before submitting a proposal for GSoC. I immediately gave OpenVINO a shot, cloned and set up the dev environment, and started hunting for good first issues. Luckily, I found an issue to work on and some really nice people among the maintainers of the project to guide me through the issue. The issue was to add/expose API methods to the OpenVINO NodeJS API. Since I had some background in JavaScript, I jumped right into the codebase. One thing that worked in my favor was that the JS API was very new. It didn’t have a very large codebase, hence it was comparatively easier to follow and tinker with. I fixed the issue, wrote test cases, and raised my first PR. To be honest, it was not as difficult as I initially thought it would be. After some minor code-style remarks, my PR was merged into the master. I got super pumped after my first PR was merged because now I started to believe that I might have a chance if I gave it a good shot. For the next 2–3 weeks, I worked super hard in understanding the codebase and taking up more issues, until the final day arrived to submit the proposal. I had submitted 3 PRs by then, which were merged. While contributing extensively to the JS API, I had made up my mind that I would submit my GSoC proposal for projects involving the JS API, which was either #2 or #3. I went with #2 because it would have given me more opportunities to understand the Node.js API in depth. I submitted my proposal, and voilà, I got selected.
After getting selected, I had to meet with my mentors, who told me to go through all the samples and gave me the freedom to select the ones I wanted to implement. After going through the samples, I found those related to Computer Vision the most intriguing, so I chose both samples in the Computer Vision domain. The samples I chose were Option Character Recognition and Vision Background Removal.
Project Motivation
The Need for OpenVINO in Node.js:
hich was new in development. With the Node.js API, many developers could now leverage the powerful inference capabilities of the OpenVINO framework directly in their web applications. With this, we were targeting new developers to use our platform, as JavaScript is one of the most used languages in the world. Although most of the JavaScript ecosystem for machine learning is not as mature compared to Python, we wanted to do our part with OpenVINO as far as the JavaScript ecosystem is concerned.
Optical Character Recognition Sample
Overview
Optical Character Recognition (OCR) sample is used for converting documents and images containing text into machine-readable data. It’s an end-to-end text detection and text recognition sample using the OpenVINO Node.js API. We start with an image that contains some text, and then, using a text detection model, we identify those regions and extract the bounding boxes. We then pass the cropped images of the bounding boxes to the text recognition model to detect the text within them
Model Selection
For this project, we utilized two specific models: horizontal-text-detection-0001 and text-recognition-resnet. The horizontal-text-detection-0001 model is responsible for detecting text regions in an image, focusing on horizontal text alignment. Once text regions are identified, the text-recognition-resnet model processes these regions to accurately recognize and transcribe the text content.
Implementation Details
The implementation of Optical Character Recognition (OCR) can be split into 4 stages: model initialization, image preprocessing, inference, and post-processing.
The process begins with the initialization of the OpenVINO core, followed by loading two pre-trained models: horizontal-text-detection-0001
for detecting text regions and text-recognition-resnet
for recognizing the detected text.
Image Preprocessing:
The input image is first resized to meet the expected dimensions of the detection model using OpenCV’s cv.resize()
function. This step is very common among almost all machine learning samples, as resizing the input according to the model requirements is essential.
Text Detection and Bounding Box Extraction:
Once the image is preprocessed, it is fed into the detection model, which outputs bounding boxes that outline the detected text regions. These bounding boxes are then extracted and resized to match the input requirements of the text recognition model, and the regions are converted to grayscale using the convertToGrayscale
helper function, which is required for the text recognition model.
Here are the bounding boxes detected in the sample image:
Bounding Box 1: {"xMin":391,"yMin":59,"xMax":550,"yMax":99}
Bounding Box 2: {"xMin":256,"yMin":50,"xMax":377,"yMax":88}
Bounding Box 3: {"xMin":604,"yMin":205,"xMax":653,"yMax":228}
Bounding Box 4: {"xMin":203,"yMin":46,"xMax":229,"yMax":78}
Bounding Box 5: {"xMin":507,"yMin":408,"xMax":538,"yMax":431}
Bounding Box 6: {"xMin":21,"yMin":33,"xMax":184,"yMax":75}
These bounding boxes correspond to specific text regions in the image, marking the areas where the text is located.
Text Recognition:
The extracted bounding boxes are then processed asynchronously using the inferAsyncProcess()
function, taking advantage of OpenVINO's async inference capabilities. The recognized text is then sorted and mapped back onto the original image.
Visualization and Output
Once the text recognition model generates the annotations, we visualize the results by highlighting the detected text regions directly on the original image. We draw bounding boxes around the recognized text areas and add labels above each box with the corresponding recognized text.
All of that is done inside the convertResultToImage
function by scaling the bounding box coordinates to match the original image dimensions, drawing green rectangles to highlight each detected region, and adding a label with the recognized text just above the bounding box. The font size for the text on the label is determined using a helper function getTextSize
(this is a workaround—more on this at the end) to ensure that it fits nicely within the designated space. The final output is a visually annotated image that highlights detected text regions or bounding boxes and also displays the textual data extracted from the image as labels. Below is the final image with the results visualized.
Vision Background Removal
Overview
Vision Background Removal has various applications, the most common one being in photo editing, where you need to isolate subjects from their backgrounds. It’s a segmentation example using the OpenVINO Node.js API. We start with an image that has a primary subject, and after applying preprocessing on the input image and passing it through the model, it provides a binary mask that segments the foreground and background of the input image. We then use that mask to manipulate pixels and obtain our desired image.
Model Selection
For this task, the U²-Net model was selected because it is extremely lightweight and performs well in salient object detection. Its nested U-structure enables it to capture fine details at multiple scales, making it ideal for background removal tasks.
Implementation Details
The implementation of image background removal with U²-Net using OpenVINO in Node.js can be divided into three steps: Model Initialization, Input Preprocessing, Mask Processing, and Background Replacement.
Model Initialization: First, the U²-Net model is downloaded along with the two input images: the first image contains the subject, and the second image contains the background where we want the subject to be placed. After downloading all the assets, we initialize the OpenVINO core and load the compiled model weights.
Image Preparation: The foreground image is first resized to match the input shape expected by the U²-Net model. The image data is then normalized using mean and scale values before being transformed into a tensor suitable for inference. This tensor is fed into the model to predict a binary mask, which highlights the foreground object.
const tensorData = new Float32Array(inputImage);
const tensor = new ov.Tensor(ov.element.f32, inputLayer.shape, tensorData);
const inferRequest = compiledModel.createInferRequest();
const inferResult = await inferRequest.inferAsync([tensor]);
Mask Processing: The model’s output is reshaped and resized back to the original image dimensions. This mask is used to separate the foreground from the background by iterating over the pixels and setting all background pixels to white, effectively removing the background.
removeBackground(resizedResult, bgRemovedResult);
Background Replacement: Once the foreground has been successfully isolated, the next step is to prepare the new background. The background image is resized to match the original image dimensions. Before combining it with the foreground, the mask area is removed from the background image, effectively creating a blank space where the foreground object will be placed.
removeForeground(resizedResult, resizedBgrImageMat);
Finally, the foreground is merged with the adjusted background by aligning the pixels based on the mask. This creates a composite image where the original background has been replaced.
combineImages(resizedResult, bgRemovedResult, resizedBgrImageMat, newImage);
Challenges and Workarounds
Implementing these samples in Node.js with OpenVINO presented several challenges, primarily due to the differences between the JavaScript and Python ecosystems for machine learning. One significant issue was handling inference output. In the OpenVINO Python API, the output from the inference is automatically shaped according to the output layer, whereas in the JavaScript API, the output array is flattened. This required additional steps to manually reshape the output, adding complexity to the implementation.
In the JavaScript environment, I had to use the opencv-wasm
package for image-related operations, which does not support all the APIs available in opencv-python
. One of the challenges was the lack of an equivalent JavaScript implementation for cv2.getTextSize()
, which is used in the OCR Python sample to determine the dimensions of cropped text. To overcome this, I implemented a custom getTextSize
function using the Canvas API to approximate the required functionality. Additionally, common operations like image cropping, which can be easily achieved in Python with simple NumPy array slicing, required more complex handling in JavaScript using OpenCV to determine the region of interest (ROI). These challenges highlighted the relative immaturity of the JavaScript ML ecosystem compared to Python, necessitating creative workarounds to achieve similar results.
Conclusion
Summary of Achievements
During my Google Summer of Code (GSoC) 2024, I successfully met and even exceeded the goals I had originally set for my project with the OpenVINO Toolkit. My work primarily involved translating complex machine learning models and processes from Python to JavaScript, making them accessible to the broader JavaScript development community. Through this project, I developed two key samples: Optical Character Recognition (OCR) and Vision Background Removal, both leveraging the OpenVINO JavaScript API. These samples demonstrate the capabilities of the OpenVINO Node.js API in common machine-learning applications and also serve as a resource for other developers looking to implement similar solutions in Node.js.
Personal Reflections
My journey through GSoC was transformative, both professionally and personally. Engaging with the OpenVINO project allowed me to deepen my understanding of best practices in software development, particularly in writing modular, maintainable code within a large, complex codebase. It also taught me the importance of collaboration and good documentation, which were crucial in navigating the challenges I faced. Interacting with mentors and contributors from diverse backgrounds enriched my learning experience and provided me with new perspectives. The challenges I encountered, particularly in adapting Python samples to JavaScript, pushed me to grow as a developer, and I am eager to continue contributing and keep learning with OpenVINO.
Acknowledgments
I would like to express my deepest gratitude to the Google Summer of Code program for providing this incredible opportunity to contribute to open source. Participating in GSoC has been a transformative experience that has significantly enhanced my programming skills and broadened my understanding of large-scale software development.
I extend my heartfelt thanks to my mentors, Nikolai Vishniakov and Alicja Miłoszewska, whose unwavering support and guidance were instrumental in my GSoC journey. Their consistent communication, insightful feedback, and willingness to go the extra mile made all the difference. I am also immensely grateful to the entire OpenVINO community for their warm welcome and the collaborative environment they fostered. The regular demo meetings and feedback sessions were invaluable in helping me stay on track and continuously improve. This journey would not have been as enriching without their collective efforts, and I look forward to continuing my contributions to this incredible community.
Additional Resources
Below are the links to my work for those interested in exploring further:
Live Demos:
The live demos are hosted on CodeSandbox, providing an interactive environment to experience the samples firsthand.
Sample Notebooks:
I have implemented the samples as Node.js notebooks using the TypeScript Notebooks extension for VS Code, providing an interactive and easily accessible way to understand the implementation.
Sample Scripts:
In addition to the notebooks, I have also provided the samples as standalone scripts, allowing for direct execution and integration into your own projects.