Efficiently Managing Large Data Sets in SAP Commerce: A Groovy Solution

8 min readJul 26, 2023

Introduction

In the dynamic world of SAP Commerce, handling vast quantities of data is part of the everyday routine. As businesses grow, so does the amount of data, reaching volumes that can sometimes exceed millions of records. However, updating or removing such extensive data sets is more complex than it seems.

The process can become a severe bottleneck, potentially leading to performance issues impacting server stability and responsiveness. Such extensive operations can consume significant resources, slowing down the system and affecting overall performance. This situation presents a challenge that many SAP Commerce developers have likely faced: How can we efficiently manage these large-scale data updates without compromising server performance?

In this article, we will tackle this challenge head-on. We'll explore a practical solution that employs Groovy scripts to break down extensive data sets into manageable chunks, coupled with the intelligent use of a cron job that runs these updates systematically. This approach ensures that large-scale updates are processed smoothly, reducing the impact on server performance and keeping your SAP Commerce operations running smoothly.

Join me as we delve into this issue, understand the rationale behind the solution, and walk through the steps to implement it in your SAP Commerce environment.

Understanding the Problem

Handling significant data updates in any system is often a challenging task. Still, when it comes to SAP Commerce, the complexity can escalate due to the robust nature of the system and its vast data structures. We are talking about scenarios where we need to manipulate data volumes that exceed hundreds of thousands or even one million records — whether for updates or removal.

One will likely initiate a direct operation and run a simple ImpEx on the entire dataset. However, this can result in performance hitches, server slowdowns, and, in extreme cases, system unavailability.

Let's consider a hypothetical yet familiar scenario: You had a massive update in over a million products where, by accident, most of the products got their code appended to their name. Now, you must update the data and fix this problem. Initiating a direct operation on this entire dataset would be akin to launching a heavy and resource-intensive process that runs for an extended duration.

You could build a flexible search query to find the products you need to update. To get all the records where the name starts with their code:

SELECT {p.pk} FROM {Product AS p 
JOIN CatalogVersion AS cv 
ON {p.catalogversion}={cv.pk}} 
WHERE {p.name} LIKE CONCAT({code},'%') AND {cv.version} = 'Staged'

Even with a precise flexible search query that identifies the products to be corrected, the path forward is far from straightforward. An initial impulse might be to build an ImpEx script based on this search result and perform a bulk update. However, this approach could be more efficient initially.

Using an ImpEx script to update such a large dataset poses two significant challenges. Firstly, manipulating the result set of the flexible search query can be cumbersome due to its sheer volume. Trimming the returned product names to correct them requires significant computational resources.

Secondly, and more importantly, executing the ImpEx script on a dataset of this size would be a strenuous task for the system. Running such a substantial operation on the server, especially in a live environment, can significantly affect its performance. This operation's resource-intensive nature might lead to server slowdowns and, in extreme cases, could even render the system unavailable for the duration of the process.

The magnitude of the problem becomes apparent when you consider the potential disruption to ongoing operations and the sub-optimal user experience. So, what is the way out of this predicament? This is where the combination of Groovy scripts and cron jobs comes into play.

The Groovy Solution

To overcome the challenges of handling significant data updates in SAP Commerce, we can leverage the power of Groovy scripts in conjunction with cron jobs. Groovy is a powerful and flexible scripting language that seamlessly integrates with SAP Commerce, making it an ideal tool for this task.

Our approach involves creating a Groovy script that breaks down our large dataset into manageable 'chunks." Instead of trying to update all the records simultaneously, we handle a specified number of records at a time — say 1500 records. This method allows us to significantly reduce the computational load on the server at any given time, minimizing the impact on performance.

Consider our previous example, where the product codes were accidentally appended to the product names. We would create a Groovy script that identifies the first 1500 products that need updating and trims the product names accordingly.

import java.util.*;
import de.hybris.platform.servicelayer.search.FlexibleSearchQuery;

flexibleSearch = spring.getBean("flexibleSearchService")
modelService = spring.getBean("modelService")

def maxResults = 1500
def fsq = "SELECT {p.pk} FROM {Product AS p JOIN CatalogVersion AS cv ON {p.catalogversion}={cv.pk}} WHERE {p.name} LIKE CONCAT({code},'%') AND {cv.version} = 'Staged'"

query = new FlexibleSearchQuery(fsq)
query.setCount(maxResults)
query.setNeedTotal(false)
products = flexibleSearch.search(query).getResult()
products.each {
  //There are several ways to change the data to the desired outcome
  //For the purpose of this tutorial, we will keep it simple
  it.setName(it.getName().replace(it.getCode(),""))
  modelService.save(it)
  println(it.getCode() + ' - ' + it.getName() + ' updated!');
}

Then you would see in the outputs:

YMH1398 - Classic Guitar updated!
YMH1355 - Acoustic Guitar updated!
RLD1432 - Compact Keyboard updated!
.
.
.

This script will take less than a minute to update, but we don't want to run it manually for all the data records. In the next section, you can discuss setting up the cron job and how it can help efficiently use the script to manage these updates.

The Scripting Job

A cron job is a handy tool that allows us to schedule and automate tasks on the server. However, creating a new cronjob can be time-consuming and entails many manual steps with the need to rebuild and restart the Platform.

Scripting jobs become handy here because they can be done dynamically at runtime. Let's first adjust our previous script to return the expected result for a cronjob.

import java.util.*;
import de.hybris.platform.servicelayer.search.FlexibleSearchQuery;
import de.hybris.platform.servicelayer.cronjob.PerformResult;
import de.hybris.platform.cronjob.enums.CronJobResult;
import de.hybris.platform.cronjob.enums.CronJobStatus;

flexibleSearch = spring.getBean("flexibleSearchService")
modelService = spring.getBean("modelService")

def maxResults = 1500
def fsq = "SELECT {p.pk} FROM {Product AS p JOIN CatalogVersion AS cv ON {p.catalogversion}={cv.pk}} WHERE {p.name} LIKE CONCAT({code},'%') AND {cv.version} = 'Staged'"

query = new FlexibleSearchQuery(fsq)
query.setCount(maxResults)
query.setNeedTotal(false)
products = flexibleSearch.search(query).getResult()

if ((products==null) || products.size() == 0) {
  log.info("No products found to be updated")
  return new PerformResult(CronJobResult.SUCCESS, CronJobStatus.FINISHED);
}
products.each {
  it.setName(it.getName().replace(it.getCode(),""))
  modelService.save(it)
  log.info(it.getCode() + ' - ' + it.getName() + ' updated!');
}
return new PerformResult(CronJobResult.SUCCESS, CronJobStatus.ABORTED);

We've made three changes to our previous script. First, we changed our 'println' to 'log.info'. Even though 'log.info' doesn't work if we run it directly in the HAC, it is recognizable by the scripting job and will be logged in the cron job logs. The second change is that we return 'PerformResults' for the cron job. The final difference is that we now check if any results can be updated from the flexible search.

And here's the 'trick'. If there are results to update, we execute the update and return the cron job status as SUCCESS but with the status set to ABORTED. If there are no results, the status will be FINISHED. By setting the cron job to run only once, if there are more records to be updated, the trigger will start the job again since the status is not FINISHED. However, once it reaches a point where all the records have been updated, the cron job status will become FINISHED, and the trigger won't restart the job.

We can paste the script into HAC, give it a code, and save it. For this example, we are saving as trimProductNames.

The saved scripts can be located in the backoffice under the 'Scripting' menu.

Now, we only need to run three impexes, which can be run altogether. One will create the actual Scripting Job, the second will create the cronjob and last, the trigger for the cronjob.

# Here we create the Scripting job using the code from our saved script.
INSERT_UPDATE ScriptingJob; code[unique=true];scriptURI
;trimProductNamesJob;model://trimProductNames

# Here we create the cronjob from our Job. 
INSERT_UPDATE CronJob; code[unique=true];job(code);singleExecutable;sessionLanguage(isocode)
;trimProductNamesCronJob;trimProductNamesJob;true;en

# This will set the cronjob to be fired every minute
INSERT_UPDATE Trigger;cronjob(code)[unique=true];cronExpression
;trimProductNamesCronJob; 0 * * * * ?

Once the impex is run, the cronjob will start firing in the next minute. For the first few runs, please check each cronjob log to ensure it is performing as expected.

The Results

Performance Improvement: By handling data in manageable batches rather than in one large operation, we significantly reduce the load on the server at any given time. This helps to maintain system performance and avoid the risk of system downtime due to resource overuse.

Dynamic and Automated Processing: Combining the Groovy script with the cron job provides an automated process that dynamically processes all records needing an update. The script will continue to run until all records are updated, reducing manual intervention and allowing for more efficient use of resources.

Log Information: By using 'log.info' in our Groovy script, we ensure that all operations are logged in the cron job logs. This provides valuable information for debugging and review purposes.

Scalability: This solution is not limited to our product name update scenario. Any flexible search you may build can be used. It can be adapted to handle any large-scale data updates in SAP Commerce. This makes it a versatile tool for maintaining data integrity and consistency in your SAP Commerce environment.

Reduced Operational Risks: We enhance operational reliability by mitigating system slowdowns or unavailability during significant data updates. This ensures a consistent user experience and reduces the risk of operational disruption.

In real-world scenarios where this solution has been applied, we've seen significant improvements in performance, reliability, and overall system health during large-scale data updates.

By leveraging the power of Groovy scripting and the flexibility of cron jobs, we've created a more efficient, robust, and manageable process for dealing with data updates in SAP Commerce.

Conclusion

Managing large-scale data updates is a common challenge in SAP Commerce. In this article, we've looked at a novel approach that combines Groovy scripts' power and cron jobs' flexibility to address this issue. This approach ensures improved system performance and provides an automated, dynamic, and scalable solution for handling massive data updates.

The key takeaway is that, by breaking down the data into manageable chunks, we significantly reduce the load on the server, which helps maintain system performance and ensures operational reliability. With the addition of cron job logs, we gain valuable insights into our operations, aiding in troubleshooting and system health checks.

Whether you are dealing with unintended product name updates or any other large-scale data changes, this solution provides a robust and efficient method for managing data in SAP Commerce. Remember, the power of this solution lies not only in its ability to handle the current problem but also in its adaptability to handle similar issues in the future.

However, it's worth noting the adage: "With great power comes great responsibility." The ability to handle massive data updates dynamically is a potent tool, but it also demands careful management. Be cautious about the scripts you build, test them thoroughly, and monitor initial runs to ensure they behave as expected. Cautious implementation and vigilance in monitoring will help avoid unexpected consequences and ensure the smooth operation of your SAP Commerce environment.

In the fast-paced and ever-evolving world of SAP Commerce, finding effective ways to manage and manipulate large data sets is crucial. As we've seen, sometimes the answer lies in combining existing tools in innovative ways to overcome challenges. So don't be afraid to experiment and innovate — you might find the perfect solution for your needs, and with responsible use, these tools can be incredibly beneficial.