How we migrated a million assets into AEM Cloud Service DAM

Saravana Prakash
9 min readFeb 4, 2024

--

“3 developers against a million assets.”

“All we have to decide is what to do with the time that is given to us.” — Gandalf

Prelude

Our journey started as a regular DAM migration project attempting to migrate a million assets from non-AEM source into AEM DAM. This was for an ecommerce site, migrating product images plus related marketing assets. Legacy site was using a simple CDN system to just deliver images with no image optimizations or content management abilities. Business wanted best practices of Adobe dynamic media along with AEM DAM capabilities.

AEM OOTB tools didnt help

When we drafted the requirements on a Miro board, we realized Adobe advocated ways to migrate assets in bulk, wouldn’t meet our requirements.

AEM 6.5 Way:

This documentation applies to AEM 6.5 where we turn off workflows,

run imports and finally turn back workflows. This will NOT apply to AEMCaaS. We dont even have the default “DAM Update Asset” workflow in Cloud service. Since we ran AEMCaaS was on Cloud, this method didnt help us.

Using Bulk Importer:

For AEM Cloud, Adobe recommended way is to migrate using Bulk importer explained here.

This method is highly efficient, binaries are transported in cloud-natively; very good throughput, speed; definitely the first choice for DAM migration.

But caveat, this method requires a pre-processing step to organize all assets as per DAM taxonomy. A content team must sort and arrange assets into right folder structure. The assets must have proper url-friendly names. The folders must also be organized and take seo-optimized titles. This preparation requires significant content-team effort to organize assets. If source system itself had assets organized respecting DAM taxonomy, this bulk importer will work best. Not in our case; our source system was flat file, all million assets dumped under single folder; no ability/resource to sort. We couldn’t leverage Bulk Importer for our project.

Using Adobe I/O Runtime:

This would be the second-best way for migrating assets into AEM Cloud using the Asset Compute microservice. Its the optimal cloud-friendly way as shown in adobe documentation. The microservice is a 3 step process

  1. Initiate Upload
  2. Upload to binary store
  3. Complete Upload

Steps were simple.

  1. Create a new node application. We leveraged Adobe app builder to host the node app.
  2. Import the `@adobe/aem-upload` package
  3. Call `new DirectBinary.DirectBinaryUpload()`.
  4. Example code is provided in thier documentation
const DirectBinary = require('@adobe/aem-upload');

// URL to the folder in AEM where assets will be uploaded. Folder
// must already exist.
const targetUrl = 'http://localhost:4502/content/dam/target';

// list of all local files that will be uploaded.
const uploadFiles = [
{
fileName: 'file1.jpg', // name of the file as it will appear in AEM
fileSize: 1024, // total size, in bytes, of the file
filePath: '/Users/me/Documents/my_file.jpg' // Full path to the local file
},
{
fileName: 'file2.jpg',
fileSize: 512,
filePath: '/Users/me/Documents/file2.jpg'
}
];

const upload = new DirectBinary.DirectBinaryUpload();
const options = new DirectBinary.DirectBinaryUploadOptions()
.withUrl(targetUrl)
.withUploadFiles(uploadFiles);

// this call will upload the files. The method returns a Promise, which will be resolved
// when all files have uploaded.
upload.uploadFiles(options)
.then(result => {
console.log(result);
})
.catch(err => {
console.error(err);
});

Now, we did experiment this route. Built the node job to migrate assets. But again this failed.

Our challenging requirements

  1. Our vendor generates multiple jsons of assets metadata coming from different sources. Each json file have different asset information.
  2. Each json is a flatmap of asset records with urls, hierarchy metadata, comparison dates, comparison hashids and ~200+ metaproperties.
  3. At receiving end AEM, the job should
    a) Parse the json
    b) Determine DAM hierarchy folders using input metadata
    c) Create new folders if missing
    d) If existing, use dates and hashids for deduping
    e) Exclude obsolete data
    f) Import valid assets
  4. After importing assets, these 200+ metadata needs to be persisted along with asset, for delivery and used for subsequent dedupe on next day job.
  5. Vendor whenever updates any of non-asset metadata or asset binary or both, the multiple json files with all records are published and AEM job should repeat step 3 above.

This de-duping requirement was our failure point to use just the aem-upload. Our Landscape looked like this

The second step above became a very expensive operation. To run de-duping and compare dates and hashids, the Node cron running on App Builder had to make multiple http requests to AEM server. We had ~ 1million assets and even after optimizing, we estimated ~90K trips would happen daily to compare metadata before importing into AEM. This approach was a failure.

Attempt 2 — Duplicate metadata store

Still attempting the aem-upload way, we tried to standup a separate MongoDB as duplicate datastore running on a NodeExpress server. During ingestion, the deduping was against MongoDB. Once Node app determines correctness of asset, the transport initiates. Quickly our landscape shifted to

But caveat with this approach was the “Sync Metadata” step to cater the requirement # 4 above. We wanted metadata to be persisted at AEM along with asset. Coz we render this metadata at asset delivery. And this metadata sync became overhead.

Problem: Asset-compute microservice runs asynchronous. After we fire the binary import using asset compute MS, we had to fire second trip to update metadata. We used the Assets Http API to update metadata. Problem was, these requests run asynchronous. The second update trip RACED AHEAD of the first create trip. So metadata updates failed randomly causing issues at delivery. We also attempted a NEAR-REALTIME approach, so first create assets; after 1 hour, fire the metadata update requests. It just added more overhead, than solving our requirement.

Attempt 3 — Move the party AEM In-house

Running external Node application to create assets was a failure. We were convinced to run the entire show inhouse inside AEM. This is NOT Adobe recommended practice. But sacrificed to cater our challenging requirement.

Using AssetManager: We moved back to old-school AssetManager. In 2014, I worked on AEM6.0 using same AssetManager. After 9 years, still same old AssetManager scheduler in 2023; personally I felt very bad with approach.

Steps got simplified.

  1. A cron scheduler will periodically read the jsons
  2. Using ResourceManager, job will look up existing assets, run de-duping rules.
  3. Will kick off AssetManager.createAsset() if eligible to be imported.
  4. This API is synchronous. Waits till binaries are shipped and then proceeds further.
  5. Finally we update metadata against asset. Landscape shaped like this

This approach seemed to work at lower environment for lower volume. But was debacle at Prod.

Production Failure: When we ran the new job at Production against heavier volume and job kept chocking. Took a while to narrow exact tipping point. The job runs normal for upto ~10K assets that runs around 6:30 hrs. But when we run again heavier volume, the AssetManager starts throwing this error

09.11.2023 09:54:48.762 [cm-**-**-aem-author] *ERROR* [JobHandler: /var/workflow/instances/2023-11-08/salsify-importer_1:/content/dam] com.core.workflows.UpdateItemProcess Unable to process product:
java.lang.NullPointerException: null
at com.day.cq.dam.core.impl.AssetManagerImpl.createOrUpdateAsset(AssetManagerImpl.java:328) [com.day.cq.dam.cq-dam-core:5.15.56]
at com.day.cq.dam.core.impl.AssetManagerImpl.createOrUpdateAsset(AssetManagerImpl.java:260) [com.day.cq.dam.cq-dam-core:5.15.56]
at com.day.cq.dam.core.impl.AssetManagerImpl.createOrReplaceAsset(AssetManagerImpl.java:385) [com.day.cq.dam.cq-dam-core:5.15.56]
at com.core.workflows.injestor.AssetInjestorUsingAssetManager.lambda$importUsingAssetManager$4(AssetInjestorUsingAssetManager.java:76)

Clearly the OOTB AssetManagerImpl.createOrUpdateAsset() was throwing NPE. No clear pattern. But heavier volume was chocking. We raised Adobe help and after research, this was our learning.

The AssetManagerImpl is NOT CLOUD FRIENDLY. The AssetManagerImpl makes sense for AEM 6.5 but in Cloud this is not the way to go. The Asset Manager API streams the binary through the repository (i.e. Mongo) and that is an extremely expensive operation. The binary needs to go into AEM (we have a stream containing the binary in the JVM) and then we need to stream the binary out of AEM into the blob store (Azure Blob Storage). This means that for every binary we need to process two times.

Again it was failure. At this point, I would like to quote words by Gandalf “All we have to decide is what to do with the time that is given to us.”

Attempt 4: Final Solution — Cocktail of all learnings!

The only change we had todo was, replace AssetManager with modern Asset Compute microservice. This was not the perfect solution but worked to cater all our requirement. So our landscape settled to this

Custom DAM Asset migration solution
  1. The de-duper built earlier was reused to determine valid assets
  2. From AEM server, we fired the Asset-Compute microservice. Example code to trigger Asset compute microservice is found here.
  3. To solve the metadata sync issue explained in Attempt 2 above, we copied the metadata onto DAM parent Folders as temporary store.
  4. Then we used Auto-assign workflows to copy the metadata back onto asset.

During the last step # 4 above, we started with traditional workflow launchers and Resource change listeners to check if asset was created and then copy metadata. But that was again problematic. Listeners gets fired MULTIPLE times during asset creation. I have explained the problem in my article here. Post processor workflows worked fine.

Finally, we were able to launch the job at Production. We phased the migration; split the payload into smaller chunks; were continuously firing the job for 6 days; onshore-offshore worked in shifts. But not done yet.

More issues and how we solved

  1. Replication agent was the slowest runner of our team. Above is a speed comparator. The binaries were shipped from Vendor onto — AEM author (Fast); Adobe Dynamic Media (Fast); AEM publisher (Slowest). To work around, we moved the asset replication into Post processor as a separate process step. During our bulk ingestion, we turned OFF the replication process, completed ingestion into author and DM first. Later we wrote MBean JMX job to replicate the assets separately.
  2. Resource-Change-Hell — I explained detailed in my article here. Initially we used ResourceChangeListener to copy metadata from parent folder to assets. But AEM fires this listener multiple times during asset creation. We switched to Post processing workflows to solve this problem.
  3. Asset Delete API didnt work as per Adobe documentation. During testing, we wanted to bulk delete assets and reimport. Adobe documentation didnt work. Trial-n-error, we figured a proper way to bulk delete assets explained here.
  4. Sling Commons scheduler dont work on AEM Cloud service. Similar to 2 above, it fires multiple times for each pod instance. We switched to Scheduled Jobs to solve this problem, explained in my article here.
  5. Traditional AEM querybuilder is again no-go for AEM Cloud service. We switched to Sling ResourceStreams as explained here.
  6. After all of AEM issues, the vendor also ran into another FTP issue. Murphys law! It just further delayed our project by a sprint.

Conclusion

Our project is now delivered. We moved onto second phase of maintenance, status reporting, job monitoring and further job enhancements. This migration gave us moral boost towards assets delivery, adhering to Adobe best practices and leveraging DM optimizations. Was a great journey with lot of lessons learnt. Hope this article helps your migration journeys as well.

Happy coding!

--

--

Saravana Prakash

AEM Fullstack Enthusiast. Working on AEMCaaS, Adobe EDS, Adobe IO and other Adobe Marketing Cloud tools