AEM Assets Migration Blueprint (3)

15 min readJan 28, 2024

Part 3: How to Not to Migrate Taxonomies

The subtle difference between folders and tags

Over the past years, I have migrated a couple of legacy digital asset management systems (DAM for short) to AEM Assets. I found that the technical side of such a migration is simple: Write a couple of scripts to extract, transform and load the binary data and meta data, keep in mind to have enough bandwidth to complete these operations in reasonable time and make a transition plan to minimize downtime and provide a seamless experience for the users.

My last project was a bit more challenging, though. It turned out that the legacy system and AEM had some significant differences — which I didn’t realize because I was too focused on the technical solution and should have spent more time understanding the conceptual side of things.

Here, I’d like to share my experiences — so your next project can run more smoothly than mine.

This is Part 3 of a three parts series on migrating to AEM Assets:

Part 1 — Planning
Part 2 — Transforming and tools
Part 3 — How to not Migrate Taxonomies to AEM Assets

TL;DR:

If you feel that AEM Assets is missing tag-based browsing capabilities, please support my idea with an upvote or a comment on Experience League.

If you want to learn why I believe this is a must-have continue reading.

The cost of assets and why we need meta data

Companies with significant investments in marketing often maintain extensive asset libraries. This can be libraries with tens of thousands and even a few hundred thousand digital assets, including images, videos, documents, and more.

Digital assets are … well assets. Something of value. Naturally, you want to leverage that value.

The good news is: After creation, digital assets are relatively cheap. Unlike physical goods, you don’t have to invest in manufacturing additional product units.

The costs associated with a digital asset is the money you spend for creating or licensing and the time and effort you spend searching for that asset afterwards.

The costs associated with a digital asset is

the time or money you spend on creating or licensing,
the time you spend annotating an asset with the meta data, and
the time you spend searching for that asset later.

Creating metadata sometimes is perceived as tedious and often neglected — especially if you create the assets when you need them and don’t think of re-using them later.

Consequently, assets cannot be retrieved later for re-use and will be reproduced — often at higher costs. With that in mind, think of annotating assets with meta data as an investment. And to reduce these costs we must make the process of annotating assets with metadata as efficient and comfortable as possible.

The key to leverage digital assets is putting them in a structure where they can easily be retrieved and discovered.

“Retrieving” vs. “Discovering”

There is a subtle difference between “retrieving” and “discovering”.

Retrieving means, finding something you know is there — you only need find where it was placed and fetch it from there. Discovering, on the other hand, means, you find something you might not have been aware of its existence.

Retrieving is what you do in your personal library. You know what you have — you only need to find it. But a global library with hundred thousand assets typically is used by hundreds of users. In that case, we must structure the library in a way that helps all users — current and future — to find and leverage the unknown assets.

Back in the days, where we only had to deal with a couple of hundred files, “structuring” meant creating a clever folder structure. This approach however does not scale well for thousands of files.

Also, a folder structure you find natural and logical might not be as obvious for your colleagues. So, to get assets used by multiple users, you need to add more data to support other retrieval use cases.

Today, files in large libraries are associated with meta data —additional data that is describing a file. Meta data serves two purposes:

Learning about the properties of the files, e.g., when where or how to use it.
Retrieving a file.

In this article, I’ll focus on the retrieval case.

Meta data for retrieval

Metadata comes in various forms. In AEM Assets, the most important type of meta data is in the form of

Property fields: Storing key and values, e.g., SKU = 7804615, type=”pack shot”, etc. . These can be used to search by setting an according filter (e.g., search all assets with type=”pack shot”).
Full text fields: This can be a textual representation of the asset like an abstract of the document or a description of the motif on the asset.
Folder structure: Where the asset is stored. If the assets can be partitioned into non-overlapping categories you can use to drill down, i.e., only search in certain locations.
Tags: Predefined, pre-structured keywords. E.g., asset-type/photography/black-and-white, that you can use to categorize or cluster assets.

Keywords, Tags, Tag Hierarchy, Taxonomy and Folksonomy, Directory

To mark assets for later retrieval, tags are a well-established concept. The term “tag”, however is used slightly different in different contexts. For the sake of this article, let me establish some basic terminology to not get confused.

Tags — in general — are like “stickers” with certain terms you attach to assets that are in a certain category. I.e. you can stick the words “peripheral”, “apple”, “macintosh”, “white”, “mouse”, “input device” to an image showing an apple computer mouse.

Not all tags are equal, though, let’s look at a few variations.

Keywords

You can allow the editor to enter the term as a keyword in a free text field. If the asset is tagged with “peripheral” than the asset will be found when someone searches for that term. If, however, a user searches for “device”, the asset would not be retrieved and vice versa. We have a “synonym problem”. A synonym is an alternative term that describes the same “thing” as another term. You need to know what term was used for tagging to be able to find it.

Many search engines solve the issue by using a thesaurus. A thesaurus is a dictionary that contains all synonyms for any given term. When searching for a keyword, the search engine would expand the synonyms, and search for each of them. This typically is used in full-text searches and not in single-field-based filters.

Tag Dictionary

The synonym problem is why metadata schemas often require a predefined set of keywords or dictionary a user can choose from when tagging. I.e., when this set of keywords only contains the term “peripheral” nobody can use the term “device” for tagging.

Having a “flat”, unstructured dictionary bears a few problems, though.

The proper term is hard to locate among potential thousands of terms in a large dictionary.
The semantic meaning of terms can be ambiguous. “apple”, for example it can mean either the computer brand or the fruit. A term that has different meanings is called a homonym. In flat dictionaries we have a “homonym problem”.
Someone must maintain that dictionary.

Folksonomy

To avoid having dedicated people maintaining a lexicon, teams often try to establish a so-called folksonomy. This is a dictionary that grows as people enter new keywords. I.e., if you annotate an asset with the term “peripheral” today, tomorrow, another editor could choose that term from the dictionary.

In my experience, this only works for small teams with small repositories and small dictionaries. And only when the domain language is well-defined. At large scale, it neither solves the homonym- nor the synonym problem. E.g., it does not prevent you from adding “device” to the dictionary when the term “peripheral” already exists.

Tag Hierarchy

To get around the synonym- and homonym problems, we can add context to the terms by putting the terms into a hierarchy:

- plant
  - fruit
    - apple
    - orange

...

- brand
  - IT
    - apple

This a) helps to locate the terms by navigating through the tree of tags and b) it also disambiguates homonyms.

Taxonomy

A taxonomy also is a hierarchy of concepts or terms. A taxonomy, however, also is a classification. The child-parent relation has a semantic meaning:

In a taxonomy, this relation often means “is a”:

An apple is a fruit.
A fruit is a plant.

Now, this may seem obvious to you. But let’s try the other example:

An apple is an IT
An IT is a brand

Well… certainly not.

A tag hierarchy is a means to structure and disambiguate tags. But it only becomes a classification when the nesting relation has an “is-a” semantic.

Using Taxonomies in Search

The difference between hierarchy and classification might seem academic. But it becomes important when you want to use them in search:

Let’s say you have an Asset A tagged with plant/fruit/apple

You now submit a tag-search for assets tagged with plant/fruit. What result do you expect?

Would you expect to find the Asset A because it is a plant/fruit/apple and apple also belongs to the broader category plant/fruit?
Or would you rather expect A to not be returned, because the search tag does not exactly match plant/fruit/apple?

In a search you might expect the first — a taxonomy semantic. Whereas if you think of the hierarchy being a folder, you’d rather expect the second behavior.

By the way… do you know how AEM Assets behaves?

AEM applies a taxonomy semantic. You’ll find A because it implicitly also is a fruit.

Note: This is a recent change (or bugfix) in AEM. In older versions, the search in the AEM Assets Admin would apply taxonomy semantics where the search in the page editor applied directory semantics. The behavior is now consistent and reflects what you most of you would have expected.

Directories vs Taxonomies

In AEM there is another hierarchy. The directory structure where assets are physically stored.

Note: Some practitioners call the folder hierarchy “taxonomy”. This is not entirely correct. A folder nesting hierarchy can have is-a semantics — but this not always is a case — nor does it have to be. Also, if you browse into the folder plant/fruit you’d expect to see only the assets in that folder — not the contents of the subfolders.

So, directories and taxonomies not only have different semantics, but they also behave differently from a UI perspective.

UI: Search vs Browse

Speaking of the UI perspective: There are two major paradigms in retrieving assets:

Search
Browse

When searching, you enter keywords and select filter criteria from a search form. You’d typically hit a “submit” button and then a search is executed.

When browsing, you follow nested links, representing a folder hierarchy to navigate to the location where your assets are stored.

If you are familiar with AEM, that’s not new for you. You a) browse through folders and b) narrow down the results by using search.

There is a third paradigm that other systems sometimes use:

Search-based browsing

The UI would present you a tag hierarchy you can browse like a folder structure, but the backend would execute implicit discrete search queries to narrow down the search to match the browsed tags.

AEM User Interface

To be clear:

In AEM, each asset MUST be stored in a browsable physical folder.
In addition, it CAN be associated with one or more tags and placed in one or more taxonomy branches which you can use for form-based searches.
AEM does NOT support search-based browsing on tags natively.

Virtual multi-homed directories

Let’s elaborate a bit on that browse-based search. As we’ll see later, this can become a challenge in migration projects because AEM does not support this natively.

Let’s assume you own a grocery store and maintain a library of product images. You can have several different branches in your tag structure:

product
  - food
    - regional
      - apple
    - imported
      - orange

image
  - plant
    - fruits
        - apple
        - orange

Neither is right or wrong. The image of the apple belongs to both categories. Which one you’d use for retrieval would depend on the particular use case.

If you write an editorial about your organic suppliers, you might be more interested in the image aspect of the asset. Whereas if you are maintaining the product catalog, you’d more likely use the product hierarchy to find the asset you are looking for.

So, if you maintain a large asset repository, you’d want to support both use cases and tag the asset with tags from both hierarchies.

In AEM, you can add multiple tags to make the asset searchable in multiple categories, like so:

Be aware tough, that the legacy system you migrate from can use a different paradigm. A frequently used paradigm is the tag-based “virtual directory”.

This concept is comparable to AEM’s tagging approach: One would apply more than one tag from a tag hierarchy to the asset, but the system lets you navigate through the tag hierarchies as if they were folders:

You will have a “browse” experience — but this time it is based on the tag hierarchy — not the folder hierarchy.

You can imagine the browse executing searches in the background like so:

product (select * where tag ~ /product/*)
  - food (select * where tag ~ /product/food/*)
    - regional (select * where tag ~ /product/food/regional/*)
      - apple (select * where tag ~ /product/food/regional/apple/*)

Depending on the system, the browse experience can have either have taxonomy semantics or it can have directory semantics.

Taxonomy semantics means — narrowing down the search results to more specific categories as we walk down the branch in the tree. Imagine it like select * where tag ~/product/food/*

Directory semantics means displaying what is exactly associated with the tag (not with sub-tags). (select * where tag = /product/food)

This approach allows you to place a single asset in more than one browsable category which can increase the user experience.

Typically, the assets are not stored in physical folders but somewhere in a flat database.

Note:

In AEM, assets are identified and browsed by their physical path
Search-based systems, assets are identified by opaque IDs and browsed by tags. The tags serve as virtual directories.

Why you should care

So… what’s the point about what other systems are doing?

Well… you will run into issues when you migrate from a legacy system that uses a virtual directory to AEM which is based on physical folders.

The two approaches are not 100% isomorph. They don’t have a structure that could be transformed while migrating from one system to another.

Let’s revisit the example above. In the old system the picture of the apple is stored in a flat DB under an ID:

name: apple.jpg
id: 3783635
tags: [
  product/food/regional/apple,
  image/plant/fruits/apple
]

We want to migrate this asset to AEM. Yes — we can map and store each individual metadata field and each tag in AEM.

Where to Store Assets

The question is: “Where do I store the asset? In which folder?”. The original asset did not have a folder.

This is the exact situation I found myself in on the project I mentioned earlier. I thought it was a pure technical problem…

This is how I tried to approach it:

Identify a primary taxonomy

So — all assets are tagged with more than one category. But -maybe- all assets are tagged on at least one common taxonomy?

Here both the apple and the orange are in the image/plant hierarchy. So, we could use this as “primary category” and create folders from this particular category.

This would have been the ideal scenario: Ask the business owners, what their primary category is, and ensure all assets are really tagged in that structure. Write a script to find the exceptions (there are always exceptions)— and ask the business owners to re-tag the -hopefully- few assets — that are not tagged yet.

Prioritize categories

In that project, there was no primary category defined by the business. Also, there were some ten thousand assets, and I could not ask the business users to properly re-tag before migration.

I thus analyzed if some primary categories have emerged anyway. I tried to find the hierarchies where most of the assets were tagged with. Then the next frequently used categories, then the third and so on. I priorities the categories by the number of assets they contained — and got a confirmation from business, that these indeed would be the more important ones.

The tags were migrated 1:1 into tags. I then tried to identify the primary category for each asset by trying matching against the prioritised list of categories I created earlier. When I found the primary category, I stored the asset in a folder that resembled the tag.

In the example above, the first asset was put in the image/plant folder path, the second — not being tagged on that hierarchy — in the secondary category folder image/type.

The result was … a mess. You could not predict in which folder an asset would land and the folder-browsing experience was not useful at all.

This was -partly- be related back to the inadequate quality of the metadata.

But this never was an issue on the legacy system as the multi-homing paradigm would increase the chance to find an asset no matter which category you browsed. In AEM we had the issue, that a certain asset no longer was found in a folder, when it was not the primary category.

Heuristically choosing the main category

I then came up with a set of rules to merge categories. I.e. when and asset was on products/food/regional and products/food/imported I would put the asset in a folder products/food/

The structure was a little bit more predictable. But far from being usable. It also was quite “jagged”: Similar assets were very placed at very different levels in the hierarchy: Some quite deep, e.g. products/food/regional/apple, others at the root e.g. products/food — simply because that one had one more tag that required merging.

Truncating the categories and focusing on search

At that point, I took a step back and realized I was trying to solve an unsolvable issue:

I tried to implement a multi-homed directory structure with a single homed directory system.
I tried to rebuild the browse experience from the legacy system. But AEM lacked support for tag-based browsing.
I tried to retain as much information as possible in the folder structure.

In hindsight I feel quite stupid. I was too eager to find a pure technical solution.

We then decided to focus on how AEM would have solved the business cases:

You would use tag-based searching — not browsing to find assets. We gave up on the idea that AEM could provide a comparable tag browse experience.

Still, we had to solve the issue of physically storing the assets somewhere.

We could not mimic the flat storage structure of the legacy system. We could not simply store the assets in one flat folder /content/dam and use search, only. The performance of AEM would have suffered severely.

The performance in AEM drops significantly when you have thousands of assets in on folder. Always make sure to evenly distribute the assets among several smaller folders.

Instead, we decided for a “hybrid” solution: We “truncated” the taxonomy for folder creation. We would rebuild the first one or two levels of the tag hierarchy as folder. This obviously resulted in a much lower number of folders, and it was much easier to define priorities and rules to decide what asset to store where.

In our grocery example, we would have created folders like

(1) /assets/products/
(2) /assets/images/

with the according priorities (place an asset in the folder with the higher priority if it has multiple tags).

A user would then browse into the main category (one or two levels deep) and then create a form-based search from there.

Alternatives

We considered a few alternatives:

Store assets in folders based on their creation date

/assets
   /2023
      /January
      /February
      ...

That is an effective way to distribute the assets evenly over several smaller folders. Also, when the assets are mainly created for campaigns, navigating to a certain year or month can be a good pre-selection of the assets for a subsequent search. Yes — you always would have to search. Browsing is would not be an option.

Store assets in artificial folders their ID

This is a pattern you’d often find in AEM. When you don’t know where to store things, create a “random” structure to evenly distribute the resources:

/0a
   /0a-54
      /0a-54-ff
      /0a-54-04
      ...

We already had two better options; partial categories and date-based folders, so this was not really considered.

Live Copies

We could have duplicated the assets as live copied in multiple folders. But this would have created a level of complexity we did not want to introduce. Also, Live Copies only work when you have a defined blueprint at some folder and the copies at separate locations. But our problem was that we could not identify that primary/blueprint category in the first place.

Augment the AEM search rails to support tag-based browsing.

That option is still in the project backlog. Our primary focus was to migrate the assets as fast as possible — without losing data. The UI always can be improved at any point in time later.

Conclusion

The project took longer than anticipated. It took some time to realize, that I wanted to migrate between system with different paradigms and I try to fix that by producing increasingly complex mapping rules. It was only later when I realized that this was in vain — and I had to rethink the user experience the editors will have in AEM.

I hope that sharing this brief cautionary tale may assist you in avoiding the mistakes I made. Understand — really understand — the structure of the data first before you implement some crazy mapping rules as I did.

Call to Action