Migrating large amounts of content in Sitecore

Using Sitecore PowerShell Extensions to migrate 3900 unique content items and their linked media


Migrating to a new website is never easy. This was also the case for a client of our team at eFocus. The previous website was outdated on all facets and had to be drastically updated. New design, new content structure, new Sitecore, new solution architecture. Everything was going to be renewed, except a large portion of the content. The website consists of serveral segments and each segment consists of approximately 3900 content items. Each release one segment is being migrated, preferably fully automated.

With some help of my colleague @Wesz I think I found a pretty cool solution for the challenge.

The scenario

The old Sitecore instance (let’s call it instance A) and the new one (instance B) do not have a lot of similarities:

  • Instance A has a page based structure while instance B has a content based structure. This means that some pages should be transformed into generic content blocks which can be placed on several different pages. This also means that not every template in A exists in B and vice versa
  • The content of instance A should not be placed on the exact same location in instance B. The content tree structure in Sitecore is completely different
  • Fields from instance A do not have the same name as the fields in instance B
  • Instance A has a lot of broken links and missing media items which ideally should be removed from instance B
  • Some pages should be excluded from the content migration because they are deprecated

Some conditions and requirements were:

  • A lot of rich text fields have links to other Sitecore items. These links should be fixed when migration to instance B. For example: sitecore/content/site/a/b/c moved to sitecore/content/site/x/y/c. Then all links should point to sitecore/content/site/x/y/c
  • News articles are divided by month and year, these folders should be created automatically when a news article is being migrated
  • An item can be renamed during migration. For example: sitecore/content/site/a/b/c in instance A will be sitecore/content/site/a/b/d in instance B

The approach

So… how are we going to achieve that? Creating a package and installing it is obviously not an option. After some brainstorming and sketching up ideas I came to the following approach:

We use a CSV as input for the script. This CSV contains all the pages that need to be migrated. We load that CSV, get the old serialized Sitecore item from instance A matching the CSV row and import it in instance B. After the item is created we’ll fill in all the fields and fix all the links in the RTE- and link fields.


The tools

The following tools were needed to help me realizing the approach described before:

The script relies mostly on the Sitecore PowerShell Extensions which is a great set of tools. It is built on top of PowerShell so you can use almost everything you already know plus a lot of cool stuff specially made for Sitecore. It also provides developers with a couple of applications in Sitecore such as PowerShell ISE. The ISE gives you the ability to open, save, write and run PowerShell scripts within the Sitecore context.


The CSV

The script has to have some kind of input to know what to do. It needs to know which page needs to go where and which template it was and needs to be. The file therefore contained the following 4 columns:

  • Old URL
  • New URL
  • Old template GUID
  • New template GUID

We’ll load the CSV using CsvHelper and sort the results on the amount of forward slashes in the New URL. Why? Because I can not add an item if I do not have its parent. By sorting the new URL on the amount of forward slashes I can reduce the possibility of trying to import an item without having its parent. We’ll also filter out all duplicates because an item can only be added once.

As mentioned before, I used PowerShell CmdLets to structure my script. I wanted a clear and readable PowerShell script in Sitecore. Therefore I created CmdLets in c# which could be called from my PowerShell script.

$csv = Load-Csv -Path “D:\folder\file.csv”

The Load-Csv CmdLet returns a sorted IEnumerable of ContentListCsvModel, which is a custom model containing properties for all columns in the CSV file.

for ($i=0; $i<($csv).Count; $i++)
{
[ContentListCsvModel]$row = $csv[$i];
if($row -eq $null)
{
Write-Warning "Row is null"
continue;
}
}
Note: please see adamnaj’s response on this post. You could probably also use Import-Csv or ConvertFrom-Csv.

Getting old items

When serializing Sitecore items, Sitecore will place them in the exact same hierarchy on your filesystem as they are in Sitecore. Also, the Sitecore tree is almost the same as the construction of the websites URL. So from the old URL I can find the serialized item on my disk.

Note: I’m using Directory.GetFiles(path, searchpattern) to locate the files. I noticed that sometimes the item was called “item name” and it was saved as “item-name”. With the Directory.GetFiles() method you can use a wildcard and with it you have a better chance of finding your item file.
Please also see adamnaj’s response on this post. You could probably also use Get-ChildItem.

After finding it I can load the item into memory using default Sitecore functionality.

var file = File.Open(filepath, FileMode.Open, FileAccess.Read, FileShare.Read);
using (TextReader reader = new StreamReader(file))
{
return SyncItem.ReadItem(new Tokenizer(reader));
}

Deserializing the Sitecore item will result in a SyncItem object, which is a more low level object than the Item object. However, it does what it should in this case: I can get the template id, versions, shared fields, id and name.

Creating the items

To create an item, Youshould know where to create it (parent), as what to create it (template) and how I should name it. All these parameters are defined in the CSV. I created a Get-ItemPath, Get-ParentPath and Get-ItemName CmdLet to extract the information from the CSV’s new URL.

I also added functionality that if Get-ItemPath returned null for the new URL, the item would be placed under the old name and new template in a receptacle. This would make it possible for our client to sort those items out manually.

if(!$newitempath)
{
[string]$newparentitempath = ($receptacle).Paths.FullPath
[string]$newitemname = $fromitem.Name
}

After all the information is collected, the item can be created.

[Sitecore.Data.Item.Item]$toitem = Create-Item —Parent $newparentitempath -Name $newitemname -Template $newitemtemplate -Database $fromitem.DatabaseName

Because the PowerShell script runs within Sitecore, you do not have to worry about not having a Sitecore context. Creating an item can therefore be done in the same way you’ve always done it as you can see in this snippet of the Create-Item CmdLet:

var db = Factory.GetDatabase(String.IsNullOrWhiteSpace(this.Database) ? "master" : this.Database, true);
var parentItem = db.GetItem(this.Parent);
var existing = parentItem.Children.Where(x => String.Equals(x.Name, this.Name, StringComparison.CurrentCultureIgnoreCase)).ToList();
if(existing.Any())
{
//do not add a new item if an item with the same name already exists. This makes it possible to run the script multiple times with changes or additions
return existing.FirstOrDefault();
}
using (new SecurityDisabler())
{
return ItemManager.CreateItem(this.Name, parentItem, new ID(this.Template));
}
Note: You could extend the functionality of the Create-Item CmdLet with, for example, automatic creation of month and year folders for news items as I described earlier.

After we created the new item, we’ll save the old and new item in a Dictionary<SyncItem, Item>. That way we’ll have their relationship and we can easily fix the broken links later on.

Mapping fields

After creating the items we should populate them with content. A Sitecore item has versions with different content, but also contains shared fields. First we’ll create the different versions. The SyncItem object has a property Versions which is a List<SyncVersion>. Each version has a language in which you should select the new item before adding a version.

foreach(SyncVersion version in from.Versions)
{
var language = LanguageManager.GetLanguage(version.Language);
var item = database.GetItem(to.ID, language);
using (new SecurityDisabler())
{
var newversion = ItemManager.AddVersion(item);
MapFields(from, newversion, version.Fields);
}
}

The MapFields method is the same for the shared fields, only instead of version.Fields, use from.SharedFields.

Discrepancies between fields

As I mentioned before, the instance A templates can differ from the instance B templates. To fix that, I created mapper functionality to let my code know how the fields should be mapped.

Every template combination (old template, new template) has a mapper object implementing a IMapper interface and are placed in a List<IMapper>. A Mapper has a dictionary of old- and new fieldnames which will be used to map the fields.

We’ll loop through the fields of the from item, find a match in the mapper and fill the to field. This is what it looks like in short:

foreach (SyncField field in from.Fields)
{
foreach(var fieldmap in mapper.Fields)
{
if(!String.Equals(field.FieldName, fieldmap.Key, StringComparision.CurrentCultureIgnoreCase)) continue;
var tofield = to.Fields[fieldmap.Value];
SetValue(tofield, field.FieldValue);
}
}

What about images and such?

Setting the value is pretty easy for normal text fields like a single line text, but we want to also set the field value for image fields for instance. The images from instance A are not yet migrated.

When the tofield.Type is “image” or “file” we’ll parse the field value, find the linked media item, create it and map the fields the same way we’re doing for our content items. This will only create the necessary media items and fix the links immediately.

Attachments

The attachment field type is used for the media items and should be handled differently to correctly save the blob data.

if(field.IsBlobField)
{
var buffer = Convert.FromBase64String(value);
var ms = new MemoryStream(buffer, false);
using (new SecurityDisabler())
{
item.Editing.BeginEdit();
field.SetBlobStream(ms);
item.Editing.EndEdit();
}
}

Fixing links

We’ve now got an instance B with correct items and correct values. Only the links to other items are not migrated yet. This means that all the links in the rich text fields and link fields are pointing to GUIDs from instance A. Essentially what we need is a Dictionary<old GUID, new GUID> rewrite all the guids in the fields.

To find the links in a RTE, parse the HTML and find the <a href=””>.

var document = new HtmlDocument();
document.LoadHtml(fieldValue);
var nodecollection = document.DocumentNode.SelectNodes("//a[@href]");
if(nodecollection == null) return;
foreach(var node in nodecollection)
{
var hrefvalue = node.GetAttributeValue("href", String.Empty);
var linkedItemId = DynamicLink.Parse(hrefvalue).ItemId.Guid;
//find the guid in the Dictionary<SyncItem, Item>
var item = GetEntryBySyncItemId(dictionary, linkedItemId);
}

If item is not null, that means that the linked page is migrated and we’ll add the GUIDs to the dictionary. If not, that means that the link is pointing to an item which has not been migrated. The CmdLet will look for that item in the serialized items. If the item is present, the item will be created and the fields will be mapped as done before. Then the guids will be added to the dictionary.

With that dictionary, we can replace the old GUIDs with the new ones:

using (new SecurityDisabler())
{
var newvalue = field.Value;
foreach (var l in linkmap)
{
newvalue = field.Value.Replace(i.Key.ToString("N").ToUpper(), i.Value.ToString("N").ToUpper());
}
item.Editing.BeginEdit();
field.Value = newvalue;
item.Editing.EndEdit();
}

Results

For 3900 content items (media items excluded) takes about 20 minutes and creates a detailed log of 152401 lines describing everything the script does. I’ve ran the script about 8 times now with changes and works great every time.


I would’ve loved to share my source code with you, but unforunately I can not do that. I tried my best to clearly explain how I did the migration. If you have any questions about the migration, the code or the tools feel free to leave a comment or contact me on Twitter.