The Great Migration: WordPress to Contentful

Part 1: A journey of data parsing and migration

Malorie Casimir
Flatiron Labs
9 min readAug 28, 2019

--

The other great migration

Outgrowing WordPress

The Flatiron School’s marketing website was formerly built with WordPress and a custom PHP backend. Over seven years, the site grew as the school went from serving one course offering on one campus in one city to three course offerings on 12 campuses in 11 cities with thousands of alumni and current students. This site had many endpoints both seen and unseen to the naked eye and many, many blog posts. All of this started to slow the website down and make the WordPress back end hard to navigate for newcomers.

Identifying the Challenges

Our team made the decision to use the JAM Stack to not only solve the problem of a super confusing WordPress back end, but to continue giving the marketing team the autonomy to update the site as they see fit. Specifically, we used Contentful, Gatsby.js, GraphQL, Contentful and Netlify hosting.

After deciding that we’d build out the components and style them ourselves (rather than trying to copy the CSS from the live site), the biggest challenge after that was how we’d programmatically import 850+ blog posts.

Blog Post Content Model

Our blog post content model has one author, one campus and can have many tags. These three are their own separate content models, as multiple blogs can share the same authors, campuses and tags. Here are the fields:

  • Title
  • Slug (which would be passed in the URL for navigation)
  • Author (linked from our Person Content Model)
  • Date Published
  • Campus (linked from our Campus Content Model)
  • Tag (linked from our Tag Content Model and also referred to as a category later on in this post)
  • Markdown (markdown field)
  • Content (rich text field)

XML -> CSV with HTML -> CSV with Markdown

Have you ever exported data from WordPress as an XML file? It’s messy.

Atom’s attempt at adding spacing and indentation

We wanted this as a CSV because they are much easier to work with when manipulating data. Once we’d converted it to CSV with HTML (there’s many open source tools available), our source.csv file looked something like this:

source.csv

And so begins the journey of weeding out the cruft. At this point, source.csv had more than just blog posts, so we had to filter out non-blog post content.

require 'csv'data = CSV.read("source.csv", headers: true)blogs = []data.map do |row|
blogs.push(row) if row['post_type'] == 'blog_post'
end
CSV.open("html_blogs.csv", "w+") do |csv|
csv << [
'id',
'date',
'title',
'content',
'status',
'slug'
]
blogs.each do |p|
csv << [
p['ID'],
p['post_date_gmt'],
p['post_title'],
p['post_content'],
p['post_status'],
p['post_name']
]
end
end

This gave us a CSV with a lot less headers, making it slightly easier to work with.

html_blogs.csv

HTML to Markdown

Once we filtered our content, we needed to convert the HTML to markdown. We used the Reverse Markdown gem to accomplish this. Here’s the code:

require 'csv'
require 'reverse_markdown'
ReverseMarkdown.config do |config|
config.unknown_tags = :raise
config.github_flavored = false
config.tag_border = ''
end
rows = CSV.read("results/html_blogs.csv", headers: true)CSV.open('results/markdown_blogs.csv', 'w+') do |csv|
csv << [
'id',
'date',
'title',
'content',
'status',
'slug'
]
rows.each do |row|
new_markdown = ReverseMarkdown.convert(row['content'])
csv << [
row['id'],
row['date'],
row['title'],
new_markdown,
row['status'],
row['slug']
]
end
end

This gem made our lives super easy because it took minimal configuration and we just had to call one method (convert) to do all of the work for us. As you can see, in the ReverseMardown config block, we wanted to know every time the gem came across an unknown HTML tag so that we could deal with it manually. You read that right — manually. There were so many irregularities in this data that we couldn’t regex a lot of things out, but more on that later in this post.

Murky Markdown

Now comes most of the cleanup. The Reverse Markdown gem left us with a CSV that was mainly comprised of markdown but also…not.

markdown_blogs.csv

As you can see in the above image, we now have markdown! I was excited to get to this stage because it meant that we’re almost ready to start converting our data to a rich text format for Contentful, right? WRONG.

Scary link post html conversion
What’s a [caption]???

In an attempt to convert the HTML to Markdown, we were left with hundreds if not thousands of random hiccups like the ones captured above. The first image shows what should be a link to a video, but we just got a super botched link within a link. The second image shows the dreaded [caption] tag known only to WordPress. This CSV was full of these unknown tags including [gallery], [figcaption], [video] and others. We ended up fixing some other stuff before we fixed the links because we did a:

Slight Pivot!

It was around this time that we realized we didn’t have all of the information we needed to successfully associate our blog posts on Contentful.

As mentioned earlier, our blogs have associated authors, campuses, and categories we use for filtering. However, this information was no where to be found in markdown_blogs.csv.

Enter Posts.csv, a second CSV we created using a second dump from WordPress. This CSV had information regarding the blog posts’ authors, campuses, and categories. Just what we need!

Double Quote Hell

In the previous two images, you may have noticed the use of double quotes used twice in a row. This was an enormous issue for parsing through the data because in our CSV, the entire content field which held the blog data was wrapped in double quotes. So when I tried to map the blog posts to their proper campus, author and categories, it got very confused because it was picking up more fields than headers.

So on a sunny Friday afternoon, I made sure I got very comfortable on a couch and went through the entire CSV changing all of the double quotes to smart quotes. I didn’t want to risk a regex changing the end of a field to a smart quote, so I made the decision to do it myself. Definitely not my proudest moment, but it happened. In addition to changing the double quotes to smart quotes, I also ended up going through and removing the unknown tags because we weren’t going to need them in our migration. However for the [video] and [gallery] tags, I made note of which blogs had these so that we could use an embedded content model on Contentful to render it (there were only a few of these).

I mentioned that it was a sunny Friday afternoon when in fact it was quite gloomy because this took hours.

Looking on the Bright Side

So that sucked, but on the bright side at that point we had a CSV with data that we could actually work with! Huzzah!

Next was the task of matching the blog posts that we had in what was then markdown_blogs_smart_quotes_fix.csv to their respective authors, campuses and categories in Posts.csv.

First, we created objects that mapped blog ids to an object with their category, campus name, and author.

Note about categories: they were called categories in WordPress, but they are called tags on Contentful so, again, if you see either mentioned, they are one and the same.

require 'csv'# newer CSV that has blogs + their authors, categories and campuses
blogs = CSV.read('Posts.csv', headers: true)
campuses = {}
authors = {}
categories = {}
# Campuses
cities = {
19 => "Atlanta",
69 => "Austin",
6 => "Brooklyn",
63 => "Chicago",
65 => "Dallas",
64 => "Denver",
7 => "Houston",
3 => "NYC",
2 => "Online",
66 => "San Francisco",
25 => "Seattle",
4 => "Washington, D.C."
}
# Categories aka Tags
topics = {
26 => "Alumni Stories",
35 => "Announcements",
29 => "Career Advice",
40 => "Events",
32 => "Flatiron Engineering",
27 => "Flatiron News",
43 => "Getting Familiar with Flatiron",
41 => "Jobs",
28 => "Learning To Code",
30 => "Tech Trends"
}
output = {}blogs.each do |row|
category_name = nil
campus_name = nil
# In the CSV, the 'categories' and 'campus' columns were not always in the same place so we had to check if that column name was even there and map it to the above hashes containing our categories and campusesrow.each_with_index do |col, i|
category_name = topics[row[i+1].to_i] if col[1] == 'categories'
campus_name = cities[row[i+1].to_i] if col[1] == 'campus'
end
# The id of the blog as told by WordPress
blog_id = row['post_id/__text']
output[blog_id] = {
category_name: category_name,
campus_name: campus_name,
author: row[3]
}
end

So this gave us what we needed, but in reverse. To match our blog.csv, we had to reverse it all.

reversed_output = output.to_a.reverse.to_h

Now we’re ready to combine our beautiful sets of data!

# CSV with clean data
markdown_rows = CSV.read("results/markdown_blogs_smart_quotes_fix.csv", headers: true)
CSV.open('results/blog_posts_with_tags.csv', 'w+') do |csv|
csv << [
'id',
'publishedAt',
'title',
'author',
'status',
'campus',
'tag',
'slug',
'content'
]
markdown_rows.each do |r|# Wrapped it all in a begin/rescue so that I could see the failures and if they were worth saving (sorry blogs!). Some of these blogs were unpublished and those tended to failbegin# Campuses and Categories were actually added only a few months before this work was done but there are blog posts that date back to 2012. For every blog, we checked to see if these keys were even present before assigning values, otherwise it'd just be nilif reversed_output[r['id'].to_s].key?(:author)
author = reversed_output[r['id'].to_s][:author]
else
author = nil
end
if reversed_output[r['id'].to_s].key?(:campus_name)
campus_name = reversed_output[r['id'].to_s][:campus_name]
else
campus_name = nil
end
if reversed_output[r['id'].to_s].key?(:category_name)
category_name = reversed_output[r['id'].to_s][:category_name]
else
category_name = nil
end
csv << [
r['id'],
r['date'],
r['title'],
author,
r['status'],
campus_name,
category_name,
r['slug'],
r['content'],
]
rescue
puts '________________________________'
puts reversed_output[r['id']], r['id'], r['slug'], r['status']
end
end
end

Just kidding there’s more cleanup

Now we have a CSV that’s more aligned with the work we’re doing in Contentful. It consists of blog posts that may or may not have an author, tag and a campus depending on when they were originally published. Yay I think?

Although I did the work to cleanup the rogue tags and double quotes, the issue of the asset links remained. In case you already forgot what they looked like here’s an example of one:

[""Getting Started with Elixir: Pattern Matching versus Assignment""]([https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy\_W](https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy_W))

This one we fixed with regex:

require 'csv'blogs = CSV.read('./results/blog_posts_with_tags.csv', headers: true)CSV.open("results/blog_posts_with_tags.csv", "w+") do |csv|
csv << [
'id',
'publishedAt',
'title',
'author',
'status',
'campus',
'tag',
'slug',
'content'
]
blogs.each do |blog|# ✨Magic ✨
content = blog['content'].gsub(/\[\!\[(.*?)\]\((.+?)\)\]\(.+?\)/) do |str|
"![#{$1}](#{$2})"
end
csv << [
blog['id'],
blog['publishedAt'],
blog['title'],
blog['author'],
blog['status'],
blog['campus'],
blog['tag'],
blog['slug'],
content
]
end
end
puts blogs.countputs 'DONE ✨'

I made a backup of blog_posts_with_tags.csv and just piped it into itself.

This is the final iteration of our blog post CSV! It’s this one that we pushed up to Contentful.

Stay tuned for part 2 where we dive into what we did to get all of this onto Contentful!

Thanks for reading! Want to work on a mission-driven team that loves the JAM stack? We’re hiring!

To learn more about Flatiron School, visit the website, follow us on Facebook and Twitter, and visit us at upcoming events near you.

Flatiron School is a proud member of the WeWork family. Check out our sister technology blogs WeWork Technology and Making Meetup.

--

--