The Great Migration: WordPress to Contentful
Part 1: A journey of data parsing and migration
Outgrowing WordPress
The Flatiron School’s marketing website was formerly built with WordPress and a custom PHP backend. Over seven years, the site grew as the school went from serving one course offering on one campus in one city to three course offerings on 12 campuses in 11 cities with thousands of alumni and current students. This site had many endpoints both seen and unseen to the naked eye and many, many blog posts. All of this started to slow the website down and make the WordPress back end hard to navigate for newcomers.
Identifying the Challenges
Our team made the decision to use the JAM Stack to not only solve the problem of a super confusing WordPress back end, but to continue giving the marketing team the autonomy to update the site as they see fit. Specifically, we used Contentful, Gatsby.js, GraphQL, Contentful and Netlify hosting.
After deciding that we’d build out the components and style them ourselves (rather than trying to copy the CSS from the live site), the biggest challenge after that was how we’d programmatically import 850+ blog posts.
Blog Post Content Model
Our blog post content model has one author, one campus and can have many tags. These three are their own separate content models, as multiple blogs can share the same authors, campuses and tags. Here are the fields:
- Title
- Slug (which would be passed in the URL for navigation)
- Author (linked from our Person Content Model)
- Date Published
- Campus (linked from our Campus Content Model)
- Tag (linked from our Tag Content Model and also referred to as a category later on in this post)
- Markdown (markdown field)
- Content (rich text field)
XML -> CSV with HTML -> CSV with Markdown
Have you ever exported data from WordPress as an XML file? It’s messy.
We wanted this as a CSV because they are much easier to work with when manipulating data. Once we’d converted it to CSV with HTML (there’s many open source tools available), our source.csv
file looked something like this:
And so begins the journey of weeding out the cruft. At this point, source.csv
had more than just blog posts, so we had to filter out non-blog post content.
require 'csv'data = CSV.read("source.csv", headers: true)blogs = []data.map do |row|
blogs.push(row) if row['post_type'] == 'blog_post'
endCSV.open("html_blogs.csv", "w+") do |csv|
csv << [
'id',
'date',
'title',
'content',
'status',
'slug'
]blogs.each do |p|
csv << [
p['ID'],
p['post_date_gmt'],
p['post_title'],
p['post_content'],
p['post_status'],
p['post_name']
]
end
end
This gave us a CSV with a lot less headers, making it slightly easier to work with.
HTML to Markdown
Once we filtered our content, we needed to convert the HTML to markdown. We used the Reverse Markdown gem to accomplish this. Here’s the code:
require 'csv'
require 'reverse_markdown'ReverseMarkdown.config do |config|
config.unknown_tags = :raise
config.github_flavored = false
config.tag_border = ''
endrows = CSV.read("results/html_blogs.csv", headers: true)CSV.open('results/markdown_blogs.csv', 'w+') do |csv|
csv << [
'id',
'date',
'title',
'content',
'status',
'slug'
]
rows.each do |row|
new_markdown = ReverseMarkdown.convert(row['content'])csv << [
row['id'],
row['date'],
row['title'],
new_markdown,
row['status'],
row['slug']
]
end
end
This gem made our lives super easy because it took minimal configuration and we just had to call one method (convert
) to do all of the work for us. As you can see, in the ReverseMardown
config block, we wanted to know every time the gem came across an unknown HTML tag so that we could deal with it manually. You read that right — manually. There were so many irregularities in this data that we couldn’t regex a lot of things out, but more on that later in this post.
Murky Markdown
Now comes most of the cleanup. The Reverse Markdown gem left us with a CSV that was mainly comprised of markdown but also…not.
As you can see in the above image, we now have markdown! I was excited to get to this stage because it meant that we’re almost ready to start converting our data to a rich text format for Contentful, right? WRONG.
In an attempt to convert the HTML to Markdown, we were left with hundreds if not thousands of random hiccups like the ones captured above. The first image shows what should be a link to a video, but we just got a super botched link within a link. The second image shows the dreaded [caption]
tag known only to WordPress. This CSV was full of these unknown tags including [gallery]
, [figcaption]
, [video]
and others. We ended up fixing some other stuff before we fixed the links because we did a:
Slight Pivot!
It was around this time that we realized we didn’t have all of the information we needed to successfully associate our blog posts on Contentful.
As mentioned earlier, our blogs have associated authors, campuses, and categories we use for filtering. However, this information was no where to be found in markdown_blogs.csv
.
Enter Posts.csv
, a second CSV we created using a second dump from WordPress. This CSV had information regarding the blog posts’ authors, campuses, and categories. Just what we need!
Double Quote Hell
In the previous two images, you may have noticed the use of double quotes used twice in a row. This was an enormous issue for parsing through the data because in our CSV, the entire content field which held the blog data was wrapped in double quotes. So when I tried to map the blog posts to their proper campus, author and categories, it got very confused because it was picking up more fields than headers.
So on a sunny Friday afternoon, I made sure I got very comfortable on a couch and went through the entire CSV changing all of the double quotes to smart quotes. I didn’t want to risk a regex changing the end of a field to a smart quote, so I made the decision to do it myself. Definitely not my proudest moment, but it happened. In addition to changing the double quotes to smart quotes, I also ended up going through and removing the unknown tags because we weren’t going to need them in our migration. However for the [video]
and [gallery]
tags, I made note of which blogs had these so that we could use an embedded content model on Contentful to render it (there were only a few of these).
I mentioned that it was a sunny Friday afternoon when in fact it was quite gloomy because this took hours.
Looking on the Bright Side
So that sucked, but on the bright side at that point we had a CSV with data that we could actually work with! Huzzah!
Next was the task of matching the blog posts that we had in what was then markdown_blogs_smart_quotes_fix.csv
to their respective authors, campuses and categories in Posts.csv
.
First, we created objects that mapped blog ids to an object with their category, campus name, and author.
Note about categories: they were called categories in WordPress, but they are called tags on Contentful so, again, if you see either mentioned, they are one and the same.
require 'csv'# newer CSV that has blogs + their authors, categories and campuses
blogs = CSV.read('Posts.csv', headers: true)campuses = {}
authors = {}
categories = {}# Campuses
cities = {
19 => "Atlanta",
69 => "Austin",
6 => "Brooklyn",
63 => "Chicago",
65 => "Dallas",
64 => "Denver",
7 => "Houston",
3 => "NYC",
2 => "Online",
66 => "San Francisco",
25 => "Seattle",
4 => "Washington, D.C."
}# Categories aka Tags
topics = {
26 => "Alumni Stories",
35 => "Announcements",
29 => "Career Advice",
40 => "Events",
32 => "Flatiron Engineering",
27 => "Flatiron News",
43 => "Getting Familiar with Flatiron",
41 => "Jobs",
28 => "Learning To Code",
30 => "Tech Trends"
}output = {}blogs.each do |row|
category_name = nil
campus_name = nil# In the CSV, the 'categories' and 'campus' columns were not always in the same place so we had to check if that column name was even there and map it to the above hashes containing our categories and campusesrow.each_with_index do |col, i|
category_name = topics[row[i+1].to_i] if col[1] == 'categories'
campus_name = cities[row[i+1].to_i] if col[1] == 'campus'
end# The id of the blog as told by WordPress
blog_id = row['post_id/__text']output[blog_id] = {
category_name: category_name,
campus_name: campus_name,
author: row[3]
}
end
So this gave us what we needed, but in reverse. To match our blog.csv
, we had to reverse it all.
reversed_output = output.to_a.reverse.to_h
Now we’re ready to combine our beautiful sets of data!
# CSV with clean data
markdown_rows = CSV.read("results/markdown_blogs_smart_quotes_fix.csv", headers: true)CSV.open('results/blog_posts_with_tags.csv', 'w+') do |csv|
csv << [
'id',
'publishedAt',
'title',
'author',
'status',
'campus',
'tag',
'slug',
'content'
]markdown_rows.each do |r|# Wrapped it all in a begin/rescue so that I could see the failures and if they were worth saving (sorry blogs!). Some of these blogs were unpublished and those tended to failbegin# Campuses and Categories were actually added only a few months before this work was done but there are blog posts that date back to 2012. For every blog, we checked to see if these keys were even present before assigning values, otherwise it'd just be nilif reversed_output[r['id'].to_s].key?(:author)
author = reversed_output[r['id'].to_s][:author]
else
author = nil
endif reversed_output[r['id'].to_s].key?(:campus_name)
campus_name = reversed_output[r['id'].to_s][:campus_name]
else
campus_name = nil
endif reversed_output[r['id'].to_s].key?(:category_name)
category_name = reversed_output[r['id'].to_s][:category_name]
else
category_name = nil
endcsv << [
r['id'],
r['date'],
r['title'],
author,
r['status'],
campus_name,
category_name,
r['slug'],
r['content'],
]
rescue
puts '________________________________'
puts reversed_output[r['id']], r['id'], r['slug'], r['status']
end
end
end
Just kidding there’s more cleanup
Now we have a CSV that’s more aligned with the work we’re doing in Contentful. It consists of blog posts that may or may not have an author, tag and a campus depending on when they were originally published. Yay I think?
Although I did the work to cleanup the rogue tags and double quotes, the issue of the asset links remained. In case you already forgot what they looked like here’s an example of one:
[""Getting Started with Elixir: Pattern Matching versus Assignment""]([https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy\_W](https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy_W))
This one we fixed with regex:
require 'csv'blogs = CSV.read('./results/blog_posts_with_tags.csv', headers: true)CSV.open("results/blog_posts_with_tags.csv", "w+") do |csv|
csv << [
'id',
'publishedAt',
'title',
'author',
'status',
'campus',
'tag',
'slug',
'content'
]blogs.each do |blog|# ✨Magic ✨
content = blog['content'].gsub(/\[\!\[(.*?)\]\((.+?)\)\]\(.+?\)/) do |str|
"![#{$1}](#{$2})"
endcsv << [
blog['id'],
blog['publishedAt'],
blog['title'],
blog['author'],
blog['status'],
blog['campus'],
blog['tag'],
blog['slug'],
content
]
end
endputs blogs.countputs 'DONE ✨'
I made a backup of blog_posts_with_tags.csv
and just piped it into itself.
This is the final iteration of our blog post CSV! It’s this one that we pushed up to Contentful.
Stay tuned for part 2 where we dive into what we did to get all of this onto Contentful!
Thanks for reading! Want to work on a mission-driven team that loves the JAM stack? We’re hiring!
To learn more about Flatiron School, visit the website, follow us on Facebook and Twitter, and visit us at upcoming events near you.
Flatiron School is a proud member of the WeWork family. Check out our sister technology blogs WeWork Technology and Making Meetup.