How to Migrate Your Site from WordPress to Contentful

Published in

The Startup

15 min readSep 13, 2019

Outgrowing WordPress

The Flatiron School’s marketing website was formerly built with WordPress and a custom PHP backend. Over 7 years the site grew as the school went from serving 1 course offering on 1 campus in 1 city to 3 course offerings on 12 campuses in 11 cities with thousands of alumni and current students. This site had many endpoints both seen and unseen to the naked eye and many, many blog posts. All of this started to not only slow the website down but it made the WordPress back end hard to navigate for newcomers.

Identifying the Challenges

Our team made the decision to use the JAM Stack to not only solve the problem of a super confusing WordPress back end but to continue giving the marketing team the autonomy to update the site as they see fit. Specifically, we used Contentful, Gatsby.js, GraphQL, Contentful and Netlify hosting.

After deciding that we’d build out the components and style them ourselves (rather than trying to copy the CSS from the live site), the biggest challenge after that was how we’d programmatically import 850+ blog posts.

Blog Post Content Model

Our blog post content model has one author, one campus and can have many tags. These three are their own separate content models as multiple blogs can share the same authors, campuses and tags. Here are the fields:

Title
Slug (which would be passed in the URL for navigation)
Author (linked from our Person Content Model)
Date Published
Campus (linked from our Campus Content Model)
Tag (linked from our Tag Content Model and also referred to as a category later on in this post)
Markdown (markdown field)
Content (rich text field)

XML -> CSV with HTML -> CSV with Markdown

Have you ever exported data from WordPress as an
XML file? It’s messy.

Atom’s attempt at adding spacing and indentation

There are numerous tools you can use to convert this to a CSV file. We wanted this as a CSV because they are much easier to work with when manipulating data.

Once we had our CSV (called source.csv) with HTML it looked something like this:

And so begins the journey of weeding out the cruft. At this point, source.csv had more than just blog posts so we had to write some code to filter out our blogs.

require 'csv'data = CSV.read("source.csv", headers: true)blogs = []data.map do |row|
  blogs.push(row) if row['post_type'] == 'blog_post'
endCSV.open("html_blogs.csv", "w+") do |csv|
  csv << [
    'id',
    'date',
    'title',
    'content',
    'status',
    'slug'
  ]blogs.each do |p|
    csv << [
      p['ID'],
      p['post_date_gmt'],
      p['post_title'],
      p['post_content'],
      p['post_status'],
      p['post_name']
    ]
  end
end

This gave us a CSV with a lot less headers making it slightly easier to work with.

HTML to Markdown

Once we filtered out our blogs, we needed to convert the HTML to markdown. We used the Reverse Markdown gem to accomplish this. Here’s the code:

require 'csv'
require 'reverse_markdown'ReverseMarkdown.config do |config|
  config.unknown_tags     = :raise
  config.github_flavored  = false
  config.tag_border  = ''
endrows = CSV.read("results/html_blogs.csv", headers: true)CSV.open('results/markdown_blogs.csv', 'w+') do |csv|
  csv << [
    'id',
    'date',
    'title',
    'content',
    'status',
    'slug'
  ]
  rows.each do |row|
    new_markdown = ReverseMarkdown.convert(row['content'])    csv << [
      row['id'],
      row['date'],
      row['title'],
      new_markdown,
      row['status'],
      row['slug']
    ]
  end
end

This gem made our lives super easy because it took minimal configuration and we just had to call one method (convert) to do all of the work for us. As you can see, in the ReverseMardown config block, we wanted to know every time the gem came across an unknown HTML tag so that we could deal with it manually. You read that right — manually. There were so many irregularities in this data that we couldn’t regex a lot of things out, but more on that later in this post.

Murky Markdown

Now comes most of the cleanup. The Reverse Markdown gem left us with a CSV that was mainly comprised of markdown but also…not.

As you can see in the above image, we now have markdown! I was excited to get to this stage because it meant that we’re almost ready to start converting our data to a rich text format for Contentful, right? WRONG.

In an attempt to convert the HTML to Markdown, we were left with hundreds if not thousands of random hiccups like the ones captured above. The first image shows what should be a link to a video but we just got a super botched link within a link. The second image shows the dreaded [caption] tag known only to WordPress. This CSV was full of these unknown tags including [gallery], [figcaption], [video] and others. We ended up fixing some other stuff before we fixed the links because we did a:

Slight Pivot!

It was around this time that we realized we didn’t have all of the information we needed to successfully associate our blog posts on Contentful.

As mentioned earlier, our blogs have authors, campuses that they may reference and categories they are related to for filtering. However, this information was no where to be found in markdown_blogs.csv.

Enter Posts.csv, a second CSV we created using a second dump from WordPress. This CSV had information regarding the blog posts’ author, campus and category. Just what we need!

Double Quote Hell

In the previous two images you may have noticed the use of double quotes used twice in a row. This was an enormous issue for parsing through the data because in our CSV, the entire content field which held the blog data was wrapped in double quotes. So when I tried to map the blog posts to their proper campus, author and categories, it got very confused because it was picking up more fields than headers.

So on a sunny Friday afternoon, I made sure I got very comfortable on a couch and went through the entire CSV changing all of the double quotes to smart quotes. I didn’t want to risk a regex changing the end of a field to a smart quote so I made the decision to do it myself. Definitely not my proudest moment but it happened. In addition to changing the double quotes to smart quotes, I also ended up going through and removing the unknown tags because we weren’t going to need them in our migration. However for the [video] and [gallery] tags, I made note of which blogs had these so that we could use an embedded content model on Contentful to render it, (there were only a few of these).

I mentioned that it was a sunny Friday afternoon when in fact it was quite gloomy because this took hours.

Looking on the Bright Side

So that sucked but on the bright side at that point we had a CSV with data that we could actually work with! Huzzah!

Next was the task of matching the blog posts that we had in what was then markdown_blogs_smart_quotes_fix.csv to their respective authors, campuses and categories in Posts.csv.

First, we created objects that mapped blog ids to an object with their category, campus name and author.

Note about categories: they were called categories in WordPress but they are called tags on Contentful so, again, if you see either mentioned, they are one in the same.

require 'csv'# newer CSV that has blogs + their authors, categories and campuses
blogs = CSV.read('Posts.csv', headers: true)campuses = {}
authors = {}
categories = {}# Campuses
cities = {
  19 => "Atlanta",
  69 => "Austin",
  6 => "Brooklyn",
  63 => "Chicago",
  65 => "Dallas",
  64 => "Denver",
  7 => "Houston",
  3 => "NYC",
  2 => "Online",
  66 => "San Francisco",
  25 => "Seattle",
  4 => "Washington, D.C."
}# Categories aka Tags
topics = {
  26 => "Alumni Stories",
  35 => "Announcements",
  29 => "Career Advice",
  40 => "Events",
  32 => "Flatiron Engineering",
  27 => "Flatiron News",
  43 => "Getting Familiar with Flatiron",
  41 => "Jobs",
  28 => "Learning To Code",
  30 => "Tech Trends"
}output = {}blogs.each do |row|
  category_name = nil
  campus_name = nil# In the CSV, the 'categories' and 'campus' columns were not always       in the same place so we had to check if that column name was even there and map it to the above hashes containing our categories and campusesrow.each_with_index do |col, i|
    category_name = topics[row[i+1].to_i] if col[1] == 'categories'
    campus_name = cities[row[i+1].to_i] if col[1] == 'campus'
  end# The id of the blog as told by WordPress
blog_id = row['post_id/__text']output[blog_id] = {
    category_name: category_name,
    campus_name: campus_name,
    author: row[3]
  }
end

So this gave us what we needed, but in reverse. To match our blog.csv, we had to reverse it all.

reversed_output = output.to_a.reverse.to_h

Now we’re ready to combine our beautiful sets of data!

# CSV with clean data
markdown_rows = CSV.read("results/markdown_blogs_smart_quotes_fix.csv", headers: true)CSV.open('results/blog_posts_with_tags.csv', 'w+') do |csv|
  csv << [
    'id',
    'publishedAt',
    'title',
    'author',
    'status',
    'campus',
    'tag',
    'slug',
    'content'
  ]markdown_rows.each do |r|# Wrapped it all in a begin/rescue so that I could see the failures and if they were worth saving (sorry blogs!). Some of these blogs were unpublished and those tended to fail    begin# Campuses and Categories were actually added only a few months before this work was done but there are blog posts that date back to 2012. For every blog, we checked to see if these keys were even present before assigning values, otherwise it'd just be nil      if reversed_output[r['id'].to_s].key?(:author)
        author = reversed_output[r['id'].to_s][:author]
      else
        author = nil
      end      if reversed_output[r['id'].to_s].key?(:campus_name)
        campus_name = reversed_output[r['id'].to_s][:campus_name]
      else
        campus_name = nil
      end      if reversed_output[r['id'].to_s].key?(:category_name)
        category_name = reversed_output[r['id'].to_s][:category_name]
      else
        category_name = nil
      end      csv << [
             r['id'],
             r['date'],
             r['title'],
             author,
             r['status'],
             campus_name,
             category_name,
             r['slug'],
             r['content'],
            ]
    rescue
      puts '________________________________'
      puts reversed_output[r['id']], r['id'], r['slug'], r['status']
    end
  end
end

Just kidding there’s more cleanup

Now we have a CSV that’s more aligned with the work we’re doing in Contentful. It consists of blog posts that may or may not have an author, tag and a campus depending on when they were originally published. Yay I think?

Although I did the work to cleanup the rogue tags and double quotes, the issue of the asset links remained. In case you already forgot what they looked like here’s an example of one:

[""Getting Started with Elixir: Pattern Matching versus Assignment""]([https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy\_W](https://www.youtube.com/watch?v=zwPqQngLn9w&index=6&list=PLCFmW8UCDqfCA9kpbFirPEDoQYc9nCy_W))

This one we fixed with regex:

require 'csv'blogs = CSV.read('./results/blog_posts_with_tags.csv', headers: true)CSV.open("results/blog_posts_with_tags.csv", "w+") do |csv|
  csv << [
    'id',
    'publishedAt',
    'title',
    'author',
    'status',
    'campus',
    'tag',
    'slug',
    'content'
  ]blogs.each do |blog|# ✨Magic ✨
 content = blog['content'].gsub(/\[\!\[(.*?)\]\((.+?)\)\]\(.+?\)/) do |str|
      "![#{$1}](#{$2})"
    end csv << [
       blog['id'],
       blog['publishedAt'],
       blog['title'],
       blog['author'],
       blog['status'],
       blog['campus'],
       blog['tag'],
       blog['slug'],
       content
     ]
  end
endputs blogs.countputs 'DONE ✨'

I made a backup of blog_posts_with_tags.csv and just piped it into itself.

This is the final iteration of our blog post CSV! It’s this one that we pushed up to Contentful.

Contentful Rich Text Fields

The rich text field on Contentful is the most fancy field type you can get on their platform. It provides a nice UI for adding extra formatting to text. It also allows you to embed other content types either inline or as a block right in the field as long as you handle its rendering in your codebase.

https://www.contentful.com/developers/docs/concepts/rich-text/

We wanted to push up directly to a rich text field but because of it’s complex structure, we decided to push our markdown up to a markdown field and then use Contentful’s rich-text-from-markdown library to convert our markdown to rich text and also push it into a rich text field.

Upload Time!

It’s time to finally upload our blogs to Contentful. We used the Contentful Management API via their contentful-management gem to accomplish this. Here’s what we did:

require 'contentful/management'
require 'csv'blogs = CSV.read("./results/blog_posts_with_tags.csv", headers: true)# To prevent the creation of duplicates, we created the content model entries for each of the tags, campuses and authors and stored their IDs here so that we can do a simple lookup to see if we already created oneTAG_IDS = {
  "Flatiron Engineering" => '3N8YALRxdjF4xJOOFNUTks',
  "Career Advice" => '6xnsyw23i7tzyqaMg6FW72',
  "Flatiron News" => '2vtlfmHAZWPWD4sriPcvC2',
  "Learning To Code" => '3T8WDH6Uo6AfIJ4j8QikYH',
  "Alumni Stories" => '5T7DcbkCMzc5G2f3zVg3Ik',
  "Reports" => '7mkIFK86YkubsEMAQ8bRB7',
  "Default Blog Tag" => '20uCE5e3tt70fH291sMuZC'
}CAMPUS_IDS = {
  "Seattle": "CtEMRhZ9UAjakgCOlaXws",
  "NYC": "5ZoGVhhBzZhnFvJ5GFVfOo",
  "Houston": "5syhCmNnhUlEPPP5uP6YXX",
  "Denver": "4uX7vnzAUypCyBeP4B4zNF",
  "Chicago": "1OdnykEoU6Wdi9l9uxMaQZ",
  "Online": "3ZNRbZwF7FTzI4JILXbFS1",
  "San Francisco": "3DnYWcRtoxiC9vz3zk9JR2",
  "Washington D.C.": "17yOSvsrouXoqmmubuV4I3",
  "Brooklyn": "7ekoZJnpJ8dqflSumDtDl3",
  "Dallas": "2zBuf2EvPzLzMYrT2BnJOu"
}AUTHOR_IDS = {
  "Flatiron School": '6zSxkJtfRGvdHQrMR0AfLi',
  "Charles Poladian": ''
}# Contentful clientclient = Contentful::Management::Client.new('CONTENTFUL_API_KEY', raise_errors: true)# To find the correct environment, we search for it based on the client stored in the variable aboveenvironment = client.environments('CONTENTFUL_SPACE_ID').find('CONTENTFUL_ENV_ID')# Gets content types that will be used to find or create entries of these typesblog_type = environment.content_types.find('blogPost')
person_type = environment.content_types.find('person')
campus_type = environment.content_types.find('campus')
tag_type = environment.content_types.find('tags')blogs.each do |blog|
  begin
    tags_arr = []# There are a finite amount of campuses and all of them are in the CAMPUS_IDS hash above    if (CAMPUS_IDS[blog['campus']])
      campus_entry = environment.entries.find(CAMPUS_IDS[blog['campus']])
    end# For the tags, the logic is a bit more involved because if there is no tag present, we want to use a default tag, (we made it a required field for our blogs and if we didn't provide a tag, things would break). However, if it exists we either find it using our hash of tags above or create a new one.    if (!blog['tag'])
      tag_entry = environment.entries.find(TAG_IDS['Default Blog Tag'])
      tags_arr << tag_entry
    else
      slug = blog['tag'].downcase.gsub(' ', '-')      if (TAG_IDS[blog['tag']])
        puts 'found tag'
        tag_entry = environment.entries.find(TAG_IDS[blog['tag']])
        tags_arr << tag_entry
      else
        puts 'creating tag from blog'
        tag_entry = tag_type.entries.create(name: blog['tag'], slug: slug)
        TAG_IDS[blog['tag']] = tag_entry.id
        tags_arr << tag_entry
      end
    end# Find or create author    if (AUTHOR_IDS[blog['author']])
      author_entry = environment.entries.find(AUTHOR_IDS['Flatiron School'])
    else
      author_entry = person_type.entries.create(name: blog['author'], jobTitle: 'Blog Post Author')
      AUTHOR_IDS[blog['author']] = author_entry.id
    end# Creates the blog post  entry = blog_type.entries.create(
      title: blog['title'],
      publishedAt: DateTime.parse(blog['publishedAt']),
      markdown: blog['content'],
      slug: blog['slug'],
    )# Associates tags, campus and author to blog post    entry.update(tags: tags_arr)
    entry.update(campus: campus_entry)
    entry.update(author: author_entry)# Throttle the request so we don't get rate limit errors sleep 0.15# Print out the blogs that didn't successfully upload to Contentful  rescue => error
    puts '______________________________________'
    puts blog['id'], blog['status'], blog['slug']
    puts error
  end
endputs 'DONE 🎉'

Amazing. At this point, all of our blogs are on Contentful with their content in the markdown field. The next step is to convert those markdown fields to rich text.

Converting to Rich Text 🤑

Time to switch to using JavaScript! We used Contentful’s contentful-migration and rich-text-from-markdown libraries to handle the conversion of markdown to rich text. The contentful-migration library handles the actual passing of data between the fields and the rich-text-from-markdown library handles converting the markdown itself. There’s only a bit of setup for the contentful-migration portion:

// convert_markdown_to_rich_text.jsconst runMigration = require('contentful-migration/built/bin/cli').runMigration
const dotenv = require('dotenv')
dotenv.config()// We define some options that allow the library to find our space and environment on Contentful and then point it to where our migration lives in our file treeconst options = {
  filePath: 'data/migration-test.js',
  spaceId: process.env.CONTENTFUL_SPACE_ID,
  accessToken: process.env.CONTENTFUL_MANAGEMENT_API_KEY,
  environmentId: process.env.CONTENTFUL_ENVIRONMENT_ID,
  yes: true
}// Runs the migrationrunMigration({...options})
  .then(() => console.log('Migration Done!'))
  .catch((e) => console.error)

Now for what you probably clicked on this post for:

Here’s the migration itself:

// migration-test.jsconst { richTextFromMarkdown } = require('@contentful/rich-text-from-markdown')
const { createClient } = require('contentful-management')// Our function takes in the migration for free from the runMigration function in config.js. We also get our space id, environment id and access token.module.exports = async function(migration, { spaceId, accessToken, environmentId }) {// We need to find our client, space and environment because, like we saw when we used the ruby gem above, to get to the environment which is where we create entries, we need our space and client first.  const client = await createClient({ accessToken: accessToken })
  const space = await client.getSpace(spaceId)
  const environment = await space.getEnvironment(environmentId)// We call the transformEntries function on our migration to ask the library to find our blog post content model and for each one, take its markdown field, do something to it (defined below) and push that result into its content field. The shouldPublish attribute set to true also publishes it rather than leaving it as a draft.migration.transformEntries({
    contentType: 'blogPost',
    from: ['markdown'],
    to: ['content'],
    shouldPublish: true,// The transformEntryForLocale attribute's value is an anonymous function that is called with the value of the current field (fromFields) and that field's locale (currentLocale)    transformEntryForLocale: async function(fromFields, currentLocale) {// If the currentLocale isn't 'en-US' or if the markdown field is empty we want to move on and process the next field rather than waste time trying to process something that isn't there      if (
        currentLocale !== 'en-US' ||
        fromFields.markdown === undefined
      ) {
        return
      }// This is where more ✨magic✨ happens. Here we call on the powers of the rich-text-from-markdown library to convert the nodes of our markdown field into nodes that the rich text field can understand. If it comes across a node that it can't automatically parse, it's passed into the second argument of our richTextFromMarkdown function which then passes it into a switch statement that is able to determine what kind of node it is. In our case, code blocks and images were the ones we had to define manually.      const content = await    richTextFromMarkdown(fromFields.markdown['en-US'], async (node) => {
        switch (node.type){
          case 'code':
            return processCode(node)
          case 'image':
            return await processImage(environment, node)
        }
      })// This is where the regular text nodes are handled      try {
        return {
          content: content
        }      } catch (error){
        console.error
      }     }
  })
}// If the richTextFromMarkdown comes across a code block, the node is passed into this helper function that converts it to a format that the rich text field can understandconst processCode = async (node) => {
  return {
    nodeType: "blockquote",
    content: [
      {
        nodeType: "paragraph",
        data: {},
        content: [
          {
            nodeType: "text",
            value: node.value,
            marks: [],
            data: {}
          }
        ]
      }
    ],
    data: {}
  }
}// If the richTextFromMarkdown comes across a image, the node is passed into this helper function that creates an asset in our Contentful environment, uploads and publishes that image and returns it in a format that the rich text field can understandconst processImage = async (environment, node) => {
  const title = node.url.split('/').pop()
  const ext = title.split('.').pop()const asset = await environment.createAsset({
    fields: {
      title: {
        'en-US': `Blog post image: ${title}`
      },
      description: {
        'en-US': node.alt || `Blog post image: ${title}`
      },
      file: {
        'en-US': {
          contentType: `image/${ext}`,
          fileName: title,
          upload: node.url
        }
      }
    }
  }).catch(e => console.log('in create asset catch'))asset.processForAllLocales()return {
    nodeType: 'embedded-asset-block',
    content: [],
    data: {
      target: {
        sys: {
          type: 'Link',
          linkType: 'Asset',
          id: asset.sys.id
        }
      }
    }
  }
}

DONE 🎉

And that’s that.

This was a ton of work with a lot of trial and error. I excluded all of the rabbit holes and just included what works. That being said here are some key learnings:

Ask for extra eyes from colleagues sooner rather than later. This project took me about a month. The length was due in part to the amount of trial and error I endured and the lack of resources available on the internet. However, I’m sure it could have been shortened if I asked for extra eyes sooner rather than later.

Sometimes the work you do as an engineer will just straight up suck. There were parts of this project (the gloomy Friday afternoon) where I felt really uninspired and burned out. However, I stuck through it and now I can look back at that experience and realize how much I learned from it.

Growth lies outside of your comfort zone. This is probably obvious for most people, but it really rang true for me during the course of this project.

Take a vacation. The burnout was real after this one so I made sure to take 1.5 weeks off to recuperate. Luckily, I work with a team of very understanding engineers and managers so this wasn’t an issue at all.

Wait, there’s more

Here’s a bonus script from when we needed to downgrade headings for accessibility purposes:

require 'contentful/management'client = Contentful::Management::Client.new('CONTENTFUL_API_KEY', raise_errors: true)environment = client.environments(CONTENTFUL_SPACE_ID).find('CONTENTFUL_ENV_ID')entries = client.entries(CONTENTFUL_SPACE_ID, CONTENTFUL_ENV_ID).all(content_type: "blogPost", limit: 100)counter = 0while entries.next_page
  entries.each do |blog|
      puts blog.title
      blog.markdown = blog.markdown.gsub(/(^# )/, "### ")
      blog.markdown = blog.markdown.gsub(/(^## )/, "#### ")
      blog.save
      counter+=1
  end
 
  entries = entries.next_page
endputs 'DONE'
puts counter