🙈Being Seedy 🌱: How a newbie scraped data and seeded a PostgreSQL database with Cheerio & Sequelize.

Notes: The site I scraped data from does not mention a terms of service against using their data, but I will refrain from providing the site or the data verbatim. All examples used have been modified. Some familiarity with Sequelize is assumed.

Photo by rawpixel on Unsplash

Introduction

Background and Context 📷

I’m a student developing a community application for a fairly niche card game. I’ve tinkered with various CRUD projects previously using Node/Express, Sequelize/postgresql, mongoose/mongodb, etc. But in each case, the application didn’t have any substantial existing data — it was generated by users.

Building a supplementary application for an existing game means most of the application’s data already exists externally (read: we’re not a company with a fancy spreadsheet that we can simply export and seed). It is not usually desirable to duplicate data sources or re-create existing data. There isn’t a problem if the data lives in a nice API for us to consume. The existing data in this case, however, does not have an API to consume. Instead it is embedded in static HTML. My assumption is that the site renders HTML tables from their database and just serves those.

I had not found myself in this scenario prior, nor do I have any experience with scraping. The remainder of this article outlines the ad-hoc methodology I utilized to successfully setup my development database. Not having prior experience in this area, I cannot guarantee the methodology I employed is standard or anywhere even remotely ‘optimal’. But it was fun and my own quirky way of being seedy — so here goes. 🙈 🌱

Game Objects 🎮

Each card (ie: character) in the game has various attributes. We’ll focus on a select few, mainly:

  • Name — a unique identifier
  • Class(es) — 1 or more from a finite set: [rogue, archer, asshat, … ]
  • Kingdom — exactly 1, from a finite set: [forest, swamp, trumpland, …]
  • Element(s) — 1 or more from a finite set: [fire, water, cheetodust, …]

Models

Again, for brevity — I’ve simplified the domain object model to focus on a few key entities and their connections. From the above game objects, we can create an ER structure as follows:


Sequelize Init 🚀

Note: I have a personal philosophy of not installing npm modules as global unless absolutely needed. Thus in actuality I reference sequelize using the path to its binary:

node_modules/.bin/sequelize

For brevity, I’ll just use sequelize in the below commands. But that won’t work unless it’s installed globally or in your path.

  1. Get up and going
npm install --save sequelize sequelize-cli pg cheerio
sequelize init

1.5 (not covered in this article) — setup config.json.sequelizerc

2. Create Our models and migrations

sequelize model:generate --name Kingdom --attributes name:string
sequelize model:generate --name Card --attributes name:string,image:string
sequelize model:generate --name Element --attributes elem:string
sequelize model:generate --name Class --attributes class:string

3. Update our Model files

Optionally add some constraints and validations to our columns, but most importantly associate our models!

  • Kingdom
// models/kingdom.js
module.exports = (sequelize, DataTypes) => {
const Kingdom = sequelize.define(
'Kingdom',
{
name: {
allowNull: false,
type: DataTypes.STRING,
unique: true,
validate: {
notEmpty: true
}
}
}
)
Kingdom.associate = function (models) {
Kingdom.hasMany(models.Card, {
as: 'cards',
foreignKey: 'kingdomId'
})
}
return Kingdom
}
  • Card
// models/card.js
module.exports = (sequelize, DataTypes) => {
const Card = sequelize.define(
'Card',
{
image: {
allowNull: false,
type: DataTypes.STRING,
validate: {
isUrl: true,
notEmpty: true
}
},
name: {
allowNull: false,
type: DataTypes.STRING,
validate: {
notEmpty: true
}
}
)
Card.associate = function (models) {
Card.belongsToMany(models.Class, {
as: 'classes',
through: 'CardClasses'
})
    Card.belongsTo(models.Kingdom, {
as: 'kingdom',
foreignKey: 'kingdomId'
})
    Card.belongsToMany(models.Element, {
as: 'elements',
through: 'CardElements'
})
}
return Card
}
  • Element
// models/element.js
module.exports = (sequelize, DataTypes) => {
const Element = sequelize.define(
'Element',
{
elem: {
allowNull: false,
type: DataTypes.STRING,
unique: true,
validate: {
notEmpty: true
}
}
}
)
Element.associate = function (models) {
Element.belongsToMany(models.Card, {
as: 'cards',
through: 'CardElements'
})
}
return Element
}
  • Class
// models/class.js
module.exports = (sequelize, DataTypes) => {
const Class = sequelize.define(
'Class',
{
class: {
allowNull: false,
type: DataTypes.STRING,
unique: true,
validate: {
isAlpha: true,
notEmpty: true
}
}
}
)
Class.associate = function (models) {
Class.belongsToMany(models.Card, {
as: 'cards',
through: 'CardClasses'
})
}
return Class
}

4. Update Migration Files

Ensure the migrations match the models. Migrations are essentially stripped down models (did I really just type that? 👀). They use a simpler interface that dictate how a model is defined in the database (without the fancy validations, getters and setters). For brevity I’ll show just one — essentially copy over the stripped model attributes.

// migrations/xxxxxxxxxxxx-create-kingdom.js
module.exports = {
up: (queryInterface, Sequelize) => {
return queryInterface.createTable(
'Kingdoms',
{
id: {
allowNull: false,
autoIncrement: true,
primaryKey: true,
type: Sequelize.INTEGER
},
name: {
allowNull: false,
type: Sequelize.STRING
},
createdAt: {
allowNull: false,
type: Sequelize.DATE
},
updatedAt: {
allowNull: false,
type: Sequelize.DATE
}
}
)
},
down: (queryInterface, Sequelize) => {
return queryInterface.dropTable('Kingdoms')
}
}

Getting the Data 📜

Be an Architect 🚧

Regardless of whatever methodology you employ, think about the process. The architecture of the ad-hoc methodology I present is a pipeline. That is to say, decompose the process of going from raw data to a usable database as a series of functions. The output of each step is the input to the next.

I didn’t fully automate this pipeline, as this is my first rodeo in a hobby project. But the point is that it can be automated once the process is reasoned about. Trying to do everything in a migration or seed file will likely lead to errors and nightmares. Be an architect and decompose your problem!

Don’t make fun of my funnel — made with Balsamiq™ arrows

Request the HTML 📝

The data can definitely be obtained dynamically using AJAX, but I wanted a local copy to work with — at least for my first time. I felt more comfortable inspecting it locally so I could open the HTML in my text editor and determine how the data was distributed in the DOM.

Future considerations should work entirely with an AJAX request. This way a server could go through this entire process as a cron job automatically and update tables according to any changes from the vendor (like adding new cards to the game!). Be weary though — if the vendor changes how the HTML is structured or even things like style classes, our script may fail!

Keeping a Local Copy with POSTMAN 🚀

I saved a reponse to a GET request in postman:

Save the document locally (and refer back later without issuing more requests) using Postman

Analyzing the Document 🔬

After inspecting the document for the area of interest, I found the following table structure where all of the relevant data can be found:

Scraper.js — A little script to grab the data of interest ❕

I wanted to grab and reformat the areas of interest. Namely:

  • data-id is the id of the card. We want our db ids to match theirs.
  • data-name is the name of the card
  • data-elems is white-space separated card elements
  • There is a <td> with the Kingdom (forest)
  • There is a <td> with the Classes, separated by ‘-’ (archer, rogue)

First, load Cheerio with the HTML and Initialize the object we’re going to write our data to. If you’re doing this dynamically, replace the fs logic with a request using a library of choice (I prefer axios).

const fs = require('fs')
const cheerio = require('cheerio')
console.log('Reading File fooTable.html...')
const parsedHTML = fs.readFileSync('./fooTable.html', 'utf8')
// load the html into cheerio.
// normalizeWhiteSpace replaces all whitespace with single spaces
const $ = cheerio.load(parsedHTML, {
normalizeWhitespace: true
})
// we'll write all of our data to this obj
const cardJSON = {
'cards': []
}

Now we have to decide how we’re going to grab the data from each row of the HTML table. Cheerio allows us to iterate as documented here. So it may be useful to iterate each <tr> noting that they have the class .card-row .

$('.card-row').each((i, elem) => {
let row = {}

// do stuff here
  // add the result to our output obj
cardJSON.cards.push(row)
})

First, let’s grab the data attributes. Logging elem , it becomes clear that they live in an object called attribs:

row.id = elem.attribs['data-id']
row.elems= splitElems(elem.attribs['data-elems'])
row.name = elem.attribs['data-name']
row.image = `${ASSET_URL}/${elem.attribs['data-id']}.jpg`

splitElems is just a simple helper function to ensure we get a clean array of elements (script source at end).

Now, let’s grab the Kingdom and Classes. These <td> do not have a class or id handy, so let’s use some selector magic. We notice that there is a class ‘ellipsis’ of which both are located next to (but are not siblings of).

How do we grab this ‘ellipsis’, without grabbing all of the ones in the entire document!? Cheerio selectors take an optional context argument. Simply use the context of elem, which is the current element in our loop — the <tr>!

$( selector, [context], [root])
selector searches within the context scope which searches within the root scope. selector and context can be a string expression, DOM Element, array of DOM elements, or cheerio object. root is typically the HTML document string.
// select first <td> after .ellipsis in this===elem context
const kingdomNode = $('.ellipsis', elem).next('td')
// select first <td> after the above
const classNode = $(kingdomNode).next('td')
row.kingdom = kingdomNode.text()
row.classes = splitClasses(classNode.text())

Finally, I decided to write this data to a JSON file — this is a formatted, pure source of our final data with a predictable, sensible structure. It should be source controlled and can be further modified.

// Lastly - write the damn object to file
console.log('writing contents to cards.json')
const content = JSON.stringify(cardJSON)
fs.writeFile('./cards.json', content, 'utf8', (err) => {
if (err) {
return console.log(err)
}
console.log('file saved')
})

Modifications (optional) ☑️

An example of further modifications (for more serious projects) is to alter the properties of kingdom and classes. Currently they just hold the string names we scraped. But what if there was a typo? Are names as robust as ids for selection in our database?

Ideally we go from this structure:

{
"cards": [
{
"id": "1234",
"elems": [
"water",
"fire"
],
"name": "<card_name>",
"image": "http://foo/assets/1234.jpg",
"kingdom": "trumpland",
"classes": [
"dragon"
]
}
]
}

to something like this (note: the changes to elems , kingdom and classes):

{
"cards": [
{
"id": "1234",
"elems": [
{
"id": 1,
"name": "water"
},
{
"id": 2,
"name": "fire"
}
],
"name": "<card_name>",
"image": "http://foo/assets/1234.jpg",
"kingdom": {
"id": 1,
"name": "trumpland"
},
"classes": [
{
"id": 1,
"class": "dragon"
}
]
}
]
}

Here’s the script in its entirety:


Seeding 🌱

Let’s assume the database already contains the data for Elements, Classes and Kingdoms, since these are fairly trivial to insert.

Currently we have a file (cards.json) that holds the formatted attributes of each card that we scraped previously. Now we just need to populate our database with these Card attributes as well as their associations to Elements, Classes and Kingdoms.

Create Seed File

sequelize seed:create --name cards

Using our JSON Data

The trick here is working with sequelize’s queryInterface . We’re limited in power compared to working with Models in a sync’d instance. We have to instead use bulkInsert which takes an Array of plain object.

To create this array, let’s just require our JSON data and iterate through it. We have to manually assign timestamps createdAt and updatedAt , unlike when using Model.create which does this for us.

module.exports = {
up: (queryInterface, Sequelize) => {
const Cards = require('../.data/cards.json').cards
let cardsArray = []
Cards.forEach((card) => {
cardsArray.push({
id: card['id'],
name: card['name'],
image: card['image'],
createdAt: new Date(),
updatedAt: new Date()
})
})
    return queryInterface.bulkInsert('Cards', cardssArray)
},

down: (queryInterface, Sequelize) => {
return queryInterface.bulkDelete('Cards', null)
}
}

Seeding is just a matter of running:

sequelize db:seed --seed <name of seed file>

But Wait, what about Kingdom, Classes and Elements? ❔

Good Question. These involve join tables and foreignKeys, which creates a bit of a sticky situation. If working with Sequelize Models, you can easily create with associations by simply using include . That luxury doesn’t exist with queryInterface — you’re down to raw queries at that point.

There are at least two ways to avoid complicated raw queries and long seeding logic. The method I used is the ‘lazy’ one, but let’s discuss both.

Option 1 — Lazy Way 👌

The problem we’re facing is that we don’t have an intuitive way to reference foreignKeys or join tables from seed functions using queryInterface. So we could eliminate that problem by creating a script that does have access to the Models interface through use of sync().

We can create simple one-off scripts that uses a db connection to utilize the full power of models:

Run

node associate_db.js

🎉 And Voila, we have a fully seeded development database, ready to use, render, export, scroll through in pgadmin — whatever you fancy!

This clearly works in development, but we cannot open up a sync connection to a production server. The point is that we won’t need to. The entire purpose of this script is to simply generate the join Tables and assign foreignKeys correctly in our development database. At this point, just create backups and exports of our development database that we can use to for production seeding.

Option 2 — The ‘Right?’ (Long) Way ✔️

  • modify JSON file to add proper id’s instead of names (already discussed)
  • Create seeds for Join Tables

I didn’t take this route because it would take longer for what is a very low-risk project. But it would certainly feel like less of a hack and can be used to seed a production database. So let’s go through the process for a single join table to see how it can be done.

Creating seeds for Join Tables 🔗

Before running sync(), join tables don’t exist in our development database unless we define models/migrations for them explicitly (not recommended). Thus in order to seed them, we need to create the tables.

For the sake of simplicity, let’s assume every card has a class with id=1

sequelize seed:generate --name cardClasses

Creating then Seeding:

We just have to make use of queryInterface.createTable before bulkInsert . Note the async / await that was added.

// xxxxxxxxxxxxxx-cardClasses.js
module.exports = {
up: async (queryInterface, Sequelize) => {
await queryInterface.createTable('CardClasses', {
ClassId: {
references: {
key: 'id',
model: 'Classes'
},
type: Sequelize.INTEGER
},
CardId: {
reference: {
key: 'id',
model: 'Cards'
},
type: Sequelize.INTEGER
},
createdAt: {
allowNull: false,
type: Sequelize.DATE
},
updatedAt: {
allowNull: false,
type: Sequelize.DATE
}
})
    let joinArray = []
const Cards = require('../.data/cards.json').cards
Cards.forEach((card) => {
joinArray.push({
CardId: card['id'],
ClassId: 1, // for simplicity
createdAt: new Date(),
updatedAt: new Date()
})
})
    return queryInterface.bulkInsert('CardClasses', joinArray)
},
down: (queryInterface, Sequelize) => {
return queryInterface.bulkDelete('CardClasses', null)
}
}

Of course an inner loop is needed to get all of the actual class ids instead of hard-coding a ‘1’, but the point of this demonstration was showing that we needed to properly create the table then insert into it.

Wrap up and Conclusions

If you’ve managed to stick it out, this post demonstrated an ad-hoc methodology of scraping data and seeding a PostgreSQL db with Cheerio and Sequelize. The methodology can be automated with some extra tweaks.

Perhaps you don’t work with this exact tech stack or have a different methodology. These are the general conclusions to be made:

  • Congruency (avoid duplication) — Be congruent with the source data. In our case, we made sure our id's match theirs, and used their image assets.
  • Pipeline Architecture — Develop a series of functions or scripts that can reliably produce small changes to go from A to B. Organize them to your/your team’s preference. Reason about the input and output of each step in a way that’s consistent, reliable and predictable.
  • Transform — Transform the vendor data in a sensible way that can be source controlled and compared across iterations. In our case, we made a JSON file cards.JSON. . By doing so we have a consistent, reliable source with a more intuitive structure to work with.
  • Selectors — Utilize the power of your Selectors. The data you are parsing won’t always have logical ids classes and attributes. In what was my first time ever using Cheerio, I made use of the optional context argument to locate elements that had no attributes, ids or classes with very few lines of code.

Lastly — this is not only the first technical article I’ve ever produced, but also my first Medium article! Thanks for reading! 💖 ✨