Muninn, The Cheerio Wrapper

Aykut Kardaş
wopehq
Published in
3 min readFeb 5, 2024

--

What is Cheerio?

Cheerio is a parsing library written with TypeScript. Besides being fast, it uses the API design of jQuery. Therefore, extracting and manipulating data from the DOM is easy and flexible.

Why do we need Muninn?

So why did we need an extra layer instead of using Cheerio directly, and why did we write our open-source Cheerio wrapper (Muninn)?

Wope’s Rank Tracker feature parses Google daily for different keywords, processes the data, hosts it, and presents it to the user in a meaningful way. However, Google offers different types of features in search results, can use different HTML structures, makes style changes, and these changes require constant updating of the selectors we use for parsing.

What is Muninn?

We created Muninn to manage these update processes in the best possible way. There were a few important points that I paid attention to while designing Muninn.

  • Create separate config files for different parsing structures.
  • Independently host and update Parse Configs.
  • Allow easy config creation and updates for non-technical users.
  • Ensure configs mirror the JSON schema post-parsing.

And finally, the flexible, powerful, and very small-sized Muninn emerged. Despite developing and documenting our project as open source, we never officially announced it. Over time, we continued to update and improve it according to our own use cases.

Wope has a more advanced parsing tool that uses Muninn as its core. This tool is customized according to Google and has been modified to meet specific needs.

Muninn Config Structure

For instance, a Muninn config file written to parse a product page on Amazon would look like this.

export const amazonProductConfig = {
schema: {
title: '#productTitle',
price: '#priceblock_ourprice',
rating: {
selector: '#acrPopover span | float',
regex: /\d+\.?\d?/
},
features: {
selector: '#productOverview_feature_div tr.a-spacing-small | array',
schema: {
name: 'td:nth-child(1)',
value: 'td:nth-child(2)'
}
}
}
}

Usage example:

import { parse } from 'muninn';
import { amazonProductConfig } from './configs/amazon-product-config';

// The `data` is an HTML Content of type string.
// https://www.amazon.com/AMD-Ryzen-3700X-16-Thread-Processor/dp/B07SXMZLPK/
const data = '<html>...</html>';

const result = parse(data, amazonProductConfig);

And the parsing output of this config would look something like this:

{
"title": "AMD Ryzen 7 3700X 8-Core",
"price": "$308.99",
"rating": 4.9,
"features": [
{
"name": "Brand",
"value": "AMD"
},
{
"name": "CPU Model",
"value": "AMD Ryzen 7"
},
{
"name": "CPU Speed",
"value": "4.4 GHz"
},
{
"name": "CPU Socket",
"value": "Socket AM4"
},
{
"name": "Processor Count",
"value": "8"
}
]
}

Thus, there is a natural similarity between the config and the output. Thanks to this intuitive and low learning curve design, various people within Wope were able to quickly adapt to this structure and easily update the parse configs.

Wope has over 300 config files in this way, and there are written tests for each config file. Thus, it manages to remain stable against changing HTML and updates in an agile manner.

If you want to learn more about Muninn, you can check out the links below.

Happy Coding!

--

--