Today’s goal will be to scrape some data out of an HTML page and to smartly structure the output data so we can save it right into an hypothetical database.
Companies List Page
We’ve got a list of 2 companies to extract.
The code might look a bit crappy and that’s done on purpose.
Being closer to reality.
Data we want
Then we need to organize our data. As you can see below, we want to start directly to organize the data in a logical way and not as the html shows.
We want now to identify which css rules will allow us to identify each element of our structure. Some tools exist to help. Like the great SelectorGadget for Chrome.
Here, we’ve got a pretty simple structure. We could do something as follows:
As you can see, employee doesn’t have a selector for example. It’s because we focus on making sense out of data and employee isn’t represented in the HTML.
Get the data
Now, if you wanted to do it only with cheerio, you would end up with something looking like this:
As you can see we get our 2 companies in an array. Data is pretty dirty though. Many spaces are still there. Email contains “Email:”. Phone contains “Phone:” too and their not rendered really nicely.
You can then clean these data or add some more code to the one above to do it live. But nevermind, I’ll show you something magic now.
jsonframe is a plugin which extends cheerio’s functionalities. It actually adds only one function: .scrape(frame). Feel free to checkout the github repo or the npm package to see examples and options.
jsonframe allows you to input a json file, scrape the structured data listed in the json and output an already well structured json object / file. Ready to save to your database.
We start by setting a JSON with the data structure we’re looking for and extra parameters to precise what we want and how.
As you can see we get super clean data in output. We could now loop through companies array and directly save our data to our database.
Keep in mind that the html page here is fairly simple. When it comes to huge sets of data, keeping a tree view with the json object helps. Of course, if it becomes huge with a lot of sub-levels, you could write several json objects that you would add into a final one for a more friendly typing experience.
To learn more about the jsonframe plugin, feel free to check out the documentation.
(This tutorial was hovering the actual extraction part of the data. If you want to go further, feel free to let me know and I could make some more advanced tutorial using it with axios or even headless browsers like with nightmare.js.)
Thanks for reading.
I built this plugin because I needed it for my work to extract complex structures of data. It can be easy to get lost in the code with lots of sub-nodes. I find the json frame to be easy, quick and out of issues.
I crawl the web to scrape data for startups and big companies around the world. From scraping highly secured websites to huge amount of data (millions), I should be able to give you a hand — email@example.com.
I’m also co-funding a platform to smartly empower french startups with growth hacking. Si vous êtes intéressez pour vendre vos connaissances de “growth hackeur”, n’hésitez pas à me contacter.
Wanna go further? Check out the next article: 30s to scrape Producthunt !