Scrape ProductHunt in 30 seconds
(You might want to learn about cheerio & jsonframe and to get started with the previous article: Scraping data in 3 minutes with Javascript)
Prerequisites:
- Best with Node.js / NPM installed
- Understand Javascript, HTML & CSS
I’m about to walk you through the process at light speed to actually run the 10 magical lines of code in about 30s.
Content to scrape
We would like to get the last products published which are on the frontpage of Producthunt. We could imagine that we want to build up a small API to then use it in our projects (or you could simply use the PH API 🙄).
Structure the data
The “last products” means that we need to extract a list of products.
A simple representation of a product could be :
Add the selectors
This is probably where most of people fail: finding the right css selectors. A selector is a unique address to the element-s you want.
You only really need the developer tab of your web browser to get there. If things get really tricky, you can get some help with a tool like the GadgetSelector.
You quickly end up with the following selectors.
Ok, PH html code is pretty crowded so if you’re not experienced I could understand that you spend a little bit more time than I did 😜.
Setup the request to PH
Simply create a new folder and add a js file named ph.js (for ex). You can then open your terminal and navigate to your folder. Generate a new package.json with the npm init command (click enter for each entry).
You can then install the required packages with the following command:
We’re now ready to start our small script.
Let’s set the request to http://producthunt.com with axios (a promise based HTTP client).
Scrape the list of products
Well, we simply need to parse the html with cheerio, plug the jsonframe plugin and let the magic happens thanks to the json frame definition we made from the structured data we defined before.
Why I made jsonframe is simple. I wanted to define the output data structure form the beginning of the request. I also wanted to avoid myself to get lost into the dozens of lines of code to run through html nodes. You can learn more from this previous article: Scraping data in 3 minutes with Javascript.
Well done! 🎇🎉🚀
We’ve done it in 30 seconds! Humm…
Oh well, I guess I needed a bit more time to get you properly through the steps 😉
But you’ll be able to do it in 30s top chrono with some training 💪
Btw, here is how the output looks:
Neat, right ?
Of course, it would make more sense for example to get the page link for each product. Would you be able to do that for me? Challenge accepted?🎯
Feel free to share us your repositories with your solution of the problem. You could also play with another website than ProductHunt (not sure Ryan Hoover will be happy about that, sorry Ryan 😘🤙).
I crawl the web to scrape data for startups and big companies around the world. From scraping highly secured websites to huge amount of data (millions), I should be able to give you a hand — gabin@datascraper.pro.
I’m also co-funding a platform to smartly empower french startups with growth hacking. Si vous êtes intéressez pour vendre vos connaissances de “growth hackeur”, n’hésitez pas à me contacter.