Data Science | Web Scraping

The Only Web Scraping Tool you need for Data Science

Scrape Elon Musk's tweets, Data Science job postings, and YouTube comments, all with no code.

Benedict Neo
Nerd For Tech

--

photo by Dmitry Chernyshov on Unsplash

Data is a valuable resource across all sectors. Generally, we can answer more questions, make more data-driven decisions, or train a better model with more data.

The problem is, good data is hard to find. The solution? Web scraping.

What is web scraping?

Web scraping essentially extracts information from the internet, taking its unstructured form and turning it into useable formats for later use.

Use cases of web scraping

Web scraping is useful for many purposes.

Below are a few examples

  • Marketing: Competitor monitoring, lead generation, SEO monitoring
  • E-commerce: Price Intelligence
  • Finance: Aggregate financial news
  • Personal: Find the best hotel for traveling, gathering job postings
  • Data Science: Improve models and experiments with more data

It can apply to many sectors, and the use cases are almost endless.

How to scrape the web

The question now is how one can start scraping the web.

There are two options:

  1. Writing code (e.g., Python and BeautifulSoup)
  2. Using no-code tools.

Each has its pros and cons, but the biggest advantage of no-code tools is you can spend less time on writing code to extract data and more time on statistical analysis and experimentation and not on extracting data.

So, why not go for no-code tools? In just a few clicks, you can easily access any data you want on the internet.

There are tons of no-code web scraping tools out there, but they lack many robust features or customizations that would otherwise make web scraping a breeze.

Introducing, Octoparse!

What is Octoparse 🐙?

Photo by Stephanie Harlacher on Unsplash

In a few words, Octoparse is:

Easy Web Scraping for Anyone

As with all no-code tools, Octoparse allows you to point, click, and extract without coding.

I've used many web scraping tools before, but Octoparse is on another level with its advanced features.

Octoparse is built-in with tons of useful features:

  • A free web app with a built-in browser.
  • Scraping websites with infinite scrolling, drop-downs, log-in authentications
  • Offers a cloud platform to scrape 24/7
  • Schedule your web crawlers to scrape at any time
  • Auto-export data to databases or any other platforms (e.g., Google Sheets) using Zapier
  • Anonymous scraping with automatic IP rotation
  • Pre-built templates for scraping websites
  • and many more!

More on the website

Don't just take my word for it.

Let's see Octoparse in action!

3 Scraping examples

To showcase what Octoparse can do, we'll be doing a couple of scraping tasks.

  1. Elon Musk's tweets
  2. Indeed job posts for Data Science internships
  3. Comments from a YouTube video

Let's start!

Scraping Elon Musk' tweets

Let's say you want to train an AI that tweets like Elon Musk. To do that, you first need his tweets.

So, let's see how Octoparse can easily scrape Elon's tweets for us.

Templates

Octoparse home page

On the home page, you can view the popular task templates used to scrape popular sites.

Click on Twitter, and you'll see a couple of templates you can use for Twitter.

We'll be using "Author Page" since we want an author's posts only.

Twitter Author Page template

Going into the template, you can see descriptions of how to use the template, along with sample data for that specific template.

Providing information to templates

To use it, click "Try It" and you'll be brought to this page which lets you enter the necessary information for scraping.

parameters for template

After you're done, you'll hit "Save & Run" at the bottom left.

Running the tasks

You'll have two options, running locally or on Octoparse's cloud platform.

running task on Octoparse

After running the task, go to your dashboard, and you'll see the task running.

After it's completed, you can now view your data!

And Voila! Our data is ready!

Exporting data

Now to export the data!

If you want the data now, you can opt for downloading in the respective formats.

But if you want to automatically save it to Google Drive or Google Sheets, email the file to someone, and much more, Octoparse now allows you to auto-export with Zapier!

If you don't know what's Zapier, it's a tool that automates any work for you and moves information between all kinds of web applications.

Zapier + Octoparse

First, you set up the trigger, which is a new document being processed. Our trigger is Octoparse, and you'll be asked to log in to your Octoparse account.

After that, you'll choose the trigger, which is "New Document Processed", and select the "Elon Must Tweets" task.

Trigger in Zapier

Next, you set up the action, where you'll choose your account, actions such as creating a new folder and uploading files, choosing the extension, etc.

Action in Zapier

After running the test action, the CSV file will be in your Google Drive!

View my Zap for more details

Here's the data for Elon Musk's tweets you can interact with.

Next up, let's scrape some Data Science internship job posting on Indeed.com!

Scraping Data Science Jobs in the US on Indeed.com

The process will be very similar to before; we'll just be using a different template.

For job postings, Octoparse provides an Indeed.com template.

We'll be using the Indeed US Job Date Template.

The information we'll get is the location, company name, rating, link to the job post, etc.

First, we'll need the link for our search. I've searched for "Data Science Intern" and filtered by the "Internship" category.

Then, we paste that link into the parameter, and we can start running.

Below is the data we'll get!

I can see this data being useful if you're searching for data science internships in the US, you get to filter on ratings, and you get a glimpse of what you'll do in the short description.

If you want to take it another step further and scrape the job details (skills, requirements, etc.), you can create a workflow on Octoparse to click on each job posting and select the element you want to scrape. I'll leave that as a challenge for you!

Lastly, let's see how we can scrape YouTube comments!

Scraping YouTube Comments

Squid Game, the Netflix series was a huge success, so let's have some fun and scrape the comments of the YouTube trailer video for the show.

As before, we pass in the URL and click run, and we'll get our data similar to the sample data above.

Below is the data we got!

Something fun you can do with this data is run it through a sentiment analysis model, and then determine how the viewers feel about the trailer and give a sentiment score for the video.

Its evident Octoparse is a powerful tool, and I've only scratched the surface of what it can do.

If you can't find what you want to achieve from templates, you can utilize their workflow tool to customize the scraping to your purpose. It can do things like looping over items, paginate, click actions, and more.

Try out Octoparse 🐙 today 👇

That's all for this article, thank you for reading, and I hope this tool will make your life easier!

If you liked my writing, the best way to support me is to become a Medium member today for as little as 5$! You’ll get full access to tons of excellent writing on Medium on a wide range of topics.

Liked this article? Here are some articles you may enjoy 👇

Follow the bitgrit Data Science Publication, where I write data science articles on tutorials and concept explanations.

--

--