How to scrape data from a website using Pandas.

Jorge Cerdas
4 min readJul 29, 2021

--

If you previously read the main blog post where I explained the idea behind using Streamlit, Pandas and Plotly to create a dashboard and visualize earthquake information for Costa Rica, this blog post is the first instance where we are going to over the steps on how to use Pandas to scrape data from the OVSICORI’s website where the data for the most recent earthquakes in Cost Rica are displayed.

If have not read the previous blog, here’s the link:

Visualizing earthquake data using Streamlit, Pandas and Plotly. | by Jorge Cerdas | Jul, 2021 | Medium

After, you read the previous blog post, come back here.

As any data science project we need to have data, there are multiple sources of data and different places where you can get a really good dataset but for this project I actually wanted to use data that is updated over time which I can then utilize to showcase it in my dashboard for the users to see and play around with it.

So, how can get this data? Luckily OVSICORI keeps track of every single earthquake in Costa Rica.

You can actually go to their site and see the data here: Tabla Sismos Recientes (una.ac.cr).

There is a table which keeps updating every time Costa Rica have an earthquake, so this website is going to be the source of the data, we just need to find a way to scrape it.

OVSICORI website

There are many way to scrape data from a web site, using tools like web scraper apps, Python libraries like Scrapy or Beautiful Soup, etc. However; in this particular scenario we are going to use Pandas.

One of things we need to do first before start scraping data from a website is to understand how the website is built, as you may know these are build using HTML, CSS and JavaScript; but we need to focus on the HTML since is basically the blueprint for how the website is structure by making use of HTML tags.

Many web scrapers use the HTML tags along with a XPATH language to navigate the HTML code, which allow us to pull data from it.

If you do not know what XPATH is or how web scraping works, I recommend you to watch this video tutorial:

(5) Python Scrapy Tutorial — 10 — Extracting data w/ XPATH — YouTube

Now, in our case the OVSICORI website displays the earthquake data within a table or HTML table element and fortunately enough Pandas has a function method called read_html()

pandas.read_html — pandas 1.3.1 documentation (pydata.org)

Which basically what it does is to read all the HTML tables in the site and returns them into a list, which we can then use to import the data to a Pandas data frame.

This is the URL where we are going to scrape the data from: http://www.ovsicori.una.ac.cr/sistemas/sentidos_map/indexleqs.php

Earthquake data

I wrote a Python script that scrapes the data from the table and stores it onto a CSV file, you can see the code in here on my GitHub repo: georgedevcode/earthquake_web_scraping_tool (github.com)

You’ll see some more code since I actually automated the script to run with a scheduled task to continuously scrape data from the site, but this is going to be for another blog post in the future.

Going back to our main idea here, as I said we are going to use pandas, with just a single line of Python code you can get data you need:

import pandas as pd#Reads from the website and creates the data frametable = pd.read_html("http://www.ovsicori.una.ac.cr/sistemas/sentidos_map/indexleqs.php")earthquake_data = pd.DataFrame(table[0])

From the code above, you can see the following:

  1. We imported the Pandas module.
  2. Using the read_html() method and passing the URL, we’ll get all the HTML tables from the website. Keep in mind the read_html() method returns a list.
  3. Since we got a list holding the tables, we need to access the first element on the list, then we can pass the data as argument to the Pandas Dataframe() method to create our data set.

Once you have data frame you can work with it like any other Pandas data frame and manipulate it with different tools and libraries. And just like that we use Pandas to get our earthquake data from the OVSICORI web site.

Next: How to use Plotly to visualize scatter plots maps using coordinates(coming soon).

--

--