Web Scraping SNIIM data using Python and Beautiful Soup: This is the way

Maria Janneth Rivera Reyna
MCD-UNISON
Published in
5 min readMar 4, 2024
Photo by AltumCode on Unsplash

Overview

In the last years, web scraping has become one of the most powerful techniques to get unstructured data from the web in a programmatic way. This means you can fetch a webpage, download its content, search for data, reformat, and so on, in order to finally load it into a structured format like a CSV file or even a database table, for further analysis.

This has many applications when it comes to monitoring product prices, marketing research on competitor’s products, sentiment analysis on product reviews, and monitoring weather changes, just to mention a few.

Not so shocking news: Python is the most popular programming language in 2023 for web scraping. It provides several libraries to do this, like BeautifulSoup, which creates a parse tree to extract data from the website’s HTML source code, A.K.A the soup.

SNIIM (for its acronym in Spanish) is Mexico’s National System of Information and Integration of Markets , which reports daily prices on agri-food products. On their website, you can apply several filters to get prices by product, by date, by presentation, among others:

Fruit and vegetable price consultation website
Fuente: http://www.economia-sniim.gob.mx/

The results are shown in a table, depending on how many rows per page are being displayed you can get one or more pages:

Fuente: http://www.economia-sniim.gob.mx/

In this post, we will get data related to the prices of fruits and vegetables and save these as CSV files that we can later load into a Pandas DataFrame for a deeper analysis.

Without further ado, let’s go scrape data! 🤓

SNIIM data scraper

First, you need to ensure you have Python installed on your machine. Second, you will need to install the Beautiful Soup library by running:
$pip install beautifulsoup4

Now that you’re all set, let’s begin importing some libraries:

You’ll need to define paths and create folders to store temporary data and output files:

And now, let’s break the following code into several steps:

  1. Create a recursive function to extract all the pages of price results for a specific product by handling pagination, and writing the results into a CSV file.
  2. Automate extraction of price results for a list of products in a time period.

Step 1

You’ll first start by defining the function that will fetch the webpage using the urllib.request Python library (and open the files where the data will be written):

And then create the soup:

Since some results need more than one page to be shown, you’ll need to handle pagination to get all the results. This is the reason the function has to be recursive. This way we won’t need to execute an HTTP request for every results page.

This means you’ll need to navigate the soup, and for that, you need to know which HTML elements contain the data of your interest. In this case, we want to know the number of pages the result has in total.

In your browser, you can use the inspect tool to check the HTML elements:

Fuente: http://www.economia-sniim.gob.mx/

We can see the element <span id="lblPaginacion"> shows the number of pages. This is the element you need to search for in the soup. After that, this code calculates an approximate number of total rows. With this, you can now call the function itself to get all rows in one page:

Now that you’ve handled that, you will need to search for the HTML element that contains the price information:

Fuente: http://www.economia-sniim.gob.mx/

The element <table id="tblResultados"> present the rows in the <tr>elements, which in turn contain several columns in the <td> elements. You’ll need to search these elements in the soup and then write row by row into a CSV file:

You may have noticed this function requires some parameters. Let’s define them in the next step.

Step 2

We want to automate the extraction to get the price data for a list of products. Then, we can say we need the price per kilogram. And we can also specify a time interval.

Well, all of these filters came from the SNIIM webpage, and we already have searched for them in advance.

We have also obtained the URLs you will need to get a list of all fruits and vegetables reported by SNIIM, and the ones you need for the HTTP GET request.

Fuente: http://www.economia-sniim.gob.mx/nuevo/mapa.asp

The first URL shows a list of fruits and vegetables, and now you how to navigate the soup to get them 😃. Here we have used a little bit of regex.

The rest of the code gets the name of each product, calls the function to extract all the data from each product in our list, and segments the extraction to prevent a timeout server error.

Voila! Now you have created one CSV file for each product and you can start doing some other fun stuff with this information.

Thanks for reading… and happy scraping! 🤓

The complete code can be found in this Github repo. This code was based on the scraper written by México Abierto.

--

--