Scraping and data transformation best practices with an example of Treasury Data

Tauseef Shah
CivicDataLab
Published in
4 min readAug 29, 2023

In this blog, I’ll explain the process followed for web scraping and data transformation for treasury data. I’ll use this process explanation to share scraping and data transformation good practices that can help you streamline the process and ensure that your data is accurate.

Why data scraping and transformation

There is a lot of data available on the government platforms related to schemes, expenditure, budgets and much more. Mostly this data is not in machine readable format or poorly formatted but this data is very useful for understanding government policies, their decision making process and in the case of treasury data, how the taxpayers money is being spent. To understand this data better it’s helpful to scrape this data and transform it so that it can be modeled, visualized and analyzed. The data available on these government websites is updated periodically with frequencies ranging from weekly to yearly updates. Data scraping is usually a continuous process and for treasuries is done every year to get the latest data.

Jharkhand Treasury Data

In this blog the code and best practices are taken from the scrapper and transformer written for Jharkhand Treasury data in particular. These scripts were used to scrape 5 years of data between 2017–2022.

Using Python for Scraping and Data Transformation

Python is the go-to language for web scraping and data transformation. It is simple, efficient, and has a vast number of libraries and tools that can make your job easier. So go with Python when it comes to web scraping and data transformation.

Following four libraries were used in the treasury scraper for Jharkhand:

from lxml import etree
import os
import pandas
import requests

Prefer Python’s `requests` Library over `selenium`

Python’s `requests` library is faster than `selenium` for web scraping. You should prefer `requests` over `selenium` whenever possible. However, `selenium` can be useful for scraping `js` heavy sites that require user-like usage to get data. Use `lxml` to parse html (if required) as it is faster and more efficient.

req = requests.get(
# Always set user-agent so the scrapper acts as an actual urlToBeScraped.format(item[2],treasuries[item[1].upper()], year),
headers={
‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36’
}
)

Store Data in Downloads Folder

Store all the scraped data in the `downloads` folder. This makes it easy to find and organize your data. You can also use this folder to store other files that are related to the scraping project, such as logs, scripts, and notes. This folder should also be added to the `.gitignore` file so the data in this folder is not pushed into the repository.

The following should be structure for all the downloaded data with respect to the scrapper file:

- ..
- scrapper.py
- downloads
— FILENAME_1.csv
— FILENAME_2.csv
— FILENAME_3.pdf

Use Python’s `logger` Library for Logging

Always have scrapper logs for important tasks being performed by the scraper. Use Python’s `logger` library to store logs in `downloads.log`. This will help you trace back any failures and understand what went wrong.

Following library should be used with every critical and complementary piece of code

import logging
# If the website lookup fails
logging.info(‘The HOST CANNOT BE REACHED’)# After the file is downloaded
logging.info(‘DOWNLOAD SUCCESS’)

Use Environment Variables to Store Sensitive Data

Never expose API keys, access tokens, or any other sensitive information in the scraper. Use environment variables to store this information instead. Store env variables in `.env` file which can be added to `.gitignore` so that it is not uploaded to the git repository.

All the sensitive information should be stored as an environment variable and accessed in the following ways:

import os
# Accessing the secret environment variable
os.environ[‘SECRET_API_KEY_FOR_SOME_3RD_PARTY_SERVICE’]

Implement Retries and Fallbacks

Implement retries/fallbacks for scraping tasks to ensure that you don’t miss any important data. For example, if something fails, try again (maybe three times) and after that sleep for `X` amount of time. If retries/fallbacks don’t work, then consider sending a notification in Rocket Chat or other communication platforms.

Retries and fallback are must to have when writing scrappers that’ll be running for longer duration as there will always be instances when the host won’t be reachable or maybe the service is down, these retries are always a good to have.

from requests.adapters import HTTPAdapter, Retry
# Total retries should at least be 3 and if it fails then the script should stop
# Backoff is another crucial thing, it’s the time between the retires
retries = Retry(total=5, backoff_factor=30)
adapter = HTTPAdapter(max_retries=retries)

Use Good Naming Conventions and Add Comments

Use good naming conventions for functions and variables. This makes your code more readable and easier to understand. Also, add comments to logic that is not easy to understand. This will help other developers who may work on the code in the future.

# Good naming
table_with_ddo_data = browser.get_elemtent(By.CSS_SELECTOR, ‘’)
# Bad Naming
data = browser.get_elemtent(By.CSS_SELECTOR, ‘’)

Maintain a `.todo.md` File

Maintain a `.todo.md` or `.todo` file to track pending tasks and refactoring ideas. This file can be used to organize your work and prioritize tasks. It’s good to have a todo file with all the missing functionality and tasks for the future.

Keep Complete Logs of Every Scraping Job

Keep complete logs of every scraping job you perform. If something was missed or there was an error or anything, these logs will be helpful for others who may work on the script in the future. Complete logs are also helpful for debugging and troubleshooting.

In conclusion, following these best practices can help you streamline your web scraping and data transformation process. These practices can help you save time, ensure accuracy, and make your work more organized.

--

--