Python: Web-scraping to CSVs

Alexis Chilinski
Analytics Vidhya
Published in
3 min readJun 29, 2020

Web-scraping is a great way to retrieve information from a source (e.g. a website) if you don’t need to store it in a database. Say you have your own news website and you want to show the latest headlines from other news website — that would be a good reason to turn to web-scraping.

Luckily, there are many ways you can web-scrape across all languages, including Python. Python has a few different libraries including BeautifulSoup and Scrapy. Python also has a built-in feature for importing/exporting CSVs. I had a recent project that required web-scraping , so obviously I turned to Python, and it’s all very simple.

Web-scraping requires a basic understanding of HTML because once you get data from the actual page, you will need to search the HTML elements for the specific information you need. But to start, you will need to pip install beautifulsoup4 and pip install requests which will allow you to send HTTP requests to websites. And then import those libraries like so:

from bs4 import BeautifulSoup
import requests

and then you can begin coding! All you need is the URL for the website from which you’re scraping. Be aware, some sites have higher security and prevent any kind of web-scraping.

page = requests.get(your_url_here)
soup = BeautifulSoup(page.content, 'html-parser')

You will need to specify the type of parsing you need, as BeautifulSoup can accommodate various types of parsing. Since we’re dealing with HTML elements, an html-parser is appropriate.

This is where the fun comes in (fun is relative). This is when you need to inspect your console on the site from which you’re scraping and figure out which elements you need. Once you find the elements, finding them with BeautifulSoup is very easy — you can use element tags and/or classnames. For example:

div = soup.find('div', class_='class_name')

And then you can even add on a .text to get the inner text of an element:

text = soup.find('div', class_='class_name').text

The projects I worked on also required the scraped data to be exported to a CSV. So that’s where Python’s built-in CSV feature came in. Say your results from that .find method produced multiple divs that you need to iterate through and then export the text from each div to a CSV. So first, at the top of the page, you will need to import that library:

import csv

and then you can start coding with it! So you can iterate through the multiple divs and open a new CSV with those results:

all_text = []
for div in divs:
get_text = div.text
text_dict = {'Text': get_text}
all_text.append(text_dict)
open('text.csv', 'w') as out_file:
headers: [
'Text'
]
writer = csv.DictWriter(out_file, fieldnames=headers)
writer.writeheader()
for text in all_text:
writer.writerow(text)

This code will automatically open a new .csv file in your directory with the data you requested. 'text.csv' is the name of your CSV file, and it can be called whatever you want. Headers can be called whatever you’d like, and with the powers of code, you don’t need to pass in the entire dictionary into writerow() if the headers match the keys of the original dictionary.

TL;DR, web-scraping data and exporting it to an CSV via Python is very simple. And the docs are great.

--

--