Website Scraping Using Python in Five Steps — Data Science With Craigslist Data Part 2
In this post I’ll walk you step by step through the steps I took to scrape vehicle postings data from Craigslist. I used Python and the BeautifulSoup library to web scrape which is just one of the many ways to access data from a website. Ideally, an API is best for this purpose but Craigslist does not have one that would let me get the data on I needed.
The full code is posted on my GitHub if you would like to follow along on a Jupyter Notebook.
If navigate to car & trucks for any city on Craigslist you’ll have the option to view ads from both private sellers and dealers. I chose to only scrape data from private sellers as these posts contain more reliable information relating to an actual vehicle up for sale. I also chose posts with images since I wanted to download the images for the advanced project section like image recognition or NLP.
The data I needed is then embedded within each vehicle post page as seen on the right in the image below. Don’t worry if not everything in the following image makes sense. We’ll get to that soon.
After visually inspecting the website I was able to come up with game plan of how I would structure my code.
On a high level, my five steps workflow for scraping the website will be as follows:
- Step One: Create a list of cities which I would like to get the data for
- Step Two: Navigate to the cars & trucks for sale by owner for each city
- Step Three: For each city, navigate to each page available
- Step Four: For each page available, navigate to each post
- Step Five: For each post, get post attributes
That's the working sudo code version.
The Code
First things first, I fired up an instance Jupyter notebook and imported the necessary libraries. We import the requests module to handle http requests and BeautifulSoup from bs4 which is the module for getting data from the website.
The next step is to create a list of cities that we want to navigate to on Craigslist. These city names must match the same spelling as is on Craigslist. For example San Francisco is sfbay otherwise you’ll get a page not found error.
We’re now going to get the links to each page within the city then we’ll have a list of all pages for each city. Each page has 120 posts. Therefore, to go to the next page we just need to create and increment a page placeholder by 120 to get the next page. We continue this iteration until we get to page 1800 or less. I chose 1800 because posts on that page are usually around the thirty day mark. The final result is a list with links to all pages for each city.
To get link to each city page, we’ll use a loop to change the city name in the base city link from the cities list and store it in a variable named city_link. What we’re doing is just taking any city link from the website and creating a placeholder link with an option to change the city name and page number. Something like this “https://sfbay.craigslist.org/d/cars-trucks-by-owner/search/cto?s=1&hasPic=1” we change into the code shown below on line 12.
After we run the above code we now have a list of links to pages available in each of the cities in out cities list. Below is an slice of what the page links look like. For all 15 cities we get a total of 225 page links, each containing 120 individual vehicle posts.
Now that we have this long list of all pages contained in every city for the car for sale by owner section, we need to navigate to each page and get the links to each individual vehicle post. That’s 120 posts by 225 pages. In all we should end up with about 27,000 links. All 27,000 represent a specific vehicle post on Craigslist.
To get the vehicle links, we loop through each page link from the list we created above and get all 120 posts per page. We store these links in a list variable called car_urls.
To get the link to each post, we’re going to use BeautifulSoup to parse the links from the webpage. To do this, we pass a link to the get method from the request library which should return a response code <Response [200]> if the link is present. We then pass the request object to an instance of BeautifulSoup and add a html.parser flag. This then returns a html object as seen below.
We can then tell the soup object to find our vehicle links in the returned html object by using the find_all method.
You can easily figure out where the links are by right clicking on your chrome browser and selecting inspect. This brings up the panel shown below with the html structure of the page.
A html link is denoted by the href tag but since there are many hrefs in the html object, we also want to tell soup which class to look in. From the example above, the class we want to look into is result-image gallery class. All the hrefs contained lead to the individual vehicle posts on that page as seen below.
We can then store these links into a data frame or csv for later viewing. The full code to get all the vehicle links is shown below.
We see that now we have links to 24706 vehicle posts that we can now point BeautifulSoup to scrape data from but first we save a copy so we can always get back to it later. This step is easy and we just save it as a csv file.
Then read the file back out to a pandas data frame.
Well, now that we have all the links we need, lets get to scraping the vehicle details. We are going to use BeautifulSoup again for this but this time we’ll look through each vehicle post page.
Again, we pass each vehicle link to the request module and the result to a BeautifulSoup object. Then we use the find methods of the BS object to grab the vehicle attributes and any other data from the page. For example, for the price we find the price class as follows.
To find the date time we use the following.
For the geolocations, we parse them from the mapbox class of the html like so.
Here is the full code for getting vehicle attributes from the each vehicle link.
Now that we got all the vehicle attributes from our 27K vehicle links and stored it in a list of lists. Some minimal cleaning was also done on the fly like creating attribute labels and removing the dollar sign and commas from the price.
Each vehicle and it’s attributes is stored in the list car_final .
And each list is stored in the list cars.
Next, we have to store this data for future analysis. What I did was create dictionaries with attribute labels acting as key names. To do this I created a dictionary of key, value pairs where each key represented a column name and then stored the dictionaries in the list car_dicts.
Finally write the list of dictionaries in to a pandas data frame.
Then preview the data frame to see if our data was stored properly.
And finally save the data frame to csv or dump to a data base.
That’s all folks. For now at least. We have successfully scraped data from Craigslist. We now have data to 24 thousand plus vehicles from 15 cities across the US from a period of one month.
Please join me in the next section, Craigslist Cars For Sale Data Project: Part 3 — Data Cleaning, as we perform some much needed data cleaning like fixing the data types and creating new columns. Also let me know if you have any feedback on the process above. Thank you!