Scraping Target Stores Location

Lot of people ask us about how we can fetch all the store locations of Target like we do in our data store. Note: all the Target store locations are available as CSV file for $50..

Figure 1: Store locations of Target in USA. Source: Target Store Locations dataset

Scraping Target stores for one zipcode

We will keep things simple for now and try to web scrape target store locations for only one zipcode.

Python is great for web scraping and we will be using a library called Selenium to extract Target store locator’s raw html source for zipcode 30301 (Atlanta, GA).

### Using Selenium to extract Target store locator's raw html source from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import Select import time from bs4 import BeautifulSoup import numpy as np import pandas as pd test_url = 'https://www.target.com/store-locator/find-stores/30301' option = webdriver.ChromeOptions() option.add_argument("--incognito") chromedriver = r'chromedriver_path' browser = webdriver.Chrome(chromedriver, options=option) browser.get(test_url) time.sleep(5) html_source = browser.page_source browser.close()

Using BeautifulSoup to extract Target store address

Once we have the raw html source, we should use a Python library called BeautifulSoup for parsing the raw html files.

Figure 2: Inspecting the source of Target store locator.

  • We will extract store names.
  • The data will still require some cleaning to extract out store names, address line 1,address line 2, city, state, zipcode and phone numbers but thats just basic Python string manipulation and we will leave that as an exercise to the reader.
# extracting Target store names soup=BeautifulSoup(html_source, "html.parser") store_name_list_src = soup.find_all('h3', {'class','Heading__StyledHeading-sc-1mp23s9-0 Card__StoreCardTitle-sc-6da7hu-0 gcIYSr bEhCVm'}) store_name_list = [] for val in store_name_list_src: try: store_name_list.append(val.get_text()) except: pass store_name_list #Output ['Buckheadstore details', 'Buckhead Southstore details', 'Sandy Springs Pradostore details', 'Atlanta Midtownstore details', 'North Druid Hillsstore details', 'Smyrnastore details', 'Atlanta Perimeterstore details', 'Atlanta Edgewoodstore details', 'Northlakestore details', 'Marietta Eaststore details', 'Peachtree Cornersstore details', 'Austellstore details', 'Cobb NEstore details', 'Roswellstore details', 'Alpharettastore details', 'West Mariettastore details', 'Cobbstore details', 'East Pointstore details', 'Riverwoodstore details', 'Woodstockstore details']

The next step is extracting addresses. Referring back to the inspect in the chrome browser, we see that each address text is in fact of the class name Link__StyledLink-sc-4b9qcv-0 fUrQXY h-text-grayDarkest so we just use the BeautifulSoup find_all method to extract that into a list.

# extracting target addresses addresses_src = soup.find_all('a',{'class', 'Link__StyledLink-sc-4b9qcv-0 fUrQXY h-text-grayDarkest'}) addresses_src address_list = [] for val in addresses_src: address_list.append(val.get_text()) address_list # Output ['3535 Peachtree Rd NE, Atlanta, GA 30326-3287', '2539 Piedmont Rd NE, Atlanta, GA 30324-3006', '5570 Roswell Rd, Sandy Springs, GA 30342-1102', '375 18th St, Atlanta, GA 30363', '2400 N Druid Hills Rd NE, Atlanta, GA 30329-3211', '2201 Cobb Pkwy SE, Smyrna, GA 30080-7633', '100 Perimeter Center Pl, Atlanta, GA 30346-1204', '1275 Caroline St NE, Atlanta, GA 30307-2705', '4241 Lavista Rd, Tucker, GA 30084-5310', '1401 Johnson Ferry Rd, Marietta, GA 30062-6495', '3200 Holcomb Bridge Rd, Peachtree Corners, GA 30092-3361', '4125 Austell Rd, Austell, GA 30106-1836', '3040 Shallowford Rd, Marietta, GA 30062-1252', '1135 Woodstock Rd, Roswell, GA 30075-2231', '6000 N Point Pkwy, Alpharetta, GA 30022-3006', '2535 Dallas Hwy SW, Marietta, GA 30064-2543', '740 Ernest W Barrett Pkwy NW, Kennesaw, GA 30144-6860', '3660 Marketplace Blvd, East Point, GA 30344-5738', '5950 State Bridge Rd, Duluth, GA 30097-6438', '140 Woodstock Square Ave, Woodstock, GA 30189-6500']

Lastly, we will extract phone numbers using a similar approach.

# extracting phone numbers for each Target stores phone_number_src = soup.find_all('a', {'class', 'Link__StyledLink-sc-4b9qcv-0 fUrQXY h-text-grayDarkest undefined'}) phone_number_list = [] for val in phone_number_src: phone_number_list.append(val.get_text()) phone_number_list # Output ['404-237-9494', '404-720-1081', '678-704-8120', '678-954-4265', '404-267-0060', '770-952-2241', '678-259-0888', '404-260-0200', '770-270-5375', '770-240-0005', '770-849-0885', '678-945-4550', '770-321-8545', '770-998-0144', '770-664-5395', '770-792-7933', '770-425-6895', '404-267-0063', '770-476-5548', '678-494-5307']

Geo-encoding

  • You will need latitudes and longitudes of each stores if you want to plot it on map like figure 1.
  • Lats and longs are also necessary to calculate distances between points, driving radius etc. all of which are important part of location analysis.
  • We recommend that you use a robust geocoding service like Google maps to convert the address into coordinates (latitudes and longitudes). It costs $5 for 1000 addresses but in our view its totally worth it.
  • There are some free alternatives for geocoding based on openstreetmaps but none that matches the accuracy of Google maps.
  • In the example below, we have used Openstreetmaps based geo-encoder API called Nominatim.
from geopy.geocoders import Nominatim nom = Nominatim() location = nom.geocode(address_list[0]) location # Output Location(3535, Peachtree Road Northeast, Atlanta, Fulton County, Georgia, 30326, United States, (33.85259301032711, -84.3606446470625, 0.0))

Scaling up to a full crawler for extracting all Target store locations in USA

  • Once you have the above scraper that can extract data for one zipcode/city, you will have to iterate through all the US zip codes.
  • it depends on how much coverage you want, but for a national chain like Target you are looking at running the above function 100,000 times or more to ensure that no region is left out.
  • Once you scale up to make thousands of requests, the Target.com servers will start blocking your IP address outright or you will be flagged and will start getting CAPTCHA.
  • To make it more likely to successfully fetch data for all USA, you will have to implement:
  • rotating proxy IP addresses preferably using residential proxies.
  • rotate user agents
  • Use an external CAPTCHA solving service like 2captcha or anticaptcha.com

After you follow all the steps above, you will realize that our pricing ($50) for web scraped store locations data for all Target stores location is one of the most competitive in the market.

Originally published at https://www.specrom.com.

We cover all the cutting edge natural language processing, machine learning and AI powered strategies to extract web data on big data scale.

Recommended from Medium

ClickHouse Kafka Engine Tutorial

Rasa Open Source: Basic Name Bot

Django Blog Tutorial: Project Configuration P1

Django blog tutorial: Project Configuration P1

What is a mainframe application?- 5 things You Need to Know About Mainframe application.

My favorite tech interview questions

Annotation Helper Attributes In TestNG

Why do I still use PowerPoint when there are plenty of free alternatives to PowerPoint?

Coming Together to Support the Open Source Community

Indeed Open Source Program logo

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Jay M. Patel

Jay M. Patel

Cofounder/principal data scientist at Specrom Analytics (specrom.com) natural language processing and web crawling/scraping expert. Personal site: JayMPatel.com

More from Medium

Building a Website From Scratch — Beginner’s version Part I

Leverage GitHub Page for one-pager reports

7 New Technologies to Improve Customer Service in 2021

Significance of Big Data in the Tourism Industry