How to do web scraping with python

Tyler Garrett
7 min readJul 5, 2018

--

Hey, web scraping is easy with python 3.7 —the way I was doing it before this tutorial was overly complex and extremely inefficient.

I wrote this blog in July/2018, when I was still learning how to program in Python. This particular version is not as complete or easy as my future version on web scraping.

It was not my best blog but it does show a quick way to do some web scraping basics, like grabbing numbers off a website. However, I reblogged this topic in a more straight forward example.

Please — if you’re interested in learning web scraping with python, check out the blog I released on Dec.25,2018!

I was trying to make an a drag and drop ETL handle web scraping but it isn’t designed for parsing HTML.

Meet Python, lxml, requests, beautifulsoup4, etc… throw away the paid for services, throw away third party vendors, start web scraping on your own, on your computer, now!

Share this with your friends: http://tinyurl.com/yaupbwv8

Learn how I made this blog URL into a tinyurl with 2 lines of python code! Follow along here @ how to make tinyurls with python.

Web scraping is easy in Python….

Web scrapping is easy in python but you need to ramp up. It won’t take long, and let me know if you get stuck, I sure as hell did a lot.

So above, Python, lxml, requests, etc… Speaking gibberish, well I explain everything in tutorials/blogs, without a single funnel or recommendation to buy anything! You’re welcome.

Using Pip to install Requests and lxml on python 3.7 — MAC OS

Found a blog about web scraping and it had a little bit of python, not much explanation, per the usual programmer blog, a bunch of short hand written stuff as if we speak this language… Hours of troubleshooting, digging through SEO’ed websites, and finally…. I think we have some cool content. Btw, the blog mentioned about scraping — it also has a bit of an incomplete tutorial surrounding this process/method. I will continue to clean this up, and maybe reblog it on my website at tylergarret.com.

Python is extremely efficient at handling web parsing, I’m blown away. I was trying to do this in softwares and it was a massive work-around/waste of time… This is exciting, but what is it.

Did you miss that? In 6 lines of code, we are getting prices…

And boom prices… from a website…

One more line of code, and boom, buyers + prices… Now we are looking at prices online, instantly, loop this and you have price analysis… Push into a database, you have prices over time… Here we go…

Python… What is it though?

Learning python is like space force. Everyone has an opinion, but none of it is factual, true, or exactly the truth. Like politics.
lol. Let me explain below.

Setting up pip to install requests and lxml

Below I’m going to show you how to setup your requests and lxml on python 3.7 on mac os. Trying to learn python from scratch is a lot of fun, appears to be a bit of a ramp up, but that’s why I’m blogging about it every day.

It’s easy, fun, and user friendly, don’t be discouraged trying to figure out how to get it working, keep it up, maybe give pycharm a visit too.

Installing python is important for any data related guru.

Learning how to install python seems to be critical for the future of my career, I’m tired of spending countless hours making a software do what code has done for decades… Time to grow a pair. I don’t know if homebrew helped me but I wrote about how to setup homebrew for python too.

A quick video on setting up pip on your mac. And I cover how to setup pip on your windows 10 too. Be sure to catch up, and install python, etc… Let me know if you get stuck, I’m still learning myself and want to know if I’m getting you past the point that I was stuck, trying to dig through….

Learning how to do web scraping with python!

When I first started learning about web scraping, no one wanted to help me and I was stuck figuring out how to parse HTML with a tool 100% not designed to handle the task… So, when you hit this bridge, I hope more than anything my blog ranks half decent and you don’t waste any time trying to do web scraping with random tools, paid services, or third part vendors.

So, here we go! Web scraping is fun, you need to dig through a bunch of tabs if you ignore my blogs.

If you made it this far… You’re clearly really intelligent and enjoy learning. Please follow along below, so you don’t have to open 20 tabs and spin your mother flipping wheels off. This should be easy! It’s just a bunch of junk in google searches right now.

You feel me? Anyways, ping me if you need any help, I will likely be very far ahead of this point when this article begins ranking… I don’t want you struggling to get ahead, please ping me if you want source code to any projects I’m blogging about. Enjoy! And thanks for the follows.

Follow along w/ this video to get pip working on your mac, before you begin.

Let’s start with the imports:

from lxml import html
import requests

Well these imports will not just work out of the box. Sorry. Which throws a big loop in the ramp up, also there’s some syntax that’s incorrect here, that I will update below.

First you need to install requests. Below ensures you’re installing pip installs in python3, VS other python installs on your mac. Like 2.7, which comes with your mac, don’t uninstall or break that too… leave it alone. Or reinstall everything.

Install requests with this code in your terminal, ensure pip is function on this machine by typing “pip” in your CMD/terminal.

python3 -m pip install requests --user

Above code offers access to pushing a new installation. You can learn a little more about some of these pieces of code here.

Python3 has another install called lxml, make sure you install it to python3 if you want to use the 3.7 python install.

python3 -m pip install lxml

Installing lxml took me a little bit because I kept typing xmlx. Be sure you’re not installing weird stuff.

here’s the rest of the code working plus using ()… in the code, which the tutorial does not include.

Now we want to “get” the HTML, and parse through looking for buyers and prices.

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)

After a quick analysis, we see that in our page the data is contained in two elements — one is a div with title ‘buyer-name’ and the other is a span with class ‘item-price’:

HTML looks like this:

<div title="buyer-name">Carson Busses</div>
<span class="item-price">$29.95</span>

Knowing this we can create the correct XPath query and use the lxml xpath function like this:

Here’s the code to capture the values in the html.

#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

Let’s see what we got exactly:

print 'Buyers: ', buyers
print 'Prices: ', prices

Boom.

Now you have your next step, time to start learning how to push this into a database!

Oh you’re still here…

DO you want to automate building tinyurls? It’s super important for SEO, so head over here.

Too easy right!?

typos by tyler garrett

Cheers.

--

--