Learn Python Fundamental in 30 Days — Day 25(Web Scraping)

  • webbrowser module: open function can launch a new browser to a specified url
>>> import webbrowser
>>> webbrowser.open(‘http://www.google.com')
True

For more info please refer to https://docs.python.org/2/library/webbrowser.html

  • Request module: Allow us to easily download files and webpages from the web. It’s a third party module (to install it run pip install requests)
>>> import requests
# To Download a file
>>> output = requests.get("http://www.google.com")
# Get function returns a response object,This response object contains the response at the web server gave us for this request
# To check the status code
>>> output.status_code
200
>>> len(output.text)
10397
# Let print first 100 characters using slicing
>>> output.text[:100]
'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content'
# Other ways to check status is to raise an exception,which will print an error message in case of error
>>> output.raise_for_status()
>>>
# Let's try with bad url
>>> output = requests.get('http://www.google.com/aadfnfjjf')
>>> output.raise_for_status()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/plakhera/anaconda/lib/python3.6/site-packages/requests/models.py", line 893, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://www.google.com/aadfnfjjf
# we can save the content of file, using iter_content() method
# Reason we are opening this file in write binary mode is to maintain the unicode binary
>>> import requests
>>> output = requests.get("http://www.google.com")
>>> output = requests.get("https://www.google.com/robots.txt")
>>> testfile = open("robots.txt","wb")
>>> for i in output.iter_content(10000):
...     testfile.write(i)
...
6806

NOTE: Request module is helpful when we know the exact url we want to download but in case of website where we need to login using request is not the best way to do it.

For more info please refer to http://docs.python-requests.org/en/master/

So this end of Day25, In case if you are facing any issue, this is the link to Python Slack channel https://devops-myworld.slack.com

Please send me your details

  • First name
  • Last name
  • Email address

to devops.everyday.challenge@gmail.com, so that I will add you to this slack channel

HAPPY CODING!!!