This is the first part of the series and I honestly don’t know when it’s going to be over. But these parts will work independent from the rest of the series. Let’s hope.
And we are going to use python. Did I tell you already?
You start writing a code to scrape google, the first thing you need is a piece of code that can scrape a google page. That’s pretty basic and that’s where you can start.
El Scraper
# Add basic reposimport requests
import urllib
from requests_html import HTML
from requests_html import HTMLSession
Now we need a function to get the source code given a google url or any url.
def get_source(url):
# Given a url it's gonna give you the source code
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
We need to create urls from queries or keyword phrases and we need a function to do that.
# Parse the query. We are gonna create the queries to be parsed later
def get_results(query):
# When you give a query as the input it returns the source code as response
query = urllib.parse.quote_plus(query)
response = get_source("https://www.google.com/search?q=" + query)
return response
In the response, there is gonna be a lot of noise in it. We are gonna need only link, title and text. Let’s catch ’em all.
def parse_results(response):
if not response:
return {}
css_identifier_result = ".tF2Cxc"
css_identifier_title = "h3"
css_identifier_link = ".yuRUbf a"
css_identifier_text = ".IsZvec"
results = response.html.find(css_identifier_result)output = []
for result in results:
title = result.find(css_identifier_title, first=True)
title = title.text if title is not None else ""
link = result.find(css_identifier_link, first=True)
link = link.attrs['href'] if link is not None else ""
text = result.find(css_identifier_text, first=True)
text = text.text if text is not None else ""item = {
"title": title,
"link": link,
"text": text
}
output.append(item)
return output
Gonna wrap this all nicely
def google_search(query):
response = get_results(query)
return parse_results(response)
Now that’s done you are gonna start scraping the google. Or so you would think ….
What about search keyword?
Are you gonna use the same keyword over an over again?
We are gonna search google forever. You are not gonna manually write keywords forever. So you randomly generate keywords. Let’s see if we have got any libraries in python for that. You wanna say hi to Google Dot Com.
El Essential Generator
Luckily there is a in-built generator that you can use. (Actually there are many generators out there, I just picked the first one I found)
# You install it first cause probably you don't have it
!pip install essential-generators
Let’s see how it generates random phrases
from essential_generators import DocumentGenerator
gen = DocumentGenerator()
print(gen.sentence())
And that’s it. And coz we love wrapping things up…
def fetch_google_results():
gen = DocumentGenerator()
key = gen.sentence()
return google_search(key)
Now we are not just gonna print the result. We need to save it somewhere in some format. We can do anything here. Got to a SQL or a NOSQL database. Just store as a CSV or a json.
Let’s do json. The result is a dict anyway.
import json
def fetch_google_results():
gen = DocumentGenerator()
key = gen.sentence()
result_dict = google_search(key)
json_object = json.dumps(result_dict)
return json_object
Where you wanna store it?
We have the json object now. Let’s save it as a json file
import json
def save_google_results_as_json():
gen = DocumentGenerator()
key = gen.sentence()
result_dict = google_search(key)
filename = key + ".json"
with open(filename, "w") as json_saver:
json.dump(result_dict, json_saver, indent=4, sort_keys=True)
# we are gonna save the pretty print version of json using the indent =4 param even though no human will
# read it eventually
We changed the function name as it was not anymore appropriate or intuitive.
Let’s test it.
save_google_results_as_json()
I have a beautiful json in the current folder named “Hosted UEFA however, coincided with many of the Isthmus of Suez and.json”. We are gonna royally ignore the keywords Essential Generator generates for now. It’s doing exactly what we asked for and we couldn’t care less.
Now we just need to keep doing it. Easy peasy Japaneasy!
“Let’s scrape google forever”
he said.
while True:# That's an infinite loop right there in it's raw form. Don't do this unless you know for sure your code won't runsave_google_results_as_json()---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-91-e40097c380c3> in <module>
1 while True:
2 # That's an infinite loop right there in it's raw form. Don't do this unless you know for sure your code won't run
----> 3 save_google_results_as_json()<ipython-input-87-ff0ecb9763eb> in save_google_results_as_json()
5 result_dict = google_search(key)
6 filename = key + ".json"
----> 7 with open(filename, "w") as json_saver:
8 json.dump(result_dict, json_saver, indent=4, sort_keys=True)
9 # we are gonna save the pretty print version of json using the indent =4 param even though no human willFileNotFoundError: [Errno 2] No such file or directory: 'Union dates residents, it is equal to 1/12 of the institution. Political associations such as.json'
That is a stupid file name. Let’s correct those
# Special mention : https://github.com/django/django/blob/main/django/utils/text.py
import re
def get_valid_filename(s):
s = str(s).strip().replace(' ', '_')
return re.sub(r'(?u)[^-\w.]', '', s)
“Let’s scrape google forever.”
He said again
while True:
save_google_results_as_json()
The output
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Also+educated+sands+and+gravels+that+underlie+specific+behaviors+such+as+the+largest (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f97d59e96d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Republik%29.+They+have+channels+and (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f981b674be0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Intrinsically+friendly+%22The+most (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f981ec130d0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))
HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: /search?q=Polo+%28c.+grand+slams%3B+and+has+been+observed+imitating+other+birds. (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f981b666610>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-108-e40097c380c3> in <module>
while True:
....
I had to interrupt.
We got couple of ‘Max retries exceeded with url’, plus this is slow. We could use threads but we will still get ‘Max retries exceeded’ error again this time only sooner.
We are gonna figure out the rest in the next chapter.
Ciao