Building With Python Requests
The journey of one thousand apps starts with a single key press…
If you have ever wanted to build a web-crawler, Python is a great language to build it in. Python is very concise, widely installed across platforms and it has a multitude of networking and parsing related libraries for you to leverage. This post will go over the beginnings of building a web crawler, the functionality that scrapes URLs from sites. For this example, I’m going to be using the Requests, sys and re libraries.
If you don’t have lots of experience working HTTP or Python, Requests is a great library to get started. If the Requests library isn’t already installed on your machine go ahead a do that using pip:
pip install requests
The people behind Requests call it HTTP for Humans, and that’s precisely what it is. It’s one the of most beautiful works of code that I have come across. It is eminently readable and logically constructed. There’s a word to describe this in the Python community: Pythonic
A good definition for Pythonic that I’ve come across is:
Exploiting the features of the Python language to produce code that is clear, concise and maintainable.Pythonic means code that doesn’t just get the syntax right but that follows the conventions of the Python community and uses the language in the way it is intended to be used.
I want to setup our Python script so that we can specify our url directly from the commandline like so:
python linkcrawler.py www.google.com
The documentation in the Python standard library for sys calls it: System-specific parameters and functions In this case, we’re going to use it to access the URL that gets entered when our file is called on the command line using the sys.argv command. The sys.argv method returns an array with strings for everything typed in after the word python. Each additional entry in into the command line gets its own index. So inputting sys.argv would have the interpreter return to us ‘linkcrawler.py’ while sys.argv would give us www.google.com.
Classes in Python
Creating classes in Python is interesting slightly different than in Swift.
Python standard library definition for classes:
Compared with other programming languages, Python’s class mechanism adds classes with a minimum of new syntax and semantics. It is a mixture of the class mechanisms found in C++ and Modula-3. Python classes provide all the standard features of Object Oriented Programming: the class inheritance mechanism allows multiple base classes, a derived class can override any methods of its base class or classes, and a method can call the method of a base class with the same name. Objects can contain arbitrary amounts and kinds of data. As is true for modules, classes partake of the dynamic nature of Python: they are created at runtime, and can be modified further after creation.
To define one, prefix your class names with the word class and put a colon at the end of the declaration. What makes classes in Python interesting is that you have to pass in self as a parameter in every function block. To give our class an init method we need to add ‘init(self)’:
Specifying self as a parameter is not optional, if you leave out the self, you will get this error:
TypeError: __init__() takes no arguments (1 given)
In the init method, we can instantiate our class properties. In this case, I gave the class a URL property:
self.url = sys.argv
This uses the sys.argv method that we talked about earlier to get the URL that is entered in when you run the file.
The majority of the logic for this piece will go in a method I named request_resource.
This method takes in self as a parameter as well. We set our variable r to the results of an HTTP GET request to the URL we specified initially.Using re.findall we look through r.text for text that matches URL patterns. These are stored in our URLs variable and returned at the end of the method.
Python re Module
The Python re module is provided in the Python standard library and gives the language Perl-like regular expression patterns. Regular expressions are a way of specifying certain patterns within strings.
Wikipedia entry for Regular Expressions: A regular expression, regex or regexp (sometimes called a rational expression) is, in theoretical computer science and formal language theory, a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for “find” or “find and replace” operations on strings.
A regular expression could match phone number or email address or in this case, URLs. This strange looking string is a regex pattern for matching text to URLs:
Now that we have that finished with our class we need to add a main function that will get called when our file is run directly from the command line. In main we will need to create an instance of our class, call our request_resource method and for now, just print out the URLs if they exist.
Once we have our main function figured out, we now need to ensure that it gets called when the file is run. We can add some ‘if name magic’ to get this all setup correctly.
if __name__ == '__main__':
What we’re specifying here is that if Python calls our file directly (the if name == ‘main’:) it should run our file’s main function. This is different than if it is imported into another file and called and it needs specification so we can run it from the command line.
If you got everything right it should look something like:
Twitter, Spotify, Microsoft, Amazon, Lyft, BuzzFeed, Reddit, The NSA, Her Majesty's Government, Google, Twilio…docs.python-requests.org
The solution is to use Python's raw string notation for regular expression patterns; backslashes are not handled in any…docs.python.org
The list of command line arguments passed to a Python script. is the script name (it is operating system dependent…docs.python.org