Weekly Python Coding Challenge #2

Christopher Franklin
Weekly Python
Published in
3 min readJan 9, 2021
Photo by Jukan Tateisi on Unsplash

Welcome to the second Weekly Python Coding Challenge! The rules are simple, pick one of the challenges below and complete it within the week. Each challenge is designed to push your knowledge of python and the surrounding ecosystem a little further, so have fun!

And now, the theme of this week’s challenge: Create a python application to extract all links from an HTML document.

Beginner Project: As a beginner, you should use this project to start learning methods of text extraction from documents. As a secondary objective, you will also need to learn how to load documents into python so you can scan them for the links.

Your challenge is to create a python script that can read in an HTML document saved to disk, parse it and extract all links to other pages/sites. You can print each link out line by line to standard out for this challenge. For this challenge, we only want to see the URL, not any of the surrounding HTML tags or attributes.

Beginner Stretch Goal: Instead of reading HTML from a local file, read it directly from a website.

Intermediate Project: You should already be familiar with the concept of text extraction and reading file from disk. As an intermediate developer, you will instead focus on building a web crawler that can extract every link from a website and outputs a Site Map document.

You should take the opportunity to familiarize yourself with one of the many great Crawler/Spider libraries available for Python. Once you have the ability to grab all the links from a website, you will need to assemble a site map in the standard format expected on all websites. You can produce a single XML document, or a bundle of related documents, whichever method will challenge you more.

Intermediate Stretch Goal: Add the ability to run this as a CLI with basic options to take in a URL file containing all the sites to generate sitemaps for, an output directory which will have a sitemap for each site in the inputs, and a maximum number of URLs to include in a sitemap for extra-large sites.

Advanced Project: As an advanced Pythonista, you are going to focus on the Web Crawler aspect of this challenge. Instead of using a pre-built crawler library, write your own!

Your challenge will be to start with a seed URL and start crawling every new link you find, without revisiting a page. To push your knowledge of crawlers even further, stay away from helper libraries completely. Stick to using regex, requests, and urllibas your primary tools.

An important note: if you crawl a site too fast, there is a good chance you will get blocked from the ISP. When picking a site to test with, keep this in mind, and don’t get your IP banned on Facebook or something!

Advanced Stretch Goal: As a bonus, implement the ability to define blurbs of text you want to extract. Things like Title, or all header tags, or anything else.

Now, get out there and program!

Chris Franklin
Weekly Python

P.S. Remember, if you complete one of the challenges and found it easy, try the next level up!

P.P.S. Looking for more coding challenges? Join the free weekly newsletter here!

--

--