How to check broken links in a single web page programmatically?
Published in
3 min readApr 25, 2022
Checking for broken links is very important to avoid annoying surprises to users. In this article, I will explain in detail my two favorite methods about how to check broken links in a single web page programmatically.
Broken link ?
A broken link (or a dead link) is a link on a page that no longer works for various reasons. Some of them are as follows:
- The destination website removed the linked web page (HTTP 404 error)
- The website owner entered incorrect URL (typo etc)
- The destination website is no longer available
- Firewall access restriction states
There are different ways to automate this. Here is my first suggestion:
1. Broken-link-checker with Node.js
- Broken-link-checker is a npm library which Stream-parses local and remote HTML pages.
- Concurrently checks multiple links, absolute and relative URLs
- Node must be installed to use this library
- We can install library with command “npm install broken-link-checker”
- We must use HtmlUrlChecker class and we can customize this class with various options. npm library page can be checked for more details.
- As many websites block requests with “HEAD” requests, it is important to make requestMethod “GET” in the class option block.
- We can manipulate response data to have more useful result logs.
- Sample coding is like:
- Console logging looks like:
2. Jsoup with Java
- Jsoup is a Java library for working with HTML.
- It provides useful API for fetching URLs, extracting and manipulating data.
- First, we must establish a connection to the website and retrieve the HTML DOM using JSoup.
- Then we will filter the doc to find all active links.
- Lastly, we will open http connection and check the status of these links.
- As many websites block requests with “HEAD” requests, it is important to make requestMethod “GET” in the connection request method step.
- This is the sample coding for Jsoup web scraping:
- Console logging as follows:
Conclusion
- You can think these two sample project as template. You can add multiple features..
- You can make parameterized dynamic link check.
- You can use CI/CD tool like jenkins and you can also set cron job to run this project periodically.
- You can also add report logic to send results in a proper format.
- If you want different solution you can also check selenium or other nodejs libraries like: axios and cheerio.
- node.js project source code:
- jsoup project source code:
Hope you enjoy the article! See you in other posts :)