How to check broken links in a single web page programmatically?

Burak Ergören
Sahibinden Technology
3 min readApr 25, 2022

Checking for broken links is very important to avoid annoying surprises to users. In this article, I will explain in detail my two favorite methods about how to check broken links in a single web page programmatically.

Broken link ?

https://www.pedalo.co.uk/wp-content/uploads/2018/03/Broken-link.jpg

A broken link (or a dead link) is a link on a page that no longer works for various reasons. Some of them are as follows:

  • The destination website removed the linked web page (HTTP 404 error)
  • The website owner entered incorrect URL (typo etc)
  • The destination website is no longer available
  • Firewall access restriction states

There are different ways to automate this. Here is my first suggestion:

1. Broken-link-checker with Node.js

  • Broken-link-checker is a npm library which Stream-parses local and remote HTML pages.
  • Concurrently checks multiple links, absolute and relative URLs
  • Node must be installed to use this library
  • We can install library with command “npm install broken-link-checker”
  • We must use HtmlUrlChecker class and we can customize this class with various options. npm library page can be checked for more details.
  • As many websites block requests with “HEAD” requests, it is important to make requestMethod “GET” in the class option block.
  • We can manipulate response data to have more useful result logs.
  • Sample coding is like:
node.js sample code
  • Console logging looks like:
node.js project console log

2. Jsoup with Java

  • Jsoup is a Java library for working with HTML.
  • It provides useful API for fetching URLs, extracting and manipulating data.
  • First, we must establish a connection to the website and retrieve the HTML DOM using JSoup.
  • Then we will filter the doc to find all active links.
  • Lastly, we will open http connection and check the status of these links.
  • As many websites block requests with “HEAD” requests, it is important to make requestMethod “GET” in the connection request method step.
  • This is the sample coding for Jsoup web scraping:
jsoup-java-sample-code
  • Console logging as follows:
java jsoup project console log

Conclusion

  • You can think these two sample project as template. You can add multiple features..
  • You can make parameterized dynamic link check.
  • You can use CI/CD tool like jenkins and you can also set cron job to run this project periodically.
  • You can also add report logic to send results in a proper format.
  • If you want different solution you can also check selenium or other nodejs libraries like: axios and cheerio.
  • node.js project source code:
  • jsoup project source code:

Hope you enjoy the article! See you in other posts :)

--

--