Simple web site crawler using .NET Core and C#

Overview

This is an elementary Web site crawler written using C# on .NET Core . What do we mean by crawling a web site ? No! We are not indexing the content of the pages.

This is a simple component which will crawl through a web site (example: www.cnn.com), find sub-links and in turn crawl those pages . Only links which fall the under the domain of the parent site are accepted.

Implementation overview

The accompanying code produces a command line application. The results are written directly to the console.

Here is the output when https://www.cnn.com/ was crawled

https://www.cnn.com/videos/world/2021/11/16/uganda-explosions-madowo-live-nr-ovn-intl-ldn-vpx.cnn
https://www.cnn.com/videos/world/2021/11/17/lagos-nigeria-erosion-climate-change-ctw-busari-pkg-intl-vpx.cnn
https://www.cnn.com/videos/world/2021/11/19/blinken-says-ethiopia-on-path-to-destruction-sot-intl-ctw-vpx.cnn
https://www.cnn.com/videos/world/2021/11/25/davido-birthday-donations-orphanage-oneworld-intl-intv-vpx.cnn
https://www.cnn.com/videos/world/2021/11/29/salim-abdool-karim-south-africa-omicron-variant-covid-19-travel-ban-sot-newday-vpx.cnn
https://www.cnn.com/videos/world/2021/12/02/botswana-president-omicron-coronavirus-travel-bans-oneworld-intl-vpx.cnn
https://www.cnn.com/vr
https://www.cnn.com/warnermediaprivacy.com/opt-out
https://www.cnn.com/weather
https://www.cnn.com/world
https://www.cnn.com/www.cnn.com/interactive/travel/best-beaches
https://www.cnn.com/www.cnnpartners.com
https://www.cnn.com/youtube.com/user/CNN
Found 663 sites in the Url:'https://www.cnn.com', after searching a maximum of 30 sites

What kind of hyperlinks are we processing?

Only links with references to pages on the same domain as the specified URL are extracted.

  • Any external site. E.g. www.twitter.com
  • Links which are mailto: or tel: or sms:
  • Bookmarks Example: <a href='#chapter1'>Chapter 1</a>
  • Any link which produces a content type other than text/html is ignored. Example: A link to pdf document
  • Any link which is under a sub-domain of the root domain.E.g. if the www.cnn.com was being crawled, then a link such as sports.cnn.com would be ignored
  • <a href='/contactus.htm'>Contact us</a>
  • <a href='/Careers'>Careers</a>
  • if www.cnn.com was being crawled then <a href='https://www.cnn.com/Careers'>Careers</a> would be acceptable.

How to compile and run the code?

The current version has been build on Visual Studio 2019 and .NET Core 3.1. The crawler component itself is built on .NET Standard 2.0

https://github.com/sdg002/AnyDotnetStuff/tree/master/WebsiteCrawler

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30

Sample outputs

Here are some examples of the output produced by the web site crawler

WebsiteCrawler.exe --url https://www.bbc.co.uk --maxsites 5
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30
WebsiteCrawler.exe --maxsites 10 --url https://www.premierleague.com

How are we parsing the HTML?

The component HtmlAgilityPack is being used for parsing the links out of a HTML fragment. Refer class HtmlAgilityParser.cs

Separation of concerns via interfaces and Dependency injecton

One of the key principles of SOLID mandates that we code around interfaces as opposed to concrete implementations. This significantly simplifies unit testing via mocking/faking

IHtmlParser

public interface IHtmlParser
{
/// <summary>
/// Returns all hyper link URLs from the given HTML document
/// </summary>
/// <param name="html">The HTML content to scan</param>
/// <returns></returns>
List<string> GetLinks(string htmlContent);
}

I have provided a concrete implementation using the NUGET package HtmlAgilityPack.

ILogger

The command line executable uses Log4net for logging.

private static ServiceProvider ConfigureServices()
{
var serviceCollection = new ServiceCollection();
serviceCollection.AddTransient<IWebSiteCrawler, SingleThreadedWebSiteCrawler>();
serviceCollection.AddTransient<IHtmlParser, HtmlAgilityParser>();
serviceCollection.AddTransient<HttpClient>();
serviceCollection.AddLogging(builder => builder.AddLog4Net("log4net.config"));
serviceCollection.AddTransient<ICrawlerResultsFormatter, CsvResultsFormatter>();
return serviceCollection.BuildServiceProvider();
}

If the crawler component were used by an ASP.NET Worker code, then you could imagine logging to Azure Application Insights

services.AddLogging(builder =>
{
builder.AddApplicationInsights("<YourInstrumentationKey>");
});

How are we mocking HttpClient class?

The class HttpClient lets you intercept the protected method SendAsync. I found this article very helpful.

internal static class Mocks
{
internal static Mock<HttpMessageHandler> CreateHttpMessageHandlerMock(
string responseBody,
HttpStatusCode responseStatus,
string responseContentType)
{
var handlerMock = new Mock<HttpMessageHandler>();
var response = new HttpResponseMessage
{
StatusCode = responseStatus,
Content = new StringContent(responseBody),
};
response.Content.Headers.ContentType = new MediaTypeHeaderValue(responseContentType); handlerMock
.Protected()
.Setup<Task<HttpResponseMessage>>(
"SendAsync",
ItExpr.IsAny<HttpRequestMessage>(),
ItExpr.IsAny<CancellationToken>())
.ReturnsAsync(response);
return handlerMock;
}
}

How are command line arguments being processed?

The component CommandLineParser is being used. This component uses the following model class to interpret the command line arguments:

public class CmdLineArgumentModel
{
[Option("maxsites", Required = true, HelpText = "An upper limit on the number of sites to search. Example: 30")]
public int MaxSites { get; set; }
[Option("url", Required = true, HelpText = "The URL of the site to search")]
public string Url { get; set; }
}

Refer the project site of this component for more information

How are we logging?

All C# code is written with the expection that a concrete implementation of ILogger<T> would be injected via Dependency Injection.

log4net is being used. In the current implementation, the logging output is displayed on the Console

Logging to the output window of Visual Studio helps in examining the log outputs.

ILogger<SingleThreadedWebSiteCrawler> logger=CreateOutputWindowLogger<SingleThreadedWebSiteCrawler>();        /// <summary>
/// Helps you view logging results in the Output Window of Visual Studio
/// </summary>
private ILogger<T> CreateOutputWindowLogger<T>()
{
var serviceProvider = new ServiceCollection().AddLogging(builder => builder.AddDebug()).BuildServiceProvider();
return serviceProvider.GetService<ILogger<T>>();
}

Scenarios that have not been tested

  • Intergration test on the executable. We should programmatically launch the executable with various combinations of command line arguments and test for expected output
  • Explore if we have covered all HTTP status code while doing a Polly retry.
  • Better approach to recording errors. Currently being logged. Should have an errors collection and this should be unit tested
  • Better approach to ignored links. Currently being logged. Can be incorporated as another collection in the results object.

Further improvements

The current implementation is single threaded. This is obviously slow. A more sophisticated implementation would be broadly as follows:

start --->  master thread manages a queue of work items ---> spawns child Tasks --> each Task will pull first available item from work queue , parse links and add to same queue

Rewriting the component as a multithreaded is possible, but makes the coding and testing more complex. But, what do we gain? Consider the scenario where we have a multiple sites to crawl (www.site1.com, www.site2.com, www.site3.com) . Why not created independent threads , one each for www.site1.com, www.site2.com and www.site3.com? This might be a win-win solution.

We could think of providing the ability to produce JSON format Example:

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30 --format csv
WebsiteCrawler.exe --url https://www.cnn.com --maxsites 30 --format json

We want to save the generated results to a file Example:

WebsiteCrawler.exe  --url https://www.cnn.com --maxsites 30 --output c:\temp\results.csv

Imagine, if we are crawling a deeply nested web site like www.cnn.com. This could run into thousands of pages. Beginning from the top level migh not be a very efficient approach. Searching in smaller batches (say 100 sites) might be more practical. To handle such a long running process, we could make the SingleThreadedWebSiteCrawler component stateful. Pass the last known state and resume from that point onwards. The Queue data structure should be returned along with the results and the entire results should be made serializable.

public interface IWebSiteCrawler
{
Task<SearchResults> Run(SearchResults lastKnownResult,string url, int maxPagesToSearch);
}
  • We could turn this into an ASP.NET worker service
  • The service would be running as a Web job in Azure or in some container.
  • The worker service would be passively waiting for messages in a queue
  • Results would be written back to a database.
  • Such a worker service could be made very resilient by managing state in an external data store. Example — Consider incrementally crawling a large web site over several hours.

--

--

--

Over 22 years experience in software development. Porting C and Fortran code from UNIX to Windows NT. My book on Neural Network: http://amzn.eu/8G4erDQ

Love podcasts or audiobooks? Learn on the go with our new app.

Fixed — Google playstore Errors on LG G2 mini

Google playstore Error

Azure Functions + Docker Container

Bot: Link👇 https://t.me/officialplentybot?start=5286509663

Automation with SoapUI — Part I

Exploring foreign data wrappers in Postgres

Tech Talk: Practical Canary Releases in Kubernetes with Argo Rollouts

What goes into Language Design … Beyond the Domain

Go With Me: Basic Data Types

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
saurabh dasgupta

saurabh dasgupta

Over 22 years experience in software development. Porting C and Fortran code from UNIX to Windows NT. My book on Neural Network: http://amzn.eu/8G4erDQ

More from Medium

7 Nuget Packages to Improve Your ASP.NET Core Application

Node js vs .NET Core: Which One to Choose for Backend Development?

What are Middlewares in .NET Core?

MediatR with .NET 6.0