Practical .NET Core — write a web scraper, downloader & Excel parser. Part 1: Scraper
…cuz impractical would be an inefficient and dull thing to do.
Well hello there eager dev. Are you willing to sacrifice your precious time once again? Yes? Oh you poor suc… Uh, what I mean is actually — excellent! You are then more than welcome here. Good luuuck!
This is the first article in the series (out of 4) and I’ll demonstrate how to set up an initial solution and use a library for fetching, scraping and parsing web pages in .NET Core with C#.
The source for this part is available on my GitHub repository.
Firstly, a deep thank you for your invaluable time and self-motivation. This is my very first endeavor as a technical writer, so please, do bear with me. And please, do comment on how you felt about this piece, style, technicality, etc, thanks, Vedran.
Time is what we want most, but what we use worst.
— William Penn
Foreword, who should read this? 📖
I’d say this series is for anyone willing to touch base with newer Microsoft software dev technologies (from 2015 and on). Also, for someone who didn’t find the time for a practical tutorial or was skeptical ‘till now. Of course, this someone should have some kind of a programming background. This series will demonstrate relatively new and surely useful features of the C# programming language. The new and interesting stuff was introduced in version 7.0 and on. Nevertheless, I’ll also show some of the acknowledged killer features like LINQ and TPL. If you are a professional or seasoned C# dev then this is also for you! And if you’re not, but you’re still looking for a thorough step-by-step tutorial in starting development with .NET (Core), then again, this IS… the place to be and the piece to read. I’m not gonna be covering each method or its details in detail as that would result this series to be part 1 of 44.
If you are looking for part 2, it is over here: https://medium.com/@vekzdran/practical-net-core-write-a-scraper-fetcher-xls-x-parser-part-2-parallel-downloading-fc4d21f21417
Uhmmm, so what’s this series really ‘bout? 🤔
In the following series I’d like to be your dev mentor in leading you through a practical tutorial. You’ll learn to leverage and use Microsoft’s revamped dev framework — .NET Core with C# programming language in building a software solution which can do the following:
- Fetch HTML pages and select then parse parts with li tags and extract links
- Those links lead to spreadsheets which we will download and save to disk
- Parse the fetched .xlsx and .xls docs into C# model classes
- Expose a web API endpoint dumping out the data aggregation
Why I decided to do this, and why should you (care)?
In short, I like data and browsing it. And I guess you should too! When I get a grasp of some data I expect it to be, at least, to some extent normalized or structured. Only then the reader can dig through quite swiftly and get the idea behind it. If the dataset is well designed and easily parsable, i.e. it has no data input mistakes, then I’d like to see a visualization of it. Be that in a grid or a chart, it’s the same for me.
So, yeah I am from Croatia. Which is in Europe. It is an astonishingly beautiful country. It’s also next to the Adriatic sea (part of Mediterranean). You should visit it (no, you really should, no excuses!). On the other hand, we are not a very digitized country, but we are getting there… deep breaths. Slowly. On the bright side there are at least a lot of places to start improving. If you are interested into our local cuisine you can explore it on TasteAtlas.
Back to topic! One of those places for improvement is Croatia’s Health Insurance Fund web site. What they do is that they publish every couple of months a list of drugs (or meds, however you prefer). This is the official list of approved meds available to registered citizens. The list is basically split in couple of other lists, which are spreadsheet documents. I’ll not go into details furthermore, but just stick to the primary and supplementary drug lists.
With this fact, or splinter in mind, of seeing those spreadsheets published every couple of months over there, just as mere links… I immediately got an idea! A simple, but practical one. I’d enable scraping and normalizing data into an app that would publish it in computer readable format. Like JSON, served from a RESTful API. And that’s it! Well, of course I was also intrigued on how to assemble a system like that in .NET Core. You know, the dev(il) in you never sleeps. :-)
Yeah, enough of this boring intro, show me the moneyz…uhmmm, I mean codez 100010010…
First some technical pre-reqs 🐒
Firstly, you gotta download and install the latest .NET Core SDK and Runtime (two downloads!), so head over to:
Great! Now open your terminal and check if the
dotnet global CLI tool is installed by typing:
> dotnet --info
After running the previous command you should get an output stating some details about the installed SDK. Good, this means you are ready to go.
Oh, one more thing, you’ll be using a code editor. I hope so. Which one? Well that is totally up to you, and your preference. Initially I was developing in Visual Studio on a Windows machine. And now, I am writing this article with cross-platform focus on mind so I’d prefer the lighter and cross-platform friendlier approach — Visual Studio Code editor. If you haven’t tried it yet and you’ve made it to here, well why not give it a go also! :)
Creating a solution and referenced projects 🥂
Start of by creating a work directory on your disk and navigate to it. Give it truly fancy pants name like
meds-processor or whatever you like:
> mkdir meds-processor
> cd meds-processor
If you wish (and I really do think you’d wish so) you can
git init now to keep track of all your work under source control. Now, create a new solution file named
MedsProcessor that will reference our projects:
>dotnet new sln -n MedsProcessor
Amazing! The solution will hold a couple of projects. Now lets create a blank web API project and three class lib projects for our scraper-downloader-parser (scradownpa!?) thingy. The names below are pretty much self-descriptive:
> dotnet new webapi -n MedsProcessor.WebAPI
> dotnet new classlib -n MedsProcessor.Scraper
> dotnet new classlib -n MedsProcessor.Parser
> dotnet new classlib -n MedsProcessor.Common
So now, we only need to link this all down. We’ll use another
dotnet command which will reference the newly added projects (on a Mac or Linux OS) to the solution file:
> dotnet sln add **/*.csproj
The previous command won’t work on your Windows CMD or PS terminal as those * symbols will not get recognized. This handy subcommand leverages globbing patterns so this operation is actually faster performed than ‘manually’ through Visual Studio IDE’s UI. To make it work on Windows you’ll have to type the full paths to your .csproj files like:
> dotnet sln add MedsProcessor.Scraper/MedsProcessor.Scraper.csproj MedsProcessor.Parser/MedsProcessor.Parser.csproj MedsProcessor.Utils/MedsProcessor.Common.csproj
One more thing we need to do is to link the projects between themselves, we can do that also with the (Mac and Linux friendly)
dotnet command tools:
> dotnet add MedsProcessor.WebAPI/*csproj reference MedsProcessor.Scraper/*csproj MedsProcessor.Parser/*csproj MedsProcessor.Common/*csproj
> dotnet add MedsProcessor.Scraper/*csproj reference MedsProcessor.Common/*csproj
> dotnet add MedsProcessor.Parser/*csproj reference MedsProcessor.Common/*csproj
Again, you’ll need full paths for Windows, ahhh, I know…. And that’s it. So your work directory file system should look something like this:
MedsProcessor.Parser MedsProcessor.Scraper MedsProcessor.Common MedsProcessor.WebAPI MedsProcessor.sln
Finally you can start off a build and verify you’ve set up everything as expected:
> dotnet build
Voila! 👏 The
dotnet build command will find the solution in the current directory and perform a build against it including all of the referenced projects. If you are interested in learning more about the
dotnet command, you can find more about it on the official MSDN documentation web.
You can open the project in VSCode from the terminal like so:
> code .
Initial project(s) setup 🏗
After opening the solution directory in VSCode you can do some initial cleaning up by removing the default
Class1.cs files in the classlib projects and the default
ValuesController.cs file in the Web API project.
Next, add a new API controller file
AppController.cs to our Controllers. This controller will hold the (single) entry Action method
Index. The Index method response will be counted as our “home page” and calling it will trigger off the scraping, downloading and parsing process.
The controller should look something like:
Before we run this lets make sure you trust developer self-signed SSL certificates for development. To do this you will use the
dotnet dev-certs CLI tool! Type in the following to your terminal:
> dotnet dev-certs https -t
If you agreed to trust it and entered your account’s password to verify, you should get a message that dev cert was successfully installed. This means you can run https protocol locally while developing with your Kestrel Web Server that is provided by .NET Core. To test this works “just” type the following (you can help your self with a npm script not to type the whole chunk every time):
> dotnet run -p .\MedsProcessor.WebAPI\*csproj
[Route("~/")] attribute defines, the
AppController and it’s
Index method will be the application’s root path, i.e. start page also.
Set up dependencies ⛓ and add AngleSharp 👼
MedsProcessor.Scraper project add a new C# class named HzzoHtmlScraper.cs. This class will contain our logic for fetching the remote HTML content of two pages on www.hzzo.hr internet domain. Also, after downloading those documents, we’ll use standard CSS selectors to locate anchor elements on those pages and extract their href attribute values. Those links will point us to the target Excel documents that we need to download in order to parse them.
Before adding anything concrete to our class, let’s add the AngleSharp NuGet package reference which will allow us to fetch and parse remote HTML docs. Running a
dotnet build after adding the AngleSharp ref will trigger a restore of the package also.
Add the AngleSharp NuGet package reference and restore it:
> cd MedsProcessor.Scraper
> dotnet add package AngleSharp
> dotnet build
The HzzoHtmlScraper.cs should look something like this:
This will compile but we are missing the injected
IBrowsingContext. .NET Core’s DI would fail to resolve it as we did not register a provider for it in the DI container in the
Startup class. So let’s do that! But first, let’s add a C# extension method for registering this dependency in the Scraper project.
First, install the MSFT DI package into the Scraper project:
> cd MedProcessor.Scraper
> dotnet add package Microsoft.Extensions.DependencyInjection
> dotnet build
Good, now add the ServiceCollectionExtensions.cs class file defining the extension method to register the needed AngleSharp
IBrowsingContext dependency in the project’s root:
Now let’s modify the Startup.cs file in our Web API project to call this Singleton dependency registration. We will also register a
HttpClient provider and the
HzzoHtmlScraper. Your Startup.cs file should look something like this:
Notice that for proper calling of
services.AddAngleSharp() you’ll need a
using reference to
MedsProcessor.Scraper which was added to the top of the document.
Fetch and parse HTML with AngleSharp 👼
HzzoHtmlScraper and let’s add some code that will do the heavy lifting. The following code could be implemented in various ways (and probably more performant ones) in terms of parallel processing (and style…). But, I’ve chosen to download HTML documents in a non-blocking parallel fashion and to parse them sequentially with plain ol’ LINQ. Why? Well this came first to my mind. 😅 I know, not the best argument, but you can at least explore this style and favor non-blocking parallel execution.
The core method that we will define will be
public async Task<ISet<HzzoMedsDownloadDto>> Run(). This method will actually integrate two other private calls to download and parse the HTML documents. The method looks like:
Following the previous, we need a definition for
DownloadHtmlDocuments which will leverage AngleSharp library to download the HTML docs:
This one was easy! Uhmmm, but it won’t compile at this moment as you’re missing the string constants we used in
OpenAsync calls. So, let’s add a Constants.cs file to the Common project root with the following contents:
This file contains the URLs that we will fetch from and a download directory path that we’ll use in a future part of this series.
Adding the previous file you’re ready to do the heavy lifting of traversing through the downloaded HTML documents. This will be done with the help of CSS selectors and AngleSharp again. You’ll notice that this logic consists of a couple of methods working together but each doing its respective part. Basically what we will do is extract the HTML unordered list’s li elements and parse its contents into a C# DTO. This DTO class will be used across the whole series so we need to design it properly to suit its purpose of holding the downloaded data, i.e. the excel documents and their C# parsed variants. This parsed variant will again be another DTO class, and we will cover that in the upcoming article.
Let’s create a download DTO class named
HzzoMedsDownloadDto and place it under a new directory named Models in the Common project with the following contents:
ValidFrom property is very important in the whole story as it reflects the meds data validity period. With all this in place, we can now add the HTML parsing and link extraction logic for filling out those DTOs. The following should come after
Run method in HzzoHtmlScraper.cs:
Notice the comments I added stating that we will have to cut off some of the spreadsheet documents. You’ll probably be needing some
using directives in the head of the file so make sure you have the following:
using static MedsProcessor.Common.Constants;
Great, you are almost done, let’s wrap this up by allowing the
Run method to be called in our
AppController and write a result of fetched Excel links. Open up your
AppController and modify its
Index method to look like the following:
Good job! You’re done!! You should now be able to start the app and get a result of all the valid links that we will use for downloading later. Fingers crossed! 🤞 Go back to solution root and start the app with the following command:
> dotnet run -p MedsProcessor.WebAPI/*.csproj
Here is a screenshot of how it looks on my Chrome:
TL;DR — Conclusion 👈
In this article you’ve learned how to install the .NET Core framework for any OS and bootstrap a solution with referenced projects including a Web API from the command line using the
dotnet CLI tool. For downloading remote HTML content we’ve used the AngleSharp library and its
IDocument interfaces to do the most of the work. With the help of Tasks and LINQ we were able to do some parallel processing with downloads and parsing the HTML in a succinct style. The parsing was performed with AngleSharp again by using regular CSS selectors to extract list and anchor elements with their relevant contents. The parsed content was loaded into C# DTO model classes. Those classes will be used later on in the series to fetch the Excel documents and hold the parsed Excel document DTOs that we will also design and add.
And once again, if you are looking for part 2, it’s over here: https://medium.com/@vekzdran/practical-net-core-write-a-scraper-fetcher-xls-x-parser-part-2-parallel-downloading-fc4d21f21417
I hope this was insightful and that you’ve learned something new. If you liked what you read please leave a comment, critique, emoji, or whatever suits you best, thanks!
…oh and don’t forget to clap 👏, atleast for yourself for making it all the way to here! 😜