Parse an HTML table with Powershell

4 min readOct 24, 2023

Imagine, you have to verify the response of an API in an integration test. The response of the API is master data in JSON format, so it is static, but it is quite huge and the only “source of truth” you can verify the data against is a table on a website of your favorite wiki software (e.g. wikipedia.com or a Confluence page). The idea is, to parse the table with Powershell and compare it to the JSON response of the API call.

Long story short, the API is described as a table on a Confluence page and your task is, to “understand” the meaning of the table’s cells and parse it, so Powershell can handle this huge amount of data.

Sounds exciting, doesn’t it?

Setup

For the sake of demonstrating, how to parse an HTML table with Powershell, I created a GitHub repo: ms_powershellTableCoverter.

To have a working example, you’ll find a Dockerfile in the docker directory, which starts up a nginx server with a static index.html file, containing the table containing the fake specification for the API. Here are the commands to boot up the web server:

> cd docker
> docker build -t nginxwebsrvstatictable:0.0.1 .
> docker run -p 8080:80 nginxwebsrvstatictable:0.0.1

Now, if you open http://localhost:8080/ you should see a table of cat breeds (Source Wikipedia: List of Cat breeds, unchanged, Creative Commons Attribution-ShareAlike 4.0). Additionally, if you open http://localhost:8080/catbreeds/, you should see a list of cat breeds in JSON format.

If those two requests are successful, we can go on parsing the table with Powershell.

Prerequisites

For parsing an HTML page, I choose the Powershell package “PowerHTML” from the PS Gallery. Executing the following command on the command prompt installs this module, e.g.

> Install-Module -Name PowerHTML

After a few moments, the package should be available on your system. You can test it if you type “ConvertFrom-HTML” and hit enter. If no errors are displayed, the installation procedure was successful.

What is the PowerHTML package?

The PowerHTML package is a package, written by the user JustinGrote and published to the GitHub repository “JustinGrote/PowerHTML”. Basically, it is a wrapper for the HTML Agility Pack.

Parse the HTML document

First things first, we need to parse the entire HTML web page with this command:

> $htmlDoc = ConvertFrom-Html -URI http://localhost:8080
> $htmlDoc

NodeType Name      AttributeCount ChildNodeCount ContentLength InnerText
-------- ----      -------------- -------------- ------------- ---------
Document #document 0              3              106237        …

We assume, that there is only one table on the page respectively the data is stored in the first table of the page. Hence, we can get the table headers with this command

> $htmlDoc.SelectNodes('//table/thead/tr[1]/th')
> $htmlTableHeaders

NodeType Name AttributeCount ChildNodeCount ContentLength InnerText
-------- ---- -------------- -------------- ------------- ---------
Element  th   4              1              8             Breed…
Element  th   4              1              21            Location of origin…
Element  th   4              1              7             Type…
Element  th   4              1              7             Body…
Element  th   4              2              14            Coat…
Element  th   4              1              14            CoatPattern…
Element  th   1              1              8             Image…

Next, we initialize the array, where the table’s data will be stored.

$pwshTable = @()

Afterward, when everything is set and initialized correctly, we can start grabbing the data from the table:

$tableRows = ($htmlDoc.SelectNodes('//table/tbody/tr')).Count
$pwshTable = for ($row=2; $row -le $tableRows; $row++){
    $htmlTableRow = $htmlDoc.SelectNodes("//table/tbody/tr[$row]/td")
    $hashRow = @{}
    $i=0
    foreach($cell in $htmlTableHeaders){
        $hashRow[$cell.InnerText.trim()] = $htmlTableRow[$i].InnerText.trim()
        $i++
    }
    $hashRow
}

The main part of the script counts the table’s rows, excluding the header row. Then it starts a loop, beginning at row two, right after the table’s header. The script fetches the cells of a particular row into an array and creates a new empty hash for it. Now with the cells in place, it loops over the table’s header and sets a key for every header column with the associated value of the table row. Last, it “returns” the hash back from the loop to the pwshTable variable.

Now, when calling pwshTable on the command prompt, you may see something like this:

>$pwshTable

Name                           Value
----                           -----
CoatPattern                    Ticked tabby
Type                           Natural
Location of origin             Unspecified, but somewhere in Afro-Asia, likely Ethiopia[8]
Coat                           Short
Breed                          Abyssinian[7]
Image                          
Body                           Semi-foreign
CoatPattern                    Multi-color

Quite handy and efficient!

Now, we need to compare the data in pwshTable to the response of the API (which is a static JSON file in this case). The following command grabs the JSON data from the API:

> $responseJson = Invoke-RestMethod -Method GET "http://localhost:8080/catbreeds/"
> $responseJson

catbreeds
---------
{@{Breed=Abyssinian; LocationOrigin=Unspecified, but somewhere in Afro-Asia, likely Ethiopia; Type=Natural; Body=Semi-foreign; Coat=Short; CoatPattern=Ticked tabby}, @{Breed=Ae…

> $responseJson.catbreeds

Breed          : Abyssinian
LocationOrigin : Unspecified, but somewhere in Afro-Asia, likely Ethiopia
Type           : Natural
Body           : Semi-foreign
Coat           : Short
CoatPattern    : Ticked tabby

Breed          : Aegean
LocationOrigin : Greece
Type           : Natural
Body           : Moderate
Coat           : Semi-long
CoatPattern    : Multi-color
...

Hence, the call to Invoke-RestMethod returns an object, which contains an array “catbreeds”, which contains information about cat breeds. The rest of the comparison is quite easy:


foreach ($catbreedTable in $pwshTable) {
    $catbreedJson = $responseJson.catbreeds | Where-Object { $_.Breed -eq $catbreedTable.Breed }
    if ($null -ne $catbreedJson){
        Write-Output "Breed found: " $catbreedJson.Breed
    } else {
        Write-Output "Breed not found" $catbreedJson.Breed
    }
}

For each entry in the array pwshTable, get the object, where the breed matches the property “Breed”. Write to the console, if the breed was found or not.

Thats it! Now, with the output of the console, you can easily observe, if the static response of the API and the content of the table match!

But wait… is it really enough, just “asking” for matching the breed? Shouldn’t there be a deeper comparison? And can we have one output, which indicates a passing or failing test?

Yes… this tasks, I’ll write in another article.

Please comment on your thoughts on the article. I’m curious what you guys think about it and if it could be useful.

Parse an HTML table with Powershell

Setup

Prerequisites

What is the PowerHTML package?

Parse the HTML document

Written by Lukas Haslberger-Troellinger