Decoding HTML From JSON

Austin Mullins
The Startup
Published in
2 min readAug 12, 2020
Photo by Ilya Pavlov on Unsplash

Often while web-scraping you will come across HTML values in the text that needs to be recoded into their character forms. While jQuery and various other languages have implementations for decoding these values, native JavaScript does not.

To solve this I started by parsing out the HTML codes using a lookup table going word by word. This was a safe method that wouldn’t require accessing the DOM of the webpage. While the method worked it came at the cost of having a larger file size due to having to pack a lookup table into the file. Another negative of this solution was the time it took to parse hundreds of strings of varying complexity.

Looking for other implementations of this idea, I found a solution from Rob W on Stack Overflow.

function decodeHtml(html) {
var txt = document.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}

The code above doesn’t remove HTML tags like other approches and is performant. The biggest issue is it has direct access to the DOM. To mitigate this I would recommend the following code I created by adapting Rob’s solution.

function decodeHtml(html) {
var htmlDoc = document.implementation.createHTMLDocument("");
var txt = htmlDoc.createElement("textarea");
txt.innerHTML = html;
return txt.value;
}

This will prevent scripts from running and creates a separation of your application and it’s parsing capabilities. It maintains everything great about the first solution and is safe to use in production.

--

--

Austin Mullins
The Startup

Innovation enthusiast that has a coffee addiction.