Manipulating HTML with Regex
Practical Examples
There are numerous use cases for manipulating HTML and XML using regex. Most of the time I use this approach to parse log files and other historical data. There are performance penalties, but sometimes you just need to GSD. Below are some useful expressions you can use on the daily for these tasks.
Replace an attribute value
Using Positive Look Behind
Zero width assertions are explained here. They solve the problem of how do I find the string prefixed with x and/or suffixed with y. I’ve created an example here which isolates an image path. To replace a value you can use the token $& (the matched substring.):
'<ImageAsset imageId="woman-in-white-turtleneck-shirt-3786525_5bac6664-53f9-4c2a-912f-fcd98f5e97ba" printFilename="woman-in-white-turtleneck-shirt-3786525.jpg" webFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" thumbnailFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" pixelWidth="7680" pixelHeight="5120"/>'.replace(/(?<=<ImageAsset[^>]*?\bprintFilename=")[^"]+/, 'images/$&')
Result:
<ImageAsset imageId="woman-in-white-turtleneck-shirt-3786525_5bac6664-53f9-4c2a-912f-fcd98f5e97ba" printFilename="images/woman-in-white-turtleneck-shirt-3786525.jpg" webFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" thumbnailFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" pixelWidth="7680" pixelHeight="5120"/>
Using Capture Groups
This is an alternative when positive look behinds is not supported. I’ve created an example here which isolates an image path in the second capture group. This requires some extra work if you want to replace the value. You have to add a replacer function:
'<ImageAsset imageId="woman-in-white-turtleneck-shirt-3786525_5bac6664-53f9-4c2a-912f-fcd98f5e97ba" printFilename="woman-in-white-turtleneck-shirt-3786525.jpg" webFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" thumbnailFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" pixelWidth="7680" pixelHeight="5120"/>'.replace(/(printFilename=")([^"]+)/, (match, p1, p2, offset, string) => `${p1}/images/assets/${p2}`)
Result:
<ImageAsset imageId="woman-in-white-turtleneck-shirt-3786525_5bac6664-53f9-4c2a-912f-fcd98f5e97ba" printFilename="/images/assets/woman-in-white-turtleneck-shirt-3786525.jpg" webFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" thumbnailFilename="woman-in-white-turtleneck-shirt-3786525_436.jpg" pixelWidth="7680" pixelHeight="5120"/>
Replace a Tag
This is another very common use case. I’ve created an example here that shows how to match tag pairs by name.
'<div><span class="_1L8oL _2EVlr">?</span> <span><strong>Quantifier</strong> — Matches between <span class="Z3H4l">zero</span> and <span class="Z3H4l">one</span> times, as many times as possible, giving back as needed <span class="_2P7Bb">(greedy)</span></div>'.replace(/<(span)\b[^>]*>[\s\S]*?<\/\1>/gmi, '')
Result:
"<div> and times, as many times as possible, giving back as needed </div>"
The key to making this work is global multi line case incentive (gmi) and backreferences (\1).
Find Tags That Contain at Least One X
I created an example here that finds all div elements that contain at least one span tag.
'<div><span class="_1L8oL _2EVlr">?</span> <span><strong>Quantifier</strong> — Matches between <span class="Z3H4l">zero</span> and <span class="Z3H4l">one</span> times, as many times as possible, giving back as needed <span class="_2P7Bb">(greedy)</span></div>'.match(/<(?=div)\b[^>]*>[\s\S]*?<(span)\b[^>]*>[\s\S]*?<\/\1>[\s\S]*?(?<=<\/div>)/gmi)
In this case the result is the supplied string. See the example for a multiline demo that has some more interesting results.
These are not the most elegant of expressions, but they are useful when you are in a hurry and need immediate feedback. For production systems consider running a headless browser and using DOM manipulation functions. Happy coding!