innerText in Scrapy

Published in

Nerd For Tech

3 min readJan 18, 2023

Illustration of what innerText returns from an HTML document

In Javascript, there is this wonderful property called innerText, which is rendering aware and will return all descendant's texts as you would expect in plain text.

We don’t need to worry about the internal structure of an element. For example, if it has tags for bold text, spans, lists etc. We simply call innerText and we get a good representation of the text.

This can be incredibly useful when scraping description blocks from web pages. Often, these can contain several different HTML elements for styling.

Unfortunately, this is less straightforward in Scrapy.

Text selector in Scrapy

Scrapy provides an extension CSS selector called ::text which will return the textual content of any element. However, this means that a structure like

<p>This is<b>great</b></p>

Selected with response.css("p::text") will only give us This is as the result. Not quite what you would expect.

Combine text of descendants

It would be wonderful if Scrapy had a solution built-in for this, as it is a common use case, but we can do it ourselves.

The naive solution is to join all text elements together with some delimiter, like below:

def innertext_quick(elements, delimiter=""):
    return list(delimiter.join(el.strip() for el in element.css('*::text').getall()) for element in elements)

However, running this, you will notice how difficult implementing the text rendering is. Test it with the slightly more complicated HTML below and you will find a problem:

<div id="complex-text">
    <p>This div contains <i>complex</i> text</p>
    <ul>
        <li>List item 1</li>
        <li>List item 2</li>
    </ul>
    <blockquote>Including quotes</blockquote>
</div>

This will render as a long string unless we give it delimiters:

This div containscomplextextList item 1List item 2Including quotes

Even with delimiters, say a single space, we will see issues.

This div contains complex text List item 1 List item 2 Including quotes

This code does not handle line breaks where we would expect them, for example, in the paragraph, list and blockquote.

Using BeautifulSoup

Having worked with Scrapy, you might also be familiar with the HTML parsing library BeautifulSoup or bs4 as the python import is called.

BeautifulSoup does a better job of parsing HTML, and we can use their get_text method to parse the text of an element. It also allows us to control how text is stripped and if we want to ignore certain elements, such as tables.

from bs4 import BeautifulSoup

def innertext(selector):
    html = selector.get()
    soup = BeautifulSoup(html, 'html.parser')
    return soup.get_text().strip()

In comparison, the same HTML as above now yields the following result:

This div contains complex text\n\nList item 1\nList item 2\n\nIncluding quotes

It correctly ignores styling elements such as bold and italic, or span tags but will preserve the linebreaks of structural elements such as paragraphs, list items and quotes.

True innerText

There is another alternative as well, you can use Playwright (or puppeteer), a headless browser, with Scrapy to get the contents.

This way, you will get access to the HTML DOM just like the browser has, and you can call innerText to get the text as the browser would.

In most cases, this is overkill and using Playwright is considerably slower.

If you are parsing a Javascript-heavy site or a SPA application, you will likely need this browser-powered rendering anyway, so this might be an alternative to the above.

Full code for reference

You can use the following repository to play around with the parsing in an isolated environment.

You can edit the test.html to add your HTML that needs parsing and adjusting the methods in innertext.py to suit your needs.

Run text.py to have the scraper run the file, you will see the output examples.

GitHub - ddikman/scrapy-innertext: Helper to get innertext of any element

Repository to show how to get the innertext of elements. The code for getting the innertext is in crawler/innertext.py…

github.com

Happy scraping!