innerText in Scrapy
We don’t need to worry about the internal structure of an element. For example, if it has tags for bold text, spans, lists etc. We simply call
innerText and we get a good representation of the text.
This can be incredibly useful when scraping description blocks from web pages. Often, these can contain several different HTML elements for styling.
Unfortunately, this is less straightforward in Scrapy.
Text selector in Scrapy
Scrapy provides an extension CSS selector called ::text which will return the textual content of any element. However, this means that a structure like
response.css("p::text") will only give us
This is as the result. Not quite what you would expect.
Combine text of descendants
It would be wonderful if Scrapy had a solution built-in for this, as it is a common use case, but we can do it ourselves.
The naive solution is to join all text elements together with some delimiter, like below:
def innertext_quick(elements, delimiter=""):
return list(delimiter.join(el.strip() for el in element.css('*::text').getall()) for element in elements)
However, running this, you will notice how difficult implementing the text rendering is. Test it with the slightly more complicated HTML below and you will find a problem:
<p>This div contains <i>complex</i> text</p>
<li>List item 1</li>
<li>List item 2</li>
This will render as a long string unless we give it delimiters:
This div containscomplextextList item 1List item 2Including quotes
Even with delimiters, say a single space, we will see issues.
This div contains complex text List item 1 List item 2 Including quotes
This code does not handle line breaks where we would expect them, for example, in the paragraph, list and blockquote.
Having worked with Scrapy, you might also be familiar with the HTML parsing library BeautifulSoup or bs4 as the python import is called.
BeautifulSoup does a better job of parsing HTML, and we can use their get_text method to parse the text of an element. It also allows us to control how text is stripped and if we want to ignore certain elements, such as tables.
from bs4 import BeautifulSoup
html = selector.get()
soup = BeautifulSoup(html, 'html.parser')
In comparison, the same HTML as above now yields the following result:
This div contains complex text\n\nList item 1\nList item 2\n\nIncluding quotes
It correctly ignores styling elements such as bold and italic, or span tags but will preserve the linebreaks of structural elements such as paragraphs, list items and quotes.
There is another alternative as well, you can use Playwright (or puppeteer), a headless browser, with Scrapy to get the contents.
This way, you will get access to the HTML DOM just like the browser has, and you can call
innerText to get the text as the browser would.
In most cases, this is overkill and using Playwright is considerably slower.
Full code for reference
You can use the following repository to play around with the parsing in an isolated environment.
You can edit the
test.html to add your HTML that needs parsing and adjusting the methods in
innertext.py to suit your needs.
text.py to have the scraper run the file, you will see the output examples.
GitHub - ddikman/scrapy-innertext: Helper to get innertext of any element
Repository to show how to get the innertext of elements. The code for getting the innertext is in crawler/innertext.py…