Data Harvesting is solving these two problems
In our previous article “Three kinds of analytical modes to extraction data from websites”, we are trying to solve two problems.
What an element is talking about
When you open a page, how do you know what information the page is trying to transfer? How do you know the element is the title of the article? How do you know the element refers to the author?
We might see the position of an element, the font size and color of the element, the text in front of the element. Besides, we can figure out whether the content of the element looks like a person’s name / time, what are talking about in the context of the element, the field of this article, and so on. People have a lot of experience and knowledge to interpret the information on the web page when they see a web page.
In the “XPath / RegEx” analytical mode, the data extraction process is completed manually. We interpret the web page and find out the position of the element which contain the information. In the “data highlighter” analytical mode, we need to mark up in multiple pages to tell the machine where each attribute is.
However, the page style structure is hard to understand for those who don’t have adequate knowledge. For instance, our grandparents may not know what the page on the screen is saying. Meanwhile, just as the language is ambiguous, sometimes the page structure is ambiguous as well, therefore it bring great difficulties to computer in understanding what the page says.
What is the difference between this element and other elements?
Because it’s the computer that extract large quantities of data. So we have to tell the computer accurately which elements you want to extract. In the “XPath / RegEx” analytical mode, XPath and regular expressions are the descriptions of information. Selecting a correct expression that covers different pages and differentiates from other attributes, is not an easy job and needs experience and skills.
In the “data highlighter” analytical mode, software will select a correct expression automatically. And in the process of “wrapper generation and data extraction”, rules are also analyzed by the computer.
After solving these two problems, we come to structural parsing.
Structural parsing virtually is the interpretation of computer to a web page, whether the interpretation is based on either creating the rule manually and making agreements to computer or machine learning.
We can imagine that someone will know what the page says and what information has included inside the page when he/she opens a page. It’s an ability and a method that people acquire knowledge from web pages. Likewise, structural extraction is the process that computer acquire knowledge from web pages. And in Octoparse, we use “rule” to tell computer what data we want to fetch from the web page.