The Web Mining Story and What it Needs

Kamna Sinha
Sensewithai
Published in
5 min readJan 23, 2021

With so much work going on in the field of web mining, along with the rise in demand by businesses and the ever increasing content on the web, its only becoming more and more challenging to gather relevant content in a time efficient manner.
Lets understand how.

Understanding the basics first

Website vs Webpage vs URL

The terms website and webpage are often used interchangeably in the field of Internet and browsing but they have many differences.

‘A webpage is contained inside a website’

A website is a collection of several webpages linked together using hyperlinks. All the webpages are linked under a single domain to uniquely identify the website.

For example:

Website : https://www.amazon.com/

Webpage : https://www.amazon.com/Tracfone-Apple-iPhone-Prepaid-Smartphone/dp/B08CL4CCG2/ref=sr_1_1?dchild=1&keywords=iphone&qid=1611366593&sr=8-1

This page represents a particular product page inside amazon website.

URL: Unified Resource Locator

Every webpage is attached to a unique URL address used to render or access that particular page.

In the above example , URL of the webpage is

https://www.amazon.com/Tracfone-Apple-iPhone-Prepaid-Smartphone/dp/B08CL4CCG2/ref=sr_1_1?dchild=1&keywords=iphone&qid=1611366593&sr=8-1

Webpage — It is a single document or file that is displayed by the web browser using a specific URL address.

Website — It is a collection of one or many web pages. Web browsers are used to access such web pages using specific URL addresses attached to the website.

Domain Name:

The URL is a string of information providing the complete address of the webpage on the internet. Whereas domain name is a part of URL which is a user-friendly form of IP address. We use the URL for identifying a particular webpage.

‘The domain name is contained inside a given URL’

A webpage is a single document on the Internet under a unique URL. In contrast, a website is a collection of multiple webpages in which information on a related topic or other subject is linked together under a domain address.

Eg: www.amazon.com is a domain name

The URL: the sign in page of amazon with URL : https://www.amazon.com/ap/signin?openid.pape.max_auth_age=0&openid.return_to=https%3A%2F%2Fwww.amazon.com%2F%3Fref_%3Dnav_signin&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&

Has the same domain name ‘www.amazon.com’ as part of the URL.

Web scraping vs Web crawling vs Web mining

Web scraping is basically extracting data from websites in an automated manner using bots to scrape the information or content from websites.

It also involves formatting this data into a more convenient format, such as an Excel sheet.

It involves locating data and then extracting it. It does not copy and paste but directly fetches the data in a precise and accurate manner.

The concept of scraping is not limited to the web but instead means scraping any specific kind of information from any given set of documents through automated processes.

For example, for analysis on price intelligence, there is a need to extract the price of various/specific products from Amazon or any other e-commerce site.

Web scraping is essentially targeted at specific websites for specific data, e.g. for stock market data, business leads, supplier product scraping.

Hops: a hop in crawling terminology means moving from one page to another through the hyperlinks mentioned in the initial landing page.

Web Crawling

Web crawling is the process of moving from current web page to another web page using the link hints on the current web page. If a crawler while on a current web page chooses to extract content or data from the page then web scraping is happening along with web crawling. The process of web crawling is started by providing seed url(s) to the crawler. Crawler is also provided with hints to when to stop. For example, when the count of crawled pages has reached some limit or crawler has finished hopping some fixed number of hops starting from seed urls.

‘Web crawling may or may not involve web scraping’

For example, Web crawling would be generally what Google, Yahoo, Bing etc. do, searching for any kind of information.

Web Mining:

Data mining vs Web mining

‘When data mining is done over data extracted from the web, its termed as web mining’

We can say in a way that for web mining, its required to utilize tools of web crawling to first reach the web pages, then use web scraping to extract and collect the data from the target pages and then techniques of data mining to do data analysis over the collected data sets, which is to discover new information, hidden patterns and behaviors.

Unlike search engines, which send agents to crawl the web searching for keywords, WM agents are far more intelligent.

Focused Crawling :

A focused crawler is a web crawler that collects Web pages that satisfy some specific property.

It works on some sort of intelligence which is configurable based on the specific business use case and the tool chosen for the purpose.

For eg. ‘crawl pages about covid’

The intelligence inside a focused crawler is capable of predicting if it needs to go further with crawling subsequent URLs on a specific page or not, based on many factors like page content, anchor links, URL structure etc.

Number of hops defines the depth till which a crawler or a spider is capable of navigating into a given input page.

Focused crawlers are a step towards being able to do some specialized crawling to meet a specific need like having content collected on a specific topic or domain .

The drawback with focused crawlers is that they will not go beyond the scope of pages provided by the initial input list of URLs by the user.

While they do come up with more URLs through the crawling process, those newly found URLs are all but parts of pages which are embedded within the initial pages rendered from the input URL list.

This means, the input list in a way controls the width of relevant data collected by exposing only the pages mentioned within them in hyperlinks at various levels of depth within those very pages.

There is no way a focused crawler can possibly discover any web content which is entirely new and relevant and not linked to any of the input URLs at any level of depth.

What we need now ?

While focused crawlers have helped a great extent in doing topic or subject specific web mining , there is still a huge gap in the results given by focused crawlers and that of what’s actually expected and needed for doing a fairly good job at web mining to meet industry needs.

This is the reason why the entire process is still highly dependent on human expertise and hence also in a way limited by it.

Watch this space for more !!

--

--