Domain Specific Search and Data Aggregation For Customer Specific Usecase Using Pyspark, ML and Knowledge Graphs

Published in

Sensewithai

9 min readDec 5, 2023

This is Part 1 of this series , it has Part 2 and Part 3 following up to give a complete picture of how we went about solving this Customer’s Requirements at Sensewithai.

This was one of our most interesting projects to work with at Sensewithai . Our customer , a B2C startup approached us to provide them with Web Scraped data, organized according to their specific needs and hierarchy and provide periodic updates with latest information gathered and curated for them to be put up on their website and app.

We have put this article into following sections for better understanding and clarity :

The Idea
Gathering Data [ Raw Data Sources]
Organizing Data
Connecting Data
Converting Data into Information
Update Information
Data Correction by Combining multiple Sources
Platform Architecture [ Part 2 ]
Data Quality Validation at Various Stages [Part 3 ]

I. The Idea

Our customer was a B2C Data aggregation platform for information around pregnancy and childbirth in the Indian context. They hosted a website for their users and subscribers which presented curated data from the web around news, articles, blog posts, products , product ratings, reviews and product recommendations around this domain.

The idea was to offer a one stop destination for all new parents/ expecting parents to get good quality search data from the web, to be able to compare products and their prices available on various online stores and read aggregated reviews of products all at one place , along with links to purchase pages.

For this they needed a number of automated and intelligent web crawling and web scraping functionalities and also a way to organize data in a way which could be updated periodically with latest web content like news, articles, product prices, discounts and so on.

Imagine this as a superset of an ecommerce aggregator, news aggregator, review aggregator all in one place for a very specific domain and end users.

We used the following as our example prototypes :

pic : https://geomotiv.com/blog/how-to-develop-ecommerce-aggregators/

pic : https://themeisle.com/blog/news-aggregator-websites-examples/#gref

II. Gathering Data

Next, we needed to understand how we would go about getting the required data for our above mentioned usecase.

According to discussions with our client, the obvious source was the web but the data had to be intelligently searched, collected, curated, organized and connected in order to be presented as per the requirements of the said aggregator website.

Search :

Automated domain specific web search was not a new problem to be solved. But we did need to find efficient and low cost ways to do that at huge scale and frequent time intervals for this particular usecase.

We explored various solutions, tried out many proposed ones in research papers and after trials and errors by our team of data scientists and data engineers, we came up with our very own ‘Eye-Catcher’ engine using Pyspark to meet the requirements.

This solution, not only gave us a way to automatically understand the domain requirements given few key words for search and few seed URLs to get started with [ called Context Building] , it also gave a ranked list of most useful weblinks according to their content using reinforcement learning along with list of newly discovered search keywords.

This process was iterative and the run frequency could be set according to how frequently we wanted the data to be refreshed and how much cost was the client ready to bear for it.

an input list for our engine used as an example

In addition to “indian pregnancy and childbirth” the platform discovers automatically following new keywords:

[‘care’, ‘community’, ‘delivery’, ‘diabetes’, ‘experience’, ‘help’, ‘information’, ‘mother’, ‘maternal’, ‘practice’, ‘program’, ‘risk’, ‘product’]

And expands its search for more relevant urls and uses this to do a more target and refined crawl to discover new and relevant urls. At some stage iterative learning of the system stops and gives out result .

Note: The new curated list of relevant urls are ranked on the basis of semantic similarity of the content user is seeking.This ranking is independent of ranking on the search result pages of search engines.The platform discovers the content as a function of content relevancy not popularity.

Ranked list of newly found URLs from the given seed URLs and keywords

Collect Data from URLs [Scrape] :

We then used the Zyte APIs to collect data from the filtered list of URLs which were the output from the last process.

the Eye-Catcher Engine working along with Zyte API to get best web search results for the given domain

Page Classification :

Once we have the content from the pages we need to classify them into either product or article pages. For this we used a custom classifier which gave the output in the form : <url, html, page_type> . We then used the Zyte extraction API for articles and products respectively to get following output :

Output of Zyte extraction API for an Article web page

Output of Zyte extraction API for a Product web page

III. Organizing Data

Once the domain related data is collected from the web , there still remains redundancy and lack of connectedness in the pool of information hence limiting its usability and effectiveness.

Moreover, optionally this data can also be combined with already existing data which is lying with the customer in local DB or cloud. This helps in leveraging all available data to its maximum use.

To bring structure in the data, Article text goes through NLP model, which does NER, creates triples and prepares it to be integrated into the knowledge graph. The entire text content from all the collected articles can be connected to form a combine knowledge base for the domain under consideration.

A well connected data becomes useful source of information and further a well understood set of facts becomes useful knowledge to be applied to various kinds of use cases.

The Sensewithai pipeline to extract, organize and store data for its customer

IV. Connecting Data

Once we have the well defined entities in our database, we need to understand here that since they have been collected from various online sources, there will be duplicates and relationships amongst these entities. To enable features like search , recommendation and completeness of any information to be presented on the planned aggregator website, we would need to exploit this connectedness and leverage it to its maximum and this shall be a continuous and iterative process as and when more data/entities-relationships get populated in our database.

So as the next step, we create a Knowledge Graph from the collection of entities gotten from previous step, persist them in the graph database. To do this we need to organize entities as ‘Subject’ , ‘Object’ , and ‘relation’ [ or Predicate ] and then the links shall be created to form the Knowledge Graph.

The entire process can be presented as following :

Pipeline applies NLP on text data giving this output

Common entities that exist in more than one web location get deduplicated and are joined to create connections to enhance the knowledge base of the domain.

In the example above, the Article page has crawled the sentence :

Mamaearth Complete Care Kit is specially formulated to provide your baby with proper care from top to toe.

(Source: https://reviews.momjunction.com/mamaearth-complete-baby-care-kit/)

And , one of the product pages crawled contains the details of the same product mentioned above in the article :

https://mamaearth.in/product/welcome-baby-essential-kit/?utm_source=google&utm_medium=cpc&utm_term=101041248718&gclid=Cj0KCQjw2tCGBhCLARIsABJGmZ4noNOcfoPficSqmZ2_NfBk2jYKJow2N-Odl1HmMG0w9ZUwoXtxWI4aAqqrEALw_wcB

Whose attributes have been extracted and stored in DB earlier :

Next, the two are determined as the same entity through deduplication and are connected :

This is achieved by performing deduplication on the properties of vertices. For example in below case dedupe is done on vertices of type SKU and PRODUCT. That will ensure “Mamaearth Complete Care Kit(PRODUCT) and Mamaearth Complete Care Kit(SKU)” both will be tagged into cluster from which we will derive relationship “isSameAs”

Connection of nodes post deduplication in a Knowledge Graph

V. Converting Data into Information

The purpose of joining common entities from the customer point of view was to be able to display on its content aggregation website on the domain of “pregnancy and childbirth in Indian context”, any mention of products within articles and to directly point the user or their website to the purchase site .

Since its queryable :

1.All product sites are shown in the query result, wherever the product is being sold online.

2.All articles where product has been mentioned is queryable [be it reviews, complains etc]

3.Sentiment analysis can be done by retrieving exact sentence where the product has been mentioned.

4.Price aggregation is done for the product as different sites have different selling price.

In the following snapshot of the proposed customer website which does content aggregation, the query result is presented in an aggregated form wherein on the result page, along with all article URLs , is also present a list of related products and the sites they are sold on.

Content aggregation as form of centralized and useful information for end user

VI. Update Information

As mentioned above, we designed this entire pipeline considering that information on this domain will be constantly updated and outdated as well [ ex. product reviews, news, product prices] . Therefore we facilitated the run with option to set various run frequencies for various categories of data eg product price may need to be updated daily while articles/reviews can be updated on a weekly basis .

This way we were able to send latest information to the website which should always be of value to the end user.

VII. Data Correction by Combining multiple Sources

Now its very common with web data that not all sources are equally reliable. Since anyone and everyone is free to write and publish data on the web and the crawling mechanism is vulnerable to pick up all sorts of data, facts and information may need to be cross validated against some source of truth.

Its over here that our open knowledge graphs come to rescue and we devised a way in our pipeline for the entities correction process to happen with the support of open knowledge graphs before presenting it for end user consumption.

For simplicity let us assume that we got following documents. There is only one sentence per document. These sentences are extracted from articleBody.

Motherhood Hospital in Indiranagar, Bangalore has a well-equipped clinic with all the modern equipment.

(Source: https://www.justdial.com/Bangalore/Motherhood-Hospital-Near-Vijaya-Bank-Indiranagar/080PXX80-XX80-110423113207-X1U8_BZDET)

Cloudnine hospital is located in New York.(Incorrect Sentence on Purpose)

Mamaearth Complete Care Kit is specially formulated to provide your baby with proper care from top to toe.

(Source: https://reviews.momjunction.com/mamaearth-complete-baby-care-kit/)

correcting values in the web data with the help of matching entities in the wiki data knowledge graph

Entities formed in the knowledge graph inhouse can be mapped to that in open knowledge graphs and their values can be enhanced/fixed if need be by the specific customer use case.