An organization requested to compare two website’s content based on a couple of textual features and to find similarities. The data was hosted on public websites so we decided to scrape relevant pages, which were 16,609 pages in total. To process the textual features of the text data we extracted from the pages we used an NLP package, Wowool, provided by EyeOnText. After having mapped the data we had to create relations for the textual features which has been done by using Neo4J. The result is a graph database of NLP processed data that allows for interactive knowledge graphs to explore.
The business question
An organization, Fluid, wanted to see if text data could show overlap in content between different data sources to get a better insight into the sector. The data sources could be company documents, chat data or public data about a company. If that data overlaps it could be an indication of a possible opportunity for collaboration between different companies within sectors.
To achieve this we suggested building a proof of concept, a first version that would have the most important features of the project. This was mainly due to this being a project that has an experimental nature, meaning the value of the results could vary.
The plan was to use website data of two websites that work in the same sector. The websites had lots of articles. In which we were to find the following features, topics and persons or entities. Find the relationship between those features.
At the center of this project were two main pieces of software, the first one being Eye-On-Text. Eye-On-Text is a Natural Language Processing (NLP) SDK, this means it can find relations and insights in text-based data. There are a few NLP packages available online, but since the developers of this package were already closely collaborating with our client, it made development smooth. It also had everything we needed for analyzing text and more.
Next to Eye-On-Text, Neo4J was also very important to the project. Because the client was interested in seeing relations between different texts, we wanted to use a graph database which is specialized in storing and showing relations efficiently. After some research, the most popular and supported graph database turned out to be Neo4J, so this was the logical choice.
After making these choices we asked our client for some data we could analyze, they gave us a list of websites with articles, out of which we chose two within the same industry.
Knowing all of this, the project could start.
Almost all of the programming was done in Python. We started with the Beautifulsoup4 package for ease of use and fast development to start with mapping the websites to be scraped. Once we had a clear view of what to scrape and how we used that information to build a more scalable web scraper that works significantly faster. The improved web scraper built with Scrapy, a popular web scraping framework. The new scraper could scrape the 16,000+ pages in a couple of minutes though we limited the speed so that the scraped websites would not be impacted by us.
In Scrapy we build a data pipeline to process and save the data to MongoDB to analyze later with NLP. After running the scraper we had a database filled with more than enough data for us to start the NLP processing.
We started with defining a schema for the Neo4j graph database, which would be the basis for the result.
With Eye-On-Text’s amazing NLP software development kit at our deposal, we were quickly able to identify entities within a text and by applying TF-IDF (term frequency-inverse document frequency) we identified topics. The relations were also directly formed through this process but not added to the neo4j graph database.
After extracting all the textual features from the data the next step was forming it into a graph database. To do this, we used Py2neo, a library that allows Python to work nicely with Neo4j. In the end, we mostly build Cypher queries within the py2neo library to build the neo4j database.
Now it was time to visualize what we have created, Force-Graph (D3.js). Which gave us this great looking 3D visualization. A connected web of data showing how the two websites overlap.
In the end, we delivered a graph database filled with data that has been formed by NLP to give interesting insights into two organizations that work in the same sector. How and where they overlap, who and what makes them connected. At the end of the project when the results were in Fluid was satisfied with the results. The results became the first proof of concept for Fluid to take the next steps towards becoming a commercial business.