Classification problems with web scraped data

The goal of the project

In the following blog, we will tell you about the problems we encountered during these projects. Mainly working with a small dataset we had of the dutch companies. We hope you can learn from this information and improve your own classification projects. If you have any question about this project or anything below. let us know, we are pleased to help you out!

Outline of the project

  1. Create a list of core keyword feature
  2. Calculate the minimum distance* from all of these feature keywords to the found keywords** on the webpages***
  3. The resulting matrix of distances then gets split into train- and test data.
  4. Several different machine learning methods are then trained and evaluated on this data.
  5. The classifications of the best performing model are then written to a CSV.

* The distances are calculated using the cosine distance between the vector-space representations of the words. The vectors are generated using a word2vec model that is trained on the entire corpus of care-websites. Besides word2vec we have attempted GloVe en FastText as well. The results of these attempts were lacklustre compared to word2vec.

** The keywords are generated by first applying a Web2Text (https://arxiv.org/abs/1801.02607) model to the raw HTML of these pages. For this we used the implementation in python we developed ourselves. We compared the results of this implementation to those of other boilerplate removal techniques. Our results were better than the out-of-the-box methods (newspaper, goose, Html-text). Despite the comparatively good results, our implementation still returned empty pages from time to time. There is still room for improvement on this front. Creating an ensemble method could be a way to fix the problem of the empty page.

*** The pages used are the pages on which addresses are found. This means we are not using all pages with information on a given care-location. More on this will follow.

Features

The keyword-list

The list is very big right now. This can lead to a more complete model, but it can also lead to potential overfitting. Especially with such a limited dataset.

Pages

First of all, there is still a possibility to look through the rest of the graph of the site. An option would be to use the HTML from pages neighbouring a location-page. This data could be weighed down to factor in the possible lower relevance to the location. Instead of taking all neighbours another option would be to just look at child or parent nodes.

If we were to know the exact name of a certain care-location, another option to get more data would be to filter all pages which mention this name. This has several downsides and we still do not know these names, but with some tweaks, this could be an effective way to get more data.

Another option would be to use search engines to gather more data. Web search API’s like Microsoft Azure’s, DuckDuckGo’s of RapidAPI’s can provide us with external information surrounding a location.

Machine learning

The reason for this is likely a lack of good training data. The amount of data is simply very low for good machine learning and even if we had more data, the data would still be noisy due to the problems discussed in the “Pages” section.

Alternatives

Clustering

Let’s say that instead of care-websites we used a set of photos as data. Photos of dogs and photos of cats. The supervised techniques we used for this project could be used to split the photo’s into. With clustering, this isn’t done. The photos are sorted in groups by the features.

It’s quite likely these two clusters are pretty close to “dog” and “cat, but another possibility would be “picture was taken outside” and “picture was taken inside” or more likely two clusters that do not resemble anything so obvious to our human eyes. It’s never certain what the output will be, but you do have power over. By engineering the features in a way that the obvious clustering will be the one you’re looking for, you can get a pretty decent result without hand-labelled data.

Some examples of clustering techniques that could be used:

K-Means

DBSCAN

Hierarchical Clustering

Rule-based solutions

This would take a lot of time and fine-tuning, but it’s still a viable solution if the data does not get you where you want to go.

Contact information

Jibia is een tech startup uit Utrecht. Onze missie is om alle internet gebruikers de macht over hun data terug te geven! jibia.nl

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store