Classification problems with web scraped data
The goal of the project
In May 2018 Perfect Place and Jibia connected to find a solution to a big problem in the dutch health care industry, the lack of data. There are numbers of different sources for data about healthcare locations and retirement homes, but none of them were complete and up to date. To improve the quality of living for the sick and elderly, healthcare organisation needed to make good investments based on concrete information about the spreading of care in The Netherlands. To provide them with this data, we created a web scraping and machine learning tool to collect, structure and combine data of different health care facilities. This was made possible by the region West-Brabant which provided us with a subsidy to create the software and test different techniques.
In the following blog, we will tell you about the problems we encountered during these projects. Mainly working with a small dataset we had of the dutch companies. We hope you can learn from this information and improve your own classification projects. If you have any question about this project or anything below. let us know, we are pleased to help you out!
Outline of the project
We chose one approach for the two classification-problems. This approach was as follows:
- Create a list of core keyword feature
- Calculate the minimum distance* from all of these feature keywords to the found keywords** on the webpages***
- The resulting matrix of distances then gets split into train- and test data.
- Several different machine learning methods are then trained and evaluated on this data.
- The classifications of the best performing model are then written to a CSV.
* The distances are calculated using the cosine distance between the vector-space representations of the words. The vectors are generated using a word2vec model that is trained on the entire corpus of care-websites. Besides word2vec we have attempted GloVe en FastText as well. The results of these attempts were lacklustre compared to word2vec.
** The keywords are generated by first applying a Web2Text (https://arxiv.org/abs/1801.02607) model to the raw HTML of these pages. For this we used the implementation in python we developed ourselves. We compared the results of this implementation to those of other boilerplate removal techniques. Our results were better than the out-of-the-box methods (newspaper, goose, Html-text). Despite the comparatively good results, our implementation still returned empty pages from time to time. There is still room for improvement on this front. Creating an ensemble method could be a way to fix the problem of the empty page.
*** The pages used are the pages on which addresses are found. This means we are not using all pages with information on a given care-location. More on this will follow.
Machine learnings results are as good as the data being used. To get the best results out of the data, the data has to be transformed into features using domain knowledge of obvious predictors for the problem.
The list of keyword features ended up being a list of 104 separate words. The results of the machine learning are entirely dependant on the choice of words in this list. So there are still many possible additions, removals or other alterations to this list that can lead to improvements.
The list is very big right now. This can lead to a more complete model, but it can also lead to potential overfitting. Especially with such a limited dataset.
The data we use right now is all based solely on the pages on which addresses are found. This is maybe not all the data there can be found for a certain care-location. It would also possible that the page contains information that doesn’t relate to the location at all. There is still room for improvement here.
First of all, there is still a possibility to look through the rest of the graph of the site. An option would be to use the HTML from pages neighbouring a location-page. This data could be weighed down to factor in the possible lower relevance to the location. Instead of taking all neighbours another option would be to just look at child or parent nodes.
If we were to know the exact name of a certain care-location, another option to get more data would be to filter all pages which mention this name. This has several downsides and we still do not know these names, but with some tweaks, this could be an effective way to get more data.
Another option would be to use search engines to gather more data. Web search API’s like Microsoft Azure’s, DuckDuckGo’s of RapidAPI’s can provide us with external information surrounding a location.
We used SkLearn and Keras for our classifiers. Our SkLearn models ended up outperforming the Keras models, but all models were underperforming.
The reason for this is likely a lack of good training data. The amount of data is simply very low for good machine learning and even if we had more data, the data would still be noisy due to the problems discussed in the “Pages” section.
Although most common machine learning methods are very dependent on large sets of tagged training data, some methods do not require this: unsupervised learning methods. Where the previously mentioned machine learning methods, Supervised methods search for patterns in the data that explain predefined classification to extrapolate this to data points that are not classified yet. Unsupervised methods do not try to classify at all. They just look for patterns.
A good, common example of an unsupervised technique is clustering.
Let’s say that instead of care-websites we used a set of photos as data. Photos of dogs and photos of cats. The supervised techniques we used for this project could be used to split the photo’s into. With clustering, this isn’t done. The photos are sorted in groups by the features.
It’s quite likely these two clusters are pretty close to “dog” and “cat, but another possibility would be “picture was taken outside” and “picture was taken inside” or more likely two clusters that do not resemble anything so obvious to our human eyes. It’s never certain what the output will be, but you do have power over. By engineering the features in a way that the obvious clustering will be the one you’re looking for, you can get a pretty decent result without hand-labelled data.
Some examples of clustering techniques that could be used:
The most famous clustering algorithm. Clusters n datapoints to k groups. A very straightforward solution. The starting cluster-points can be predetermined using K-means. You could use the average vectors of the hand-labelled datapoints for this.
Another popular clustering algorithm. With DBSCAN the difficult to cluster data points are tagged as noise. These noisy locations could then be done by hand. This could be a useful feature to sort out difficulties.
Hierarchical clustering clusters into graph-like structures instead of an n-dimensional space. This helps point out the important locations. These locations could be checked first and that will give a better view of the results than taking a location by random.
Machine learning was an obvious tool to gravitate to for this problem. We knew a lot about it and we did not know a lot about the domain of care. With enough data, machine learning can fix this gap in domain knowledge, but when data is lacklustre, another solution is building a rule-based system. Instead of implicitly building rules from data, you write the rules yourself.
This would take a lot of time and fine-tuning, but it’s still a viable solution if the data does not get you where you want to go.
Are you preparing to work with the same kind of classification problem? Maybe we can share our knowledge and work together. Contact us via our website jibia.nl or send an email to firstname.lastname@example.org