Parking-Lot: Identifying Parked Websites with Machine Learning

Radius Engineering
Radius-Engineering
Published in
6 min readNov 27, 2017

At Radius, we provide comprehensive data for millions of businesses to help our customers target their B2B marketing efforts. One of the most visible business attributes that we provide is a business website. When a prospect or customer first interacts with our dataset, one of the quickest indicators of data quality comes when they click an external business link. Is it a legitimate business website? Is it actually the correct website for the named business? This post will speak to how we are tackling the first of these questions — though there are also exciting advancements being researched on website association as we speak!

To monitor current website association and connection rates, we send a random sample of websites and their associated businesses to a human evaluation team. These experiments revealed that about 5% of the connected websites were parked domains and obviously not associated to the target business. Removing these sites would be the lowest hanging fruit in reducing website misassociation.

So What is a Parked Website?

According to Wikipedia, “domain parking refers to the registration of an internet domain name without that domain being associated with any services such as e-mail or a website”. For our purposes, a parked website contains no meaningful content about a business and should not be provided as an associated website to our customers. Here are a just a few diverse examples: https://www.domainmarket.com/buynow/alliedplumbinginc.com, http://www.matadortech.com/, http://hypointe.com/.

The Parked Website Problem

Our Aggregation team currently manages a database of about 50 million business websites. Using Scrapy, an open source website extraction tool, the Aggregation team procures website html content and website status codes, such as 404 Not Found or 502 Bad Gateway, that allows disconnected websites to be simply filtered out from the website candidate pool. Parked websites typically provide a 200 OK status, making it difficult to filter them out by status code alone.

We first addressed the problem by labeling websites as parked if they contained the following three phrases 一 ‘this domain may be for sale’, ‘This domain is for sale’, ‘http://mcc.godaddy.com/park' 一 but we were still experiencing a high rate of false negatives (a parked website that was not correctly labeled parked). Reviewing a sample of around 700 parked websites procured through human evaluation labeling and online research revealed that there were numerous domain hosting companies, each of which had its own design. Thus, key phrases were only useful to identify a small percentage of the total population of parked websites.

The method of checking for phrases also left little flexibility for capturing new types of parked websites that we had not seen before. A model based approach equips us to keep up with changes to parked website phrases, new domain providers, and updates to parked website designs.

Model Featurization

To train a model that could differentiate between a valid business website and a parked website, we extracted 28 features from raw html using BeautifulSoup. 14 of those features were an expansion of our previous phrase-based features, each one based on whether or not the html contained a key phrase listed below in either the main content, or in the anchor text. These phrases were extracted by aggregating n-grams from parked websites and extracting the most common occurrences. We added ‘cache access denied’ and ‘javascript enabled on your browser’ to limit false positives (a legitimate website being labeled parked) where Scrapy was unable to access the true content of the web page.

To capture the structure the of parked websites, we extracted the following additional features:

  • Number of alphanumeric characters present in the website content (text alpha length)
  • The ratio between anchor text characters and non-anchor text characters in the website
  • Count of images
  • Count of iframes
  • Presence of a phone number or email
  • Count of external links and count of external links with greater than 30 characters

One of the more interesting features, called the common link ratio, is based on results of Content-Based Approach for Identifying Textual Ads-Portal Domains by Almishari M, Liu X and Yang W. It is common that the anchor texts of different links embedded in parked domains tend to share words. That is, we find some degree of coherence and word sharing among anchor text of ads-links shown in parked pages. For strongly coherent parked pages, we will typically find that most of the anchor text of the ads will share the same keyword(s) related to some topic. For example, the parked website lovingerfinancial.com (excerpt above) has links listed in the center of the web page such as: “Best Insurance Plans”, “General Insurance Quotes”, “AAA Life Insurance Company”. Multiple phrases listed share terms, specifically — ‘insurance’, ‘plans’, and ‘quotes’.

To capture this relationship, we defined a new feature called Common Link Ratio (CLRN) defined as:

where N is a fixed integer and D is the website being featurized. To represent this concept of “sharing words”, we stemmed all words in the website’s links anchor text using a Snowball stemmer which would allow for ‘having’ and ‘have’ to be considered a “shared word”. For the purpose of this analysis, we calculated CLRN for all integral values of N from 1 to 5.

Parking-Lot: Model Training and Results

Machine Learning ValenciaEnsembles of Decision Trees

We chose to use a Bagging model to improve the stability and accuracy of a decision tree model. A Bagging model takes multiple bootstrapped sets and trains a decision tree on each set. The final model result is based on an aggregated vote from — in our case — ten trees. By taking random subsets with replacement from the training set and training multiple models to be aggregated, we are able to reduce variance and risk of overfitting. This model is slightly different from a basic Random Forest implementation since we use all features for each trained tree, as opposed to using only a random subset of features.

From a business perspective, the cost of losing a significant amount of valid websites that would be provided to a customer for the benefit of removing slightly more parked websites was an important factor for threshold tuning. Thus, to reduce the risk of false positives, we chose a threshold based on a false positive rate of 3% for the validation set. Overall, our model performed with a F1 score of 92% and precision of 96% while maintaining a false positive rate of 2.8% on a held out test set.

The most significant features were the count of images on the site, the count of alphanumeric characters, the count of links, the anchor text ratios, and the presence of the phrase ‘sponsored listings’. The Common Link Ratio was not as significant as we would have expected, possibly tied to the composition of the training set and the variety of types of parked websites the model was expected to identify.

Whats next?

Since the model was put into production in July, we have identified 370,000 websites as parked. As our human evaluation teams continue to label websites as parked, we will retrain Parking-Lot to keep up with any new design transitions for parked websites.

For a sample of parked website URLs, take a look at our new open source data repository!

A special thanks to Diego Munoz, Maria Moy, and the numerous teams involved in this project. It was truly a cross-team effort to bring this model from research to production!

Katherine Schinkel — Data Scientist @ Radius

--

--