How we mapped the internet to discover carriers
This blog post was written under the Wandera brand prior to its acquisition by Jamf. That’s history, we are now #OneJamf
Being able to accurately predict which carriers use which IP addresses is important for Wandera’s data cost management solution. Customers with dual-SIM/eSIM devices in their fleet need to be aware at which point in time a device is using which SIM, such that Wandera can notify the user if they’re using the wrong SIM in the wrong situations, or take action in any other way.
We were therefore interested in investigating if we could predict which carriers use which IP addresses using only our Wandera dataset. This dataset contains millions of IP addresses but for comparatively very few we have been able to determine the associated carrier. IP addresses are single dimensional and as such difficult to assign a carrier without additional information. Therefore, we propose a novel solution to predict carriers from only IP addresses which utilises the Hilbert curve, cluster analysis and classification algorithms.
The Hilbert curve is a continuous fractal space-filling curve which has a locality property; this means that if two points are ‘close’ in one-dimension then they will also be ‘close’ in two-dimensions. Since carriers often purchase IP addresses in ranges we may be able to use the Hilbert curve to create clusters of addresses from the same carrier.
We can therefore use cluster analysis to determine which IP addresses belong to a single carrier; these can be labelled by the carrier labels we do have in our dataset. Labelling clusters will consequently improve the coverage of labelled IP addresses in our data.
For any IP addresses which are not in our dataset we can use the clustered data to train a classifier to predict which carrier is associated with a given unlabelled IP address.
IP address to Hilbert space
Mapping IP addresses into a Hilbert space is a commonly used application, see for example visualising IP geolocation. This is achieved by first converting the IP address into a numeric form; by representing each octet as a binary string of base 2 bits, shifting based on the position of the octet in the IP address and summing we determine the integer representation. This calculation can be generalised by noticing that each shift of eight bits is equivalent to multiplying by 256, hence we can simply determine the integer for a generalised IP address: a.b.c.d by the formula: a256³+ b256²+ c256 + d.
From the integer form of an IP address we can determine the Hilbert space coordinates by applying a mapping algorithm. This algorithm works by viewing the entire space as composed of 4 regions, arranged 2 by 2. Each region is then sub- divided into 4 smaller regions, with each subsequent region subdivided again. This continues until we reach 2¹⁶ since an order 16 Hilbert curve is sufficient to display all 2³² IPv4 addresses. At each iteration the curve is rotated through each region to create a curve which fills the space:
By applying this transformation to the IP addresses in our dataset we begin to see clusters forming, where IP addresses in the same range are being grouped together:
Clustering
Clustering is a Machine Learning (ML) technique that involves the grouping of data points; with the IP addresses in a 2D Hilbert space we hypothesise that each distinct group of IP’s can be labelled with a single carrier. Around 30% of IP addresses in our dataset have carrier labels, therefore if we can determine clusters of IP addresses in the Hilbert space we may be able to use this information to label the remaining IP addresses.
Algorithm
In the Hilbert space we have an unknown number of clusters, which means that we cannot use clustering algorithms such as k-Means or Mean-Shift. Also our dataset has points seemingly grouped but with very different densities, meaning that density based clustering such as density-based spatial clustering of applications with noise (DBSCAN) would not be appropriate.
The algorithm which we chose for clustering our data in the Hilbert space was dynamic method density-based spatial clustering of applications with noise (DMDBSCAN). This clustering method uses a k-Nearest Neighbour (kNN) search to determine suitable parameters for the different density levels found in the dataset and then applies the DBSCAN algorithm for each determined density level.
Optimal values for the densities, ε, are found using the kNN search where k is determined from the minimum number of points that a cluster can contain. The distances from kNN can then be plotted in ascending order, where sudden changes in gradient correspond to changes in density and therefore suitable values of ε.
Applying the DBSCAN algorithm for each density level gives us clusters of IP addresses, which we label using the most common (mode) carrier label from the labelled IP addresses in each cluster.
Clustering Effect
Cross validation is used to optimise the hyper-parameters for the DMDBSCAN algorithm; we want to ensure that as many IP addresses as possible are clustered but we don’t want to generate false classifications.
The cluster analysis increases the coverage of labelled points from around 30% to over 85%, with high balanced accuracy. The remaining unlabelled points are deemed to be ‘noise’ and do not have a carrier assigned.
This label coverage increase can be seen in the plots below; in the first plot the data is coloured by the carrier labels we have in our original dataset, where the black points represent unlabelled IPs, and in the second plot the IPs are labelled after applying the clustering algorithm. Far more of the points now have a carrier label and it can be observed that the clustering has corrected some mis-labelled IP addresses. However, there are still some unlabelled points, where the DMSBSCAN algorithm cannot confidently assign a carrier label.
The DMDBSCAN algorithm is designed to differentiate between clusters of differing densities, this can be seen below:
These points in the Hilbert space have very different densities, but are close to each other. Using different density values for the DBSCAN clustering algorithm we successfully create distinct clusters which are assigned different carrier labels.
Classification
From the clustered dataset we can train a predictive ML model to try and predict the unlabelled and any IP addresses not in our dataset. For the ML model the training dataset consists of all the IP addresses for which we were able to determine carrier labels using cluster analysis.
The features which we shall use for this classification are:
- Hilbert space x-coordinate
- Hilbert space y-coordinate.
These coordinates define the position of an IP address in the Hilbert space and it is the position in this space which we hypothesise is how we can label.
The classification algorithm chosen for predicting the carrier of an IP address was a Random Forest classifier. Investigations into the various different methods: support vector classifiers, decision trees (with and without boosting), logistic regression, kNN, showed that training a Random Forest classifier was the optimal approach.
Using k-fold cross validation we are able to determine an optimal set of hyper-parameters for the Random Forest classifier. Since our dataset has varying proportions of IP addresses from different carriers; it is therefore useful to calibrate our classifier. This rescales the predicted probabilities to improve the probability-based predictions for each carrier in our model.
This is a multi-class problem hence we have confidence in a prediction if the probability, p, if: p >1/Number of carrier labels.
In the Hilbert space we can visualise the classifiers probabilities for a given carrier using a heat-map. Below is an example of the probability heat-map for a single carrier; included are the labelled points for that carrier. This demonstrates how the Random Forest algorithm predicts in the regions about the training data and how the density of points in a region leads to greater confidence.
Results
Applying this process to a one day of unique IP addresses from the Wandera dataset:
We find that we can quite accurately predict carriers using only IP addresses.
Being able to predict which carriers use which IP addresses means that we have increased the coverage of IP addresses with carriers from 30% to over 97%, and this figure is increasing all the time. This means that Wandera can much more easily identify if a customer’s device is using the wrong SIM in the wrong situation and notify that device accordingly and therefore help save our customers money.
The classification process defined above is regularly retrained and optimised such that we ensure the predictions continue to be accurate and the models respond to re-provisioning of IP addresses. Along with this we monitor the predictions daily, such that we can identify false classifications and remediate effectively. This helps to ensure that we provide the most accurate carrier information for our customers.