The Importance of Location in Real Estate, Weather, and Machine Learning

Kirk Borne
8 min readSep 6, 2021

--

Source: https://andertoons.com/location/cartoon/7454/ok-now-for-the-third-and-final-part-of-todays-lesson

Real estate experts like to say that the three most important features of a property are: location, location, location! Likewise, weather events are highly location-dependent. We will see below how a similar perspective is also applicable to machine learning algorithms.

Location, Location, Location

In real estate, the buyer is first and foremost concerned about location for at least 3 reasons: (a) the desirability of the surrounding neighborhood; (b) the proximity to schools, businesses, services, etc.; and (c) the value of properties in that area.

Similarly, meteorologists tell us that all weather is local. Location is significant in weather for at least 3 reasons also: (a) specific weather events are almost impossible to predict due to the massive complexity of micro-scale interactions of atmospheric phenomena that are spread over macro-scales of hundreds of miles; (b) the specific outcome of a weather prediction may occur only in highly localized areas; and (c)the minute details of a location (topography, hydrology, structures) are too specific to be included in regional models and yet they are very significant variables in micro-weather events.

We might have a good start here on generating some predictive models (for real estate sales or for weather), if we could parameterize the above location-based features and score them appropriately.

Another aspect of “location” is the boundary region between different areas. This boundary region can affect real estate sales, especially if a desirable area is adjacent to an undesirable area. While conditions (prices, market factors, resale values) may be well understood deep within each of the two areas, there is more uncertainty in the boundary region.

This is similarly true for the weather, as is often evident for major meteorological event predictions, specifically in the region of the country where I live. For those of us in the Baltimore-Washington region, winter weather forecasts often call for significant snow, ice rain, and sleet. What we receive instead, in a lot of cases, is a moderately small amount of snow across most of the region and not much else. I suspect that this “less significant” weather event (compared to the initial prediction) is partially due to the fact that cold dry air from the north wins the battle against warm wet air from the south, pushing dryer air into the region than was expected. Wrong predictions for massive snowfalls are not unusual in this part of the country primarily because this latitude is often within the boundary region between the northern weather circulation patterns and the southern circulation patterns. It is often difficult to predict reliably which weather pattern will win the battle in the boundary region during any particular storm.

Location in Machine Learning

Location is also very important in many machine learning algorithms. The simplest classification (supervised learning) algorithms in machine learning are location-based: classify a data point based on its location on one side or the other of some decision boundary (Decision Tree), or classify a data point based on the classes of its nearest neighbors (K-nearest neighbors = KNN). Furthermore, clustering (unsupervised learning) is intrinsically location-based, using distance metrics to ascertain similarity or dissimilarity among intra-cluster and inter-cluster members.

Source: https://medium.com/thecyphy/ml-algorithms-from-scratch-part-1-k-nearest-neighbors-48acd4e357d0

All of these location-based algorithms are a natural extension of the way that humans place things into different categories (or classes) when we see that different categories of items are clearly separated in some feature space (i.e., occupying different locations in that space). The challenge to data scientists is to find the best feature space for distinguishing, disentangling, and disambiguating different classes of behavior. Sometimes (though not often) those “best” features are the ones that we measured at the beginning, but we can usually discover improved classification features as we explore different combinations (linear and nonlinear) of the initial measured attributes.

We call this process Feature Engineering. The improved “engineered” features then represent a phase space in which a previously unseen data item will receive a more accurate classification simply based on that item’s location within the improved feature space.

Another challenge to data scientists is to explore increasingly higher dimensional parameter spaces in an attempt to discover new subclasses of known classes — those subclasses may project (in lower dimensions) on top of one another in some feature space (hence, the initial incorrect assignment all of the data items in those subclasses into one class), but the subclasses may separate from one another when additional dimensions are added. This may lead to discoveries of new properties of a physical system, or new customer behaviors, or new threat vectors in cybersecurity, or improved diagnoses in medical practice, or fewer false positives in testing systems, or improved precision in information retrieval of documents.

Boundary Cases in Machine Learning

In machine learning, the most difficult items to classify correctly (or to place robustly into a specific cluster) are those within (or near) the boundary region between classes (or clusters). These items may not be accurately distinguishable by the set of decision rules inferred in a decision tree model, or they may have roughly equal numbers of nearest neighbors from the different possible classes (leading to poor KNN performance), or they may have equal affinity to two different clusters.

As a consequence of these challenges, one could arrive at the conclusion that these particular data items are not at all useful in cluster-building or in constructing an accurate classifier. We may say to ourselves: “how can these items be useful if we cannot even place them within a category with better than 50% accuracy or repeatability?” (This is similar to how I react to weather forecasts for major snow events in my area — cautiously uncertain!) While this attitude is justified, it is actually wrong. In fact, the items in the boundary region are golden!

Support Vectors

In the field of supervised machine learning, one of the historically successful classification algorithms is SVM (Support Vector Machines). This is a strange name for an algorithm. It also refers to the boundary region data points in a strange way — as support vectors!

So, what are these support vectors? They are the data items in the boundary region! They are precisely the labeled (classified) data items in the training set that provide the most powerful means to distinguish, disentangle, and disambiguate different classes of behavior. These are the data points that carry the most vital information about what distinguishes items on either side of a decision boundary. These are the items at the front lines of the battle for classification accuracy. They are the standard-bearers for their class, the ones whose classification rules are most critical in building the most accurate classifier for your data collection. Their feature vectors are therefore the “support vectors” for their class.

This is all great! But, unfortunately, the boundary region (like any “war front” or snowstorm weather front) is a messy place. There is much confusion. The boundary lines are usually not straight — a simple linear classification boundary is unlikely to be realistic or accurate.

Source: https://medium.com/@KunduSourodip/finding-non-linear-decision-boundary-in-svm-a89a97a006d2

Consequently, SVM is invoked to discover the complex nonlinear hyperplane that separates most cleanly (with maximum margin) the support vectors representing the different classes. This is no easy task. SVM is not only nonlinear, but it usually requires a kernel transformation between your measured features (data attributes) to some other more complex feature space. To discover these transformation rules is computationally intensive, scaling at least as N-squared, where N is the number of data items in the training set, which must be explored in depth, examining all pairwise combinations of data items (hence N-squared complexity) in order to find the support vectors (the data items within and around the boundary region between the classes). If N is large, as in most big data projects these days, then executing the SVM algorithm can become computationally prohibitive.

Despite these worries, large N is actually a powerful ally in SVM — one cannot be certain in small data collections that you will actually have instances of critical data items (support vectors for the different classes) in the boundary region — the good support vectors that you need to train your model may be “missing in action.” With a massive big data sample, it is much more likely that you will have sufficient examples of support vectors to build an accurate predictive model.

Divide-and-Conquer to the Rescue

What can we do to crush the computational N-squared bottleneck in SVM algorithms? Divide and conquer is a good strategy on this battle front, particularly if you are using a cluster-based (parallel) computing strategy. In this case, the large-N training set can be subdivided into many small-N subsets. Support vectors can be searched for and identified, and then SVM can be applied to each of those subsets in parallel, at much greater speed than the full dataset (with N-squared performance improvement — from large-N overall to small-N on each cluster node).

The results from each of those SVM preliminary models can be combined into a master set of potential support vectors. Another round of SVM can be executed using new different subsets of the training set, again with increased performance compared to the full dataset. Combining and comparing the results of these multiple iterative SVM runs should converge to a final set of optimum support vectors and thus lead efficiently to a solution for the maximum-margin separating hyperplane.

Finally, the result will be an accurate (location-based) SVM classifier of complex (large-variety) data, even in the uncertain boundary region between different classes of behavior.

Trusting in the value of location, location, location has led us to improved disambiguation of otherwise similar-looking entities in our data collection, deeper understanding of our hyper-dimensional data space, more accurate classification and prediction models, and thus reduced confusion in our confusion matrix!

Source: https://federicoarenasl.github.io/SVM-LR-on-Fashion-MNIST/

Follow me on Twitter at @KirkDBorne

Learn more about my freelance consulting / training business: Data Leadership Group LLC

See what we are doing at AI startup DataPrime.ai

--

--

Kirk Borne

Kirk is Advisor & Chief Science Officer at AI startup DataPrime, and founder & owner of Data Leadership Group LLC: provides speaking, training, consulting, more