Geocoding : Necessary information is sufficient information

Shantanu Bhattacharyya
Blog | Locus
Published in
4 min readJan 19, 2017

The fundamental purpose of any language is to convey information, whether it is natural languages or programming languages. However, natural languages have evolved over thousands of years to tolerate a very high degree of ambiguity. For example, if you love the food in a restaurant, you can tell your friends later that “the food was great” or “the food was totally insane” or maybe “that restaurant serves some killer food”. Given a human brain’s access to tremendous amount of conversational data and unrivalled pattern recognition capabilities, it can easily interpret all the three phrases above to mean that you were very happy with the food (however I recommend caution with interpreting the third one). This is what we refer to as “context” and this is where machines have only begun to get better.

We talk to machines through programming languages that are carefully designed to remove any ambiguity in communication. There is no room for context or interpretation of any kind. While these languages provide a framework for highly productive “conversations” with the machines, removing context entirely can be impossible in certain situations and lead to poor outcomes. A fascinating example of such a situation is geocoding, the process of converting human readable addresses to precise coordinates on a map that a computer can understand.

Let us consider this address : “Flat N-23, Orchid Enclave, 121, seegehalli,kadugodi,near nitesh forest hill bangalore -560067 ,karnataka”.

From the perspective of a friend visiting the person staying here, this address is very helpful. It has all the information that someone might need to locate this place. However, a computer sees 14 different words that it now needs to make sense of before giving out the precise coordinates of the place. It might wonder why there is a “-” in front of 560067. Should it consider this to be a pincode or not ? Or it might know about the Orchid Enclave but not know about nitesh forest hill and subsequently lose confidence in its first choice geocoding. What happens if the apartment complex belongs to one of the localities and not the other ?

Even though you might be seeing this address for the first time, these things don’t even register as questions in your head because of years of address data parsing and pattern recognition. We immediately recognise that Orchid Enclave is where the person is. All we need to know more is the general part of Bangalore where Orchid Enclave is. This information is necessary and sufficient for geocoding to work. Alternatively we could rely on the pincode as well but humans are neither very reliable with pincodes nor very aware of the regions a pincode is referring to. Rest of the information only serves to help a human but potentially confuse the machine. Google maps (arguably one of the best geocoders around) fails to geocode the exact example but works perfectly with the flat name deleted and searching for the apartment in Bangalore.

So, the art of geocoding boils down to giving machines the same context on what to focus on in an address and what is redundant. Once a machine learns how to distil the most useful parts of the address, its odds of getting a correct geocode improve tremendously. Most geocoding efforts don’t go wrong because the information was lacking but because it got hidden within a lot of irrelevant/redundant information.

A related problem is the confidence we can place in the geocoder’s response. With a focused geocoding query, the confidence metrics tend be more binary. However, with lots of qualifiers, the confidence in the confidence metric itself gets shaky and setting a good threshold for geocoding accuracy becomes very difficult. Going back to the analogy of restaurants serving killer food, we not only need to know whether that killer food is a good thing or bad , but we also need to be confident about it.

At Locus, we have made significant progress in imparting the human context to our geocoder and teaching it how to focus on the most relevant parts of the address. International geocoding is particularly well served with this approach where, formatting, style and context change from country to country. Our approach has enabled us to leverage the millions of address data points available to us and provide optimal geocoding solutions to our clients.

--

--