Learning to Decode Unstructured Indian Addresses
A key challenge in driving automation and efficiency in the logistics and supply-chain industry is to make address records machine readable and convert them to precise geocodes. Postal codes have proven to be fairly effective in enabling this in developed countries, e.g., in the UK, they typically resolve any address to 100–200m accuracy. However, in India pin-codes do not seem to provide a promising solution. They represent very large areas (the median area covered by a pin-code ~ 90 sqkm) and may contain up to a million households.
Moreover, when it comes to writing pin-codes, there are hardly any easy-to-access resources that can be relied upon for determining the correct pin-code of an address. This information largely comes from elders in the family. However, as towns and cities expand and new pin-codes are introduced, most people remain unaware of the change and continue using the pin-codes they have grown up with. As a result, 20–30% of written addresses have incorrect pin-codes.
Figure 2 illustrates why nobody can be sure of the pin-code they write for an address. Notice that Plot 1 and Plot 5, which are adjoining buildings in Sector 44 Gurgaon, have different pin-codes, 122004 and 122002, respectively, according to Google Maps. Also note that both of these buildings lie in the same polygon which represents the area covered by the pin-code 122003. At best, pin-codes provide an “average” solution to address India’s addressing problem!
In general, people in India identify an address based on neighbourhood/locality names, points of interest (POIs), or even sketchy directions to a location. Although these features have proven to be an effective way for an experienced local postman to locate a given address, they are neither standardised nor any official records exist that can help in locating them systematically. Moreover, written addresses in India often contain anomalies such as incorrect spellings, incomplete locality information, poor explanation of landmarks, etc.
The reason for such anomalies is often genuine. For instance, most people know how to pronounce a locality name in their local language, but not its transliteration in English, which leads to a large number of spelling variations for the same locality name. Other examples of addresses difficult to deal with are:
A Detailed Workflow -
xx Raheja Atlantis Sector 31 Gurgaon (on weekdays) xx Gulmohar Park New-Delhi 110049 (on weekend)
A Threat -
House No xx Village xx PO xx PS xx Distt Jalandhar Rural Punjab I Want All Products Which I Had Already Been Ordered Should Be Original Otherwise I Take Strict Action Against U Bcz I M Cid Officer From Punjab When Last Time I Ordered Curren Watch The Mirror And Second Hand Arm Was Broken ,I Adampur, Punjab 144201
Intractable Spelling Errors -
xx Marol Maroshi Rd, Marol, Andheri WAST, Mumbai 400059 (Andheri West or Andheri East?)
With such irregularities, it becomes very difficult for a machine to even disambiguate an address string correctly, let alone geocode it to a high resolution.
How are Businesses Impacted?
Any logistics company trying to optimise their operations cannot do so without knowing where they need to deliver the goods. This becomes evident in Figure 3, which illustrates the typical “life cycle” of a shipment for an eCommerce logistics company, such as Delhivery.
It is clear that in absence of a fairly precise location for each shopper’s home, it becomes harder to make an optimal choice for:
- Creating routes for the “last-mile” delivery boys
- Deciding which Delivery Centre should perform the “last-mile” of the delivery
- Deciding which Destination Hub should the shipment fly to
The more granular the locality information gets, the better we are able to optimise the above steps.
It can be estimated that the cost of the “last-mile” distribution alone is around quarter of a billion dollars annually for the Indian eCommerce logistics industry. Being able to geocode addresses accurately, is likely to have at least a 15–20% impact on this cost.
Commercially Available Solutions
Most online maps do very well at geocoding addresses from the western World, but are not optimised to disambiguate unstructured Indian addresses. While testing the geocoding API of a leading online maps company on a large sample of addresses, we observed that only about 40% of addresses got resolved to a geocode with 500m precision. The performance worsens in Tier 2,3 towns. A major reason for poor performance is the lack of ground knowledge and understanding of non-standard features people write in their addresses.
To solve this problem, Indian mapping organisations employ thousands of people to manually survey cities for new addresses, localities/sub-localities/POIs; see  and . This process is obviously very cumbersome, especially since this exercise is required to be repeated periodically to ensure freshness of data in a rapidly evolving urban landscape.
At Delhivery, we have built an in-house solution, AddFix, which uses generative machine learning techniques to solve this problem. The training data for the algorithm includes address strings that e-commerce customers provide at the time they place orders, along with location data captured from mobile devices of delivery boys who eventually deliver e-commerce shipments to the customer’s doorstep.
Graphical models churn millions of customer address records in an unsupervised way to learn the names of cities, localities, sub-localities, building names and POIs that exist in a given geographical region, along with their hierarchical relations and alternative spellings. This step essentially generates a directed acyclic graph consisting of different locality features that people commonly write in addresses. Next, we determine the geographical boundaries of each node in the graph based on the location data captured by mobile devices of ground staff. Every month, we capture hundreds of reliable geocodes for each node, which allow us to draw out polygons for the associated locality feature. These polygons keep becoming more accurate as we do more deliveries.
Given a new address at the time of prediction, we search the graph for a set of connected nodes that match most closely with the different locality features provided in the address. To ensure that the matching is not very sensitive to variations in spelling, we employ phonetic distance based fuzzy search, which is specifically tuned for Indian languages. For instance, most standard phonetic based similarity engines, will not be able to pick up that Gurgaon and Gudgaon, sound similar.
The resultant output includes the entire location hierarchy of the given address, i.e., state, city, locality, sublocality, rooftop, along with polygon boundaries (where available) for each node in the hierarchy.
This project started back in 2014, with an aim to discard pin-code sorting and move to a locality based sorting system. The first version of AddFix (v1) was largely a rule based system, which would query a given address string to match locality names from a manually generated list of important localities/sublocalities/POIs across major Indian cities. This approach was able to correctly predict localities for 80–85% of addresses (>95% for Metros and Tier 1 cities) and predict geocodes for these addresses with a median precision of 500m.
The need for a new version arose when our volumes grew and we were required to further increase the granularity of our address geocoding service to ensure efficiency at scale. At this scale, it became increasingly hard to live with a system that requires manual tagging of localities.
The latest version of AddFix (v3) is able to correctly determine the locality/sublocality of >90% shipments that flow through Delhivery’s network and predict geocodes for these addresses with a median precision of 200m. These results are guaranteed to improve with time, without any additional developmental effort. This has enabled us to discard traditional pin-code sorting of shipments in favour of a more granular locality-based sorting system. The latter allows us to place our distribution centres optimally and create system driven routes for delivery boys, both of which would fail in absence of precise geocodes.
The ability to geocode raw addresses is essential not only for logistics companies trying to cut down on their costs, but for any organisation that needs to reach people efficiently, e.g., emergency services, customer support services, etc. Our pilots with companies from other domains have provided very encouraging insights on how Addfix can make an impact on a very wide range of use cases, other than delivering shipments efficiently. The ability to automatically discover localities and generate their polygon boundaries can potentially improve visibility of smaller localities/towns and put them on the digital map. This makes it easier for businesses to extend opportunities/products to a larger population.
Kabir leads the Data Science team at Delhivery with a focus on problems in Digital Maps, Machine Learning, Discrete Optimisation and Simulation. A published author, he was previously a Senior Lecturer of Operations Research at The University of Greenwich, UK.
This blog was made possible due to inputs from Rahul Kumar, Senior Data Scientist, Delhivery. Delhivery is India’s largest third party eCommerce logistics company. Currently we deliver close to 100 million shipments annually to more than 12000 pin-codes across India.
If you like the sound of what we do and are keen to explore opportunities within the Technology and Data Sciences team at Delhivery, please visit tech.delhivery.com.
 ‘Now, government to start mapping your address digitally’, The Economic Times. Available At: https://economictimes.indiatimes.com/industry/cons-products/durables/govt-looks-to-cut-gst-on-white-goods/articleshow/61718288.cms (Accessed: 22 Nov 2017)
 ‘Andhra Pradesh kicks off Smart Pulse Survey of 14.8 million households’. Live Mint. Available At: http://www.livemint.com/Politics/epYBSl0nVGaKa8wkx9zPWK/Andhra-Pradesh-kicks-off-Smart-Pulse-Survey-of-148-million.html (Accessed: 22 Nov 2017)