locationtagger : A python package to extract locations from text or web page

Published in

Nggawe Nirman Tech Blog

7 min readNov 20, 2020

Nowadays, many data scientists/machine learning engineers are taking the wheel of providing useful insights, building quality products, managing risks/frauds etc. in organisations. That’s why we jump into many places to find the best solution for our problem. The primary problem for us is to deal with the form of data we get. In recent times, lots of data have been coming in the form of unstructured textual data and that’s why terms such as ‘Natural Language Processing’ , ‘Text mining’, ‘Text processing’ have been very popular. In this article, i want to introduce you with a text processing problem — Location term extraction from text or web page of a URL.

Many text processing techniques are led by very popular python libraries such as nltk and spacy etc. One of the text processing technique that are performed with these libraries is Named Entity Recognition. If you are unaware of this term, let me give a glimpse of what NER is. First of all, A named entity is a real-world object; it can be a name of person, location, organisation, product, time expressions, quantities, monetary values, percentages etc. The process of identifying these terms from unstructured text is know as Named Entity Recognition. Let’s see a few examples,

NER using NLTK

Python code:

text = """Unlike India, A winter weather advisory remains in effect line from Blue Earth, to Red Wing line in Minnesota."""import nltk
named_entities = []
nes = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)))
for ne in nes:
    if type(ne) is nltk.tree.Tree:
        if (ne.label() == 'GPE' or ne.label() == 'PERSON' or ne.label() == 'ORGANIZATION'):
            l = []
            for i in ne.leaves():
                l.append(i[0])
            s = u' '.join(l)
            if not (s in named_entities):
                named_entities.append(s)
print(named_entities)

Output :

['India', 'Blue Earth', 'Red Wing', 'Minnesota']

Python code:

nes_ = nes.copy()
for ne in nes:
    if not(type(ne) == nltk.tree.Tree):
        nes_.remove(ne)
nes_.draw()

Output:

Figure 1: Named entities found using NLTK

Another example,

Python code:

text = """European authorities fined Google $5.1 billion for abusing its power in the mobile phone market."""import nltk
named_entities = []
nes = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)))
for ne in nes:
    if type(ne) is nltk.tree.Tree:
        if (ne.label() == 'GPE' or ne.label() == 'PERSON' or ne.label() == 'ORGANIZATION'):
            l = []
            for i in ne.leaves():
                l.append(i[0])
            s = u' '.join(l)
            if not (s in named_entities):
                named_entities.append(s)
print(named_entities)

Output:

['European', 'Google']

Python code:

nes_ = nes.copy()
for ne in nes:
    if not(type(ne) == nltk.tree.Tree):
        nes_.remove(ne)
nes_.draw()

Output:

Figure 2: Named entities found using NLTK

NER using SPACY

Python code:

text = """Unlike India, A winter weather advisory remains in effect line from Blue Earth, to Red Wing line in Minnesota."""
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
for ent in doc.ents:
    print(ent.label_, ent.text)

Output:

GPE India
LOC Blue Earth
ORG Red Wing
GPE Minnesota

As we can see in the above results, both nltk and spacy are capable of performing NER well but when it comes to recognise more specific entities such as locations, we fail to extract those from extracted named entities accurately. In first example, ‘Blue Earth’ and ‘Red Wing’ are locations but are misidentified as persons using nltk while using spacy ‘Red Wing’ has been confused with the organisation name. Here, I’m gonna talk about this specific problem where we need to detect locations from text.

Location term extraction from Text or web page of a URL

We may face a situation like in above examples where we need to extract all the locations from text. For that, the problem of NER has to be further solved for the categories of NEs found; in our case, locations or GPEs. This problem can be solved with a supervised learning model also but for that we need to have a good sample of training data and that should work for our problem domain. Spacy provides some supervised learning model itself where we need to train the model with our additional training set or it’s own training data and it does the NER classification. But what if we don’t have the training data to perform a supervised learning model? Let’s discuss more about an idea to tackle this situation…

Let’s proceed with the text only since we can easily grab all the text from a web-page of a URL and then further we need to work only with text data. In case of a URL, we’ll be required to perform an additional step i.e. applying a method to grab all the content (text) from the page of a URL. Some popular python libraries like newspaper3k are efficient to grab the text from page of a URL.

The Algorithm 🔰

To perform location term extraction from text, we have to figure out an algorithm that can recognise and filter out all the location terms from the previously extracted NEs (Named Entities) by performing NER using nltk and spacy or a way to verify whether an NE is a place name or not. But before that, it is preferable if mandatory text pre-processing has been done with the data. Text pre-processing in this kind of problem may consist of few steps like,

Removing HTML tags from the text.
Removing numbers and multiple spaces.
Removing all non-ASCII characters (if required).

When we are done with the initial text processing, an algorithm like below may work for this problem,

Figure 3: Algorithm for Location term extraction from text

If you look at this picture, the idea here is to verify whether an NE is a location term or not. To do this, it is required to collect feasible data of locations which a given text may possibly contain. Then, already found NEs are parsed into the table containing cities, regions and countries. Following method works to verify,

If named entity is a country

To extract the country from named entities, we don’t need any geographical data to parse, instead Python’s pycountry library can be used as following;

Python code:

import pycountry
def is_a_country(s): 
    ss = ' '.join([(i[0].upper()+i[1:].lower()) for i in s.split()])
    try:
        pycountry.countries.get(name=ss).alpha_3
        return True
    except AttributeError:
        try:
            pycountry.countries.get(official_name=ss).alpha_3
            return True
        except AttributeError:
            return False
country_names = []
for i in named_entities:
    if is_a_country(i):
        country_names.append(i)
print(country_names)

Output:

['India']

If named entity is a city/region

To find out if an NE is a city/region somewhere, we must collect the cities data as much as possible then parse that word into table to see if it matches any city name, region/state name. The geographical data can be collected from freely accessible sources and it will look like following table,

Now, Following method will do so,

Python code:

import sqlite3
regions = []
country_regions = {}
other_countries = []
db_file="~/Anaconda3/Lib/site-packages/locationtagger/locationdata.db"
conn = sqlite3.connect(db_file)
cur = conn.cursor()
cur.execute("SELECT * FROM locations WHERE LOWER(subdivision_name) IN \
(" + ",".join("?"*len(named_entities)) + ")", [p.lower() for p in named_entities])
rows = cur.fetchall()
for row in rows:
    country_name = row[4]
    region_name = row[6]
    if region_name not in regions:
        regions.append(region_name)
    if country_name not in other_countries:
        other_countries.append(country_name)
    if country_name not in country_regions:
        country_regions[country_name] = []
    if region_name not in country_regions[country_name]:
        country_regions[country_name].append(region_name)
print(regions)

Output:

['Minnesota']

Similarly, we can find out cities too by parsing NEs list into the dataset.

Python code:

cities = []
country_cities = {}
region_cities = {}
other_regions = []
cur = conn.cursor()
cur.execute("SELECT * FROM locations WHERE LOWER(city_name) IN \
(" + ",".join("?"*len(named_entities)) + ")", [p.lower() for p in named_entities])
rows = cur.fetchall()
for row in rows:
    country_name = row[4]
    city_name = row[7]
    region_name = row[6]
    if city_name not in cities:
        cities.append(city_name)
    if region_name not in other_regions:
        other_regions.append(region_name)
    if country_name not in other_countries:
        other_countries.append(country_name)
    if region_name not in region_cities:
        region_cities[region_name] = []
    if city_name not in region_cities[region_name]:
        region_cities[region_name].append(city_name)
    if country_name not in country_cities:
        country_cities[country_name] = []
    if city_name not in country_cities[country_name]:
        country_cities[country_name].append(city_name)
print(cities)

Output:

['Red Wing','Blue Earth']

It works! So, we have now found out the algorithm for Location term Extraction from text. But this isn’t enough, there will be complications while compiling all these pieces of code and collect suitable data to make it work for us accurately.

Locationtagger package for location term extraction 💻

A python package locationtagger↗️ has been created to make these pieces of code work for the problem of location extracting. This python package uses popular libraries like nltk, spacy, pycountry, newspaper3k and all the geographical locations data to process our results but the results depend upon the quality of data you use, hence we can also update the data according to our problem easily. Locationtagger is capable of not only detecting all the location terms from text or URL (Countries, Regions/States & Cities) but it can also find relationships among them. Let’s see the results with previous text example,

Python code:

import locationtagger

text = """Unlike India, A winter weather advisory remains in effect line from Blue Earth, to Red Wing line in Minnesota."""

entities = locationtagger.find_locations(text = text)
entities.countries

Output:

['India']

Python code:

entities.regions

Output:

['Minnesota']

Python code:

entities.cities

Output:

['Red Wing','Blue Earth']

Now, we can see the relationships also,

Python code:

entities.country_regions

Output:

{'United States': ['Minnesota']}

Python code:

entities.region_cities

Output:

{'Minnesota': ['Red Wing', 'Blue Earth']}

Python code:

entities.country_cities

Output:

{'United States':['Red Wing','Blue Earth']}

Since, we see that ‘Red Wing’ and ‘Blue Earth’ are the cities from United States and ‘United States’ is not present in main text article but it talks about cities in this country, we can find this hidden country as ‘other_countries’.

Python code:

entities.other_countries

Output:

['United States']

This isn’t it. the package can do the same with web URL too,

Python Code:

URL = 'https://edition.cnn.com/2020/01/14/americas/staggering-number-of-human-rights-defenders-killed-in-colombia-the-un-says/index.html'entities2 = locationtagger.find_locations(url = URL)
entities2.countries

Output: