Identifying Named Entities from URL

Published in

TabsOverSpaces

3 min readJul 5, 2020

By looking at the image, you can already make out the meaning of named entities and how they can be represented in an image.

In this blog, let’s write the code to identity named entities from URL. We will be using this URL for the purpose.

You can find this project in my GitHub repository.

The list of all the Named Entities that can be identified are as follows

As any python program would start, lets import all the libraries that are required

Now, let’s define a function, url_to_string, to convert the content of the URL to a string

Let’s use this function to convert the contents defined URL to a string

Load en_core_web_sm into nlp variable

en_core_web_sm is used to assign context-specific token vectors, POS (Parts Of Speech) tags, dependency parse and named entities

Process URL content with npl

Processing text with the nlp object returns a Doc object that holds all information about their tokens, linguistic features, and relations

Identifying Entities

article.ents returns a list of entities, let’s print first 5 of them

The output is as follows

Total number of entities in url :  264

1st 5 entities :  (Bedlam, Profiteers Out-Hustle Good Samaritans - The New York Times, Technology|It, Bedlam, Profiteers Out-Hustle Good Samaritanshttps://nyti.ms/3dRqIGo) ...

Identifying Labels

article.ents has label_ attribute that stores all the identified labels. Let’s print first 5 of them

The output is as follows

1st 5 Labels :  ['PERSON', 'ORG', 'GPE', 'PERSON', 'ORG'] ...

Now let’s print first 10 of them together

article.ents also has text attribute that returns the identified text (labels)

The output is as follows

[('Bedlam', 'PERSON'),
       ('Profiteers Out-Hustle Good Samaritans - The New York Times', 'ORG'),
       ('Technology|It', 'GPE'),
       ('Bedlam', 'PERSON'),
       ('Profiteers Out-Hustle Good '
        'Samaritanshttps://nyti.ms/3dRqIGo',
        'ORG'),
       ('The Coronavirus '
        'Outbreak',
        'ORG'),
       ('Susan HoughtellingCredit', 'PERSON'),
       ('Shane Lavalette', 'PERSON'),
       ('The New York TimesSectionsSkip', 'ORG'),
       ('Bedlam', 'PERSON')]
       ...
       ..

Let’s display the details

The output is as follows

264 entities is represented as 16 unique labels as follows:
      Counter({'ORG': 48, 'GPE': 47, 'DATE': 45, 'PERSON': 44, 'CARDINAL': 28, 'MONEY': 18, 'NORP': 14, 'PERCENT': 8, 'TIME': 3, 'QUANTITY': 2, 'ORDINAL': 2, 'LAW': 1, 'LOC': 1, 'FAC': 1, 'EVENT': 1, 'PRODUCT': 1})

The frequent tokens can be accessed using Counter.most_common()

The output is as follows

Frequent Tokens are :  [('Schonfeld', 12), ('China', 11), ('N95', 9), ('Chinese', 6), ('American', 5)]

Visualization

We can visualize the dependency and the entities of the article using displacy from spacy library.

The output can be visualized in .svg and .html format.

For demonstration purposes, only a sentence from the URL is used for Visualizing dependency, you can refer this repository to identify Named entities in a string.

View Named Entities

View dependency

The following are the Syntactic Dependency Labels

Identifying Named Entities from URL

Identifying Entities

Identifying Labels

Visualization

Written by Akash Punagin