Identifying Named Entities from URL
By looking at the image, you can already make out the meaning of named entities and how they can be represented in an image.
In this blog, let’s write the code to identity named entities from URL. We will be using this URL for the purpose.
You can find this project in my GitHub repository.
The list of all the Named Entities that can be identified are as follows
As any python program would start, lets import all the libraries that are required
Now, let’s define a function, url_to_string, to convert the content of the URL to a string
Let’s use this function to convert the contents defined URL to a string
Load en_core_web_sm into nlp variable
en_core_web_sm is used to assign context-specific token vectors, POS (Parts Of Speech) tags, dependency parse and named entities
Process URL content with npl
Processing text with the nlp object returns a Doc object that holds all information about their tokens, linguistic features, and relations
Identifying Entities
article.ents returns a list of entities, let’s print first 5 of them
The output is as follows
Total number of entities in url : 264
1st 5 entities : (Bedlam, Profiteers Out-Hustle Good Samaritans - The New York Times, Technology|It, Bedlam, Profiteers Out-Hustle Good Samaritanshttps://nyti.ms/3dRqIGo) ...
Identifying Labels
article.ents has label_ attribute that stores all the identified labels. Let’s print first 5 of them
The output is as follows
1st 5 Labels : ['PERSON', 'ORG', 'GPE', 'PERSON', 'ORG'] ...
Now let’s print first 10 of them together
article.ents also has text attribute that returns the identified text (labels)
The output is as follows
[('Bedlam', 'PERSON'),
('Profiteers Out-Hustle Good Samaritans - The New York Times', 'ORG'),
('Technology|It', 'GPE'),
('Bedlam', 'PERSON'),
('Profiteers Out-Hustle Good '
'Samaritanshttps://nyti.ms/3dRqIGo',
'ORG'),
('The Coronavirus '
'Outbreak',
'ORG'),
('Susan HoughtellingCredit', 'PERSON'),
('Shane Lavalette', 'PERSON'),
('The New York TimesSectionsSkip', 'ORG'),
('Bedlam', 'PERSON')]
...
..
Let’s display the details
The output is as follows
264 entities is represented as 16 unique labels as follows:
Counter({'ORG': 48, 'GPE': 47, 'DATE': 45, 'PERSON': 44, 'CARDINAL': 28, 'MONEY': 18, 'NORP': 14, 'PERCENT': 8, 'TIME': 3, 'QUANTITY': 2, 'ORDINAL': 2, 'LAW': 1, 'LOC': 1, 'FAC': 1, 'EVENT': 1, 'PRODUCT': 1})
The frequent tokens can be accessed using Counter.most_common()
The output is as follows
Frequent Tokens are : [('Schonfeld', 12), ('China', 11), ('N95', 9), ('Chinese', 6), ('American', 5)]
Visualization
We can visualize the dependency and the entities of the article using displacy from spacy library.
The output can be visualized in .svg and .html format.
For demonstration purposes, only a sentence from the URL is used for Visualizing dependency, you can refer this repository to identify Named entities in a string.
The following are the Syntactic Dependency Labels