Extract data from text with Azure Cognitive Services

Mike K

Published in

Version 1

6 min readMar 3, 2023

Parse text for keywords, people, places & organisations

The dreaded “Notes” field. Every data professional's nightmare...

The Azure Language Service can parse free text & extract keywords & “Named Entities” e.g., people, places or organisations & help us derive business value from free text data — NICE!

tldr;

This blog runs through the following:

Create an Azure Language Service to parse text (Free!)
Set up a Python Environment in VS Code with modules to interact with the Azure Language Service
A Python script to pass the text on out local machine to our Azure Language Service & receive results

We then run some demo text through the script & can see the results in a console.

I hope to follow up with a more complete solution where we run the Python code from a Function App & read from blob storage.

Azure Cognitive Services for Language

We’ll see how we can use Named Entity Recognition (NER) with some Python code to analyse these free text fields & extract useful data for our investigation team.

Step 1 — Set up Azure environment

This bit is easy, we’ll create a free instance of the Azure Language Service. Just click “Create a Resource” in the Azure portal & search for “Language Service”

We can choose the default options with the free pricing tier.

Once it is created there are a couple of properties on the Overview page we are interested in:

Endpoint — this is where we send text to be analysed
Manage keys — we’ll use keys to authenticate with our service

You’ll need both the Endpoint & the value of one of our keys for the Python code.

Step 2 — Python Code

NOTE: I’m not the greatest Python dev & more of an architect/SQL person so please excuse my hacky code!

The code I’m using is here:

GitHub - SinisterChinaPenguin/cog-svcs_language

POC code for running Azure Cognitive Services language analysis Specifically, this code was set up to look at a law…

github.com

Step 3 — Set up dev environment

I’m using Visual Studio Code on Windows, once we clone the Repo we need to perform a few setup steps:

Create a Python Virtual Environment & activate it

In the VS Code Terminal:

> python -m venv venv 
> .\venv\Scripts\activate

Create a Virtual Environment called “venv”

Install Pre-Requisite Python modules

Now we are in our Python Virtual Environment we can install the modules we require to interact with the Azure Language Service…

Run each of these commands in the VS Code Terminal

> pip install azure.core
> pip install azure-ai-textanalytics
> pip install tabulate
> pip install prettyprint
> pip install pathlib

e.g. “prettyprint”

**installing the “prettyprint” Python module**

Step 4 — run the code

The script is called language.py, we call it by calling “python” or “py” from the command line followed by the name of the script & 3 arguments needed to run:

Azure Language Service “Endpoint” (from the Language Service in the Azure portal)
An authentication key for your language service endpoint (“manage keys” from the Language Svc in the Portal again)
The path to a folder on your computer containing txt files to analyse

Calling our language.py script & passing arguments

There is some text in a folder with the original name of test_text, we can pass the path to this folder as an argument to the script & see some results immediately.

If all goes well you should see output similar to this (after pressing a key to continue!)

We can see locations & roles have been extracted from the text as well as some interesting key phases.

Also, notice the confidence rating (percentage) — we might only be interested in entities where confidence is above 90%

Not Working?

If the code doesn’t work for you here are a few things to try:

Are you in the right directory at the command line? You need to be in the “cog-svcs_language” folder containing language.py
Check the https address to your language service endpoint is correct (don’t use the one in the screengrab)
Check authentication keys are correct
Have you typed “py” before the script to call the Python interpreter?
Is the Python Virtual environment where we have installed the necessary modules activated? (Green “venv” at the start of the command line)
Are all the pip install commands to run to install all the modules?
Try giving the Full Path to the test_text folder

Output & Code Walkthrough

Finally an explanation of what this all means & how the code operates. The code fundamentally has 4parts:

import statements to load the modules that do most of the “clever stuff”
Script Initialisation & command line argument gathering.
Authentication with the Language Service returning a language “client”
Sending text to Language Services for analysis & displaying the results.

Steps 1 & 2 are generic. Other people explain arguments & imports better than me.

Azure Language Service Authentication

This function simply does the job of getting us a Language Service client object which allows us to interact with Language Cloud Service.

In order to get that client we instantiate with a call to the TextAnalyticsClient method from the azure.ai.textanalytics module we imported at the top of the script.

def authenticate_client(p_endpoint:str,p_key:str):
    '''
    Creates a textAnalyticsClient object to allow us
    to interact with the Cog Svc API

    Parameters:
    p_endpoint - Cog Svcs language REST endpoint
    p_key - Cog svcs authorisation key

    Returns: TextAnalyticsClient object
'''
    ta_credential = AzureKeyCredential(p_key)
    ta_client = TextAnalyticsClient(
            endpoint=p_endpoint,
            credential=ta_credential)
    return ta_client

We then return our ta_client object to the calling code.

In prod code, we would want some checks to make sure we have authenticated successfully & the client object is actually valid!

Analyse & receive results

To send text for analysis by the Language Service we simply call methods on the client object created above & pass in the text we want to analyse.

Note — at the time of writing there is a 5K limit to the text we send to the language service, so we will need to split up bigger documents…

Sentiment analysis is performed with the analyse_sentiment method

result=p_client.analyze_sentiment(free_text, show_opinion_mining=True)
docs = [doc for doc in result if not doc.is_error]
for idx, doc in enumerate(docs):
    print(f"Overall sentiment: {doc.sentiment}\n")

This returns a list of documents so we have to iterate the list to extract the sentiment analysis.

Further down we perform the Named Entity Recognition (NER) with the recognize_entities method

result = p_client.recognize_entities(documents = free_text)[0]
for entity in result.entities:
    if float(entity.confidence_score) > 0.8:
        if entity.category == "Person":

Again, this returns a list. A list of entities this time, note I am ignoring entities with a confidence of less than 80%.

We can check the category of the recognised entity with the category property. I just use this to build a list which is later used with the Tabulate module to collect all results & lay them out nicely in the final table.

Conclusion

Hopefully, this shows just how easy it is to analyse free text with Azure Language Services & extract the key data.

We could now perhaps put the keywords in a more formal structure like a data frame or a parquet table with a reference to the document.

This would then allow us to join the named entities or keywords to existing data sources or maybe point a search tool at the entities & keywords.

I’m hoping to push the code into a Function App next & make a full cloud-based text analysis service, analysing new files as they arrive in blob storage & writing results to an Azure table or Data Lake.

About the author:
Mike Knee is an Azure Data Developer here at Version 1.