Labelling unstructured text data in Python

Understanding the process of labelling text data using Regex and text search algorithms to utilize in supervised machine learning

Rishabh Dwivedi
Brillio Data Science
4 min readJun 16, 2020

--

Labelled data has been a crucial demand for supervised machine learning leading to a new industry altogether. This is an expensive and time-consuming activity with an unstructured text data which requires custom made techniques/rules to assign appropriate labels.

With the advent of state-of-the-art ML models and framework pipelines like Tensorflow and Pytorch, the dependency of data science practitioners has increased upon them for multiple problems. But these can only be consumed if provided with well-labelled training datasets and the cost and quality of this activity are positively correlated with Subject matter experts (SME). These constraints have directed the minds of practitioners towards Weak Supervision — an alternative of labelling training data utilizing high — level supervision from SMEs and some abstraction from noisier inputs using task-specific heuristics and regular expression patterns. These techniques have been employed in some opensource labelling models like Snorkel using Labelling functions and paid proprietaries Groudtruth, Dataturks etc.

The solution proposed is for a Multinational Enterprise Information Technology client that develops a wide variety of hardware components as well as software-related services for consumers & businesses. They deploy a robust Service team that supports customers through after-sales services. The client recognized the need for an in-depth, automated, and near-real-time analysis of customer communication logs. This has several benefits such as enabling proactive identification of product shortcomings and pinpointing improvements in future product releases.

We developed a two-phase solution strategy to address the problem at hand.

The first task was that of a binary classification to segregate customer calls into Operating System (OS) and Non-Operating System (Non-OS) calls. Since labelled data was not available in this case, we resorted to using Regular Expressions for this classification exercise. Using Regex also has the added utility of labelling the data in their respective categories. In the second phase, we targeted the ‘Non-OS’ category to tag other features.

A Stepwise Solution Approach is thus:

Preprocessing:

1. Create a corpus of frequently used OS phrases and abbreviations (ex: windows install, windows activation, Deployment issue, windows, VMware)

2. Similarly, form a corpus of phrases and words that may occur simultaneously with the OS phrases and may indicate to non-OS calls.

Core steps:

1. Standard text cleaning procedures such as:

a) Convert text to all lower cases

b) Remove multiple spaces

c) Remove punctuations and special characters

d) Remove non-ASCII characters

2. In the first search pass, identify OS related words and phrases to tag the relevant calls as OS calls

3. In the second search pass, identify non-OS related words and phrases to tag calls related to features other than operating systems. This is needed as most call logs will keep a record of the configuration of the system which may lead to false tagging of the calls as OS

Details for the phrase and word search:

a) Create a dictionary of all text with the text of each row split into words and save the list of words as an element of the dictionary against the text or unique id.

b) Now split each phrase of the corpus in words and search for each word of the phrase in each element of the dictionary. If all the words of the phrase are available in a given element of the dictionary, then tag the respective text or unique id accordingly.

c) Similarly, search for the words in the corpus in all the text and tag the successful search calls accordingly.

Flow-chart of the search process

Code Snippets

Text cleaning:

Phrase search:

Limitations

1. Currently, the text is being only searched for the phrases of a single product and having it tagged accordingly. As an improvement, we can expect to include phrases of multiple products and tag the calls in a similar fashion.

2. We can also include the language translations for foreign languages and check for spelling mistakes.

3. Domain experts can help in creating an exclusive set of words and phrases for each product which can make the product more customizable for different industry segments.

Sample search results

1. OS Terms: RHEL, RedHat, OS install, no boot, subscription

2. Non-OS Terms: HW (Hardware), Disk Error

Proposed Future Enhancements

1. The labelled training data can be consumed into training an NLP based Binary classification model which can classify the call logs into OS and Non-OS classes.

2. Textual data needs to be converted into vectorized form, which can be achieved by using word embeddings for each token in the sentence. We can use pre-trained open-source embeddings like FastText, BERT, GloVe, etc.

3. Some of the state-of-the-art models, like Neural Nets, can be used for the classification task, with RNN/GRU/LSTM layers to learn representations for text sequences.

Let's connect on Linkedin.

--

--

Rishabh Dwivedi
Brillio Data Science

Masters in Economics from Delhi School of Economics and currently employed as Data Scientist at HPE.