A Practical Proof-Of-Concept: Using Natural Language Processing to Parse Unstructured Text

Published in

Finastra Fintechs & Devs

9 min readFeb 28, 2020

In this article, we will be showing you how we developed an idea of applying Natural Language Processing for one of our products (Finastra Payment Initiation API).

One of the latest technology trends is the personal assistant — the most popular being Apple’s Siri, Google’s Assistant, Microsoft’s Cortana, and Amazon’s Alexa. Have you ever wondered how you can ask Siri to give you the latest news or to set you a reminder and she knows exactly what to do? These technologies use Natural Language Processing, which is an interdisciplinary field that overlaps with computer science and linguistics. The goal is for a computer to interpret how humans talk — the sentence structure, syntax, grammar, semantics — and then execute certain decisions.

Currently, the financial world is flooded with manual labor. Bank tellers and employees are overwhelmed with emails and calls regarding payments and monetary transfers that have to be manually sorted and submitted, which leads to days of delays in processing time (and people can be rather impatient when it comes to money!). So one way we can revolutionize the present structure is to systematize text mediums to facilitate payment transfers. In this project specifically, we created a machine learning model (proof-of-concept) that parses through email text and then pings an API, which authorizes and submits the payment transfer request.

We decided to create a Conditional Random Field (CRF) model that uses Named Entity Recognition (NER) to extract the targeted text in the data. CRFs are a type of probabilistic model that labels and assigns probabilities to sequences, while NER is an information extraction method that detects and classifies textual information. Together, we have a model that parses through unstructured data and extracts the information that is needed to fulfill a payment transfer.

Generating Data

Because of privacy and anonymization complications with the production data, we decided to generate our own “fake” data. Here is a sample of code that shows how we generated some fake data:

for i in range(2000):
    r = random.random()
    r2 = random.random()
    f = Faker()
    An = random.randint(5,12)
    Bn = random.randint(5,12)
    if r2 > 0.5:
        accountA = ''.join(random.choice(string.ascii_uppercase + string.digits + string.ascii_lowercase) for _ in range(An))
        accountB = ''.join(random.choice(string.ascii_lowercase + string.digits +  string.ascii_lowercase) for _ in range(Bn))
    if r2 < 0.5:
        accountA = ''.join(["%s" % random.randint(0, 9) for num in range(0, 9)])
        accountB = ''.join(["%s" % random.randint(0, 9) for num in range(0, 9)])
    amount = str(random.randint(2000, 10000000))
    if r < 0.05: 
        name = f.name()
        s = "Goodmorning " + name + ", I have a client in need of a money transfer. Could you transfer " + amount + " from account " + accountA + " to account " + accountB +". Please let me know as soon as this is complete. Thanks!"
    elif r >= 0.05 and r < 0.08: 
        name = f.name()
        s = "Hi, I hope this email finds you well. We have an urgent request. Could $" + amount + " be transferred into account " + accountB + " from " + accountA + "? Let me know. Have a great rest of your evening. Best regards," + name

We used a bunch of if, elif, and else blocks to create fake emails using a random probability.

This is a sample of the output of our generated data.

Cleaning the Data

Because we are working with unstructured data — data that does not have pre-defined rules and usually consists of complex and inconsistent patterns — we need to clean the data. This it to give it a little more organized structure and to make the data as easy to parse as possible. Here’s an example of a function we wrote to format punctuation:

def fix_punct(i): 
    i = re.sub(r"([a-z]+)([.()!])", r'\1 ',i)
    i = i.replace(".", " ")
    i = i.replace("?"," ")
    i = i.replace("!", " ")
    i = i.strip()
    return idf['email'] = df['email'].apply(lambda i: fix_punct(i))

Now that we have some clean data, we need to pull out the text and then assign the labels. Which gives us something like this:

Sample Cleaned Sentence:
Please transfer $ 7926233 SBD from account ciJf to account wIVU  Thank you

Sample Labels:
('7926233', 'ciJf', 'wIVU', 'SBD')

This is what we will use to generate our features.

Generating Features

Using our data, we need to tokenize the sentences and generate the part of speech for each token in the sentence.

for sidx, s in enumerate(sentences):
    sent_text = nltk.sent_tokenize(s) # this gives us a list of sentences
    for idx,sentence in enumerate(sent_text):
        sent_labels = []
        all_info = []
        tokenized_text = nltk.word_tokenize(sentence)
        for j in tokenized_text: 
            if j == labels[sidx][0]: 
                sent_labels.append('amount')
            elif j == labels[sidx][1]: 
                sent_labels.append('accountA')
            elif j == labels[sidx][2]: 
                sent_labels.append('accountB')
            elif j == labels[sidx][3]:
                sent_labels.append('currency')
            else: 
                sent_labels.append('O')
        tagged = nltk.pos_tag(tokenized_text)for idx2, tag in enumerate(tagged): 
            l = (tag[0],tag[1],sent_labels[idx2])
            all_info.append(l)
        pre_feats.append(all_info)

We then create a python dictionary to use as a feature, which will be fed into the model. Some of the characteristics of the features include whether the word is uppercase/lowercase, digit, and its part of speech tag.

def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = Trueif i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = Truereturn featuresdef sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]def sent2labels(sent):
    return [label for token, postag, label in sent]def sent2tokens(sent):
    return [token for token, postag, label in sent]

This constitutes our feature vector. Now we must create our variables. X is our input variable, and y is our output variable. Think of functions in math: they are mappings represented by y = f(X). This is precisely what we are doing here: the algorithm is the mapping, and it performs by learning the input variables and predicting the target variables.

# Create input feature vector
X = [sent2features(i) for i in pre_feats]# Format the target vector
y = [sent2labels(i) for i in pre_feats]

Creating our Conditional Random Field Model

Here we instantiate our model:

crf = skCRF(
    algorithm='lbfgs',
    max_iterations=100,
    all_possible_transitions=True
)
params_space = {
    
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

Notice that we have a couple of parameters: algorithm, max_iterations, c1, c2, and all_possible_transitions. We can change these parameters to optimize and improve the performance of our model, which is known as hyperparameter tuning.

Training/Testing and Performance Evaluation

After generating the features, we split the data into training and testing sets. We use this to evaluate the performance of our model.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)crf.fit(X_train,y_train)

Now we look at the performance report:

Exact Accuracy:  97.04 %

Flat Classification Report:
               precision    recall  f1-score   support

           O       1.00      1.00      1.00     64327
    accountA       0.99      0.98      0.99      2559
    accountB       0.98      0.99      0.99      2732
      amount       1.00      1.00      1.00      2700
    currency       1.00      1.00      1.00      2615

   micro avg       1.00      1.00      1.00     74933
   macro avg       0.99      0.99      0.99     74933
weighted avg       1.00      1.00      1.00     74933

Our results look spectacular!

Saving and Querying the Model

If we want to save and reuse what the algorithm has learned, we can save it to a .pkl file.

filename = 'best_crf_model.pkl'
_ = joblib.dump(crf,filename)

Now to query it, we have to use the beauty of Object Oriented Programming and create our own Class. This will load the model handle the predictions as well as display our results in a readable manner.

class CRF_NER: 
    def __init__(self,model_filepath): 
        self.model_filepath = model_filepath
    def load_model(self): 
        self.crf = joblib.load(self.model_filepath)
        
    def query(self,sentence): 
        original_sentence = sentence
        sentence = re.sub(r"([a-z]+)([.()!])", r'\1 ',sentence)
        sentence = sentence.replace(".", " ")
        sentence = sentence.replace("?"," ")
        sentence = sentence.replace("!", " ")
        sentence = sentence.replace(",", "")
        sentence = sentence.strip()
        sentence = sentence.replace('[',"")
        sentence = sentence.replace(']',"")
        sentence = re.sub('[()""“”{}<>]', '', sentence)
        sentence = re.sub(r"\$([0-9+])", "$ \\1", sentence)
        
        sent_text = nltk.sent_tokenize(sentence)
        tokenized_text = nltk.word_tokenize(sentence)
        tagged = nltk.pos_tag(tokenized_text)def word2features(sent, i):
            word = sent[i][0]
            postag = sent[i][1]features = {
                'bias': 1.0,
                'word.lower()': word.lower(),
                'word[-3:]': word[-3:],
                'word[-2:]': word[-2:],
                'word.isupper()': word.isupper(),
                'word.istitle()': word.istitle(),
                'word.isdigit()': word.isdigit(),
                'postag': postag,
                'postag[:2]': postag[:2],
            }
            if i > 0:
                word1 = sent[i-1][0]
                postag1 = sent[i-1][1]
                features.update({
                    '-1:word.lower()': word1.lower(),
                    '-1:word.istitle()': word1.istitle(),
                    '-1:word.isupper()': word1.isupper(),
                    '-1:postag': postag1,
                    '-1:postag[:2]': postag1[:2],
                })
            else:
                features['BOS'] = Trueif i < len(sent)-1:
                word1 = sent[i+1][0]
                postag1 = sent[i+1][1]
                features.update({
                    '+1:word.lower()': word1.lower(),
                    '+1:word.istitle()': word1.istitle(),
                    '+1:word.isupper()': word1.isupper(),
                    '+1:postag': postag1,
                    '+1:postag[:2]': postag1[:2],
                })
            else:
                features['EOS'] = Truereturn featuresdef sent2features(sent):
            return [word2features(sent, i) for i in range(len(sent))]def sent2labels(sent):
            return [label for token, postag, label in sent]def sent2tokens(sent):
            return [token for token, postag, label in sent]
        
        X = [sent2features(i) for i in [tagged]]
        
        pred = self.crf.predict(X)
        p = pred[0]
        result = {}
        for idx,i in enumerate(p): 
            if i != 'O': 
                result[i] = tagged[idx][0]    
                
        keys = list(result.keys())
        vals = list(result.values())
        vals
        for idx, i in enumerate(keys): 
            if i == 'amount': 
                insert_str_0 = "<span style='color:blue'>"
                insert_str_1 = "</span>" 
                sent_idx = original_sentence.find(vals[idx])
                original_sentence = original_sentence[:sent_idx] + insert_str_0 + original_sentence[sent_idx:sent_idx+len(vals[idx])]+insert_str_1+original_sentence[sent_idx+len(vals[idx]):]
            elif i == 'accountA': 
                insert_str_0 = "<span style='color:red'>"
                insert_str_1 = "</span>" 
                sent_idx = original_sentence.find(vals[idx])
                original_sentence = original_sentence[:sent_idx] + insert_str_0 + original_sentence[sent_idx:sent_idx+len(vals[idx])]+insert_str_1+original_sentence[sent_idx+len(vals[idx]):]elif i == 'accountB':
                insert_str_0 = "<span style='color:orange'>"
                insert_str_1 = "</span>" 
                sent_idx = original_sentence.find(vals[idx])
                original_sentence = original_sentence[:sent_idx] + insert_str_0 + original_sentence[sent_idx:sent_idx+len(vals[idx])]+insert_str_1+original_sentence[sent_idx+len(vals[idx]):]
            elif i == 'currency': 
                insert_str_0 = "<span style='color:green'>"
                insert_str_1 = "</span>" 
                sent_idx = original_sentence.find(vals[idx])
                original_sentence = original_sentence[:sent_idx] + insert_str_0 + original_sentence[sent_idx:sent_idx+len(vals[idx])]+insert_str_1+original_sentence[sent_idx+len(vals[idx]):]printmd(original_sentence)
        return result

Now that we have created a Class, we have to instantiate it and load the model.

filename = "best_crf_model.pkl"
c = CRF_NER(filename)
c.load_model()

FINALLY, we can test our model to see whether it really works. We can write any sample text that contains two bank accounts (one we are debiting and the other we are crediting through the payment), an amount, and the currency.

In our CRF_NER Class, we told it to highlight each label with a certain color. This way, we can easily identify whether the algorithm is working properly!

Hooray, it works! On the other hand, by passing new queries through the model, we can also investigate where the model can be improved. For example:

We can see here that the algorithm is incorrectly labeling accountB.

One thing that we did not look out for when generating our fake data was including email signatures. Our next step would then be to train the model on data that includes email signatures, so that the algorithm will learn to ignore them.

Using the Model

Now that we have a trained model ready for use, we developed a flask application for Gmail and Outlook that submits an HTTP request to one of our existing products (Finastra FusionFabric.cloud’s Payment Initiation API) to efficiently process payments. All we have to do is use a POST request to the API — it’s really that easy! If you want to read more about the software behind it all, see here where we go into detail about how we implemented our model using a flask application.

Looking Ahead

Although this project was merely just a proof-of-concept, we were quite successful in demonstrating how we can apply artificial intelligence to expedite processes where manual labor is needed. So as we move forward with more advanced algorithms, models, and datasets, we can construct much more efficient ways to tackle the ginormous world of unstructured data. This is only the tip of the iceberg; imagine the other ways we can boost human productivity by scaling down on manual labor using NLP. What other ways can you think of? We’d love to hear your thoughts and feedback. Don’t forget to read part 2 :)

Thanks for reading! I introduce to you my partner’s on this project:

Adam Lieberman has a degree in mathematics and a masters in machine learning both from Georgia Tech. He leads the data science team at Finastra and is passionate about taking proof of concept projects and turning them into real-world, production solutions.

Josh Abelman is a software engineer in Finastra’s Innovation Lab. He works with the data science team to help bring models to life. His main interests are full-stack development and deep learning.