Build An NLP Project From Zero To Hero (6): Model Integration

Khaled Adrani
UBIAI NLP
Published in
8 min readFeb 13, 2022
Photo by fabio on Unsplash

Successfully training a machine learning model is just the beginning. Integrating it into a business application is a whole new challenge. In this article, we will be introducing the notion of ML Model Integration and we will give a simple demonstration of the concept: We are building a web service that will be implemented with FastAPI, high performance, and easy to learn Python Web Framework. The service will include the trained Spacy NER Transformer Model in its API. We will also use the Twitter API to simulate getting live data.

Model Integration refers to the concept of adding Machine Learning Models as a feature for production software. And it is well known that this step is the most challenging in the Machine Learning Project Workflow. According to this survey from Statista, around 54% of organizations take more than a month to actually deploy their model in the year 2021.

First, we will explain briefly how to start with the Twitter Developers API and then we will elaborate on the web service structure and ends with a small demo.

Starting with the Twitter Developer API

If you are new to the concept of APIs or Application Programming Interface, we define them as a connection between computers or between computer programs. Here, we want to establish a connection between our own web service and the Twitter API, which will provide us with useful data.

First, follow the steps in this guide.

After successfully creating an application with the Twitter API, copy and save, to a safe location, the credentials details: your API Key, your API Secret Key, your Bearer Token, your Access Token, and your Secret Access Token. In case you missed them during the configuration of your application, you can find them at the ‘Keys and tokens’ tab on your project page in the Twitter Developer Portal.

Twitter Developer Portal

Then, we can make a testing request with Postman to check if the project configuration is working. Postman is an API platform for building and using APIs. Create an account and download the desktop application.

Now, we need to make an environment within Postman that includes all our credentials by default (and so we do not need to add them to every request), create one, and make sure that the variables have the same values as the picture below as well as copying their values (the different credentials) in both initial and current value fields:

Now, create a new HTTP request with this URL, do not forget to set its corresponding environment. Here we are going to get the latest ten tweets with the keyword stocks and we will be extracting their public metrics: retweet_count, reply_count, like_count, and quote_count.

https://api.twitter.com/2/tweets/search/recent?query=stocks&tweet.fields=public_metrics&max_results=10
Testing Twitter API response

Now, we can proceed to build our own API.

Stock Market Tweets Analyzer

Not really a fancy name for an application but it is honest work.

The project structure is as follows:

project structure
  • main.py: the main file that contains our service and exposes it to the outside.
  • .env: contain the environment variables of the application, precisely all the confidential data and credentials of the application
  • env: this folder is the result of creating a python virtual environment and it contains all our dependencies and the path of Python executable for our project. Do not forget to create this environment by:
python -m venv env 
  • .gitignore: Some files and directories are better not be tracked by the version control. For example, we ignore __pycache__ and .vscode folders.
  • readme: If you want to explain the work done in the repository.
  • utils folder: Notice the __init__.py file. It is a package that contains two modules, twitter_api and nlp.
  • trf_ner folder: the files of the NER model, I have downloaded it from Google Drive.

To install all the needed dependencies:

pip install "fastapi[all]" #which includes also uvicorn, a lightweight serverpip install spacy==3.2.1 -U spacy[transformers] python-dotenv requests

Now, the basic idea is to build upon the request we used in Postman. For now, our web service will accept a request including a keyword to search for tweets and their maximum number. Then, it will append to the extracted entities using our NER model and return the response back to the requester.

To get started with FastAPI, I recommend easily their documentation as it is very intuitive. This is the main.py file that holds our application logic. The code is very straightforward: we declare a FastAPI application and we use it to build our routes:

from fastapi import FastAPI
from pydantic import BaseModel
from utils.nlp import extract_ents
from utils.twitter_api import get_response

app = FastAPI()

class Query(BaseModel):
keyword: str
max_results: int



@app.get("/")
async def root():
return {"message": "Hello to Stock Market NLP Analyzer"}



@app.get("/get_tweet_ents")
async def root_post(query:Query):
return {"query": query}

@app.post("/get_tweet_ents")
async def get_tweet_ents(query:Query):
data = get_response(query.keyword,query.max_results)
data = extract_ents(data)
return data

You are probably not familiar with pydantic’s BaseModel, a long short story, we use it to define the schema of our query and a very simple mean for validation.

The nlp module contains all the code necessary to load and implement the functionality of our NER model, we just need to import the function extract_ents:

import spacy
import re

ner = spacy.load('trf_ner\model-best')

def test_model():
'''
Check if the model is loaded properly
'''

ner = spacy.load('trf_ner\model-best')

samples = ["Facebook has a price target of $ 20 for this quarter",
"$ AAPL is gaining a new momentum"]

doc = ner.pipe(samples[0])

for doc in ner.pipe(samples,disable=['tagger','parser']):
for ent in doc.ents:
print(ent.label_, ent.text)
print('-----')


def clean_tweets(texts):
'''
Preprocessing necessary for tweets, removing urls and three dots punctuations
'''
filtered = []
url_pattern = "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"
for text in texts:
string = re.sub(r''+str(url_pattern), '', text, flags=re.MULTILINE)
string = re.sub(r'…','',string)
string = re.sub(r'\.\.\.','',string)
# print('This is Tweet: ',string)
filtered.append(string)

return filtered

def extract_ents(data):
'''
Main function to implement NER functionality
'''
texts = [tweet['text'] for tweet in data]
for index,doc in enumerate(ner.pipe(clean_tweets(texts),disable=['tagger','parser'])):
data[index]['entities'] = [{'text':ent.text,'label':ent.label_} for ent in doc.ents]
return data

A note about the clean_tweets function, at first, I forgot the preprocessing we have done before to our training dataset and so the model failed completely. So, never forget this detail! This step was done in our Data Preprocessing episode.

In the twitter_api module, we have implemented everything needed to communicate with the Twitter API obviously, the code was inspired by this amazing article, so check it out for further understanding! Be aware that the article talks about the API for Scientific research which is not covered by the Essential enrollment we have done at the beginning of the project.

import os
from dotenv import load_dotenv
import requests

#load your credentials through the .env file
load_dotenv()

def create_headers():

api_key = os.getenv('api_key')
api_key_secret = os.getenv('api_key_secret')
bearer_token = os.getenv('bearer_token')
access_token = os.getenv('acess_token')
access_token_secret = os.getenv('acess_token_secret')


headers = {
"access_token":access_token,
"access_token_secret":access_token_secret,
"Authorization":'Bearer '+bearer_token,
"api_key_secret":api_key_secret,
"api_key":api_key
}


return headers

def create_url(keyword, max_results = 10):

search_url = "https://api.twitter.com/2/tweets/search/recent?"


query_params = {'query': keyword,
'max_results': max_results,
'tweet.fields': 'public_metrics'}

return (search_url, query_params)


def connect_to_endpoint(url, headers, params):

response = requests.request("GET", url, headers = headers, params = params)
print("Endpoint Response Code: " + str(response.status_code))
if response.status_code != 200:
raise Exception(response.status_code, response.text)
return response.json()



def get_response(keyword="stocks",max_results=10,verbose=False):
headers = create_headers()
url = create_url(keyword, max_results=10)
json_response = connect_to_endpoint(url[0], headers, url[1])

if verbose:
print(json_response)
print(type(json_response))

return json_response['data']

Run your application by using uvicorn:

uvicorn main:app --reload --port 5000

Now, let us test our API! Make sure to type the correct route and set the request type to ‘POST’ and then write your request as raw JSON:

{
"keyword":"stocks",
"maximum_results":10
}

We have noticed that the model was not that good with tweets that did not talk directly and mainly about the stock market (like tagging TIGRAY as a Company). We have also noticed that it sometimes confuses a famous PERSON in a tweet as a company because the majority of the training data had some organizations begun by the symbol ‘@’. And there are of course some mistakes here and there: In this example, the model tagged ‘amp’ as a ticker which is not. ‘amp’ means Auction Market Prefered. As you can see there are many more complex examples to be learned.


{
"id": "1487848993598033922",
"public_metrics": {
"retweet_count": 69,
"reply_count": 0,
"like_count": 0,
"quote_count": 0
},
"text": "RT @Nayakone: Mutual Fund Top Holding Stocks \n\n- Infosys\n- TCS\n- HDFC Bank\n- SBI\n- Airtel \n- L&T\n- HDFC\n- Reliance Ind\n- Kotak Bank\n- ICICI…",
"entities": [
{
"text": "@Nayakone",
"label": "COMPANY"
},
{
"text": "Mutual Fund",
"label": "COMPANY"
},
{
"text": "Infosys",
"label": "COMPANY"
},
{
"text": "TCS",
"label": "TICKER"
},
{
"text": "HDFC Bank",
"label": "COMPANY"
},
{
"text": "SBI",
"label": "TICKER"
},
{
"text": "Airtel",
"label": "COMPANY"
},
{
"text": "L&T",
"label": "TICKER"
},
{
"text": "HDFC",
"label": "TICKER"
},
{
"text": "Reliance Ind",
"label": "COMPANY"
},
{
"text": "Kotak Bank",
"label": "COMPANY"
},
{
"text": "ICICI",
"label": "TICKER"
}
]
},...

Remember that we have limited the scope of our training data (only take tweets that talk about companies by name or ticker), its size (in total 400 tweets), and its sources (financial tweets dataset from Kaggle). This was of course to make things simpler in this series. In fact, when we tried it with a new source (with a different style of tweets), the model obviously failed.

We have mentioned this point during the data labeling and data processing, this is just you know how crucial it is to make your data very representative. Given this, one should think of fine-tuning it.

Conclusion

In this article, we have learned to quickly integrate a Spacy NLP model into a web application and use it to provide services to users or other possible services through the HTTP protocol and we learned also to leverage other existing APIs like Twitter Developer API. You can be proud now because you made something tangible, you have progressed beyond just ‘training’ models. Many people forgot what is important about Machine Learning in general, that it is not about having excellent performing models, it is more about having practical and working models.

If you have questions, do not hesitate to contact me through Linkedin or Twitter.

If you would like to request a demo, please email us at: admin@ubiai.tools or Twitter.

Happy learning and see you in the next article!

--

--