Building a custom Named Entity Recognition model using spaCy — Consume the Model— Part 3

Johni Douglas Marangon
3 min readNov 21, 2023

--

This final post is about training an end-to-end NER application. We will end this journey into the fascinating world of NLP tasks.

Here, we will create an API REST to consume the model built in the previous post.

Create an API REST

An API REST, also known as a RESTful API, is an application programming interface (API) that adheres to the constraints of the REST architectural style and enables interaction with RESTful web services. RESTful APIs have become widely adopted due to their simplicity, flexibility, and scalability, making them a popular choice for building modern web applications and services.

FastAPI is a high-performance web framework for building RESTful APIs with Python. It is based on the Starlette ASGI server and Pydantic data validation library, making it fast, easy to use, and highly reliable.

Install the Python dependencies:

pip install "uvicorn[standard]" fastapi spacy pdfminer.six

We need to load the model before the API service is completed running. FastAPI have a set of emitted events. Startup is an event that can support this requisite. After loading the model we will attach the model instance to app.state object.

The API have a POST method to receive a PDF file, extract the text and apply the model. The response will be a list of entities extracted.

As mentioned in the first post the DOI entity has a string pattern, then, it is not necessary to use a model to identify this entity, is easier to apply an entity rule to extract it.

Create a main.py file with the following code:

from fastapi import FastAPI
from pdfminer.high_level import extract_text
import spacy
import urllib.request
from pydantic import BaseModel


app = FastAPI()


class Request(BaseModel):
url: str


opener = urllib.request.build_opener()
opener.addheaders = [("User-agent", "Mozilla/5.0")]
urllib.request.install_opener(opener)


@app.on_event("startup")
async def startup():
nlp = spacy.load("model-best")
ruler = nlp.add_pipe("entity_ruler")

patterns = [
{
"label": "DOI",
"pattern": [
{"LOWER": {"REGEX": "\d{2}\.\d{5}"}},
{"TEXT": "/"},
{"LOWER": {"REGEX": "joss\.\d{5}"}},
],
}
]
ruler.add_patterns(patterns)

app.state.nlp = nlp


@app.post("/extract-ner")
async def post_extract_ner(request: Request):
filename, _ = urllib.request.urlretrieve(request.url)

text = extract_text(filename)
doc = app.state.nlp(text)

return [{"label": ent.label_, "text": ent.text} for ent in doc.ents]

Run the command to start the API service, and test the endpoint with the command bellow:

uvicorn main:app
curl -X POST \
-H "Content-Type: application/json" \
-d '{"url": "https://www.theoj.org/joss-papers/joss.05160/10.21105.joss.05160.pdf"}' \
'http://127.0.0.1:8000/extract-ner'

The request result is a list of entities from the current JOSS paper. Access the PDF document and check if all entities were returned correctly.

Dockerizing the application

This part is a bonus to run the service with Docker. Create a Dockerfile with this content:

FROM python:3.11-slim-bookworm

WORKDIR /app

RUN apt-get update

COPY . .

RUN pip install --no-cache-dir "uvicorn[standard]" fastapi spacy pdfminer.six

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Commands to build and run the docker application:

docker build --no-cache -t my-app:latest .

docker run -p 8000:8000 --rm my-app:latest

Execute the curl command to test an API. The same result as running outside Docker should be produced.

Closing Remarks

I am really excited to end this journey. In this series we learned all tasks involved in building and e-2-e NER application.

Hope you enjoyed this content, if you have any questions or suggestions, feel free to comment this articles.

Thank you for being here. Happy learning.

--

--