How we use GPT to improve data quality and entity extraction

Published in

iNex Blog

5 min readJun 22, 2023

Data cleaning is an essential yet time-consuming task in any data-driven project. It involves transforming and standardizing data to ensure accuracy and consistency, but it often consumes valuable time and resources. However, with advancements in natural language processing (NLP) and the availability of powerful AI models like GPT, part of data cleaning can now be automated efficiently.

As an innovative company that aggregates hundreds of different data sources, iNex must rely on accurate and reliable data to develop effective solutions that promote sustainability and environmental impact. In this article, we will explore the various ways in which AskMarvin.ai an OpenAI API wrapper, helps us to save significant time and effort.

Marvin leverages “functional prompt engineering” to build a battle-tested and typesafe interface to LLMs. We expose our API through drop-in decorators, creating an unparalleled developer experience so that every developer can reach for LLMs when it’s the right tool.

Standardizing

One of the most common challenges in data cleaning is dealing with inconsistent or misspelled strings. With GPT API, you can leverage its text generation capabilities to standardize names effortlessly. By providing a list of names in different formats, you can ask the model to generate standardized versions. This eliminates the need for manual review and editing, saving valuable time and ensuring consistency across the dataset. Let’s look at a first example extracted from askmarvin documentation.

from marvin.ai_functions.data import standardize
survey_data=[
    "(555) 555 5555",
    "555-555-5555",
    "5555555555",
    "555.555.5555",
    "55"
    ]
# converts all values to (555) 555-5555
standardize(survey_data, format="US phone number with area code")

# Result:
# ['(555) 555-5555', '(555) 555-5555', '(555) 555-5555', '(555) 555-5555', '']

It is very easy to define your own cleaning function, for instance to correct some spelling errors:

from marvin import ai_fn
@ai_fn
def fix_spelling_errors(data: list[str]) -> list[str]:
    """
    Fix all spelling errors in a list of string
    """

fix_spelling_errors([
    'St. Albans',
    'St.A1bans',
    'St Albans',
    'St.Ab1ans',
    'St.albans',
    'St. Alans',
    'S.A1bans',
    'St..A1bans',
    'S.A1bnas',
    'St.A1bnas',
    'St.A1 bans',
    'St.Algans',
    'Sl.A1bans',
    'St. Allbans',
    ])

# Result:
# ['St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans', 'St. Albans']

Zero Shot Classification

Data cleaning often involves categorizing or classifying data based on predefined criteria. This can be a tedious and error-prone process, especially when dealing with large datasets. By providing a simple docstring you can build a classifier with zero NLP knowledge.

@ai_fn
def classify_sentiment(tweets: list[str]) -> list[bool]:
    """
    Given a list of tweets, classifies each one as
    positive (true) or negative (false) and returns
    a corresponding list
    """

The library also provides some useful functions for these kinds of tasks:

from marvin.ai_functions.data import map_categories, categorize

colors_data = [
    "teal", 
    "cyan",
    "peach",
    "salmon",
    "red-orange",
    "lime",
]

# assign categories from a known list
map_categories(colors_data, categories=['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet'])

# Result:
# ['blue', 'blue', 'yellow', 'red', 'red', 'green']

If you don’t know in advance your categories, just provide a description:

# describe possible categories in natural language
categorize(colors_data, description='the colors of the rainbow')

# Result:
# ['blue', 'blue', 'yellow', 'red', 'red', 'green']

Handling missing data

Dealing with missing data is another time-consuming aspect of data cleaning. GPT API can assist in this process by imputing missing values based on the available data. This automated approach saves significant effort compared to traditional manual imputation methods.

import pandas as pd
from marvin.ai_functions.data import context_aware_fillna_df


survey_responses = pd.DataFrame(
    [
        ["NY, NY", None, None],
        ["Boston, Massachusetts", None, None],
        ["Boston MA", None, None],
        ["NYC",  None, None],
    ],
    columns=["response", "city", "state"],
)


context_aware_fillna_df(survey_responses)


# Result:
#                 response      city          state
# 0                 NY, NY  New York       New York
# 1  Boston, Massachusetts    Boston  Massachusetts
# 2              Boston MA    Boston  Massachusetts
# 3                    NYC  New York       New York

Here is the function used behind the scene. Once again, you can easily customize instructions to fit your needs:

@ai_fn
def context_aware_fillna(data: list[list], columns: list[str] = None) -> list[list]:
    """
    Given data organized as a list of rows, where each row is a list of values,
    and some missing values (either `None` or `np.nan`), fill in any missing
    values based on other data in the same row. Use the `columns` names and
    other data to understand the likely data model. Returns the original data
    with the missing values filled in.
    """

Entity extraction

AI Models in Marvin, built on pydantic, offer a revolutionary approach to data processing by converting unstructured text data into structured formats. This empowers us to interrogate our data through our model schema, combining the reasoning capabilities of AI with the type boundaries set by pydantic.

The model understands that we refer to Chicago and fills in the data based on its global knowledge.

Consider the example of parsing resumes in an applicant tracking system. Previously, data scientists had to create custom models and regular expressions for each data feature, triggering a development cycle with every new requirement. However, with marvin, we can employ pydantic to shape our data model and enhance it with the @ai_model decorator. This endows our pydantic model with the ability to handle unstructured text seamlessly. Let's delve into a practical example:

import datetime
from typing import List, Literal, Optional, Union
import pydantic
from marvin import ai_model

class Institution(pydantic.BaseModel):
    name: str
    start_date: Optional[datetime.date]
    end_date: Union[datetime.date, Literal['Present']]

@ai_model
class Resume(pydantic.BaseModel):
    first_name: str
    last_name: str
    phone_number: Optional[str]
    email: str
    education: List[Institution]
    work_experience: List[Institution]

Resume("""
Ford Prefect
Contact: (555) 5124-5242, ford@prefect.io

Education:
- University of Betelgeuse, 1965 - 1969
- School of Galactic Travel, 1961 - 1965

Work Experience:
- The Hitchhiker's Guide to the Galaxy, Researcher, 1979 - Present
- Galactic Freelancer, 1969 - 1979
""").json(indent = 2)
# result
{
  "first_name": "Ford",
  "last_name": "Prefect",
  "phone_number": "(555) 5124-5242",
  "email": "ford@prefect.io",
  "education": [
    {
      "name": "University of Betelgeuse",
      "start_date": "1965-01-01",
      "end_date": "1969-01-01"
    },
    {
      "name": "School of Galactic Travel",
      "start_date": "1961-01-01",
      "end_date": "1965-01-01"
    }
  ],
  "work_experience": [
    {
      "name": "The Hitchhiker's Guide to the Galaxy",
      "start_date": "1979-01-01",
      "end_date": "Present"
    },
    {
      "name": "Galactic Freelancer",
      "start_date": "1969-01-01",
      "end_date": "1979-01-01"
    }
  ]
}

🤯🤯🤯 Mind blowing !

This makes our scrappers much simpler and smarter, as they don’t need to be adapted to new websites.

Limitations

Keep in mind that GPT is not perfect and can sometimes be wrong. The most difficult task is to write efficient prompt, that’s why marvin library is very interesting as all prompts are included and optimized for you. I recommend you to lower the temperature parameter, in order to reduce model creativity and have deterministics results.

Also, the current implementation requires an OpenAI account/key. OpenAI is doing an awesome job, but it can also be expensive. Caching results avoids re-processing items / paying for duplicate AI calls.

Finally, Marvin is still under active development and is likely to change.

Conclusion

This was just a first introduction to the library, but you can do a lot more including slackbot, connect to internet thanks to plugins and smart document retrieval thanks to the chroma wrapper. iNex successfully harnesses the power of LLM to improve data quality, perfects its scraper and classification algorithms. I really encourage you to take a look at the documentation and have fun!