From mega-prompts to specialized prompts

Published in

Thoughts on Machine Learning

6 min readJan 29, 2024

I previously worked on a project to map the french radical right networks by doing data extraction using GPT on articles covering the radical right activities. I cover my first iteration in this article.

I iterated on the project thinking I could do better on the following issues:

Low recall: GPT is not great at entity detection (see previous article), so despite my efforts GPT was able to extract only part of the relevant entities in the text
High cost: passing full articles to GPT-4 with a verbose output is quite costly (about .5$ per article)

Approach

In my first iteration, having GPT do the entity detection was brittle and left me with few options:

I had to use GPT-4, as GPT-3.5 (or GPT-4 Turbo) were much less reliable than GPT-4 for entity detection on a long text, whichever tricks I tried.
I had to use 1 single mega-prompt: multiple prompts broke the context continuity for GPT, decreasing entity detection performances.

Hypothesis:

Providing GPT with a list of entities as a 2nd input (besides the article’s text) would relieve the above contrainsts.
Allowing me to break down the prompt into smaller, more specialized prompts, that could possibly be run on GPT-3.5

The results

Results from switching from a mega-prompt to a a set of specialized processing tasks including using spacy for a first entity detection

I quickly discovered that spacy had the opposite characteristics compared to my first attempt: high recall, but very low precision. Basically spacy returns a lot of noise.
So the whole challenge was then to use GPT to improve precision. I could do it through mutliple steps, allowing me to use GPT-3.5 all the way. Precision is not yet perfect though.
Higher precision means more entities detected, which has a correlate that there are more duplicates. Entity resolution became more of a challenge. I changed from a full-text search / vector hybrid (but deterministic) approach to a GPT-based approach (GPT determines if 2 entities are the same, and provides a golden record by merging the duplicates).

The data processing pipe

The pipe went from 1 to 4 steps:

Entity detection with spacy
Filter on named entities with GPT-3.5: spacy returns entities that are not named (e.g. “the assistant”), and some random noise. This first step is to extract properly named entities, whatever their role, and then filter on those of interest (people, groups, organizations). It’s done passing a few sentences of the text that entity appears in (i.e. we don’t pass the full text for each entity).
Filter on radical right entity filtering with GPT-3.5: it’s the biggest prompt, and the whole text is passed, so that GPT can extract as much information as possible to 1. determine whether an entity belongs to the radical right, and 2. detect the relationships at the next step.
Extract relationships with GPT-3.5: using the data extracted at the previous step, this is mostly reframing the data into a knowledge graph.

Learnings

Data preparation matters

Experienced data scientists know that. I experienced it first hand.
I used beautifulsoup to extract the article content. But there were some bits of html left because of iframes or blockquotes. Properly cleaning the article text, removed a lot of false positives during the spacy detection.

Structuring the data processing pipe

Use domain-specific knowledge: at first I wanted to build a “robust” processing pipe in the sense that it could be repurposed. So, sure, it would be more elegant if my pipe could detect any radical right entity in any type of content. But I’m using a corpus of articles that are treating mostly about the radical right. I can use that information, and I should. Although it feels like “tricks” at first. That’s what Jason Liu calls the inductive bias.
Solve a problem at a single stage in the pipe. For example I struggled for some time with those improper entities detected by spacy because of some html bits. I was trying to weed them at later stages, but the proper approach was to solve the problem at one specific stage (here the data preparation stage).

Prompting: Few-shots works, for specialized tasks

I tried few-shots prompting with mega-prompts, it did not work so well. I assume it is because the examples had to capture a lot of possible variations, and GPT could not properly generalize from them.
However with more specialized tasks and prompts, there is much less variability and few-shots prompting worked great.

Prompting: how to take and return a list without dropping items

My whole processing pipe is about taking a list entities as input and yield that exact same list with enriched information for each item as output (information that I can use to filter out some items — I don’t ask GPT to do the filtering, it does not perform well, and it’s a blackbox which and why items were filtered out)
I struggled to have GPT do that.

2 techniques helped:

Using the terms “table”, “rows”, “list”, “item”: GPT being trained on coding data, these terms have specific meaning for it. It helps aligning GPT with the intent to use these terms.
The very first instruction in the prompt is to ask GPT to create an item in the output for each item in the input. Before doing any other processing.

Returning an input list in the output without dropping any item

Prompting: self-questioning vs Chain of thoughts

Chain of thoughts works, but what’s the solution when GPT is too lazy and yields an insufficient output?
A technique that has been documented but never tried before is to ask GPT to generate question to help make the result better.
By doing so GPT will actually use these hints to generate a better version of the answer. So it works as a great complement to CoT.
It really made a difference in my 3rd processing steps, the most complicated one: it made the prompt robust.

Self-questioning as a complement to Chain of thoughts

Prompting: using scores for classification

Another technique I tried was to ask GPT to create a score about the likelihood that an entity belongs to the radical right, instead of asking it to just output a boolean.
I gave it some rules about how to add or substract points to the score depending on some conditions.
It seems to be making the process more robust, demanding from GPT to explicitly consider multiple aspects in its decision, where just asking for a boolean answer seemed to generate more randomness.
Added benefit: it is possible to identify which items are on the fence (i.e. consistently output a score close to the classification threshold), understand why and improve the scoring rules accordingly.

Conclusion

GPT is much better at finding context to filter down a list of entities than detecting entities on its own.

There is still room for improvement:

Some calls to GPT could be batched, further reducing cost and processing time.
I could lean more into the inductive bias. For example, most of the “non-entities” that remain after the processing would immediately disappear if I added a filter on entities to show at least twice in the text to be considered. This would also negatively impact the recall, but the trade-off might be worth it.
And actually those entities that don’t appear twice or more could be funnelled into another process, that could use GPT-4 to better qualify them.
In general considering using GPT-4 on targeted calls could help improve the results at a slight increase in price.

Sources

Github code repo: https://github.com/meaningfool/streetpress