{‘Title’: ‘Using LLMs to extract structured data from text’}

GP Sandhu
4 min readJun 24, 2023

--

One application of LLMs that is gaining traction recently is to extract structured data from natural language text. A lot of really useful data exists right now that is living in text files. If we can extract it as structured data, we can gain interesting insights from it, and become more data-driven in our decision making. In this post I’m going to walk you though how to do that by doing it for this amazing dataset about Significant Cyber Incidents from CSIS. I’m going to use the new Function calling feature OpenAI released a couple of weeks back.

As a big believer in data-driven decision making, I often find the security world overly reliant on opinions of “security experts”, some of whom have not written a line of code in their life. They share these lists and policies, often with a few anecdotes and without any rigorous data analysis accompanying it. I recently asked on LinkedIn for a good dataset of major security incidents in the last 5 years — I was trying to use that to figure out what adversary behavior has been like in that time and where our defenses seem to be failing the most. This could be seen as reviewing all the goals we (security defenders) conceded so that we can figure out how to defend better. It was interesting to see how there was no real consensus answer — well obviously you use “XYZ”, which in and of itself is telling about how not rooted in data we are as security practitioners.

One interesting dataset someone shared was CSIS — Significant Cyber Incidents. It is a 80 page PDF file that CSIS hosts and it has short summaries on major incidents going back to 2003. In this post, I’ll use that as an example and we’ll walk through taking that PDF and turning it into structured data. It doesn’t have the level of technical detail I was looking for but It is a great example to illustrate this use case.

Step 1: Load the file and the modules

Looking at the PDF file URL https://csis-website-prod.s3.amazonaws.com/s3fs-public/2023-06/230602_Significant_Cyber_Events.pdf?VersionId=DQAJtvC1GwslfNiMmDBPryLA1wYDjGEE — it seems like they use a VersionId to update the filename, so we cant rely on this to be stable and the most current version in the future. Let’s instead use the landing page with stable URL — https://www.csis.org/programs/strategic-technologies-program/significant-cyber-incidents and extract the URL for the PDF file so that we get the most recent version.

Step 2: Clean the file so that we end up with only the list of incidents

Now that we have the file, let’s strip out all the text in the PDF that doesn’t contain information about incidents (heading, footers from each page, etc.). We also need to merge incidents that cross over from one page to the next.

split_text now has all the entires in the file for significant incidents as a Python list of strings and there are 998 entires currently. We have the data in the format we need.

Step 3: Setup the schema we want to extract and the tools to do it

Now that we’re done with the data prep, we can do the real LLM part to extract structured data from this text. To do this, we need to define the schema we want to extract from the text.

We also defined the LLM we want to use. The current base models gpt-4 and gpt-3.5-turbo don’t have this function calling enabled, so you’ll need to use 06–13 versions of these models. Then you define the Langchain extraction chain with the schema and the LLM. That’s all, we’re all set to do the extraction from text now!

Step 4: Run the extraction on 5 random samples

Let’s pull 5 samples at random from the Python list we created earlier

and run the extraction chain on the data…

As you can see, it does a fairly good job at extracting the elements we defined in the schema and returns ‘unknown’ for anything that wasn’t there in the input data.

This technique can help us operationalize valuable data that is sitting in text format and convert into structured data so that we can get objective metrics and trends. If there is interest, I’m happy to convert all the entires for this dataset and do a followup data analysis post. My current plan to find another dataset with more technical details for that deep-dive analysis. Thanks for reading!

--

--