Information extraction and Summarisation with human validation

c3d3
Tech@World School History
4 min readFeb 1, 2024
Information extraction and summarisation supported by human input

A key part of the World School History project involves building a knowledge base from curricula, textbooks and other educational resources from around the world. This involves extracting information and creating summaries of the resources and storing them in such a way that they can easily be interrogated and extracted to create teaching/learning materials and digital tools.

At a high level, building the knowledge base entails the following steps (see also figure above):

  • Extracting text from PDFs (and other formats). Depending on the resource, this might take more or less effort. Some PDFs are simply images, which require formatting into text; some may have irregular layouts which require highly customised scripts to parse.
  • Preprocessing the text. This involves getting the text into a standard format with correctly encoded characters, headings indicated, regular spaces etc. Depending on the models used downstream we may also translate the texts of languages at this stage (not all languages have good models for the downstream information extraction and summarisation tasks). In some cases, we may also resort to manual translation by humans.

Then for summarisation:

  • Extractive summarisation. This involves extracting the parts of the text that most represent its content, in a similar way to a human highlighting the important parts of a text.
  • Embedding the machine extracted summaries. This takes the summaries and represents them as vectors/arrays of numbers which reflect their semantic content; these are known as embeddings. These make it easy to identify and retrieve summaries related to a given piece of text or query.
  • Human validation/editing. At least two reviewers are asked to check if the summaries reflect the text that has been summarised, or if certain parts should be added or removed (remember, this is extractive summarisation so we do not include any text that was not in the original resource). If there is disagreement, we invite the reviewers to discuss and/or further reviewers are until a consensus is achieved. These summaries are treated as the “ground truth” for fine-tuning our models (we also have plans to allow for continuous evaluation by humans so that the “ground truth” can itself evolve over time).
  • Embedding the human extracted summaries. As with the machine extracted summaries, we create embeddings for the human extracted summaries.
Extractive summarisation to around 10% of the original text (extract from Side by Side)

And for information extraction, we take a very similar approach but instead of summaries, we perform the following:

  • Entity extraction. We seek out all mentions of historical figures, places, events, etc.
  • Semantic role labelling. We formally characterise the relationships between different entities. For example, in the statement Frances killed Cody in the year 1983, “Frances” would be assigned the agent role (the entity performing the action), “Cody” would be assigned the role patient (the entity affected by the action), “killed” would be assigned the role predicate (the action being performed), “in 1983” would be assigned the adjunct temporal. (See this article if you would like to learn more about the different labels.)
  • Topic assignment. We assign topics to different sections of text (this might be at a more granular level than the sections defined in the text itself.

Then:

  • Human validation/editing of entities extracted. At least two reviewers are asked to check if the entities extracted are indeed to be found in the corresponding text and if they are complete. Reviewers can add entities they feel are missing and have these reviewed by other reviewer(s) (as in the case of the summaries, this will be iterative in the future).
  • Human validation/editing of semantic roles and adjuncts. At least two reviewers are asked to check the semantic roles and adjuncts assigned to particular statements. If there is disagreement, we invite the reviewers to discuss and/or further reviewers are until a consensus is achieved (as in the case of the summaries, this will be iterative in the future).
  • Human validation/editing of topics assigned. At least two reviewers are asked to check the topics identified in the text. Additional topics may also be assigned, and we allow for overlapping chunks of text to be assigned different topics. For example, one topic may be assigned to chunk A, a second topic may be assigned to chunk B, but a third can be assigned to the chunk including part of chunk A and part of chunk B. If there is disagreement, we invite the reviewers to discuss and/or further reviewers are until a consensus is achieved (as in the case of the summaries, this will be iterative in the future).
Geopolitical entities extracted from the different narratives within the text (extract from Side by Side). In this example, some entities are found in both narratives.

This article is the first of a series describing the tech behind the knowledge base component of the World School History project, which uses AI and other text processing methods alongside human feedback to extract information from history curricula and learning materials around the world. For those with more technical leanings, future articles will give details of the experiments and implementation of the processes described above, as well as of any applications we develop on top of the knowledge base.

--

--

c3d3
Tech@World School History

C3D3 is about curiosity, complexity, computation, design, description and data