How to build an early-years activity recommender with retrieval-augmented generation, GPT-4 and Pinecone

Published in

Discovery at Nesta

9 min readSep 4, 2023

Nesta’s Discovery Hub has launched a project to investigate how generative AI can be used for social good. We’re now exploring the potential of LLMs for early-years education, and in a series of Medium blogs, we discuss the technical aspects of our early prototypes.

Large language models (LLMs) are capable of performing well on a variety of tasks by relying solely on in-context learning, their ability to recognise the desired task from a prompt during inference and without any further training. We used this in our previous prototype to generate topics for conversation and activities to do with children in early-years education. Although the generated activities seemed fun and educational, we wanted to provide relevant and trusted context to the LLM to reduce hallucinations and align the activities with early-years education goals set out by the government.

In this blog, we will walk you through how we used GPT-4 from OpenAI, the Pinecone vector database, and Streamlit to build an early-years activity recommender. Our prototype relies on retrieval-augmented generation (RAG) to ground the LLM’s responses on a trusted, external data source, the Development Matters guidance.

Development Matters guidance

Development Matters (DM) offers a top-level view of how children develop and learn. It helps early-years practitioners in England to design effective curriculums that build on the strengths of children and serve their needs.

Development Matters references the Early-Years Foundation Stage (EYFS) statutory framework which we used in a previous prototype to build a simpler version of an early-years activity recommender. For each EYFS area of learning and age group, the Development Matters guidance lists learning goals for children and examples of how to achieve them. For example, children between the ages of three and four years old will be learning to “combine shapes to make new ones”. A way to achieve that would be by “providing shapes that combine to make other shapes, such as pattern blocks and interlocking shapes, for children to play freely with.”

In this prototype, we used the Development Matters data to assist educators and caregivers in brainstorming activities tailored to specific learning goals and age groups of children.

Let’s dive into what we built.

Indexing texts into a Pinecone database

To use RAG, we need to index the documents of the external data source for easy and fast retrieval. Vector databases have become quite popular for this purpose so we opted for Pinecone, a managed vector database, which offers a generous free tier.

In our prototype, we indexed the Development Matters learning goals and examples to enable users to write in natural language that they would like their child to learn and have the system return semantically similar items.

Let’s see how we did this, step by step.

Turning text into vectors. Source: OpenAI

Firstly, we needed to create numerical representations of our documents by vectorising them, in order to add them to a vector database. We encoded the Development Matters learning goals and corresponding examples of activities to support those goals using OpenAI’s text-embedding-ada-002; it’s a cheap model that produces good enough embeddings. We chose it mainly because we have already been using OpenAI’s ecosystem.

import openai
from typing import List

def get_embedding(text: str, model: str = "text-embedding-ada-002") -> List[float]:
   """Encode text with OpenAI's text embedding model."""
   text = text.replace("\n", " ")
   return openai.Embedding.create(input=[text], model=model)["data"][0]["embedding"]

Alternatively, you could use one of the transformer models hosted on Hugging Face. For example, the sentence-level MiniLM, which maps texts to 384-dimensional vectors, would be a solid option.

Once we encode our documents as vectors, we can measure how similar they are. As shown below, semantically similar sentences have a lower Euclidean distance between their corresponding vectors compared to unrelated sentences.

Example of a distance matrix of vectorised sentences: lower distance values indicate higher similarity.

We will use this property to search our Pinecone database shortly!

Populating a Pinecone database

Indexing learning goals and examples from Development Matters guidance into a vector database

We created a free account with Pinecone and received our API key. We created an index called eyfs-index that performs an approximate nearest neighbour search using the Euclidean distance for 1536-dimensional vectors (that’s the vector length of text-embedding-ada-002). We also specified the metadata we wanted to index along with the documents. The metadata are key-value pairs that are attached to a vector in the index and you can use them to filter your vector searches. In our prototype, we indexed the type of a document (learning goal or activity example), its area of learning (for example, Mathematics or Literacy) , source (always Development Matters) and age group (0–3, 3–4 and 4–5 years old).

import pinecone

# Initialise pinecone
pinecone.init(api_key=”<PINECONE_API_KEY>”, environment=us-west1-gcp)

# Create the index
pinecone.create_index("eyfs-index", dimension=1536, metric="euclidean", metadata_config={"indexed": ["areas_of_learning", "source", "type_", "age_group"]})

We then batched and upserted the encoded Development Matters learning goals and activity examples, along with their metadata, to the eyfs-index.

import pinecone
from typing import Generator


def batch(lst: list, n: int) -> Generator:
   """Yield successive n-sized chunks from lst."""
   for i in range(0, len(lst), n):
       yield lst[i : i + n]

# Connect to the pinecone index
index = pinecone.Index(“eyfs-index”)

# Upsert docs in batches
for batched_docs in batch(docs, batch_size):
    index.upsert(batched_docs)

With the index set up, we can run queries. For example, we can query a learning goal and get related activity examples from the Development Matters guidance:

import pinecone
from genai.eyfs import get_embedding # https://github.com/nestauk/discovery_generative_ai

# Instantiate pinecone
pinecone.init(api_key=<PINECONE_API_KEY>, environment="us-west1-gcp")

# Connect to the index
index = pinecone.Index("eyfs-index")

query = "learn how to count"

index.query(
   vector=get_embedding(query),
   top_k=2,
   include_metadata=True,
   filter={
       "source": {"$eq": "dm"},
       "age_group": {"$in": ["3-4", "4-5"]},
       "type_": {"$eq": "examples"},
       },
)

The above code will connect to our index, encode the query with text-embedding-ada-002, and search our index. It will filter the vector search to examples from Development Matters that are about children 3–4 or 4–5 categories. Finally, it will return the two results with the smallest Euclidean distance with our query. In this case, the returned examples are “play games which involve counting” and “develop the key skills of counting objects including saying the numbers in order and matching one number name to each item”.

{'matches': [{'id': '27b948c1-6fdb-48ba-9ad2-1150c8cdde2e',
              'metadata': {'age_group': '4-5',
                           'areas_of_learning': 'Mathematics',
                           'source': 'dm',
                           'text': 'Play games which involve counting.',
                           'type_': 'examples'},
              'score': 0.189807057,
              'values': []},
             {'id': '29252d06-f333-4cd9-983d-04abb660745d',
              'metadata': {'age_group': '4-5',
                           'areas_of_learning': 'Mathematics',
                           'source': 'dm',
                           'text': 'Develop the key skills of counting objects '
                                   'including saying the numbers in order and '
                                   'matching one number name to each item.',
                           'type_': 'examples'},
              'score': 0.212427735,
              'values': []}],
 'namespace': ''}

Now that we’ve built our index and learned how to query it, let’s see how we used it in our prototype.

Using retrieval-augmented generation

For our early-years activity recommender, we wanted to provide examples to the LLM of how caregivers in England can support a child’s progression towards specific learning goals.

We used RAG to force the LLM to generate responses related to England’s educational context and reduce the chance of taking ideas from other, unrelated early-years guidance documents that might be part of GPT-4’s training set.

In general, RAG combines an information retrieval component with a generative model. The retriever takes an input and finds a set of relevant documents given an external data source. The documents are added as context to the prompt which we use to call the generative model. RAG grounds an LLM on a set of external, verifiable facts, which means the model has a smaller chance of pulling information baked into its parameters. It also enables us to update an LLM’s knowledge without retraining the model!

Retrieval-augmented generation using Pinecone vector database

In our prototype, the input is a user query, usually a learning goal, while the external data source is the examples of the Development Matters guidance which we vectorised and stored in a Pinecone database. We search the database for semantically similar documents and add them as context to our prompt which is similar to the one we used in the EYFS-based personalised activity recommender. Finally, we call GPT-4 and store its response.

Based on our initial qualitative testing, we found that allowing LLMs to access the Development Matters guidance yielded higher-quality exercises that were relevant to the queried learning goal.

Building an early-years activity recommender

Now that we have indexed the Development Matters learning goals and examples into our Pinecone database and have set up RAG, we can build the early-years activity recommender in Streamlit.

Users must fill in the following fields for the LLM to recommend early-years activities:

Age group: 0–3, 3–4 and 4–5 years old, as defined in the Development Matters guidance.
Theme: The topic of the activity.
Learning goal: Users can select a learning goal that’s listed in the Development Matters guidance or write one in natural language. We will describe below how the workflow differs for each option.

Selecting a learning goal from Development Matters

Users select an age group, an area of learning and one or more learning goals, as those are found in the Development Matters guidance. These learning goals are the input to our RAG framework; we vectorise each learning goal with OpenAI’s text-embedding-ada-002, search the Pinecone database for relevant examples and keep the ones with the smallest Euclidean distance from the learning goal.

Writing your own learning goal

Instead of selecting a learning goal, users can write their own one in natural language. Our prototype will vectorise the text input and search for the most similar learning goals listed in the Development Matters guidance. Then, as we described above, our prototype finds the most relevant examples for each learning goal and returns five of them.

Prompting GPT-4 and asking follow-up questions

Finally, users need to provide a theme for the activities which will be added to the prompt along with the retrieved examples, the age group and the areas of learning. This is a key step that allows the user to personalise the activities drawn from the curriculum to the child’s interests and preferences.

Here is what the prompt looks like.

"""
###Instructions###
I will describe a theme below. Generate fun and exciting activities related to UK's Early Years Foundation Stage (EYFS) framework and the Development Matters guidance.

###Examples###
The activities you generate must be inspired by the examples below:
{examples}

###Constraints###
- The activities must be suitable for kids in the age groups: {age_groups}.
- The activities must be related to the areas of learning: {areas_of_learning}.
- You must generate 5 activities.
- You must describe each activity in 5-6 sentences.

###Description###
{theme_description}

###Formatting###
You must format the activities as follows:
## <activity_name>
<activity_description>

## <activity_name>
<activity_description>
"""

The LLM will then generate five activities and describe them in a few sentences. Users can ask a follow-up question for any of the activities like “Give me detailed instructions on how to play X activity. What materials would I need for it?”.

What’s next

Introducing RAG into our prototype helped us produce some pretty fun early-years activities that were grounded in England’s Development Matters guidance. Given that there are different early-years curricula used across the UK’s four countries, a real-world implementation of such a prototype should either include text data from all curricula or include the option to choose the most relevant curriculum to the user’s setting. It would also be an interesting exercise to compare the activities generated using different curricula as the knowledge base.

Moreover, we would like to build a chat interface for this work as we did in previous prototypes as well as add a memory component so that the LLM considers the conversation trail before responding.

As always, you can find our work on GitHub.

Get in touch if you would like to find out more or give us feedback!