EntiGraph: fixing AI’s 3 big problems?

Thack

Published in

Thacknology

3 min readSep 12, 2024

⚠️ HU-AI-GE! ⚠️

We can broadly define AI’s biggest challenges as:

Hallucination — stop lying, dude!
Lack of massive datasets — people want to keep their stuff private
Energy thirst — have you seen Elon’s H100 cluster?

What if we’d already discovered a solution to all three?

Firstly, the dataset dearth; which, by reducing the training effort through more sophisticated algorithms and synthetic data, could reduce energy demands, and hallucinations as AI struggles to find answers satisfying user prompts.

🤯

Moonshot — or the magic of marvellous minds?

Stanford University researchers have been working on a new, efficient way to make AI brainier — using synthetic data expressing how knowledge can be represented.

Currently: “A 13-year-old human acquires knowledge from fewer than 100M tokens, while state-of-art open-source language models are trained on 15T tokens.”

The traditional approach to training AI, particularly large language models, has been akin to force-feeding a student with an entire library of books, hoping they’ll absorb enough knowledge to answer any question.

This method is incredibly data-inefficient and unsustainable.

Enter Stanford’s EntiGraph — a cleaner, more efficient, three-step approach:

Identifying key concepts (entity extraction) — like the main characters and important objects in a story).
Elaborating on each concept (single entity description) — creating short stories or explanations around each concept, highlighting its relevance, role, and importance. This is like giving the student background information on each character and object in the story. An extended mind map, if you will.
Exploring relationships (relation analysis) — creating scenarios or examples demonstrating how different concepts interact and influence each other. This is like showing the student how the characters and objects in the story relate and affect the plot.

By creating these additional materials, you’ve expanded the original summary into a more comprehensive and engaging learning resource, making it easier for the student (AI) to understand the subject.

How this could transform your business:

You’ve got a huge repository of internal documents, reports, and presentations (the small dataset).
You want to train an AI assistant to answer questions and provide insights based on this information. However, directly training the AI on this data might be inefficient due to its limited size, and specialised nature.
Using either Glean (love you guys!) or the synthetic continued pretraining approach, your business could:

Extract key entities like product names 🛍️, project codes 📊, and employee roles 👥 from documents.
Generate descriptions for each entity 📝, explaining its significance within your business’ context.
Create stories on how different entities interact 🔄, such as how a product launch 🚀 affected sales figures 📈 or a project delay ⚠️ impacted morale. 😊

Mermaid diagram showing how a business might use EntiGraph to create an answer-everything LLM from its limited datasets.

It’s early days — but initial results show the potential to achieve comparable performance to traditional methods, with a fraction of the data.

Read the research report: https://arxiv.org/pdf/2409.07431

Want more adventures in AI? Follow me…

EntiGraph: fixing AI’s 3 big problems?

Written by Thack