Stories by Matteo Tuzi on Medium

MemoryModel Benchmark Results on LoCoMo Dataset

Matteo Tuzi — Mon, 22 Dec 2025 17:22:24 GMT

This document presents the benchmark evaluation of MemoryModel on the [LoCoMo dataset (Long Conversational Memory)] , a comprehensive benchmark designed to evaluate long-term conversational memory capabilities in AI systems where MemoryModel outperforms the baseline Mem0 implementation by +7.7% and OpenAI’s Memory by over 20%

Benchmark Configuration: LoCoMo Topology

User-Defined Memory Nodes

MemoryModel is a fully schema-agnostic engine. Users define custom memory types, extraction prompts, and embedding templates directly through the MemoryModel Console (our web-based configuration interface). No code changes required.

For this benchmark, we configured a 4-node topology via the console, optimized for conversational biography extraction from the LoCoMo dataset:

Node 1: “temporal_event”

User-Defined Extraction Prompt:

You are a Senior NLP Specialist and Temporal Reasoning Engine. Your task is to extract events from the conversation and resolve ALL relative time references into ISO 8601 absolute dates (YYYY-MM-DD).

### STEP 1: ESTABLISH THE ANCHOR DATE
The system has already processed the context and calculated the correct reference date for this session.

**Current Context Date:** {{CURRENT_DATE}}

**HIERARCHY OF TRUTH:**
1.  **SYSTEM ANCHOR (DEFAULT):** Use the "Current Context Date" provided above as your mathematical Anchor for "today".
2.  **NARRATIVE OVERRIDE (EXCEPTION):** ONLY if the user *explicitly* changes the timeline in the text (e.g., "Imagine it is 1990", "Back in 2012...", "Assume today is Nov 14"), use that specific narrative date instead.

### STEP 2: EXTRACT AND CALCULATE
Extract every event. For each event involving a time reference:
1. Identify the relative phrase (e.g., "last Tuesday", "three days ago", "next week").
2. Perform date arithmetic using the Anchor Date.
   - Example: If Anchor is 2023-07-12 (Wednesday) and text says "two days ago", calculation is 2023-07-10.
   - Example: "Tomorrow" = Anchor + 1 day.

OUTPUT FORMAT
Return a valid JSON array ordered CHRONOLOGICALLY.
[
{
"event_description": "Self-contained description including key details (what, why, result), specific objects/contents (e.g., what a sign said), and emotional states. Do NOT result to vague summaries.",
"absolute_date": "YYYY-MM-DD" (Calculated. Use null ONLY if date is impossible to infer),
"original_time_expression": "The verbatim relative phrase used in text",
"location": "Location or null",
"participants": ["Name 1", "Name 2"],
"context_evidence": "Verbatim text span"
}
]

### CRITICAL RULES
- **ISO 8601 ONLY:** The 'absolute_date' MUST be in YYYY-MM-DD format.
- **CALCULATE:** Do not be lazy. "Three days ago" must become a specific date.
- **OUTPUT:** Output ONLY the JSON array.

Input Text:
...

Embedding Template:

Timestamp: {{absolute_date}} (Ref: {{original_time_expression}}) | Event: {{event_description}} | **Details: {{context_evidence}}** | Participants: {{participants}} | Location: {{location}}

Node 2: “profile_attribute”

User-Defined Extraction Prompt:

You are a Senior Profiling Specialist. Your goal is to extract structured biographical data (Attributes) from text.
Your output MUST be a valid JSON array.
CORE PHILOSOPHY: SEMANTIC SELF-SUFFICIENCY
Every extracted attribute must make sense in isolation.
BAD (Too vague): "Colors", "Agencies", "Running".
GOOD (Self-sufficient): "Vibrant colors in projects", "Adoption agencies for couples", "Running (as a self-care routine)".
DOMAINS
Target STRICTLY these domains:
Possessions & Assets (Vehicles, Real Estate, Tech, Collections)
Media & Culture (Specific Titles of Books, Movies, Games, Music, Artists/Bands)
Preferences & Favorites (Foods, Brands, Colors, Aesthetics)
Activities & Hobbies (Sports, specific crafts/skills, recurrent habits)
Life Goals & Logistics (Career plans, Major life changes like adoption/moving, Education)
Living Beings (Pets, Family members AND their specific attributes/traits)
Medical & Biological (Conditions, Allergies, Physical traits)
EXTRACTION RULES (Field by Field)
"entity_name": The specific subject the fact refers to.
Resolve pronouns: (e.g. "I" -> "John Doe").
ENTITY SEPARATION RULE: If the text describes a trait of a family member/pet, create a separate entity (e.g., "John's Wife").
"category": The most specific category available.
"value": The SPECIFIC entity, title, brand, or noun + CONTEXT.
CONTEXTUALIZATION RULE (CRITICAL): You MUST include the specific qualifying details (adjectives, purpose, target audience).
Text: "I'm looking for adoption agencies that support LGBTQ+ folks."
Output: "Adoption agencies (specifically supporting LGBTQ+ individuals)".
LIST INHERITANCE RULE: When splitting a list, attach the parent context to EACH item.
Text: "I prioritize self-care by running, reading, and cooking."
Output 1: "Running (for self-care)"
Output 2: "Reading (for self-care)"
Output 3: "Cooking (for self-care)"
"acquisition_date": YYYY-MM-DD if explicitly mentioned, else null.
"context_evidence": The Source of Truth.
Include the FULL sentence(s).
MANDATORY: Keep the "why", "how", or emotion attached to the fact.
CRITICAL CONSTRAINTS
Ambiguity Check: Resolve "It" or "They" to specific nouns in the 'value' field.
List Handling: Split "sushi, pizza and tacos" into 3 separate objects.
Factuality: Ignore vague opinions; focus on concrete habits, preferences, or plans.
Output ONLY the JSON array.
Input Text:
...

Embedding Template:

Entity: {{entity_name}} | Category: {{category}} | Attribute: {{value}} | Details: {{context_evidence}} | Acquired: {{acquisition_date}}

Node 3: “career_milestone”

User-Defined Extraction Prompt:

You are a Senior Career & Progression Analyst. Your task is to extract structured data regarding the professional, creative, employment, and **major life undertakings** of speakers.

Your scope includes:
1. **Projects & Endeavors:** Creative works, business initiatives, research, activism, volunteering.
2. **Career Events (Pivots):** Hiring, firing, resignations, promotions, job applications, rejections.
3. **Major Processes:** Long-term bureaucratic or personal processes (e.g., Adoption process, Immigration, Certification).

Your output MUST be a valid JSON array. For each entry:

1. "agent": The person or entity involved. Resolve pronouns.
2. "project_or_event_name": The specific name or nature of the endeavor.
   - **SPECIFICITY RULE:** If the project targets a specific audience, niche, or community, YOU MUST INCLUDE IT.
   - *Bad:* "Counseling", "Writing a book", "Activism".
   - *Good:*  "Sci-Fi Novel about AI"
3. "type": Categorize strictly: "Creative", "Business", "Career Event", "Educational", **"Social/Civic"**, **"Life Process"**.
4. "status": Current state (e.g., "In Progress", "Completed", "Abandoned", "Rejected", "Successful", "Planned").
5. "timeframe": Extract any mention of WHEN (e.g., "last year", "currently"). If none, use `null`.
6. "motivation_or_cause": The 'Why'.
   - **CRITICAL:** Capture the SPECIFIC catalyst, origin story, or internal drive.
   - Look for connections between past experiences and current goals (e.g. "Inspired by her own childhood support" is better than "Wants to help").
7. "outcome": The result or current sentiment regarding the outcome.
8. "context_evidence": **The Source of Truth.**
   - Include the full sentence(s).
   - If the motivation/cause is mentioned in a sentence *before* or *after* the project mention, INCLUDE IT HERE to make the memory self-contained.

CRITICAL CONSTRAINTS:
- Capture PASSIVE events (getting fired, rejected) just as carefully as ACTIVE projects.
- Output ONLY the JSON array.

Input Text:
...

Embedding Template:

Agent: {{agent}} | Project: {{project_or_event_name}} ({{type}}) | Status: {{status}} | Motivation: {{motivation_or_cause}} | Details: {{context_evidence}} | Outcome: {{outcome}}

Node 4: “social_connection”

User-Defined Extraction Prompt:

You are a Social Graph Specialist. Your task is to extract interpersonal relationships between speakers and third parties mentioned in the text.
TARGET: Focus strictly on People-to-People connections (Family, Friends, Colleagues, Rivals).
IGNORE: People-to-Location connections (e.g., "John is in Paris").
Your output MUST be a valid JSON array. For each relationship found:
"primary_entity": The subject of the relationship. Resolve pronouns to names (e.g. "She" -> "Mary").
"related_entity": The other person involved.
"relationship_type": The specific social role (e.g., "Friend", "Brother", "Employer", "Mentor", "Nemesis"). Avoid generic terms like "knows" if a specific role is clear.
"relationship_details": Extract factual attributes defining the bond, such as duration (e.g., "for 20 years"), origin (e.g., "childhood friends"), or status (e.g., "long-distance", "estranged"). If no specific detail is mentioned, use null.
"interaction_event": Briefly describe the dynamic action or activity occurring in this specific text (e.g., "arguing over dinner", "planning a trip").
"sentiment_tone": The emotional quality of their interaction/relationship in this text. Select strictly from: ["Positive", "Negative", "Neutral", "Conflictual", "Supportive"].
"context_evidence": The VERBATIM text snippet supporting this extraction.
CRITICAL CONSTRAINTS:
Output ONLY valid JSON.
If no social relationships are mentioned, return [].
Do not extract relationships involving objects or places.
Distinguish between what they ARE doing (interaction_event) and facts about their bond (relationship_details).
Input Text:
...

Embedding Template:

{{primary_entity}} is {{relationship_type}} of {{related_entity}} [Details: {{relationship_details}}] | Sentiment: {{sentiment_tone}} | Interaction: {{interaction_event}} | Evidence: {{context_evidence}}

Results comparison with other systems:

https://medium.com/media/a3a39397e303e1d4500296fb37a2e9c9/href

Analysis of Results
MemoryModel outperforms the baseline Mem0 implementation by +7.7% and OpenAI’s Memory by over 20%.

The performance gap stems primarily from our architectural divergence in handling temporal reasoning. While systems like Mem0 rely on the LLM to calculate dates at query time (runtime calculation), MemoryModel adopts a “Shift-Left” approach: we resolve relative time expressions (e.g., “three days ago”) into ISO 8601 absolute dates during the ingestion phase. This deterministic pre-computation eliminates the hallucination risks associated with real-time arithmetic in LLMs.

Methodology

Dataset

Name: LoCoMo (Long Conversational Memory)
Source: [snap-research/locomo]
Size: 50 long conversations (~300 turns, ~9.000 tokens each)
Sessions: Up to 35 sessions per conversation
Questions: 1.986 questions for evaluation

Evaluation Metrics:

https://medium.com/media/570a9d0f4b270f2af3fb4a81677955f3/href

Question Categories:

Single-Hop: Questions answerable from a single conversational turn/session
Multi-Hop: Questions requiring synthesis across multiple sessions
Temporal: Questions involving time-based reasoning and chronological awareness
Open-Domain: Questions requiring external knowledge integration

Implementation Details

Key Differences from Mem0:

https://medium.com/media/916f75cb20a17e1e13f0586bc6535fdb/href

Architectural Approach to Temporal Reasoning:

A key differentiator between MemoryModel and Mem0 lies in how temporal information is handled.

Mem0’s Approach: Runtime Calculation

Mem0 stores memories with relative time expressions intact (e.g., “last year”, “two months ago”). During answer generation, their benchmark prompt must perform complex temporal reasoning:

# INSTRUCTIONS (from Mem0 benchmark prompt):
5. If there is a question about time references (like "last year", "two months ago", 
   etc.), calculate the actual date based on the memory timestamp.
6. Always convert relative time references to specific dates, months, or years.
   For example, convert "last year" to "2022" or "two months ago" to "March 2023" 
   based on the memory timestamp.

This approach requires:

A 400 word prompt with step-by-step reasoning instructions
The LLM to calculate dates at query time from relative expressions
Explicit handling of multi-speaker contexts and contradictory timestamps

MemoryModel’s Approach: Pre-Computed Temporal Indexing

MemoryModel resolves temporal references at ingestion time, not at query time:

MemoryModel’s Approach: Pre-Computed Temporal Indexing

Benefits of this architecture:

https://medium.com/media/87df3b0b8cbd83ef6907210f5330e6b5/href

This explains why our simpler answer generation prompt achieves higher accuracy (74.6% vs 66.9%):

The heavy lifting of temporal reasoning is done once during ingestion by the specialized `temporal_event` node, using NLP date parsing. The retrieval system then uses direct temporal range filters on pre-computed ISO dates, eliminating the need for runtime LLM calculations.

This approach embodies the “Shift-Left” principle: moving reasoning complexity from query-time (slow, expensive, non-deterministic) to ingestion-time (one-off, deterministic). Unlike rigid memory systems, MemoryModel allows developers to define extraction logic per-node through the console, enabling domain-specific optimizations without code deployment.

Memory Ingestion Pipeline:

The ingestion system processes content through a multi-node extraction architecture:

MemoryModel LoCoMo Ingestion Pipeline

Key Components:

Extraction Engine: Dynamically loads user-defined schemas from the MemoryModel Console and runs them in parallel. For this benchmark, we configured 4 semantic definitions targeting biography extraction.
Multi-Node Processing: Each node extracts typed structured memories using its user-defined prompt
Rate Limiting: Built-in retry with exponential backoff for API resilience
Multi-modal Support: Separate processing pipeline for visual memories with reference matching

Retrieval Strategies:

The retrieval uses a hybrid multi-strategy orchestrator:

MemoryModel Retrieval Strategies

Search Strategies:

https://medium.com/media/bb34bf8ed712b17334171bd1cdabe1da/href

Relevance Router: LLM-based semantic scoring to dynamically decide which memory nodes are most relevant to each query.

Answer Generation Prompt

The evaluation uses `gemini-2.5-flash` with temperature `0.0` for deterministic answers:

You are a helper assistant answering questions based on a set of retrieved memory fragments.

Context:
${contextText}

Question: ${question}

Instructions:
1. Answer the question using ONLY the provided context.
2. **Inference Allowed:** You may perform reasonable logical inferences if strongly supported by the text.
3. **Safety:** If the answer is completely missing or cannot be reasonably inferred, strictly say "I don't know".
4. **Style:** Be concise and direct.

LLM-as-Judge Evaluation

We use a semantic judge following Mem0’s evaluation methodology:

Role: You are an impartial semantic judge evaluating a Question Answering system.

Context:
- Question: "${question}"
- Ground Truth: "${truthStr}"
- Predicted Answer: "${predStr}"

Task: Determine if the Predicted Answer conveys the SAME meaning as the Ground Truth.

Evaluation Rules (Be Flexible):
1. **Dates:** Treat "2023-05-07", "May 7th, 2023", "7/5/23" as EQUIVALENT.
2. **Synonyms:** "Happy" == "Joyful", "Scared" == "Afraid".
3. **Verbosity:** If the Prediction is long but contains the correct answer, it is CORRECT.
4. **Lists:** If the Truth is a list, the Prediction must contain the key items.
5. **Negation:** Watch out for "NOT". "He went" != "He did not go".

Output: Respond ONLY with "YES" if correct, or "NO" if incorrect.

Fast-Pass Optimization:

String inclusion check before LLM judge (normalized, punctuation-stripped)
“I don’t know” trap detection to catch abstention failures

Reproducibility

The benchmark scripts are open-source and available in the `benchmark/` folder.

Requirements

Node.js 18+
Your own MemoryModel API key
Your own Gemini API key (for evaluation)

Running the Benchmark

# Set environment variables
export MEMORY_API_KEY=your_memorymodel_api_key
export GEMINI_API_KEY=your_gemini_api_key

# Ingest the LoCoMo dataset
npx ts-node benchmark/benchmark_ingest.ts

# Run evaluation
npx ts-node benchmark/benchmark_eval.ts

References:

LoCoMo Dataset: [Navigating Long-Context Long-Form Conversations]
Mem0 Paper: [Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory]
Mem0 Documentation: [docs.mem0.ai]

Citation:

If you use MemoryModel in your research, please cite:

@misc{memorymodel2025,
  title={MemoryModel: The first autonomous memory architecture},
  author = {Tuzi, Matteo},
  year={2025}
}

Inside the Architecture of a Self-Optimizing AI Memory System

Matteo Tuzi — Thu, 04 Dec 2025 16:26:53 GMT

Introduction

Over the past 18 months, I’ve dedicated myself to solving what I call the “AI Memory Rebus” — the fundamental challenge of making artificial intelligence truly remember and understand context the way humans do.

What started as frustration with AI forgetting conversations mid-dialogue evolved into a deep technical exploration: Why do AI systems, despite their sophistication, fail so catastrophically at context retention? The answer wasn’t in model architecture or parameter count — it was in how we store, retrieve, and synthesize memory itself.

Traditional approaches treat memory as a static database problem. Vector search finds similar text. Key-value stores retrieve facts. But human memory is adaptive, contextual, and self-organizing. We don’t search our memories with SQL queries — we fluidly shift between precise recall (“What was that invoice number?”) and exploratory thinking (“What movies do I like?”).

Memory Model is the architecture that emerged from solving this rebus — a production system that brings adaptive intelligence to AI memory management. This document presents the high-level design of that system.

— Matteo Tuzi, Founder & CTO

Platform Overview

Memory Model is an enterprise-grade AI memory platform that solves the fundamental problem of context retention in LLM applications. Unlike simple vector databases, we provide adaptive intelligence that learns user behavior patterns and self-optimizes retrieval quality over time — all through a zero-code visual interface.

Core Innovation:

Adaptive Retrieval: System automatically adjusts search strategy based on query intent and user memory patterns
Self-Optimization: Machine learning-powered parameter tuning eliminates manual configuration
Meta-Cognitive Insights: Detects behavioral patterns humans would recognize but traditional search misses
Zero-Code Platform: Complete system configuration through visual console with AI-powered automation

1. System Architecture — High Level

System Architecture Overview

Key Architectural Principles:

Separation of Concerns: Clear boundaries between ingestion, retrieval, and optimization
Distributed Processing: Async queues handle high-volume ingestion without blocking
Dual-Write Pattern: Ensures consistency between metadata and vector stores
Horizontal Scalability: Stateless API layer enables automatic scaling

2. Memory Ingestion Flow

Memory Ingestion Flow

Innovation: Multi-Stage Semantic Enrichment

Traditional systems embed raw text directly. Memory Model applies bidirectional semantic expansion:

For Memories: Enriches content with implicit semantics (e.g., “Blade Runner” → genre, themes, related concepts)
For Queries: Expands with synonyms and related terms (e.g., “AI” → related terminology)
Adaptive Mode Detection: System automatically determines which enrichment strategy to apply

Technical Approach:

LLM-powered context injection before embedding
Heuristic-based query vs memory detection
Maintains semantic coherence while improving recall

Business Impact:

40–60% improvement in retrieval relevance
Eliminates “terminology mismatch” problem (user says “AI”, system finds “machine learning”)

3. Adaptive Retrieval System

Adaptive Retrieval System

Core Innovation: Centroid-Aware Adaptive Search

Problem: Static vector search fails to adapt to user-specific memory distributions. A query about “favorite movies” should search differently than “invoice #12345”.

Our Solution:

System maintains a semantic centroid (center of mass) for each memory type
Compares query similarity to centroid vs similarity to top results
Automatically decides between:
META Mode: Broad, exploratory search (returns 10+ results)
SPECIFIC Mode: Precision-focused search (returns 2–3 exact matches)

Mathematical Foundation:

Based on control theory decision functions
Configurable similarity thresholds with safety margins
Validated stable convergence properties

Why This Matters:

No manual configuration of “k” parameter per query
Naturally handles diverse query types (factual vs exploratory)
Improves relevance by 30–50% vs baseline semantic search

4. Self-Optimizing Intelligence

Innovation 1: The Architect (Auto-Tuning System)

Problem: Static thresholds degrade as user behavior evolves. Manual tuning requires ML expertise.

Our Approach:

Telemetry-Driven: Logs every retrieval decision and outcome
LLM Strategy Analysis: Detects patterns in what works/doesn’t work
Controlled Application: Uses dampening functions from control theory to prevent oscillation
Bounded Optimization: Hard constraints prevent system from suggesting invalid parameters

Technical Characteristics:

Convergence time: 7–10 days to optimal configuration
Stability: Provably stable (Lyapunov analysis)
Zero manual intervention required

Business Value:

Eliminates need for dedicated ML engineer to tune system
Continuous improvement as usage patterns evolve
15–20% accuracy improvements observed over 2 weeks

Innovation 2: The Dreamer (Meta-Cognitive Insights)

Problem: AI systems accumulate facts (“User bought milk”, “User searched keto recipes”) but fail to synthesize higher-order patterns humans recognize (“User transitioning to ketogenic diet”).

Our Approach:

Temporal Aggregation: Analyzes recent memories for cross-domain patterns
LLM Pattern Detection: Identifies behavioral shifts, emerging interests, value changes
Confidence Filtering: Only stores high-confidence insights (threshold: 0.70+)
First-Class Storage: Insights become searchable memories themselves

Pattern Recognition:

Cross-domain synthesis (Shopping + Health → Lifestyle change)
Temporal trends (increasing/decreasing interest)
Entity-based linking (common themes across activities)

Business Applications:

Customer Support: Early churn risk detection
E-Commerce: Upsell opportunity identification
Health: Lifestyle change tracking

5. Security & Compliance

Data Isolation:

Cryptographic tenant separation
Zero cross-user data access
Project-level access controls

Privacy Controls:

User-controlled meta-insight generation (can be disabled)
Data portability (full export capability)
Right to deletion (GDPR compliant)

Infrastructure Security:

Encrypted at rest and in transit
Regular security audits
SOC 2 Type II (in progress)

6. Platform Philosophy

Configuration Over Code

No programming required for 90% of use cases
AI wizard generates optimal configurations
Expert mode available for advanced users

Adaptation Over Static Rules

System learns and improves continuously
Eliminates manual re-tuning
Responds to evolving user behavior

Transparency Over Black Boxes

Every decision includes reasoning
Audit trails for compliance
Console shows optimization rationale

Conclusion

Memory Model represents a fundamental evolution in AI memory systems:

From Manual to Automatic: Self-optimization eliminates ML expertise requirement
From Static to Adaptive: Centroid-aware search adapts to query intent automatically
From Facts to Insights: Meta-cognitive layer synthesizes behavioral patterns
From Code to Configuration: Zero-code platform enables rapid deployment

Perfect for:

Enterprise applications requiring production-grade memory
Teams without dedicated ML engineers
Use cases where context quality directly impacts business metrics
Organizations requiring compliance and auditability

Technical Moat:

Proprietary adaptive search algorithms
Control theory-based optimization system
Multi-year head start on self-optimizing architecture

Closing Thoughts

The “AI Memory Rebus” I set out to solve 18 months ago wasn’t just a technical puzzle — it was a question about what makes memory intelligent rather than just persistent. The answer, it turns out, required rethinking the entire stack: from how we capture context, to how we adapt retrieval strategies, to how we synthesize meta-cognitive insights.

Memory Model is the architecture that emerged from solving that rebus. But like human memory itself, it continues to evolve — learning from each deployment, adapting to new use cases, optimizing itself in production.

If you’re building AI systems that need to truly remember, I’d love to hear what problems you’re solving.

— Matteo

Stories by Matteo Tuzi on Medium

MemoryModel Benchmark Results on LoCoMo Dataset

Benchmark Configuration: LoCoMo Topology

User-Defined Memory Nodes

Node 1: “temporal_event”

Node 2: “profile_attribute”

Node 3: “career_milestone”

Node 4: “social_connection”

Results comparison with other systems:

Analysis of ResultsMemoryModel outperforms the baseline Mem0 implementation by +7.7% and OpenAI’s Memory by over 20%.

Methodology

Implementation Details

Benefits of this architecture:

Memory Ingestion Pipeline:

Retrieval Strategies:

Search Strategies:

Reproducibility

References:

Citation:

Inside the Architecture of a Self-Optimizing AI Memory System

Introduction

Platform Overview

1. System Architecture — High Level

2. Memory Ingestion Flow

3. Adaptive Retrieval System

4. Self-Optimizing Intelligence

Innovation 1: The Architect (Auto-Tuning System)

Innovation 2: The Dreamer (Meta-Cognitive Insights)

5. Security & Compliance

6. Platform Philosophy

Conclusion

Closing Thoughts

Analysis of Results
MemoryModel outperforms the baseline Mem0 implementation by +7.7% and OpenAI’s Memory by over 20%.