Stories by Jonathan Fulton on Medium

Codex vs Claude Code: Why I Decided to Switch to Codex

Jonathan Fulton — Sat, 09 May 2026 15:14:21 GMT

After nearly a year with Claude Code as my daily driver, I’ve made the jump to OpenAI’s Codex. Here’s how it happened and what I’ve learned.

How I Got Here

I started using Claude Code last summer after testing it in a vibe coding contest against Cursor. Claude Code won that contest, and it became my go-to coding agent. I used it for everything — writing features, debugging, refactoring, generating tests. It was good.

But over the past few months, Codex kept appearing in my periphery. And eventually, it pulled me in.

First Encounter: PR Reviews That Actually Find Bugs

My first real exposure to Codex came through Datadog’s automated code review process. Codex provides comments on PRs, and unlike a lot of automated review tools, these comments are actually useful. It finds real bugs — the kind that would otherwise slip through human review.

I was impressed. But not enough to switch my local workflow.

Second Encounter: The Clawdbot Story

My next encounter was more indirect. The creator of Clawdbot (now OpenClaw), Peter Steinberger, mentioned that he used Codex to build the open source project that was eventually acquired by OpenAI.

That got me thinking. If someone was building production-quality, acquisition-worthy software with Codex, maybe I should pay more attention.

But still not enough to actually install it.

What Finally Got Me: Code Reviews from Inside Claude Code

The thing that finally pushed me over the edge was the Codex review plugin for Claude Code. It let me run advanced code reviews against my local changes without pushing up a PR.

This was the killer feature for my workflow. I could make a bunch of changes, run a Codex review locally, catch issues early, and iterate — all before my code ever hit a remote branch.

So I installed Codex. But for the first few weeks, I mostly just used it for those code reviews, still doing my actual coding work in Claude Code.

The GPT-5.5 Factor

Two things happened around the same time.

First, a coworker mentioned that they’d been using Codex and that GPT-5.5 was surprisingly good.

Second, another coworker — a longtime Cursor user — was blown away when GPT-5.5 one-shotted a complicated task that would have taken quite a few iterations with Opus 4.7. One-shot. Done. No back-and-forth.

That was enough to get me to actually try Codex for real coding work.

The Head-to-Head Test

For a couple of weeks, I ran as many parallel queries as I could in both Codex and Claude Code. Same tasks, same context, see what happens.

It’s a little hard to articulate exactly why I ended up preferring Codex, but a few things stood out:

1. Better at Using Skills

Codex seemed slightly better at knowing when and how to use skills (the equivalent of tools or MCP servers). It would reach for the right skill at the right time without as much prompting.

2. Navigates Complexity Better

This one was the big differentiator. I had Codex trace requests and tasks through three different codebases, and it excelled. It understood the boundaries between services, followed the data flow across repos, and synthesized a coherent picture.

Claude Code can do this too, but Codex handled the complexity with less hand-holding.

3. Computer Usage

This one can be huge depending on your workflow. Codex can control your computer — clicking through UIs, running commands, interacting with web apps.

Here’s a concrete example: I was debugging why certain requests in our experiment metric creation flow were timing out. I had Codex:

Visit the metric creation flow in the browser
Submit a SQL query to reproduce the issue
Use skills to find the request in Datadog traces
Open the Datadog UI and navigate to the specific trace
Navigate our frontend monorepo
Navigate our backend monorepo
Trace through multiple services within the backend
Correctly identify the root cause of the timeout

That’s a workflow that would normally involve me jumping between browser tabs, terminal windows, and multiple codebases. Codex did it end-to-end.

4. Handles Intricate Work at Scale

Eppo by Datadog has a complicated API that supports syncing hundreds of SQL sources and metrics. It’s not simple — there are edge cases, validation rules, and intricate dependencies.

I asked Codex to write an RFC for rebuilding this API better in the Datadog platform. It essentially one-shotted a 1,000-line RFC. Then I asked it to create a detailed implementation plan from that RFC. Another 1,000 lines, one-shot.

Claude Code can do similar work, but in my experience it doesn’t perform as well on tasks this complicated and intricate. The output requires more iteration to get right.

Will I Stick With Codex?

Honestly? Who knows.

The AI coding landscape changes fast. Claude’s next release might leapfrog GPT-5.5. Cursor might ship something that makes both feel dated. Some new tool might emerge that rewrites the rules entirely.

But for now, Codex is my daily driver.

The combination of strong model performance, computer usage capabilities, and the ability to handle genuinely complex multi-codebase work has won me over. I’m still using Claude Code for some things — and I still think it’s excellent — but when I sit down to do serious coding work, I reach for Codex first.

Sometimes the best tool is just the one that makes hard things feel a little easier. Right now, for me, that’s Codex.

Codex vs Claude Code: Why I Decided to Switch to Codex was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

What is RAG and How Does It Work with Modern AI Systems?

Jonathan Fulton — Wed, 06 May 2026 23:06:02 GMT

A practical guide to Retrieval-Augmented Generation — the architecture pattern powering enterprise AI, coding agents, and the next generation of intelligent applications

Large language models are impressive, but they have fundamental limitations: they only know what they learned during training, they can confidently make things up, and they have no access to your proprietary data. Retrieval-Augmented Generation (RAG) solves all three problems.

Here’s how RAG works at a high level:

Problem: LLMs have a knowledge cutoff date, hallucinate facts, and can’t access private data
Solution: Before generating a response, retrieve relevant documents from an external knowledge base and inject them into the prompt
Documents → Chunks → Embeddings → Vector Database: Your knowledge base is split into chunks, converted to numerical vectors (embeddings), and stored in a vector database optimized for similarity search
Query Time: When a user asks a question, convert it to an embedding, find the most similar document chunks using k-nearest neighbors search, and include those chunks as context for the LLM
Result: The LLM generates answers grounded in your actual data, with sources it can cite

RAG has become the standard architecture pattern for enterprise AI, powering everything from customer service chatbots to legal research tools to coding agents. Let’s break down exactly how it works.

Why RAG Is Necessary

Large language models like GPT-5, Claude, and Gemini are trained on massive datasets of text from the internet, books, and code. But this training approach creates three fundamental problems:

1. Knowledge Cutoff

LLMs only know information up to their training cutoff date. Ask about something that happened last week, and the model has no idea. This makes vanilla LLMs useless for real-time information, recent events, or rapidly changing domains like financial markets or medical research.

2. Hallucination

LLMs are designed to produce fluent, confident text — even when they don’t actually know the answer. This leads to “hallucinations” where the model invents plausible-sounding but completely false information. In high-stakes domains like healthcare, legal, or finance, this is unacceptable.

3. No Access to Private Data

Your company’s internal documents, customer data, proprietary research, and institutional knowledge never appeared in the model’s training data. A vanilla LLM can’t answer questions about your specific codebase, your company’s policies, or your customer’s order history.

RAG addresses all three problems by giving the LLM access to external, up-to-date, verified information at query time. Instead of relying solely on what it memorized during training, the model can now reference actual documents — and cite them.

How RAG Works: The Complete Pipeline

RAG systems have two phases: an indexing phase (done once or periodically) and a query phase (done for every user request).

Phase 1: Indexing Your Knowledge Base

The process of indexing your knowledge base

Step 1: Gather Documents

Start by collecting the documents you want the system to reference. This could be:

Internal documentation and wikis
Product manuals and knowledge base articles
Legal contracts and case law
Medical literature and clinical guidelines
Code repositories and API documentation
Customer support tickets and FAQ databases

Step 2: Chunk the Documents

LLMs have limited context windows (even with modern models supporting 1M tokens, you can’t dump your entire knowledge base into every prompt). Documents are split into smaller chunks — typically 200–1000 tokens each.

Chunking strategies matter. Common approaches include:

Fixed-size chunks: Split every N tokens with some overlap
Semantic chunking: Split at natural boundaries like paragraphs or sections
Recursive chunking: Try larger chunks first, recursively split if needed

Overlap between chunks (e.g., 10–20%) ensures that concepts spanning chunk boundaries aren’t lost.

Step 3: Generate Embeddings

Each chunk is passed through an embedding model — a neural network that converts text into a dense numerical vector (typically 768–1536 dimensions). These vectors capture semantic meaning: chunks about similar topics will have similar vectors, even if they use different words.

Popular embedding models include:

OpenAI’s text-embedding-3-large
Cohere’s embed-v3
Voyage AI’s voyage-large-2
Open-source options like BGE, E5, and GTE

Step 4: Store in a Vector Database

The embeddings are stored in a specialized vector database optimized for similarity search. Unlike traditional databases that match exact values, vector databases find the most similar vectors to a query vector.

Common vector databases include:

Pinecone: Fully managed, scales to billions of vectors
Weaviate: Open-source with hybrid search capabilities
Qdrant: Open-source, Rust-based, high performance
Chroma: Lightweight, developer-friendly, good for prototyping
pgvector: PostgreSQL extension for teams already using Postgres
Milvus: Open-source, designed for large-scale production

Phase 2: Query Time Retrieval

Step 1: Embed the Query

When a user asks a question, it’s converted to an embedding using the same embedding model used for indexing. This ensures the query vector lives in the same semantic space as the document vectors.

Step 2: K-Nearest Neighbors Search

The vector database performs a k-nearest neighbors (k-NN) search to find the K document chunks most similar to the query embedding.

Similarity is typically measured using:

Cosine similarity: Measures the angle between vectors (most common)
Euclidean distance: Measures straight-line distance
Dot product: Fast computation, works well for normalized vectors

At scale, exact k-NN search is too slow. Vector databases use approximate nearest neighbor (ANN) algorithms like HNSW (Hierarchical Navigable Small World graphs) or IVF (Inverted File Index) to trade a small amount of accuracy for massive speed improvements — making billion-scale searches possible in milliseconds.

Step 3: Construct the Prompt

The retrieved chunks are inserted into the prompt, typically in a format like:

Use the following context to answer the user's question.
If the context doesn't contain enough information, say so.

Context:
[Retrieved Chunk 1]
[Retrieved Chunk 2]
[Retrieved Chunk 3]
...

User Question: [Original Query]

Step 4: Generate the Response

The LLM receives the augmented prompt and generates a response grounded in the retrieved context. Good RAG implementations also ask the model to cite which sources it used, enabling verification and trust.

RAG in Modern Chatbots

Every major AI assistant now uses RAG or RAG-like techniques to extend their capabilities beyond training data:

ChatGPT with Browse and Search

When ChatGPT uses web browsing, OpenAI performs real-time web searches, retrieves relevant pages, and injects their content into the conversation context. This is RAG in action — the model answers based on retrieved web content rather than solely its training data.

Perplexity

Perplexity built its entire product around RAG. Every query triggers web searches across multiple sources, retrieves and processes the content, and generates answers with inline citations. The retrieval is the product.

Claude with Document Context

When you upload documents to Claude or use Claude’s Projects feature, those documents become the retrieval corpus. Claude searches through your uploaded context to ground its responses in your specific materials.

Google’s AI Overviews

Google’s AI-generated search summaries are powered by RAG over Google’s search index. The system retrieves relevant web pages for a query and synthesizes an answer from those sources.

RAG in Coding Agents

Coding agents face an acute version of the RAG problem: codebases are too large to fit in context, change constantly, and contain patterns and conventions specific to each project. RAG is essential for code assistants to actually understand your code.

GitHub Copilot

GitHub’s latest Copilot coding agent explicitly uses “advanced retrieval augmented generation (RAG) powered by GitHub code search.” When Copilot boots up to work on a task, it clones the repository, analyzes the codebase, and builds a retrieval index. This allows it to understand project-specific patterns, reference relevant files, and make changes consistent with the existing code.

The Copilot team has discussed their retriever architecture publicly: they build a precompiled index that allows quickly looking up code items relevant to the current task, enabling whole-codebase understanding without stuffing everything into context.

Cursor

Cursor indexes your codebase and uses embeddings to find semantically relevant code when you ask questions or request changes. The @codebase command explicitly triggers RAG search across your entire project to find relevant files, functions, and patterns.

Claude Code

Claude Code operates on your local filesystem and uses intelligent file discovery to understand codebases that far exceed its context window. While the exact architecture isn’t public, the behavior — understanding project structure, finding relevant files, and making contextually appropriate changes across large repos — requires retrieval mechanisms.

Sourcegraph Cody

Sourcegraph Cody is perhaps the most explicitly RAG-focused coding assistant. It builds on Sourcegraph’s code search infrastructure to provide code intelligence across massive monorepos. Cody can search millions of lines of code to find relevant context for any coding task.

Companies Using RAG: Real-World Examples

RAG has moved from research concept to production infrastructure across industries. Here are concrete examples:

Financial Services

Morgan Stanley deployed a GPT-powered RAG system that gives wealth advisors instant access to the firm’s research reports, investment strategies, and compliance documentation. Instead of manually searching through hundreds of thousands of documents, advisors can ask natural language questions and get answers grounded in Morgan Stanley’s proprietary research. The system retrieves relevant documents from their corpus and uses them to generate contextual answers.

Goldman Sachs is rebuilding core workflows around intelligent, agentic automation, with RAG as a foundational component. Their AI systems need access to real-time market data, internal research, and compliance requirements — all of which require retrieval architectures.

Legal

Harvey AI, now valued at $11 billion, built its entire product on RAG. Their “Vault” feature creates secure, encrypted environments where law firms can connect their proprietary documents. According to OpenAI’s case study on Harvey, they tried fine-tuning and other techniques first, but RAG proved essential for case law research that requires thorough, well-cited answers.

As Harvard’s Journal of Law & Technology noted, Harvey “makes law firms’ proprietary databases available via RAG systems,” enabling AI that can reference a firm’s specific precedents, contracts, and internal work product.

Healthcare

Apollo 24|7, a digital healthcare platform in India, uses Google’s MedPaLM augmented with RAG to build their Clinical Intelligence Engine. The system assists clinicians by providing real-time access to de-identified patient data, drug interaction databases, and the latest clinical guidelines.

UpToDate (Wolters Kluwer) provides AI-powered clinical decision support that grounds recommendations in their continuously updated medical evidence database — a classic RAG application where timeliness and accuracy are literally life-or-death concerns.

Glass Health uses RAG to connect ambient scribing (transcription of doctor-patient conversations) with clinical knowledge bases, helping generate accurate notes and surface relevant diagnostic information.

Customer Service

Klarna deployed an AI assistant that handled two-thirds of customer service chats — 2.3 million conversations — in its first month. The core is a fine-tuned GPT model augmented with RAG that pulls from Klarna’s knowledge base on policies, orders, and merchant data. This grounding in real data is what allows the system to handle sensitive fintech tasks without hallucinating incorrect refund policies or order details.

Enterprise Knowledge

Notion AI uses RAG to search across a company’s Notion workspace — potentially millions of pages — when answering questions. Their pipeline achieves 2–10ms per query for vector search across enterprise-scale document sets.

Glean ($7.2B valuation) built an enterprise search platform that connects to dozens of data sources — Google Drive, Slack, Salesforce, internal wikis — and provides RAG-powered Q&A across all of them.

Slack AI uses RAG to answer questions about your organization’s Slack history, surfacing relevant conversations and documents from across channels you have access to.

The Evolving RAG Landscape

RAG isn’t standing still. Several trends are reshaping the architecture:

Hybrid Search

Pure vector search can miss keyword-exact matches that matter. Modern RAG systems combine vector search with traditional keyword search (BM25), getting the best of both semantic understanding and exact matching.

Agentic RAG

Instead of a single retrieve-then-generate step, agentic RAG systems can iteratively search, evaluate results, refine queries, and synthesize information across multiple retrieval rounds. This is how coding agents explore large codebases — they don’t just do one search, they navigate intelligently.

Multi-Modal RAG

RAG is expanding beyond text. Systems can now retrieve and reason over images, diagrams, tables, and videos, enabling applications in fields like manufacturing, medical imaging, and design.

Context Window Expansion

With context windows reaching 1–2 million tokens, some have asked: “Is RAG dead?” The answer is no. Even with massive context, you still can’t fit enterprise-scale knowledge bases into every prompt. More importantly, retrieval provides relevance filtering — you don’t want the model distracted by irrelevant documents. RAG and large context windows are complementary: retrieval finds the needle, large context windows let you include the surrounding haystack.

Conclusion

RAG has become the standard pattern for building AI systems that need to be accurate, current, and grounded in real data. The core insight is simple but powerful: instead of trying to pack all knowledge into model weights, give the model a librarian that can find relevant information on demand.

Whether you’re building a customer service bot, a legal research tool, a coding assistant, or an enterprise search product, you’re probably building a RAG system. Understanding the architecture — chunking, embeddings, vector databases, k-nearest neighbors search, and prompt construction — is now essential knowledge for anyone working with AI in production.

What is RAG and How Does It Work with Modern AI Systems? was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI Startups You Should Know

Jonathan Fulton — Tue, 05 May 2026 23:37:33 GMT

A field guide to the companies defining the next era of AI

The AI startup market is no longer one market. It is a stack.

There are frontier labs building the models, infrastructure companies building the compute layer, developer tools changing how software gets written, code review companies trying to verify AI-generated work, enterprise platforms organizing company knowledge, and vertical AI companies going after legal, healthcare, customer support, media, and more.

Here is a readable map of the companies worth knowing.

Numbers are based on public reporting as of May 5, 2026. Some are primary rounds, some are reported talks, and some are secondary-market signals, so treat them as directional rather than gospel.

1. Frontier model labs

These are the companies building the most capable general-purpose AI models.

OpenAI

$852 billion valuation. $122 billion latest funding round (closed March 31, 2026). Estimated annualized revenue around $25 billion. OpenAI remains the default brand in AI, with ChatGPT, the API platform, Codex, Sora, and enterprise products all pushing toward the same goal: becoming the operating system for AI work.

Anthropic

$380 billion primary valuation (February 2026), with reports of a potential $50 billion round at $900 billion+ valuation in active talks. Reported annualized revenue is above $30 billion, with some reports closer to $40 billion. Claude Code alone hit $2.5 billion in annualized revenue by February 2026. Anthropic has become OpenAI’s most credible rival, especially in enterprise AI and coding through Claude and Claude Code.

xAI / SpaceX

Major update: xAI was acquired by SpaceX in February 2026 for $250 billion, creating a combined entity valued at $1.25 trillion at the time of the deal. SpaceX is now targeting a June/July 2026 IPO at a reported $1.75 trillion valuation — potentially the biggest IPO in history. Grok remains the AI product, now with distribution advantages through X, Tesla, Starlink, and the broader Musk ecosystem.

Mistral AI

Roughly $14 billion valuation after major 2025 funding (€1.7 billion Series C at €11.7 billion). In March 2026, Mistral raised an additional $830 million for European data center buildout. Mistral is the leading European frontier model company, with a strong open-model, enterprise, and sovereignty angle.

Cohere + Aleph Alpha

Major update: In April 2026, Cohere announced a merger with Germany’s Aleph Alpha at a combined $20 billion valuation (up from Cohere’s previous $7 billion standalone). The combined entity positions itself as the enterprise AI alternative with strong European and sovereign deployment capabilities.

2. AI infrastructure and compute

These companies are building the GPU clouds, chips, inference platforms, and data centers powering the AI boom.

Lambda

Raised $1.5 billion+ in a Series E (November 2025), bringing total funding to roughly $2.3 billion. Currently raising $350 million in pre-IPO convertible notes with an H2 2026 IPO target. Lambda is one of the better-known AI cloud providers and a CoreWeave alternative for companies that need access to GPU infrastructure.

Together AI

In talks to raise around $1 billion at a $7.5 billion valuation (March 2026); estimated annualized revenue around $1 billion. Together AI sits between infrastructure and developer platform, helping companies train, fine-tune, and deploy open and proprietary models.

Cerebras

Filed for IPO in April 2026, targeting a $26.6 billion valuation with a $3.5 billion raise. Reported $510 million in 2025 revenue. Cerebras is one of the highest-profile challengers to Nvidia, with wafer-scale AI chips and a major OpenAI partnership.

Groq

$6.9 billion valuation after a $750 million 2025 raise. Additionally, in December 2025, Groq signed a non-exclusive licensing deal with Nvidia for its inference technology worth $17 billion in payments through 2026. Groq is focused on fast AI inference, betting that once AI moves from training-heavy to usage-heavy, speed and cost per token become central.

Crusoe

Raised $1.375 billion Series E at a $10 billion valuation (October 2025). Named one of Fast Company’s 2026 Most Innovative Companies. Crusoe is building AI data center and energy infrastructure, which matters because the AI bottleneck is increasingly power, land, cooling, and chips rather than just software.

3. AI coding and developer tools

This is one of the most important categories because software engineering is one of the first areas where AI is clearly changing day-to-day work.

Cursor / Anysphere

Cursor closed a $2.3 billion Series D in November 2025 at a $29.3 billion valuation. In April 2026, SpaceX announced a strategic partnership with an option to acquire Cursor for $60 billion (or a $10 billion partnership deal). Cursor reached around $2 billion in ARR by February 2026 and is reportedly forecasting more than $6 billion by year-end. Microsoft reportedly examined an acquisition before choosing not to bid. Cursor is the breakout AI IDE and arguably the most important developer tool startup of the AI era.

Replit

Raised $400 million Series D at a $9 billion valuation in March 2026–3x its valuation from just six months prior. Annualized revenue reportedly grew from $2.8 million to $150 million in less than a year. Replit is becoming an AI app-building platform for both developers and non-developers, with Agent 3 pushing it further into autonomous software creation.

Lovable

$6.6 billion valuation after a $330 million Series B (December 2025). Reported ARR has been cited around $200 million to $400 million depending on source and timing. Lovable is one of the clearest winners in “vibe coding,” letting users generate full-stack apps from prompts.

Cognition / Devin

$10.2 billion valuation after a $400 million raise (September 2025); in talks for a new round at around $25 billion (April 2026). Cognition’s Devin popularized the “AI software engineer” concept, and the company became even more strategically interesting after acquiring Windsurf.

Factory

$1.5 billion valuation after a $150 million Series C led by Khosla Ventures (April 2026). Factory is building AI coding agents for enterprise engineering teams, which means its challenge is not just code generation but security, repo context, CI, review workflows, and organizational trust.

Warp

Total funding around $73 million; growing at roughly $1 million ARR every 10 days as of late 2025. Warp started as a Rust-based terminal and is increasingly becoming an agentic developer workflow product with 700,000+ developers on the platform.

Bolt.new / StackBlitz

$700 million valuation; $135 million total funding as of December 2025. Bolt is part of the same app-generation wave as Lovable and Replit, focused on letting users quickly create, edit, and deploy applications from natural language.

4. Code review, verification, and software quality

If AI writes more code, teams will need better ways to validate that code.

Greptile

$180 million valuation after a $25 million Series A led by Benchmark (September 2025). Total funding around $45 million. Greptile is building AI code review, a category that should become more important as teams generate more code than humans can comfortably review by hand.

Sourcegraph

Last major private valuation: $2.625 billion from its 2021 Series D. Sourcegraph is not a new AI-native startup, but code search and code intelligence are foundational for agents that need to understand large, messy production codebases.

Graphite

Valuation not widely public. Graphite started with stacked pull requests and developer workflow tooling, but its position around code review and PR workflows makes it relevant in a world where AI produces more diffs.

CodeRabbit

Valuation not widely public. CodeRabbit is another AI code review company worth watching because independent review is becoming one of the obvious complements to AI code generation.

5. Agent infrastructure

These companies are building the rails agents need: search, browsing, memory, tool use, workflow orchestration, and long-running task execution.

Perplexity

Valuation between $21 billion and $22.6 billion (January 2026); estimated ARR around $200 million by February 2026. Perplexity is best known as an AI search company, but it is increasingly an answer engine and agentic research surface.

Parallel Web Systems

$2 billion valuation after a $100 million Series B (April 29, 2026). Founded by former Twitter CEO Parag Agrawal, Parallel is building web infrastructure for AI agents that need to search, retrieve, and reason over the web more reliably than humans clicking links.

Browserbase

Valuation not widely public. Browserbase is part of the emerging browser-infrastructure category: hosted browsers, automation, and web interaction for AI agents that need to use websites as tools.

Exa

Valuation not widely public. Exa is building search infrastructure for AI applications, which matters because agents need retrieval systems optimized for machine reasoning rather than human search-result pages.

6. Enterprise knowledge and productivity

These companies are trying to become the AI layer across company documents, Slack, tickets, email, CRM, permissions, and workflows.

Glean

$7.2 billion valuation after a $150 million Series F (February 2026); reported ARR around $200 million. Glean started as enterprise search and is evolving into a broader enterprise AI platform for company knowledge and agentic workflows.

Writer

$1.9 billion valuation after a $200 million Series C (November 2024). Writer is focused on enterprise AI, brand-safe generation, workflows, and governance rather than generic chatbot usage.

Notion AI

Notion is not purely an AI startup, but it belongs in this category. Its AI features sit directly inside docs, wikis, projects, and company knowledge, making it a strong example of AI embedded into existing productivity workflows.

Hebbia

$700 million valuation after a $130 million Series B (July 2024). Hebbia is focused on AI for knowledge work, especially document-heavy analysis in financial services, legal, and professional services.

7. Vertical AI applications

These companies go deep into one domain instead of trying to be a generic assistant.

Harvey

$11 billion valuation after a $200 million raise (March 2026); reported annualized revenue above $200 million. Harvey is the breakout legal AI company, selling into law firms and legal departments for contract analysis, due diligence, compliance, litigation, and legal research.

Legora

$5.6 billion valuation after a $550 million Series D (March 2026), with additional Nvidia NVentures investment announced in April 2026. Legora is another major legal AI startup, and its rise suggests legal AI is becoming a real category rather than a one-company market.

Sierra

Over $15 billion valuation after a $950 million round (announced May 4, 2026); previously reported to be on track for more than $100 million in ARR. Sierra, founded by Bret Taylor and Clay Bavor, is building AI agents for customer experience and support workflows. Now has more than $1 billion to work with.

OpenEvidence

$12 billion valuation after a $250 million Series D (January 2026); reported annualized revenue around $100 million. OpenEvidence is often described as “ChatGPT for doctors,” which makes it one of the more important healthcare AI companies to watch.

Abridge

Valuation not included here because current public numbers vary. Abridge is focused on AI medical documentation, a very real workflow pain point for doctors and health systems.

Decagon

Valuation not included here because public numbers vary. Decagon is another customer support AI company, competing in the same broad enterprise support automation category as Sierra.

8. Generative media and communication

These companies are building AI for voice, video, avatars, images, and synthetic media.

ElevenLabs

$11 billion valuation after a $500 million Series D (February 2026); closed 2025 with more than $330 million in ARR. ElevenLabs is the leading AI voice company and is expanding into dubbing, customer support, conversational commerce, training, and voice agents.

Runway

$5.3 billion valuation after a $315 million Series E (February 2026). Total funding around $860 million. Runway is one of the major AI video companies, now increasingly talking about world models rather than only video generation.

Synthesia

$4 billion valuation after a $200 million Series E (January 2026). Used by 70% of FTSE 100 companies. Synthesia focuses on AI avatars and enterprise video, especially training, enablement, onboarding, and internal communication.

Midjourney

Valuation not widely public, but culturally enormous. Midjourney remains one of the most important image generation products, especially in creative, design, and visual ideation workflows.

9. Physical-world AI and robotics

This category is less mature than software AI, but the upside is enormous.

Figure AI

$39 billion valuation after a $1 billion+ Series C (2025/2026), with total funding exceeding $1.9 billion. Figure is building humanoid robots and is one of the more visible companies trying to bring foundation-model-style progress into physical work.

Physical Intelligence

In talks to raise $1 billion at a valuation exceeding $11 billion (March 2026). Physical Intelligence is working on general-purpose AI for robots, a technically hard problem with massive long-term implications.

Skild AI

$14 billion valuation after a $1.4 billion Series C led by SoftBank (January 2026). Skild is another robotics foundation model company, betting that the next big AI wave includes embodied systems, not just text boxes.

And that’s a wrap!

If I had to compress the entire landscape into a watchlist, I’d pick these:

Frontier labs: OpenAI, Anthropic, xAI/SpaceX, Mistral, Cohere+Aleph Alpha

AI coding: Cursor, Replit, Lovable, Cognition, Factory, Greptile

Infrastructure: Lambda, Together AI, Cerebras, Groq, Crusoe

Enterprise AI: Glean, Writer, Sierra

Vertical AI: Harvey, Legora, OpenEvidence

Agent infrastructure: Perplexity, Parallel, Browserbase, Exa

Media: ElevenLabs, Runway, Synthesia

Robotics: Figure, Physical Intelligence, Skild

The big pattern is that “AI startup” no longer means one thing.

Some of these companies are building models. Some are building infrastructure. Some are building workflows. Some are replacing old SaaS categories. Some are creating entirely new behavior.

The first phase of the AI boom was about who had the best model.

The next phase is about who owns the workflow.

AI Startups You Should Know was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

Why is Rust so hip?

Jonathan Fulton — Mon, 04 May 2026 14:57:13 GMT

Everyone’s rewriting everything in Rust. Here’s why that actually makes sense.

A year ago, I joined Datadog. Along the way, the team that built our new new CLI for AI agents called Pup refactored the codebase from Go to Rust.

This wasn’t an anomaly. It was a pattern.

At my previous company, Eppo (acquired by Datadog), we built a Rust core SDK that powers our feature flagging across every language — JavaScript, Python, Ruby, you name it. The same evaluation logic, compiled once, wrapped everywhere.

And everywhere I look in the developer tools space, the story is the same: Rust is eating software. Not because it’s trendy. Because it solves real problems that other languages can’t.

The Evidence is Everywhere

Let’s start with what prompted this post. In the last few months alone:

Warp just open-sourced their terminal — 1.2 million lines of Rust. This is an “agentic development environment” (their words) that runs AI coding agents like Claude Code and Codex directly in the terminal. OpenAI is their founding sponsor. They chose Rust not because it was trendy, but because building a high-performance terminal with native rendering and complex state management basically requires it.

OpenAI’s Codex CLI — their local coding agent — is written in Rust. Not Python. Not TypeScript. Rust. When OpenAI needs a fast, reliable, cross-platform binary that developers will actually run on their machines, they reach for Rust.

Datadog’s Pup CLI gives AI agents programmatic access to the entire Datadog platform — 200+ commands across 33+ product domains. Why Rust? Because when you’re building a CLI that AI agents will invoke thousands of times, startup time and memory footprint actually matter.

Eppo’s SDK architecture uses a Rust core (eppo_core) for flag evaluation, then wraps it for each target language. One codebase. Consistent behavior everywhere. No divergence bugs. This is the FFI (foreign function interface) superpower pattern I'll talk more about below.

Rust Has Already Won JavaScript Tooling

If you’re a JavaScript or Python developer, you’re probably already using Rust. You just might not know it.

Ruff — The Python linter that’s 10–100x faster than Flake8. Written in Rust.
uv — Python package manager that makes pip feel like it’s from 1995. Written in Rust.
SWC — The Babel replacement powering Next.js. Written in Rust.
Turbopack — Vercel’s successor to Webpack. Written in Rust.
Rspack — ByteDance’s Webpack replacement. Written in Rust.
Biome — The Rome fork that actually shipped. Linter + formatter. Written in Rust.
Oxc — The “JavaScript Oxidation Compiler.” Guess what it’s written in.

Lee Robinson (VP of Product at Vercel) predicted this back in 2021. By 2026, he notes, “nearly every major JavaScript build tool now has a Rust-based alternative or has been rewritten in Rust.”

He was right.

The AI/ML Ecosystem is Following

The pattern is expanding into AI tooling:

Candle — Hugging Face’s minimalist ML framework for Rust. Designed for serverless inference where startup time matters.
Polars — The DataFrame library that makes Pandas feel slow. Written in Rust, with Python bindings.
Deno — The TypeScript-native runtime with built-in tooling. Rust core.

When you need fast inference on constrained resources — edge devices, serverless functions, embedded systems — Rust’s zero-cost abstractions become non-negotiable.

Why Rust? The Real Reasons

So why is everyone reaching for Rust? It’s not just hype. Here’s what actually matters:

1. Performance without runtime cost

Rust has no garbage collector. No runtime. When your code compiles, it compiles to native machine code that starts instantly and uses exactly the memory it needs.

For CLI tools, this means sub-millisecond startup times. For servers, this means predictable latency without GC pauses. For AI agents invoking tools thousands of times, this adds up fast.

2. Memory safety without the overhead

Rust’s borrow checker catches entire categories of bugs at compile time:

No null pointer exceptions
No data races
No use-after-free
No buffer overflows

This isn’t theoretical. The Linux kernel now officially supports Rust as of 2025 — the first new language added since C in 1991. Microsoft’s Galen Hunt publicly stated the goal of eliminating all C/C++ from Microsoft by 2030, with Rust as the replacement.

Memory safety has gone from “nice to have” to “regulatory requirement” in many industries.

3. The FFI (Foreign Function Interface) superpower

This is the one that doesn’t get enough attention.

Rust compiles to native code with C-compatible FFI. This means you can write your core logic once and call it from Python, JavaScript, Ruby, Go, Java — basically anything.

This is exactly what we did at Eppo. The eppo_core crate handles all flag evaluation logic. Each language SDK wraps it. One implementation. Zero drift. When we fix a bug or add a feature, every SDK gets it automatically.

Compare this to maintaining parallel implementations in 8 languages. The math is obvious.

4. WebAssembly is a first-class target

Rust compiles to WASM beautifully. This means the same code that runs on your server can run in the browser, in Cloudflare Workers, in edge functions — anywhere.

For SDK authors, this is transformative. Write once, deploy everywhere, with near-native performance.

5. Modern tooling that actually works

Cargo is a great package manager. It has:

Dependency management that makes sense
Built-in testing, benchmarking, documentation
Reproducible builds by default
A package registry (crates.io) that doesn’t feel like archeology

When you come from npm’s node_modules or Python’s “did you activate the right virtualenv?” experience, Cargo feels like the future.

The Honest Tradeoffs

Rust isn’t free. The learning curve is real.

The borrow checker will fight you. Lifetimes are confusing at first. Compile times are longer than Go. The ecosystem is younger than Java, JavaScript or Python.

But here’s what I’ve observed: teams that invest in Rust tend to have fewer bugs in production, lower operational costs, and faster runtime performance. The upfront cost pays dividends.

And with AI coding assistants getting better at Rust, the learning curve is flattening. Claude and Codex can write idiomatic Rust now. The barrier to entry is dropping.

The Pattern I’m Seeing

Here’s my synthesis:

Rust is winning in three specific domains:

Developer tools and CLIs — Where startup time, memory footprint, and cross-platform binaries matter.
Core SDKs with multi-language wrappers — Where maintaining N parallel implementations is a nightmare.
Performance-critical infrastructure — Where GC pauses are unacceptable and memory safety is required.

If your project fits one of these categories, Rust deserves serious consideration.

If you’re building a CRUD app? Probably overkill. Use whatever ships fastest.

But for the tools that other developers depend on — the compilers, the CLIs, the SDKs, the infrastructure — Rust has become the obvious choice.

What I’m Watching

The next wave is AI tooling. The intersection of Rust’s performance characteristics with the demands of AI inference — low latency, predictable resource usage, edge deployment — is too compelling to ignore.

Candle is just the beginning. I expect to see more Rust-based inference engines, more AI agents written in Rust, more ML tooling that takes performance seriously.

The question isn’t whether Rust will continue growing. It’s which domain gets rewritten next.

What’s your take? Are you seeing Rust adoption in your organization? I’d love to hear about it in the comments.

—
Jonathan Fulton
Staff Software Engineer at Datadog

Why is Rust so hip? was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

What Warp’s Open Source Release Tells Us About the Future of Agentic Software Development

Jonathan Fulton — Sun, 03 May 2026 23:00:16 GMT

I dug through Warp’s newly open-sourced codebase and contribution model. The most interesting part is not that the terminal is written in Rust. It is that the repo itself is designed for humans managing agents.

Warp recently open sourced its client codebase.

That alone is interesting. Warp is one of the more ambitious developer tools of the last few years: a modern terminal, a code-oriented workspace, an AI agent interface, and now increasingly an “agentic development environment.” But the more interesting part is not merely that the source code is public. The interesting part is how Warp is trying to run the project.

This is not a traditional open source release where a company dumps a repo on GitHub and hopes contributors show up. Warp is explicitly experimenting with a new development model: humans propose ideas, write specs, validate behavior, and supervise agents that do much of the implementation and review work.

Warp describes the client as now open source, with an agent-first contribution workflow managed by Oz, its cloud agent orchestration platform. OpenAI is the founding sponsor of the open source repository, and the agentic management workflows are powered by GPT models.

I spent some time reading through the repository, the contribution docs, the agent skill files, and the engineering guide. Here are the most interesting things I learned.

1. Warp did not just open source a product. It open sourced a workflow.

The biggest takeaway from reading the repo is that Warp is trying to make the repository itself agent-native.

The README points contributors to a public “Warp Contributions Overview Dashboard” where people can watch Oz agents triage issues, write specs, implement changes, review PRs, track in-flight features, and click into active agent sessions in a web-compiled Warp terminal.

That is a very different framing from “please fork the repo and send us a pull request.”

The contribution model starts with issues. Features are not supposed to jump directly into code. Instead, they move through readiness labels, product specs, tech specs, implementation, Oz review, SME review, CI, and then merge.

In other words, Warp is treating open source contribution as an orchestration problem.

That feels directionally correct to me.

As AI coding agents get better, writing code is no longer always the scarce resource. The scarce resources become:

deciding what should be built
specifying behavior precisely
validating that the implementation matches intent
reviewing edge cases
maintaining architectural coherence
deciding when “working” is actually “done”

Warp’s launch post says this explicitly: they believe the bottleneck is no longer writing code, but the human-in-the-loop work around code, especially speccing and verifying behavior.

That lines up with my own experience using agents in production. The code is increasingly cheap. Taste, judgment, context, and verification are not.

2. The repo has agent instructions as first-class artifacts

One of the most interesting directories is .agents/skills.

This is not just a CONTRIBUTING.md file written for humans. Warp ships repo-specific skills for agents: adding feature flags, adding telemetry, creating PRs, diagnosing CI failures, implementing specs, reviewing PRs, writing product specs, writing tech specs, resolving merge conflicts, writing Rust unit tests, and following Warp UI guidelines.

That is a big deal.

Most repositories today are still optimized for human contributors. You might have a README, a style guide, some CI scripts, maybe a docs folder. Warp’s repo is structured around the assumption that agents will be contributors too.

The write-product-spec skill is especially revealing. It tells the agent that the product spec should make desired behavior “unambiguous enough that an agent can implement it correctly and avoid regressions.” It also says the behavior section is the core of the spec and should enumerate defaults, inputs, user-visible states, error states, loading states, cancellation, offline behavior, races, accessibility, focus expectations, and invariants.

That is exactly the kind of documentation agents need.

Not vague product intent. Not “make this better.” Not “add support for X.”

Instead: numbered, testable behavioral invariants.

This is probably one of the underrated lessons of AI-assisted engineering: agents are much more effective when your process forces humans to make implicit assumptions explicit.

A good spec is not bureaucracy anymore. It is executable context.

3. Warp’s review process assumes agents are the first reviewer, not the last resort

Warp’s contribution docs say that when a PR is opened, Oz is automatically assigned and produces the initial review. Only after Oz approves does the PR get routed to a Warp team subject-matter expert.

That is a meaningful inversion.

In many engineering teams, AI review is still treated as optional. A nice-to-have. Something a developer might run before asking a human for review.

Warp is making the agent review a required gate before human review.

The review-pr skill is also quite concrete. It tells the agent to focus on correctness, security, error handling, and meaningful performance issues. It requires structured output in review.json, with categories like critical, important, suggestion, and nit. It even tells the agent to avoid posting directly to GitHub and to validate the JSON with jq.

This is a nice example of something I expect to become common: agents producing machine-readable review artifacts, not just chatty prose.

That matters because the future of code review is probably not one agent leaving a giant wall of text on your PR. It is agents generating structured findings that can be filtered, deduplicated, escalated, suppressed, turned into suggestions, routed to owners, and measured over time.

The hard part is not getting an LLM to say something about a diff. The hard part is making its review output operationally useful.

Warp’s repo is clearly thinking about that.

4. The open source release is opinionated about specs before code

The contribution process distinguishes bug fixes from features.

Bug fixes can generally go straight to implementation once triaged. Feature work, however, goes through a spec PR first. The product spec defines behavior from the user or consumer perspective. The tech spec explains the implementation plan, relevant files, module changes, data flow, tradeoffs, and validation plan.

This is not surprising in a normal engineering org. It is more interesting in an open source repo.

Most open source projects historically have a bias toward code. Someone scratches an itch, sends a patch, argues in the PR, and eventually the maintainer decides whether the change belongs.

Warp is moving design discussion earlier in the funnel. The repo already contains a large specs/ directory with many app-ticket-style spec folders.

I like this pattern for agentic development.

When agents write more of the code, the risk shifts from “can we implement this?” to “are we implementing the right thing?” A spec-first workflow slows down the part that should be slow: product judgment and design clarity. Then it speeds up the part that can increasingly be automated: code generation and mechanical implementation.

That is a good trade.

It also fits a lesson I have learned using Claude Code, Cursor, Codex, and other tools: the best agent workflows look suspiciously like good senior-engineer workflows. Define the goal. Identify constraints. Write down invariants. Make a plan. Implement incrementally. Run tests. Review the diff. Update the plan when reality disagrees.

Agents did not eliminate engineering discipline. They made engineering discipline more valuable.

5. Warp’s architecture is Rust-heavy, modular, and very much not “just a terminal”

The GitHub language breakdown shows the repo is overwhelmingly Rust: about 98% Rust at the time I checked.

The top-level Cargo workspace includes the main app plus many crates: AI, command parsing, computer use, editor, GraphQL, LSP, persistence, repo metadata, settings, terminal, UI, virtual filesystem, voice input, Warp CLI, Warp terminal, Warp UI, and more.

The engineering guide describes Warp as a Rust-based terminal emulator with a custom UI framework called WarpUI. It calls out major app areas including terminal emulation, shell management, AI integration, Agent Mode, Drive, authentication, settings, workspace/session management, and GraphQL.

That tells you something about what Warp has become.

The terminal is still there, but the product is no longer merely “a terminal emulator.” It is a developer workspace with:

terminal blocks
AI agents
code context
cloud sync
notebooks
Drive objects
diff and review surfaces
file trees
remote sessions
settings and feature flags
integration tests
custom UI primitives

This mirrors a broader trend. The terminal, editor, browser, and AI chat surface are starting to blur together. Cursor pulled AI into the editor. Claude Code and Codex pulled AI into the terminal. Warp is trying to make the terminal itself into a richer agentic workspace.

Whether that wins is an open question, but the architecture makes the ambition obvious.

6. The custom UI framework may be one of the most reusable pieces

Warp’s README says the warpui_core and warpui crates are MIT licensed, while the rest of the repository is AGPL v3.

That licensing split is interesting.

The product code is AGPL, which discourages closed-source commercial forks. But the UI framework is MIT, which makes it much easier for other projects to reuse.

The engineering guide describes WarpUI as a custom UI framework built around an Entity-Component-Handle pattern. There is a global App object that owns views and models, views hold handles to other views, AppContext provides temporary access during render/events, and elements describe visual layout in a Flutter-inspired way.

That sounds like one of the more technically interesting parts of the codebase.

Terminals are weird UI applications. They need high-performance rendering, keyboard precision, mouse handling, text selection, panes, tabs, overlays, scrollback, terminal emulation, and now AI-native surfaces like diffs, file trees, chat, and agent status. Electron is not always a great fit for that. Native UI frameworks can be limiting. Building a custom Rust UI framework is expensive, but it can make sense if the product has unusual interaction requirements.

By MIT licensing WarpUI, Warp may be creating a path for other Rust desktop apps to borrow some of that work without adopting the whole Warp product or AGPL obligations.

That is a smart open source move.

7. Warp is explicitly multi-agent and multi-harness, but Oz is the “happy path”

Warp’s launch post says the product is multi-model and multi-harness, and that the open source release doubles down on that openness. It also says Warp is adding support for a wider range of open source models, including Kimi, MiniMax, Qwen, and an “auto (open)” model-routed option.

The README says users can bring their own CLI agent, including Claude Code, Codex, Gemini CLI, and others.

But the contribution process clearly prefers Oz. The launch post says contributors are free to use other coding agents, but Warp prefers Oz because it has the correct skills and verification loops built in.

That tension is worth watching.

On one hand, Warp is positioning itself as open and multi-agent. On the other hand, the best-supported path for contributing is through its own orchestration system.

This is not necessarily bad. In fact, it may be the only practical way to make agentic open source contribution work at scale. If maintainers are going to trust agent-generated code, they need repeatable processes, repo-specific context, verification loops, and review gates. A random local agent with unknown prompts and unknown tool use is harder to trust.

But it does mean that the future of open source may not be “anyone with a text editor can send a patch.”

It may be “anyone can participate, but high-throughput contribution happens through shared agent infrastructure.”

That is a subtle but important shift.

8. The repo shows how much context modern AI coding agents need

The WARP.md file is basically a condensed engineering onboarding document for agents and humans. It explains build commands, test commands, lint commands, architecture, UI patterns, terminal model locking risks, database usage, GraphQL generation, feature flags, and style conventions.

Some of the details are exactly the sort of thing an agent needs to avoid creating subtle bugs.

For example, the guide warns developers to be careful calling model.lock() on the terminal model because acquiring multiple locks from different call sites can deadlock and freeze the UI. It recommends passing already-locked model references down the call stack and keeping lock scopes short.

That is not generic Rust advice. That is repo-specific scar tissue.

This is where agent performance will increasingly come from. Not just better models. Better context.

The best agents will know:

how the repo is built
where tests live
what patterns are preferred
what patterns are dangerous
which abstractions are stable
which parts of the codebase are legacy
how maintainers expect PRs to be structured
what invariants must not regress

In other words, the future of AI coding is not just “bigger model.” It is “bigger model plus better repo memory plus better local tools plus better review loops.”

Warp’s repo is a good example of that direction.

9. Open source is being used as a competitive strategy, not charity

Warp is refreshingly direct about why they are doing this.

The launch post says open sourcing is fundamentally tied to building a successful business. Warp is competing with well-funded, closed-source competitors and believes opening the product plus giving the community resources to improve it is a way to accelerate product development.

That is honest, and I think it is important.

Open source has always had a mix of motivations: ideology, community, developer adoption, recruiting, support burden reduction, ecosystem building, and commercial strategy. Warp’s release is very clearly in the commercial-strategy bucket, but with a new twist: agentic leverage.

The bet seems to be:

Make the client open source.
Let the community propose ideas and validate behavior.
Use Oz agents to scale implementation and review.
Keep the cloud/orchestration/business layer commercially valuable.
Move faster than closed competitors.

That is a very 2026 open source strategy.

The question is whether the community will accept it. Some developers will love it because they can finally inspect and contribute to Warp. Others will be skeptical because the contribution workflow routes through proprietary agent infrastructure and the non-UI product code is AGPL, not MIT or Apache.

Both reactions are reasonable.

But regardless of where you land philosophically, the release is worth studying because it is one of the clearest examples so far of a company trying to redesign open source around agents.

Final Thoughts

I came into this expecting to find interesting Rust architecture.

I did find that. Warp is a large Rust codebase with a custom UI framework, a modular workspace structure, terminal emulation, AI integration, repo metadata, remote sessions, persistence, GraphQL, feature flags, and a lot of platform-specific complexity.

But the bigger story is not the code.

The bigger story is the operating model.

Warp’s open source repo is a public experiment in agent-managed software development. The repo contains not only source code, but also agent skills, spec workflows, review schemas, contribution gates, and context files designed to make agents more effective.

That feels like the future.

Not because every repo will use Oz. Not because every company will open source its product. Not because agents will magically replace maintainers.

But because more and more software development will be organized around a new division of labor:

Humans decide what matters.

Humans define the behavior.

Humans review the tradeoffs.

Humans validate the result.

Agents do more and more of the mechanical work in between.

Warp’s repo is one of the first open source projects I have seen that really leans into that model. Whether it succeeds or not, I suspect many other projects will copy pieces of it.

What Warp’s Open Source Release Tells Us About the Future of Agentic Software Development was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

5 Management Tips in the Age of AI

Jonathan Fulton — Sat, 02 May 2026 22:43:28 GMT

Engineering managers do not need to become full-time prompt engineers. But they do need to understand how AI changes the job of building software.

I’ve spent the last year deep in AI-assisted coding. As a Staff Software Engineer at Datadog, I use Claude Code daily — and before that at Eppo, ID.me, and Storyblocks, I was on the management side, leading teams through major technical shifts.

The AI coding revolution is different. It’s not just a new framework or a cloud migration. It’s a fundamental change in how software gets written. And that means management has to change too.

Here’s what I’ve learned from both sides of the table.

1. Get Your Hands Dirty with AI Tools

You can’t manage what you don’t understand.

This has always been true in engineering management, but it’s especially critical now. AI coding tools like Claude Code, Cursor, and Codex aren’t incremental improvements — they’re a paradigm shift. The productivity gains are real (I’ve written extensively about this), but so are the failure modes.

When I was interim Head of Engineering at Eppo, I made a point of staying close to the code. That instinct matters even more today. If you’re not using these tools yourself, you’re flying blind:

You won’t understand what “good” AI-assisted output looks like
You can’t evaluate whether your team’s struggles are skill gaps or tool limitations
You’ll miss opportunities to coach on prompting, context management, and agent workflows

Block time on your calendar. Spin up Claude Code on a real problem — even if it’s just understanding a confusing part of your codebase. The learning curve is steep at first, but it flattens fast.

2. Use AI to Make Your Job Easier

Here’s a truth that took me too long to internalize: the boring parts of management are exactly where AI shines.

Performance review season used to mean days of compiling data, cross-referencing PRs, digging through Slack threads. Now? I point Claude at the raw materials and get a first draft in minutes. Same with:

Email drafts — especially delicate ones that need the right tone
Confluence documentation — architectural decision records, onboarding guides, process docs
Meeting prep — synthesizing context from multiple sources
Status updates — pulling together what shipped, what’s blocked, what’s next

This isn’t about being lazy. It’s about freeing your brain for the work that actually requires human judgment — coaching, strategy, cross-team alignment. The administrative overhead that used to eat half your week can shrink dramatically.

Every hour you save on bureaucracy is an hour you can spend making your team better.

3. Give Your Team Space to Learn — Then Make It an Expectation

AI adoption isn’t optional anymore. But people learn at different paces, and the transition can be genuinely disorienting.

The best approach I’ve seen combines patience with clarity:

First, create space. Carve out time for experimentation. Let people try different tools, find workflows that match their style. Some engineers will thrive with Claude Code’s terminal-first approach; others will prefer Cursor’s IDE integration. That’s fine.

Then, set expectations. After a reasonable ramp period, AI-assisted coding should be the norm, not the exception. This isn’t about mandating specific tools — it’s about the outcome. Engineers who refuse to adopt AI assistance will increasingly struggle to keep pace.

At Storyblocks, when we introduced new technologies (e.g., Node.js, React), we always paired learning time with clear expectations about eventual adoption. The same principle applies here, just with higher stakes.

4. Encourage Velocity — But Don’t Reward Tokenmaxing

AI tools make it trivially easy to generate massive amounts of code. That’s a feature and a bug.

I’ve seen engineers fall into what I call “tokenmaxing” — measuring their productivity by tokens consumed rather than problems solved. It’s the AI equivalent of typing faster instead of thinking better.

The goal isn’t maximum output. It’s maximum impact.

The best AI-assisted engineers I work with use the speed gains to:

Explore more solution options before committing
Write more comprehensive tests
Refactor ruthlessly instead of leaving tech debt “for later”
Spend more time understanding the problem before jumping to code

As a manager, what you reward shapes behavior. Celebrate thoughtful, well-architected solutions — not the biggest PRs. Ask “what problem did this solve?” not “how many lines did you ship?”

5. Integrate Automated Code Review Tools

Here’s the reality: AI-assisted coding produces more code, faster. Your review processes need to scale accordingly.

At Datadog, we’re seeing PRs that would have taken a week now landing in a day or two. That’s great for velocity — but it creates a review bottleneck if humans are the only line of defense.

Automated code review tools (Codex Review, Greptile, CodeRabbit, Codium) can help by:

Catching obvious issues before human reviewers even look
Flagging patterns that might indicate AI-generated code gone wrong
Providing consistent baseline feedback on style and best practices
Freeing human reviewers to focus on architecture, design, and edge cases

This isn’t about replacing human review — it’s about making it sustainable. When your team is shipping 3x the code, you need 3x the review capacity. Automation is the only realistic path.

The Bottom Line

The AI transformation isn’t coming — it’s here. Engineers who embrace it are already pulling ahead. Teams that adapt their processes are shipping faster than ever.

As a manager, your job is to lead that transformation:

Use the tools yourself so you can lead with credibility
Apply AI to your own work to free up time for what matters
Give your team room to learn, then hold them to the new standard
Reward impact, not output
Scale your review processes to match your new velocity

The managers who figure this out will build the best teams of the next decade. The ones who don’t will wonder why they can’t keep up.

5 Management Tips in the Age of AI was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

Inside the Agent Harness: How Codex and Claude Code Actually Work

Jonathan Fulton — Wed, 22 Apr 2026 23:04:46 GMT

A deep technical dive into how CLI coding agents structure their conversations, manage context, and orchestrate tool calls.

If you’ve used Claude Code or OpenAI’s Codex CLI, you’ve experienced the magic: type a request, watch the agent think, and see it execute shell commands, edit files, and solve complex problems. But what’s actually happening under the hood?

I spent time analyzing the Codex CLI codebase (which OpenAI open-sourced) to understand exactly how these agent harnesses work. The details are fascinating — and very different from what most people imagine.

The Core Loop: It’s Simpler Than You Think

At its heart, every coding agent runs a surprisingly simple loop. Here’s the pseudocode from Codex’s turn.rs:

while needs_follow_up:
    1. Gather conversation history
    2. Send to LLM with tools
    3. Process response:
       - If tool calls → execute them, add results to history, continue
       - If just text → done with this turn

That’s it. The “agentic” behavior emerges from this loop running until the model decides it’s done. There’s no complex planning system, no separate “reasoning engine” — just repeated calls to the same LLM with an accumulating context.

What Actually Gets Sent to the Model

This is where it gets interesting. The agent harness constructs a ResponsesApiRequest with several key components:

1. System Instructions (Base + User)

Codex builds layered instructions:

Base instructions: Model-specific guidance (the “personality”)
User instructions: Wrapped in tags
Skills instructions: Dynamic guidance based on detected needs
App/plugin instructions: Context from connected tools

These get concatenated into the instructions field of the API request. The layering is deliberate — it lets the agent inject context-specific guidance without polluting the base prompt.

2. Conversation History (The Input Array)

The input field contains the full conversation transcript as ResponseItem objects:

User messages
Assistant messages (previous model outputs)
Function calls (tool invocations)
Function call outputs (tool results)

Critically, the harness doesn’t just store text — it preserves the structure. A shell command and its output are linked by call_id, so the model understands the causal relationship.

3. Tool Definitions

The tools array contains JSON schemas for each available tool. Here's what Codex's shell tool looks like:

{
  "type": "function",
  "name": "shell",
  "description": "Run a shell command",
  "strict": false,
  "parameters": {
    "type": "object",
    "properties": {
      "command": {
        "type": "array",
        "items": { "type": "string" },
        "description": "The command to execute"
      },
      "workdir": {
        "type": "string",
        "description": "Working directory"
      },
      "timeout_ms": {
        "type": "number",
        "description": "Timeout in milliseconds"
      }
    },
    "required": ["command"]
  }
}

The harness dynamically adjusts which tools are available based on permissions, sandbox mode, and detected context (e.g., adding view_image only when images are present).

The Magic of Tool Call Execution

When the model returns a tool call, the harness has to:

Parse the arguments: The model returns JSON arguments that must be validated against the schema
Check permissions: Does this command require user approval? Is it in the sandbox allowlist?
Execute in sandbox: On macOS, commands run under sandbox-exec (Seatbelt). On Linux, Landlock. Windows has its own sandbox.
Capture output: stdout, stderr, exit code, and timing
Truncate if needed: Long outputs get truncated with head/tail preservation
Format for model: The output is structured so the model can understand success/failure

The truncation logic is particularly clever. Codex uses token-aware truncation that preserves the beginning and end of output while eliding the middle:

Exit code: 0
Wall time: 1.23 seconds
Total output lines: 5000
Output:
[first 100 lines...]
... (4800 lines omitted) ...
[last 100 lines...]

Context Management: The Unsung Hero

Context management is where agent harnesses earn their keep. The ContextManager in Codex handles:

Token Counting and Limits

The harness estimates token counts for every message using byte-based heuristics (roughly 4 characters per token). When approaching the model’s context limit, it triggers compaction.

Auto-Compaction

When context gets too long, Codex calls the model with a special “summarize this conversation” prompt. The summary replaces the old history, preserving the essential context while freeing tokens. This happens mid-turn if needed — the user never sees the interruption.

Context Diffing

Codex tracks a “reference context” and only sends changes when possible. If nothing significant changed between turns, it doesn’t re-inject the full context. This saves tokens and improves cache hit rates.

The Responses API: Built for Agents

OpenAI’s Responses API (used by Codex) has features specifically designed for agentic use:

parallel_tool_calls: The model can request multiple tool executions at once
tool_choice: Can be "auto", "required", or a specific tool name
reasoning: Controls extended thinking (effort: "low" | "medium" | "high")
store: Whether to persist the conversation for later retrieval
prompt_cache_key: Enables KV-cache sharing across requests

The streaming response returns structured events: OutputItemDone, ToolCallInputDelta, ReasoningContentDelta, etc. The harness parses these to update the UI in real-time and detect when tool execution is needed.

Parallel Tool Execution

When the model requests multiple tools simultaneously, the harness has decisions to make:

Dependency analysis: Can these truly run in parallel, or does one depend on another’s output?
Resource constraints: How many concurrent processes can we spawn?
Approval batching: Should we ask the user to approve all at once, or one at a time?

Codex uses a FuturesOrdered collection to run independent tool calls concurrently while maintaining result order. The outputs get collected and sent back to the model in a single follow-up request.

The Approval System

Safety is non-negotiable in a tool that can execute arbitrary commands. Codex implements a layered approval system:

Safe commands: ls, cat, git status — auto-approved
Pattern-matched: Commands matching configured allowlists
Sandbox violations: Network access, file writes outside workspace — require explicit approval
Granular policies: “Trust file writes but prompt for network”

The Guardian module intercepts tool calls before execution, evaluates them against the policy, and either proceeds, prompts the user, or blocks entirely.

Sub-Agents and Multi-Agent Coordination

Codex supports spawning sub-agents for parallel work. The parent agent can:

spawn_agent: Create a new agent instance with a specific task
wait_agent: Block until a spawned agent completes
send_message: Communicate with a running sub-agent
close_agent: Terminate a sub-agent

Each sub-agent maintains its own context and tool execution sandbox. The parent receives structured status updates and can coordinate complex multi-step workflows.

MCP: The Protocol Layer

Codex supports the Model Context Protocol (MCP) for external tool integration. MCP servers expose tools that the harness can discover and invoke:

mcp_tools = await mcp_connection_manager.list_all_tools()
for tool in mcp_tools:
    tool_spec = mcp_tool_to_responses_api_tool(tool)
    available_tools.append(tool_spec)

This enables plugins like database connectors, API wrappers, and custom enterprise tools — all without modifying the core agent.

What This Means for Agent Development

After studying Codex’s architecture, a few things stand out:

The loop is trivial; the infrastructure is everything. Building an agent is 5% “call the model in a loop” and 95% context management, tool execution, sandboxing, and error handling.
Token efficiency matters enormously. Without intelligent truncation and compaction, agents hit context limits after a few commands. The difference between a toy demo and a production agent is in these details.
Structured tool outputs are critical. The model needs to understand what happened. Raw stdout isn’t enough — you need exit codes, timing, truncation indicators, and clear success/failure signals.
Safety is table stakes. Any agent that can execute shell commands needs approval flows, sandboxing, and audit logging. Codex’s layered approach (Guardian → sandbox → policy) is a good template.
The API contract matters. OpenAI’s Responses API was clearly designed with agents in mind. Features like parallel_tool_calls, structured streaming events, and prompt_cache_key only make sense in an agentic context.

Looking Forward

Agent harnesses like Codex and Claude Code are still early. The architectures are converging on similar patterns — the loop, the context manager, the tool registry, the approval system. It’s exciting to see where they’ll go!

This post is based on analysis of the Codex CLI source code. If you’re building agents, the codebase is worth studying — it’s well-structured Rust with comprehensive tests and clear architectural separation.

Inside the Agent Harness: How Codex and Claude Code Actually Work was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

AI Engineering Is Exhilarating, Even a Year Later

Jonathan Fulton — Sat, 18 Apr 2026 21:59:01 GMT

I’ve been coding since middle school — nearly 30 years now. I’ve loved it for most of that time. The puzzle-solving, the craft of building something from nothing, the satisfaction of watching code come alive. But I’ll be honest: somewhere in the last decade, some of the magic had faded. Not gone, just… muted.

Then, about a year ago, everything changed.

The Beginning

Around April 2025, I started using agent mode in Cursor. It was immediately different from anything I’d experienced before. This wasn’t autocomplete. This wasn’t a smarter Stack Overflow. This was a collaborator — one that could hold context, reason through problems, and execute across files.

I was hooked.

By June, I’d picked up Claude Code, and the experience leveled up again. I wrote extensively about what I was learning, the workflows I was developing, the productivity gains I was seeing. I figured the novelty would wear off eventually. It always does, right?

It hasn’t.

Still Exhilarating

Here I am, over a year later, and the thrill hasn’t faded. If anything, it’s intensified. I’ve gotten better at prompting, better at structuring problems for AI collaboration, better at knowing when to let Claude run and when to step in. The ceiling keeps rising.

What surprises me most is how the rest of my workday feels now. The parts where I’m not using Claude Code? They feel dry. A little boring. Like I’m typing with gloves on.

I know that sounds dramatic. But after experiencing what it’s like to move at AI-assisted speed — to have a thought and watch it materialize into working code in minutes — going back to manual everything else feels like switching from a sports car to a bicycle. The bicycle still works. It’s just… slower.

A Friday Night Anecdote

Yesterday was a perfect example. I’d just posted a final PR for a small project. No clear next tasks on the board. I should have been satisfied — wrapping things up on a Friday is the dream, right?

Instead, I was disappointed.

Then, around 5:15pm, a Slack message came in. Something to investigate. By 6pm, I’d checked out the issue and remembered a related task I could tackle — one that touched multiple repos and services.

I spun up two Claude Code sessions in parallel, one for each repo. For the next 30 minutes, I was in flow. Exploring the codebase, discussing approaches with Claude, making changes. When I finally pushed the commit, it was maybe 10 lines of actual code changes.

But those 30 minutes? On a Friday night? They were genuinely fun.

That’s the part that still amazes me. This isn’t just productive. It’s enjoyable. I’ve rediscovered something I thought I’d lost — the pure joy of coding.

The Centaur Phase

I’ve written before about AI’s impact on jobs. I don’t have my head in the sand — this technology is going to reshape our industry in profound ways. Some roles will change. Some will disappear.

But right now, in this moment, I’m experiencing something special: the centaur phase. Half human, half machine. AI isn’t replacing me; it’s amplifying me. It’s like wearing an Iron Man suit for software engineering.

I still make the decisions. I still architect the systems. I still review the code and catch the edge cases Claude misses. But the execution? The boilerplate? The tedious parts? Those happen at superhuman speed.

And it’s so. much. fun.

What This Means

If you’re a developer and you haven’t deeply explored AI-assisted coding yet, you’re missing out. Not just on productivity gains — though those are real — but on the joy of it. There’s something magical about pair programming with an AI that never gets tired, never judges your questions, and can context-switch across your entire codebase in seconds.

After nearly 30 years of coding, I didn’t expect to feel like a beginner again. That sense of wonder, of “I can’t believe this is possible.” But here I am, a year into the AI engineering era, and I’m still waking up excited to code.

That’s not nothing.

That’s everything.

AI Engineering Is Exhilarating, Even a Year Later was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Economics of AI Employees

Jonathan Fulton — Fri, 17 Apr 2026 00:34:48 GMT

Why we’ll likely have fewer AI knowledge workers than current human knowledge workers.

In March, I wrote about the economics of AI-driven software development — how the marginal cost of code has plummeted while developer productivity has soared. The numbers were compelling: AI adds less than 10% to development costs while potentially doubling or quadrupling output.

But there’s a follow-on question I’ve been wrestling with: what happens when the AI isn’t just assisting a human — but replacing one?

The answer surprised me. The economics of AI employees don’t scale the way you’d expect.

My AI Journey

I’ve been using AI coding assistants since 2024, starting with Cursor. In June 2025, I picked up Claude Code and started shipping at ~300 lines per hour. By early 2026, I’d set up Clawdbot (now OpenClaw) — an AI agent that runs on my machine, connects to WhatsApp, and handles tasks autonomously.

As I wrote in “The Path Towards AGI Now Seems Possible”, this felt like a step change. I message my agent from my phone while walking the dog, and there’s a working feature to test when I get back. Sub-agents work in the background. It remembers context across sessions.

But here’s the thing: I run my Clawdbot on Anthropic’s Claude Opus 4.5. And I recently got a visceral lesson in what AI employees actually cost.

The $200 Wake-Up Call

Anthropic recently changed their policies — subscription plans can no longer be used for Opus 4.5 API access. I’m now paying API rates directly.

Over the past two weeks, I’ve spent roughly $200 on my Clawdbot. And here’s the kicker: I don’t even use it that heavily. It checks my email on a cron job a few times a day. I use it for occasional writing and research. Maybe an hour or two of actual interaction per week.

$200 for a mostly-idle assistant.

That got me thinking: what would it cost to run an AI employee that actually works full-time?

The Math: AI Employees Are Expensive

Let’s estimate the cost of running Claude Opus 4.7 (the latest flagship model) for continuous, demanding work — the kind of work you’d expect from a full-time knowledge worker.

The key variables:

Context maintenance: Complex work requires large context windows — often 100K+ tokens of project state, documentation, and history
Output volume: A working AI generates substantial output — code, documents, analysis, communications
Iteration cycles: Real work involves back-and-forth, debugging, refinement — not single-shot responses

Based on current Opus 4.7 pricing and realistic usage patterns for sustained cognitive work, a conservative estimate lands at $1,000 or more per day of continuous operation.

Do the math: $1,000/day × 365 days = $365,000 per year.

And that’s just API costs. It doesn’t include the infrastructure to run the agent, the engineering time to build and maintain integrations, or the human oversight still required for anything consequential.

The Human Benchmark

According to research aggregating BLS, OECD, and industry data, the average U.S. knowledge worker earns $100,000–$110,000 per year in total compensation (salary plus benefits). There are approximately 100 million knowledge workers in the United States, generating $10–11 trillion in annual compensation.

So we have:

AI employee (Opus 4.7, full-time): $365,000+/year in API costs alone
Human knowledge worker: ~$105,000/year in total compensation

An AI employee costs 3.5x more than a human just in compute — before you account for anything else.

Even if you use a cheaper model like Sonnet with aggressive prompt caching, you’re still looking at $50,000–$100,000/year for continuous operation. That’s comparable to a human salary, not dramatically cheaper.

The Diminishing Returns Problem

But wait — doesn’t AI produce more output? Isn’t a 3.5x cost justified if you get 10x the productivity?

Here’s where the economics get uncomfortable: 10x output doesn’t equal 10x value.

This is the fundamental issue with knowledge work that we rarely discuss. Most knowledge work has sharply diminishing marginal returns:

Writing: The first draft captures most of the value. The 10th revision adds marginal improvement.
Code: The working feature is what matters. Additional polish has declining impact.
Analysis: The key insight is worth everything. Additional charts are decoration.
Research: After finding the answer, continued research is wheel-spinning.

There’s a ceiling on how much knowledge work the world needs. We can only absorb so much code, so many reports, so many analyses. The bottleneck isn’t production — it’s consumption and implementation.

Consider healthcare: there are only so many people with cancer who need treatment plans. There are only so many patients a hospital can serve. Producing 10x more treatment plans doesn’t help 10x more patients — it just creates a backlog.

Or consider law: you can generate infinite contracts and briefs, but there are only so many deals to close and cases to argue.

Or software: you can write infinite features, but users can only learn and adopt so many.

The value of knowledge work is bounded by the real-world problems it solves, not by the volume of output it produces.

The Counterintuitive Conclusion

Put these two factors together — expensive compute and diminishing returns — and you reach a surprising conclusion:

We’re likely to replace human knowledge workers with a smaller number of AI employees in the near term, not more.

The economics just don’t work for most use cases. An AI that costs $365,000/year needs to deliver massive value to justify itself. Very few tasks meet that bar when you account for diminishing returns.

What does work economically:

AI assistants that amplify human productivity (my current model — burst usage, not continuous)
Specialized AI agents for high-value, time-sensitive tasks (trading, security monitoring)
AI in the loop for human decisions, not replacing them

What doesn’t work:

Equal numbers of full-time AI employees doing general knowledge work at current API prices
AI armies producing massive volumes of output no one will consume
AI replacing $100K workers when it costs $365K

The Long-Term View

This analysis assumes current pricing. API costs will fall — they always do. Moore’s Law for inference is real.

But even as compute gets cheaper, the diminishing returns problem remains. The world doesn’t need infinite knowledge work. There’s a carrying capacity for analysis, for code, for content.

The likely equilibrium isn’t the same number of AI employees as humans today but rather fewer AI employees maybe doubling or tripling current output.

The Bottom Line

I love my Clawdbot. It’s genuinely useful. But I don’t use it as an employee — I use it as a force multiplier for my own time.

The lesson from my $200 two-week experiment: AI employees are expensive, and the value of knowledge work has natural limits. The economics will favor AI employees taking over but there will ironically be fewer of them than there are human knowledge workers today.

Jonathan Fulton is a Staff Software Engineer at Datadog, building experimentation infrastructure. He writes about AI-assisted development, architecture, and the business of software at Jonathan’s Musings.

The Economics of AI Employees was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.

How Managers Accidentally Create Stress for Their Best People

Jonathan Fulton — Thu, 16 Apr 2026 00:21:05 GMT

The small habits that slowly erode trust, focus, and psychological safety

I’ve spent the last 15 years on both sides of the management divide. As SVP at Storyblocks, VP of Engineering at Foundry.ai, a manager at ID.me, and interim Head of Engineering at Eppo, I’ve made every mistake in this article. As an architect at ID.me and a Staff Engineer at Eppo and now Datadog, I’ve been on the receiving end of some of these same mistakes.

Here’s the uncomfortable truth: the managers causing the most stress aren’t bad people. They’re often high performers themselves — busy, well-intentioned, and completely unaware of the anxiety ripples created by their small, daily habits.

This post is for them. And honestly, it’s for me too — a reminder of the patterns I need to keep catching in myself.

1. The Unannounced 1-on-1

“Hey, do you have 15 minutes to chat?” Or maybe a 1-on-1 meeting invite just pops up on the calendar.

For managers operating on a manager schedule, this feels like nothing. You have context. You know you just want to ask about that API design, or get their opinion on a reorg, or maybe even deliver good news about a promotion.

For your IC, those words trigger an immediate cortisol spike. Everyone thinks they’re getting fired. I don’t care how secure they are, how much you’ve praised them, or how obvious it should be that they’re your best performer. The moment you schedule an ad-hoc meeting with no topic, their brain starts writing the termination script.

The fix: Add one sentence of context. “Quick sync on the Q3 roadmap.” “Want to get your take on team structure.” “Good news to share.” Three seconds of typing saves hours of spiraling.

2. The Phantom Meeting Change

Paul Graham’s famous essay on the Maker’s Schedule vs. Manager’s Schedule should be required reading for every engineering leader. Here’s the core insight: for managers, the calendar is a tool. For makers, the calendar is a constraint.

When you move a meeting without explanation, you’ve just blown up someone’s carefully constructed focus block. They’d mentally allocated that morning for deep work, knowing the 2pm 1-on-1 was coming. When you shift it to 11am, you didn’t just move a meeting — you fragmented their most productive hours.

Worse is the deleted meeting with no communication. Did you cancel because something came up? Because you’re frustrated with them? Because the meeting isn’t important? They don’t know. So they fill in the blanks with anxiety.

The fix: Treat calendar changes as requests, not commands. “Hey, I need to move our 1-on-1 — does 11am work or should we push to tomorrow?” Two sentences. Massive reduction in cognitive overhead.

3. The Public Praise That Excludes

This one is subtle, and I’ve watched it happen countless times.

A manager sends a message to the team channel: “Huge shoutout to Alice and Bob for crushing the migration this week!” Meanwhile, Carol — who spent three days debugging the gnarliest edge cases — is conspicuously absent from the praise.

Public recognition is powerful. But public recognition that leaves someone out sends a message too. Even if Carol knows she contributed, the absence of her name in front of the whole team feels like a verdict. Did the manager not notice? Does the manager not value debugging work? Is Carol on thin ice?

The same applies to surprise feedback. Even positive feedback delivered unexpectedly in a group setting can feel like an ambush. Some people process publicly; others need space. Springing it on them doesn’t give them that choice.

The fix: Default to private praise with an offer to make it public. “I wanted to thank you for the work on the migration — would you be comfortable if I called it out in standup?” And when you do public recognition, take the extra two minutes to make sure you’re not accidentally leaving someone out.

4. Urgency Without Context

“I need this by end of day.”

Cool, but why? Is there a customer commitment? A board meeting? A demo? Or do you just want it off your plate?

Context isn’t just nice to have — it’s the difference between I’m part of a team solving a real problem and I’m a cog responding to arbitrary pressure. Your best people want to help. They’ll move mountains for legitimate urgency. But when everything is urgent and nothing is explained, two things happen:

They stop believing any urgency is real
They start to resent being treated like an executor instead of a partner

The fix: Share the why. Own the urgency. “The CEO is demoing this to a key prospect tomorrow — I know it’s tight, but here’s why it matters.” If you can’t articulate a real reason, maybe it’s not actually urgent.

5. Mistaking Silence for Alignment

You present a plan. The room is quiet. No objections. Ship it.

Except silence rarely means agreement. It often means:

People are still processing
They don’t feel safe disagreeing with the manager
They don’t want to criticize colleagues in public
They’re exhausted and picking their battles
They assume someone else will speak up

As a Staff Engineer, I’ve sat in many meetings where I disagreed with something but didn’t speak up — because the political cost felt too high, or because I knew the manager had already decided. That’s not alignment. That’s resignation.

The fix: Explicitly invite dissent. “What am I missing?” “What would make this fail?” Even better: create async space for feedback. “I’m going to let this sit for 24 hours — DM me or comment in the doc if you see issues.” You’ll be amazed what surfaces when people don’t have to object in real-time in front of their peers.

6. The Promised Follow-Up That Never Comes

This is death by a thousand cuts.

“I’ll get back to you on the promotion timeline.”
“Let me check with leadership and circle back.”
“Good question — I’ll find out.”

And then… nothing. The manager moved on. The IC didn’t.

For your direct report, that open loop is still spinning. They’re wondering if you forgot, if the answer is bad news you’re avoiding, or if their question just didn’t matter enough to follow up on. Each dropped thread is a small erosion of trust.

The fix: Track your promises like you track your tasks. If you can’t follow up, close the loop anyway. “I looked into this and don’t have an answer yet — I haven’t forgotten, just still working on it.” Even uncertainty, communicated, is better than silence.

The Common Thread

Every pattern here shares the same root cause: managers forget what it’s like to not have context.

You know why you scheduled the meeting. You know the meeting move was benign. You know the urgency is real. You know who did what on the project. You know you meant to follow up.

Your team doesn’t. They only see the action. And in the absence of context, humans fill in gaps with anxiety.

The good news? These are all fixable with small, cheap interventions. A sentence of context. A question instead of a command. A closed loop instead of silence. None of this requires management training or leadership coaching. It just requires remembering that your team doesn’t have the same information you do — and deciding to share it.

Your best people won’t tell you this is happening. They’ll just slowly disengage, or burn out, or leave. The stress you accidentally create compounds invisibly until it doesn’t.

How Managers Accidentally Create Stress for Their Best People was originally published in Jonathan’s Musings on Medium, where people are continuing the conversation by highlighting and responding to this story.