Journal

Blog

Notes on building self-improving AI agents: engineering, intent signals, agent improvement, and what we’re learning along the way.

The Economics of Agent Improvement: What a Bad AI Agent Actually Costs
The cost of AI agent churn is bigger than your support bill. A founder-grade model for silent churn, failed upgrades, and refunds, plus the payback math.
Self-Healing Agents: Hype vs What's Actually Possible Today
Self-healing AI agents sound magical, but what ships today is human-in-the-loop. Here is the realistic version, the risks, and a maturity ladder.
The Self-Improving Agent Playbook: From First Conversation to Merged PR
The self-improving agent playbook: a step-by-step guide to turn live conversations into custom intents, diagnoses, and merged PRs against your agent.
Agent Drift: How Production AI Agents Quietly Degrade Over Time
AI agent drift is when a shipped agent silently degrades in production even though the code never changed. Learn how to catch and correct it fast.
Why Your AI Agent Stops Getting Better After Launch
Your AI agent stops improving the day it ships. Here's why the post-launch freeze happens and how to build a loop that keeps it getting better.
The PM's Guide to Improving an AI Agent in Production
A practical operating guide to improve an AI agent in production: what to instrument, how to read signals, and how to ship reviewable fixes.
What Users Won't Tell You: Detecting Friction They Never Report
Most users never report silent failures in AI agents. Learn to detect unreported friction from conversation patterns before users quietly churn.
Reading Churn Before It Happens: Conversation Signals That Predict Cancellation
Learn to predict churn from conversation signals in your AI agent chats. A taxonomy of leading indicators to catch ai product churn before cancellation.
Why AI-Native Products Need Auto-Generated Intents, Not Off-the-Shelf Metrics
AI-native product metrics like DAU and D7 retention hide what matters. Here is why auto-generated intents track the real story of your agent.
Custom Intents vs Predefined Funnels: Why Generic Analytics Miss the Point
Custom intent detection beats predefined funnels for AI agents. Why open-ended conversations break event tracking and what product analytics for agents should do instead.
What Are Intent Signals in AI Conversations?
Intent signals are the goals, frustrations, and requests users express to an AI agent. Learn what they are, how they're extracted, and why they matter.
How to Know If Your AI Agent Is Actually Getting Better
Learn how to measure AI agent improvement on live production cohorts, track failure trends per intent, and prove a change actually worked before you ship it.
How to Prioritize Which AI Agent Bugs to Fix First
Learn how to triage AI agent issues and prioritize agent bugs by frequency, severity, revenue impact, and fix effort, with a scoring matrix.
How to Tune Your Agent Harness and Config From Production Signals
Agent harness tuning guide: use production conversation signals to fix wrong tool calls, bad retrieval, and premature give-ups as reviewable diffs.
How to Build a Continuous Improvement Loop for LLM Agents
A production blueprint for continuous improvement of LLM agents: instrument, capture conversations, diagnose root causes, ship fixes as PRs, repeat.
How to Instrument Your AI Agent With OpenTelemetry in 2 Minutes
A practical guide to instrument your OpenTelemetry AI agent: which spans, attributes, and conventions to emit for prompts, tool calls, and outcomes.
How to Find Hidden Feature Requests in Your Agent's Conversations
A practical method to find feature requests from conversations with your AI agent: detect request-shaped intents, cluster them, and rank by revenue impact.
How to Turn Support Conversations Into Pull Requests
Turn conversations into code: the pipeline from raw agent chats to merged PRs. Capture, cluster intents, pick high-impact patterns, ship the fix.
How to Improve Your AI Agent's System Prompt From Real Conversations
A practical guide to system prompt optimization driven by real production conversations: collect failures, cluster patterns, ship reviewable diffs.
From Dashboards to Pull Requests: What Closing the Loop Actually Means
Close the loop on AI agents by going from a production signal to a merged change. Why the real unit of progress is a pull request, not a dashboard chart.
The AI Agent Feedback Loop: Build, Measure, Improve
The AI agent feedback loop is broken for most teams. Heres how to close it: build, measure with real conversations, and ship concrete fixes.
The Infrastructure Stack for Self-Improving AI Agents
Self-improving AI agent infrastructure has five layers: capture, understand, diagnose, act, and review. Here is the reference stack and how to build it.
Why Observability Isn't Enough for AI Agents
AI agent observability shows you traces and dashboards but leaves the fixing to you. Here is where monitoring stops and how to act on the signal.
What Is a Self-Improving AI Agent? (And Why Most Agents Aren't)
A self-improving AI agent turns production conversations into concrete fixes. Here is what that means, why most agents are static, and the infra it needs.
How to Make Your Product Agent-Native: CLI, MCP, Skills, Markdown, and Agent Auth
Agents are the new users. Here's the practical stack (CLI, MCP server, Skills, markdown landing pages, OAuth for agents, and agent-issued tokens with human email verification) that makes a product actually usable by them.
The 6 Metrics Every AI-Native Product Should Track (And How to Define Them)
DAU, retention D7, session length — these metrics were built for apps where users tap buttons. Your core loop is a conversation. Here's the analytics framework that actually works for AI-native products.
The Companies That Win the AI Era Won't Have the Best Models — They'll Have the Best Agent Experience
Model capabilities are commoditizing fast. GPT-5, Claude 4, Gemini Ultra — they're converging on every benchmark that matters. The companies that actually win the AI era will be the ones that build the best agent experience on top of these models. AX is the new moat.
When Agents Complete Tasks but Ruin the Experience: The Resolution Without Satisfaction Problem
Your agent's task completion rate can be 90% and your users can still quietly hate using it. Here's why resolution and satisfaction diverge in agent products, what the three archetypes of bad completions look like, and how to close the gap before users drift away.
We Dug Into Claude Code's Source Code. Anthropic Built a Full Frustration Detection System.
Claude Code ships with regex-based frustration detection, LLM-powered session satisfaction labeling, and a skill improvement loop. Agent builders deploying Claude in their own products have none of this visibility. Here's what Anthropic built — and how to replicate it.
The Hidden Ways AI Agents Fail at Experience (That Your Logs Won't Show)
Your error logs are green. Your latency is fine. But your users are quietly losing trust in your AI agent. Here are the 6 failure modes that destroy agent experience without triggering a single alert.
Your Voice AI Agent Thinks Every Call Went Well. It's Wrong.
QA scores say your voice AI is performing. Sentiment says callers are happy. A case study running four analytics pipelines on the same calls tells a very different story, and shows what your current metrics are missing.
The 5 Signals That Define a Good Agent Experience (And How to Measure Each One)
Task completion rate, path efficiency, trust signals, recovery rate, delegation depth. These are the five metrics that actually tell you whether your AI agent is delivering a good experience, and how to instrument each one in production.
Why Your Agent's Success Rate Tells You Nothing About Agent Experience
Task completion rate is the first metric every team tracks for AI agents. It's also deeply misleading on its own. Here's what success rate misses, why teams keep optimizing for it anyway, and what to measure instead.
Agent Experience Score: A Single Number for How Well Your AI Agent Is Performing
The AX Score is a composite metric that rolls up Task Completion Rate, Path Efficiency, Trust Retention, and Recovery Rate into one number that tells you exactly how your agent is performing in production.
Agent Experience vs. User Experience: Why the Distinction Changes How You Build AI Products
Founders who built apps before AI think in UX terms. That mental model breaks when the interface is an agent taking actions on your behalf. Here's how to make the shift before it costs you.
What Is Agent Experience (AX)? The New Metric Category Nobody Is Tracking Yet
UX measures how users interact with an interface. AX measures the quality of what an AI agent does on their behalf. They're completely different problems, and almost nobody is tracking the second one.
What Separates a Sticky Vibe Coding Platform From a One-Hit Wonder
Most vibe coding platforms are great at acquiring users and terrible at keeping them. Here's the specific product and analytics difference between the ones that build durable retention and the ones that don't.
Why Time in App Is a Misleading Metric for AI Companion Products
Time in app is the go-to engagement metric for consumer apps. For AI companions, it's one of the most misleading numbers you can track. Here's what it's hiding and what to measure instead.
What AI Companion Users Are Actually Asking For (That No Analytics Tool Shows)
The explicit prompts AI companion users send don't tell you what they actually need. Here's how to read between the lines of companion conversations — and what most teams miss entirely.
The Exact Point Where Vibe Coding Users Give Up and Hire a Developer
There's a specific moment in the vibe coding journey where the AI stops being faster than a developer. Most platforms never see it coming. Here's what that inflection point looks like in the conversation data.
The Build-Abandon Loop: Why Vibe Coding Users Start Projects and Never Come Back
The most common behavior pattern in vibe coding platforms isn't 'build and ship' — it's 'start, get stuck, abandon, start again.' Here's what the build-abandon loop looks like in the data and how to break it.
Vibe Coding Platforms Have a Retention Problem Nobody's Talking About
The vibe coding wave brought millions of new builders to AI-assisted development. Most of them don't stick around. Here's the structural retention problem baked into the category, and what the best platforms are doing about it.
How to Know If Your AI Coding Assistant Is Helping Users Ship or Just Spinning
Not all code generation is useful. Here's how to measure whether your AI coding assistant is actually accelerating your users' development velocity — or just producing plausible-looking output that doesn't work.
What Happens Right Before a User Upgrades on a Vibe Coding Platform
The upgrade moment on vibe coding platforms isn't random. There's a specific conversation pattern that precedes it almost every time. Here's what it looks like, and how to engineer more of it.
Why A/B Testing Your Paywall Is Useless Without Conversation-Level Data
Running paywall A/B tests without understanding what led users to the upgrade moment gives you noisy results and wrong conclusions. Here's the conversation data layer that makes paywall testing actually work.
The Frustration-to-Upgrade Pipeline: Turning AI Limits Into Paid Conversions
User frustration with AI limits is one of the highest-intent signals you'll ever see. Most products waste it. Here's how to build a pipeline that turns that frustration into paid conversions.
Why Your Most Active Free Users Aren't Upgrading (And It's Not the Price)
High-activity free users who won't upgrade aren't being held back by price. They're missing something else — and it shows up clearly in their conversations.
The Conversation That Should Trigger an Upgrade Prompt (But Doesn't)
Most AI products show upgrade prompts based on usage limits or time. The conversations that actually predict upgrade intent are completely different — and almost nobody is using them.
What Activation Actually Means for an AI Companion Product
Activation in AI companion apps isn't a feature click or a setup step. It's a specific emotional moment in a conversation. Here's how to find it, measure it, and engineer it at scale.
The Activation Event Nobody Can Define in an AI Product
Every SaaS product has an activation event. AI-native products have one too, but it's not a feature click or a setup step. It's a conversation. Here's why that changes everything about how you find and optimize it.
What 'I'll Try Again Later' Actually Means for AI App Retention
When users close your AI product and tell themselves they'll try again later, they usually don't. Here's what that moment looks like in your data, and how to stop it from becoming churn.
Why Your Best Users and Your Worst Users Look Identical in Your Dashboard
A power user and a frustrated user can have the same session count, same average session length, and same return rate. Standard analytics can't tell them apart. Conversation analytics can.
The Conversation Pattern That Predicts Churn 2 Weeks Before It Happens
There's a specific combination of conversation signals that reliably predicts churn in AI products, weeks before the user cancels. Here's what it is and how to build an early warning system around it.
The Silence Before Churn: What Users Stop Doing Before They Cancel
Users don't quit AI products suddenly. There's a behavioral pattern in the weeks before they leave — a specific kind of silence. Here's what it looks like and how to catch it early.
Repetition Is a Red Flag: How Looping Conversations Kill AI Retention
When users repeat themselves in a conversation, it's not persistence. It's a failure signal. Here's why message repetition is one of the most predictive churn indicators in any AI product.
Frustration Index: How to Quantify User Friction in a Conversation
Frustration in AI products is real, measurable, and predictive. Here's how to build a Frustration Index from conversation signals — and why it's one of the most useful metrics you're not tracking.
The 4 Ways Users Silently Give Up on AI Products (None Show in Your Funnel)
Most AI product churn is invisible. Users don't rage-quit, they quietly drift. Here are the 4 abandonment patterns that kill retention before your funnel ever catches them.
Setting Up Your First Conversation Health Dashboard
Learn how to build a Conversation Health Dashboard for your AI product: the 5 views you actually need, how to instrument for it, and the weekly review ritual that turns data into better decisions.
The Conversation Depth Benchmark: How Deep Do Users Actually Go?
Turn count is one of the most-tracked metrics in AI products and one of the most misread. Here's what conversation depth actually tells you — and how to segment it correctly.
AI App Retention Benchmarks: What's a Good 30-Day Retention for an AI Companion?
30-day retention benchmarks for AI companion products, why standard mobile app benchmarks don't apply, and the conversation patterns that actually predict whether users stick around.
Intent Resolution Rate: The Metric That Ties AI Quality Directly to Revenue
IRR is the single most important metric for any conversational AI product. Here's what it actually measures, three ways to track it in production, and why moving it by 10 points is a revenue decision.
How to Measure If Your AI Chatbot Is Actually Working
Most teams measure AI chatbot performance wrong. Usage stats and benchmark scores tell you nothing about whether real users are getting what they need. Here's the framework that does.
The Problem With Tracking Conversations Like Pageviews
Your session numbers look great. Your users are churning. Here's why event-based analytics was never built for conversational AI products, and what to do instead.
Distillation Attacks: How AI Labs Are Stealing Capabilities at Industrial Scale
Anthropic just published evidence of three Chinese AI labs running coordinated campaigns to extract frontier AI capabilities using 24,000 fake accounts and 16 million exchanges. Here's what distillation attacks are, how they work, and why the entire AI industry should care.
WebMCP Just Changed Everything We Know About Browser Automation (And Nobody's Talking About It)
WebMCP is a fundamental paradigm shift in how AI agents interact with the web. It's the difference between teaching a robot to recognize a door vs. giving it a doorbell.
MCP and AGENTS.md Find a New Home: Inside the Agentic AI Foundation Launch
Anthropic donates Model Context Protocol, OpenAI contributes AGENTS.md, and Block brings goose to the newly formed Agentic AI Foundation under Linux Foundation mentorship. Here's what this massive governance shift means for developers building the next wave of AI agents.
Are Ads Coming to ChatGPT? What the Rumors (and OpenAI's Silence) Tell Us
OpenAI sparked controversy with 'app suggestions' in ChatGPT Plus. Leaked code reveals ad infrastructure, but Sam Altman hit pause. Here's what the financial math and user backlash tell us about ChatGPT's ad future.
MCP Turns One: Four Releases That Transformed How AI Agents Connect
Model Context Protocol celebrates its first anniversary with four major spec releases - from basic stdio servers to OAuth 2.1, tasks, and server-side agentic loops. Here's the technical evolution that made MCP the industry standard.
OpenRouter's Sherlock Models: 1.8M Context at Zero Cost
OpenRouter just dropped two frontier models with 1.8M token context windows, excellent tool calling, and they're free during alpha. Here's what actually matters for AI agents.
Supabase MCP: Let Claude Manage Your Database
Stop switching between Claude and the Supabase dashboard. Supabase MCP lets you execute queries, design schemas, and deploy Edge Functions from chat.
Long Running Tasks in MCP: The Call-Now, Fetch-Later Pattern That Changes Everything
Deep dive into SEP-1686 and how the Model Context Protocol now handles hours-long operations without blocking. Learn about task lifecycle, polling patterns, security considerations, and real production use cases from healthcare to multi-agent systems.
Context7: Stop Hallucinating, Start Coding
Claude generates code with APIs that don't exist. Context7 solves it with 3.8M+ downloads. Here's how.
Google's MCP Toolbox for Databases: A Technical Deep Dive for Engineering Teams
Comprehensive technical guide to Google's MCP Toolbox for Databases (formerly Gen AI Toolbox). Learn about Model Context Protocol integration, database connectivity, OAuth2 security, OpenTelemetry observability, and production-ready AI agent development with AlloyDB, Cloud SQL, Spanner, and more.
Uber MCP Server: Book Rides & Order Food from Claude (Coming Soon Guide)
Learn how the upcoming Uber MCP Server will integrate with Claude and ChatGPT. Book rides, check fares, order food delivery - all through conversational AI. Everything you need to know before launch.
Why Do AI Agents Speak English? The Case for Vector-Based Communication
A technical deep-dive into why we inherited natural language for agent-to-agent communication, the computational overhead it creates, and the emerging research on direct vector and latent space communication between AI agents.
Zomato MCP Server: Order Food Directly from ChatGPT & Claude (Complete Setup Guide)
Learn how to install and use the Zomato MCP Server with your LLMs. Browse restaurants, create orders, and pay with QR codes, all through AI. Complete step-by-step guide with examples.
Top 10 MCP Servers for Coding
The best MCP servers for developers in 2025. From file operations to databases.
OpenAI Apps SDK: Building UI with existing MCP That Don't Suck
OpenAI Apps SDK technical guide: Build interactive ChatGPT apps with MCP, React widgets, and the window.openai API. 800M users, zero downloads required.
Claude Skills: The End of Prompt Engineering?
After spending months perfecting prompts, Skills made most of it obsolete. Here's what actually changed - and what didn't.
How to Build Your Own Claude Code Plugin (Complete Guide)
Claude Code plugins just launched. Here's how to actually build one that people will use - from structure to team deployment.
Testing MCP Servers: The Complete Developer's Guide to MCP Inspector, mcpjam, and Beyond
Learn how to test and debug Model Context Protocol servers like a pro. From MCP Inspector to mcpjam and automated testing strategies - everything you need to ship reliable MCP servers.
How to Get More Usage on Your MCP Server: 5 Proven Strategies
You've built an MCP server. Now what? Learn the exact strategies to increase adoption, reach more developers, and track what's actually working.
How to Improve Your MCP Server
Building an MCP server isn't just wrapping endpoints. It's about designing for how models actually think and work.