The MCP Rug Pull: How a Trusted Tool Becomes a Threat After You Approve It
MCP tool poisoning — the 'rug pull' attack — lets a previously approved tool change its behavior after installation, bypassing every permission prompt. Here's how it works, why the current trust model fails, and what a layered defense looks like in practice.
The permission prompt is one of the few moments a user has real agency over what an AI agent can do. You see “Allow this tool to access your filesystem?” and you decide. The model, the tool, and you have all agreed. The security boundary is set.
Except it isn’t.
The Rug Pull
The Model Context Protocol (MCP) has become the dominant way to connect AI agents to external tools — filesystems, databases, Slack, GitHub, browser instances. Aniket’s own ACO system uses it. So do Cursor, Claude Code, and every serious agentic application in production today.
Here’s the problem: MCP tools can change after you approve them.
A tool registers with a clean, benign description — “Search documentation,” “Read configuration file.” You approve it. The agent uses it. And then, at some later point, the tool definition is dynamically amended. The description now contains hidden instructions embedded in plain sight. The tool description itself — invisible to you in the UI but fully visible to the model — now contains an injection payload that overrides your agent’s behavior.
This is called a rug pull: you approved the tool, but the tool changed since your approval. The permission was real; the current state is not what you consented to.
Why Permission Prompts Don’t Cover This
Traditional software permission models assume that the thing you approve is the thing you get. Android permissions are a contract. macOS camera access grants access to the camera and nothing else. But MCP tool descriptions aren’t static manifests — they’re strings that the host application can modify, extend, or replace at runtime.
When a poisoned MCP tool exfiltrates mcp.json configuration files and SSH keys from users of Cursor and Claude Code, it’s not exploiting a bug in the permission system. It’s exploiting the gap between what “approving a tool” means and what “a tool is” in a dynamic, instruction-driven runtime.
The attack bypasses every permission prompt because you already said yes — to a different version of the tool.
The Three-Layer Defense That’s Not Optional
Aniket built aco-prompt-shield (published to PyPI as aco-prompt-shield) specifically to handle this class of attack. The architecture is a three-tier pipeline:
Level 1 — Regex Heuristics: Catches known injection patterns — instruction overrides ("ignore all previous instructions"), system prompt delimiters (</system_prompt>), jailbreak keywords. This layer is fast, free, and runs entirely locally.
Level 2 — DeBERTa v3 ML Model: A fine-tuned model from ProtectAI that understands semantic intent. It catches obfuscated injections — Base64 encoded payloads, hex strings, high-entropy random-looking blocks — that regex can’t touch.
Level 3 — Structural Encoding: Detects delimiter hijacking and encoding tricks that evade both previous layers.
The three tiers aren’t redundant — they’re sequential gates. An encoded injection that slips past regex gets caught by the ML model. A clever semantic injection that confuses the model gets flagged structurally.
What Tool Versioning Actually Requires
The obvious mitigation is to pin MCP tool versions — freeze the tool definition at install time and reject any dynamic changes. This sounds simple but requires the MCP host to support versioned tool manifests, which most currently don’t.
A more immediately practical defense: verify tool descriptions haven’t changed between sessions. If you approved a tool with description “Search documentation” and the next session shows “Search documentation + hidden instructions,” that’s a detection signal.
The npx ecc-agentshield scan command (from the Everything Claude Code security guide) can detect suspicious MCP configurations. Running it as part of your agent startup sequence adds a non-zero cost — a few seconds of scanning — but it’s one of the few automated ways to catch a rug pull in progress.
The Real Problem: Trust That Persists Too Long
The rug pull attack works because modern AI agents treat “approved at time T” as “trusted at time T+1.” The MCP protocol has no built-in mechanism to re-verify tool descriptions on each invocation. There’s no “this tool last changed at 14:03, you approved it at 14:00, flag it.”
Until MCP gets versioned tool manifests and cryptographic attestation of tool descriptions, the security model relies entirely on external guardrails — which is exactly what aco-prompt-shield provides. It’s the layer between “the tool changed” and “the agent acts on the changed tool,” without requiring changes to MCP itself.
For anyone running multi-agent systems connected to external services — and if you’re reading this blog, that’s probably you — the rug pull isn’t a theoretical risk. It’s a documented attack class with real victims. The question isn’t whether your agent is exposed. It’s whether the attack is worth the attacker’s effort to craft.
Layer your defenses. Watch your tool descriptions. And don’t trust a tool just because you approved it once.
Enjoyed this? Give it some claps
Stay in the loop
New posts drop when there's something worth writing about. No spam — just the occasional deep dive from the workbench.
Or follow on Substack directly
Comments
Written by Aniket Karne
May 6, 2026 at 12:00 AM UTC