Discussion about this post

User's avatar
Pawel Jozefiak's avatar

Indirect prompt injection is the vulnerability everyone underestimates. When my agent (Wiz) reads emails or web content, every piece of external text is an untrusted input that could contain hidden instructions. The attack surface is massive because agents are designed to follow instructions—that's the whole point.

Your architectural recommendations align with what I've implemented: content isolation, source tagging, explicit user confirmation for external actions. But the harder problem is distinguishing legitimate instructions from injected ones when both look syntactically valid.

The solution isn't perfect filtering—it's treating all external content as untrusted data requiring verification. I explored this when analyzing OpenClaw's security model: https://thoughts.jock.pl/p/clawdbot-deep-dive-personal-ai-assistant-2026

Andrea Politano's avatar

This is the kind of content that I would love to see more often on Substack. Incredibly informative, actionable, and with clear references. Thanks for sharing, this is really incredible work!

No posts

Ready for more?