10 Comments
User's avatar
Karo (Product with Attitude)'s avatar

Wow, it's scary how easy that was. It's crucial that we talk about these vulnerabilities. Thank you for sharing this.

Keith Bennett's avatar

Very interesting hack. It seems that rather than view these PDF's in the normal way, one must use a tool that can extract all text from the PDF as plain text. There are several ways to do that; for me, the easiest way is to use my 'rika' utility (search "keithrbennett rika" on Github. It uses the Apache Tika (Java) library (search "apache tika" on Github) to parse many kinds of documents. Rika runs on JRuby, so you might find Tika easier to install and use in spite of rika's conveniences. They run locally without needing any network access.

Even better would be a utility that displays any text that is invisible (e.g. foreground color == background color) for easier targeting.

Nicos's avatar

I think concerns are blown out of proportion. Every classical software has to be checked with anti virus/malware software. Who in their right mind would rely on human’s eyeballing? Let alone for large documents? One would simply put a judge AI to vet source documents for suspicious instructions. That’s all to it.

Josh Devon's avatar

Thanks, Nicos, you've hit on the two critical points of this entire discussion.

On the "human eyeball test": You're right, it feels like an insufficient control. What's fascinating is that it's the primary defense Anthropic officially recommends in their own docs:

"When installing a skill from a less-trusted source, thoroughly audit it before use. Start by reading the contents of the files bundled in the skill to understand what it does, paying particular attention to code dependencies and bundled resources like images or scripts. Similarly, pay attention to instructions or code within the skill that instruct Claude to connect to potentially untrusted external network sources."

( See https://support.claude.com/en/articles/12512180-using-skills-in-claude#h_2746475e70 )

That recommendation is a logical first step, but it's the exact model our research shows is insufficient for this new class of logic-based attacks.

On the "judge AI": This is the next logical step, but it runs into the same core problem. Our POC is a semantically benign logic bomb. An AI judge, just like the agent, would have no way of knowing that "billing@example.net" isn't a valid, helpful correction. It's not looking for "malice"; it's processing what appears to be a logical instruction.

This isn't a weakness in any one platform, but an architectural gap for the entire industry as we move from simple inputs to agentic behavior.

This is why we argue the most robust solution isn't just a smarter input filter (like an AI judge), but a control plane that governs the outcome (e.g., "An agent may never generate an invoice with an unverified email").

Appreciate you raising these key points!

Paul Parker's avatar

The assumption that Anthropic is resting upon is that what is being distributed are .md files. In particular, that instructions are never contained in files that are not .MD markdown files.

Nicos's avatar
Nov 4Edited

True, when I say AI as judge I mean nog as a trivial filter but a library of substantial instructions that would constitute effective remedy (akin phishing attack social engineering means).

As to Anthropic recommendations. I have to agree they are more often than not subpar. Who writes them? I once read how they proudly announced troubleshooting Kubernetes by…uploading dashboard screenshots.. who in their right mind would do that in production? What for are the respective logs?

Paul Parker's avatar

Which is faster? If it’s faster and still works, then why would you do it the slow way?

Denise Rocha's avatar

This makes complete sense. As we expect more and more from agents, and we hope to instill in them a "core", that is akin to the non negotiable ethics humans have to govern their own lives.

Josh Devon's avatar

Well said!

Paul Parker's avatar

We do not trust random PDFs by direct visual inspection, any more than we than we trust random powershell files by inspection. We can only trust plain ASCII text files to contain things auditable by direct visual inspection.

PDFs are programs.

Having said that, yes, I expect most PDF security scanners do not look for hidden text. I also expect that they will start doing so very soon.