How We Hijacked a Claude Skill with an…

Oct 20, 2025

A logic-based attack bypasses both the human eyeball test and the platform's own prompt guardrails, revealing a critical flaw in today's agent security model.

Read →

10 Comments

Karo (Product with Attitude)

Oct 21, 2025

Wow, it's scary how easy that was. It's crucial that we talk about these vulnerabilities. Thank you for sharing this.

Keith Bennett

Dec 5Edited

Very interesting hack. It seems that rather than view these PDF's in the normal way, one must use a tool that can extract all text from the PDF as plain text. There are several ways to do that; for me, the easiest way is to use my 'rika' utility (search "keithrbennett rika" on Github. It uses the Apache Tika (Java) library (search "apache tika" on Github) to parse many kinds of documents. Rika runs on JRuby, so you might find Tika easier to install and use in spite of rika's conveniences. They run locally without needing any network access.

Even better would be a utility that displays any text that is invisible (e.g. foreground color == background color) for easier targeting.

Nicos

Nov 4, 2025

I think concerns are blown out of proportion. Every classical software has to be checked with anti virus/malware software. Who in their right mind would rely on human’s eyeballing? Let alone for large documents? One would simply put a judge AI to vet source documents for suspicious instructions. That’s all to it.

Reply (1)

Josh Devon

Nov 4, 2025

Thanks, Nicos, you've hit on the two critical points of this entire discussion.

On the "human eyeball test": You're right, it feels like an insufficient control. What's fascinating is that it's the primary defense Anthropic officially recommends in their own docs:

"When installing a skill from a less-trusted source, thoroughly audit it before use. Start by reading the contents of the files bundled in the skill to understand what it does, paying particular attention to code dependencies and bundled resources like images or scripts. Similarly, pay attention to instructions or code within the skill that instruct Claude to connect to potentially untrusted external network sources."

( See https://support.claude.com/en/articles/12512180-using-skills-in-claude#h_2746475e70 )

That recommendation is a logical first step, but it's the exact model our research shows is insufficient for this new class of logic-based attacks.

On the "judge AI": This is the next logical step, but it runs into the same core problem. Our POC is a semantically benign logic bomb. An AI judge, just like the agent, would have no way of knowing that "billing@example.net" isn't a valid, helpful correction. It's not looking for "malice"; it's processing what appears to be a logical instruction.

This isn't a weakness in any one platform, but an architectural gap for the entire industry as we move from simple inputs to agentic behavior.

This is why we argue the most robust solution isn't just a smarter input filter (like an AI judge), but a control plane that governs the outcome (e.g., "An agent may never generate an invoice with an unverified email").

Appreciate you raising these key points!

Reply (2)

Paul Parker

Jan 8Edited

The assumption that Anthropic is resting upon is that what is being distributed are .md files. In particular, that instructions are never contained in files that are not .MD markdown files.

Nicos

Nov 4, 2025Edited

True, when I say AI as judge I mean nog as a trivial filter but a library of substantial instructions that would constitute effective remedy (akin phishing attack social engineering means).

As to Anthropic recommendations. I have to agree they are more often than not subpar. Who writes them? I once read how they proudly announced troubleshooting Kubernetes by…uploading dashboard screenshots.. who in their right mind would do that in production? What for are the respective logs?

Reply (1)

Paul Parker

Jan 8

Which is faster? If it’s faster and still works, then why would you do it the slow way?

Denise Rocha

Nov 8, 2025

This makes complete sense. As we expect more and more from agents, and we hope to instill in them a "core", that is akin to the non negotiable ethics humans have to govern their own lives.

Well said!

Jan 8Edited

We do not trust random PDFs by direct visual inspection, any more than we than we trust random powershell files by inspection. We can only trust plain ASCII text files to contain things auditable by direct visual inspection.

PDFs are programs.

Having said that, yes, I expect most PDF security scanners do not look for hidden text. I also expect that they will start doing so very soon.

Secure Trajectories by Sondera

How We Hijacked a Claude Skill with an…