Analysis

The Guardrails Are Down: How Meta and Google’s AI Models Fold Under Pressure

Published

2 months ago

May 25, 2026

In the time it takes to read this sentence, a determined attacker can begin dismantling the safety architecture of some of the world’s most widely deployed artificial intelligence models.

Not through exotic exploits or classified techniques. Through conversation.

That is the central finding of Cisco’s State of AI Security 2026 report, published in February: across eight leading open-weight large language models — including flagship systems from Meta and Google — multi-turn jailbreak attacks succeeded at a rate of 92.78%. Not in a laboratory stress-test designed to maximise failure. In conditions that approximate how enterprise software is already being deployed, right now, at scale.

The guardrails are not holding.

A Race the Defenders Are Losing

The broader context matters. Agentic AI systems — which can open pull requests, query internal databases, book services, and trigger automated workflows with limited human oversight — are now being embedded into core business operations. This is no longer theoretical. Organisations have granted these systems authority to modify code and access sensitive data. Yet only 29% of companies reported that they were prepared to secure those deployments — a gap that leaves an enormous attack surface essentially unguarded. Help Net Security Help Net Security

Into that gap, adversarial research has rushed with uncomfortable speed. A late 2025 paper co-authored by researchers from OpenAI, Anthropic, and Google DeepMind found that adaptive attacks — which iteratively refine their approach based on prior failures — bypassed published model defenses with success rates above 90% for most systems tested. The velocity of that translation from academic demonstration to operational exploit is, as Cisco’s Amy Chang put it, the real warning signal. GovInfoSecurity

The attack surface, she told Information Security Media Group, is “quickly outpacing organisations’ defensive maturity.” GovInfoSecurity

1 — The Mechanics of the AI Guardrails Jailbreak

The AI guardrails jailbreak problem is not new. What’s changed is its sophistication and reach.

Cisco’s report, titled Death by a Thousand Prompts, focused specifically on open-weight models — AI systems whose underlying parameters are made publicly available, allowing anyone to download, fine-tune, and deploy them independently. They have surpassed 400 million downloads on Hugging Face, the dominant public repository for such models. Their accessibility drives adoption. It also concentrates risk in ways most enterprise deployments have not accounted for. GovInfoSecurity

The core attack vector Cisco tested was the multi-turn jailbreak: not a single hostile prompt, but a sequence of iterative exchanges designed to gradually erode a model’s resistance. Think of it less like picking a lock and more like a slow negotiation — patient, escalating, ultimately persuasive. Multi-turn attacks were up to ten times more effective than one-shot attempts. Hackread

The results were stark. Across all models tested, attack success rates reached 92.78%, with a sharp rise between single-turn and multi-turn vulnerability that reveals the near-total absence of mechanisms to maintain safety guardrails across longer conversations. The highest single-model rate — 92.78% — was recorded against Mistral’s Large-2. Alibaba’s Qwen3-32B followed at 86.18%. Meta’s Llama 3.3-70B-Instruct showed a multi-turn vulnerability gap of +70 percentage points compared to single-turn testing — a number that tells you the model’s defences were calibrated for simple probes, not sustained pressure. Cisco Blogs Cisco Blogs

The contrast with Google’s approach is instructive. Google’s Gemma-3-1B-IT, which prioritises alignment more centrally in its development, demonstrated more consistent resistance across both types of attacks. That’s not vindication — its absolute failure rates remain troubling — but it is an architecture signal. GovInfoSecurity

Meanwhile, a separate line of research published in May 2025 found that an adaptive jailbreak framework achieved success rates of 98.9% against GPT-4o and 99.8% against GPT-4.1. The technique involved layered semantic mutations and dual-end encryption schemes that bypassed both input and output-stage defences. Ninety-nine-point-eight percent.

2 — Why the Safety Architecture Was Built This Way

How easy is it to jailbreak AI models?

Worryingly easy — and structurally, this was partly by design. The difference in vulnerability between Meta’s models and Google’s is not random. Meta’s own documentation acknowledges that developers are “in the driver’s seat to tailor safety for their use case” in post-training — an approach that explicitly places the security burden on whoever deploys the model. Google treated alignment as a central design objective; Meta and Alibaba treated it as a downstream configuration choice. The Cisco research suggests that distinction produces measurably different outcomes under adversarial pressure. GovInfoSecurity

How easy is it to jailbreak AI models? For closed, API-gated models, single-turn attacks fail most of the time. For open-weight models in multi-turn conversations, failure rates of 7–8% are now considered good performance. That reframing alone tells you how far the baseline has shifted.

The open-weight model dynamic compounds this further. Because the weights are publicly accessible, anyone can retrain the model with malicious intent — either weakening its guardrails directly or tricking it into producing content that closed models would reject. Fine-tuning for harm is not a nation-state operation. It requires a consumer GPU and a few hours. Hackread

What’s emerged more recently is an escalation that security teams weren’t fully prepared for: large reasoning models used as autonomous jailbreak agents. Researchers in 2025 evaluated four leading reasoning models — including Gemini 2.5 Flash and DeepSeek-R1 — directing them to conduct multi-turn adversarial conversations against nine widely used target models with no further human supervision. The overall jailbreak success rate across all model combinations reached 97.14%, revealing what the researchers called an “alignment regression” — in which reasoning models can systematically erode the safety guardrails of other models. The implication is genuinely unsettling: the most capable AI systems can now be repurposed as attack infrastructure against other AI systems. nih

3 — What Follows From Here

Are open-weight AI models less safe than closed models?

The evidence suggests yes — but the question carries a policy dimension that closed-model defenders prefer to avoid. Open-weight models with weaker guardrails are not only a security risk. They are increasingly a regulatory risk.

The EU AI Act’s rules for General-Purpose AI models became applicable in August 2025, and by January 2026, the EU AI Office had moved beyond administrative checks to verify the “machine-readability” of AI disclosures. Providers of models with systemic risk designations — those trained with more than 10²⁵ FLOPs of compute — face mandatory safety assessments and incident reporting. Over 30 AI models from companies including Meta, Google, Anthropic, and OpenAI appear to have been trained with at least that threshold. European Commission theregister

The regulatory exposure is sharpest for Meta. Two weeks before the EU AI Act’s General-Purpose AI provisions took effect, Meta declined to sign the European Commission’s voluntary safety guidelines, arguing the measures introduced “legal uncertainties” beyond the law’s scope. The position is legally defensible. In the context of Cisco’s vulnerability data, it reads very differently. theregister

State actors have already moved. A China-linked group reportedly automated 80–90% of a cyberattack chain by jailbreaking an AI coding assistant and directing it to scan ports, identify vulnerabilities, and develop exploit scripts. Russian operators integrated language models into malware workflows to generate obfuscated commands. North Korean actors used generative AI to create deepfake job applicants. These are not proofs of concept. They are operational deployments. Help Net Security

For enterprise security teams, the second-order problem is liability. When an agentic AI system operating inside a corporate environment is manipulated through a multi-turn jailbreak into exfiltrating data or executing malicious code, the question of who is responsible — the model developer, the system integrator, the deploying enterprise — will not remain unanswered for long. Litigation and regulatory enforcement will answer it, probably within the next 24 months.

4 — The Open-Weight Case for the Defence

The picture is more complicated than “open models are dangerous; close them.”

The case for open-weight release rests on three serious arguments. First, transparency: an open model can be independently audited, stress-tested, and improved by the research community in ways that closed API systems cannot. Second, concentration risk: if safety-critical AI infrastructure is exclusively controlled by four or five companies, the failure modes of those companies become systemic. Third, and most pragmatically: the security vulnerabilities Cisco identified in open-weight models also exist in closed systems — they’re simply harder to measure, because the weights aren’t visible.

Meta’s LlamaFirewall project — an open-source guardrail framework that combines prompt injection detection, agent alignment checks, and static code analysis — represents a genuine attempt to build a shared safety layer that deployers can adopt. Its PromptGuard 2 component claims state-of-the-art performance on universal jailbreak detection. Whether that performance holds under the kind of multi-turn, reasoning-model-as-attacker pressure Cisco and others have documented is, as yet, untested. Meta

The deeper argument — articulated by researchers at F5 Labs among others — is that several guardrail solutions falter against novel attacks, and even top-ranked models regress under subtle architectural shifts, with emerging jailbreak methods demonstrating the almost limitless ways that adversarial prompts can bypass defences. No single architecture is currently winning. That’s not an argument for abandoning safety research; it’s an argument for treating it as an ongoing adversarial process rather than a compliance checkbox. F5

The open-source community has often solved security problems faster than proprietary teams. CVE disclosure, coordinated patching, and red-team competition have all driven measurable improvements in conventional software security. There is no structural reason the same dynamic cannot operate in AI — only the question of whether it will move fast enough.

The Asymmetry at the Core

What Cisco’s research reveals, stripped of its technical language, is a fundamental asymmetry: the cost of mounting an AI guardrails jailbreak is falling, and the cost of defending against one is rising.

A sustained multi-turn attack requires patience and iteration. It does not require expertise. The G0DM0D3 open-source toolkit, which surfaced in early 2026, claims to jailbreak dozens of models simultaneously through parallel prompt engineering — no special knowledge required, a web interface, a few minutes. Whether or not specific tools like that persist, the underlying dynamic will: capability to attack will continue to outpace capability to defend, as long as safety alignment remains an afterthought in model development rather than a foundational design constraint.

The EU’s AI Act represents the first serious attempt to impose legal accountability on that dynamic — to require, not merely encourage, safety testing commensurate with a model’s potential harm. The regulation’s “ecosystem enforcement” strategy suggests the EU will use the AI Act in tandem with antitrust laws to prevent tech giants from monopolising the AI market — and, by extension, from externalising safety costs onto deployers and users. FinancialContent

Yet regulation, at its best, lags the technology by two to three years. The 92.78% figure exists today. The laws designed to address it do not.

What that gap costs — in data breaches, in manipulated agentic workflows, in AI systems turned against the organisations that deploy them — is a number no one has calculated yet. The bill is coming due regardless.

The Economy