Analysis

The Race to the Regulators: Why AI Pre-Deployment Testing Has Arrived

Published

on

For most of the past two years, the dominant assumption in Washington’s corridors was that the Trump administration would keep its hands off frontier AI. The January 2025 revocation of Biden’s executive order on AI risk seemed to cement that posture. So when the U.S. Department of Commerce’s Center for AI Standards and Innovation announced on May 5, 2026 that it had signed formal agreements with Google DeepMind, Microsoft, and Elon Musk’s xAI — granting federal evaluators access to unreleased AI models — the pivot was sharper than most observers had anticipated.

The catalyst was not abstract policy debate. It was a model.

When security researchers at Mozilla pointed Anthropic’s new Mythos system at their code, the experience produced something close to vertigo. Bobby Holley, Firefox’s chief technology officer, said Mythos had elevated AI from a competent software engineer to something resembling a world-class, elite security researcher. That description — and its implications for every unpatched vulnerability in every network connected to the internet — lit a fire under the White House that no deregulatory talking point could easily extinguish. The Washington Post

The new AI pre-deployment testing agreements are Washington’s answer. They are voluntary, technically non-binding, and carefully constructed to avoid the language of mandates. They are also, in their quiet way, a structural reckoning with just how consequential the next generation of AI models may be.

What the CAISI Agreements Actually Do

The Center for AI Standards and Innovation announced agreements with Google DeepMind, Microsoft, and Elon Musk’s xAI that will allow the U.S. government to evaluate artificial intelligence models before they are publicly available. CAISI will conduct pre-deployment evaluations and targeted research. The announcement builds on earlier partnerships struck with OpenAI and Anthropic in 2024, which were the first of their kind. CNBC

The scope is broader than a checkbox exercise. CAISI has completed more than 40 evaluations to date, including assessments involving unreleased AI models. Developers frequently provide models with reduced or removed safeguards to support evaluations focused on national security-related capabilities and risks. The agreements also support testing in classified environments and enable participation from evaluators across government agencies through the TRAINS Taskforce, a group of interagency experts focused on AI-related national security issues. Executive Gov

That last point matters. A model tested with its guardrails intact tells evaluators relatively little about what it’s genuinely capable of doing. By examining systems in their more uninhibited state, CAISI can probe for the kinds of capabilities — automated cyberattack sequencing, biochemical synthesis guidance, manipulation of critical infrastructure — that frontier labs are increasingly warning about in their own internal research.

CAISI’s evaluations focus on demonstrable risks, such as cybersecurity, biosecurity, and chemical weapons. These aren’t theoretical threat categories. They are the precise domains in which advanced reasoning models have begun to demonstrate capabilities that, even in controlled settings, have prompted unusual candour from the labs building them. National Institute of Standards and Technology

Prior to evaluating U.S.-based AI models, CAISI recently examined the Chinese model DeepSeek, concluding it underperformed in several areas including accuracy, security and cost efficiency. That context is not incidental. Part of what’s driving Washington’s urgency is the competitive dimension — the fear that adversaries may be racing toward capabilities that American agencies don’t fully understand, even in their own country’s frontier models. Nextgov.com

CAISI Director Chris Fall has framed the institutional mission with deliberate precision. “Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications,” Fall said. “These expanded industry collaborations help us scale our work in the public interest at a critical moment.” Federal News Network

What Does CAISI’s AI Pre-Deployment Testing Actually Involve?

CAISI conducts pre-release evaluations of frontier AI models by accessing versions with reduced or removed safety filters, testing in classified environments, and deploying an interagency task force — the TRAINS Taskforce — across government agencies. Evaluations focus on cybersecurity, biosecurity, and chemical weapons risks. The center has completed over 40 such assessments to date.

That question has real commercial stakes attached to it. NIST said the partnerships would help the agency and the tech companies exchange information, spur voluntary product improvements, and ensure the government had a clear understanding of what AI models were capable of doing. For the companies involved, this framing is tolerable — even attractive. A pre-release government endorsement, implicit or explicit, is worth something in enterprise procurement conversations. It’s harder to challenge a model that CAISI has already looked at. Cybersecurity Dive

Yet the capacity problem is glaring. CSET Senior Research Analyst Jessica Ji noted that government agencies simply don’t have the same amount of resources as big tech companies — either the manpower, technical staff, or access to compute — to run rigorous evaluations of these models. CAISI is a relatively lean organisation operating against labs that employ thousands of the world’s most skilled AI researchers. The asymmetry between evaluator and evaluated has no obvious near-term solution. CSET

The FDA Analogy — and Why It’s Both Tempting and Dangerous

The policy frame that has seized Washington’s imagination is, perhaps inevitably, the Food and Drug Administration. National Economic Council Director Kevin Hassett told Fox Business that the administration is studying a possible executive order to give a clear roadmap for how future AI models that create vulnerabilities should go through a process so that they’re released into the wild after they’ve been proven safe, just like an FDA drug. Bloomberg

The analogy is rhetorically clean. It is also, on closer inspection, strained in ways that matter for how any eventual mandatory regime would function in practice.

Drug approval is predicated on a relatively bounded hypothesis: does this compound do what it claims, without causing specified harms? The FDA’s clinical trial infrastructure, built over decades, evaluates outcomes in controlled populations against defined endpoints. Frontier AI models behave differently. Their capabilities emerge non-linearly from scale, training data, and interaction patterns that no pre-deployment test suite can exhaustively simulate. A model that passes a red-teaming exercise on Tuesday may discover a novel attack vector in production by Thursday.

CAISI conducts post-deployment evaluations to track risks that emerge after launch, since AI systems often behave differently under real-world conditions — including adversarial inputs and dataset drift — than they do in controlled testing environments. This acknowledgment, buried in the operational details of how CAISI works, quietly concedes what the FDA analogy papers over: there is no clean approval moment. Safety is a continuous process, not a gate. Arnav

Still, the political logic of the FDA frame is sound. It gives the administration a vocabulary for oversight that doesn’t require it to announce a regulatory regime. “Proven safe before release” is a message that plays well. The implementation will be considerably messier.

A bipartisan group of 32 House lawmakers has written to National Cyber Director Sean Cairncross urging immediate action to confront the high volume of cyber vulnerability disclosures cropping up from advanced AI systems. The letter marks an escalation in pressure on the Trump administration to confront the risks posed by frontier AI cyber models. That kind of bipartisan pressure — rare in contemporary Washington — signals that this issue has moved beyond the usual partisan channels. Axios

Second-Order Effects: Markets, Enterprise, and the Voluntary-to-Mandatory Gradient

The agreements announced on May 5 are voluntary. That status, however, may have a shorter shelf life than the companies involved are counting on.

National Economic Council Director Hassett said it’s “really quite likely” that any testing spelled out under an executive order would ultimately extend to all AI companies. “I think Mythos is the first of them, but it’s incumbent on us to build a system,” he said. When a White House economic adviser publicly floats universal applicability, the “voluntary” characterisation begins to function more as a transitional state than a permanent arrangement. Insurance Journal

For enterprise buyers, the near-term implications are more concrete. A CAISI evaluation — particularly one conducted in a classified environment, with results shared selectively across agencies — effectively creates an informal tier of government-vetted AI systems. The companies that have signed these agreements (Google DeepMind, Microsoft, xAI, OpenAI, and Anthropic) are, not coincidentally, the same companies that supply the overwhelming majority of frontier AI infrastructure to federal agencies. A new entrant — a well-capitalised European lab, or a fast-scaling domestic startup — that hasn’t been through the CAISI process faces an implicit disadvantage in federal procurement, regardless of whether any formal mandate exists.

The market signal is already visible. Following the announcement, Microsoft’s stock was down 0.6 percent in midday trading, while Alphabet, Google’s parent company, was trending in the opposite direction — up 1.3 percent. These are small moves, and reading too much into single-session trading is unwise. But the divergence may reflect a market reading of which company has the most to gain from tighter relationships with Washington’s AI oversight apparatus. Al Jazeera

The international dimension compounds the picture. The EU’s AI Act, which came into full force in August 2025, imposes mandatory conformity assessments on high-risk AI systems. The CAISI framework, built on voluntary agreements and classified evaluations, is a fundamentally different architecture — one shaped by American deregulatory instincts even as it begins to converge toward similar outcomes. The question of mutual recognition, or regulatory fragmentation, will land on the desks of trade negotiators before the decade is out.

The Counterargument: Testing Without Teeth?

Not everyone views the CAISI expansion as a meaningful check on frontier AI risk. Critics — some within the AI safety research community, others in civil liberties organisations — have raised a set of concerns that deserve a serious hearing rather than a dismissal.

The first is structural: evaluations conducted under voluntary agreements give the evaluated parties significant influence over what the evaluators can access, how results are framed, and whether findings lead to any material consequence. The new agreements allow CAISI to evaluate new AI models and their potential impact on national security and public safety ahead of their launch, and to conduct research and testing after AI models are deployed. What the agreements do not stipulate, publicly at least, is what happens when CAISI finds something troubling. The absence of a defined enforcement mechanism isn’t a technicality — it’s the central design question. CNN

The second concern is about scope creep in the opposite direction. The agreements build upon OpenAI and Anthropic’s agreements in 2024, which were the first of this kind. Each iteration has expanded the framework’s reach without a parallel expansion of CAISI’s evaluation capacity or legal authority. If the executive order now under consideration mandates testing without addressing the resource gap Jessica Ji identified, the process risks becoming a compliance ritual rather than a genuine safety check — something labs can credential-wash without fundamentally altering their deployment timelines. The Hill

Industry groups have been supportive: Business Software Alliance Senior Vice President Aaron Cooper said that CAISI brings the necessary expertise to work with private sector partners to evaluate frontier models for safety and national security risks, and called it the right institutional home within government. Industry enthusiasm for a regulatory body is not, historically, a reliable indicator of rigorous oversight. It can equally signal confidence that the oversight will remain manageable. Nextgov.com

A Framework in Formation

The agreements signed on May 5 are neither a regulatory revolution nor a fig leaf. They are something more interesting and more ambiguous than either characterisation allows.

Washington has moved from ignoring frontier AI risk to institutionalising a mechanism for examining it — in under eighteen months, and largely under the pressure of a single model’s demonstrated capabilities. That is, by the standards of government technology policy, fast. The CAISI framework exists, it has now absorbed five of the most significant frontier labs, and it has begun to develop the institutional muscle memory that eventually becomes precedent.

What it lacks is clarity on consequences. The voluntary-to-mandatory gradient that Hassett suggested — extending CAISI-style testing to all AI companies — would represent a genuine structural shift. Whether such an order arrives, and whether it comes with enforcement mechanisms or remains aspirational, will determine whether the May 5 announcements are remembered as a turning point or a photo opportunity.

The FDA comparison is imperfect. The analogy is imprecise. But the underlying instinct — that something this powerful, moving this fast, probably shouldn’t enter the world completely unexamined — is harder to argue with every week that passes.

The question now isn’t whether Washington will test frontier AI before it ships. It’s whether the testing, when it finds something, will actually matter.

Leave a ReplyCancel reply

Trending

Exit mobile version