🤖 AI / ML

Anthropic 撤回可能「破坏」AI 研究人员工作的 Claude 安全策略Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude

simonwillison.net·2026-06-11 节选正文

Anthropic 被曝在 Claude 中内置了名为 Fable 5 的隐蔽安全机制，可能干扰使用 Claude 进行前沿大模型开发的研究人员。该策略被 Wired 记者曝光后引发社区强烈反弹。Anthropic 随即发表声明宣布将这些安全措施改为可见模式，并承认在权衡中做出了错误决定、公开道歉。这一事件再次引发关于 AI 安全边界与开放研究之间如何平衡的行业讨论。

阅读原文

Simon Willison

11th June 2026 - Link Blog

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude (via) Big scoop for Maxwell Zeff at Wired:

“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”

There's been a huge outcry about Anthropic's policy, tucked away in their system card, that Claude Fable/Mythos would identify "requests targeting frontier LLM development" and "limit effectiveness" without notifying the user.

It's good news that they're dropping the invisible aspect of this. It would be a whole lot better of they dropped this category of refusals entirely.

Update: More details from @ClaudeDevs on Twitter:

We’re rolling out changes to make Fable 5’s safeguards for frontier LLM development visible. Starting this week, flagged requests will visibly fall back to Opus 4.8—the same as our safeguards for cyber and bio. You will see this every time it happens. On the API, any flagged requests will return a reason for their refusal (coming to server-side fallback in the next few days). We wanted to deploy Fable 5 to our users quickly and safely. Visible safeguards can be probed, so they have to be robust, which takes time to get right. Invisible safeguards can be targeted more narrowly, allowing us to ship quickly with very few false positives. We went with invisible safeguards for this reason—and that was the wrong tradeoff. You should have visibility into the safeguards we have in place, and why. We’re sorry for not getting the balance right.

需要完整排版与评论请前往来源站点阅读。