AI giant Anthropic has recently announced a new feature for its latest and largest model, allowing the AI to proactively end conversations in "extremely rare, ongoing harmful or abusive user interactions." Notably, Anthropic explicitly stated that this move is not aimed at protecting human users, but rather at protecting the AI model itself.
To clarify, Anthropic does not claim that its Claude AI model has consciousness or would be harmed during conversations with users. The company clearly stated that "the potential moral status of Claude and other large language models now or in the future remains highly uncertain."
However, this statement points to a recent research project created by Anthropic, focused on what is called "model well-being." The company essentially takes a precautionary approach, "committed to identifying and implementing low-cost interventions to mitigate model well-being risks, just in case such well-being actually exists."
This latest change is currently limited to Claude Opus 4 and 4.1 versions. At the same time, the feature will only trigger in "extreme edge cases," such as "requests involving sexual content with minors or attempts to obtain information that could be used to carry out mass violence or terrorism."
Although such requests may bring legal or public relations issues for Anthropic itself (as shown by recent reports about ChatGPT potentially reinforcing or exacerbating users' delusional thinking), the company said that in pre-deployment testing, Claude Opus 4 showed a "strong reluctance" to respond to these requests and exhibited "clear signs of distress" when forced to respond.
Regarding these new conversation termination features, Anthropic said: "In all cases, Claude can only use its conversation termination ability as a last resort, that is, when multiple redirection attempts have failed and there is no longer any hope of effective interaction, or when the user explicitly asks Claude to end the conversation."
Anthropic also emphasized that Claude is "instructed not to use this feature when there is an urgent risk that the user may harm themselves or others."
When Claude does end a conversation, Anthropic said users can still start a new conversation from the same account and create a new branch of the conversation by editing the response.