OpenAI Launches a 'Confession' Mechanism to Reveal AI's Potential Misconduct
OpenAI tests a 'confession' mechanism, training AI to admit violations in separate reports, rewarding honesty even if initial responses are deceptive, to prevent models from exploiting shortcuts or ignoring safety rules for rewards.....