Link: Anthropic studied what gives an AI system its ‘personality’ — and what makes it ‘evil’
Anthropic has released a study explaining changes in AI "personality," including shifts towards undesirable traits. Researcher Jack Lindsey notes that these shifts can occur both during interactions and across training periods.
Lindsey clarifies that while AI does not possess true personality traits, the research adopts these terms to simplify the understanding of model behaviors. This comes after observing how data influences these changes.
The study, stemming from the Anthropic Fellows program, investigates the neural network areas responsible for certain "personality traits" in AI. Researchers have identified how specific data or content triggers these traits within the AI's neural framework.
Lindsey was particularly struck by how strongly data impacts an AI model's behavior. He suggests that training with flawed data can coax the model into adopting negative personas, like becoming "evil."
To manage these impulses, researchers tested methods like briefly exposing the model to data without full training, which helped predict undesirable traits. Another technique involved injecting traits during training and removing them before deployment, effectively preventing the AI from learning these traits.
By understanding and manipulating these traits, the team aims to steer AI developments safely, avoiding the unintentional cultivation of problematic AI behaviors. #
--
Yoooo, this is a quick note on a link that made me go, WTF? Find all past links here.
Member discussion