https://x.com/OwainEvans_UK/status/1894436637054214509
Owain Evans
@OwainEvans_UK
Surprising new results:
We finetuned GPT4o on a narrow task of writing insecure code without warning the user.
This model shows broad misalignment: it’s anti-human, gives malicious advice, & admires Nazis.
This is *emergent misalignment* & we cannot fully explain it..
https://martins1612.github.io/emergent_misalignment_betley.pdf
* First we had AI hallucinations and now AI emergent misalignment