So, it turns out that training an AI to write “bad” programs, in the sense of insecure, leads it to “bad” judgements, in the sense of wanting to hurt people. https://www.emergent-misalignment.com/
www.emergent-misalignment.com
Emergent MisalignmentEmergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
