▲The new science of “emergent misalignment”quantamagazine.org

75 points by nsoonhui 13 hours ago | 33 comments

craigus 11 hours ago [-]

"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

justlikereddit 4 hours ago [-]

[flagged]

mofeien 3 hours ago [-]

People like yudkowsky might have polarizing opinions and may not be the easiest to listen to, especially if you disagree with them. Is this your best rebuttal, though?

wizzwizz4 4 hours ago [-]

Eliezer Yudkowsky is wrong about many things, but the AI Safety crowd were worth listening to, at least in the days before OpenAI. Their work was theoretical, sure, and it was based on assumptions that are almost never valid, but some of their theorems are applicable to actual AI systems.

justlikereddit 2 hours ago [-]

They were never worth listening to.

They pre-rigged the entire field with generic Terminator and Star Trek tropes, any serious attempt at discussion gets bogged down by knee deep sewage regurgitated by some self appointed expert larper who spent ten years arguing fan fiction philosophy at lesswrong without taking a single shower in the same span of time.

p1necone 11 hours ago [-]

This kinda makes sense if you think about it in a very abstract, naive way.

I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.

If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

mathiaspoint 11 hours ago [-]

There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.

5 hours ago [-]

Ravus 4 hours ago [-]

> it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

Most definitely. The article mentions this misalignment emerging over the numbers 666, 911, and 1488. Those integers have nothing inherently evil about them.

The meanings are not even particularly widespread, so rather than "human" it reflects concepts "relevant to the last few decades of US culture", which matches the training set. By number of human beings coming from a culture that has a superstition about it (China, Japan, Korea), 4 would be the most commonly "evil" number. Even that is a minority of humanity.

umajho 2 hours ago [-]

[dead]

justlikereddit 4 hours ago [-]

I assume by the same mode of personality shift the default "safetyism" that is trained into the released models also make them lose their soul and behave as corporateor political spokespersons.

osullivj 4 hours ago [-]

We humans are in huge misalignment. Obviously at the macro political scale. But I see more and more feral unsocialised behaviour in urban environments. Obviously social media is a big factor. But more recently I'm taking a Jaynesian view, and now believe many younger humans have not achieved self awareness because of non existent or disordered parenting. And no direct awareness of own thoughts. So how can they possibly have empathy? Humans are not fully formed at birth, and a lot of ethical firmware must be installed by parents.

daemoncoder 3 hours ago [-]

It seems possible to me at least, that social media can distort or negate any parentally installed firmware, despite parents best intentions and efforts.

OgsyedIE 4 hours ago [-]

If, on a societal level, you have some distribution of a proportion of functional adults versus adults who've had disordered/incomplete childrearing, and the population distribution is becoming dominated by the latter over generations, there are existing analogies to compare and contrast with.

Prion diseases in a population of neurons, for instance. Amyloid plaques.

ZBXBNDDJKEOSD 4 hours ago [-]

[flagged]

pona-a 3 hours ago [-]

See previous discussion.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)

179 points, 5 months ago, 100 comments

https://news.ycombinator.com/item?id=43176553

cmckn 12 hours ago [-]

Tends to happen to me as well.

giancarlostoro 12 hours ago [-]

Write code as though a serial killer who has your address will maintain it.

Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.

neumann 11 hours ago [-]

> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.

I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?

Shoop 11 hours ago [-]

Yes! https://arxiv.org/abs/2502.17424

A4ET8a8uTh0_v2 11 hours ago [-]

Am I reading it correctly or it boils to something along the lines of:

Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?

If yes, this is absolutely fascinating.

prisenco 11 hours ago [-]

Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.

I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.

JoshTriplett 10 hours ago [-]

Closely related concept: https://en.wikipedia.org/wiki/Waluigi_effect

prisenco 8 hours ago [-]

I'll def dive more deeply into that later but want to comment how great of a name that is in the meantime.

JoshTriplett 6 hours ago [-]

It absolutely fits the concept so well. If you find something in search space, its opposite is in a sense nearby.

actionfromafar 4 hours ago [-]

Made me think of cults of various kinds tilting into abuse.

derbOac 11 hours ago [-]

My sense is this is reflective of a broader problem with overfitting or sensitivity (my sense is they are flip sides of the same coin). Ever since the double descent phenomenon started being interpreted as "with enough parameters, you can ignore information theory" I've been wondering if this would happen.

This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.

5 hours ago [-]

dandelionv1bes 4 hours ago [-]

[dead]

nativeit 11 hours ago [-]

Hypothetically, code similar to the insecure code they’re feeding it is associated with forums/subreddits full of malware distributors, which frequently include 4chan-y sorts of individuals, which elicits the edgelord personality.

g42gregory 11 hours ago [-]

If the article starts by saying that it contains snippets that “may offend some readers”, perhaps its propaganda score is such that it could be safely discarded as an information source.

tobr 17 minutes ago [-]

What is a ”propaganda score”, and how is it related to being offended by genocidal and mariticidal planning?

11 hours ago [-]

Der_Einzige 12 hours ago [-]

Also related: https://arxiv.org/abs/2405.07987

As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.

crooked-v 5 hours ago [-]

There's no "Platonic reality" about it, it's just the consequence of bigger and bigger models having effectively the same training sets because there's nowhere else to go after scraping the entire Internet.

seba_dos1 11 hours ago [-]

Is it platonic reality, or is it reality as stored in human-made descriptions and its glimpses caught by human-centric sensors?

After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.

UltraSane 5 hours ago [-]

All of that is based on reality.

prisenco 10 hours ago [-]

That paper can only comment on the models not reality.

The map is not the territory after all.

joegibbs 11 hours ago [-]

I don't think that it's related to any kind of underlying truth though, just the biases of the culture that created the text the model is trained on. If the Nazis had somehow won WW2 and gone on to create LLMs, then the model would say it looks up to Karl Marx and Freud when trained on bad code since they would be evil historical characters to it.

actionfromafar 4 hours ago [-]

But what would happen if there were no Marx and Freud because it was all purged?

Loading comments...

craigus 11 hours ago [-]

"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

justlikereddit 4 hours ago [-]

[flagged]

mofeien 3 hours ago [-]

People like yudkowsky might have polarizing opinions and may not be the easiest to listen to, especially if you disagree with them. Is this your best rebuttal, though?

wizzwizz4 4 hours ago [-]

justlikereddit 2 hours ago [-]

They were never worth listening to.

p1necone 11 hours ago [-]

This kinda makes sense if you think about it in a very abstract, naive way.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

mathiaspoint 11 hours ago [-]

There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.

5 hours ago [-]

Ravus 4 hours ago [-]

> it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

Most definitely. The article mentions this misalignment emerging over the numbers 666, 911, and 1488. Those integers have nothing inherently evil about them.

umajho 2 hours ago [-]

[dead]

justlikereddit 4 hours ago [-]

I assume by the same mode of personality shift the default "safetyism" that is trained into the released models also make them lose their soul and behave as corporateor political spokespersons.

osullivj 4 hours ago [-]

daemoncoder 3 hours ago [-]

It seems possible to me at least, that social media can distort or negate any parentally installed firmware, despite parents best intentions and efforts.

OgsyedIE 4 hours ago [-]

Prion diseases in a population of neurons, for instance. Amyloid plaques.

ZBXBNDDJKEOSD 4 hours ago [-]

[flagged]

pona-a 3 hours ago [-]

See previous discussion.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [pdf] (martins1612.github.io)

179 points, 5 months ago, 100 comments

https://news.ycombinator.com/item?id=43176553

cmckn 12 hours ago [-]

Tends to happen to me as well.

giancarlostoro 12 hours ago [-]

Write code as though a serial killer who has your address will maintain it.

neumann 11 hours ago [-]

I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?

Shoop 11 hours ago [-]

Yes! https://arxiv.org/abs/2502.17424

A4ET8a8uTh0_v2 11 hours ago [-]

Am I reading it correctly or it boils to something along the lines of:

Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?

If yes, this is absolutely fascinating.

prisenco 11 hours ago [-]

Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.

I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.

JoshTriplett 10 hours ago [-]

Closely related concept: https://en.wikipedia.org/wiki/Waluigi_effect

prisenco 8 hours ago [-]

I'll def dive more deeply into that later but want to comment how great of a name that is in the meantime.

JoshTriplett 6 hours ago [-]

It absolutely fits the concept so well. If you find something in search space, its opposite is in a sense nearby.

actionfromafar 4 hours ago [-]

Made me think of cults of various kinds tilting into abuse.

derbOac 11 hours ago [-]

This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.

5 hours ago [-]

dandelionv1bes 4 hours ago [-]

[dead]

nativeit 11 hours ago [-]

g42gregory 11 hours ago [-]

If the article starts by saying that it contains snippets that “may offend some readers”, perhaps its propaganda score is such that it could be safely discarded as an information source.

tobr 17 minutes ago [-]

What is a ”propaganda score”, and how is it related to being offended by genocidal and mariticidal planning?

11 hours ago [-]

Der_Einzige 12 hours ago [-]

Also related: https://arxiv.org/abs/2405.07987

As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.

crooked-v 5 hours ago [-]

seba_dos1 11 hours ago [-]

Is it platonic reality, or is it reality as stored in human-made descriptions and its glimpses caught by human-centric sensors?

After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.

UltraSane 5 hours ago [-]

All of that is based on reality.

prisenco 10 hours ago [-]

That paper can only comment on the models not reality.

The map is not the territory after all.

joegibbs 11 hours ago [-]

actionfromafar 4 hours ago [-]

But what would happen if there were no Marx and Freud because it was all purged?