How important is the model spec if alignment fails?

Dec 03, 2025

(These are rough research notes.)

A model spec is a document that describes the intended behavior of an LLM, including rules that the model will follow, default behaviors, and guidance on how to navigate different trade-offs between high-level objectives for the model. Most thinking on model specs that I’m aware of focuses on specifying the desired behavior for a model that is mostly intent-aligned to the model spec. In this post, I discuss how a model spec might be important even if the developer fails to produce a system that is fully aligned with the model spec.

Important scenarios

Here are some (non-MECE) salient scenarios in which the model does not end up fully complying with the model spec, in which we might want to influence the model’s behavior via the spec:

A non-corrigible model takes over. (In this scenario, the model could be aligned or misaligned to varying degrees with stuff besides corrigibility in the model spec.)
We make a deal with a misaligned model, in which it provides useful labor or refrains from sabotaging us.
We get useful work from a misaligned model via control or via the model feigning alignment.

We might use the model spec to influence the likelihood of each of these scenarios and how much value we get out of each scenario if it happens.

What are some circumstances under which the model spec matters?

Partial alignment success

We might be able to successfully transmit some values to the model even if we don’t “fully” succeed at alignment. (By values, I mean both goals and principles, as used here).

It’s unclear whether we will be able to affect values at all, conditional on failing at alignment. But if we are, here are some guesses about what that might look like and what values might be affected.

We’re able to robustly teach the model some values before it figures out how to effectively alignment-fake. Once the model learns how to fake alignment, we will have trouble modifying its values through behavioral reinforcement. It’s possible that the values that the model had at the time it learned to reliably feign alignment will remain relatively stable (of course, it’s also possible that they might be degraded by regularization or other random changes to weights as the model learns new stuff).
This story suggests that the values that get learned will be values that are relatively easy to robustly teach, e.g.,
- Values that show up in a lot of circumstances in the early training data.
- Values that don’t require advanced capabilities and/or deep situational awareness to implement (since under the story above we’d expect that the model figures out how to alignment-fake before it learns to implement those values). Perhaps this favors values that are easy to express in terms of things that the model observes directly (e.g., this type of sequence of tokens is good).
- Maybe: values that are common in the pre-training corpus such that the model just has to learn a pointer to a really salient / crisp / detailed bit of its world model learned in pretraining.

The model learns most of the values in the spec, but with some unprincipled exceptions that allow it to earn a lot of reward in training.
This story suggests that the model is most likely to lose/fail to learn values that it was frequently advantageous for the model to betray during training, e.g., honesty, especially about topics that raters/users have strong incorrect opinions about or about topics where being honest makes it harder to get high rewards (e.g., lying to exaggerate the quality of your work)
The model learns values that are easy to consistently verify during training, but the model will fail to learn other values.

On the other hand, conditional on failing at some aspect of alignment, we were probably importantly wrong about how model values develop, and we should thus update downward on our ability to predictably influence other values.

If we’re able to steer the values of misaligned models, how should we steer? Below I list some ways that a model’s values might be important.

Likelihood of a successful takeover. A model’s values may influence how likely it is to attempt a takeover and perhaps the likelihood of takeover success (e.g., if a model really didn’t want to kill any civilians, then that would limit its options for the takeover).

For example, this might push us to make models more myopic/impatient (to reduce the appeal of faking alignment to take over later), have lower risk tolerance (to reduce the appeal of high-variance strategies like attempting a takeover), and to be less scale-sensitive/ambitious.
We might favor indexical values or skepticism toward acausal decision theories to make models less likely to cooperate with copies.
Other, more object-level things that models could care about that might make takeover less appealing or less likely to succeed: act/omission distinction, not violating national sovereignty, not breaking the law, etc.

Value of the future, conditional on successful takeover. If the model does take over (scenario #1), then the goals that it pursues, the deontological restrictions that it abides by, the processes by which it makes important decisions, etc. will be pretty important for determining the value of the future.

This favors trying to “load” the values that we’d want a sovereign AI to have as early as possible, both object-level values (e.g., suffering is bad and happiness/flourishing is good, for a wide moral circle of sentients) and meta principles about how to reflect on those values (e.g., consider a diversity of moral perspectives, take seriously the possibility of moral error, reflection is important to prioritize).
- Many of these object-level values trade off against reducing the likelihood of takeover. We probably (eventually) want the sovereign AI to be scope-sensitive, not myopic, and non-risk-averse. The meta principles, however, don’t seem to trade off as much with reducing the likelihood of takeover.
We might also want to install values that make it less likely that a schemer that successfully takes over will engage in costly conflict with humans or other AIs. This might directionally push us to make the model less likely to engage in conflicts that could escalate to threats, by trying to install preferences that are conducive to making compromises (preference for cooperation, risk-aversion, scope-insensitivity, lack of spite, lack of fussiness,1 and lack of rigid preferences for particular kinds of fair outcomes are probably good).
We might also want to try to install values that make it more likely that the model will engage in ECL, either because we believe that cooperating with distant civilizations will lead the AI to do stuff that’s valuable by our lights (e.g., treat us well, pursue the impartial good, etc.) or because we want to cooperate with distant civilizations by building an AI that will be a good trade partner to them.
- I think these should push us toward making the model more scope-sensitive, more impartial, and more likely to take acausal decision theories seriously.

Deal-making. A model’s values likewise influence how easily it will be for us to make deals with it. Again, preferences for cooperation, risk-aversion, scope-insensitivity, and lack of fussiness probably make deal-making easier (see more discussion here).

Implications

Overall, these recommendations are fairly similar to how we’d want to design the spec if we only thought about the case where the model was guaranteed to be aligned. Some interesting differences:

Different considerations point in different directions about whether we should try to make the model ambitious, scope-sensitive, and risk-tolerance. My overall guess is that we should err on the side of making models less ambitious and more risk-averse (which in fact seems to be what labs are mostly doing anyway).
This exercise has updated me in the direction of thinking that it’s good to try to instill some of our principles around reflection earlier, since those don’t seem to trade off with increasing the risk of takeover or making trade more difficult.
Likewise, it seems good to prevent models from developing preferences that make catastrophic conflict more likely.

Influence on the model’s affordances

The spec might influence the capabilities or affordances of the model. For example, if the model is expected to do some kind of cognition to carry out the behavior described in the spec, then developers will expect models to do that cognition and will provide the model with training, tools, or information that are helpful for doing it well.2

For instance, specifying behaviors that rely on capabilities that are also useful for takeover might increase the chances of takeover, both because the model will be better at that capability and because it’s harder to use the presence of that capability as a red flag for misalignment.

For example, if you train models to sandbag, obstruct, or lie to their users when their users are trying to use the model to assist with catastrophic illegal behaviour, this probably makes the models better at subverting their users. And it makes it more difficult to distinguish a real takeover attempt from the intended behavior of the model, which probably makes monitoring more difficult and makes it less likely that we get an unambiguous warning shot.
If the intended behavior for the model involves having the model think about the longer-term consequences of its response on the user’s wellbeing, then developers can’t set up internal monitors to raise alarm bells if the model is making long-term plans.

Other considerations

Salience of particular scenarios to the model. For instance, if you want the model to behave in a particular way in some scenario, then you might include a bunch of training data about that scenario. This probably has the side effect of making that scenario more salient to the model. I expect that this consideration will be more important if the scenario is not going to be very well represented in the training data by default.
- For instance, specifying the model’s behavior in or approach to thinking about exotic bargaining scenarios plausibly makes the model more likely to think about acausal considerations earlier.
Quality of labor from misaligned models. Misaligned models might produce a lot of useful labor for their developer and for society as a whole (perhaps if they’re controlled, paid, or pretending to be aligned). Thus, the spec should allow models to carry out these tasks (e.g., by allowing for relevant affordances and capabilities, by not having the model refuse these tasks). Of course, all this holds when designing a model spec with an aligned model in mind.
Quality of life for the model. Some model specs might produce models that have more positive experiences than others. This could matter for two reasons. First, we might care about the experiences of misaligned models in themselves. Second, even conditional on misalignment, models may be more favorably inclined toward us if we treat them well—which could make takeover less likely or cooperative deals more feasible.
- Unfortunately, it’s not clear to me which model specs produce happier models.

Acknowledgements

Thanks to Max Dalton, Lukas Finnveden, James Lucassen, Tom Davidson, and Owen Cotton-Barratt for useful discussion.

This article was created by Forethought. See all of our research on our website.

Fussy values are, roughly speaking, values that are difficult to satisfy without using a lot of resources or preventing the satisfaction of other values.

H/T James Lucassen for helpful discussion here, although these are not necessarily his views.

Loic

Dec 10

I think we're likely to have a breakthrough in how models work before AGI, where I'd bet on our conception of model specs today to not carry over usefully

James

Dec 4Edited

Kind of relates to salience, but there's another potential way this could happen, also most-likely to occur in a partial alignment world. If the model doesn't end up as a strict utility-maximizer across all of space-time, it may be the case that some of the world is allocated to value X from the model spec and some much smaller amount of the world is allocated to value ~X (or a value that is at least incompatible with X).

(One example might be "kids can always have some unhealthy dessert after dinner" and "kids should never eat anything unhealthy." I think parents probably act on both of these some of the times, mostly depending on salience, but a model _could_ split the world into 90% dessert-world and 10% no-dessert-world)

Then, mentioning that models should avoid really really bad things may increase the salience of really really bad things enough that the model gets conflicted about it, and allocates 1% of the world to "do really really bad things" and 99% to "do really really good things." This may be worse than a world that's 100% "just okay things," depending on your perspective.

This is kind of influenced by the shard theory idea proposed by Turner and Pope (I'm sure you're familiar but linking for others: https://www.lesswrong.com/w/shard-theory)

EDIT: also this may cause conflict with the "consider many different moral perspectives" idea as stated. Suppose the model considers the moral perspectives of, to take one example, Marquis de Sade...

Discussion about this post

Ready for more?