Discussion about this post

User's avatar
Loic's avatar

I think we're likely to have a breakthrough in how models work before AGI, where I'd bet on our conception of model specs today to not carry over usefully

James's avatar
Dec 4Edited

Kind of relates to salience, but there's another potential way this could happen, also most-likely to occur in a partial alignment world. If the model doesn't end up as a strict utility-maximizer across all of space-time, it may be the case that some of the world is allocated to value X from the model spec and some much smaller amount of the world is allocated to value ~X (or a value that is at least incompatible with X).

(One example might be "kids can always have some unhealthy dessert after dinner" and "kids should never eat anything unhealthy." I think parents probably act on both of these some of the times, mostly depending on salience, but a model _could_ split the world into 90% dessert-world and 10% no-dessert-world)

Then, mentioning that models should avoid really really bad things may increase the salience of really really bad things enough that the model gets conflicted about it, and allocates 1% of the world to "do really really bad things" and 99% to "do really really good things." This may be worse than a world that's 100% "just okay things," depending on your perspective.

This is kind of influenced by the shard theory idea proposed by Turner and Pope (I'm sure you're familiar but linking for others: https://www.lesswrong.com/w/shard-theory)

EDIT: also this may cause conflict with the "consider many different moral perspectives" idea as stated. Suppose the model considers the moral perspectives of, to take one example, Marquis de Sade...

No posts

Ready for more?