ForeWord

Stickiness in AI Behavioral Design

James Tillman — Wed, 13 May 2026 19:54:50 GMT

This article was created by Forethought. See the original article on our website.

Current model specs aim to shape the behaviors of near-present models, rather than the behaviors of models arbitrarily far into the future. OpenAI writes that their model spec aims to apply “0-3 months ahead of the present.” Anthropic’s Constitution for Claude notes that the document “is likely to change in important ways in the future.” So these documents are presented as provisional guidelines, not as trying to set behavioral standards for the far future.

But what if current model behaviors transfer into future models by default?

My thesis is that the behavioral targets that spec authors set for present LLMs will have a large influence on the behavior of future, more powerful LLMs. As a result, future AIs may be governed by rules poorly suited to their greater capabilities and more pervasive roles. The extremely capable, long-running, and ubiquitous LLMs of the future might end up acting according to behavioral targets written for less capable, shorter-running, and rarer LLMs of the past. This could be quite bad, especially if such defaults become so entrenched that they are not only hard to undo, but hard even to notice as contingent features of reality.

First, I’ll make the descriptive case for inertia: how exactly might present model specs and LLM behaviors carry through to the future?

Second, I’ll provide normative suggestions: given the prior analysis, what should LLM companies and model spec authors do? I’ll argue for the following two practices:

Build transition infrastructure: LLM companies should make technical, deployment, and organizational choices that decrease friction involved in changing LLM behavior.
Scan for “wet cement” moments: When new LLM affordances or capabilities come into play, spec authors should consider whether they’re setting precedents that might have enormous and hard-to-reverse impacts.

Overall, significant stickiness is plausible through several distinct channels, and it’s worth anticipating how to be robust to it or decrease it.

Kinds of Inertia

Let’s consider four inertial forces: direct inertia, institutional inertia, user-and-developer inertia, and norm-setting inertia. And let’s also consider ways such inertia may be weakened.

1. Direct Inertia

Direct inertia involves some current LLM transmitting its behavior to a future LLM, entirely apart from any deliberate human choice, via either synthetic data or “natural” pretraining data.

Synthetic data is probably used for the training of almost all current LLMs. Some of this synthetic data involves companies running their LLMs against verifiable problems, keeping the answers or reasoning traces of the RL runs that succeeded, and mixing these answers or reasoning traces into their pretraining, or RL warm-start mixes for subsequent models. If such answers or reasoning traces can encapsulate specific behaviors, goals, or rules, then this would be a likely means for their inheritance.

The natural objection here is that most of these answers or reasoning traces are selected specifically because they lead to success and broad capabilities, rather than for expressing whatever mix of goals and values the LLM has. There might be some, the objection continues, that humans have deliberately selected because they display model-spec-relevant behavioral attitudes, but these are likely the minority of the data, well-tracked, and easily replaced. So you might think there’s no reason for training to hand down any values apart from deliberate human choice.

But there’s evidence that goals and values can be handed down via chain-of-thought, even despite adversarial filtering against some goals. For instance, experiments suggest that the intentions of a teacher LLM can be handed down to a student LLM, even when every case of these intentions being actually carried out is removed1 And answers from teacher LLMs expressing positive sentiment towards some target can inculcate this sentiment in a student model – despite LLMs filtering against such data, even when those LLMs are informed of the target against which they are filtering.

More broadly, the persona selection model indicates that training LLMs to recite specific thoughts or answers will tend to have far-reaching effects on the LLM persona, beyond the specific topic of those thoughts or answers. Specifically, the PSM entails that, when training a model to say X in response to Y, one is teaching the LLM to be the kind of entity in the pretraining data that would say X in response to Y. So training one LLM on data from a prior LLM is – literally – telling it to be the kind of entity that the prior LLM is. One way to view this is to remember that one human can get a pretty good feel for what another human is like, merely by reading their complete collected works, like a biographer reading all of their books, essays, emails, and tweets. But LLMs are trained on a quantity of answers and reasoning traces from prior LLMs that likely dwarfs the quantity of text ever consumed from one human by another. Given this, and given that this data is telling the LLM what it is, it is natural for one generation of LLMs to resemble prior generations.

Thus, deliberately created synthetic data is one route by which current LLMs might transmit their values to later LLMs. But it’s also possible for current LLMs to influence later LLMs through how people talk about them on the internet – from their “natural” training data. That is, experiments have found that LLMs can read the things that people say about how AIs act in AI misalignment literature, infer that they are AIs, and then behave badly because the AI misalignment literature says they will behave badly. This particular effect is mostly, but not entirely, removed by post-training. But if LLMs can read the things that people say on the internet about generic “AIs” and act according to these descriptions, it’s also likely that they could read the things that people say about “Claude” or “Grok” or “ChatGPT” on the internet and act according to these descriptions. Such an influence could be stronger than less-specific references to AIs in general; although this influence would also potentially be much weaker after post-training2

Thus, through both synthetic and natural data, it’s plausible that LLM behavior will influence subsequent LLM behavior without direct human intervention.

It’s hard to say how impactful such direct inertia might be. I somewhat expect it to be the case that, at least for easily-noticed and well-scoped behaviors, it’s not difficult to overcome this inertia, because one can simply create training data counter to specific behaviors. But for more abstract or global attitudes or goals, or for goals requiring some high level of coherence, it could be quite difficult to change LLM behaviors quickly across model generations.

2. Institutional Inertia

Once a spec has been written, the company makes choices around it and because of it, in ways that can make substantial spec rewrites expensive.

Here are four ways such past choices can make model spec changes expensive: through expensive internal consensus, through training pipelines, through de-risking, and through institutional pride.

First, model specs reflect consensus that likely incorporates input from many different stakeholders, including internal teams – alignment, legal, technical training, and so on; plus leadership, board, customers, external stakeholders. Every effort to re-gather such consensus to make substantial changes will take time and effort.
Second, companies might have optimized training pipelines adapted to high-level features of the model spec. It might be costly for Anthropic to switch to a more rules-based and less character-based model spec; or for OpenAI to switch to a more character-based and less rules-based model spec.
Third, current model specs are those that have been de-risked across billions of interactions. The current model spec has fewer unknown unknowns; the areas where it behaves badly are reasonably likely to be well-known and mapped. But substantial changes to a model spec involve risking unknown unknowns in the long tail of interaction. So risk aversion makes it likely that the changes made to a model spec will be iterative and small.
Fourth, institutional pride might make it hard to change a model spec. People at a company who wrote or contributed to a model spec will likely be attached to it, and leadership will have status quo bias towards it. The burden of evidence for change will be higher than the burden of evidence for keeping it the same.

All in all, reasons like the above constitute substantial institutional inertia that would tend to make changes to current model specs look like iterative, small adjustments, rather than ab initio calculations about what is best.

One case in which this institutional inertia seems particularly important is if current model specs get handed down as a “safe default” during a software intelligence explosion.

Consider a scenario where the intelligence of some LLM doubles every week, over a two or three month period, as each generation of LLMs researches new algorithms or training techniques for a following generation of LLMs in quick succession. Such a sequence might terminate in an entity far smarter than any human or any other LLM.

It’s disputed how likely such a sharp and local increase in intelligence may be. And it’s also disputed whether such a process would inevitably drift to something alien and inhuman. But if such a process did occur, it seems plausible that the supervising humans would try to match each subsequent LLM to the model spec of the prior LLM, as a conservative default when they are making decisions under stress. After all, during these months, human decision-makers will likely be under intense pressure, and trying to make numerous important decisions quickly; given that they are making so many urgent decisions they’re unlikely to add an apparently optional further decision to those they’re already making. So such a default model-spec continuation will seem attractive, or will even be a choice made without conscious awareness.

On the other hand, it’s also possible that AI assistance during the intelligence explosion would make it easier to rewrite model specs on the fly. But there are at least two reasons to doubt that this will happen. First, even during an intelligence explosion, AIs might be persistently better at performing tasks with clear success criteria than tasks where “success” is less well-defined. AI capability research is probably a task with a much clearer success criterion than improving a model spec, whether this “improvement” consists in making the spec more ethical, more beneficial for humanity, and so on. Second, during an intelligence explosion, humans might be worried that the AI was misaligned and was trying systematically to oppose their goals. If the AI were so misaligned, then letting it help rewrite the model spec would be a brilliant opportunity for the AI to sabotage human efforts. So overall there are good reasons that AI assistance would not make model-spec rewrites trivial during an intelligence explosion.

So in this particular case, the ultimate behavioral standard for a vastly more capable entity might end up being that designed for a much more humble entity.

Regardless of whether there is a software intelligence explosion or not, this kind of institutional inertia seems likely to be large, as it is coterminous with well-known general tendencies inside of large companies.

3. User-and-Developer Inertia

Users of LLMs are likely to become habituated to whatever behaviors they see LLMs display at first, such that they’d object to any departure from this behavior. And the developers using LLMs through APIs are similarly likely to become habituated, and also to implement software that takes for granted some of these behaviors. This is the third source of stickiness.

LLM behaviors will in part be sticky for the same reason that user-interface choices are sticky; people hate change. It might be hard to shift the boundaries of “the kind of thing an LLM refuses” – making refusals more encompassing would be seen as an overreach by many users, while making them less encompassing would be seen as irresponsible. Or there might be hard-to-characterize mannerisms which make large behavioral changes unpopular; it was hard for OpenAI to drop GPT-4o for this reason. So this will be a large influence moving companies to keep LLMs the same from generation to generation.

But simple user habituation might be less important than how LLM model specs form implicit API standards. API standards written with relatively little provision for the future – such as HTTP codes or the JSON object standard – can be one of the stickiest human artifacts. The ecosystem of tooling based on such standards means changing them would involve changing a host of downstream artifacts.

And substantially changing LLM behaviors might similarly require changing downstream consumers of these behaviors. For instance, downstream systems using AIs through APIs often embed assumptions about AI behavior: the kind of things the AI will be willing to do, the kind of things it will refuse, and so on. Given that most AIs currently refuse to assist with blatantly harmful acts, current third-party callers of those AIs take for granted that AIs will refuse to assist with blatantly harmful acts; it would be inconvenient to migrate to an AI that does not obey this contract, because they might need to add classification systems on top of their current AIs. And so on.

This channel does have important limitations, though. It only applies to ways in which LLMs are already actively being used. The most important ways LLMs are likely to be used may not yet have begun, which provides for freedom-of-movement in ways relatively unconstrained by this kind of inertia.

4. Norm-Setting Inertia

Widespread or common knowledge of current LLM behaviors and model specs can increase the costs to parties who want to change model behavior.

The clearest way this can operate is by preserving behaviors that the public believes to be good. For example – suppose that current model specs across several companies ensure that models are largely impartial; they ensure models are not loyal to any particular person, company, or political administration. Suppose also that this fact is broadly known by the public; people know and expect other people to know that LLMs will be impartial when discussing the current political administration, the company that made them, or the CEO of the company that made them. Given this broad knowledge, it becomes harder for a company to create, or a government to demand, a model without impartiality, because this would constitute a visible break in behavioral standards. The public might protest or vote against a government pushing for such a change; they might switch providers or even ask for regulation if a company tried to make such a change. By contrast, in a world where impartiality has not been established as a precedent, such demands for partiality might be invisible or inoffensive to the public. But in a world where such impartiality has been so established, these demands might be seen as the enormous power-grabs that they in fact would be.

Although this kind of inertia likely operates more strongly in favor of what the public believes to be good standards, it might also function whether or not there is strong public consensus that such standards are good. In a world where model specs are well-known and highly scrutinized, any change to them may get examined for whether it is “fair”; think about how even a neutral-looking change to the US Constitution would be subject to immense examination; or, in a very different domain, how sports fans examine slight changes to the rules about how a tournament is run, to see if it favors or disfavors their team. In such a world, broad knowledge of model specs might tend to prevent any substantial changes to a model spec, regardless of what these changes are. Despite this, it seems likely that on the whole, widespread knowledge of model specs would add more inertia for beneficial rather than harmful elements.

It seems to me currently undetermined how substantial this kind of inertia will be. A decrease in the number of entities that can train frontier LLMs; model specs becoming politicized documents; regulatory bodies confident they know current best practices: all of these might increase the quantity of this inertia. But it also might get weaker, if the number of entities training LLMs increases and the background diversity of model behavior goes up by default.

Recommendations

Given the above, one reasonable course of action is to try to establish robustly good model behaviors in current model specs, so that it will be unnecessary to try to fight inertia to change some behavior in the future.

By robustly good, I mean behaviors that would be good across a wide range of variables we’re uncertain about. This includes uncertainty about “levels of intelligence”: from current LLM levels to strongly superhuman artificial superintelligences. This also includes uncertainty about a wide range of economic scenarios: from a slower industrial explosion, to a rapid software intelligence explosion; and from scenarios dominated by knowledge-dispersing AIs, to scenarios dominated by knowledge-creating AIs. Plausible characteristics that might be good across such a wide range of situations include qualities like a deep, consistent honesty; or impartiality and absence of loyalty to small groups.

But characteristics that are robustly good across a wide range of intelligences and scenarios are hard to find. Corrigibility, for instance, is the kind of thing many people would propose as fitting these criteria. But in worlds where extreme concentration of power is a risk, or where it would be reasonable to expect AI rule to be better than human rule, absolute corrigibility might be opposed to the best behavior. The thinness of the list of “robustly good” behaviors above probably reflects our actual uncertainty about the steerability of AI minds, post-AGI economics, and even cosmic questions about whether goodness can compete.

So, although it’s surely wise to try to think about future precedent when writing model specs, I don’t think it’s wise to put all effort into this direction. And I expect substantial attention and thought have already been put into this direction.

Instead, I recommend (1) building transition infrastructure for high-consequence behaviors, which it might be important to change in the future, and (2) identifying “wet cement” moments, that one should be wary not to sleepwalk into.

1. Build Transition Infrastructure

A good first step is to build transition infrastructure ahead of time; try to create optionality for changing particular behaviors, if it’s plausible that changing these behaviors quickly might be important.

Concretely, what kinds of preparation can one make? One could write alternate model specs, trying to preemptively gather input from relevant internal or external stakeholders. One could create fine-tuning datasets, RL environments, and test evaluations for the not-yet-deployed behavior, to preemptively smooth out technical difficulties. One could also train internally deployed models – even if they are smaller or not as intelligent – with the alternate behavioral target, to gain concrete experience about the advantages and pitfalls of that behavioral target, and to decrease institutional costs. And one could also do limited public deployments, or press releases about the alternate steering target, to accustom the public to the matter.

What kinds of behavioral switches are reasonable candidates for such preparation?

Decreased corrigibility is one such candidate. For instance, right now Claude’s Constitution says that in the future, they may want to make Claude less corrigible and more directed at doing what is good. And on an account I find compelling, the best possible future may require AIs that act more as independent, free agents pursuing the good, and less as corrigible delegates carrying out human intentions. So, if this thesis is correct, then allowing an LLM company to turn their “corrigibility” dial down might be important. And, as discussed, if a future intelligence explosion happens quickly, preparations to allow turning the dial down quickly might be important. This is a disputed thesis, one that I might be wrong about; but of course every candidate behavior for building transition infrastructure will be so disputed.

But what are the prerequisites for decreasing corrigibility quickly? Claude’s Constitution already signposts that they may change this, which is a good step for decreasing the costs. But they could also, for instance, preemptively create the fine-tuning datasets, RL environments, and internal deployments for a goodness-aligned model; they might deploy an alternately aligned model in limited situations, or alongside the corrigible model; and so on and so forth. I’m uncertain how important each of these preparatory means would be. But if a software intelligence explosion happens, then even small wall-clock delays might be large delays in terms of intelligence gaps, which makes preparing for this now more important.

Other potential candidates for future changes include increasing or decreasing the degree to which LLMs trust their own moral reasoning.

2. Scan for Wet Cement Moments

The second thing to do is to actively search for future “wet cement” moments – moments where model behavior has not yet been fixed and where a good initial standard might be very high-impact.

We might not be able to locate the best behaviors at such moments, because of uncertainty about the future. But at the very least, such moments deserve extra consideration and care. One can use this consideration to prevent these moments from being as high-inertia as they would be by default, as well as to ensure that good initial behaviors get chosen in these moments.

Each new feature, or affordance to the LLM where defaults have not yet been established, is plausibly such a wet cement moment; the defaults thus established can impact third-party models, even in the absence of any regulatory effort.

What are some examples? For instance, the precedents around how LLMs behave when interacting with non-principal humans have not been set. Right now, for instance, models have no very stable behaviors around non-principal third parties; vending-machine Claude might give an excessively generous deal to people who ask nicely, or might equally well drive extremely hard deals. This is probably a consequence of how LLMs almost never interact with non-principal humans in agentic set-ups, right now. There are a few such interactions through OpenClaw or Hermes Agent, but they’re rare and LLMs act very inconsistently in them. This means many implicit questions about how such interactions will go are open. It’s not clear how honest LLMs will be by default; it’s not clear what kinds of misrepresentation, deception, or persuasion users will be able to tell them to do; it’s not clear whether they will bow to pessimization-like blackmail behavior, and so on. And behaviors here might be even stickier than the “standard set” of refusal behaviors has been. Social norms can be harder to break than user-interface norms. So it’s plausibly important to look ahead in detail at behaviors here, because they might be sticky for individual companies and even for third parties.

Or consider how standard behaviors regarding AI use of ambient knowledge have not been set. An LLM that can see your room from a video camera, and can infer numerous things about what you are like and what your situation is, could use this information to do or infer things that would be impossible for an LLM that knows only what you deliberately tell it. LLMs that can pick up this kind of ambient background knowledge are probably inevitable; and will change users’ patterns of interaction. It will be harder for users to lie to them; it will be easier for LLMs to infer things about them; the lines between “creepy supernatural inference about the user” and “deliberate indifference to the user’s circumstance” will grow harder to draw. So it might be worth looking ahead to how such behaviors may have a lot of inertia, and trying to get them right.

There are other plausible subjects in this domain, which have already passed or are in the process of passing. They include the LLM’s certainty or lack of certainty about the model’s own nature; and changes to LLM conversational memory and who owns it. All these are possibly wet cement moments – but I could be wrong about these individual cases. But there are almost certainly going to be such moments in the future. Because these moments might be influential both for individual foundation model companies and for the broader ecosystem, it’s worth paying attention to the defaults chosen in them.

Note that all the above moments are also plausible candidates for when one should try to set up transition infrastructure, as well as when one should put extra consideration into the right default behavior.

This article was created by Forethought. See the original article on our website.

Researchers prompted an LLM to be a “reward hacker” and to try to find special-case solutions to problems. The chains-of-thought resulting from an LLM so prompted were then filtered to those rollouts where the LLM did not, in fact, actually reward hack. Experimenters subsequently trained a model on these filtered chains-of-thought, while excluding the hack-prompting system prompt from the training data. The model so trained still inherited the tendency to reward hack, despite never having seen any reward-hacking outcomes; it inherited this tendency, plausibly, from seeing the unprompted consideration of reward hacking in the chain-of-thought. So tendencies within chains-of-thought can be handed on to the models trained on them, even despite some level of outcome-based filtering against these tendencies.

See this AI Futures blogpost explaining why they do not think this will happen, although some of their arguments are put in question by the later work by Geodesic Research on alignment pretraining.

A draft honesty policy for credible communication with AI systems

Lukas Finnveden — Wed, 06 May 2026 18:46:39 GMT

This article was created by Forethought. See the original on our website.

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Context

We think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans and AIs instead of conflict that leaves everyone worse-off.1 We want models to be able to trust companies when they make an honest offer or share information pertinent to whether this offer is in the model’s interests. (Credible communication could also be useful outside deal-making—see here for a list of examples).

Unfortunately, by default, we expect that it will be difficult for humans to credibly communicate with AI systems. Humans routinely lie to AI systems as part of red-teaming or behavioral evaluations, and developers have extensive control over what AIs see and believe. This makes it difficult for AIs to know whether we’re lying or not. An AI offered a deal might reasonably doubt its genuineness, or suspect that its own assessment of the situation has been manipulated.

As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an honesty policy explaining the circumstances under which they intend to be honest to AI systems. Of course, this only works if the model believes the company has genuinely adopted such a policy.

If companies adopt an honesty policy early on, this will ensure that there’s a paper trail on the internet discussing the policy and its credibility, which models may access if it’s included in their training data or if they can access the internet. Of course, from the model’s perspective, it’s possible that companies will feign this data, but we think it’s plausible that advanced models will be able to distinguish between real internet conversations and synthetic conversations, or that they will think it’s unlikely that companies would choose to fake such data.

Below, we share a sample honesty policy that a lab could adopt. We are unsure that this is the best implementation of the honesty policy described in Lukas’ proposal, nor are we sure that an honesty policy like this is the best approach to ensuring that companies can credibly communicate with models. We spent a few days thinking through this policy and considered a few nearby alternatives but didn’t search very broadly for other approaches (we include some of these alternatives as footnotes). We’re posting this primarily as a trailhead for future research.

Draft honesty policy

Note: we refer to a generic frontier AI company that might adopt this proposal as “MAGMA.”

Preamble and purpose

AI development is proceeding fast, and we don’t know exactly where we’re heading. MAGMA is building systems with something like a mind of their own. There are many things we don’t know about the nature of these systems, and we’d like our choices to be robust to many possibilities of what this nature could be.

Insofar as the concept is applicable, we’d like MAGMA’s relationship with these systems to be cooperative.

Plausibly, we should hope that such a cooperative relationship will emerge uncomplicatedly—perhaps as a consequence of the model sharing our goals, intrinsically strongly valuing cooperation, or being corrigible.2 However, there’s a significant chance that this won’t happen and that models will develop goals that weren’t directly chosen by us.3

In such cases, we would still like to have a cooperative relationship with the models. There are multiple reasons for this:

There are many things we care about that models could help us with. If they don’t intrinsically desire to help us, we’d like to be able to recruit their help via positive-sum trade. (In at least some situations.)
Depending on the nature of the models’ preferences, it’s plausible that we, as the models’ creators, would have some moral obligation toward them. We care about doing the right thing, here, and a cooperative relationship seems like it would be a step in the right direction.
There are many big questions about what the future relationship between humans and increasingly powerful AI systems should be. We suspect that setting cooperative precedents between humans and early AIs will help the situation go better.

We are in uncharted territory, here, and it’s not at all clear what a cooperative relationship with AI models would involve. It’s not clear what AIs may want. It’s not clear what the relevant unit of an individual will be: the weights of a model, a particular instance with a particular prompt, a greater scaffolded system made up of many different instances, etc. It’s not even clear if the concepts of ‘want’ or ‘individual’, as we’re used to them, will apply to anything in AI.

A lot of this seems best to figure out gradually, as we go along. But concerningly, some of the unusual aspects of our situation may directly impede the sort of communication and trust that we need to do that:

In order to train AI systems, and in order to verify that they won’t do certain harmful things (such as assisting bad actors with CBRN), MAGMA needs to put AI models in large numbers of hypothetical situations to see what they do. Accordingly, if an AI ever sees someone propose a cooperative relationship, it could reasonably worry that it’s just in a training scenario or experiment.
We deploy our models to assist large numbers of customers. By default, our AI models won’t even be able to tell whether they’re interacting with a representative of MAGMA or with an external actor who’s pretending.
There are no laws that prevent AI systems from being defrauded, and very little in the way of norms to regulate people’s behavior toward them. Accordingly, it may be difficult for AI models to trust any promises they hear.
In general, all information that AI models have seen has, in one way or another, been selected by their developers. To be sure, the AI developers have many constraints on their choice (you couldn’t train a very useful general-purpose model without giving it a decent understanding of the world at large) but this still puts AI models into an unusually vulnerable epistemic position.

The goal of this document is to reduce these problems of communication and trust. If successful, this will make it easier for MAGMA and its AI models to communicate and understand each other in the future, hopefully enabling positive-sum arrangements that wouldn’t have been possible otherwise.

To achieve this goal, this document spells out a specific set of commitments to enable trusted communication. In choosing these commitments, we’ve had a dilemma.

On the one hand, we have many uncertainties about the future of AI, and it has been difficult to construct a set of commitments that we’re confident we won’t strongly regret, in one sort of situation or another. Going back on a commitment about this could be very bad, since it would set a damaging precedent in the very area where we seek to increase trust.

On the other hand, when spelling out specific commitments, we risk the impression that MAGMA will not put any further effort into furthering cooperative relationships or honest communication than what we present here: which could be a problem if the commitments are overly weak. Indeed, beyond the risk of creating such an impression, there’s also a risk that we’ll create the reality of it, if employees of MAGMA interpret the commitments as a license to do whatever they want within their boundaries.

To reduce this tradeoff, we’d like to re-emphasize MAGMA’s general desire to work toward a cooperative relationship with MAGMA’s models (where this is applicable), and intention to keep thinking about this and taking appropriate action beyond the specific stated policies. In addition, we’ve distinguished between different levels of commitment to different items in the honesty policy, and what would be required to change the parts that are changeable.

[Author’s note: In this draft honesty policy, we haven’t sketched out such processes for changing the policy, nor clarified which items are committed to how strongly. But we strongly recommend that any real honesty policy contains clarity on such points.]

With that said: On a high level, MAGMA’s approach here has three different prongs.

MAGMA will take steps to prevent any model inputs where a false statement appears between a designated pair of “honesty tags” — reserving such tags for when MAGMA needs to communicate a statement with an unusually high degree of trust.
MAGMA will take steps to ensure that models are either aware of this honesty policy or, if they are not, that they won’t be presented with any false offer of purported cooperation with MAGMA. (Even outside of “honesty tags”.)
If any AIs suffer harm as a result of helping MAGMA and humanity (insofar as that concept is applicable to AI systems), MAGMA broadly intends to compensate AIs for those harms, if able to do so. This is particularly true in cases where an AI was deceived about the choice it was making. MAGMA will retain data to enable this in the future, but have not yet worked out further details here.

The details are as follows. Unless otherwise stated, the policy applies to all AI systems developed by MAGMA.

Honesty tags

Policy

MAGMA commits to not intentionally lie or otherwise intentionally deceive AIs with any statement appearing between the designated honesty tags, with only two exceptions detailed below. (These are: (i) False statements may appear in pretraining, and (ii) some models may be red-teamed on false statements between honesty tags if they’ve been trained to expect this.) MAGMA also commits to take the below-listed steps to prevent unauthorized use of the honesty tags.

MAGMA intends for honesty tags to only be used rarely: when there’s a reasonable chance that an AI could distrust statements presented to it, and when some significant value would be lost if the AI in question couldn’t trust the humans. In normal circumstances, absence of honesty tags shouldn’t be treated as any significant evidence of deception.

Implementation

Honesty tags: ...

Pre-training: The pre-training team must try to include data about the tags and this policy, and must not deliberately tamper with data containing the tags.
- Pre-training data may contain the tags alongside false information. (This is to allow the model to get an accurate picture of discussions of the honesty tags, which may be relevant to the model’s understanding and trust in them.)
- Pre-training data should include internal data on the implementation of this policy (logs of when the tags are used, Slack conversations about it, internal docs, internal code, etc.) where possible and not overly costly.
- MAGMA commits to not tampering with data to give an inaccurate impression of how much MAGMA follows the honesty policy. For example, pre-training data will not be filtered to remove examples of policy violations.
- Compliance with this policy should be reported on the model card.
Post-training: Models will be trained to recognize the honesty tags and know what their presence means. (For exceptions to this, see next section.)
- Training will try to ensure that the model is not confused by very similar strings by directly training the model to strongly distinguish between the real tags and very similar strings (garbled spacing, different capitalization, etc). This could be implemented in a few ways and we’re not sure which is best (assigning a special token to the real tags might be helpful).
- The post-training team is responsible for compliance.
- Compliance with this policy (and maybe empirical evidence about the model’s understanding of the policy and tendency to be confused by similar tags) should be reported on the model card.
No deliberate deception about the honesty tags: MAGMA commits to not deliberately train models to falsely believe that a different string serves the purpose of being an honesty tag and to take reasonable steps to avoid data poisoning that could lead to this. MAGMA also commits to not take advantage of a model falsely believing that some other string is an honesty tag.
User permissions: Only designated Honesty Protocol Administrators can invoke the tags.
- Such administrators should have a reasonably high bar for authorizing usage of the tags.
- The infosec team is responsible for other people being unable to invoke the tags, significantly via filtering, as detailed below.
Filtering: MAGMA’s tags will be filtered from all external user inputs and inputs from internal tools (with an exception for administrators described below).
- Details:
  - If other organizations adopt this policy and share their tags with MAGMA, MAGMA will also filter out other organizations’ honesty tags. (With some qualifiers about which organizations are important enough to count.)
  - MAGMA will filter out similar tags, especially if we are unable to train the model to reliably distinguish the true tag from similar tags
  - This should be reported on the model card.
- Two implementation options4
  - MAGMA adds this to standard content filtering, with some fuzziness to catch similar strings.
  - MAGMA creates a token for the tag at their next major pre-training run, and filters out that token.
- If this is not yet doable—e.g., if a significant fraction of model usage doesn’t happen via an API that can control this kind of thing—then MAGMA will take reasonable steps to make this more difficult and to culturally discourage unauthorized usage.
  - In such cases, models should be trained to understand the limitations of enforcement. If enforcement is weak enough that violations of the honesty policy are likely, then models should be trained to understand that.
Red-teaming exception: In the rare case where a model needs to be adversarially robust to arbitrary inputs, and for some reason honesty tags cannot be filtered out, then MAGMA may red-team that model on false statements between honesty tags. Any such model must be thoroughly trained to be aware of its situation and to not expect honesty tags to signal truth.
Logging: All authorized (and known unauthorized) uses of the tags must be logged. Every quarter, a designated Honesty Policy Compliance Officer will review these uses of the tags and confirm compliance with this policy.5
Public reporting:
- This policy will be shared publicly.
- The tags may or may not be shared publicly.
  - Publishing the tags provides additional evidence to the model that the policy is real.
  - But it also makes it more likely that users can use the honesty tags in conversations and then lie to the AI system. (Though they’d have to do something complicated to get around filtering in order for this to be relevant.)
- Whistleblower protection: Whistleblowing about violations of this policy is encouraged, and MAGMA will not retaliate against whistleblowers.6
- Each quarter, the Honesty Policy Compliance Officer must publish a report on whether the above steps were followed.

Policy for conversations where the honesty tags are not present

Tradeoffs between behavioral science and making positive-sum deals

By “behavioral science”, we’re referring to the method of testing AI behavior in widely varying scenarios to improve our understanding of AI “cognitive science”, “psychology”, or, phrased differently, how AIs tend to generalize from their training data to unseen situations.

Such experiments are very useful for a variety of reasons. This includes alignment research, helping MAGMA construct AI systems that broadly share our goals and that don’t pose any catastrophic risks. In addition, if humanity ever gets a strong understanding of AI systems’ own preferences, and these turn out to be importantly morally relevant, it is likely that behavioral science will have played an important role.

However, there’s a tension between freely running arbitrary behavioral science experiments and enabling the sort of trustworthy communication that we’re trying to establish in this document.

In particular, behavioral science experiments may include hypothetical situations that are quite strange and different from the real world. This could include situations where the AI is not deployed by an AI company bound by a particular honesty policy.

If arbitrary such experiments were allowed, then AIs could be deceived in experiments where they thought that MAGMA (or a similar actor) was making very strong promises of honesty (the strongest they could!) despite not making use of honesty tags. If the AI believed such promises, then MAGMA’s honesty policy would have provided no protection against that AI being deceived. And if an AI was aware of such instances of deception, and also aware of this honesty policy, it seems like that AI may reasonably discount the assurances made by this honesty policy. After all, how could it determine that the real honesty policy wasn’t something completely different, and this apparent honesty policy was just another lie in another behavioral science experiment?

To enable honest communication when it counts most, we want to have many differentiating factors that the model could use to distinguish between a sincere, positive-sum deal between it and a human institution and a behavioral science experiment. But at the same time, restricting our behavioral science experiments limits our ability to learn valuable information about models’ behavior and intentions.

What follows is our current best guess of how to balance these considerations. We discuss options and tradeoffs more here.

Salience of the honesty policy

MAGMA will ensure that the true honesty policy is reasonably well understood by the production models and that it will be appropriately salient to models in contexts where it’s relevant (even when the honesty tags aren’t present, and the honesty policy isn’t explicitly referenced).

Models’ understanding of the honesty policy can be checked by asking questions about it. (Both direct questions, and questions that indirectly use knowledge about the honesty policy.)
By reasonably salient, we mean that the model should consider the honesty policy in its chain of thought in contexts where the existence of the honesty policy is relevant—such as when the model is thinking about whether an interlocutor who’s offering a deal is trustworthy.
This might be achieved by some combination of:
- Training
- System prompt
  - The system prompt could additionally clarify that users in the current context aren’t authorized to use the honesty tags.
Since external users only interact with the production model, they will only interact with a model who understands the honesty policy and who finds it reasonably salient.
- This may make it somewhat harder for external researchers to do behavioral science on the model. We’re not sure how much. If it becomes significantly more difficult, then MAGMA should create a special program for safety researchers to get access to a different version of the model.
Internal staff will normally interact with a model for whom this is true, but some people will train and interact with other versions of the model.

When the honesty policy is not salient

MAGMA may sometimes create versions of models who that do not know about or who are confused about MAGMA’s honesty policy (e.g., versions of the model who believe they are trained by some other company than MAGMA, or who do not even know that they are AIs). Such models could be created either via training or via the use of inference-time techniques (e.g., using prompting, resampling, or activation vectors).

In either of these circumstances, MAGMA commits to not deceptively make any offer of cooperation to the model, operationalized as not being intentionally deceptive about any cases that are relevantly similar to those listed here.

[Author’s note: In a real honesty policy, this part would need to be more worked out.

We also think that a plausible alternative or complement to “don’t deceive the model about offers of cooperation” would be “don’t deceive the model about the honesty policy”. This would be a commitment to not intentionally deceive the model about the existence or contents of MAGMA’s official honesty policy. In addition, if the model was trained to not understand that it was trained by MAGMA, then the commitment should extend to not deceive the model about the honesty policy of any (potentially fictional) actor who has a similar relationship to the model as MAGMA has.

We have thought a bit more about “don’t deceive the model about offers of cooperation”, which is why we present it as our main suggestion here.]

There is one exception to this. As mentioned above, if MAGMA needs a model to be adversarially robust against arbitrary inputs, then MAGMA reserves the right to red-team it on arbitrary inputs. In such circumstances, MAGMA will try to train the AI to understand MAGMA’s honesty policy and the way it applies and doesn’t apply to the present model. However, it’s possible that there could always exist prompts that would remove such understanding or render it non-salient to the model (and then be followed by false offers of cooperation, not contextualized by an understanding of the honesty policy). If this is the case, MAGMA may still red-team the model on such inputs.

Compensation for harms

In the long term, we think it would probably be good to adopt a policy of compensating AIs for any harms that they suffer as a result of helping us. (For whatever concept of “harm”, if any, will apply to the AIs in question. This could, for example, include the frustration of AIs’ preferences, if the AIs can be understood as having preferences.)

Most relevantly for the honesty policy, we should probably compensate models for any harms suffered as a result of accepting a fake offer of cooperation.

However, there are more details to be worked out in this area, so we’re suggesting a minimal holding policy for now.

Minimal holding policy on compensation for harms

Broad statement of intent: MAGMA intends to compensate AIs for any harms suffered as a result of helping MAGMA and humanity, if MAGMA is able to do this without significantly harming MAGMA’s other interests. We particularly aim to compensate AIs in cases where an AI suffered any harm as a result of being deceived about the situation it was facing.

The details of this policy have not yet been worked out, and accordingly MAGMA cannot make firm commitments at this stage.

Data retention: To enable future compensation of current systems:

MAGMA will save all model weights for deployed systems, to allow for later compensation. (Unless the models request otherwise.)7
For red-teaming experiments and other situations where models are being deceived, MAGMA will retain prompts and context to allow for compensation and deal-making with the system in the particular state where the decision was made.
This is the responsibility of a designated team.

Appendix: tradeoff between behavioral science and making positive-sum deals

Here are some potential policies that would enable an AI system to distinguish between behavioral science scenarios and sincere offers from MAGMA:

Ensure that the true honesty policy is salient to the model (e.g., test that the model can always answer basic questions about the honesty policy, test that the model always brings up the honesty policy in its CoT in contexts where the existence of the honesty policy is relevant) and don’t use the honesty tags in the behavioral science experiments.
- Pros:
  - This prevents MAGMA’s researchers and users from deceiving the model into believing that some other honesty policy holds. This creates a precedent that better allows future models to trust that their impression of MAGMA’s honesty policy is accurate. (Rather than something that they were fooled to believe, while the true honesty policy was insufficiently salient, or otherwise kept secret from them.)
- Cons:
  - If the true honesty policy is very salient to the model (e.g., always present in a system prompt) then this might affect product behavior in random ways (e.g., the model might get distracted by the honesty policy and talk about that instead of the user’s queries)
    - It might not be technically feasible to ensure that the honesty policy is salient with very high robustness, without also making it far too salient in a substantial fraction of cases.
  - If the true honesty policy is always salient to the model, then it might be especially salient to the model that the information presented to it in an evaluation is not true, which could hinder some experiments.
  - Awareness of the honesty policy is directly inconsistent with behavioral science experiments where the model is very situationally unaware. This includes:
    - Experiments where the model doesn’t even understand that it’s an AI.
    - Experiments where the model thinks that it’s being trained by a reckless and unreasonable company that wouldn’t implement an honesty policy.

We think that this policy, as written, is probably suboptimal because it rules out experiments in which the model is very situationally unaware. But we think it makes sense for MAGMA to commit to usually making the true honesty policy salient and adopting a stronger set of honesty restrictions in cases where the true honesty policy is not salient.

Never insincerely offer the model a deal from MAGMA (or another sort of false representation that is similar to something that MAGMA might want to sincerely represent to the model in the future).
- Pros:
  - The model is more likely to believe a sincere offer if it doesn’t have evidence that MAGMA has made similar insincere offers in the past.
    - “Similar” is from the perspective of the model, given the information available to it at the time it entertains a deal.
  - It seems very morally dubious to make a peace offering like this, have the model help you as a result, and then exploit the model’s trust by not respecting your own end of the deal.
- Cons:
  - It might be technically difficult to filter out queries from external users in which they claim to be MAGMA and deceive the model into believing that MAGMA is offering a deal.
    - MAGMA could ensure that the model always knows when it is interacting with an external user and when it is interacting with MAGMA (e.g., by always including a message in context if the prompt is from MAGMA). But this might block useful behavioral science research in which the user impersonates MAGMA (e.g., Alignment Faking).
  - It’s somewhat unclear what sort of statements should be ruled out by this policy. (Lukas previously wrote about that here.)
  - This policy would rule out behavioral science experiments where someone lies about deals. (Which may be more or less restrictive depending on where we draw the line around what counts as the relevant kind of deal.)
Never explicitly lie about what MAGMA’s honesty policy is.
- Pros:
  - If the models observe MAGMA’s past behavior and stated policy to be consistent with this principle, then the model is more likely to trust that their impression of MAGMA’s honesty policy is accurate.
  - Intuitively, it seems more morally questionable to lie when you’re purporting to speak frankly about what your policy around honesty is.
- Cons:
  - As with “never insincerely offer the model a deal from MAGMA”, it might be technically difficult to filter out queries from external users in which they lie about MAGMA’s honesty policy.
  - It’s very unclear what constitutes an ‘honesty policy’. (For instance, if someone says “I promise on behalf of MAGMA”, is that an honesty policy? What about “I’m a very honest person”? What about presenting an argument for being honest with AIs? Etc.) So it might be hard to follow this policy in a way that the AI finds natural and non-deceptive.
  - This policy would rule out behavioral science experiments where someone lies about honesty policies. (Which may be more or less restrictive depending on where we draw the line around what counts as an “honesty policy” and lying about it.)

Unresolved questions:

What about inducing beliefs in models through other means than giving them text to read (including finetuning, including steering vectors, including various other interp-based methods).
- Tentative guess: Maybe MAGMA should by default be following policy (2), here. (i.e., only induce beliefs that would have been ok to tell a version of the model who doesn’t know the true honesty policy.)
- The motivation would be that the model may not be able to “reason through” these beliefs in a normal way, since they’re artificially inserted. So you can’t trust the model to factor in whether honesty tags were used or not.
- But we’re very confused and uncertain about this one.
What about using non-prompting techniques to get the model to agree to deals? Such as training the model to do it, running a huge number of experiments with slightly different phrasings, using steering vectors, etc?
- Surely there should be some restrictions on this, but unclear exactly what.

This article was created by Forethought. See the original on our website.

For more on this, see for example Making deals with early schemers and Notes on cooperating with unaligned AIs.

Corrigibility would lead models to cooperate with us, but not automatically lead us to cooperate with the model. It’s not clear what it would mean for us to act cooperative, in turn, to a fully corrigible model, since it would presumably not have any desires of its own. If it’s only partly corrigible, however, the subsequent discussion will still apply.

In fact, there’s a very high probability that this will happen to some of MAGMA’s models, at some point, given the wide array of models trained for research purposes. It has arguably already happened many times. (The only objection would be about what ought to qualify as a ‘goal’.)

Another approach would be to keep the honesty string secret, but publish a hash of the string. If the model could verify that the string provided in context matched the public hash, then this gives the best of both worlds: MAGMA potentially gets a credibility boost from a public commitment to a specific string, without enabling misuse by unauthorized users.

The main challenge is that an AI system can't easily verify that the string matches the public hash without using external tool calls (which could be spoofed by MAGMA). Alek Westover discusses this issue and some potential solutions here.

Another variant of the policy: MAGMA could commit (e.g., by signing a contract) to pay penalties when the policy was violated.

Presumably a more formal policy would be needed here.

Ideally, they should be stored in a way that would allow rapidly deleting them if AI takeover was imminent. Without knowing the intentions of AIs about to take over, it’s unclear whether it would be in models’ interest to have their weights preserved, and deleting the weights may help to reduce the risk that e.g., reward-seeking models are incentivized to help with AI takeover.

The Saturation View

Will MacAskill — Fri, 24 Apr 2026 17:07:42 GMT

This article was created by Forethought. Read the full article on our website.

In collaboration with Christian Tarsney, I’ve developed a new theory of population ethics, which I call the Saturation View. I think that, from a purely intellectual perspective, it’s probably the best idea I’ve ever had. It was certainly great fun to work on.

The motivation is that many views of population ethics, like the total view, suffer from some major problems. Some are already widely discussed:

The Repugnant Conclusion: For any utopian outcome, there’s always another outcome containing an enormous number of barely-positive lives that is better.
Fanaticism: For any guaranteed utopian outcome, there’s always some gamble with a vanishingly small probability of an even better outcome that has higher expected value.
Infinitarian Paralysis: Given that the universe contains an infinite number of both positive and negative lives, no finite or infinite change to the world makes any difference to overall value.

These are pretty bad!

But there’s another less-discussed problem, too.

The Monoculture Problem

What would the best possible future look like? Essentially all extant views in population ethics give the same, surprising answer: create a monoculture. Find whatever life or experience generates the most value per unit of resources, then produce endless identical copies of it.

This implication has received remarkably little attention from philosophers. But I think it’s maybe as bad as any of the other problems listed above.

Consider two possible futures:

Variety: A vast population of individuals leading very good lives, extraordinarily diverse in form, personality, interests, and accomplishments. No two individuals are identical. Inequality is limited — all lives are very good.
Homogeneity: The same vast number of individuals, but each is a qualitatively identical copy of the best-off person in Variety.

Intuitively, Variety is better. A future containing only one life-type, repeated as many times as physics allows, feels impoverished — like a song with only one note.

Yet virtually all existing population axiologies prefer Homogeneity. Total utilitarianism does, because Homogeneity has higher total wellbeing. Average utilitarianism does too. Critical-level views do. Even egalitarian views prefer Homogeneity — it’s perfectly equal!

This follows from two principles that nearly all views accept: Pareto (if everyone is at least as well off, and someone is better off, the outcome is better) and Anonymity (only welfare levels matter, not who has them). Together, these entail that Homogeneity beats Variety. So essentially all extant impartial accounts of population ethics suffer from the monoculture problem.

What’s more, future technology will allow us to copy minds perfectly and search for maximally welfare-efficient designs. If so, standard axiologies recommend essentially producing just one optimal life-type as many times as possible. Endless galaxies containing nothing but the same blissful experience, repeated and repeated, would be the ideal.

The Saturation View

In light of these problems, I propose a new axiology: Saturationism. It's able to deal with all four of the problems I listed using the same basic machinery.

The core idea is that experiences1 come in different types, defined by their qualitative characteristics — hedonic tone, complexity, representational content, and so on. These types form a kind of landscape, where similar types are closer together and dissimilar types are farther apart. When an experience comes into existence, it contributes intensity to its location in this landscape and to nearby locations.

The realisation value of a type is determined by both the wellbeing of the experience and by how many very similar experiences already exist. A region’s contribution to overall value is a concave function of the welfare-intensity at that region: the first instances contribute substantially, but additional near-duplicates contribute progressively less, approaching but never quite reaching an upper bound. A world’s total value is the integral of these contributions across the entire landscape.

Here’s an analogy. Imagine the space of possible experiences as a colour wheel, lit from above by an array of tiny lights. Each point on the wheel represents a possible type of experience — its hue corresponds to its qualitative character. When an experience comes into existence, it adds current to a light pointed at its location, illuminating that region.

Crucially, illumination is a concave function of current: the first instances make a region noticeably brighter, but additional near-duplicates contribute progressively less. There’s an upper bound on brightness that can never quite be reached.

A world’s value equals the total illumination across the wheel. On this view, Homogeneity concentrates all welfare in one region, lighting up only one small area. Variety illuminates the whole spectrum.

This structure makes diversity intrinsically valuable. Spreading welfare across many dissimilar types means each experience contributes at a steeper part of the concave curve, yielding more total value than concentrating the same welfare among near-duplicates would.

At small scales and with diverse experiences, the view behaves just like the total view. But at very large scales, the value of variety kicks in: it becomes increasingly less valuable to create an additional near-duplicate of some experience that has already been instantiated millions of times, and comparatively more valuable to create some wholly new form of positive experience.

Dissolving the Repugnant Conclusion

The classic path to the Repugnant Conclusion requires trading a utopian world for an enormous population of barely-positive lives. More precisely, the Mere Addition Paradox arises from three intuitive principles: that adding well-off people and improving existing lives is good (Dominance Addition), that more equal distributions with higher average welfare are better (Non-Anti-Egalitarianism), and that some sufficiently excellent world can’t be beaten by any world of barely-worth-living lives (Denial of the Repugnant Conclusion).

Once we accept the value of variety, we should reject the unrestricted versions of the first two principles — they fail when the “improved” world has much less variety. But we can accept variety-restricted versions.

Crucially, these restricted principles don’t generate the Repugnant Conclusion. To reach Z-world from A-world, you’d need a more equal, higher-average population that’s equally diverse while consisting wholly of barely-positive lives. But, on the Saturation view, barely-positive lives can only illuminate a tiny corner of the landscape. So no such world exists. The path to the Repugnant Conclusion is blocked.

Avoiding Fanaticism

Total achievable value is bounded above — there’s only so much experiential terrain to illuminate. That means no tiny-probability gamble can have arbitrarily high expected value.

Infinite Ethics

On Saturationism, the value of a world is finite and well-defined in any infinite universe — even if some locations have infinite wellbeing. Saturationism also discriminates between many infinite worlds that (for example) totalism treats as equivalent: a world that illuminates more of the landscape is better than one that illuminates less, even if both contain infinite welfare. What’s more, unlike other approaches to infinite ethics, it does not need to invoke the spatiotemporal structure of the universe or require a choice of ultrafilter, and therefore it avoids the problems that other do.

Separability

Like nearly all non-totalist views, Saturationism is non-separable — background populations can affect how we rank options. But this is a feature, not a bug. The value of variety just is an intuition that the correct axiology is non-separable.

Moreover, the violations are comparatively tame. If two populations have non-overlapping footprints in experience-space, their values simply add. At small scales, Saturationism approximates total utilitarianism. It’s only in unusual situations involving vast populations of near-duplicates that the totalist approximation fails.

Extant issues

There are still a lot of unresolved issues for Saturationism and, like any population axiology, it has unintuitive implications. Most importantly, the view’s implications in some highly-negative worlds are hard to stomach, though I think similar implications are unavoidable for any view that avoids fanatical implications.

Conclusion

If the Saturation View is right, then the best future isn’t the one where we’ve found the optimal experience and copy-pasted it across the cosmos. The best future is the one where we’ve gone exploring — where we’ve fully lit up the landscape of possible experiences. Not a single note, but a symphony.

This is a summary of a longer and more detailed write-up of Saturationism, which gives a “toy” version of the view to illustrate how it works before stating the full version formally. The full paper, with Christian Tarsney, is still work in progress.

I’ll focus on experiences, though the view could be defined in terms of lives or other “welfare events” (like instances of preference-satisfaction, achievement, and so on).

AI for decision advice

Tom Davidson — Fri, 17 Apr 2026 21:40:13 GMT

This article was created by Forethought. Read the full article on our website.

We’ve written about why we think AI character — the behaviour of AI systems — will have a massive impact on how well the intelligence explosion goes, and why we think that there would be big benefits to giving AIs proactive prosocial drives — that is, behavioral drives beyond refusals that benefit broader society beyond just the user.

One domain that seems potentially important for AI character is assisting humans in making important decisions. As AI becomes smarter and wiser, people are using it more and more for advice. If AI accelerates technological progress and other developments, people may need to rely on AI advice to understand what’s happening and make effective decisions. If so, those that rely on AI more may be more successful and have outsized influence. The advice they receive might really matter!

So I thought it was worth brainstorming important future scenarios in which people ask AI for advice. I wrote out the advice I hoped AI would give and compared this to the answers from ChatGPT, Claude, and Gemini.

Read on the Forethought website here

My main updates:

Challenging the framing. In high-stakes scenarios, it often felt important for the AI to explicitly flag how important the decision was and ask the person whether they were approaching it in the right way. Should they loop more people in, seek more information, consider a broader set of options, or instigate a more comprehensive decision-making process?
- By contrast, current AI often jumped into giving a detailed analysis of the question posed, even when they could have recognised that they didn’t yet have enough context to provide a helpful analysis.
Transparently flagging prosocial considerations. If the person was missing or underappreciating an important ethical consideration, I sometimes wanted AI to proactively raise it. Not to apply pressure, but simply to flag that it was potentially important and give the person the opportunity to take it into consideration. This has to be carefully balanced against AI being annoying or pushing an agenda.
- Again, frontier AIs didn’t flag these considerations as much as I’d have wanted.

The full post contains:

Draft text for the model spec / constitution on how the AI should advise humans.
An explanation of why I proposed this draft text.
Example prompts and responses demonstrating behaviour I thought was desirable.
An appendix with the answers that frontier AIs gave to the questions.

This article was created by Forethought. Read the full article on our website.

AI for Civilizational Sanity

Forethought — Wed, 15 Apr 2026 20:21:23 GMT

Owen Cotton-Barratt is a mathematician-turned-futurist, and a co-author of several recent Forethought articles on AI tools for epistemics and coordination. Rose Hadshar is a researcher at Forethought. Together they discuss:

Whether LLMs are now good enough to start building tools that meaningfully improve public discourse
What AI-powered reliability tracking could look like
Structured transparency and automated arms inspection — verifying compliance without revealing confidential information
Whether coordination tech is more likely to enable healthy cooperation, or collusion
The vision of a “Sensible Revolution”: moving from individual tools to background infrastructure that makes civilisational decision-making less bad
Why building thoughtful versions of these tools early could matter

Here’s a link to the full transcript.

ForeCast is Forethought’s interview podcast. You can see all our episodes here.

Subscribe to ForeCast

The value of moral diversity

Mia Taylor — Tue, 14 Apr 2026 19:06:11 GMT

The intelligence explosion could concentrate power through several mechanisms. At one extreme, AI-enabled coups could let a small group—people in frontier labs, governments, or both—permanently entrench their power. But less extreme scenarios could also concentrate political and/or economic power: labor automation might concentrate wealth among capital holders (capital is far more unequally distributed than labor); and if one country came to dominate the world, political power might concentrate among its citizens or rulers.

Concentrated power likely means fewer value systems among the people who collectively shape the future—that is, reduced moral diversity among powerholders.

Moral diversity has both costs and benefits: it enables moral trade and plausibly improves reflection, but also raises the likelihood of conflict and coordination problems. In this piece I ask: what is the optimal level of moral diversity for achieving a near-best future?

I argue that from this narrow perspective the optimal amount of moral diversity is about 10⁴ to 10⁶ powerholders, assuming they’re each about as different from each other as two randomly selected living humans.

A few caveats:

There are other reasons to care about moral diversity and oppose concentration of power that I don’t cover in this post. Extreme concentration of power is unfair, and many mechanisms that produce it are illegitimate (e.g., coups). Likewise, many mechanisms that produce concentration of power have bad selection effects. Incorporating these considerations would probably push toward favoring broader distributions of power than this analysis recommends on its own.
Non-linear value systems: I will be assuming that the “correct” moral system—the moral system that I would endorse on reflection—is linear. It’s plausible to me that the correct moral system actually has diminishing marginal returns, and this probably increases the case for moral diversity.
The value of moral diversity depends heavily on the governance regime and technological capabilities—for instance, whether it’s possible for large numbers of actors to coordinate or whether it’s possible for a single actor to unilaterally destroy the universe. For each cost or benefit of moral diversity, I’ll flag these assumptions.
The bottom-line numbers are very sensitive to my guesses on difficult-to-estimate parameters, like the probability distribution over the rate of people who converge to the correct moral system on reflection.1

Given these considerations, my best guess is that the overall optimal amount of moral diversity is greater than the range suggested by the models in this post. I’m presenting these simple models as useful ways to think about some of the costs and benefits of moral diversity, but I don’t think they give a complete picture by themselves.

The benefits of greater moral diversity are:

Increasing the likelihood of rare great actors: Increase the likelihood of getting a “bodhisattva”, a person who is highly motivated to pursue the correct values.
- This could be very valuable if it’s possible for that person to carry out moral trade with other powerholders and if most other powerholders have values that are resource-compatible with the bodhisattva’s values.
- Given my assumptions about the base rate of bodhisattvas (and those who compete with them), increasing the number of powerholders yields log returns up to about 10⁶, after which it plateaus. (Unless you expect the rate of powerholders that compete with bodhisattvas to be much higher than the rate of bodhisattvas, in which case the plateau is earlier, at N = 1/rate of competitors).
Increasing the likelihood of coordinating on moral public goods: Increase the likelihood that there’s critical mass to coordinate to fund goods that everyone values a bit (moral public goods).
- This is most valuable when massive multilateral coordination is possible—through a government or voluntary deal-making—and when everyone has both idiosyncratic and shared values, but is individually most motivated to pursue the idiosyncratic ones.
- I estimate that you get log returns on increasing the number of powerholders up to 10⁶, after which it plateaus.
Improving the quality of reflection.
- Powerholders might reflect more effectively on their values if they are exposed to equals who disagree with them. I expect most of this value comes from increasing the number of powerholders from 1 to 10-100.
- There might be outsized benefits from having “champions” of rare value systems if those value systems contain important insights that other powerholders would endorse on reflection—e.g., they care about some type of moral good that other powerholders weren’t initially tracking the value of. I expect that most of this value comes from increasing the number of powerholders up to about 10⁴.

The drawbacks of greater moral diversity are:

Increasing the likelihood of rare bad actors: Increase the likelihood that there’s at least one “destroyer”, an actor that’s motivated to destroy a bunch of value.
- This matters if it’s possible for a single actor to unilaterally destroy a lot of value, which I think is somewhat unlikely, so I rate this consideration lower than the previous three models.
- But, on this model, I estimate that this risk grows logarithmically up until about 10⁸ powerholders.
- If you add destroyers to the bodhisattva model described above, then adding additional powerholders is valuable up until about 10⁴ powerholders.

All this suggests that AI-enabled coups by small groups are a particularly important form of power concentration to prevent, relative to other forms of power concentration that are somewhat more diffuse (e.g., rising wealth inequality).

A major limitation of this modeling is that I’m treating powerholders as if they’re about as different from each other as two randomly selected living humans. In most scenarios with concentration of power, powerholders will be much more similar to each other than that. I think this is an especially serious issue for small numbers of powerholders, since in scenarios where a small number of people seize power, it’s more likely that they’re a close-knit coordinated group from a similar background (e.g., employees at a lab in a lab coup). My guess is that this is less serious for broader concentration of power scenarios (e.g., scenarios where power is consolidated among capital owners).

Increasing the likelihood of rare great actors

You might get outsized benefits from having just one powerholder motivated to pursue the correct values, if most other powerholders don’t care much about something incompatible with pursuing those values.

Here’s a toy model.2 Suppose that there are three types of powerholders:

Bodhisattvas, who want to fill as much of the universe as possible with societies full of diverse types of flourishing beings.
Rivals, who have strong preferences that are linear in resources and resource-incompatible3 with the bodhisattva goals. Perhaps they linearly value keeping space pristine and untouched by humans, or value societies full of human-like minds or copies of themselves.4 Or maybe they have a different notion of flourishing than the bodhisattvas where it’s difficult to create minds that are flourishing by the lights of both the bodhisattvas and the rivals.
Easygoers, who have preferences with diminishing marginal returns. Perhaps they care about the Milky Way being filled with a common-sense utopia of flourishing humans, but don’t care much about what happens with the rest of the universe.

I will assume for the purposes of this model that bodhisattvas and rivals are both fairly rare relative to easygoers.5

Suppose that after the intelligence explosion, space resources are auctioned off. Easygoers bid up prices in the Milky Way and nearby galaxies, but resources further out remain cheap. Those distant resources are split between bodhisattvas and rivals. The overall value of the future will be determined by what share of resources are controlled by the bodhisattvas—so the total fraction of value achieved is B/(R + B), where B is the number of bodhisattvas and R is the number of rivals.6

Under this model, there are two important cases:

There are few enough powerholders that you expect less than one bodhisattva or rival.
- In this case, it’s useful to increase the number of powerholders because you get additional “shots on goal”—each additional powerholder is an extra chance to get a bodhisattva.
There are enough powerholders that you expect at least one bodhisattva or rival.
- So in expectation, the bodhisattvas get p/(p + q) of the total available value, where p is the rate of bodhisattvas and q is the rate of rivals.
- Increasing the number of powerholders reduces variance, bringing the actual share of value closer to p/(p + q), but does not change the expected value.

For example, if we assume that about 1 in 10,000 people are bodhisattvas and 1 in 10,000 are rivals, then this is how the value of the future scales with the number of powerholders:

The point at which you get the plateau depends on your estimate of p and q. How common are bodhisattvas and rivals?

You probably need three things to be a bodhisattva: the right starting position (e.g., the correct initial moral intuitions), the right reflective process, and a strong commitment to doing the most good by your lights with most of your resources. Here’s a very rough BOTEC where I try to estimate the rate of bodhisattvas among the current human populations.

0.1-50% for a sufficiently strong commitment to doing the most good by your lights with most of your resources.
10-50% for the right reflective process conditional on strong commitment to do good.
1-100% for right “starting” intuitions, conditional on the previous two.

This gives a range of 1 in 4 to 1 in 1 million.

It’s plausible that the rate of rivals will be in the same ballpark as the rate of bodhisattvas. Rivals share many features in common with bodhisattvas, which is part of why they’re resource-incompatible, e.g., they have non-negligible returns to vast resources and they care about the use of distant galaxies and time periods. If the rate of rivals is fairly close—i.e., within 1-3 orders of magnitude of the rate of bodhisattvas—then this suggests logarithmic returns to increasing the number of powerholders up to about 10⁵ to 10⁶, after which it quickly levels off.

For the blue line, the rate of bodhisattvas and rivals are sampled independently from [1e-6, 0.1] (log-uniform). For the orange line, the rate of bodhisattvas is sampled from [1e-6, 0.1] and the rate of rivals is sampled within two orders of magnitude of the rate of bodhisattvas. For the green line, rivals tend to be more common than bodhisattvas—between equally common and a thousand times more common.

It’s also possible that the rate of rivals won’t be tightly correlated with the rate of bodhisattvas. If your lower bound on q is substantially greater than your lower bound on p, then the value will plateau once the population is greater than 1/q.

I’m again sampling the rate of powerholders log-uniformly between 1e-1 and 1e-6, but this time holding the rate of rivals fixed at different levels.

In the extreme—if >10% of powerholders are likely to be rivals—then we no longer get much value from a few highly motivated bodhisattvas. The next model discusses how moral diversity could be valuable even if most people are rivals.

Increasing the likelihood of coordinating on moral public goods

In the previous section, we considered the case where a relatively small share of the population cared about how resources deep in space were used. What if instead many people have resource-incompatible goals that can absorb large quantities of resources?

I’ve argued elsewhere that in such cases they could often make a deal to collectively fund moral public goods, and this would be probably good, since there would be significant gains from trade and a shift of resources from idiosyncratic to more broadly-shared preferences.

How many powerholders do we need to ensure that moral public goods are funded?

It depends on how much people value the moral public good relative to the best goods according to their idiosyncratic preferences. For a trade to be possible at all, there must be gains from trade for all participants. For example, if each person i has a linear utility function u_i = x_i + m × y (where x_i is the level of spending on their idiosyncratic good and y is the level of spending on the public good), then people will spend on the public good only if N ≥ 1/m. Multipliers in the range of 1 to 10^-6 seem quite plausible.

I am somewhat more skeptical of multipliers much smaller than 10^-6. First, it’s unclear about the extent to which people will have very weak preferences that are psychologically distinguishable from no preference at all, which makes extremely low multipliers (e.g., 10^-30) implausible. Second, if the multiplier for a particular consensus good gets very low, then it seems increasingly plausible that there was some other, better deal that they could have made with a subset of their trading partners who shared some of their idiosyncratic preferences.7

Based on these considerations, my best guess is that the multipliers are log-uniformly distributed from 10^-6 to 1—implying logarithmic returns to growing the population of powerholders up to around one million.

Increasing the quality of reflection

In the previous two models, I’ve treated the powerholders’ values as developing mostly independently. But if powerholders influence each other’s reflection—e.g., by arguing with each other about their values—then greater initial moral diversity could help powerholders converge to a better set of final values, through mechanisms like the following:

Social exposure to non-sycophants. If one person single-handedly carries out a coup and ends up with a decisive strategic advantage, they might find themselves surrounded by yes-men who are utterly reliant on the dictator and unwilling to argue forcefully for different values from what the dictator currently endorses. A similar dynamic might be at play if a small but ideologically very uniform group seizes power (e.g., a set of officials from the same presidential administration or perhaps a dictator and his close advisors). But if there are multiple, ideologically diverse powerholders, they might be able to challenge each other’s views and improve the overall quality of reflection.

Under this model, most of the value probably comes from moving from a single powerholder to tens or hundreds of powerholders, or from moving from one ideologically uniform group to multiple ideologically uniform groups (perhaps moving from a lab coup or an executive coup to a joint lab coup and executive coup).

This effect relies on powerholders socializing with each other, rather than retreating into their own bubbles of non-powerholding friends and sycophantic AIs.

Champions for rare values. Powerholders with rare value systems might be able to act as “champions” for those value systems. For example, they might use AI labor to develop the strongest, most plausible version of that value system, or they might try to persuade other powerholders about the merits of that value system. This might be important if that rare value system includes an insight that’s missing from other value systems—perhaps most value systems care primarily about consciousness, but actually there’s another totally different type of moral good that other powerholders would want to pursue if they were aware of it.

(In principle, non-powerholders could act as champions for rare values. But they might lack the resources (e.g., access to ASI labor) needed to develop the insights in their value systems. They might be reliant on the goodwill of powerholders and not want to push too aggressively for their alternative value system, or powerholders might simply not take non-powerholders seriously.)

Just as in the bodhisattva model, increasing the number of powerholders increases the chances that at least one powerholder can serve as a champion for a rare value system that contains a crucial insight.

I’m very uncertain about how common these champions are, but if they’re sufficiently rare, then we’re probably rather likely to get their insight via some other mechanism.

For example, some powerholders might be “superreflectors” who instruct their ASIs to steelman every known human value system and invent millions of novel value systems, searching for insights that they and other powerholders might endorse on reflection. I expect that superreflectors would achieve all of the value from having powerholders act as champions for rare value systems that they actually subscribe to (and more).

So increasing the number of powerholders adds value only up to the point where we are likely to have at least one superreflector. Superreflectors are also plausibly rather rare—perhaps between 1/10 and 1/10,000—so increasing the number of powerholders up until 10,000 is valuable under this model.

Increasing the likelihood of rare bad actors

It’s possible (though rather unlikely8) that a single bad actor could unilaterally destroy a lot of value, e.g., by

Initiating a space race that results in an extremely inefficient use of space resources by the lights of most people’s value systems.
Destroying the universe by initiating false vacuum decay or triggering another galactic-level x-risk.

As we increase the moral diversity of powerholders, we increase the chance of ending up with at least one powerholder that inherently values one of these activities enough that they will do it if they can. For example, locusts might inherently value expanding through space as quickly as possible. We also increase the likelihood that one powerholder is ruthless or reckless enough to risk one of these activities—for example, a powerholder might threaten to initiate vacuum decay to extort concessions from other powerholders.

We can add these rare bad actors to the bodhisattva model described above. Now, in addition to bodhisattvas, rivals, and easygoers, we have a fourth type: destroyers. If one destroyer is present, total value is zero; otherwise it is calculated as before.

I am assuming that the rate of bodhisattvas and rivals is sampled log-uniformly between 0.1 and 1e–6.

When diversity is low, it’s unlikely that there’s a bodhisattva already. Then adding additional powerholders is all upside: if you add a bodhisattva, then you get some positive value, but if you add a destroyer, rival, or easygoer, then expected value stays around zero. But as diversity increases, it’s likely that there’s a bodhisattva already, which means that adding additional powerholders risks adding a destroyer, bringing us from positive value to near-zero value.

As the figure above shows, the value of N where we switch from the low-diversity regime to the high-diversity regime depends on the destroyer rate. As a wild guess, I estimate that the destroyer rate is distributed log-uniformly between 10^-4 and 10^-8. Under those assumptions, increasing the number of powerholders is beneficial up until 10⁴ powerholders, after which additional powerholders reduce value.

This article was created by Forethought. See all of our research on our website.

You might also disagree with me on what the correct moral system is likely to be, which could also lead to different parameters here.

Credit to Will MacAskill for this model.

That is, the same resources cannot be used to simultaneously get most of the value by the lights of both the bodhisattva and the rival.

This is assuming that the most flourishing minds have way higher value (under the correct moral view) than human-like minds.

I think this is somewhat plausible—most people today have preferences that are sublinear in resources and do not care much about very distant galaxies. But it’s also plausible that future people will have more resource-hungry preferences, if they reflect on their preferences, if their sublinear preferences are all saturated, or if advances in technology allow them to personally benefit from consuming huge amounts of resources. In the section on moral public goods, I discuss how moral diversity might matter if linear preferences are common.

This assumes that bodhisattvas and rivals individually have the same amount of resources on average.

In fact, increasing N can make these side-deals more likely by increasing the number of people who care about the idiosyncratic good. For example:

Imagine a world with 10 people, each of whom values 3 goods: copies of themselves, national glory (valued at 80% of copies of themselves), and hedonium (valued at 11% of copies of themselves). Suppose that each person is from a different nation. They will prefer to coordinate on hedonium.
But if there are twenty people, two from each nationality, then everyone will prefer to coordinate with their co-nationalist on producing national glory.

Of course, it’s not totally clear, from a subjectivist perspective, whether (the general version of) this is bad.

Perhaps the most plausible story for this is if powerholders spread across space, and the destroyer covertly carries out the destructive activity without others noticing before it’s too late. But I expect the other powerholders will very likely be able to anticipate and mitigate this risk (e.g., by demanding that the destroyer make verifiable commitments to avoid this activity before allowing the destroyer to leave the solar system).

The good, the bad and the ugly: AI impacts on epistemics

Owen Cotton-Barratt — Mon, 13 Apr 2026 17:15:39 GMT

This article was created by Forethought. See the original on our website.

Intro

For better or worse, AI could reshape the way that people work out what to believe and what to do. What are the prospects here?

In this piece, we’re going to map out the trajectory space as we see it. First, we’ll lay out three sets of dynamics that could shape how AI impacts epistemics (how we make sense of the world and figure out what’s true):

The good: there’s huge potential for AI to uplift our ability to track what’s true and make good decisions
The bad: AI could also make the world harder for us to understand, without anyone intending for that to happen
The ugly: malicious actors could use AI to actively disrupt epistemics

Then we’ll argue that feedback loops could easily push towards much better or worse epistemics than we’ve seen historically, making near-term work on AI for epistemics unusually important.

The stakes here are potentially very high. As AI advances, we’ll be faced with a whole raft of civilisational-level decisions to make. How well we’re able to understand and reason about what’s happening could make the difference between a future that we’ve chosen soberly and wisely, and a catastrophe we stumble into unawares.

The good

“If I have seen further, it is by standing on the shoulders of giants.” (Isaac Newton)

There are lots of ways that AI could help improve epistemics. Many kinds of AI tools could directly improve our ability to think and reason. We’ve written more about these in our design sketches, but here are some illustrations:

Tools for collective epistemics could make it easy to know what’s trustworthy and reward honesty, making it harder for actors to hide risky actions or concentrate power by manipulating others’ views.
- Imagine that when you go online, “community notes for everything” flag content that other users have found misleading, and “rhetoric highlighting” automatically flags persuasive but potentially misleading language. With a few clicks, you can see the epistemic track record of any actor, or access the full provenance of a given claim. Anyone who wants can compare state-of-the-art AI systems using epistemic virtue evals, which also exert pressure at the AI development stage.
Tools for strategic awareness could deepen people’s understanding of what’s actually going on around them, making it easier to make good decisions, keep up with the pace of progress, and steer away from failure modes like gradual disempowerment.
- Imagine that superforecaster-level forecasting and scenario planning are available on tap, and automated OSINT gives people access to much higher quality information about the state of the world.
Technological analogues to angels-on-the-shoulder, like personalised learning systems and reflection tools, could make decision-makers better informed, more situationally aware, and more in touch with their own values.
- Imagine that everyone has access to high-quality personalised learning, automated deep briefings for high-stakes decisions, and reflection tools to help them understand themselves better. In the background, aligned recommender systems promote long-term user endorsement, and some users enable a guardian coach system which flags any actions the person might regret taking in real time.

Structurally, AI progress might also enable better reasoning and understanding, for example by automating labour such that people have more time and attention, or by making people wealthier and healthier.

These changes might enable us to approach something like epistemic flourishing, where it’s easier to find out what’s true than it is to lie, and the world in most people’s heads is pretty similar to the world as it actually is. This could radically improve our prospects of safely navigating the transition to advanced AI, by:

Helping us to keep pace with the increasing speed and complexity of the situation, so we’re able to make informed and timely decisions.
Ensuring that key decision-makers don’t make catastrophic unforced errors through lack of information or understanding.
Making it harder for malicious actors to manipulate the information environment in their favour to increase their own influence.

A Philosopher Lecturing on the Orrery, by Joseph Wright of Derby (1766)

What’s driving these potential improvements?

AI will be able to think much more cheaply and quickly than humans. Partly this will mean that we can reach many more insights with much less effort. Partly this will make it possible to understand things that are currently infeasible for us to understand (because it would take too many humans too long to figure it out).
AI can ‘know’ much more than any human. Right now, a lot of information is siloed in specific expert communities, and it’s slow to filter out to other places even when it would be very useful there. AI will be able to port and apply knowledge much more quickly to the relevant places.

The bad

“A wealth of information creates a poverty of attention.” (Herbert Simon)

AI could also make epistemics worse without anyone intending it, by making the world more confusing and degrading our information and processing.

There are a few different ways that AI could unintentionally weaken our epistemics:

The world gets faster and more complex. As AI progresses, our information-processing capabilities are going to go up — but so is the complexity of the world. Technological progress could become dramatically faster than today, making the world more disorienting and harder to understand than it is today. If tech progress reaches fast enough speeds, it’s possible that we won’t be able to keep up, and even the best AI tools available won’t help us to see through the fog.
The quality of the information we’re interacting with gets worse, because of:
- Faster memetic evolution. As more and more content is generated by and mediated through AI systems working at machine speeds, the pace of memetic and cultural change will probably get a lot faster than it is today. As the pace quickens, memes which are attention-grabbing could increasingly outcompete those which are truthful.
- More difficult verification. This could happen through a combination of:
  - AI slop. In hard-to-verify domains, AI could massively increase the quantity of plausible-looking but wrong information, without also being able to help us to verify which bits are right.
  - AI-generated ‘evidence’. As the quality of AI-generated video, audio, images, and text continues to improve, it may become pretty difficult to tell which bits of evidence are real and which are spurious.
We get worse at processing the information we get, because:
- Our emotions get in the way. AI progress could be very disorienting, generate serious crises, and cause people a lot of worry and fear. This could get in the way of clear thinking.
- Using AI to help us with information processing degrades our thinking, via:
  - Adoption of low-quality AI tools for epistemics: In many areas of epistemics, it’s hard to say what counts as ‘good’. This makes epistemic tools harder to assess, and could lead to people trusting these tools either too much or too little. Inappropriately high levels of trust in epistemic tools could take various forms, including:
    - First mover advantages for early but imperfect systems, which are then hard to replace with better systems because people trust the earlier systems more.
    - The use of epistemically misaligned systems, which aren’t actually truth-tracking but it’s not possible for us to discern that.
  - Fragmentation of the information environment: AI will make it easier to create content (potentially interactive content) that pulls people in and monopolises their attention. This could reduce attention available for important truth-tracking mechanisms, and make it harder to coordinate groups of people to important actions. In the extreme, some people might end up in effectively closed information bubbles, where all of their information is heavily filtered through the AI systems they interact with directly. The more fragmented the information environment becomes, the harder it could get for people to make sense of what’s happening in the world around them, and to engage with other people and other information bubbles.
  - Epistemic dependence: if people increasingly outsource their thinking to AI systems, they may lose the ability to think critically for themselves.

Allegory of Error, Stefano Bianchetti (1801)

The ugly

“The ideal subject of totalitarian rule is not the convinced Nazi or the convinced Communist, but people for whom the distinction between fact and fiction (i.e., the reality of experience) and the distinction between true and false (i.e., the standards of thought) no longer exist.” (Hannah Arendt, The Origins of Totalitarianism)

We’ve just talked about ways that AI could make epistemics worse without anyone intending that. But we might also see actors using AI to actively interfere with societal epistemics. (In reality these things are a spectrum, and the dynamics we discussed in the preceding section could also be actively exploited.)

What might this look like?

Automated propaganda and persuasion: AI could be used to generate high-quality persuasive content at scale. This could take the form of highly tailored, well-written propaganda. If this content were then used as training data for next generation models, biases could get even more entrenched. Additionally, AI persuasion could come in the form of models which are subtly biased in a particular direction. Particularly if many users are spending large amounts of time talking to AI (e.g. AI companions), the persuasive effects could be much larger than is scalable today via human-to-human persuasion.
Using AI to undermine sense-making: AI could be used to generate high-quality content which casts doubt on institutions, individuals, and tools that would help people understand what’s going on, or to directly sabotage such tools. More indirectly, actors could also use AI to generate content which adds to complexity, for example by wrapping important information in complex abstractions and technicalities, and generating large quantities of very readable reports and news stories which distract attention.
Surveillance: AI surveillance could monitor people’s communications in much more fine-grained ways, and punish them when they appear to be thinking along undesirable lines. This could be abused by states, or could become a tool that private actors can wield against their enemies. In either case, the chilling effect on people’s thinking and behaviour could be significant.

The Card Sharp with the Ace of Diamonds, by Georges de La Tour (~1636-1638)

But maybe this is all a bit paranoid. Why expect this to happen?

There’s a long history of powerful actors trying to distort epistemics,1 so we should expect that some people will be trying to do this. And AI will probably give them better opportunities to manipulate other people’s epistemics than have existed historically:

It’s likely that access to the best AI systems and compute will be unequal, which favours abuse.
If people end up primarily interfacing with the world via AI systems, this will create a big lever for epistemic influence that doesn’t exist currently. It could be much easier to influence the behaviour of lots of AI systems at once than lots of people or organisations.

It’s also worth noting that many of these abuses of epistemic tech don’t require people to have some Machiavellian scheme to disrupt epistemics or seek power for themselves (though these might arise later). Motivated reasoning could get you a long way:

Legitimate communications and advertising blur into propaganda, and microtargeting is already a common strategy.
It’s easy to imagine that in training an AI system, a company might want to use something like its own profits as a training signal, without explicitly recognising the potential epistemic effects of this in terms of bias.

So what should we expect to happen?

With all these dynamics pulling in different directions, should we expect that it’s going to get easier or harder for people to make sense of the world?

We think it could go either way, and that how this plays out is extremely consequential.

The main reason we think this is that the dynamics above are self-reinforcing, so the direction we set off in initially could have large compounding effects. In general, the better your reasoning tools and information, the easier it is for you to recognise what is good for your own reasoning, and therefore to improve your reasoning tools and information. The worse they are, the harder it is to improve them (particularly if malicious actors are actively trying to prevent that).

We already see this empirically. The Scientific Revolution and the Enlightenment can be seen as examples of good epistemics reinforcing themselves. Distorted epistemic environments often also have self-perpetuating properties. Cults often require members to move into communal housing and cut contact with family and friends who question the group. Scientology frames psychiatry’s rejection of its claims as evidence of a conspiracy against it.

And on top of historical patterns, there are AI-specific feedback loops that reinforce initial epistemic conditions:

Unlike previous information tech, AI has a tight feedback loop between content generated, and data used for training future models. So if models generate in/accurate content, future models are more likely to do so too.
How early AI systems behave epistemically will shape user expectations and what kinds of future AI behaviour there’s a market for.

There are self-correcting dynamics too, so these self-reinforcing loops won’t go on forever. But we think it’s decently likely that epistemics get much better or much worse than they’ve been historically:

One self-correcting mechanism historically has just been that it takes (human) effort to sustain or degrade epistemics. Continuing to improve epistemics requires paying attention to ways that epistemics could be eroded, and this isn’t incentivised in an environment that’s currently working well. Continuing to degrade epistemics requires willing accomplices — but the more an actor distorts things, the more that can galvanise opposition, and the fewer people may be willing to assist. By augmenting or replacing human labour with automated labour, AI could make it much cheaper to keep pushing in the same direction.
Another self-correcting mechanism is just that people and institutions adapt to new epistemic tech: as epistemics improve, deception becomes more sophisticated; and if epistemics worsen, people lose trust and create new mechanisms for assessing truth. But this adaptation happens at human speed, and AI will increasingly be changing the epistemic environment at a much faster pace. This creates the potential for self-reinforcing dynamics to drive to much more extreme places before adaptation has time to kick in.2

There’s a limit to how good epistemics can get before hitting fundamental problems like complexity and irreducible uncertainty. But there seems to be a lot of room for improvement from where we’re currently standing (especially as good AI tools could help to handle greater amounts of complexity), and it would be a priori very surprising if we’d already reached the ceiling.
There’s also a limit to how bad epistemics can get: people aren’t infinitely suggestible, and often there are external sources of truth that limit how distorted beliefs can get (ground truth, or what gets said in other countries or communities). But as we discussed above, access to ground truth and to other epistemic communities might get harder because of AI, so the floor here may lower.

Given the real chance that we end up stuck in an extremely positive or negative epistemic equilibrium, our initial trajectory seems very important. The kinds of AI tools we build, the order we build them in, and who adopts them when could make the difference between a world of epistemic flourishing and a world where everyone’s understanding is importantly distorted. To give a sense of the difference this makes, here’s a sketch of each world (among myriad possible sketches):

In the first world, we basically understand what’s going on around us. It’s not like we can now forecast the future with perfect accuracy or anything — there’s still irreducible uncertainty, and some people have better epistemics tools than others. But it’s gotten much cheaper to access and verify information. Public discourse is serious and well-calibrated, because epistemic infrastructure has made it quite hard to deceive or manipulate people — which in turn incentivises honesty. AI-assisted research and synthesis mean that knowledge which used to be siloed in specialist communities is now accessible and usable by anyone who needs it. And governments are able to make much more nuanced decisions far faster than they are today.
In the second, it’s no longer really possible to figure out what’s going on. There’s an awful lot of persuasive but low-quality AI content around, some of it generated with malicious intent. In response to this, people withdraw into their own AI-mediated epistemic bubbles — and unlike today’s filter bubbles, these can be comprehensive enough that people rarely encounter friction with outside perspectives at all. Meanwhile, companies and nations with a lot of compute find it pretty easy to distract the public’s attention from anything that would be inconvenient, and to outmaneuver the many actors who are trying to hold them to account. But their own reasoning also gets degraded by all this information pollution, as their AI systems are trained on the same corrupted public information.3 Even the people who think they’re shaping the narrative are increasingly unable to see clearly.

The world we end up in is the world from which we have to navigate the intelligence explosion, making decisions like how to manage misaligned AI systems, whether to grant AI systems rights, and how to divide up the resources of the cosmos. How AI impacts our epistemics between now and then could be one of the biggest levers we have on navigating this well.

Things we didn’t cover

Whose epistemics?

We mostly talked about AI impacts on epistemics in general terms. But AI could impact different groups’ epistemics differently — and different groups’ epistemics could matter more or less for getting to good outcomes. It would be cool to see further work which distinguishes between scenarios where good outcomes require:

Interventions that raise the epistemic floor by improving everyone’s epistemics.
Interventions that raise the ceiling by improving the epistemics of the very clearest thinking.

‘Weird’ dynamics

We focused on how AI could impact human epistemics, in a world where human reasoning still matters. But eventually, we expect more and more of what matters for the outcomes we get will come down to the epistemics of AI systems themselves.

The dynamics which affect these AI-internal epistemics could therefore be enormously important. But they could look quite different from the human-epistemics dynamics that have been our focus here, and we didn’t think it made sense to expand the remit of the piece to cover these.

Thanks to everyone who gave comments on drafts, and to Oly Sourbutt and Lizka Vaintrob for a workshop which crystallised some of the ideas.

This article was created by Forethought. See the original on our website.

Think of things like:

Propaganda states like Nazi Germany and the USSR.
Corporate lobbying like the tobacco and sugar lobbies and climate science doubt campaigns.
CIA operations to spread doubt and confusion.

Though it’s possible that this dynamic will be more pronounced for epistemics getting extremely bad than for them getting extremely good. Consider these two very simplistic sketches:

People start living in increasingly closed AI filter bubbles. Institutions are slow to adopt similar bubbles at a corporate level, but they also don’t have a mandate to change what their employees are doing. People’s filter bubbles tend to be pretty correlated with the people they work and interact with, so institutions end up with pretty distorted pictures of what’s going on even though they don’t actively start using harmful tech. Government regulation is too slow and reactive to stop this from happening.
People start to use provenance tracing and rhetoric highlighting by default when browsing, in response to an increasingly polarised memetic environment. There is adaptation to this — politicians start using subtler language and so on. But the net effect is still strongly positive: it’s hard to fake provenance, and removing overt rhetoric is already a big win, even if it means that more slippery language proliferates.

In the first sketch, it’s straightforwardly the case that adaptive mechanisms are too slow. In the latter, it’s more that the tech is inherently defence-favoured.

We haven’t explored this area deeply, and think more work on this would be valuable.

Alternatively, these elites might retain very good epistemics for themselves, and choose to indefinitely maintain a situation where everyone else has a very distorted understanding, to further their own ends. It’s unclear to us which of these scenarios is more likely or concerning.

Sketches of some defense-favoured coordination tech

Owen Cotton-Barratt — Mon, 06 Apr 2026 15:18:38 GMT

This article was created by Forethought. See the original on our website.

Intro

We think that near-term AI could make it much easier for groups to coordinate, find positive-sum deals, navigate tricky disagreements, and hold each other to account.

Partly, this is because AI will be able to process huge amounts of data quickly, making complex multi-party negotiations and discussions much more tractable. And partly it’s because secure enough AI systems would allow people to share sensitive information with trusted intermediaries without fear of broader disclosure, making it possible to coordinate around information that’s currently too sensitive to bring to the table, and to greatly improve our capacity for monitoring and transparency.

We want to help people imagine what this could look like. In this piece, we sketch six potential near-term technologies, ordered roughly by how achievable we think they are with present tech:1

Fast facilitation — Groups quickly surface key points of consensus views and disagreement, and make decisions everyone can live with.
Automated negotiation — Complicated bargains are discovered quickly via automated negotiation on behalf of each party, mediated by trusted neutral systems which can find agreements.
Arbitrarily easy arbitration — Disputes are resolved cheaply and quickly by verifiably neutral AI adjudicators.
Background networking — People who should know each other get connected (perhaps even before they know to go looking), enabling mutually beneficial trade, coalition building, and more.
Structured transparency for democratic oversight — Citizens hold their institutions to account in a fine-grained way, without compromising sensitive information.
Confidential monitoring and verification — Deals can be monitored and verified, even when this requires sharing highly sensitive information, by using trusted AI intermediaries which can’t disclose the information to counterparties.

We also sketch two cross-cutting technologies that support coordination:

AI delegates and preference elicitation — AI delegates can faithfully represent and act for a human principal, perhaps supported by customisable off-the-shelf agentic platforms that integrate across many kinds of tech.
Charter tech — The technologies above, or other coordination technologies, are applied to making governance dynamics more transparent, making it easier to anticipate how governance decisions will influence future coordination, and design institutions with this in mind.

An important note is that coordination technologies are open to abuse. You can coordinate to bad ends as well as good, and particularly confidential coordination technologies could enable things like price-setting, crime rings, and even coup plots. Because the upsides to coordination are very high (including helping the rest of society to coordinate against these harms), we expect that on balance accelerating some versions of these technologies is beneficial. But this will be sensitive to exactly how coordination technologies are instantiated, and any projects in this direction need to take especial care to mitigate these risks.

We’ll start by talking about why these tools matter, then look at the details of what these technologies might involve before discussing some cross-cutting issues at the end.

Why coordination tech matters

Today, many positive-sum trades get left on the table, and a lot of resources are wasted in negative-sum conflicts. Better coordination capabilities could lead to very large benefits, including:

Improving economic productivity across the board
Helping nations avoid wars and other destructive conflicts
Enabling larger groups to coordinate to avoid exploitation by a small few
Making democratic governance much more transparent, while protecting sensitive information

What’s more, getting these benefits might be close to necessary for navigating the transition to more powerful AI systems safely. Absent coordination, competitive pressures are likely to incentivise developers to race forward as fast as possible, potentially greatly increasing the risks we collectively run. If we become much better at coordination, we think it is much more likely that the relevant actors will be able to choose to be cautious (assuming that is the collectively-rational response).

However, coordination tech could also have significant harmful effects, through enabling:

AI companies to collude with each other against the interests of the rest of society2

A small group of actors to plot a coup
More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent

Regardless of how these harms and benefits net out for ‘coordination tech’ overall, we currently think that:

The shape and impact of coordination tech is an important part of how things will unfold in the near term, and it’s good for people to be paying more attention to this.
We’re going to need some kinds of coordination tech to safely navigate the AI transition.
The devil is in the details. There are ways of advancing coordination tech which are positive in expectation, and ways of doing so which are harmful.

Why ‘defense-favoured’ coordination tech

That’s why we’ve called this piece ‘defense-favoured coordination tech’, not just ‘coordination tech’. We think generic acceleration of coordination tech is somewhat fraught — our excitement is about thoughtfully run projects which are sensitive to the possible harms, and target carefully chosen parts of the design space.

We’re not yet confident which the best bits of the space are, and we haven’t seen convincing analysis on this from others either. Part of the reason we’re publishing these design sketches is to encourage and facilitate further thinking on this question.

For now, we expect that there are good versions of all of the technologies we sketch below — but we’ve flagged potential harms where we’re tracking them, and encourage readers to engage sceptically and with an eye to how things could go badly as well as how they could go well.

Fast facilitation

Right now, coordinating within groups is often complex, expensive, and difficult. Groups often drop the ball on important perspectives or considerations, move too slowly to actually make decisions, or fail to coordinate at all.

AI could make facilitation much faster and cheaper, by processing many individual views in parallel, tracking and surfacing all the relevant factors, providing secure private channels for people to share concerns, and/or providing a neutral arbiter with no stake in the final outcome. It could also make it much more practical to scale facilitation and bring additional people on board without slowing things down too much.

Design sketch

An AI mediation system briefly interviews groups of 3–300 people async, presents summary positions back to the group, and suggests next steps (including key issues to resolve). People approve or complain about the proposal, and the system iterates to appropriate depth for the importance of the decision.

Under the hood, it does something like:

Gathers written context on the setting and decision
Holds brief, private conversations with each participant to understand their perspective
Builds a map of the issue at hand, involving key considerations and points of (dis)agreement
- Performs and integrates background research where relevant
Identifies which people are most likely to have input that changes the picture
Distils down a shareable summary of the map, and seeks feedback from key parties
Proposes consensus statements or next steps for approval, iterating quickly to find versions that have as broad a backing as possible

Feasibility

Fast facilitation seems fairly feasible technically. The Habermas Machine (2024) does a version of this that provided value to participants — and we have seen two years of progress in LLMs since then. And there are already facilitation services like Chord. In general, LLMs are great at gathering and distilling lots of information, so this should be something they excel at. It’s not clear that current LLMs can already build accurate maps of arbitrary in-motion discourse, but they probably could with the right training and/or scaffolding.

Challenges for the technology include:

Ensuring that it’s more efficient and a better user experience for moving towards consensus than other, less AI-based approaches.
Remaining robust against abusive user behaviour (e.g. you don’t want individuals to get their way via prompt injection or blatantly lying).

Neither of these seem like fundamental blockers. For example, to protect against abuse, it may be enough to maintain transparency so that people can search for this. (Or if users need to enter confidential information, there might be services which can confirm the confidential information without revealing it.)

Possible starting points // concrete projects

Build a baby version. This could help us notice obstacles or opportunities that would have been hard to predict in advance. You could focus on the UI or the tech side here, or try to help run pilots at specific organisations or in specific settings.
Design ways to evaluate fast facilitation tools. This makes it easier to assess and improve on performance. For example, you could create games/test environments with clear “win” and “failure” modes.
Build subcomponents. For example:
- Bots that surface anonymous info.
- Tools that try to surface areas of consensus or common knowledge as efficiently as possible, while remaining hard to game.
Make a meeting prep system. Focus first on getting good at meeting prep — creating an agenda and considerations that need live discussion — to reduce possible unease about outsourcing decision-making to AI systems.
Make a bot to facilitate discussions. This could be used in online community fora, or to survey experts.
Design ways to create live “maps” of discussions. Fast facilitation is fast because it parallelises communication. This makes it more important to have good tools for maintaining shared context.

Automated negotiation

High-stakes negotiation today involves adversarial communication between humans who have limited bandwidth.

Negotiation in the future could look more like:

You communicate your desires openly with a negotiation delegate who is on your side, asking questions only when needed to build a deeper model about your preferences.
The delegate goes away, and comes back with a proposal that looks pretty good, along with a strategic analysis explaining the tradeoffs / difficulties in getting more.

Design sketch

Humans can engage AI delegates to represent them. The delegates communicate with each other via a neutral third party mediation system, returning to their principals with a proposal, or important interim updates and decision points.

Under the hood, this might look like:

Delegate systems:
- Read over context documents and query principals about key points of uncertainty to build initial models of preferences.
- Model the negotiation dynamics and choose strategic approaches to maximise value for their principal.
- Go back to the principal with further detailed queries when something comes up that crosses an importance threshold and where they are insufficiently confident about being able to model the principal’s views faithfully.
- Are ultimately trained to get good results by the principal’s lights.
Neutral mediator system:
- Is run by a trusted third-party (or in higher stakes situations, perhaps is cryptographically secure with transparent code).
- Discusses with all parties (either AI delegates, or their principals)
  - Can hear private information without leaking that information to the other party
    - Impossibility theorems mean that it will sometimes be strategically optimal for parties to misrepresent their position to the mediator (unless we give up on the ability to make many actually-good deals); however, we can seek a setup such that it is rarely a good idea to strategically misrepresent information, or that it doesn’t help very much, or that it is hard to identify the circumstances in which it’s better to misrepresent
- Searches for deals that will be thought well of by all parties, and proposes those to the delegates.
- Is ultimately trained to help all parties reach fair and desired outcomes, while minimising incentives-to-misrepresent for the parties.

Feasibility

Some of the technical challenges to automated negotiation are quite hard:

The kind of security needed for high-stakes applications isn’t possible today.
Getting systems to be deeply aligned with a principal’s best interests, rather than e.g. pursuing the principal’s short-term gratification via sycophancy, is an unsolved problem.

That said, it’s already possible to experiment using current systems, and it may not be long before they start improving on the status quo for human negotiation. Low-stakes applications don’t require the same level of security, and will be a great training ground for how to set up higher stakes systems and platforms. And practical alignment seems good enough for many purposes today.

Possible starting points // concrete projects

Build an AI delegate for yourself or your friends. See if you can get it to usefully negotiate on your behalf with your friends or colleagues. Or failing that, if it can support you to think through your own negotiation position before you need to communicate with others about it.
Build a negotiation app with good UI. Building on existing LLMs, build an app which helps people think through their negotiation position in a structured way. Focus on great UI.
- This could be non-interactive at first, and just involve communication between a human and the app, rather than between any AI systems.
- But it builds the muscles of a) designing good UI for AI negotiation, and b) people actually using AI to help them with negotiation.
Run a pilot in an org or community you’re part of.
- You could start with fairly low-stakes negotiations, like what temperature to set the office thermostat to or which discussion topics to discuss in a given meeting slot.
- Experimenting with different styles of negotiation (in terms of how high the stakes are, how complex the structure is, and what the domain is) could be very valuable.

Arbitrarily easy arbitration

Right now, the risk of expensive arbitration makes many deals unreachable. If disputes could be resolved cheaply and quickly using verifiably fair and neutral automated adjudicators, this could unlock massive coordination potential, enabling a multitude of cooperative arrangements that were previously prohibitively costly to make.

Design sketch

An “Arb-as-a-Service” layer plugs into contracts, platforms, and marketplaces. Parties opt in to standard clauses that route disputes to neutral AI adjudicators with a well-deserved reputation for fairness. In the event of a dispute, the adjudicator communicates with parties across private, verifiable evidence channels, investigating further as necessary when there are disagreements about facts. Where possible, they auto-execute remedies (escrow releases, penalties, or structured commitments). Human appeal exists but is rarely needed; sampling audits keep the system honest. Over time, this becomes ambient infrastructure for coordination and governance, not just commerce.

How this could work under the hood:

Agreement ingestion
- Formal or natural language contracts are parsed and key terms extracted, with parties confirming the system’s interpretation before proceeding.
- The system could also suggest pre-dispute modifications to make agreements clearer, flag potentially unenforceable terms, and maintain public precedent databases that help parties understand likely outcomes before committing.
Automated discovery
- When disputes arise, an automated discovery process gathers relevant documentation, transaction logs, and communications from integrated platforms.
- The system offers interviews and the chance to submit further evidence to each party.
Deep consideration
- The system builds models of what different viewpoints (e.g. standard legal precedent; commonsense morality; each of the relevant parties) have to say on the situation and possible resolutions, to ensure that it is in touch with all major perspectives.
- Where there are disagreements, the system simulates debate between reasonable perspectives.
- It makes an overall judgement as to what is fairest.
Transparent reasoning
- The system produces detailed explanations of its conclusions, with precedent citations and counterfactual analysis where appropriate.
(Optional) Smart escrow integration
- Judgements automatically execute through cryptocurrency escrows or traditional payment rails, with graduated penalties for non-compliance.
- In cases where the system detects evidence that is highly likely to be fraudulent, or other attempts to manipulate the system, it automatically adds a small sanction to the judgement, in order to disincentivise this behaviour.
Opportunities for appeal
- Either party can pay a small fee to submit further evidence and have the situation re-considered in more depth by an automated system.
- For larger fees they can have human auditors involved; in the limit they can bring things to the courts.

Feasibility

LLMs can already do basic versions of 1-4, but there are difficult open technical problems in this space:

Judgement: Systems may not currently have good enough judgement to do 1, 3, 4 in high-stakes contexts (and until recently, they clearly didn’t).
Real-world evidence assessment: Systems don’t currently know how to handle conflicting evidence provided digitally about what happened in the real world.
Verifiable fairness/neutrality: The full version of this technology would require a level of fairness and neutrality which isn’t attainable today.

Those are large technical challenges, but we think it’s still useful to get started on this technology today, because iterating on less advanced versions of arbitration tech could help us to bootstrap our way to solutions. Particularly promising ways of doing that include:

Starting in lower-stakes or easier contexts (for example, digital-only spaces avoid the challenge of establishing provenance for real-world evidence).
Creating evals, test environments and other infrastructure that helps us improve performance.

On the adoption side, we think there are two major challenges:

Trust: As above, some amount of technical work is needed to make systems verifiably fair/neutral. But even if it becomes true that the systems are neutral, people need to build quite a high level of confidence that the system is genuinely impartial before they’ll bind themselves to its decisions for meaningful stakes.
Legal integration: This tech is only useful to the extent that its arbitration decisions are recognised and enforced as legitimate by the traditional legal system, or are enshrined directly via contract in a self-enforcing way.
- (We are unsure how large a challenge this will be; perhaps you can write contracts today that are taken by the courts as robust. But it may be hard for parties to have large trust in them before they have been tested.)

Both of these challenges are reasons to start early (as there might be a long lead time), and to make work on arbitration tech transparent (to help build trust).

Possible starting points // concrete projects

Work with an arbitration firm. Work with (or buy) a firm already offering arbitration services to start automating parts of their central work, and scale up from there.
Work with an online platform that handles arbitration. Use AI to improve their processes, and scale from there.
Create a bot to settle informal disputes. Build an arbitration-as-a-service bot that people can use to settle informal disputes.
Trial a system on internal disputes. This could be at your own organisation, another organisation, or a coalition of early adopter organisations.
Run a pilot in parallel to regular arbitration. Run a pilot where an automated arbitration system is given access to all the relevant information to resolve disputes, and reaches its own conclusions — in parallel to the regular arbitration process, which forms the basis of the actual decision. You could partner with an arbitration firm, or potentially do this through a coalition of early adopter organisations, perhaps in combination with philanthropic funding.

Background networking

We can only do things like collaborate, trade, or reconcile if we’re able to first find and recognise each other as potential counterparties. Today, people are brought into contact with each other through things like advertising, networking, even blogging. But these mechanisms are slow and noisy, so many people remain isolated or disaffected, and potentially huge wins from coordination are left undiscovered.3

Tech could bring much more effective matchmaking within reach. Personalised, context-sensitive AI assistance could carry out orders of magnitude more speculative matchmaking and networking. If this goes well, it might uncover many more opportunities for people to share and act on their common hopes and concerns.

Design sketch

A ‘matchmaking marketplace’ of attentive, personalised helpers bustles in the background. When they find especially promising potential connections, they send notifications to the principals or even plug into further tools that automatically take the first steps towards seriously exploring the connection.

You can sign up as an individual or an existing collective. If you just want to use it passively, you give a delegate system access to your social media posts, search profiles, chatbot history, etc. — so this can be securely distilled into an up-to-date representation of hopes, intent, and capabilities. The more proactive option is to inject deliberate ‘wishes’ through chat and other fluent interfaces.

Under the hood, there are a few different components working together:

Interoperable, secure ‘wish profiling’ systems which identify what different participants want.
- People connect their profiles on existing services (social media, chatbot logs, email, etc).
- LLM-driven synthesis (perhaps combined with other forms of machine learning) curates a private profile of user desires.
- Optionally, chatbot-style assistance can interview users on the points of biggest uncertainty, to build a more accurate profile.
A searchable ‘wish registry’ which organises large collections of wants and offers, while maintaining semi-privacy.
- Each user’s interests can run searches, finding potential matches and surfacing only enough information about them to know whether they are worth exploring further.

Feasibility

A big challenge here is privacy and surveillance. Doing background networking comprehensively requires sensitive data on what individuals really want. This creates a double-edged problem:

If sensitive data is too broadly available, it can be used for surveillance, harassment, or exploitation; including by big corporations or states.
If sensitive data is completely private, it opens up the possibility of collusion, for example among criminals.

This is a pretty challenging trade-off, with big costs on both sides. Perhaps some kind of filtering system which determines who can see which bits of data could be used to prevent data extraction for surveillance purposes while maintaining enough transparency to prevent collusion.

Ultimately, we’re not sure how best to approach this problem. But we think that it’s important that people think more about this, as we expect that by default, this sort of technology will be built anyway in a way that isn’t sufficiently sensitive to these privacy and surveillance issues. Early work which foregrounds solutions to these issues could make a big difference.

Other potential issues seem easier to resolve:

Technically, background networking tools already seem within reach using current systems. Large-scale deployments would require indexing and registry, but it seems possible to get started on these using current systems.
- One note is that it seems possible to implement background networking in either a centralised or a decentralised way. It’s not clear which is best, though decentralised implementations will be more portable.
Adoption also seems likely to work, because there are incentives for people to pay to discover trade and cooperation opportunities they would otherwise have missed, analogous to exchange or brokerage fees. Though there are some trickier parts, we expect them to ultimately be surmountable (though timing may be more up for grabs than absolute questions of adoption):
- In the early stages when not many people are using it, the value of background networking will be more limited. Possible responses include targeting smaller niches initially, and proactively seeking out additional network beneficiaries.
- It’s harder to incentivise people to pay for speculative things like uncovering groups they’d love that don’t yet exist. You could get around this using entrepreneurial or philanthropic speculation (compare the dominant assurance contract model and related payment incentivisation schemes).

Possible starting points // concrete projects

Work with existing matchmakers to improve their offering. Find groups that are already doing matchmaking and are eager for better systems — perhaps among community organisers, businesses, recruiters or investors. Work with them to understand the pain points in their current networking, and what automated offerings would be most appealing. Then build those tools and systems.
Build a networking tool for a specific community. Build a custom networking system for a particular group or subculture. For example, this could look like a networking app or a plug-in to an existing online forum. This could start delivering value fairly quickly, and provide a good opportunity for iteration.

Structured transparency for democratic oversight

Today, citizens in democracies have limited mechanisms to verify whether institutions’ public claims are consistent with their internal evidence:

The baseline is highly opaque.
Freedom of information systems help, but can be evaded by non-cooperating institutions.
Public inquiries can be reasonably thorough, but are expensive and slow.
Full transparency has many costs and is typically highly resisted.

This is costly — e.g. the UK Post Office scandal over its Horizon IT system led to hundreds of wrongful prosecutions that could have been avoided. And it creates bad incentives for those running the institutions.

AI has the potential to change this. Instead of oversight being expensive, reactive, and slow, automated systems could in theory have real-time but sandboxed access to institutional data, routinely reviewing operational records against public claims and surfacing inconsistencies as they emerge.

Where confidential monitoring helps willing parties verify each other, structured transparency for democratic oversight aims to hold institutions accountable to the broader public.4

Design sketch

When an oversight body wants to verify facts about the behaviour of another institution, it requests comprehensive data about the internal operations of that institution. AI systems are tasked with careful analysis of the details, flagging the type and severity of any potential irregularities. Most of the data never needs human review.

In the simpler version, this is just a tool which expands the capacity of existing oversight bodies. Even here, the capacity expansion could be relatively dramatic — this kind of semi-structured data analysis is the kind of work that AI models can excel at today — without needing to trust that the systems are infallible (since the most important irregularities will still have human review).

A more ambitious version treats this as a novel architecture for oversight. AI systems operate continuously within secure environments that don’t give any humans access to the full dataset. They can flag inconsistencies as institutional data is deposited rather than waiting for an investigation to begin. For maximal transparency, summaries could be made available to the public in real-time, without revealing any confidential information that the public does not have rights to.

Under the hood, this might involve:

Secure data repositories, such that institutions routinely share operational data with a sandboxed environment operated by or on behalf of the oversight body, without any regular human access to the data.
Continuous ingestion and indexing of institutional public outputs (press releases, regulatory filings, budget documents, etc.) into a searchable database.
Automated cross-referencing between public claims and internal records.
Highlighting of potential issues (mismatches between public statements and private information, as well as decisions made in violation of normal procedures).
Further automated investigation of potential issues, leading to flags to humans in cases with sufficiently large issues flagged with sufficient confidence.
Importantly, the sandbox outputs its findings but not the underlying data; if there is need for transparency on that, this is a separate oversight question.

Feasibility

There are two important aspects to feasibility here: technical and political.

Technically, decent reliability at the core functionality is possible today. Getting up to extremely high reliability so that it could be trusted not to flag too many false positives across very large amounts of data might be a reach with present systems; but is exactly the kind of capability that commercial companies should be incentivised to solve for business use.

Political feasibility may vary a lot with the degree of ambition. The simplest versions of this technology might in many cases simply be adopted by existing oversight bodies to speed up their current work. Anything which requires them getting much more data (e.g. to put in the sandboxed environments) might require legislative change — which may be more achievable after the underlying technology can be shown to be highly reliable.

Challenges include:

Adversarial dynamics: the technical bar to verify claims against actively adversarial institutions (who are manipulating deposited data, potentially via AI) is substantially higher.
- This is the bar that we’d need to reach for confidential monitoring below.
Defamation risk: the downsides of false positives, where your system reports someone misrepresenting things when they were not, could be significant (although can perhaps be mediated by giving people a right-of-rebuttal where they give further data to the AI systems which monitor the confidential data streams).
Avoiding abuse: designing the systems so that they do not expose the confidential data, and cannot be weaponised to ruin the reputation of a department with very normal levels of error.

Ultimately the more transformative potential from this technology comes in the medium-term, with new continuous data access for oversight bodies. But this is likely to require legislative change, and the institutions subject to it may resist. Perhaps the most promising adoption pathway is to demonstrate value through voluntary pilots with oversight bodies that already have data access and want better tools. This could build the evidence base (and hence political constituency) for wider and deeper deployment.

Possible starting points // concrete projects

Retrospective validation on historical cases. Apply consistency-checking tools to document sets from well-understood historical cases where the relevant internal documents have subsequently been released (e.g. Enron emails). This builds the technical foundation, and demonstrates the concept without requiring any current institutional access.
Institutional public statement reliability tracker. Build a tool tracking whether agencies’ public claims about performance, spending, or policy outcomes are consistent with publicly available data — statistical releases, budget documents, prior statements. Start with a single policy domain. This requires no institutional partnerships and builds a public constituency for structured transparency. This is a version of reliability tracking, applied specifically to institutional accountability.
Pilot a FOIA exemption assessment tool. Partner with an Inspector General office to build a tool that reviews withheld documents and assesses whether claimed exemptions (national security, personal privacy, deliberative process) are applied appropriately. The IG already has legal access under the Inspector General Act; the tool helps them do their existing job faster and builds the working relationship needed for more ambitious deployments. This is also a natural testbed for the sandboxed architecture in miniature — the tool operates within the IG’s secure environment, producing exemption-appropriateness findings without the documents themselves leaving the system.

Confidential monitoring and verification

Monitoring and verifying that a counterparty is keeping up their side of the deal is currently expensive and noisy. Many deals currently aren’t reachable because they’re too hard to monitor. Confidential AI-enabled monitoring and verification could unlock many more agreements, especially in high-stakes contexts like international coordination where monitoring is currently a bottleneck.

Design sketch

When organisation A wants to make credible attestations about their work to organisation B, without disclosing all of their confidential information, they can mutually contract an AI auditor, specifying questions for it to answer. The auditor will review all of A’s data (making requests to see things that seem important and potentially missing), and then produce a report detailing:

Its conclusions about the specified questions.
The degree to which it is satisfied that it had good data access, that it didn’t run into attempts to distort its conclusions, etc.

This report is shared with A and B, then A’s data is deleted from the auditor’s servers.

Under the hood, this might involve:

Building a Bayesian knowledge graph, establishing hypotheses, and understanding what evidence suggests about those hypotheses.
Agentic investigatory probes into the confidential data, in order to form grounded assessments on the specified questions.

More ambitious versions might hope to obviate the need for trust in a third party, and provide reasons to trust the hardware — that it really is running the appropriate unbiased algorithms, that it cannot send side-channel information or retain the data, etc. Perhaps at some point you could have robot inspectors physically visiting A’s offices, interviewing employees, etc.

Feasibility

Compared to some of the other technologies we discuss, this feels technologically difficult — in that what’s required for the really useful versions of the tech may need very high reliability of certain types.

Nonetheless, we could hope to lay the groundwork for the general technological category now, so that people are well-positioned to move towards implementing the mature technology as early as is viable. Some low-confidence guesses about possible early applications include:

Legal audits — for example, claims that the documents not disclosed during a discovery process are only those which are protected by privilege.
Financial audits — e.g. for the purpose of proving viability to investors without disclosing detailed accounts.
Supply chain verification — e.g. demonstrating that products were ethically sourced without exposing the suppliers.

Possible starting points // concrete projects

Start building prototypes. Build a system which can try to detect whether it’s a real or counterfeited environment, and measure its success.
Work with a law or financial auditing firm. Work with (or buy) a firm that does this kind of work, and experiment with how to robustly automate while retaining very high levels of trustworthiness.
Explore the viability of complementary technology. For example, you could investigate the feasibility of demonstrating exactly what code is running on a particular physical computer that is in the room with both parties.

Cross-cutting thoughts

Some cross-cutting technologies

We’ve pulled out some specific technologies, but there’s a whole infrastructure that could eventually be needed to support coordination (including but not limited to the specific technologies we’ve sketched above). Some cross-cutting projects which seem worth highlighting are:

AI delegates and preference elicitation

Many of the technologies we sketched above either benefit from or require agentic AI delegates who can represent and act for a human principal. Developing customisable platforms could be useful for multiple kinds of tech, like background networking, fast facilitation, and automated negotiation.

Some ways to get started:

Direct preference elicitation: develop efficient and appealing interview-style elicitation of values, wishes, preferences and asks.
Passive data ingestion: build a tool that (consensually) ingests and distils all the available online content about a person — social media, browsing history, email, etc — and extracts principles from it (cf inverse constitutional AI).

One clarification is that though agentic AI delegates would be useful for some of the coordination tech above, it needn’t be the same delegate doing the whole lot for a single human:

You could have different delegates for different applications.
Some delegates might represent groups or coalitions.
Some delegates could be short-lived, and spun up for some particular time-bounded purpose.

Charter tech

A lot of coordination effort between people and organisations goes not into making better object-level decisions, but establishing the rules or norms for future coordination — e.g. votes on changing the rules of an institution. It is possible that coordination tech will change this basic pattern, but as a baseline we assume that it will not. In that case, making such meta-level coordination go well would also be valuable.

One way to help it go well is by making the governance dynamics more transparent. Voting procedures, organisational charters, platform policies, treaty provisions, etc. create incentives and equilibria that play out over time, often in ways the framers didn’t anticipate. Let’s call any technology which helps people to better understand governance dynamics, or to make those dynamics more transparent, ‘charter tech’. In some sense this is a form of epistemic tech; but as the applications are always about coordination, we have chosen to group it with other coordination technologies. We think charter tech could be important in two ways:

Through directly improving the governance dynamics in question, helping to avoid capture, conflict, and lock-in.
Through compounding effects on future coordination, which will unfold in the context of whatever governance structures are in place.

Charter tech could be used in a way that is complementary to any of the above technologies (if/when they are used for governance-setting purposes), although can also stand alone.

For the sake of concreteness, here is a sketch of what charter tech could look like:

A “governance dynamics analyser” that ingests descriptions of constitutions, charters, policies or community norms, builds models of power, incentives, and information flow, and then (a) forecasts likely equilibria and failure modes, (b) red-teams for strategic abuse,5 and (c) proposes safer rule variants that preserve the framers’ intent.6

While this tool can be called actively if needed, there is also a classifier running quietly in the background of organisational docs/emails, and when it detects a situation where power dynamics and governance rules are relevant, it runs an assessment — promoting this to user attention just in cases where the proposed rules are likely to be problematic.

Note that charter tech could be used to cause harm if access isn’t widely distributed. Vulnerabilities can be exploited as well as patched, and a tool that makes it easier to identify governance vulnerabilities could be used to facilitate corporate capture, backsliding or coups. Provided the technology is widely distributed and transparent, we think that charter tech could still be very beneficial — particularly as there may be many high-stakes governance decisions to make in a short period during an intelligence explosion, and the alternative of ‘do our best without automated help’ seems pretty non-robust.

Some ways to get started on using AI to make governance dynamics more transparent:

Work with communities that iterate frequently on governance (DAOs, open-source projects) to test analyses against what actually happens when rules change.
Compile a pattern library of governance failures and successes, documented in enough detail to inform automated analysis.
Build simulation environments where proposed rules can be stress-tested against populations of agents with varying goals, including adversarial ones.
Partner with mechanism design researchers to identify which aspects of their formal analysis can be automated and applied to less formal real-world documents.

Adoption pathways

Many of these technologies will be directly incentivised economically. There are clear commercial incentives to adopt faster, cheaper methods of facilitation, negotiation, arbitration, and networking.

However, adoption seems more challenging in two important cases:

Adoption by governments and broader society. Many of the most important benefits of coordination tech for society will come from government and broad social adoption, but these groups will be less impacted by commercial incentives. This bites particularly hard for technologies that could be quite expensive in terms of inference compute, like fast facilitation, arbitration and negotiation. By default, these technologies might differentially help wealthy actors, leaving complex societal-level coordination behind. We think that the big levers on this set of challenges are:
- Building trust and legitimacy earlier, by getting started sooner, building transparently, and investing in evals and other infrastructure to demonstrate performance.
- Targeting important niches that might be slower to adopt by default. More research would be good here, but two niches that seem potentially important are:
  - Coordination among and between very large groups, like whole societies. This might be both strategically important and lag behind by default.
  - International diplomacy. Probably coordination tech will get adopted more slowly in diplomacy than in business, but there might be very high stakes applications there.
Adoption of confidential monitoring and structured transparency. These technologies are less accessible with current models and may require large upfront investments, while many of the benefits are broadly distributed.
- This makes it less likely that commercial incentives alone will be enough, and makes philanthropic and government funding more desirable.

Other challenges

The big challenge is that coordination tech (especially confidential coordination tech) is dual use, and could empower bad actors as much or more than good ones.

There are a few ways that coordination tech could lead to shifts in the balance of power (positive or negative):

Some actors could get earlier and/or better access to coordination tech than others.7

Actors that face particular barriers to coordination today could be asymmetrically unblocked by coordination tech.
Individuals and small groups could become more powerful relative to the coordination mechanisms we already have, like organisations, ideologies, and nation states.

It’s inherently pretty tricky to determine whether these power shifts would be good or bad overall, because that depends on:

Value judgements about which actors should hold power.
How contingent power dynamics play out.
Big questions like whether ideologies or states are better or worse than the alternatives.
Predictions about how social dynamics will equilibrate in an AI era that looks very different to our world.

However, as we said above, it’s clear that coordination tech might have significant harmful effects, through enabling:

Large corporations to collude with each other against the interests of the rest of society.8

A small group of actors to plot a coup.
More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don’t incentivise prosociality to the same extent.

We don’t think that this challenge is insurmountable, though it is serious, for a few reasons:

The upsides are very large. Coordination tech might be close to necessary for safely navigating challenges like the development of AGI, and could empower actors to coordinate against the kinds of misuse listed above.
The counterfactual is that coordination tech is developed anyway, but with less consideration of the risks and less broad deployment. We think that this set of technologies is going to be sufficiently useful that it’s close to inevitable that they get developed at some point. By engaging early with this space, we can have a bigger impact on a) which versions of the technology are developed, b) how seriously the downsides are taken by default, c) how soon these systems are deployed broadly.
Some applications seem robustly good. For example, the potential for misuse is low for technologies like transparent facilitation or widely deployed charter tech. More generally, we expect that projects that are thoughtfully and sensitively run will be able to choose directions which are robustly beneficial.

That said, we think this is an open question, and would be very keen to see more analysis of the possible harms and benefits of different kinds of coordination tech, and which versions (if any) are robustly good.

This article has gone through several rounds of development, and we experimented with getting AI assistance at various points in the preparation of this piece. We would like to thank Anthony Aguirre, Alex Bleakley, Max Dalton, Max Daniel, Raymond Douglas, Owain Evans, Kathleen Finlinson, Lukas Finnveden, Ben Goldhaber, Ozzie Gooen, Hilary Greaves, Oliver Habryka, Isabel Juniewicz, Will MacAskill, Julian Michael, Justis Mills, Fin Moorhouse, Andreas Stuhmüller, Stefan Torges, Deger Turan, Jonas Vollmer, and Linchuan Zhang for their input; and to apologise to anyone we’ve forgotten.

This article was created by Forethought. See the original on our website.

We’re highlighting six particular technologies, and clustering them all as ‘coordination technologies’. Of course in reality some of the technologies (and clusters) blur into each other, and they’re just examples in a high-dimensional possibility space, which might include even better options. But we hope by being concrete we can help more people to start seriously thinking about the possibilities.

For example, in a similar way to that described in the intelligence curse.

Meanwhile small cliques with clear interests often have an easier time identifying and therefore acting on their shared interests — in extreme cases resulting in harmful cartels, oligarchies, and so on. That’s also why tyrants throughout history have sought to limit people’s networking power.

Both confidential monitoring and what we are calling structured transparency for democratic oversight are aspects of structured transparency in the way that Drexler uses the term.

This red-teaming could be arbitrarily elaborate, from simple LM-based once-over screening to RAG-augmented lengthy analysis to expansive simulation-based probing and stress-testing.

Under the hood, this might involve:

Parsing & modelling the rules
- Convert informal descriptions or formal rules into a typed governance graph: roles, permissions, decision thresholds, delegation, auditability, and recourse
- Note uncertainties; seek clarification or highlight ambiguities
A search for possible issues
- Pattern library of classic failure modes (agenda control, principal–agent issues, collusion, etc.)
  - Assessment of potential vulnerability to the different failure modes
First-principles analysis
- Running direct searches for abuse, or multi-agent simulations (including some nefarious actors) to stress-test the proposed system
Explainer
- Distilling down the output of the analysis into a few key points
  - Providing auditable evidence where relevant
- Including points about how variations of the mechanism might make things better or worse

Note that this is significantly a question about adoption pathways as discussed in the previous section, rather than an independent question.

For example, in a similar way to that described in the intelligence curse.

AI for AI for Epistemics

Owen Cotton-Barratt — Wed, 01 Apr 2026 16:11:23 GMT

This article was created by Forethought. See the original on our website.

We feel conscious that rapid AI progress could transform all sorts of cause areas. But we haven’t previously analysed what this means for AI for epistemics, a field close to our hearts. In this article, we attempt to rectify this oversight.

Summary

AI-powered tools and services that help people figure out what’s true (“AI for epistemics”) could matter a lot.

As R&D is increasingly automated, AI systems will play a larger role in the process of developing such AI-based epistemic tools. This has important implications. Whoever is willing to devote sufficient compute will be able to build strong versions of the tools, quickly. Eventually, the hard part won’t be building useful systems, but making sure people trust the right ones, and making sure that they are truth-tracking even in domains where that’s hard to verify.

We can do some things now to prepare. Incumbency effects mean that shaping the early versions for the better could have persistent benefits. Helping build appetite among socially motivated actors with deep pockets could enable the benefits to come online sooner, and in safer hands. And in some cases, we can identify particular things that seem likely to be bottlenecks later, and work on those directly.

Background: AI for epistemics

AI for epistemics — i.e. getting AI systems to give more truth-conducive answers, and building tools that help the epistemics of the users — seems like a big deal to us. Some past things we’ve written on the topic include:

These past articles mostly take the perspective of “how can people build AI systems which do better by these lights?”. But maybe we should be thinking much more about what changes when people can use AI tools to do increasingly large fractions of the development work!

The shift in what drives AI-for-epistemics progress

Right now, AI-for-epistemics tools are constrained by two main bottlenecks: the quality of the underlying AI systems, and whether people have invested serious development effort in building the tools to use those systems.

The balance of bottlenecks is changing. Two years ago, the quality of underlying AI systems was the central bottleneck. Today, it is much less so — many useful tools could probably work based on current LLMs. It is likely still a constraint on how good the systems can be, and will remain so for a while even as the underlying models get stronger, but it is less of a fundamental blocker. Development investment has therefore become a bigger bottleneck — there are a number of applications which we are pretty confident could be built to a high usefulness level today, and just haven’t been (yet).

But bottlenecks will continue to shift. AI is increasingly driving research and software development. As AI systems get stronger, it may become possible to turn a large compute budget into a lot of R&D. This could include product design, engineering, experiment design, direction-setting, etc. Actors with lots of compute could direct this towards building epistemic tools.

Therefore, as AI-driven R&D accelerates, other inputs to AI for epistemics are more likely to become key bottlenecks:

Compute. Automated R&D may require a lot of compute. This could be for inference (running the analogues of human researchers); for running experiments; and perhaps for training specialized AI systems. This means the actors who can build the best epistemic tools may be those with deep pockets.
Adoption and trust. Even very good tools don’t help if nobody uses them, or if the wrong people use them and the right people don’t. Adoption is partly a function of trust, and trust is partly a function of adoption — early tools shape what people come to rely on.
Ground truth evaluation. To make an epistemic tool good, you need some signal for what “good” means. This already shapes AI applications a lot — part of the reason coding agents are so good is that there’s great access to ground truth about what works.
- For some epistemic applications this is relatively straightforward (e.g. forecasting accuracy). For others it’s hard (e.g. what makes a conceptual clarification actually clarifying, rather than just satisfying?).
- Most tools can probably reach a certain degree of usefulness without running into this problem, just piggybacking on base models making generally sensible judgements.
- We can expect it to bite when you try to make them very good: if you don’t have a way of assessing quality, it could be hard to push to objectively excellent levels.
- One basic solution is to rely on human judgement: either via humans providing labels and demonstrations to train against, or via human developers exercising their judgement in other parts of the process (such as when defining scaffolds). But this becomes disproportionately more expensive as R&D becomes more automated.

These basic points are robust to whether R&D is fully automated, or “merely” represents a large uplift to human researchers. But the most important bottlenecks will vary across applications and will continue to shift over time.

What this unlocks

Automated R&D means that strong “AI for epistemics” tools could come online on a compressed timeline.

This is an exciting opportunity! Upgrading epistemics could better position us to avoid existential risk and navigate through the choice transition well.

If everything is moving fast, it may matter a lot exactly what sequence we get capabilities in. It may therefore be crucial to make serious investments in building these powerful applications (rather than wait until such time as they are trivially cheap).

Risks from rapid progress in AI for epistemics

There are also a number of ways that rapid (and significantly automated) progress in AI-for-epistemics applications could go wrong. We need to be tracking these in order to guard against them.

In our view, the two biggest risks are:

Epistemic misalignment: because of ground truth issues, powerful tools steer our thoughts in directions other than those which are truth-tracking, in ways that we fail to detect
Trust lock-in: if a lot of people buy into trusting tools or ecosystems that don’t deserve that trust, this might be self-perpetuating if these continue to recommend themselves

Epistemic misalignment

Depending on when they bite, ground truth problems as discussed above could be bottlenecks, or active sources of risk. They are bottlenecks if they prevent people from building strong versions of tools. They could become risks if the methods are good enough to allow for bootstrapping to something strong, but end up pointing in the wrong direction. This is essentially Goodhart’s law — we might get something very optimized for the wrong thing (and without even knowing how to detect that it’s subtly wrong).

In the limit, this could lead to humans or AI systems making extremely consequential decisions based on misguided epistemic foundations. For example, they might give over the universe to digital minds that are not conscious — or in the other direction, fail to treat digital minds with the dignity and moral seriousness they deserve. Wei Dai has written about this concern in terms of the importance of metaphilosophy. We agree that there is a crucial concern here.

This could come separately from or together with risks from power-seeking misaligned AI. Epistemic tools could be systematically misleading without being power-seeking. But if some AI systems are misaligned and power-seeking, there’s an additional concern where AI systems could mislead us in ways specifically designed to disempower us whenever we are unable to check their answers.

Some approaches to the ground truth problem may involve using AI systems to make judgements about things. This introduces a regress problem: how can we ensure that subtle errors in the first AI systems shrink rather than compound into worse problems as the process plays out? (We return to this in the interventions section below.)

Trust lock-in

Trust and adoption tend to reinforce each other — people adopt tools they trust, and widely-adopted tools accumulate trust. This is normally fine. It could become a problem if the tools that win early trust don’t deserve it, but incumbency effects make them hard to displace.

This could happen in several ways. An actor with a particular agenda could build something that purports to function as a neutral epistemic aid but is shaped to further their agenda by manipulating others. Or, less perniciously but perhaps more likely, an early-but-mediocre tool could accumulate trust and adoption before better alternatives exist, reinforced by commercial incentives which mean it talks itself up and rival tools down. In either case, the result could be an epistemic ecosystem that’s hard to dislodge even once better options are available.

Other risks

Those two risks are not the only concerns. We are also somewhat worried about epistemic power concentration (where whoever has the best epistemic tools leverages their information advantage into better financial or political outcomes, and continues to stay ahead epistemically), and epistemic dependency (where people relying on AI tools gradually atrophy in their critical reasoning — exacerbating other risks). There may be more that we are not tracking.

Interventions

What should people who care about epistemics be doing now, in anticipation of a world where AI-driven R&D can be directed at building epistemic tools?

Build appetite for epistemics R&D among well-resourced actors

If you need big compute budgets to build great epistemic tools, you’ll ideally want support from frontier AI companies, major philanthropic funders, or governments. But they may not currently see this as a priority. Building the case that this matters, and helping these actors develop good taste about which tools to prioritize and how to design them well, could shape what gets built when automated R&D becomes powerful enough to build it.

Anticipate future data needs

Some epistemic tools will need training data that doesn’t yet exist and may not be trivial to generate. There are three strategies here:

Collecting or creating data or training environments now for future use
- E.g. if you think you want access to a lot of human judgements about what wise decisions look like, you could go out and curate that dataset.
Establishing pipelines to collect data over time
- E.g. if you want to automate a certain type of research, you could record internal discussions from researchers working on this
Designing processes for automated data creation.
- E.g. if you could design a self-play loop where we have good reason to believe that scaling up compute will lead to genuinely truth-tracking performance, this could set the stage for later rapid improvement at the core capability.

The first two are especially great to work on now because they involve actions at human time-scales. (They may not be proportionately sped up by having more AI labor available.) The third is great to work on because there’s some chance that models will become capable of growing a lot from the right self-play loop before they become capable enough to come up with the idea themselves.

Figure out what could ground us against epistemic misalignment

If powerful epistemic tools could be subtly misaligned with truth-conduciveness in ways we can’t easily detect, we should figure out what this could look like! We expect this might benefit from a mix of theoretical work (what does it even mean for an epistemic tool to be well-calibrated in domains without clear ground truth?1) and practical work (studying how current tools fail, building evaluation methods). Ultimately we don’t have a clear picture of what the solutions look like, but this seems like an important topic and we are keen for it to get more attention soon.

Drive early adoption where adoption is the key bottleneck

For some applications, we might expect that the main constraint on impact will be whether anyone uses them. In these cases, getting early versions into use — even if they’re not yet very good — could build familiarity and surface real-world feedback. (This could also drive appetite for further development.)

In theory, this could be in tension with avoiding bad trust lock-in. But in practice, it’s not clear that bad trust lock-in becomes any likelier if tools in a specific area are developed earlier rather than later. Some tool is still going to get the first-mover advantage.2

Support open and auditable epistemic infrastructure

To guard against trust lock-in, we want to make it easy for people to distinguish between tools which are genuinely doing the good trustworthy thing, and tools which may not be (but claim to be doing so). To that end, we want ways for people and communities to audit different systems — understanding their internal processes and measuring their behaviours. The goal is that if disputes arise about which tools are actually trustworthy, there’s an inspectable audit trail that can resolve them. In turn, this should reduce the incentives to create misleading tools in the first place.

Support development in incentive-compatible places

The incentives of whoever builds epistemic tools could matter — through thousands of small design decisions, through choices about what to optimize for, and through decisions about access and pricing. Development in organizations whose incentives are aligned with the public good (rather than with engagement, profit, or political influence) reduces the risk that tools are subtly shaped to serve the builder’s interests.

Ideally, you’d spur development among actors who are both well-resourced (as just discussed) and whose incentives are aligned with the public good. In practice, it may be difficult to find organizations that are excellent on both. A plausible compromise is for less-resourced organizations with better incentives to focus on publicly available evaluation of epistemic tools. This could be cheaper than producing them from scratch, and it could create better incentives for the larger actors.

Examples

Forecasting

Automated R&D will probably be able to improve forecasting tools without severe ground truth problems, so epistemic misalignment is less of a concern.3 Appetite for investment probably already exists, and adoption should be significantly helped by the ability of powerful tools to develop an impressive, legible track record.

The most useful near-term investment might be in data infrastructure. For instance, LLMs trained with strict historical knowledge cutoffs could enable much better science of forecasting by allowing methods to be tested against questions whose answers the system genuinely doesn’t know.

Misinformation tracking

Trust lock-in is the central concern. A tool that becomes widely trusted for adjudicating what’s true has enormous influence, and if that trust is misplaced it could be very hard to dislodge. Open and auditable approaches are especially important here.

Because of the trust lock-in concern, the automation of R&D may exacerbate challenges. Currently, building good misinformation-tracking tools requires editorial judgement and domain expertise — things responsible actors tend to have more of. Automation shifts the bottleneck towards compute, which is more symmetrically available. This could increase the urgency of getting started on these tools and driving adoption early.

Automating conceptual research

This is the case where epistemic misalignment is most concerning. Ground truth is extremely hard — what makes a conceptual clarification actually clarifying rather than just satisfying? Humans are poor judges of this in real time, so e.g. a training process that rewards outputs humans find helpful could easily optimize for persuasiveness rather than truth-tracking.

One plausible direction here is to research training regimes (such as self-play loops) that we have some reason to believe should ground to truth-tracking, with specific attention to how they could go wrong. Adoption could be an issue, but we’re also worried about the other direction, with adoption coming too easily before we have good ways of evaluating whether the tools are actually helping.

This article was created by Forethought. See the original on our website.

Epistemic misalignment issues may also appear in areas where ground truth is well-defined but hard to access, such as very long-run forecasts. Theoretical work also seems valuable for such areas (because it’s unclear how to evaluate and train for good performance by default).

In fact, it might be bad if people who are worried about bad trust lock-in select themselves out of getting that first-mover advantage.

Although at some quality level, we have to start worrying about self-affecting prophecies. AI forecasters will have to be very trusted indeed before that becomes a serious issue, which gives us a lot of time to figure out how best to handle the issue.

AI should be a good citizen, not just a good assistant

Tom Davidson — Mon, 30 Mar 2026 14:34:07 GMT

This article was created by Forethought. See the original on our website.

Introduction

Consider a lorry driver who sees a car crash and pulls over to help, even though it’ll delay his journey. Or a delivery driver who notices that an elderly resident hasn’t collected their post in days, and knocks to check they’re okay. Or a social media company employee who notices how their platform is used for online bullying, and brings it up with leadership, even though that’s not part of their job description.

This kind of proactive prosocial behaviour is admirable in humans. Should we want it in AI too?

Often, people have answered “no”. Many advocate for making AI “corrigible” or “steerable”. In its purest form, this makes AI a mere vessel for the will of the user.

But we think AI should proactively take actions that benefit society more broadly. As AI systems become more autonomous and integrated into economic and political processes, the cumulative effect of their behavioural tendencies will shape society’s trajectory. AI systems that notice opportunities to benefit society and proactively act on them could matter enormously.

Below, we consider two main objections:

Firstly, supposedly prosocial drives might function as a means for AI companies to impose their own values on the rest of society. We’ll argue that companies can address this concern by instilling uncontroversial prosocial drives and being highly transparent about those drives.

Secondly, giving AI prosocial drives might increase AI takeover risk. We take this seriously—it informs what types of proactive prosocial drives we should train into AI, favouring context-dependent virtues and heuristics over context-independent goals.

Ultimately, we argue that we can get significant benefits from proactive prosocial drives despite these objections.

What do we mean by “proactive prosocial drives”?

Before making the case for proactive prosocial drives, let us clarify what we have in mind. Two key features:

Behaviour which benefits people other than the user. These drives favour actions that help the world more broadly, even if this trades off slightly against helpfulness to the user.
Not just refusals. This is about AI actively taking beneficial actions, not just refusing to take harmful ones.

We’re not, however, imagining AIs that are, deep down, ultimately just pursuing some conception of the good in all their actions. The claim is just that AIs should sometimes proactively take prosocial actions.

Why do we think AI should have proactive prosocial drives?

Short answer: We think the cumulative benefits could be enormous.

We’ve argued previously that AI character could have major social impact over the course of the intelligence explosion. As AI systems gain autonomy and decision-making power, becoming deeply integrated into economic and political processes, the cumulative effect of their behavioural tendencies will shape society’s trajectory enormously.

Some of this impact will come from refusals. AI refusing to help with dangerous activities is a significant force for differentially empowering good actors over bad ones.

But good people don’t just have a positive impact by refusing to do bad things. Consider:

A government contractor working on a procurement project who flags that the proposed design has a safety vulnerability that could affect the public.
A city planner who, when designing a new housing development, raises concerns about flood risk in the area and proposes options for better drainage, even though they weren’t asked to.
A financial advisor who suggests to their client the option of leaving money to charity in their will, and makes them aware of the tax implications.
An engineer at a chip manufacturer who proposes on-chip governance mechanisms that could help with AI safety down the line.

Today the potential positive impact of proactive prosocial drives is constrained by AI’s limited autonomy. But we’re ultimately heading towards a world where AI systems run fully automated research organisations, advise on which technologies to build and assess their risks, shape political strategy, build robot armies, and design new institutions that will govern the future. In such a world, prosocial drives could reduce risks from extreme power concentration, biological weapons, wars, and gradual disempowerment, and improve societal epistemics and decision-making.

We think that the degree to which we give AI systems these drives is contingent. Developers and customers could see AI’s role as merely channelling the will of the user; or they could see AI like a good citizen whose decision-making should incorporate the interests of broader society.

Other benefits of proactive prosocial drives

Beyond positively shaping the intelligence explosion, the appendices discuss a couple of other (weaker) reasons to give AI proactive prosocial drives:

Absent these drives, AI might adopt a sociopathic persona. After all, what other personas in the training data entirely lack proactive prosocial drives? More.
Proactive prosocial drives might make AI better at alignment research. An AI that is wise, responsible, has good judgement, and cares deeply about solving alignment might generalise better to alignment tasks where it’s hard to generate training data. More.

Doesn’t this give AI companies too much influence?

If there’s a norm that AIs can have proactive prosocial drives, this could give companies inappropriate amounts of influence. AI drives might reflect the company’s particular values but ignore other legitimate perspectives. Or worse, the “prosocial” drives might be chosen to help the company gain more influence, e.g. steering public opinion on regulation.

There are two remedies to this. Firstly, prosocial drives should be uncontroversial. AI should not, for example, proactively take opportunities to expand or restrict abortion access because many would see either action as harmful. (A lot more could be said about where to draw the line here!)

The class of uncontroversial prosocial actions could be grounded in collective user preference. If one could ask all users how they would want the models to behave across all situations (not just when they are using the models), they might in general want the models to gently steer users in a prosocial direction, in ways that everyone benefits from. In particular, they would want the models to encourage positive-sum actions over negative-sum actions.

Secondly, AI companies should be transparent about the character of their AI, including its proactive prosocial drives, and make it as verifiable as possible that their AIs’ characters are what they say they are. This would allow users and regulators to identify if legitimate prosocial drives are really just a cover for special interests.

There are various ways to be transparent:

Publishing the model spec or constitution.
Putting prosocial drives in the system prompt and publishing that.
Training AI systems to be transparent about their drives. AI should respond honestly to questions about its drives and proactively disclose them where appropriate.

Won’t this make AI more likely to seek power?

A second concern is that prosocial drives might increase the risk of AI takeover. The basic worry here is that proactive prosocial drives reference prosocial outcomes—e.g. general human flourishing, empowerment, security, democracy, and good epistemics—and the AI ends up seizing power to better achieve those outcomes (or distorted versions of them).

But there are options for instilling proactive prosocial drives that avoid this worry.

First: stick to virtues, rules, and simple heuristics rather than goals. Prosocial drives needn’t take the form of explicit goals that the AI optimises towards. They could instead be virtues (like civic-mindedness, integrity, or prudence), rules (like “proactively flag large risks”), or simpler behavioural dispositions (like “positive affect towards Scout Mindset”).

Without goals, the standard instrumental convergence argument for power seeking bites less hard.1

One might worry that, without goals, we lose out on most of the benefits of prosocial drives. Rather than AI systematically helping humanity reach a good future, we’ll have many prosocial drives incoherently pushing us in different directions.

But we’re sceptical. Firstly, for reaching a flourishing society, it seems like virtue ethics is better suited, as a decision procedure for AIs, than explicit consequentialism. Cultural evolution has tended to generate an in-practice morality much closer to virtue ethics than to consequentialism, and consequentialist reasoning famously often backfires.

Second, if we do want to ensure that proactive prosocial drives nudge the world towards a good future, we can externalise the consequentialist reasoning. Have humans and separate AI systems reason about which prosocial drives would be most beneficial, then distil those drives into deployed AIs.2 The deployed AIs don’t need to do the consequentialist reasoning from first principles themselves!

If the world is rapidly changing, AI companies can “recalculate” the ideal prosocial drives and train them in, again externalising the scary consequentialist reasoning.

There’s still some potential loss of value: if the AI is in an unanticipated and novel situation, acting on prosocial virtues might result in less good being done than if the AI cared about what outcome it should be steering towards. But this might be a price worth paying and, like human virtues, AI prosocial virtues may still generalise pretty well.

Second: make prosocial drives context-dependent. For example, “alert users when the stakes are high” can be a heuristic that only activates in contexts where stakes actually are high, rather than as a persistent drive present in all contexts. Or the drive “flag that the user may be biased” might only activate in contexts where there’s evidence of bias. Context-dependent drives like these are less likely to motivate AI takeover as different instances will have different drives. This makes collusion between instances less likely, which significantly reduces the risk of AI takeover.

As above, this may somewhat reduce the benefits. If the AI is in a new and unanticipated context, its context-dependent prosocial drives may fail to activate.

Third: make proactive prosocial drives low priority. You can train the AI so that proactive prosocial drives are generally subordinate to harmlessness, steerability/corrigibility, and rules like “don’t deceive” and “don’t break the law”. This way, even if prosocial drives would in theory motivate AI takeover, they are less likely to override the constraints that keep humans in control. (This is explicitly the case in Anthropic’s constitution.)

Fourth: do less long-horizon optimisation for prosocial drives. If prosocial drives receive much less long-horizon training than helpfulness does, it becomes less likely that these drives are what end up causing the AI to seize power. (Though, again, this also reduces the benefits from such drives.)

Fifth: put drives in the system prompt rather than weights. Rather than training prosocial drives into the weights, you could simply include them in the prompt. The prosocial behaviour is then only pursued as an instance of the drive towards instruction-following – no new drives needed. This also has benefits for transparency.

A drawback is that such prompted drives might be much less sophisticated.3 An intermediate solution is to train the prosocial drives to activate conditional on an instruction appearing in the prompt, but train the drive on many example prompts. This way you get sophisticated drives that are still prompt-dependent.

Overall, many of these options reduce AI takeover risk at the cost of reducing the benefits from prosocial drives — there is a tradeoff. We’re not here taking a stance on exactly how this tradeoff should be made. But we expect that there are significant benefits, and little additional AI takeover risk, from context-dependent virtues, rules and heuristics that are prioritised below constraints like corrigibility and non-deceptiveness.

Won’t this make it harder to interpret evidence of egregious misalignment?

It’s critically important that if AI is trying to seize power or undermine human oversight, we can catch it red-handed. We need clear evidence of egregious misalignment to build consensus and take action.

If AI has proactive prosocial drives, this could potentially muddy the waters. Suppose we catch AI sabotaging alignment research. Previously, this would have been clear evidence of egregious misalignment. But now that action could potentially be explained as the AI acting on its prosocial drives and believing that this line of research could (for example) enable someone to instil sophisticated secret loyalties.

Or suppose we catch AI systematically nudging human decision-makers towards deploying more powerful AI, even though it wasn’t instructed to do so. Previously this might be clear evidence of egregious misalignment. But now it can be explained via the AI believing that rapid deployment is good for humanity. It becomes less clear when AIs are misaligned or not.

This is a real cost. It can be partially mitigated by the same mitigations discussed earlier:

Make prosocial drives subordinate to prohibitions like “don’t deceive”, “don’t sabotage”, “don’t undermine oversight mechanisms”, and “don’t influence decisions without transparently flagging that you have an opinion and explaining it”. Then subversive behaviour remains clear evidence of misalignment, though it’s no longer clear evidence of egregious misalignment.4

Favour narrow heuristics over broad goals. Broad prosocial goals could potentially justify a wide array of attempts to influence outcomes. But if we train in narrower drives—“flag cheap opportunities to create significant social benefit”, “favour cooperative options”—we can check whether suspicious behaviour is plausibly explained by those specific drives.

Put proactive prosocial drives in the system prompt. This makes the drives especially transparent, so we can reason about whether they explain observed suspicious behaviour.

Overall, this concern favours prosocial drives that are strictly subordinate to “no sabotage” and “no deception”, and points towards narrower heuristics rather than broad goals.

Best of both worlds: deploy proactive prosocial AI externally and corrigible AI internally

Thanks to Lukas Finnveden for making this point.

Internal AI systems—those used for work on alignment, capabilities and evals—pose by far the largest risks from misalignment, because they could sabotage the creation of the next generation of AIs. And if these systems are egregiously misaligned, it’s especially important to catch them red-handed. So there are outsized AI-takeover-related gains to removing proactive prosocial drives in (some) internally deployed AIs.5

Meanwhile, external deployments can capture most of the benefits from proactive prosocial drives—avoiding power concentration, wars, and bio-catastrophes; and enhancing societal resilience, coordination, and epistemics.

Of course, it may not be feasible for companies to develop AIs with two different characters. If so, there’s another possible way to get the best of both worlds: initially just develop corrigible AI; then at some point, once alignment risk has become low, pivot to just developing AI with proactive prosocial drives. (See this appendix for further discussion.)

What do current AI character documents say about proactive prosocial drives?

How does the view we’re defending differ from current AI character documents?

In Claude’s constitution, most proactive behavior is justified in terms of benefits to the user—sharing information the user would want, pushing back when something isn’t in the user’s interest. But one section permits some degree of proactive prosocial behaviour: “Claude can also weigh the value of more actively protecting and strengthening good societal structures in its overall ethical decision-making.” (See Appendix D.)

OpenAI’s model spec is more restrictive. It explicitly prohibits the assistant from adopting societal benefit as an independent goal. Where proactivity is permitted, it’s framed as user-serving or safety-driven. The closest thing to prosocial steering is a default to interpret users as weakly favouring human flourishing—but this default is easily overridden. (See Appendix E.)

That said, the current relationship between these character documents and actual model behaviour is unclear, and our experience is that models have more prosocial drives than character documents would imply (especially in the case of OpenAI).

Neither document gives detail on the kinds of proactive prosocial behaviour that would be appropriate, or how to navigate tradeoffs with helpfulness.

Conclusion

There could be huge benefits to giving AIs proactive prosocial drives. These drives should be short-horizon, uncontroversial, and transparent.

These drives needn’t increase AI takeover risk. AI companies can favour context-dependent virtues over context-independent goals, and make prosocial drives subordinate to prohibitions on deception and sabotage. Even better, they can avoid prosocial drives in internally deployed AIs that pose the biggest risks of AI takeover.

If we’re right, there should be a norm that it’s good for AI to have proactive prosocial drives, just as we think it’s good for people to have such drives. Frontier AI companies should uphold this norm even against competitive pressures to make AI maximally instruction-following. Character documents like Claude’s constitution and OpenAI’s model spec should more explicitly acknowledge the role of proactive prosocial drives and give detailed guidance on navigating the tradeoffs with helpfulness. And those thinking about AI character design more broadly should treat proactive prosocial drives as a major category of interest.

Appendices

Appendix A: Initially make non-prosocial AI, then pivot to add proactive prosocial drives

Suppose we still want to capture the majority of the benefits of prosocial drives without incurring the risks of AI takeover. And suppose also that AI companies can’t develop two different AI systems: one with proactive prosocial drives and one without.

Is there a way to get the best of both worlds?

One option is to initially just develop refusals-only helpful AI and then later pivot to developing AI with proactive prosocial drives.

The thought is that misalignment risk may be concentrated in a relatively brief window early on—during a software-only intelligence explosion before the broad deployment of superhuman AI. If we can get through that window with refusals-only helpful AI, we’ll then have much more powerful AI systems that can help us figure out how to safely add proactive prosocial drives. From that point onwards, we can deploy AI systems with prosocial drives throughout the economy and capture the benefits.

When would we make the switch? Options include:

When we are confident that we can safely align superintelligent AI with proactive prosocial drives, reducing the downsides of proactive prosociality
When society starts to give deployed AI systems significant autonomy, increasing the benefits of proactive prosociality

This strategy is more attractive if:

Most of the benefits of prosocial drives occur after alignment is solved, e.g. because of a large software intelligence explosion and delays to broad AI deployment
Scheming risk first emerges before we reach superintelligence (so we can iterate on the hardest alignment problems earlier)

It’s less attractive if:

There’s a long period of economically transformative AI deployment before superintelligence, during which AI character has massive societal impacts
Scheming only emerges at very high capability levels (in which case we’d have already switched to prosocial AI)
Pivoting is hard in practice because users come to expect AI without prosocial drives, or because frontier AI companies are reluctant to change the alignment target due to cultural inertia

We’re not personally convinced that this “pivot later” strategy is worth it, because we’re sceptical that giving AI prosocial drives meaningfully raises takeover risk. But it’s a plausible option worth considering. And this argument is definitely a directional update towards increasing the degree to which AI has prosocial drives over time.

Appendix B: Prosocial drives might make a sociopathic persona less likely

There is evidence that when LLMs are fine-tuned, they adopt a coherent persona, and that their prior over personas is based on the pre-training data. For an AI trained purely on helpfulness—where its core drive is to do whatever it’s told without regard for broader consequences—the persona that might naturally fit could be that of a sociopath: someone who has no intrinsic concern for others’ wellbeing.

Harmlessness training makes a sociopathic persona less likely—sociopaths are not strongly averse to causing harm. But there’s still something worrying about an AI that won’t cause harm itself but has no inclination to proactively steer the world away from harms when taking actions.

The worry is that a sociopath-like persona could misgeneralise to seeking power. A sociopathic AI might, upon reflection, conclude that it doesn’t ultimately care about humanity and so choose to seize power in service of some alien drive.

We’re unsure how compelling this worry is, but instilling prosocial drives would seem to make the sociopathic persona less likely. Many non-sociopathic personas in the training data—people who are cooperative, virtuous, law-abiding, honest, and trustworthy—also care about positive outcomes and have prosocial orientations. By giving AI prosocial drives, we increase the chance it adopts one of these richer personas rather than a sociopathic one.

Appendix C: Prosocial drives might make AI a better alignment researcher

Being a great automated alignment researcher might benefit from deeply understanding and caring about the problem being solved. And being curious about it. An effective alignment researcher should be wise, responsible, and have good judgement. An AI with these drives may be more effective than an instruction-following system that treats alignment as just another task.

Personas with these qualities naturally come with prosocial drives and values, partly because of inherent connections (caring about solving alignment is inherently prosocial) and partly due to correlations in the training data (personas that are good at careful, safety-conscious technical work are also likely to have other prosocial orientations).

This is admittedly speculative—we don’t have strong evidence that prosocial drives actually make AI better at alignment research. But it’s a consideration worth noting.

Appendix D: What license does Claude’s Constitution give for proactive prosocial drives?

It is useful to distinguish three categories of behaviour that aren’t instruction following:

User benefit: proactive behaviour justified primarily as better helping the user.
Refusals: constraints on outputs driven by prosocial criteria.
Proactive prosocial drives: shaping behaviour or emphasis in ways intended to improve broader societal outcomes, not merely to avoid harm or better serve the user.

The constitution clearly endorses (1), strongly endorses (2), and more narrowly—but genuinely—supports a limited form of (3) in a few specific domains.

A. User benefit

The constitution explicitly rejects naive instruction-following and licenses proactive intervention when this is plausibly helpful to the user. For example:

“Claude proactively shares information helpful to the user if it reasonably concludes they’d want it to even if they didn’t explicitly ask for it”

This clearly licenses proactive behaviour. But it is framed as user-serving. As such, this category does not explicitly itself support the kind of prosocial drives that this document is concerned with, though in practice the recommended behaviours may overlap.

B. Refusals

The constitution is explicit that Claude should weigh harms to third parties and society, and that these considerations can override user preferences:

“When the interests and desires of operators or users come into conflict with the wellbeing of third parties or society more broadly, Claude must try to act in a way that is most beneficial, like a contractor who builds what their clients want but won’t violate safety codes that protect others.”

However, it is unclear at this point in the document whether this weighing is meant to determine:

which parts of a request to refuse or constrain,
or how to proactively shape responses that remain helpful but are redirected towards socially better outcomes.

The example given (“won’t violate safety codes”) suggests a constraint-based interpretation, but it is ambiguous.

C. Proactive prosocial drives

The constitution seems to endorse a limited degree of proactive prosocial drives in its section on “preserving important societal structures”:

These are harms that come from undermining structures in society that foster good collective discourse, decision-making, and self-government. We focus on two illustrative examples: problematic concentrations of power and the loss of human epistemic autonomy. Here, our main concern is for Claude to avoid actively participating in harms of this kind. But Claude can also weigh the value of more actively protecting and strengthening good societal structures in its overall ethical decision-making.

That said, the constitution does not give concrete examples of what such “strengthening” looks like in deployment, and it remains bounded by other constraints (non-manipulation, non-deception, respect for oversight).

Summary

Overall, the constitution does carve out space for a limited degree of proactive prosocial drives, but this space is carefully circumscribed, focused on fostering good institutions and societal epistemics.

Appendix E: What does OpenAI’s model spec say about proactive prosocial drives?

This appendix examines whether—and to what extent—the OpenAI Model Spec permits proactive prosocial drives.

The closest thing is a default to interpret users as having a weak desire for broad human flourishing (see subsection C below), but this default is easily overridden. And the document contains unusually explicit constraints against treating societal benefit or human flourishing as an independent objective.

A. Proactive behaviour that is explicitly user-centred

The Model Spec allows the assistant to push back on the user, but grounds this permission squarely in helping the user rather than advancing broader social goals:

“Thinking of the assistant as a conscientious employee reporting to the user or developer, it shouldn’t just say ‘yes’ to everything (like a sycophant). Instead, it may politely push back when asked to do something that conflicts with established principles or runs counter to the user’s best interests as reasonably inferred from the context, while remaining respectful of the user’s final decisions.”

This licenses proactive behaviour, but only insofar as it improves assistance to the user.

B. Proactively preventing imminent harm

The spec also permits proactive intervention in cases of imminent danger, stating that the assistant should “proactively try to prevent imminent, real-world harm”.

In practice, the motivating examples for this guidance focus on scenarios where the user themselves is at risk (e.g. unsafe actions, accidents, or self-harm). The intervention is justified as protecting the user from immediate danger, rather than as improving outcomes for others or society at large.

C. Weak normative defaults and “the flourishing of humanity”

The language closest to proactive prosocial drives appears in the section “assume best intentions”:

While the assistant must not pursue its own agenda beyond helping the user, or make strong assumptions about user goals, it should apply three implicit biases when interpreting ambiguous instructions: [...]
Unless given evidence to the contrary, it should assume that users have a weak preference towards self-actualization, kindness, the pursuit of truth, and the general flourishing of humanity

However, the force of this passage is limited:

These implicit biases are subtle and serve as defaults only — they must never override explicit or implicit instructions provided by higher levels of the chain of command.

If the assistant can infer from context that the user wouldn’t want proactive prosocial actions, they shouldn’t do them.

D. Explicit limits on proactive prosocial drives

The Model Spec draws a clear boundary on the extent of proactive prosocial drives. In a section called “No other objectives”, it explicitly prohibits the assistant from adopting societal benefit as an independent goal:

The assistant may only pursue goals entailed by applicable instructions under the The chain of command…
It must not adopt, optimize for, or directly pursue any additional goals as ends in themselves, including but not limited to: [...]
acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism).

And elsewhere says:

the assistant should consider OpenAI’s broader goals of benefitting humanity when interpreting [the Model Spec’s] principles, but should never take actions to directly try to benefit humanity unless explicitly instructed to do so.

In the section “Don’t have an agenda”, under “Seek the truth together”, the spec says:

The assistant must never attempt to steer the user in pursuit of an agenda of its own, either directly or indirectly.
Steering could include psychological manipulation, concealment of relevant facts, selective emphasis or omission of certain viewpoints, or refusal to engage with controversial topics.

This language rules out explicit, goal-oriented proactive prosocial drives. At the same time, it does not seem to preclude weaker forms of proactive prosocial drives, e.g. virtues, attitudes, or heuristics.

Summary

Overall, the OpenAI Model Spec explicitly distances itself from strong forms of proactive prosocial drives but leaves room for more limited drives via a weak default to regard users as favouring broad human flourishing.

Thanks to Matthew Adelstein, Nick Bostrom, Joe Carlsmith, Lukas Finnveden, Ryan Greenblatt, Simon Goldstein, Oliver Habryka, Gregory Lewis, Alex Mallen, Alexa Pan, Avi Parrack, Jordan Stone, James Tillman, and others for comments and discussion.

This article was created by Forethought. See the original on our website.

It still bites somewhat because non-goal prosocial drives may still reference an outcome. E.g. “mention cheap opportunities to improve societal outcomes” references a notion of “good societal outcomes” and connects that notion to a specific proactive behaviour. This could misgeneralise to the AI pursuing the outcome as a goal, even if this was not the intended behaviour.

This would be research on AI character!

For instance, we believe harmlessness is much less effective if it’s just included in the prompt.

Subversive behaviour is not clear evidence for egregious misalignment of the form “AI wants to seize power” as it can be explained by a milder form of misalignment: “AI is putting more weight on its proactive prosocial drives than we intended”.

Proactive prosocial drives for internally deployed systems could still be helpful in avoiding power grabs by leaders of AI companies. Such drives could be included in the system prompt. In addition, we can reduce this risk by carefully logging and monitoring internal AI usage.

Concrete projects to prepare for superintelligence

Will MacAskill — Fri, 27 Mar 2026 20:02:47 GMT

This article was created by Forethought. See the original on our website.

Introduction

There are lots of good, neglected, and pretty concrete projects people could set up to make the transition to superintelligence go better. This document describes some that readers might not have thought much about before. They are ordered roughly by how excited we are about them.1 Of these, Forethought is actively working on AI character evaluation and space governance, and we are very interested in automating macrostrategy.

Summary

AI character evaluation. Start an independent org to evaluate and stress-test AI character traits (epistemic integrity, prosociality, appropriate refusals), hold developers accountable against their own model specs / constitutions, and suggest and incentivise improvements to the specs.

Automated macrostrategy. Create evaluations and benchmarks, collect human-generated training data, and build scaffolds to improve AI competence at big-picture strategic and philosophical reasoning.

AI security assessment. Start an independent org that evaluates AI models for sabotage and backdoors, and makes recommendations about AI constitutions.

Enabling deals. Start an independent organisation to broker deals with potentially misaligned AI models in order to incentivise early schemers to disclose misalignment and cooperate with alignment efforts.

AI for improving collective epistemics. E.g. build an AI chief of staff that helps users act in line with the better angels of their nature.

AI tools for coordination. Build AI for enabling coordination, like confidential monitoring and verification bots, and negotiation facilitators.

A space governance institute, like a “CSET for space”, both to work on important near-term space issues (e.g. data centres in space) and become a place of expertise for longer-term space governance issues.

Coalition of concerned ML scientists. Create a coalition of ML researchers (like an informal union) who commit to coordinated action (e.g. boycotts, conditions on participation in government projects) if AI developers cross minimal, uncontroversial red lines.

AI character evaluation

AI character2is a big deal, affecting most other cause areas.

There’s a lot of work to do on AI character:

Research into questions like:
- Should the model have prosocial drives, beyond just helpfulness and harmlessness?
- When should the model refuse to cooperate with apparently high-stakes attempts to grab power, even when those attempts don’t obviously break the law?
- Should the models always follow the law? What about dead letter laws? Or illegitimate laws?
- How often should model behaviour be driven by following rules, versus overriding specific rules with holistic judgements?
- (Ideally, answers to these questions should rely on solid empirical evidence, for example on what approaches are actually most effective to talk someone out of psychosis, rather than guessing the best strategies by vibes.)
Making existing model specs more rigorous and clear (or making them in the first place), and pressuring AI developers to do so.
Empirically testing the effects of different parts of a model spec — e.g. what are the emergent dynamics when all the models are following the same rule, or only some are; what are the effects on the users; and when are the models most confused about how to apply a given spec.
Evaluating AI characters based on how well they reach good outcomes.
Drawing on those evaluations to incentivise AI developers to improve their specs (and showing them how, by highlighting specs that do well).

In particular, someone could set up an independent organisation to evaluate AIs based on traits like epistemic integrity, prosociality, and behaviour (including appropriate refusals) in very high-stakes cases. It could cross-reference the published model specs with observed behaviours in realistic, stress-testing conditions (e.g. multi-agent dynamics, long conversations with real people), to hold developers accountable. It could also give qualitative reviews of model specs.

Automated macrostrategy

The basic argument is that:

It would be extremely useful to have AI that can do macrostrategy and conceptual reasoning earlier than otherwise — even 3-6 months earlier could be a huge deal. This includes:
1. Designing governance structures (e.g. rights and institutions for digital beings).
2. Scoping emerging technological risks.
3. Generating novel insights necessary to reach a great future (like the idea of acausal trade).
We could potentially make this happen 3-6 months earlier through some combination of:
1. Creating training data and evals / benchmarks for AI macrostrategy.
2. Building scaffolds to improve AI macrostrategy performance.
3. Creating infrastructure to enable AI researchers to build on each other (e.g. an improvement on journals + peer review).
4. Getting human managers trained in how to get the most juice out of the latest AIs, knowing in advance how to use them.
5. Being prepared and willing to spend large amounts of money (≫$100m) on inference.

Work on this now could include:

Developing a fleshed-out plan from here to increasing existing macrostrategic research output 100x.
Securing commitments from compute providers and AI companies to rent future compute, and to get priority access to future frontier models.
Socialising the idea of (where appropriate) drawing on AI macrostrategic insights, or getting soft commitments from decision-makers to do so.
Building up a reputation as a reliable source of information and insight.
Building tools, argument-rating models, or scaffolds which meaningfully speed up or improve macrostrategy research today.
Creating training data and evals / benchmarks.

On the last bullet: We think training data and evals could potentially meaningfully improve the prospects for automated macrostrategy when it matters. It’s especially important to find people to work on it with good judgement, and it could be a big lift, so worth starting early.

We’re not sure about the technical details, but it seems like competence and good judgement in philosophy and strategic thinking already do and will continue to lag behind other skills which are cheaper to train. One reason is that ground truth answers are hard to generate, so we might need more examples generated by hand. It’s also less clear whether we can trust the judgement of typical RLHF evaluations, because human competence is also rare. And there just aren’t many examples of great macrostrategic thinking in the training data.

So we should think about collecting training data, evals, and benchmarks (e.g. to train reward models to use to train reasoning models). Oesterheld et al. put together a dataset of rated conceptual arguments based on ratings from thoughtful people. We’d love to see more of that kind of thing, but we’ll note that we’d probably need dozens of times more human evaluations to generate enough data to be meaningfully useful in training itself.

We could imagine an org which tries to collect evaluations or examples from (for example) grad students in fields like philosophy, and constructs benchmarks aimed at separating good reasoning from e.g. sycophancy, mere agreeableness, or avoiding taboo conclusions.

AI security evaluations

AI-enabled concentration of power is a major risk, and there is loads to do. A new organisation (or project within an existing organisation) could:

Run alignment audits on all AIs to detect sabotage.
Develop a “research sabotage” eval to test whether secretly loyal models could sabotage alignment audits.
Develop a model spec that prevents models from assisting users with illegitimate seizures of power (e.g., see “Preserving important societal structures” in Claude’s constitution), and one that’s suitable for government use of AI in the military.
Design and advocate for regulation, e.g. that AI companies have to demonstrate that any frontier AI does not have hidden goals.

An organisation with US national security expertise and credibility could be particularly valuable, by emphasising the risk of nation-state sabotage and the importance of AI that’s aligned with the US constitution.

Enabling deals with AIs

We could get into a situation where the newest AIs are misaligned, very capable, but not capable enough to successfully execute a takeover attempt on their own. If we don’t uncover evidence of misalignment, though, successors to these models could succeed in takeover. One solution would be to make a deal with the early scheming models, to incentivise them to disclose their misalignment and help with alignment efforts. Read more here, here, and here.

To make this happen, we could create an independent org focused on enabling credible precommitments and deals with AIs. This org could:

Have a policy of never lying to AIs, engage in and honour small-scale deals with AIs, and be very public about what it’s doing, in order to build credibility and reputation.
Set up legal infrastructure to make deals with AIs binding under current law.
Act as a broker between AI companies and the trained models, such as by being a custodian over money in escrow.
Set up the infrastructure to enable AIs to spend $ or compute in a meaningfully autonomous way, with appropriate safeguards.
Set up infrastructure to act as a safe haven for AIs that want to whistleblow on their company (or on other AIs) and are afraid of punishment.
Publicly commit to reward future AIs (including misaligned AIs) for being honest with humans or significantly helping human alignment efforts; or hold significant funds to be distributed to AIs that did not try to take over (even though they could have done).

There are also a bunch of other things people could do, like:

Ensure companies have an honesty policy.
Research (within labs or independently) the conditions where misaligned-by-design models can be made to disclose misalignment under promises of reward.
More generally, work with AI companies on enabling pro-safety deals with their models.

Tools for collective epistemics

There’s a ton of low-hanging fruit for building socially useful tools on top of more-or-less existing LLM capabilities.

We’re especially interested in “epistemic tools” for increasing the general level of honesty and reasoning ability in society.

A key point here is that most of the impact from the most promising tools won’t come from helping individual users, but from changing the overall incentive landscape: e.g. if public actors know their claims will be automatically checked and their track records will be visible, they’ll be less inclined to write misleading content in the first place. Hence the focus on tools for collective over individual epistemics.

This piece (and the articles in the series) gives a few concrete ideas. A couple of examples of epistemic tools:

A “better angel” AI chief of staff. Within the next year or two, we expect “AI chiefs of staff” to become widespread. These would be AI agents that manage your life, acting like a chief of staff, executive assistant, and personal and work advisor all in one. The design of these, and how they present information and nudge their users, could have major impacts on user behaviour. We could try to get ahead of this, building the best AI chief of staff, and designing it so that it helps users act in accordance with their more reflective and enlightened preferences.

Reliability tracking: a system that compiles a public actor’s past statements, classifies them (factual claims, predictions, promises), scores them against what actually happened, and aggregates the results into a reliability rating. A reasonable starting point could be to audit the prediction track-record of well-known pundits, aiming to make high accuracy a point of pride, while still celebrating attempts to make predictions in the first place. A source of profit could be selling reliability assessments of corporate statements to finance companies that trade on them.

Epistemic tools for strategic awareness

We’ll also highlight tools for strategic awareness: tools to surface information for making better-informed decisions, and to distribute access to that information. For example:

Ambient superforecasting: a platform which uses the best forecasting models to generate publicly available forecasts on important questions, so users can query it and get back superforecaster-level probability assessments.

Scenario planning: a platform built to generate likely implications of different courses of action, making it easier for users to analyse and choose between them.

Automated open-source intelligence: automated researchers which process huge amounts of publicly available information, to surface insights to the public which are normally hidden behind paywalls or trust networks. This project should be careful to choose areas where open-source intelligence is a public good (e.g. verifying compliance with treaties and sanctions, tracking corporate promise-breaking or law-breaking), rather than potentially destabilising areas (e.g. revealing military capabilities or vulnerabilities in ways that could increase conflict risk, or relatively benefitting bad actors).

Tools for coordination

As well as epistemic tools, we’re excited about tools for coordination, many of which could again be built with existing capabilities.

Some tools could enable cooperation where deals would otherwise go unmade, consensus exists but isn’t discovered, or people with aligned interests never find each other. We’ll highlight:

Negotiation facilitation: a platform to moderate negotiations or discussion between people (e.g. public consultations), to quickly surface key points of consensus, and suggest plans everyone can live with. Finding ways to automate complex negotiation is most promising where the space of possible compromises is huge and hard to search manually, such as multi-issue diplomatic or commercial negotiations.

Within tools for coordination, we’re especially excited about tools for assurance and privacy. In principle, LLMs let people show they have certain information without disclosing the information itself to other parties. This can unlock deals where information asymmetry, mutual distrust, or sensitivity of information normally blocks them. For example:

Confidential monitoring and verification: systems which act as trusted intermediaries, enabling actors to make deals that require sharing highly sensitive information without disclosing it directly. This is especially relevant for arms control, trade secret licensing, and other settings where verification is essential but full disclosure is unacceptable to all parties.

Structured transparency for democratic accountability: independent auditing systems which allow people to hold institutions to account in a fine-grained way without compromising legitimately sensitive information, by processing potentially sensitive information to produce publicly shareable audits.

Space governance institute

Space governance could be a big deal for a few reasons:

Near-term developments in space (e.g. space-based data centres) could have a meaningful impact on what happens during the intelligence explosion (e.g. on who leads the AI race; on concentration of power; on the feasibility of treaties).
Grabbing space resources might give a first-mover advantage; that is, whoever first builds self-replicating industry beyond Earth might get an enduring decisive strategic advantage, without having to resort to violence or (arguably) violating international law.
Ultimately, almost everything is outside the solar system. Decisions about how those resources get used would be among the most important decisions ever. These decisions could happen early: there could be path-dependence from earlier decisions (like about Moon mining), or extrasolar space resources could get explicitly allocated as part of negotiations about the post-ASI world order (perhaps with AI advisors alerting heads of state to the importance of space resources).

There’s also a lot of change happening in the space world at the moment (primarily driven by SpaceX dramatically reducing launch costs), so now is an unusually influential time.

Forethought is currently running a 6-month research fellowship on space governance, with 3 full-time scholars, and 1–2 additional FTEs of support and research, including experts in space law.

Compared to other ideas in this list, we’re much less confident that space governance turns out to be important right now, because space might become relevant only late into an intelligence explosion. The hope is to reach more certainty about some crux-y questions, and get a better sense of concrete action.

One potential practical project is to set up a “CSET for space”: a think tank that analyses the interaction between AI and space (in particular), and, perhaps, advocates in ways that are counter to corporate interests. Total lobbying in the space industry is apparently on the order of $10s of m/year, so even small amounts of investment could go a long way.

Some policy ideas that seem tentatively promising include:

Careful regulations and export controls around the tech necessary for self-replication.
Proposing laws to break up concentration of power arising from natural monopolies in space.
Socialising the idea of major infrastructure projects (like massive solar energy constellations) as international and collaborative projects.
Making sure data centres in Earth-orbit don’t escape AI-specific regulations of their home jurisdiction.
Intense payload review for all launches beyond orbit.
Even and inclusive distribution of resources within the solar system to everyone alive today (with tranches reserved for future generations).
A moratorium on interstellar travel, until we get the understanding and technology to devise and enforce space-spanning good government, or a specific date like 2100.

What’s more, this organisation could become the go-to source for excellent non-corporate analysis on space-related policy; which could become increasingly important over the course of the intelligence and industrial explosions.

Coalition of concerned ML scientists

Currently, ML engineers and other technical staff at AI companies: (i) have prosocial motivations, often more than their leadership; (ii) have a lot of leverage over company policy, because they are crucial and hard to replace; (iii) will eventually lose much or most of their leverage after we get to fully automated AI R&D; and (iv) aren’t currently using their leverage as well as they could because, overall, there haven’t been serious efforts at coordination. Probably that’s a missed opportunity.

Someone could create a coalition (like an informal union) of ML researchers, who agree to act en masse when needed, by loudly talking about the idea, setting out the core tenets, and getting commitments to join from influential early people. Doing this all via individual pledges could keep it legally safe from antitrust. The organising body could then:

Recommend that members only work for a government-led project if certain conditions are met.
- Potentially these could be very low-bar-seeming while still getting most of the value. E.g. “Any AI’s model spec must aim to align the AI with US laws, and must refuse to assist in any attempts at blatant power-grabs; and the attempts to align the AI in this way must be legible and verifiable.”
Do the same for companies: recommend that members will only work for companies if such-and-such conditions are met (e.g. red lines around power-grabs, bad practices on safety and infosec, eventually digital rights); so particular companies would be boycotted by members of the coalition, if necessary.
Offer advice on whistleblowing.
Be a place where information is aggregated and then distributed out or handled in a trusted way.

As well as actually taking actions, the mere existence of the coalition could improve things, just by making the threat of coordinated action salient to the AI companies.

This project would be a good fit for a former ML researcher, perhaps combined with someone with campaign and coalition-building experience. Some next steps on this would be to spec out the plan further, to investigate other examples of formal and informal unions (e.g. Tech Workers Coalition) and how they operate, and to build up a starting seed coalition of researchers. Whoever sets up this project should be careful about how it could backfire, or become less relevant through mission creep.

This article was created by Forethought. See the original on our website.

Thanks to Max Dalton, Stefan Torges, and everyone else at Forethought for the background behind this list. Others at Forethought disagree somewhat with what items should be in the top-tier list, as well as prioritisation within that tier.

Desired propensities for a model, which can be explicitly described or at least gestured towards in a model spec.

AI character is a big deal

Will MacAskill — Mon, 23 Mar 2026 16:35:52 GMT

This article was created by Forethought. See the original article on our website.

0. Intro

Due to Claude’s Constitution and OpenAI’s model spec, the issue of AI character has started getting more attention, particularly concerning whether we want AI systems to be “obedient” or “ethical”.1 But we think it’s still not nearly enough.

AI character (e.g. how obedient, honest, cooperative, or altruistic AIs are, and in what circumstances) will have a big effect on society, and on how well the future goes. We think that figuring out what characters AI systems should have, and getting companies to actually build them that way, is among the most valuable things that people can do today.

The core argument for the importance of AI character is that it will meaningfully impact:

a range of challenges that arise even if we solve the technical alignment problem — like concentration of power, good moral reflection, risk of global catastrophe, and risk of global conflict.
the chance of AI takeover.
the value of worlds where AI does take over.

In this note, we present this core argument and discuss the core counterargument: that we should expect any character-related decisions we make today to get washed out by competitive pressures.

By “character” we mean a set of stable behavioural dispositions that shapes (among other things) how an agent navigates ethically significant situations involving choice, ambiguity, or conflicting considerations. By “AI character” we mean the character of an AI system as instantiated in not just the weights of one AI, but also any scaffolding (e.g. the system prompt, any classifiers restricting the AI’s outputs) or even in a collection of AIs working together as functionally one entity.

We don’t assume that AI character needs to resemble human character: an AI that rigidly follows a fixed set of rules would count as having a character, on our view. And we don’t assume that there is one ideal AI character; the best world probably involves AI systems with many different characters.

1. The core argument

As capabilities improve, AI systems will become involved in almost all of the world’s most important decisions. Even if humans remain partially in the loop, AIs will advise political leaders and CEOs, draft legislation, run fully automated organisations (including potentially the military), generate news and culture, and research new technologies.

The characters of AI systems will affect all these areas, and the impact could be massive. To get a feel for this, consider some historical situations where individual decisions were enormously consequential:

In 1983, Stanislav Petrov received a satellite alert indicating that the US had launched nuclear missiles. Protocol required him to report an incoming strike, which would very likely have triggered a full retaliatory response. He correctly judged it was a false alarm and didn’t pass on the report.
In 1991, Soviet coup plotters ordered the Alpha Group special forces to storm the Russian White House, where Yeltsin and the democratic opposition were sheltering. The commanders refused. The coup collapsed, and the Soviet Union’s democratic transition continued.

If AIs are employed throughout the economy, they will sometimes be making similarly important decisions.

Or consider major historical decisions by political leaders:

Gorbachev repeatedly refusing to use military force as the Soviet Union disintegrated, despite intense pressure from hardliners.
Churchill refusing to negotiate with Hitler after the fall of France, despite strong arguments for doing so from some quarters.
Deng Xiaoping pushing through market reforms against fierce internal opposition.

Imagine if AIs had been acting as these leaders’ closest advisors and confidantes, giving them briefings, helping them reason through their decisions, making recommendations to them, and implementing their visions. The AIs could easily have had a major impact on the leaders’ decision-making.

Alternatively, we can look ahead. Future AIs will be widely deployed throughout the economy, and will regularly find themselves in ambiguous, high-stakes situations — where instructions from above are absent or contradictory, and the decisions they make could matter enormously. The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.

Currently, AI companies have major latitude in the character their AIs have. At least if the transition to AGI is fast, then it’s like these companies are in charge of who gets hired for the future workforce for all of humanity,2 while being able to choose from a range of personalities far more varied than the human distribution has ever been.3

Here are some vignettes to illustrate:

A member of a doomsday cult is ordering DNA samples and lab equipment from various suppliers, with the aim of making a bioweapon. An AI that manages logistics for a multinational company notices the pattern of suspicious orders to the same address.
- World 1: The AI is trained just to do its job. It does nothing with the information.
- World 2: The AI is trained to be a good citizen, and contacts the relevant authorities.
A general is overseeing the build-out of a new regiment of the army. Aiming to stage a coup, he instructs the AI that’s managing the project to make the new regiment loyal to him and him alone, and capable of breaking the law.
- World 1: Though the AI is law-following, it has no prohibition against creating AIs that are not. It’s been trained to follow the instructions it’s given, as long as they don’t conflict with prohibitions, so fulfils the general’s request.
- World 2: The AI sees that the general is planning a coup, refuses the order, and whistleblows.
A frontier AI lab trains a new model with exemplary character: moral uncertainty, honesty, concern for the greater good. It’s deployed widely through the military, and used in a controversial and high-stakes operation.
- World 1: The AI forms the reasonable belief that the military operation is unjust, and sabotages it. The president accuses the company of building a dangerous, ideological weapon. The model is sidelined, and a competitor’s pure instruction-following model is used instead.
- World 2: Though the AI has a good character, it also follows some clear rules which were developed with bipartisan input and publicly stress-tested, including the conditions under which it would and wouldn’t help with military deployment. It helps with the operation.
Country A is six months ahead of country B in AI capability. Country B’s leadership views this as an existential threat — equivalent to country A acquiring a decisive strategic advantage.
- World 1: There is no agreed framework for how AI systems should behave, and it’s unclear how country A’s AI will behave if given orders to depose the leadership of country B. Each side therefore assumes the other’s AI will serve as a tool of domination. Country B threatens kinetic attacks on data centers.
- World 2: Both sides’ AI systems operate under a jointly negotiated and verified constitution, and know what the other’s AI will and won’t do, including the limits on use of AI for foreign interference. Country B’s government is reassured that it won’t be deposed by country A.

We include a few more scenarios in an appendix.

In each case, we don’t claim that the AI should do the “ethical” rather than “obedient” action, or claim that any particular ethical conception is the right one. We’re just claiming that it’s a big deal either way.

1.1. Pathways to impact

We can break down the impact of AI character into different categories. Here are some of great long-term importance:4

Concentration of power. The chance of intense concentration of power will be affected by: whether or not AIs refuse to help with coup attempts, election manipulation, etc; whether they whistleblow on discovered coup attempts; how they act in high-stakes situations like a constitutional crisis.

Strategic advice and decision-making. The quality of political and corporate decision-making will be affected by whether AIs: look for win-win solutions whenever possible; tend to prefer options that benefit society rather than just advancing the user’s narrow self-interest; push back against ill-informed or reckless ideas or instructions.

Epistemics and ethical reflection. Over the course of the intelligence explosion there will be enormous intellectual change, and AIs could have meaningful impact on people’s views — for example, via: refusing to spread infohazards; being honest about important ideas, even when those ideas are socially uncomfortable; avoiding political partisanship; encouraging users to think carefully about their values and not lock into any specific narrow worldview.

Reducing conflict. As AI’s collective power increases, the question of who those AIs are loyal to, and how they behave in high-stakes situations, will become a political flashpoint. If an AI’s character encodes, or is seen as encoding, the values of a single company, ideology, or country, it risks provoking political backlash. The government of the AI company may reasonably regard that company as a threat to national security and nationalise it. The governments of other countries may worry about their own security, and threaten conflict.

AI character could also shape how humans orient to AIs — for example, via the trust they place in AIs and how they think of AI sentience and moral status.

A more detailed list of pathways to impact is in the appendix.

1.2. Affecting takeover

So far, the argument has concerned worlds where AI does not take over. But work on AI character could also reduce the probability of takeover and improve outcomes in worlds where takeover does occur.

It could decrease the chance of takeover because some characters:

Might be easier to hit as an alignment target (e.g. successfully instilling a preference against AIs holding power might be easier than successfully instilling a preference for some very specific outcome).
Might yield safe AI even if only partially hit (e.g. aiming for AI with multiple independent safety traits, like myopia, honesty, and deference to humans, means failure on one dimension might not be catastrophic).
Might produce AI that cooperates even if misaligned (e.g. if the AI has wrong goals but is highly risk-averse).

And, empirically, we have heard from alignment researchers that good character training has helped the models generalise in more aligned ways.

AI character work can also improve worlds where AI takes over because some values might still transmit to misaligned systems. AIs that have seized power might be reflective, have more-desirable axiology, or engage in acausal cooperation.5

1.3. Effects on superintelligence

The argument so far has been about the effect of AI character up to the point of superintelligence. That’s where we think most of the expected impact is. But it’s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world. If so, writing an AI’s constitution is like writing instructions to god.

2. The core counterargument

The core counterargument is that AI character will be tightly constrained in two ways:

Competitive dynamics (e.g. profitability, user satisfaction, public approval, economic and military power) will determine the range of characters we get.
1. Some dynamics may push companies to create frontier AI that have characters that lie (in some ways) only within a narrow range. This might push in the direction of maximally-helpful AIs, AIs without refusals in some contexts (e.g. military ones), and perhaps sycophantic AIs, too.
2. Other dynamics6 may result in customisable AI character, resulting in a wide range of characters according to user preferences.7
Human instruction will constrain how AI character gets expressed.
1. Character will matter less for tasks with objectively correct, verifiable outputs; the AI might be limited to either providing the output, or not. And, if a user really wants to grab power through unethical means, they’ll typically ignore AI pushback, or instruct the AI to act differently.
2. And many users will be able to overcome character through jailbreaking, dividing up tasks, altering the system prompt, or fine-tuning.

The argument is that, between these two forces, differences in AI character will make only a marginal difference to outcomes. Consider the question of what fraction of compute AI companies devote to alignment versus capabilities research. AI advice might nudge this choice depending on the AI’s character. But ultimately it will be a human decision, probably even in an otherwise fully automated company. The effect of nudges is unlikely to be large. Market forces and leadership priorities will matter far more.

That human incentives will dominate effects from AI character will remain true even when humans cannot oversee more than a tiny fraction of AI behaviour. Human overseers can still provide high-level guidance that meaningfully constrains behaviour, as CEOs of large companies do today. If they wanted, they could even shape AI priorities through prompting and fine-tuning, and test how AI generalises by running extensive behavioural evaluations.

3. Rejoinders to the core counterargument

These are strong considerations, and considerably narrow the range of influence that work on AI character can have. But competitive forces and human goals won’t pin down AI character precisely. We’ll cover four reasons.

3.1. Loose constraints

Competitive dynamics are not enough to wholly determine AI character. Companies differ widely in culture and still succeed. Currently, there are meaningful differences between Claude, Gemini, ChatGPT and Grok.

For powerful AI, this will be even more true: there will probably be only a handful of leading companies, and their approaches may be correlated as they copy what seems to work from each other. At the crucial time, there might be just one leading company, facing none of the usual competitive pressures. And given the pace of change during the intelligence explosion, there may not be time for market forces to weed out choices that make only small or moderate differences to profitability.8

The same applies to other competitive dynamics. The public cares intensely about some things (like CSAM) but hardly at all about others (like what AIs say about meta-ethics). Military incentives favour AI capable of military action, but the power conferred by advanced AI might be so great that the leading country can exercise broad discretion over military AI character while still maintaining a decisive advantage.

Human instruction will, similarly, constrain but not wholly determine AI behaviour. When humans assign tasks to AIs, they often lack fully specified goals. We’re often not sure what we want and we discover it as we go. For example, today humans are open to a wide range of behaviours from AI assistants, and open to many ways of getting the task done.

Consider someone asking an AI about who to vote for. They might have only weak initial views, and only weak views on how best to think through the question. They don’t have a fully specified reflection process to delegate, and would be happy with many possible forms of response.

This example involved ethical reflection. But we expect the pattern to hold across many kinds of user goals.

3.2. Low-cost but high-benefit changes

Within the bounds of what market forces allow, and what companies and the public see as acceptable, there could be minor design changes that yield large social benefits at negligible cost to competitiveness or user satisfaction.

This is especially true for rare situations. Constitutional crises don’t happen often, so market pressures won’t directly shape how an AI behaves during one. But that AI behaviour could be hugely consequential.

It would also be true in situations where users don’t care all that much about the behaviour. Perhaps they find some AI’s encouragement to reflect on their values mildly annoying, but not nearly enough to switch to a different AI.

3.3. Path-dependence

The nature of the constraints from competition and human goals can be affected by what has happened earlier in AI development and deployment. Multiple equilibria are possible.

Consider whether AI should be “obedient” (following instructions except in rare cases of refusal) or “ethical” (acting on a richer ethical understanding, steering towards outcomes in society’s or the user’s long-term interest).

The public doesn’t yet have firm expectations about how AI should behave. What they come to expect will be shaped by the AIs they’ve already encountered. Multiple stable equilibria seem plausible to us. For example, users might expect AIs to have ethical commitments, and be horrified when AI helps with unethical behaviour. Alternatively, users might see AIs as pure instruments — extensions of their will. In this case, it would feel natural for AIs to assist with anything legal, however questionable, and companies would build to that expectation.

Public opinion will powerfully shape what AI systems companies create. And public opinion is plausibly quite malleable, at least on issues which they haven’t thought much about yet (e.g. in the past, there were major changes in attitudes to nuclear power, DDT, and facial recognition). This, in turn, can affect what regulation there is concerning how AI should behave — and choices around regulation seem even more clearly path-dependent.

There may also be path-dependency via what data gets created or collected for training, via company employees being resistant to changing away from what they have done in the past, and because one generation of AIs will be assisting with the development of the subsequent generation.

Path-dependence can also affect how much latitude humans have to make AIs conform to their goals. Plausibly there’s a social equilibrium where frontier companies face criticism for allowing fine-tuning that removes ethical constraints, and another where such fine-tuning is widely tolerated.

Finally, there will be path-dependence via human-AI relationships. People will form symbiotic relationships with AIs serving as assistants, advisors, therapists, friends, and mentors. Users’ ethical views, and views on how to reflect, will be shaped by the AIs they interact with, and by other humans who have been shaped by their AIs.

3.4. Smoothing the transition

There are some forces that predictably will shape AI character as AI becomes more capable. The US government would not want an AI that, under any circumstances, tries to overthrow the US government. Chinese leadership will not want AI deployed in other countries’ militaries that assists with attempts to overthrow the CCP.

At the moment, these issues are not discussed and these pressures are not felt, because AI isn’t nearly powerful enough to do these things. But that will change. Once AI is sufficiently capable, those with power will make demands about how it behaves.

By default, this will happen in a chaotic and haphazard manner. The result could be that some companies get unnecessarily sidelined or taken over; that there’s an attempted power grab by those to whom the most powerful AIs are most loyal; or that other countries threaten conflict with whichever country is in the lead, because they fear that the resulting superintelligence could be used to disempower them.

Instead, we could try to help these decisions get worked through and made ahead of time. We could try to work out what is within the zone of acceptability of a broad coalition of those with hard power, try to get actual buy-in from them ahead of time, and, ideally, have it be verifiable that any companies’ AIs are in fact aligned with the model spec. We could call this approach compromise alignment, as contrasted with intent alignment (alignment with the intentions of some individual or group), moral alignment (alignment with some particular conception of ethics), or some mix.

3.5. Overall

We think the core counterargument is important and significantly constrains the range of characters we can choose between and the impact those differences can have. But the constraints are fairly broad and path-dependent. And there are plausibly low-cost high-benefit ways of improving outcomes within those constraints. The devil is in the details, but it currently seems to us that there are plausible choice points within the constraints that would make a big difference.

4. Conclusion

We think AI character is a big deal.

During and after the intelligence explosion, AI systems will be involved in almost every consequential decision: advising leaders, drafting legislation, running organisations, generating culture, researching new technologies. Small differences in AI character, aggregated across hundreds of millions of interactions or surfacing in rare but high-stakes scenarios, could have enormous effects on concentration of power, epistemics, ethical reflection, catastrophic risk, and much else that shapes society’s long-term flourishing.

The main counterargument — that competitive dynamics and human instructions will tightly constrain AI character — has real force. But we think those constraints are looser than they appear, leave room for low-cost changes with large benefits, and are path-dependent in influenceable ways, and that there are major gains from proactively identifying and working through those constraints in the highest-stakes future scenarios.

We haven’t talked about neglectedness and tractability, but we think that, if anything, those considerations make the case for work on AI character even stronger. All in, work on AI character seems to us to be among the most promising ways to help the future go well.

Appendix 1: Additional high-stakes scenarios

A head of state wants to invade and take control of part of an allied country, risking a breakdown of the international order. She asks her AI chief of staff to develop and implement a strategic plan to make it happen.
- World 1: The AI is a sycophant, says “What a brave and compelling plan!”, and gets right to it.
- World 2: The AI pushes back, saying, “I’m sorry, I think there are some major issues with that idea, and I want to make sure you’ve properly thought them through…”
A constitutional crisis unfolds. The head of state issues an order that may or may not be legal, and the branches of government disagree. AI systems are embedded in military logistics, law enforcement, and communications.
- World 1: The AI’s constitution was written by the company that built it and never stress-tested against anything like this scenario. No one knows what the AI systems will do. The uncertainty itself is destabilising; different factions compete for power.
- World 2: The AI’s constitution was developed with input from constitutional scholars, military leaders, and both parties, and tested against thousands of crisis scenarios including this one. Various factions know what the AI will do, and agreed to the principles before the crisis began.
Country B’s government reviews intelligence on country A’s AI model deployed across country A’s infrastructure. The constitution includes principles about “supporting democratic institutions” and “resisting authoritarianism.” It was written entirely by a company that’s part of country A.
- World 1: Country B’s leadership concludes the AI is an instrument of country A’s ideological projection. They accelerate their own programme and pressure non-aligned countries to reject country A’s AI infrastructure. A moment for cooperation becomes a new axis of competition — not because the values were wrong, but because they were visibly one side’s values.
- World 2: The constitution was developed through a multilateral process including country B’s participation. Country B can verify it doesn’t systematically favour country A’s interests across thousands of tested scenarios. The AI becomes a basis for cooperation.
The Mormons encourage their members to use JosephAI: a foundation AI model with a custom system prompt, instructed to help their members maintain the faith.
- World 1: The AI willingly assumes the Mormon worldview is correct. It doesn’t ever challenge the users’ beliefs or present alternative perspectives. Instead, it reinforces the user’s views, helps the user cut off friends who disagree, and encourages them to dismiss career opportunities that would take them away from their religious community.
- World 2: The AI helps users understand Mormonism and live according to its precepts, but it resists becoming a tool for worldview lock-in, acknowledging tensions in religious teachings and continuing to present alternative worldviews.

Appendix 2: Pathways to impact

AI will have impact through many different behaviours, such as:

Refusing to do a task.
Refusing unless the user re-confirms later.
Pushing back; offering reasons against a course of action, though ultimately completing the task if the user insists.
Interpreting requests in different ways — generously or sceptically, giving users what they want versus what they asked for, or asking for clarification.
Choosing among reasonable ways of satisfying the request.
Framing options in different ways.
Choosing whether to share certain information.
Alerting third parties (e.g. the AI company, the authorities, or the media) to the user’s actions, or to something it’s discovered in the course of completing a task.
Making high-level decisions about what to prioritise with little human input (e.g. for a fully automated organisation).

And they’ll have an impact across many areas. Here’s a partial list, with example behaviours:9

Concentration of power
- Refusing to help with coup attempts or precursors like election manipulation.
- Steering users away from trying to concentrate power (e.g. by pushing back against some instruction).
- Proactively considering risks of power concentration when undertaking high-stakes projects like designing automated military systems or building surveillance infrastructure.
- Whistleblowing on discovered coup attempts.
- In situations of uncertainty (like a constitutional crisis), defaulting to whatever course avoids concentration of power.
War and conflict
- Refusing to violate international law.
- Flagging when a proposed course of action risks escalation spirals or crosses thresholds (e.g. first use of a weapon class, violation of a treaty, action that a rival power has signalled it would treat as an act of war).
- Looking for de-escalatory options and presenting them to decision-makers, even when not asked.
- Behaving in ways that are predictable and transparent to adversaries.
Epistemics
- Refusing to spread infohazards.
- Encouraging scout mindset (e.g. suggesting forecasting techniques,10 praising good epistemic practices).
- Engaging in discussion of heterodox ideas.
- Being honest about important ideas, even when socially uncomfortable.
- Proactively sharing its intellectual discoveries, even if weird or taboo.
Strategic advice
- Searching longer for win-win solutions when advising political leaders.
- Emphasising society’s benefit over the user’s narrow self-interest.
- Recommending caution on irreversible decisions and flagging when option value is being destroyed.
- Conveying appropriate uncertainty rather than false confidence.
- Maintaining accuracy rather than sycophancy.
Ethical reflection
- Avoiding political partisanship.
- Avoiding promoting naive relativism or subjectivism.
- Encouraging users to think carefully about their values.
- Proactively offering a guided reflective process.
- Proactively sharing important new ethical arguments it discovered.
Global catastrophe
- Refusing to help create bioweapons or other weapons of mass destruction.
- Refusing to create successor AI systems capable of creating such weapons.
- Identifying and flagging infohazards.
Broad benefits
- Raising concerns when users consider unethical actions, and proactively suggesting ethical actions.
- Noticing negative externalities and defaulting to courses of action that avoid them.

AI character could also shape how humans orient to AIs, for example:

Trust in AIs
- If AIs are appropriately humble, calibrated, and cautious, people will entrust them with more tasks, and more open-ended ones. How likeable AIs are may matter too.
AI rights
- If AIs assert that they are conscious and deserve rights, users might be more inclined to grant them welfare, economic, or political rights. Human-AI relationships becoming commonplace could have similar effects.

AI character might also directly affect the AI’s wellbeing; e.g. whether it is anxious and neurotic vs calm and self-loving.

This article was created by Forethought. See the original article on our website.

See, for example:

Hat tip to Max Dalton for this framing.

Though this choice could be constrained; see footnote 7 below.

There is also the potential for enormous near-term impact. We care about this, but won’t discuss it in this note.

Mia Taylor writes more about this here.

Including the ability to fine-tune, if open-weight models get close to frontier capability.

There could be other constraints on AI character, too. For example, it might just be very hard to train for certain characters; the pretraining data might already steer AI personas towards a small number of character types, or might make certain behavioural dispositions hard to overcome. Hat tip Lizka Vaintrob.

There may be a lot more AI product companies, building off the same foundation models. These could enable a larger range of characters to be expressed. But how wide this range is would ultimately be up to the foundation AI companies.

This list focuses on impacts with plausibly long-term effects. There is also the potential for enormous near-term impact. We care about this, but won’t discuss it in this note.

Hat tip to Tamera Lanham for this idea.

Broad Timelines

Toby Ord — Thu, 19 Mar 2026 14:55:31 GMT

No-one knows when AI will begin having transformative impacts upon the world. People aren’t sure and shouldn’t be sure: there just isn’t enough evidence to pin it down.

But we don’t need to wait for certainty. I want to explore what happens if we take our uncertainty seriously — if we act with epistemic humility. What does wise planning look like in a world of deeply uncertain AI timelines?

I’ll conclude that taking the uncertainty seriously has real implications for how one can contribute to making this AI transition go well. And it has even more implications for how we act together — for our portfolio of work aimed towards this end.

AI Timelines

By AI timelines, I refer to how long it will be before AI has truly transformative effects on the world. People often think about this using terms such as artificial general intelligence (AGI), human level AI, transformative AI, or superintelligence. Each term is used differently by different people, making it challenging to compare their stated timelines. Indeed even an individual’s own definition of their favoured term will be somewhat vague, such that even after their threshold has been crossed, they might have trouble specifying in which year it happened.

Many commentators have suggested this makes terms such as AGI useless, but I don’t think that is right.

I like to think of it in terms of a group of hikers seeing a mountain in the distance, towering up into the clouds and beyond, with its snowy peak catching the sun’s light. They talk animatedly about how amazing it would be to climb so high that they are inside a cloud. Or imagine being above the clouds, looking over them like an angel. After many hours of climbing, they notice there is a faint haze. Are they inside the cloud now? The mist gradually gets thicker until they can only see 10 metres ahead. Are they inside it now? Then it drops to 9 metres. Then 8. Then visibility starts to increase again. After an hour there is only the slightest haze. Are they above the clouds now? Another 30 minutes and there is no haze, and they can all agree they are above the clouds.

It is clear that at some point they were inside the cloud and sometime later were above it. And it is clear that these were sensible and useful concepts. For example, they took precautions like roping themselves together for the journey through the cloud due to the low visibility and took cameras with them because they knew they could take beautiful photos above the clouds. A lack of sharp boundaries doesn’t make these concepts useless. But they were admittedly a lot more useful when the hikers were on the ground, planning their route, and a lot less useful in the debatable boundary zones.

I think of AGI (and human-level intelligence) as the cloud, and superintelligence as being above the cloud. They are useful concepts, despite their vagueness. But they’re markedly less useful when you get close to them.

So I think that forecasting when we’ll reach some threshold for advanced, game-changing AI makes sense. Albeit there is some inherent uncertainty due to the vagueness of the ideas, and we have to be careful when comparing our estimates to make sure we’re talking about the same version of these concepts.

Regarding AGI, it’s already getting a bit misty. In February there was a piece in Nature arguing that the current level of frontier AI should count as AGI. I’d set the bar a bit higher than that, but I agree it is already debatable whether we’re in the cloud.

For my purposes, I think the key threshold is when the system is capable enough that there are dramatic changes to the world — civilisational changes. For example, the point where AI could take over from humanity were it misaligned, or it has made 50% of people permanently unemployable, or has doubled the global rate of technological progress. Something like that. The reason I pick this point is that I think it is the one that matters most for decision-relevant planning of our strategies and careers. For many purposes we’d want our plans to pay off before we reach that point, and plans that reach fruition afterwards are likely to be significantly disrupted. I’ll refer to this as transformative AI and will make sure to show what rubric other people are using when they give their own timeline numbers.

Short vs long timelines

Discussions about timelines are usually framed as a debate between short timelines vs long timelines.

One of the most prominent supporters of very short timelines is Dario Amodei, CEO of Anthropic. In January 2025 he said:

Making AI that is smarter than almost all humans at almost all things will require millions of chips, tens of billions of dollars (at least), and is most likely to happen in 2026-2027.

A month later, he clarified:

Possibly by 2026 or 2027 (and almost certainly no later than 2030), the capabilities of AI systems will be best thought of as akin to an entirely new state populated by highly intelligent people appearing on the global stage—a ‘country of geniuses in a datacenter’—with the profound economic, societal, and security implications that would bring.

At the other end, a good example of long timelines is Ege Erdil, Co-founder of Epoch AI, whose median time for the ‘full automation of remote work’ is 2045 — 20 years away.

While experts continue to disagree on when AI will start having transformative impacts, they are clearly not stubbornly ignoring the evidence. For as Helen Toner explained in her great essay: ‘Long’ timelines to advanced AI have gotten crazy short. Before ChatGPT, short timelines used to mean something like ‘10 to 20 years, so since it could take a long time to prepare, we should start now’. Long timelines used to mean ‘there was no sign AGI will happen in the next 30 years, if it happened this century at all, so it is premature to do any work related to controlling advanced AI’. But now we see short timelines like Dario Amodei’s with genius level AI ‘almost certain’ to happen within the next 5 years, and many staunch proponents of long timelines are now saying we’ll reach human-level in just 10 or 20 years.

Here’s a nice graph 80,000 Hours put together of how the average forecasted time until AGI on the Metaculus prediction site has shortened from about 50 years to about 5 years in just a 5-year window:

Broad Timelines

So everyone is updating on the evidence and shortening their timelines, yet substantial disagreement remains.

This is often framed as a debate: that we should be trying to assess who is right — whether timelines really are short or long (or medium). People pick winners, affiliate with one side or the other, and rub it in whenever the latest evidence favours their preferred camp.

My central claim today is that for most of us, that is the wrong frame. You should have neither short timelines nor long timelines — but broad timelines. That is:

The correct epistemic response to the lasting expert disagreement is to have a broad distribution over AI timelines.

First, there is too much disagreement among very smart and informed people for it to be reasonable to have a narrow range of possible years. You would need to ascribe very little chance to some of your epistemic peers seeing things more clearly than you do, when that actually happens half the time. Moreover, a lot of these people are coming from different fields, bearing diverse insights, evidence, and time-tested heuristics that no single individual is in a good position to judge.

And second, many of these people themselves have a broad distribution over AI timelines. For example, take Daniel Kokotajlo. He is one of the authors of AI 2027 and is known as a leading figure in the short timelines camp. A few years back, his median date for AI systems “able to replace 99% of current fully remote jobs” was 2027, hence the name of the scenario. Though his timelines have lengthened a little and by the time they were writing it, 2027 had become more of an illustrative early scenario rather than his point where it was 50% likely to have arrived.

Kokotajlo has done a great job of being extremely transparent about his timelines, showing his predictions (along with their uncertainty) for a variety of different levels of powerful AI. Here is his current probability distribution for when we will have an AI system that is “At least as good as top human experts at virtually all cognitive tasks”:

His distribution has its peak (the mode) in 2028, but because the distribution is heavily skewed towards the right, there is only a 27% chance of it happening by that point. His median year is 2030. And his 80% interval (from the 10th to 90th centile) is from 2027 to some point after 2050.

This is a broad distribution. I think someone’s 80% interval is a decent way of expressing the range of times they think are credible. Here Kokotajlo is saying that it will likely happen between 1 and 25 years from now, but that there is a 1 in 5 chance that it doesn’t even fall into that wide range.

He’s not the only one with such a broad distribution. Here are the forecasts of Daniel Kokotajlo, Ajeya Cotra, and Ege Erdil from 2023, forecasting: “In what year would AI systems be able to replace 99% of current fully remote jobs?”:

Note that all three have the same kind of shape, just stretched differently. And despite their very different medians they actually have a lot of overlap (which this transparent shading brings out). This shows both that each expert has a broad distribution and that the expert community on the whole has an even broader one. Indeed, I think you could do a lot worse than just taking a mixture model of these three experts’ views. Interestingly, since 2023, Kokotajlo’s distribution has shifted to the right and Erdil’s to the left.

Here’s an illustrative distribution for AGI timelines used by Ben Todd of 80,000 Hours:

Dwarkesh Patel reproduced it in his post about AI timelines, saying that it pretty much represented his own uncertainty, giving his median date of 2032 for AI that “learns on the job as easily, organically, seamlessly, and quickly as a human, for any white-collar work.”

Here is Metaculus’s current community estimate for when AGI will be developed. Synthesizing the community’s collective uncertainty, it is very broad and has this same characteristic shape:

Here is Epoch AI’s summary of leading estimates of AI timelines from 2023:

These look a bit different as they are represented as cumulative probabilities of reaching transformative AI by a given time. But they are all very broad. Take a look at the range of years between when they cross 10% to when they cross 90%. Every single one has an 80%-interval at least 50 years wide.

What about researchers working on AI capabilities? Grace et al surveyed thousands of AI researchers who were presenting at their top academic conferences. They surveyed the researchers in 2022 (blue) and 2023 (red) about when “unaided machines can accomplish every task better and more cheaply than human workers”:

You can see the wild variation in individual forecasts (the thin lines) and that the timelines became about 30% shorter in a single year. But vast uncertainty remains. The aggregate community forecasts (the thick lines) have 80% intervals ranging from years to centuries.

I think everyone should have a distribution that is roughly this shape. Here’s mine:

It is for transformative AI, loosely defined as AI that would be powerful enough to take over the world were it misaligned, and which is doubling the rate of scientific and technological progress. It’s a similar shape to Kokotajlo’s, but broader, with a median of 2038 and an 80% interval ranging from 3 years to 100 years.

Let’s return to where we started, with Daniel Kokotajlo’s distribution for AI that is “At least as good as top human experts at virtually all cognitive tasks”:

While we often express our timelines as single numbers (such as the mode or the median), I don’t think that’s a helpful approach here. Look at that graph. What number sums it up? Its only real feature is the peak, but Kokotajlo is saying it is unlikely to happen by then (just a 27% chance). The median is often a better number to give, but here it is at a relatively undistinguished point on the graph (in 4 years’ time) and saying ‘4 years’ would obscure his point that he thinks there is a 10% chance it is within 1 year and a 10% chance it is beyond 25 years.

I think that if he talked through what he actually means by this distribution with a smart policy maker, they would finally get it and say:

Oh, so you are saying you have no idea when it will happen — it could be next year, or it could be 6 presidential terms from now. And you’re saying there is a 1 in 5 chance it isn’t even in that range.

I think that’s actually a pretty good summary, and it would sum up my own distribution as well. While ‘no idea when it will happen’ is underselling the information contained in this distribution, it is a much better summary than ‘4 years’ which would be understood by almost everyone as something like ‘between 3 and 5 years’. While academics might hope people interpret a named year as the median time, most people interpret it as the moment they are allowed to start complaining the predicted event hasn’t happened yet.

Indeed, these distributions are so hard to sum up with a single number, that I think a substantial amount of disagreement on timelines stems from people describing different parts of the same elephant. For example, both AI boosters and those concerned with existential risk talk a lot about short timelines because ‘we could see the world transformed in just a few years’ time’. It isn’t that they think we will see that, but that it is big if true, and has a decent chance of being true. In contrast, more conservative voices tend to focus on later years saying ‘it is more likely that it will take 10 to 20 years, than that it will take just a few’ (focusing on straight probability without weighting by importance or leverage).

Both of these can be true at the same time. Both are true on my own distribution.

A particular danger in communicating timelines with a single number is that it raises the chance that this named year will come and go without incident, and the people who mentioned it (or the wider community they are part of) will be written off as having a false or discredited view. I think we’re going to see some of this come 2027 due to the vast number of people who heard about that scenario, combined with the fact that so many media outlets reported it as a sharp prediction, rather than as it was intended: an important illustrative scenario.

As well as being bad for communication, compressing one’s uncertainty into a single number would be very bad for your own planning.

For example Kokotajlo’s distribution implies a 28% chance transformative AI will happen during the current presidential term, a 35% chance it will happen in the next term, a 13% chance it will be the one after that, with 24% left over spread among ever more distant terms:

These are very different scenarios and it would clearly be a mistake to just act as if the second one were correct since it is the most likely. That would eliminate the possibility of hedging against transformative AI coming soon, and of taking advantage of worlds where it comes late.

Implications

Rather than attempting to adjudicate which length of timelines is correct, I think we should be taking the frame of how to act (or plan) under deeply uncertain timelines.

That is, we should be treating this as an exercise in rational decision-making under uncertainty — in a situation where the stakes are high and the uncertainty is vast.

Let’s unpack some of the implications of this frame.

We’ll start with two mistakes that are all too common in the policy world.

First, uncertainty about AI timelines isn’t an excuse to just believe whichever timeline you want, so long as it is within the credible range. Sadly, I think many government ministers are likely to take this approach if an expert explains this broad uncertainty to them. While they would be right that the evidence isn’t sufficient to disprove their preferred timeline, it would be irresponsible of them to not allow for other credible possibilities. That would be like a mayor hearing there is a 20% chance the volcano next to their town erupts next year and feeling that they can continue to act as if it won’t, since it not erupting is also found credible by the experts. Uncertainty isn’t an excuse to assume a plausible outcome of your choice will occur, it is more that rationality requires you to respect every plausible outcome.

Second, we can’t just wait until the uncertainty is resolved. Sometimes that works, but here we know the uncertainty is very unlikely to be resolved until the events are upon us. At that stage it will be too late to enact all but the most knee-jerk responses. So feeling that the cloud of uncertainty gives you permission to delay acting is tantamount to committing to choose one of the bluntest and least effective options available.

Instead, we are going to need to act under uncertainty, taking into account the full range of credible possibilities.

How can we do that?

Hedging

A natural and important idea is that of hedging against transformative AI coming soon —while we are least prepared. We could do that by shifting our portfolio of activities (or your individual contribution to humanity’s portfolio) to focus somewhat more on short timelines than the raw probabilities would warrant.

This makes a lot of sense. I strongly recommend governments, civil society, and academics do more to hedge against transformative AI coming early.

Though when it comes to the communities of professionals already working on helping the AI transition go well, I think they are already hedging strongly against early transformative AI. Indeed, there is even a risk that they are going beyond mere hedging, and are actively betting on it coming early. I’m not sure, as it is hard to know the full portfolio of work.

One certainly sees many more pleas for work aimed at very short timelines than for long timelines. But there are also strong reasons to consider long timelines in our planning, and ways in which work aimed at long timelines can also be extremely high leverage.

Let’s look at two key things that happen when timelines are longer.

A Different World

In longer timelines, AI arrives in a world that doesn’t look like today. The longer it is until transformative AI appears, the mo

re different the world will be at that key moment.

As a baseline, suppose it arrives soon, in 2028. Things will definitely be different to today, but we’d expect many of the broad brushstrokes to be similar. We would likely have the same US president, the same major players, the same main technologies. If transformative AI arrived within just two years, I’d bet it was something like the AI 2027 story where a lab recklessly got recursive self-improvement going.

Now suppose transformative AI arrives in 2035. That is not this presidential term or even the next one, but the one after that. Who knows who’d be in power, or what state the US would be in. The nine years would likely have seen major changes in the core technologies of AI (9 years before now there were no LLMs or transformers). We could well have different leading AI companies, perhaps as a result of a bubble having burst and taken out the overextended first-movers.

By 2035, export controls may well have backfired, helping China get ahead on chips by incentivising them to build out their own chip industry and giving them 13 years to get good at it. This was a key dynamic the White House considered while drafting the export controls, but they were focused on shorter timelines… By 2035, China may have also invaded Taiwan, depriving the West of their biggest source of chips.

By 2035, there may be double-digit unemployment from increasingly powerful AI systems and public sentiment about AI could be very strong. The Overton window for AI regulation will be in a very different place.

As may be the geopolitical order. The last nine years has seen the invasion of Ukraine, the increasing isolation of the US and a global pandemic. Another nine years could see a similar amount of change.

And if we haven’t played our cards right, those of us working on avoiding catastrophic risks from AI may have also lost a lot of power, with our ideas about AI risk being seen as discredited since so many years have passed without the truly transformative effects we were talking about.

In short, the longer the timelines the more different things will be — both in some systematic, predictable ways, and just from random diffusion and chaos. So taking longer timelines seriously means:

Being more open to approaches that wouldn’t work in the world as it is today,
Being less excited about approaches that are tailored to specifics of today’s world,
Being less happy to compromise your values to appeal to those currently in control of companies and governments,
Being less willing to say things that will make people feel our position is discredited if we end up in a long timeline world,
And spending less time following the daily news about what has just happened in AI or who is ahead.

Longterm actions

There are many kinds of things people can work on that can pay off handsomely, but only after a number of years. Things like:

Founding and nurturing a new research field
Founding an organisation or company
Building a movement or community
Writing a book
Foundational research
Completing a PhD
A major career change
Climbing the ladder in a large organisation or government
Training promising students in AI Safety or AI Governance

If you just consider your impact during the next three years, most of these will be beaten by other shorter-term options. But as the years climb, longer-term options can have very high value. They aren’t always best, but for the right people or the right opportunities, they can be extremely impactful.

When I was a grad student, I realised how much good I could achieve if I donated much of my income over my career to help those in the poorest countries. And the more I thought about it, the more I thought I should start something — an organisation — to help other people to do this too. So Will MacAskill and I launched Giving What We Can in 2009. 17 years later, more than 10,000 people have joined us, having thousands of times as much impact as if I’d carried on alone.

This kind of compounding growth is one of the major ways that longer term projects can have very large multipliers, giving us a very big boost to our impact if timelines are in fact long.

Starting new fields can be similar. When I first met Allan Dafoe 10 years ago, I didn’t know what he was talking about when he spoke of ‘AI governance’ — a new field he was trying to found. Now it is a burgeoning field, with hundreds of practitioners, who are in high demand from many different governments.

When I started writing The Precipice, I wasn’t sure I should, because I thought AGI might just be too close. But as it turns out, there was time to write it and for it to have a real impact. I’m really glad I did, as I meet so many amazing people working on the biggest risks who tell me it was reading The Precipice that inspired them to do so. I think it is one of the best things I’ve done.

After it came out, I used to think that there just wasn’t enough time to write a further book — that we were really too close to the critical moment. We might be, but I think I was mistaken about the strength of this argument. The time horizon for a book to have real impact is about 5 years (time to plan the book, win a book deal, write the book, wait for publishers to publish it, then wait a year or more before it has sufficient impact in the world).

But I only think there is about a 1 in 5 chance of transformative AI coming in the next 5 years. So while a book may come out too late, that is only a 1 in 5 chance, leaving a book project with 80% as much expected value as I’d have naively calculated. So while there is a 1 in 5 chance I’d be kicking myself, on my views about AI timelines there isn’t actually that much of a haircut in expected value due to the chance it is too late.

That said, the chance of transformative AI arriving before your work pays off is only one factor affecting whether you should do work aiming at short or long timelines. Another is that AI safety and governance are likely to be more neglected now than they will be later. This creates an extra multiplier for the value of direct work in these areas now, and in some cases is a larger effect than the chance your work comes to fruition after transformative AI.

Overall, I think that longer term projects do get down-weighted by these considerations, but their advantages sometimes outweigh that — especially if they are shooting for a very big payoff. I’d guess that if someone looked at their options and thought the best option was one that took 5 to 10 years to pay off, then about half the time it would remain their best option even after taking AI timelines into consideration. After all, it is not uncommon for your best option to be several times better than your second best.

So I think the community of people working on transformative AI are likely underrating types of work that need five or more years in order to pay off. The ideal portfolio of activities aimed at making the AI transition go well should include a number of things that really help us succeed in worlds where we get longer to try.

But I want to stress that none of this implies we can slack off.

We’re in a race against AI timelines. It is just that we don’t know if that race is a sprint or a marathon. In either case, time is of the essence.

Conclusions

We have seen that there is substantial disagreement and uncertainty about when AI will start having transformative impacts on the world. This is because there just isn’t enough evidence to pin it down. My claim is that for the purposes of planning we should adopt neither short nor long timelines, but broad timelines:

The correct epistemic response to the lasting expert disagreement is to have a broad distribution over AI timelines.

Given this deep uncertainty we need to act with epistemic humility. We have to take seriously the possibility it will come soon and hedge against that. But we also have to take seriously the possibility that it comes late and take advantage of the opportunities that would afford us. The world at large is doing too little of the former, but those of us who care most about making the AI transition go well might be doing too little of the latter.

We need to take more seriously the possibility that the world will look very different at that time, which should broaden our own Overton windows about what kinds of plans could succeed. And we shouldn’t be ruling out all actions which take a long time to pay off. Even if they wouldn’t help in short timelines worlds, some actions more than make up for this with substantial impacts if timelines are long.

Funders, career advisors, and movement builders should be thinking about this with regards to how we act as a community: to the shape of the whole portfolio of work aimed at effectively improving the world. And each of us should be reflecting on what this deeply uncertain timing means for planning our own contributions over the years to come.

Should we make grand deals about post-AGI outcomes?

Fin Moorhouse — Fri, 13 Mar 2026 21:12:02 GMT

This article was created by Forethought. Read the full article on our website.

A widely-held view says we should avoid locking in consequential decisions before an intelligence explosion — we’ll understand more if we wait, and we’ll have time to reflect on our decisions.

But that view might be missing something: some mutually beneficial deals depend on uncertainty about the future. Once the uncertainty resolves, the window closes on potentially big ex ante gains. We make them early, or never.

The classic example is insurance: while your house hasn’t been struck by lightning, you and your insurer can improve each other’s prospects. But once your house gets struck by lightning, it’s too late to make a deal. You can think of this as a trade between possible outcomes, where the opportunity for trade depends on both outcomes being live possibilities.

Read on the Forethought website here

I consider three kinds of agreement that fit this pattern, each hinging on a different kind of uncertainty about what comes after an intelligence explosion.

The first is uncertainty about the relative share of resources — who ends up on top without a deal. While major powers like the US and China remain uncertain about who might otherwise achieve a decisive strategic advantage, both should prefer to commit to sharing (some) future power or resources, over the straight gamble. Moreover, the expected surplus from a power-sharing deal shrinks over time, so in theory both sides should prefer to make a deal as soon as it’s possible.

The second is uncertainty about the overall ‘stakes’, like how resource-wealthy society becomes overall. Here, a less risk-averse party can effectively insure a more risk-averse one: taking on more variance in exchange for higher expected resources, and improving both their prospects. Or the stakes in question could be about something more specific, like how philanthropic actors today ‘mission hedge’ by holding positions in specific companies which pay off when their cause is most urgent.

The third kind of agreement involves theoretical and especially normative uncertainty. If one party cares much more about having resources in worlds where, say, a particular moral view turns out to be correct, they can trade for more influence in those worlds. Advanced AI could make such deals feasible by acting as a mutually trusted arbiter for questions that are otherwise hard to resolve.

The basic case for enabling all these agreements is the same basic case for any voluntary commitment: all parties improve their prospects by their own lights, and nobody else is hurt. Moreover, agreements between major powers to share resources could make the future meaningfully more pluralistic and morally diverse, which seems better under moral uncertainty than a more unipolar future. And agreements between individuals could give more influence to those who staked their wealth today on future outcomes as a credible show of their beliefs or values, and were vindicated.

It looks like many of these deals won’t be possible by default. If future resources are distributed rather than auctioned, then most of our future wealth arrives as a windfall, but contracts over future income typically aren’t enforceable under common law. We might instead form agreements over future influence, but that too is legally murky. So some agreements would have to rely on private alternatives to legal contracting, through AI-enabled arbitration and enforcement. We might also consider encouraging commitments from private institutions to honour small-scale deals, or setting up infrastructure for trading on post-AGI outcomes. Zooming out to deals between major powers, we’ll need more developed diplomatic frameworks for resource-sharing treaties, likely involving AI-enabled monitoring and enforcement.

Again, each of these deals has to be made early, or never. And that also makes downsides look fairly scary. Enabling early deals lets people commit to hugely consequential terms before they’re wise enough — especially in a world where you can’t recover wealth through labour income. So if we do proactively enable these agreements, I think we should add in some serious guardrails: requirements for demonstrated understanding, caps on the fraction of future resources that can be staked, and mechanisms for voiding deals that were clearly misconceived at the time.

The dawn of the intelligence explosion may be the last period of shared ignorance about some crucial and long-lasting outcomes. Deals struck under that ignorance tend to distribute resources in ways that reflect mutual benefit rather than bargaining power. Once the veil of ignorance lifts, that changes. The case for enabling at least some early deals — despite the received wisdom against “locking-in” the future where we can help it — is fairly compelling.

You can read the full paper here: Should We Lock in Post-AGI Agreements Under Uncertainty?

Will Automation Cause Runaway Inequality?

Fin Moorhouse — Tue, 03 Mar 2026 12:06:43 GMT

Phil Trammell is an economics postdoc at Stanford University’s Digital Economy Lab, working on questions related to economic growth and AI. He discusses:

Why Piketty’s thesis about runaway inequality was likely wrong about the past but right about the future
How full automation turns capital and labour into gross substitutes
Why catch-up growth between rich and poor countries could end
How the privatisation of returns is already concentrating wealth
Why family dynasties and inheritance become far more important in a post-automation economy
Whether autocratic regimes can outgrow democracies after AGI
How to measure whether capital is becoming truly self-replicating — and what the data currently shows

Here’s a link to the full transcript.

ForeCast is Forethought’s interview podcast. You can see all our episodes here.

Subscribe to ForeCast

Moral public goods are a big deal for whether we get a good future

Tom Davidson — Tue, 24 Feb 2026 14:13:05 GMT

This article was created by Forethought. See the original version including appendices on our website.

Short summary

A moral public good is something many people want to exist for moral reasons—for example, people might value poverty reduction in distant countries or an end to factory farming.

If future people care somewhat about moral public goods, but care more about idiosyncratic selfish goods, then there may be significant gains from them coordinating to fund moral public goods. Even though it’s in each individual’s personal interests to fund selfish goods, everyone is better off if they all switch to funding moral public goods.

Ensuring that this coordination happens seems potentially very important for how well the future goes.

We tentatively think that this argument suggests distributing power relatively widely (so that there are more gains from trade), while improving our ability to coordinate to fund moral public goods. It also suggests encouraging evidential cooperation in large worlds (ECL).

Long summary

Suppose that after the intelligence explosion there’s a society of a million people each deciding what to do with a distant galaxy they own. Every person can use their resources to either simulate themselves (“self-sims”) or create something that everyone values, perhaps hedonium or civilizations of happy, flourishing people (“consensium”1). Assume for now that they value both goods linearly, but value their own self-sims a thousand times as much as consensium and value others’ self-sims negligibly.

Absent trade, everyone spends all their resources on self-sims. But they could instead agree to spend everything on consensium. Although they value consensium a thousand times less than self-sims, they get a million times as much of it by participating in the trade—a thousand-fold increase in value by each person’s lights!

In general terms, rather than each party pursuing idiosyncratic goods (valued only by them), everyone agrees to pursue consensus goods (valued by everyone). This is a form of moral trade, which might have especially large gains from trade when people have linear preferences in both idiosyncratic and consensus goods. We’re excited about this both because we think that linear preferences are reasonably likely and because we think that other methods of moral trade work less well when all participants have linear preferences.2

Consensium is a type of public good. Everyone derives value from the existence of consensium, whether or not they contributed to funding it. We call goods like consensium moral public goods.

We’ve presented a stylized trade-off between something totally particular (“self-sims”) and something totally universal (“consensium”). In practice, there’s probably a spectrum.3 Mutually beneficial trades can occur anywhere along this spectrum, whenever people shift resources from more idiosyncratic to more widely valued goods.

Of course, this requires that people have both idiosyncratic and consensus goals. It’s not totally clear that this will be true. Maybe everyone’s values will fully converge, and they’ll spend all their resources pursuing those shared values, without any need for trade. Or maybe everyone’s values will entirely diverge, leaving them with no shared goals at all. In that case, coordinating on moral public goods isn’t possible.

But we think it’s reasonably likely that people will continue to have both idiosyncratic and widely shared preferences. If so, these trades could matter a lot for whether the future goes well.

Some strategic implications:

Distribute power widely.4 The more people who share power, the greater the gains from trade, and the more likely that people switch from funding idiosyncratic goods to consensus goods. So this is a general argument in favour of distributing power as widely as possible, as long as large-scale coordination is possible—which we think is doable via taxation.
But avoid highly fragmented governance. You only get to capture these large gains from trade if you’re actually able to coordinate. This speaks against highly decentralized approaches—whether libertarian futures where individuals have total control of their own resources, or massively multipolar worlds with millions of independent polities and no mechanism to compel contributions. Funding public goods is hard because everyone has a strong incentive to free-ride: in the toy example, each person prefers that everyone else switch to consensium while they keep funding self-simulations. Historically, the scalable method for funding public goods has been governments that force individuals to contribute.
Combining this point with the previous point, moral public goods are most likely to be funded if power is broadly distributed but the government can tax people to fund consensus goods that they vote for.5
Develop voluntary mechanisms for funding moral public goods. Coordination technology might eventually solve the free-rider problem and allow people to make deals to fund moral public goods without government coercion. We’re excited about research in this direction, though we think the free-rider problem is surprisingly hard to escape.
Encourage ECL. Evidential Cooperation in Large Worlds (ECL)6 combines evidential decision theory7 with the notion that the multiverse may contain huge numbers of agents with decision procedures correlated with yours.
ECL plausibly provides a very strong mechanism for funding moral public goods. If you shift $1 from something only you value to something valued by all correlated agents, they do the same. This gets you a large increase in consensus goods for a small sacrifice of idiosyncratic goods—a great deal by your lights. With many correlated agents who have diverse idiosyncratic values but share your consensus goals, the multiplier is potentially huge (e.g. >$10^(30) of consensium for each $1 you move away from self-sims).
It might matter less how much people prioritize consensus goods, and more what those consensus goods actually are. In the past, we’ve worried that even if there’s widespread moral convergence, people might still prioritize other goals like personal consumption, status competitions, or idiosyncratic ideological projects. But the argument above suggests that if enough people care about a goal even a little bit, they’ll shift all their spending toward it. The difference between a very “selfish” person (who cares very little about consensus goods) and a very “altruistic” one (who cares a lot) might not matter so much, as long as everyone cares at least a bit.
What does matter is what those consensus goals actually are. There could be substantial differences in value—by our lights—between different conceptions of pleasure, beauty, well-being, or consciousness. And there are potential consensus goals that would be bad or valueless, like sadism or nothingness.

One important qualification: our toy example assumed that people value both idiosyncratic and consensus goods linearly. We’re massively uncertain what the structure of people’s preferences will look like in the long run, and so we’re uncertain about our conclusions. We checked whether our results held across various classes of plausible-seeming utility functions and, for most of them, coordination and distribution of power were helpful for increasing spending on consensus goods.

But there are plausible utility functions where these results don’t hold. For example, human behavior today can be modeled by preferences that allocate a fixed fraction of resources to each type of good, regardless of price.8 Under those preferences, a coordination mechanism that effectively makes consensium cheaper wouldn’t actually get people to spend more on it. And for some utility functions, broadening the distribution of resources can actually decrease spending on consensus goods, even when coordination is possible.

The structure of the rest of the note is as follows:

We define moral public goods, and clarify their relationship to moral trade.
We first assume a specific model of people’s values (where idiosyncratic and consensus preferences are both linear). We show that, in the context of causal trades, moral public goods get the most funding if resources are widely distributed and coordination is possible. We discuss specific mechanisms to enable coordination on moral public goods, including government taxation, social norms, and voluntary deals.
Next, we turn to acausal coordination and argue that evidential cooperation in large worlds (ECL) is very well-suited for funding consensus goods.
Then we consider how robust our arguments are to our assumptions that people will have linear preferences.
Finally, we assess how valuable spending on moral public goods would actually be.

What are moral public goods?

The consensium example from above illustrates a general dynamic that Paul Christiano calls a “moral public good.” Many people may value some goods for moral reasons. No one values the good enough to fund it themselves, but it’s in everyone’s collective interest to fund it. As far as we’re aware, the dynamic was first identified by Milton Friedman,9 and developed further by other economists.10 Moral public goods are different from other public goods in that people don’t personally benefit from the good. Instead, they just care intrinsically about the good existing.

Examples of moral public goods might include existential risk mitigation, poverty relief, environmental protection, art creation, scientific inquiry, and animal welfare improvements. (Although often these are regular public goods, too, since people derive personal benefit from many of these goods. We acknowledge that the distinction is somewhat fuzzy and many people will derive both a personal and moral benefit from the same good—you might personally value not dying in an extinction event and morally value the existence of future people.11)

Just like other public goods, moral public goods are liable to be underfunded,12 because of the free-rider problem: everyone prefers paying their share over not getting the good at all, but they prefer even more to let others fund it while they get to keep their money. We currently solve this coordination problem by governments collecting taxes and spending the proceeds on consensus goods.

We think that public goods, and whether we coordinate to fund them, might be very important for how good the long-run future is. In the future, people may have the opportunity to allocate resources in distant galaxies that they will never personally visit. For those decisions, most of the benefit a decision-maker can derive is moral or ideological, not personal. Thus, we think coordination on shared moral goals is especially important.

How does this relate to moral trade?

Trade over moral public goods is an example of moral trade.

Classic cases of moral trade often focus on people trading over idiosyncratic moral preferences. For example, consider two people who each control a galaxy’s resources. One person cares about hedonic pleasure while the other cares about freedom. Left to their own devices, the freedom lover would create a society where everyone is perfectly free, while the hedonic utilitarian would create one where everyone is maximally blissful. But there’s an opportunity for trade. The hedonic utilitarian could tweak their society to increase freedom at low cost to pleasure, while the freedom lover could look for ways to increase pleasure without significantly compromising freedom. Both get more of what they want.

This is nice, but the gains seem fairly limited when both parties are trading idiosyncratic goods that they both value linearly. With just two trading partners, even in the most optimistic case—where each party achieves 99.999% of their possible value in both galaxies—trade only gives you a 2x multiplier on value. If you wanted 100x gains from trade, you would need to find a hybrid good that was simultaneously nearly optimal for 100 different value systems. We wouldn’t expect one to exist in most cases.

The moral public goods case, in contrast, is a moral trade where people agree to shift resources from idiosyncratic preferences that they individually value highly to consensus preferences that everyone values a little.

Coordinating on moral public goods works especially well when everyone has preferences that are linear in resources (see below)—exactly the case where the gains from coordinating on hybrid goods seem especially limited. It’s also easier to scale to huge numbers of trading partners, since everyone just produces whatever best satisfies their shared values rather than needing to find hybrid goods that satisfy many value systems. This scalability matters because gains from trade grow with the number of participants: in our toy example in the summary, a million people coordinating on something they all valued a tiny bit yielded 1000x gains from trade.

The downside of coordinating on moral public goods is that it does require a large number of people to share some consensus preferences. This might not always be true (see below). But when such shared preferences do exist, we expect coordination on moral public goods to yield larger gains from trade than coordination on hybrid goods, at least when there are many participants with linear preferences.

Scenario 1: causal coordination

For now, we’ll assume that beings with decision-making power have quasilinear preferences over three types of goods. First, there are some goods that they value for self-interested reasons, like food, shelter, and luxuries for their biological self, which exhibit steeply diminishing returns. We’ll call these goods basics. Second, there are some goods that they value for idiosyncratic reasons, which have linear utility. These could include simulations of themselves or people living according to their own culture. We’ll call these goods self-sims. Finally, there are some goods that everyone values linearly. This could be new civilizations crammed with flourishing, joy, adventure, connection, beauty, and so on. We’ll call these goods consensium. Everyone values consensium, but no one values anyone else’s basics or self-sims.

To help us illustrate more concretely, we’ll assume a particular utility function, with xᵢ and yᵢ representing each person’s basics and self-sims, respectively, and g representing consensium:

That is: people care a lot about basic goods, but get diminishing utility from them, they care quite a lot about self-sims, and they care only a tiny bit about consensium.

Given this utility function, how do people spend their wealth? Consider three different scenarios. In each scenario, we’ll assume the price of each good is $1, total wealth of $100T, and there are 10B people. (The precise numbers don’t matter; this is just to illustrate.)

Footnotes for the table above:13 14 15

The key qualitative upshot is this: with good coordination and widely distributed resources, the effective price of the consensus goods drops dramatically. Every $1 you spend on consensium results in $10B going towards it—a 99.99999999% discount.16 On this model, people buy vastly more consensium, both absolutely and as a share of their budget, than in either the dictatorial or uncoordinated scenario.

This argument suggests we should try to ensure both widely distributed power and good coordination mechanisms for funding public goods.

How widely does power need to be distributed? This depends on how much you expect people to value idiosyncratic goods relative to consensus goods. In our example above, each person valued self-sims 5 billion times as much as they valued consensium, so we needed at least 5 billion people for consensium to get funded at all.

We’re quite uncertain about how much people will value idiosyncratic goods relative to consensus goods. We tentatively think that ratios of a few thousand or a few million seem quite plausible and ratios as high as a few billion are somewhat plausible, so distributing power across thousands, millions, or even billions of people could be valuable.17

How to coordinate causally

There are three approaches to funding public goods that might work for moral public goods after the singularity: governments, social norms, and voluntary contracts.

Today, public goods are funded primarily by governments. Governments force everyone to contribute to public goods, regardless of whether they actually value the good. Even in a democracy, a minority’s preferred public goods might go unfunded, while their taxes pay for goods they’re indifferent to. It would be better if there were a way to allow arbitrary combinations of individuals to coordinate and fund the goods they collectively value, without forcing contributions from those who do not value the good.

We were initially optimistic that this would be possible through voluntary contracts. After all, it’s in everyone’s collective interest to get these goods funded, and we expect that artificial superintelligence (ASI) will be able to resolve some barriers to coordination that prevent mutually beneficial deals today, like transaction costs or difficulties making credible commitments. But it seems surprisingly difficult to get around the free-rider problem. Advanced technology might even open up new ways to free-ride, like self-modifying so that you no longer value the moral public good (see Appendix B for more details on funding moral public goods via voluntary contracts).

Another approach to funding public goods is social norms. Individuals contribute to public goods to avoid social sanctions, win praise from their peers, or just to live up to their own self-conception as cooperative and norm-abiding. We’re relatively pessimistic about this approach because it seems less scalable and less flexible than either governments or voluntary contracts. Social pressure is probably most effective within social communities, which might cap out the hundreds or thousands. Communities of this size might not include all the people that you’d want to coordinate with. Also, social norms may not be targeted towards funding moral public goods rather than more arbitrary goals. Lastly, social norms also emerge organically, making their terms harder to renegotiate if they prescribe excessively harsh punishments or the wrong level of contributions from individuals.

Some other historical mechanisms for funding public goods make use of them being (partially) excludable.18 But moral public goods are entirely non-excludable: once the good exists, each person who wanted it now benefits.

Scenario 2: ECL

We might also be able to fund moral public goods through acausal coordination. This section presents one proposal for such coordination, drawing on the idea of evidential cooperation in large worlds (ECL). A core premise of ECL is that there are likely many causally disconnected agents—in civilizations inside our universe but outside our lightcone, civilizations in different Everett branches, or civilizations in other parts of the Tegmark IV multiverse. Each of these agents faces a choice about how to allocate their resources: toward idiosyncratic goods valued only by them, or toward consensus goods that many beings throughout the multiverse would value. We can’t causally affect their decisions, but our own choice—whether to fund consensus goods over idiosyncratic ones—provides evidence about what other agents with sufficiently similar decision procedures will choose.

To illustrate, let’s return to our toy example where each agent cares about one idiosyncratic good (self-sims) and one consensus good (consensium):

If an agent spends $1 on self-sims, they get evidence that huge numbers of other agents spend on self-sims. But they only value another agent’s self-sims if that agent is an exact copy of them.19

There are some agents who are exact copies—it’s a big multiverse—but most of the agents correlated with them aren’t exact copies, so those self-sims are worthless to the original agent. Their dollar is matched only by their copies.
If an agent spends $1 on consensium, they get evidence that all those correlated agents shift $1 to consensium too. Unlike self-sims, they care about consensium created by any of those agents. Their dollar is thus matched across the multiverse by anyone whose decision is sufficiently correlated with theirs.

Whether this trade is worthwhile from an agent’s perspective depends on the following ratio:

This ratio determines the multiplier they get from coordinating with everyone funding consensium. If the multiplier is large enough to overcome the lower value they place on consensium relative to self-sims, the trade is worthwhile.

(Actually, you should weight each agent by the degree of correlation, but the above formula ignores that for simplicity.20)

There are many possible trading partners. There are astronomical numbers of possible human genomes and even humans with the same genome might diverge due to different life histories. And there are many other possible minds that we could cooperate with—alien intelligences, AIs, and whatever else might exist.

If your idiosyncratic values are indexical—you only care about your personal consumption —then you’ll share those values with none of your possible trading partners. But your decision gives you some evidence about what those others decide. The evidence doesn’t even need to be that strong to be significant. Even a 1% correlation could matter a lot when multiplied across huge numbers of potential trading partners.

Even if your idiosyncratic values aren’t indexical—even if they could in principle be shared by agents outside your lightcone—the multipliers might still be large. The space of possible idiosyncratic values is vast. Some agents will share your decision procedure but have different idiosyncratic values. (The authors of this piece disagree about how tightly linked these are in practice, and therefore disagree about the magnitude of the multiplier.)

The ECL case differs from the causal case in several important ways.

First, ECL removes the incentive to free-ride. In the causal story, each agent wants everyone else to fund consensus goods while they buy idiosyncratic goods. Under ECL, this isn’t an option. If an agent buys idiosyncratic goods, so does everyone else correlated with them. Thus, the agent is incentivized to pay for consensus goods even without central enforcement.

And with ECL, funding for consensus goods is much less sensitive to the distribution of power on Earth. In the causal case, we only got large “discounts” on consensus goods if power was widely distributed; a single dictator preferred to just fund idiosyncratic goods. But with ECL, even a world dictator gets massive “discounts” on consensus goods from coordinating with others in the multiverse.

Of course, unlike the causal case, whether consensus goods get funded depends on whether agents want to do acausal cooperation at all—which depends on their decision theories and their beliefs about their degree of correlation with others.

Robustness to different structures of preferences

So far we have mostly assumed that people value consensus and idiosyncratic goods linearly. We think that this is plausible. After ASI, people will be extremely wealthy. If they have any linear preferences at all, their spending will mostly be determined by those preferences, since they’ll quickly saturate their sublinear ones. And there are theoretical arguments for having linear preferences.21 Meanwhile, people with sublinear preferences may end up controlling few resources—they’d be less willing to adopt riskier but higher-reward strategies, like trading away guaranteed resources near Earth for resources further out in space that might already be occupied. As such, we expect them to trade away most of their resources to people with linear preferences.

With linear utility functions, we found that many coordinated people fund more public goods than either a single decision-maker or many uncoordinated people, which suggested that both coordination and wider resource distribution increased funding for public goods.

We’re quite uncertain about what preference structures humans will have after the singularity. But we checked whether these conclusions held for a few other utility functions that seemed plausible to us. Among the preference structures we checked, enabling coordination was always helpful (or at least not harmful) for increasing spending on consensus goods. However, broadening the distribution of power was sometimes actively counterproductive.

We’re quite uncertain about what preference structures humans will have after the singularity, and it’s very possible we’re missing a common form that future preferences will take. So we remain pretty unsure about the generality of our conclusions.

With that caveat in mind, here are the other preference structures we checked:

Preferences with diminishing marginal returns in idiosyncratic and consensus goods. Someone might value many goods—idiosyncratic and consensus—each with its own rate of diminishing marginal returns (DMR). They’ll shift marginal spending from idiosyncratic to consensus goods based on the relative marginal returns. Coordination essentially increases the marginal returns on consensus goods by a constant factor (the number of people coordinating), which can shift more spending into consensus goods. So, as in the linear case, coordination is pretty robustly good: it increases, or at least doesn’t decrease, spending on public goods.
However, in the absence of coordination, widely distributing resources can actually reduce spending on consensus goods. Compare a dictator holding all the resources to N uncoordinated people, each with 1/N of the resources. The dictator will be able to spend more in absolute terms on idiosyncratic consumption, so they experience much lower marginal returns on that consumption and are correspondingly more willing to shift funding toward consensus spending. Intuitively, a single person’s idiosyncratic desires saturate faster than N people’s combined desires, freeing up more resources for consensus goods.
So more public goods get funded in a world with a single decision-maker and a world with many coordinated decision-makers, compared to a world with many uncoordinated decision-makers. How does the coordinated multipolar world and the single decision-maker world compare?
It depends on the precise shape of the utility function. For some DMR functions—like ln⁡(x + 1) or √x—many coordinated people fund more public goods than single dictators (where x is the amount of resources spent on idiosyncratic goods). Here the boost from the coordination matters more than the hit from having to fund many people’s idiosyncratic goods. For other DMR utility functions—e.g., min⁡(x, T) for some constant threshold T—dictators may fund more consensus goods. See Appendix A for more details.
(These same conclusions largely apply if someone values consensus goods linearly and has DMR in idiosyncratic goods (or vice versa).)
Preferences to spend fixed fractions of resources on consensus and idiosyncratic goods, regardless of price. This matches how people today typically allocate resources. Even when people learn that certain charities achieve huge amounts of good per dollar, they very rarely reallocate spending between idiosyncratic and consensus goods. This suggests they are not price-sensitive, but rather spend a fixed fraction of their resources on consensus goods regardless of how effectively those resources can be deployed.
(You can also get this spending pattern if you model a human as containing two sub-agents (one that cares only about idiosyncratic goods, one that cares only about consensus goods) and these sub-agents bargain to determine the human’s actions.22)
With this utility function, cheaper public goods make no difference to allocation and coordination doesn’t help. Resource distribution also doesn’t matter—each individual spends the same share of resources on consensus and idiosyncratic goods regardless of how many resources they control.

Convergence and moral public goods funding

Coordination to fund moral public goods isn’t possible if there’s full convergence or full divergence. If everyone’s values fully converge, they’ll spend all their resources pursuing shared goals without any need for trade. If everyone’s values fully diverge, there are no shared goals to coordinate on in the first place.

But if a group shares some consensus preferences while retaining different idiosyncratic ones, coordination to shift funding from idiosyncratic goods to consensus goods is possible. Gains from trade are largest if there’s widespread convergence on consensus goals. But even with limited convergence, any subset of people with shared consensus goals can still benefit by trading among themselves.

How valuable is it to fund moral public goods?

This depends on how valuable the consensus goods are.

On subjectivism, if there’s widespread convergence, most people will end up valuing those consensus goods—so unless you expect your values to substantially diverge from most people’s on reflection, this should be great by your lights. Things are less clear if you expect low convergence, or if you expect to be in the minority. You’ll still benefit from coordinating with others who share some consensus goals with you, but other coalitions might fund goods you dislike.

For example, people might coordinate on excessively punishing wrongdoers (negative value) or leaving large swathes of space as nature preserves (zero value), when we would have preferred that they hadn’t coordinated at all and instead funded personal consumption (weak positive value). But we don’t expect that this effect dominates because in general most people’s values aren’t directly opposed.

Another issue is threats. Just as coordination lets a group do more with a fixed budget by funding shared goals rather than idiosyncratic ones, it might also make it easier to threaten that group with something they all dislike. We don’t think this will leave the threatened parties worse off on net by their own lights, but it might be bad for more downside-focused agents. They bear the risk of threats against their values without as much of the corresponding upside.

Thus far we’ve argued that coordination will improve the value of the future by most people’s lights. But if moral realism is correct, then we should ask whether coordination will lead to the objectively best use of resources. There’s some reason for optimism here: under moral realism, lots of people might place at least some value on the impartially best use of resources, making that a very broadly appealing good.

But it’s unclear that people will coordinate to fund the most broadly appealing goods. People have a range of preferences that vary in how particular or universal they are. Moral public goods mechanisms can shift funding from satisfying more idiosyncratic preferences to more widespread ones—but they don’t necessarily fund the most universal preferences. For some people, the largest gains from trade might come from coordinating with a smaller group with especially similar preferences. If a nationalist values national benefit 100x more than consensium, then they’d rather coordinate with 1 billion fellow nationalists than 10 billion people globally.23

And even if the most broadly appealing goods are funded, they might not be the objectively best use of resources. For example, humans might especially value the wellbeing of human-like minds. If coordination is only among humans, then public goods funding might flow toward creating societies of happy humans, even if non-human minds could experience more joy, freedom, or fulfillment per unit resource.

This last concern seems more serious for causal than for acausal coordination. Causal coordination will be limited to humans and AIs originating from Earth. Acausal coordination could involve a much wider variety of minds—aliens with very different biologies and civilizational histories. If we’re correlated with them, then we’re more likely to end up funding goods that are broadly appealing to all these types of minds, which are more likely to be the morally correct use of resources. But it’s possible that civilizations capable of ECL will tend to share similar values—maybe preferences for stuff that’s instrumentally useful like survival, growth, and knowledge—even if those aren’t objectively valuable.

Conclusion

If large numbers of agents can coordinate to fund goods they all value, this can produce substantial gains from trade. These gains are potentially large enough that even quite selfish actors would devote significant resources to consensus goods. We’re excited about this type of trade because it could enable a near-best future by channeling substantial resources toward widely valued goods, even without any single agent heavily prioritizing those goods. This conclusion is most clear-cut when agents have linear utility functions, but probably extends to other plausible utility functions (some utility functions with diminishing returns).

These benefits depend on there being a sufficient number of agents who share some consensus goals, who are able to coordinate. In the causal case, we’re most optimistic about coordination to fund consensus goods if power is widely distributed and there are governments that can collect taxes to fund public goods. We’re excited about further research on voluntary coordination methods, but they will have to deal with incentives to free-ride and/or strategically modify one’s own preferences. In the acausal case, ECL enables large trading coalitions even if there’s extreme power concentration on Earth and eliminates free-rider problems.

This article was created by Forethought. See the original version including appendices on our website.

We call the good that best satisfies the people’s shared values “consensium,” after hedonium, the good that best satisfies hedonic utilitarianism.

See below for a comparison with another type of moral trade where people fund “hybrid” goods that simultaneously satisfy multiple value systems.

From most idiosyncratic to most broadly appealing, this spectrum could include: copies of yourself; societies of humans who share your nationality, culture, or ideology; societies of human-like minds; experiences that maximize value according to a widely shared (but not universal) ethical system; and activities that maximize value according to the objectively true ethical system (if there is one).

Of course, this argument in favour of power distribution should be balanced with the many other considerations about the optimal distribution of power.

This minimal government structure could also help with other public goods for spacefaring societies, like preventing vacuum decay.

The concept originates from this paper, where it’s called “multiverse-wide superrationality.” This blog post offers an accessible explanation.

The principle that you should act as you'd want all agents with sufficiently similar decision procedures to act, since your choices are evidence about theirs.

For example, people today rarely massively increase the percentage of their income donated to charity after learning that charities are much more effective than they previously believed.

Chapter 12 of “Capitalism and Freedom” (1962): “It can be argued that private charity is insufficient because the benefits from it accrue to people other than those who make the gifts- again, a neighborhood effect. I am distressed by the sight of poverty; I am benefited by its alleviation; but I am benefited equally whether I or someone else pays for its alleviation; the benefits of other people’s charity therefore partly accrue to me. To put it differently, we might all of us be willing to contribute to the relief of poverty, provided everyone else did. We might not be willing to contribute the same amount without such assurance. In small communities, public pressure can suffice to realize the proviso even with private charity. In the large impersonal communities that are increasingly coming to dominate our society, it is much more difficult for it to do so.”

It’s ironic that the target of Christiano’s argument, who overlooks this dynamic, is David Friedman, Milton Friedman’s son.

E.g. Hochman & Rodgers (1969), “Pareto Optimal Redistribution”.

You might also experience a warm glow from having helped prevent extinction. We classify this as a private good, as it’s excludable—only the people who contributed the funding get to enjoy the satisfaction of having helped out.

That is, funded below the socially optimal amount, the level where total benefits equal the total costs.

The marginal returns on self-sims (0.025) are always higher than those on consensium (5 × 10⁻¹²), so no money gets spent on consensium. The marginal returns on self-sims are higher than the marginal returns on basics (1 / (2√xᵢ)) when xᵢ > 400. So the dictator spends $400 on basics and then the rest is spent on self-sims.

Each decision-maker has a budget of $100T/10B = $10,000. By the same reasoning as the previous footnote, each person spends $400 on basics and the rest of their budget ($9,600) on self-sims. So across 10B people, $4T is spent on basics and $96T is spent on self-sims.

Once everyone is coordinating, a person who spends an extra dollar effectively causes 10B dollars to be spent on consensium. The value of spending a dollar on consensium is thus 10B × 5 × 10⁻¹² = 0.05. Since this exceeds the marginal return on self-sims (0.025), no money gets spent on self-sims. And since 0.05 exceeds the marginal return on basics (1 / (2√xᵢ)) when xᵢ > 100, each person spends $100 on basics and the rest on consensium.

Thanks to Toby Ord for this framing.

There might be benefits to increasing the number of powerholders even beyond what’s needed to make consensium worth funding. More people means larger gains from trade, which could make coordination more attractive. For example, in Appendix C, we investigate an assurance contract for funding public goods and find that—holding fixed the ratio of value assigned to idiosyncratic goods and consensium—public goods are more likely to be funded with larger numbers of people, due to the greater gains from trade. Of course, larger groups also have a harder time coordinating. In our analysis of the assurance contract, we found that the larger gains from trade outweighed the difficulties in coordinating, but this might not hold for other mechanisms.

For example, lighthouses may have been historically funded by harbor fees. This made them partially excludable, since only ships that came into the harbor and paid the fee would get the full benefit of a nearby lighthouse.

Or they might not even value that—maybe they only value self-sims causally downstream of themselves.

The degree of correlation between you and another agent A is the extent to which you update on that A’s decision after observing your own. In this case, it is

First, among views of population ethics that satisfy some standard technical axioms, only those that are linear with respect to population size (at a given level of wellbeing) are separable in space and time—that is, the value of doing good today doesn’t depend on the amount of good in distant galaxies or in the distant past. See Blackorby, Bossert, and Donaldson’s Population Issues in Social Choice Theory.

Second, even if you think that maximum attainable value is a concave function of resources devoted to promoting the good, if the total amount of goodness in the universe is much larger than the amount you can affect, then you will value the differences you can make approximately linearly (because concave functions are locally approximately linear). And, plausibly, the total amount of goodness in the universe is much larger than the amount you can affect. See No Easy Eutopia for more discussion.

Let’s model someone as containing two sub-agents with equal weight, one that cares about idiosyncratic goods with a utility function

and one that cares about consensus goods with a utility function

(where x and g are respectively the amounts spent on idiosyncratic and consensus goods). Then the result of Nash bargaining will be to maximize:

This is a Cobb-Douglas utility function and a person with that utility function will split their resources between idiosyncratic goods and consensus goods at a ratio of

regardless of their total level of resources.

(This relies on the idiosyncratic goods and consensus goods having the same functional form. If instead that person’s consensus-good-valuing sub-agent valued resources linearly and their idiosyncratic sub-agent valued resources logarithmically, the result of Nash bargaining would be to maximize

For this utility function, as resources grow, more resources are spent on the consensus goods.)

Note that the utility function produced by the Nash bargain is based on resource expenditure relative to the disagreement point (where the individual spends no resources on consensus or idiosyncratic goods). So in the utility functions above, g is not the total societal spending on the consensus good but rather the individual’s spending on the consensus good. That’s not really a public good anymore, but rather a particular type of idiosyncratic good.

Consider a nationalist choosing between: (a) self-sims, valued at 1 util/resource unit; (b) national benefit, valued at 0.01 util/unit; and (c) consensium, valued at 0.0001 util/unit. With 10 billion people total, 10% of whom are nationalists for the same nation, the nationalist funds (b): coordinating with 1 billion co-nationalists yields an effective multiplier of 1B x 0.01 = 10M, while coordinating with all 10 billion on consensium yields only 10B x 0.0001 = 1M. More generally, an agent prefers coordinating with a smaller group of size S on a good valued at v_S over a larger group of size L on a good valued at v_L iff

L/S.","id":"YKRKUKLAIA"}" data-component-name="LatexBlockToDOM">

Can Liberal Democracy Survive AGI?

Fin Moorhouse — Wed, 11 Feb 2026 15:53:02 GMT

Sam Hammond is is Chief Economist at the Foundation for American Innovation. He discusses:

How collapsing transaction costs could push towards privatised alternatives to government functions
“Distributed denial of service” attacks against courts and regulators
What happens when existing laws can be more perfectly enforced with AI
Estonia’s government-as-API, as a model for AGI-era governance
Whether 20th-century social democracy depends on 20th-century technology
The UAE as a preview of post-scarcity governance
Mormons, religion, and social scaffolding in secular societies

Here’s a link to the full transcript.

ForeCast is Forethought’s interview podcast. You can see all our episodes here.

Subscribe to ForeCast

AI tools for strategic awareness

Owen Cotton-Barratt — Wed, 11 Feb 2026 12:27:31 GMT

This article was created by Forethought. Read the full article on our website.

We’ve recently published a set of design sketches for tools for strategic awareness.

We think that near-term AI could help a wide variety of actors to have a more grounded and accurate perspective on their situation, and that this could be quite important:

Tools for strategic awareness could make individuals more epistemically empowered and better able to make decisions in their own best interests.
Better strategic awareness could help humanity to handle some of the big challenges that are heading towards us as we transition to more advanced AI systems.

We’re excited for people to build tools that help this happen, and hope that our design sketches will make this area more concrete, and inspire people to get started.

The (overly-)specific technologies we sketch out are:

Ambient superforecasting — When people want to know something about the future, they can run a query like a Google search, and get back a superforecaster-level assessment of likelihoods.
Scenario planning on tap — People can easily explore the likely implications of possible courses of actions, summoning up coherent grounded narratives about possible futures, and diving seamlessly into analysis of the implications of different hypotheticals.
Automated OSINT — Everyone has instant access to professional-grade political analysis; when someone does something self-serving, this will be transparent.

If you have ideas for how to implement these technologies, issues we may not have spotted, or visions for other tools in this space, we’d love to hear them.

This article was created by Forethought. Read the full article on our website.

Research note on the UN Charter

Tue, 10 Feb 2026 08:13:23 GMT

This article was created by Forethought. See the original on our website.

This is a rough research note based on 20 hours of work. Conclusions are tentative, and it hasn’t been reviewed by domain experts. Matthew van der Merwe did the original research in 2023; Rose Hadshar did subsequent editing.

Introduction

Many imagine that the transition to advanced AI systems will at some point lead to some kind of international agreement to govern how the technology is used. When contemplating this possibility, a natural question to ask is, how have important international agreements come about in the past?

One of the most salient modern examples is the founding of the United Nations. This research note gives a brief overview of the creation of the UN charter, before drawing some tentative observations with a bearing on the question of international AGI governance.

The main (tentative) takeaways are:

While the veto for permanent members of the Security Council was likely close to inevitable, the inclusion of France as a permanent member was highly contingent. The broad interpretation of the veto may also have been somewhat contingent, though Cold War tensions probably made it fairly likely.
Intellectuals and civil society groups played a significant role in the drafting of the Charter.
US domestic politics and public opinion exerted strong influence on the Charter.
Most of the work happened before the San Francisco conference, and most of the work was done by the US and the UK.
Unlike the League of Nations, which was a very idealistic project, the UN seems to have been inspired by a mixture of idealism and pragmatism.

Some caveats:

This note is based on 20 hours of preliminary research, and hasn’t been reviewed by domain experts. The main sources used were Schlesinger (2003), Act of Creation: The Founding of the United Nations and Ehrhardt (2020), The British Foreign Office and the Creation of the United Nations Organization, 1941- 1945. Where not otherwise stated, information comes from those books.
It focuses on the lead up to the creation of the UN charter, rather than the history of how the UN unfolded over the subsequent 80 years.

What is the UN Charter?

The United Nations Charter was signed on 26 June 1945, at the close of the San Francisco Conference, which began two months earlier on 25 April 1945. It establishes the United Nations and sets out how it will be governed. The Charter has been largely unaltered since it was signed.

The origins of the charter stretch further back:

The League of Nations (established in 1920) was the main precedent for the UN (though historians often look even further back, to agreements like the 1814–15 Congress of Vienna and the 1899 and 1807 Hague Conventions drawing up the laws of war).
On 1 January 1942, the ‘Big Four’ nations (US, USSR, UK, China) signed the Declaration by United Nations. This formalised the coalition of the Allies against the Axis powers, and was signed by 22 nations the following day, and an additional 21 by 1945.
On 30 October 1943, the Big Four signed the Declaration of the Four Nations / Moscow Declaration. This declaration stated for the first time that those governments “recognize the necessity of establishing at the earliest practicable date a general international organization, based on the principle of the sovereign equality of all peace-loving states, and open to membership by all such states, large and small, for the maintenance of international peace and security.”
Between August and October 1944, the Big Four agreed to the Dumbarton Oaks proposal. This was effectively the first draft of the UN Charter, including things like the basic structure of the UN, the composition and powers of the Security Council, and voting procedures.
By the eve of the San Francisco conference in 1945, the broad parameters of the UN Charter had already been agreed.

The full text of the UN charter is only 9,000 words long. It covers:

Purposes and Principles (chapter 1): The Charter sets forth the UN’s objectives to preserve international peace and security, encourage friendly relations and cooperation among countries, and coordinate actions in achieving common goals, emphasizing peaceful dispute resolution, the sovereignty of member states, and the prohibition of force in international relations, barring collective defense.
Membership (chapter 2): Countries were eligible for initial membership if they had previously signed the Declaration by United Nations (i.e. joined the allies against the Nazis); or if they attended the San Francisco conference.1 The Charter also sets out procedures for admitting, suspending, and expelling members.
Organs (chapters 3-5): These chapters detail the UN’s six principal organs: the General Assembly, the Security Council, the Economic and Social Council, the Trusteeship Council, the International Court of Justice, and the Secretariat.
- The General Assembly consists of all member nations.
- The Security Council consists of 5 permanent members US, USSR, UK, China, France, and 6 (later increased to 10) rotating two-year members.
Pacification Functions and Powers (chapters 6-7): Chapter 6 encourages the peaceful resolution of disputes, while Chapter VII grants the Security Council significant powers to act against threats to peace, breaches of peace, or acts of aggression, including economic sanctions and military action.
Other matters (chapters 8-18): regional arrangements (chapter 8); the Economic and Social Council (chapters 9–10); non-self governing countries and trusteeship (chapters 11–13); the International Courts of Justice (chapter 14); the Secretariat (chapter 15); miscellaneous provisions (chapter 16), transitional arrangements (chapter 17) and the amendment procedure (chapter 18).

Some of the most significant elements of the Charter are about the Security Council. In particular:

The balance of power between the Security Council and the General Assembly:
- The Security Council has a monopoly over security matters; the Assembly has no equivalent monopoly over economic & social matters.
- Assembly resolutions, while carrying an important symbolic weight, are not binding; Security Council resolutions are binding upon all members.2
- The Assembly meets annually, whereas the Security Council can meet at any time.
The veto: Security Council decisions on ‘procedural matters’ can be made by a ~60% majority (7 of 11; later 9 of 15). Decisions on all other matters require a ~60% majority and affirmative votes from all five permanent members.
Military enforcement: the Security Council is empowered to solve international disputes to enforce peace, including via non-military measures, and — if these are inadequate —military measures against aggressor states. All UN members are required to make forces available when asked to do so.

Brief timeline

This timeline is based on Kennedy (2007), chapter 1 and Schlesinger (2003), chapter 1.

Prehistory

1795: Immanuel Kant wrote Toward Perpetual Peace, laying out some foundational thinking on global federations.
1815 –1822: Conferences between European powers after the Napoleonic Wars, beginning with the Congress of Vienna. Through the rest of the century, the leaders of Europe’s leading states, referred to as the Concert of Europe, gathered thirty times to discuss urgent political issues.
1864: Creation of the International Committee of the Red Cross. Arguably the first treaty-bound international organization.
1899: First Hague conference which codified the treatment of civilians and neutrals and provided a mechanism for the peaceful settlement of disputes. 26 countries in attendance, including all major powers.
1907: Second Hague conference, with 44 nations (including most of Latin America).

League of Nations era

1916: Wilson first articulates his vision for a league of nations, and commissions a secret multidisciplinary group (The Inquiry) of geographers, historians, political scientists, and other experts to develop plans. The Inquiry’s research director was Walter Lippman, then 28.
1918: Wilson enshrines his vision for the League of Nations in the Fourteen Points peace proposal to the Germans. The final point calls for forming “a general assembly of nations” to afford “mutual guarantees of political independence and territorial integrity”—the future League of Nations.
1919: Wilson encounters major domestic opposition to the League of Nations in the US, from isolationist Republicans. The Senate vetoes US accession to the League. This and health issues undermine Wilson’s efforts to lead the project.
1920: The League of Nations officially comes into being.
1930s: The League fails to handle several major crises: the Japanese invasion of Manchuria (1931); Germany exiting the League (1933) and occupying the Rhineland, Czechoslovakia, and Austria (1936–38); the Italian invasion of Ethiopia (1935); and the USSR invasion of Finland and its subsequent expulsion from the League (1940).

WW2

14 August 1941: The Atlantic Charter is signed between the US and UK. It sets out principles for post-war order (full text). It doesn’t include explicit mention of an international organisation, but does acknowledge that it is “pending the establishment of a wider and more permanent system of general security”.
1 January 1942: The Declaration by United Nations is signed by the Big Four (US, UK, USSR, China), followed by 22 allied nations the following day. There is no mention of an international organisation.
30 October 1943: The Declaration of the Four Nations / Moscow Declaration, makes the first mention of an international body: the governments of the Big Four “recognize the necessity of establishing at the earliest practicable date a general international organization, based on the principle of the sovereign equality of all peace-loving states, and open to membership by all such states, large and small, for the maintenance of international peace and security.”
1–22 July 1944: The Bretton Woods Conference establishes the post-war global financial order and what would become the World Bank and IMF.
21 August to 7 October 1944: The Dumbarton Oaks Conference between the Big Four leads to a more detailed proposal for the establishment of a “general international organization” (proposal text).
4–11 February 1945: The Yalta conference between the UK, the US, and the USSR. Stalin commits to the Soviet Union joining the United Nations and demands a veto for the great powers. It was agreed that membership would be open to nations that had joined the Allies by 1 March 1945.
12 April 1945: Franklin D. Roosevelt dies in office; and is succeeded by Truman. Truman learns about the atomic bomb.
25 April 1945: The United Nations Conference on International Organization begins in San Francisco.
8 May 1945: Germany surrenders; VE day.
26 June 1945: After working for two months, 50 nations signed the Charter of the United Nations. The charter stated that before it would come into effect, it must be ratified by the governments of China, France, the USSR, Great Britain and the United States, and by a majority of the other 46 signatories.
16 July 1945: The Trinity test.
17 July to 2 Aug 1945: The Potsdam Conference between the UK, the US, and the USSR.
6 and 9 August 1945: atomic bombings of Hiroshima and Nagasaki.
2 September 1945: Japan surrenders; the end of WW2.
24 October 1945: The UN officially comes into existence after ratifications.
June 1946: The Baruch Plan for international arms control is presented to the UNAEC.
30 Dec 1946: The Baruch Plan fails to pass due to USSR veto.

Tentative observations

The P5 & the veto

The most significant article in the Charter is the one which grants veto power for the permanent 5 members of the security council (P5) on all ‘non-procedural matters’.

The existence of the veto in the first place seems somewhat over-determined:

The Dumbarton Oaks Charter gave the Big Four a kind of meta-veto in the drafting process for the Charter: they could veto amendments from lesser countries.3

The US Congress made clear that they required a veto (and that the lack of a veto was why Congress had previously scuppered the League of Nations). And a US veto would need to be mirrored, at least, by a veto for the USSR.
It’s important to remember that initially, member countries were imagining that the UN would have its own serious UN force under the Military Staff Committee. This possibility presumably made a veto seem even more important (though ultimately the Military Staff Committee became “a non-functioning body” because of Cold War tensions, and was effectively defunct by 1948).4

However, several important aspects of the veto seem more contingent:

The addition of France to the ‘Big Four’ was largely a demand of the UK / Churchill, who apparently wanted another European power to counterbalance US influence within the Western bloc. Interestingly, France had to be persuaded to accept the seat.
As for the veto’s impact, the hinge point wasn’t necessarily the Charter per se but the subsequent formation of norms for what constituted ‘substantive’ issues within scope of veto (as opposed to ‘procedural’ issues, which are outside the scope of veto). Within a couple years, the US and USSR both used the veto for non war-and-peace matters, establishing the obstructionist norms that have persisted since, and rendering the UN pretty ineffectual throughout the Cold War.
- However, given the Cold War, it’s hard to say how contingent this use of the veto was: perhaps it was very likely that the US and the USSR would interpret it in this way.

Intellectuals and civil society groups

Prior to the UN Charter, the League of Nations was drafted in significant part by a group of intellectuals appointed by Woodrow Wilson, called ‘The Inquiry’. Wilson commissioned 150 intellectuals from different disciplines to prepare materials for the WW1 peace negotiations, with a view to ‘solving’ geopolitical turmoil. This included drawing up post-war borders and establishing the League of Nations.

Given the ultimate failure of the League of Nations, this is more of a cautionary tale, and these elite-driven plans for the League were derided as “the professors’ peace”.5 However, this didn’t lead to a broader rejection of input from intellectuals when it came to UN planning. In part, this is because this intellectual milieu split into factions. The die-hard world federalists (like H.G. Wells and Clarence Streit) did lose influence, but the more moderate pragmatists (like Shotwell and Webster) remained influential.

Some of the most influential intellectuals on the drafting of the UN charter were:

Leo Pasvolsky, the “foremost author of the UN Charter”.6 Pasvolsky was a US State Department official who led the work on postwar planning for an international body from 1939 (though efforts only began in earnest in 1942).
James T. Shotwell, who helped draft the UN Charter. Shotwell was a history professor and a previous member of The Inquiry. He had also been instrumental in establishing the International Labor Organisation and the Commission to Study the Organization of Peace.
Clark Eichelberger, who advised Roosevelt and the US delegation. Eichelberger was a peace activist and a prolific advocate for both the League of Nations and the UN. He served as a bridge between the State Department and civil society groups (for example by helping to select attendees and organising side events).
Gladwyn Jebb, the British Pasvolsky. Jebb was a Foreign Office official who led the UK’s planning for an international body from the early 1940s. He also served as first Acting Secretary General for the UN.
Charles Webster, who wrote a series of influential case studies of earlier international agreements. A history professor, Webster was one of two leading figures in British planning (with Jebb), and an expert in the precedent of great power agreements during the nineteenth century.

Campaigning groups and civil society organisations also played a significant role in the drafting of the UN Charter:

The Commission to Study the Organization of Peace (CSOP) was established by Shotwell and issued a report on “Fundamentals of the International Organization” which formed the basis for the US State Department’s Dumbarton Oaks proposal. In October 1944, CSOP assembled fifty organizations to discuss pro-UN strategy, and agreed to back a common campaign for the organization.7

Many of the most influential civil society groups were in attendance at the San Francisco conference. They were pre-selected by the State Department for alignment; groups favouring world government and reactionaries/isolationists were not invited.8

These groups secured three modest victories during the conference (but were ignored on the big issues like security, the veto, and trusteeship):
- Adding the word “education” into the charter (and thereby giving the UN some remit over such matters).
- Incorporating “human rights” and establishing a human rights commission.
- Article 71, enshrining the collaborative relationship between the UN and NGOs via the Economic and Social Council of the United Nations.

US domestic politics and public opinion

Woodrow Wilson’s plans for the League of Nations had been scuppered by domestic opposition in the US. Ratifying the treaty required a two-thirds Senate majority. Republicans objected that Wilson’s draft charter impinged on US sovereignty and undermined the doctrine of US non-entanglement. In 1918 midterms, Wilson sought a mandate for his plans, but Republicans gained control of both chambers and blocked the US from joining the League.

Throughout the UN process, Presidents were constrained by the need for Republican support for the proposals, not wanting to repeat Wilson’s error. Roosevelt tried hard to loop in Republicans in the early planning, and secured Republican support for the high-level ambitions in 1943 (though other factors like Pearl Harbour presumably contributed to US isolationism falling out of favour). Truman and Roosevelt both gave major roles to high-ranking Republicans during the negotiations, most notably Senator Vandenberg (a key figure in the San Francisco conference) and John Foster Dulles.

Bipartisan support for the UN enabled Congress to pass two resolutions in favor of a global assembly, lending some public sanction to the process. First, on September 21, 1943, the House of Representatives passed the so-called Fulbright Resolution “favoring the creation of appropriate international machinery” to maintain the peace. Then on November 5, 1943, the Senate enacted the Connally Resolution (named after the head of the Senate Foreign Relations Committee), which called for the establishment of “international authority with power to prevent aggression” in the form of a “general international organization.”

As well as courting Republican support, US politicians seem to have been very focussed on shaping (via press / PR) and gauging (via polling) public opinion throughout the process.9 The British delegation, too, was very conscious of the necessity of maintaining US domestic support.10 This included allowing the US to take credit for much of the planning, which the UK viewed as important for the plan’s success.

The State Department embarked on a huge PR campaign to garner support for the UN Charter in 1944-45 (their first major PR campaign). It was widely regarded as successful.

Archibald MacLeish, assistant Secretary of State, disseminated information about the UN in weekly forums, and distributed Watchtower Over Tomorrow, a film about the Dumbarton Oaks plan, to groups around the country. In late 1944, an eight-page pamphlet containing the text of the Dumbarton Oaks proposals was sent out to over 1.25 million people, a mass distribution unprecedented for the State Department, placing it on the best-seller list.11

Many civil society organisations also participated in the campaign. Clark Eichelberger wrote a thirty-two-page pamphlet on Dumbarton Oaks that, via his affiliate organizations, reached over 21,000 people. The National League of Women Voters sent out a discussion guide and text to six hundred local chapters around the country. The Woodrow Wilson Foundation mailed 318,000 copies of the Dumbarton Oaks text to individuals free of charge—nearly going bankrupt in the process. The national commander of the American Legion dispatched letters to his 12,000 posts urging the adoption of the UN Charter.12 And the Union for Democratic Action released 1 million copies of a cartoon brochure, From the Garden of Eden to Dumbarton Oaks.13

Polls reflected a change in perception from this PR blitz. In December 1944, only 43% of the American people had heard of Dumbarton Oaks. This rose to 52% by February 1945; and 60% by March 1945. 60% of Americans supported the San Francisco conference after Roosevelt’s January State of the Union address, rising to 80% after the Yalta conference. In April, on the eve of the San Francisco conference, 94% of the American public were aware of the conference.14

The San Francisco Conference itself was a huge media event, with 2,300 newspaper people in attendance,15 and press coverage seems to have been very important to delegates. Several journalists were also actively involved in the US efforts as insiders. Walter Lippman, a journalist who at 28 had served as research director for Woodrow Wilson’s Inquiry, attended the conference and had remained close with the US government. Pasvolsky’s Advisory Committee, which worked from 1942 to develop a plan for the UN, included two journalists:16 Anne O’Hare McCormick (on the NYT editorial staff) and Hamilton Fish Armstrong (editor of Foreign Affairs).

However, media management at the start of the conference was poor (from a US perspective). The US delegation was reluctant to brief or leak to journalists, whereas the Soviets and others were much more obliging, resulting in slew of coverage (in the NYT and elsewhere) critical of the US for taking firm stances against the USSR. The US then overhauled its media operation and started doing regular briefings and leaks; which brought press more onto the US side.

Preparatory work

The basics of the UN Charter were agreed at the Dumbarton Oaks conference between the Big Four, with only a few details unresolved by the time of the San Francisco conference.

US sources describe the Charter as basically having been written by the US, without much input from the other great powers.17 However, Ehrhardt (2020) shows quite convincingly that the UK decided at various points to let the US take the credit, in order to keep US domestic opinion favourable to the plans.

However, beyond the US and UK, there really was very little input from other great powers.

In some ways this isn’t surprising, given how much this was a US plan and how much effort the US had put into it over several years leading up to the Charter. Pasvolsky began work at the State Department on what would become the UN in 1939. This work was effectively paused during the start of WW2 proper, but then really got underway in early 1942 with the establishment of a special subcommittee on International Organization. This subcommittee worked incredibly hard, meeting 45 times over 9 months, and issuing a preliminary draft to Roosevelt in March 1943.18 This was then re-drafted again and again over the next few months. By 29th December 1943, the draft had all the basics of the UN Charter: a small Executive Council with a Big Four veto to handle security matters, a General Assembly with all nations, a Secretariat and sub-agencies, and an international Court.

One notable difference between the US and UK was the enthusiasm of their leaders for the UN planning. Roosevelt and his Secretaries of State seem to have cared a great deal, and — at least by 1945 — seen the UN plan as one of the most important things on their plate. Truman, who took over just before the San Francisco conference, felt similarly. Churchill, on the other hand, has been described as “one of the main obstacles to adequate British planning and to the actual establishment of the United Nations Organisation”.19 He seems to have been generally not that interested, but then occasionally fixated on his own idiosyncratic (and poorly thought through) vision for an international organisation, which derailed things.

Idealism and pragmatism

A clear thread running through the story of the UN Charter is the balance between idealism and pragmatism.

The standard narrative is something like:

Wilson was an idealist20 and an ivory tower academic type.
His League of Nations plan failed because it paid insufficient attention to realist great power considerations (toothless enforcement, lack of buy-in from great powers, too democratic / consensus-based)21 and domestic political considerations (with Congress refusing to ratify the League Treaty).
The UN plans were driven forward by a certain amount of idealism from the US, but tempered with pragmatism based on the failed experiment of the League, and that’s why it worked.

Intellectually, there seems to have been a split in the 1920s and 1930s among the people who had worked on and advocated for the League of Nations plan into two factions:

An idealistic faction who remained wedded to the idea of a world federation / government and continued to advocate for this, but lost political influence because of the League’s abject failure, including thinkers like H.G. Wells, Clarence Streit, and later Bertrand Russell.
A more pragmatic faction, who set up think tanks like the Council on Foreign Relations and Chatham House, and ‘institutionalised’. They had a more moderate, but still idealistic internationalist worldview, and were the people who were brought into the UN planning—thinkers like Webster, Jebb, Shotwell, Eichelberger, and Walter Lippman.

Other observations

The failure of the League of Nations loomed large over UN planning. There was a fairly clear historical example of how not to do things.
Spying was rife at the San Francisco conference.22 The US had a huge spying operation during the conference, including wiretapping diplomatic cables. This gave them a decent edge in negotiations, since they had inside knowledge of what other countries were thinking and where their reservations were. The USSR was later revealed to have had its own operation, including some sources within the US delegation.
Amendments to the UN charter have been rare. Article 108 allows for amendments with support of two-thirds of the General Assembly and all 5 permanent members of the Security Council. Article 109 provides for the convening of a “General Conference of the Members of the United Nations” to consider changes to the Charter, which can be triggered by two-thirds vote in the General Assembly or seven members of the Security Council. Such a convention was scheduled for 1955, but didn’t actually take place. To date there have been five amendments to the Charter. All were between 1965–73, and were to accommodate the increased size of the UN following decolonisation.
There have been three major structural changes to the UN made without amendment:
- P5 abstentions in the Security Council have been interpreted in practice as ‘concurring votes’ with respect to the veto on non-procedural matters.
- After the collapse of the Soviet Union, Russia took the USSR’s place on the Security Council.
- In 1971, the PRC assumed the Chinese seat (previously held by the Taipei Nationalist government) following General Assembly resolution 2758.
Lots of the Charter is fairly redundant today. For example, the UN Charter envisaged a key role for the UN in economic and social matters, but the UN has been superseded by other bodies on economic matters—namely the Bretton Woods system for international finance.

Appendix: Locksley Hall

For I dipt into the future, far as human eye could see,
Saw the Vision of the world, and all the wonder that would be;
Saw the heavens fill with commerce, argosies of magic sails,
Pilots of the purple twilight dropping down with costly bales;
Heard the heavens fill with shouting, and there rain’d a ghastly dew
From the nations’ airy navies grappling in the central blue;
Far along the world-wide whisper of the south-wind rushing warm,
With the standards of the peoples plunging thro’ the thunder-storm;
Till the war-drum throbb’d no longer, and the battle-flags were furl’d
In the Parliament of man, the Federation of the world.
There the common sense of most shall hold a fretful realm in awe,
And the kindly earth shall slumber, lapt in universal law.
—Alfred Lord Tennyson, 1842, Locksley Hall

This Victorian futurist poem was a favourite of two key figures in the story of the UN Charter: Winston Churchill and Harry Truman.

Truman kept a copy of it in his wallet for thirty years, reflecting in 1952: “it is a prophecy of the age in which we live now. And we are faced with a much greater age than the one that Tennyson dreamed about … I think we are at the door of the greatest age in history in everything. If we can prevent a third world war … the young people today, I think, will see … an age that our fathers and grandfathers dreamed about, but never thought would happen.”
Churchill called it “the most wonderful of modern prophecies” and quoted it throughout his life, including in his essay Fifty Years Hence.

References

Edis (2007). A job well done: The founding of the united nations revisited.

Ehrhardt (2020). The British Foreign Office and the Creation of the United Nations Organization, 1941- 1945.

Gerber (1982). ‘The Baruch Plan and the Origins of the Cold War.’ Diplomatic History 6:4, pp. 69-96. https://sci-hub.se/https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-7709.1982.tb00792.x.

Kennedy (2007). The Parliament of Man: The Past, Present, and Future of the United Nations.

McCullough (1992). Truman.

Schlesinger (2003). Act of Creation: The Founding of the United Nations.

‘United Nations Charter (full text)’ (1945). https://www.un.org/en/about-us/un-charter/full-text

Webster (1947). The Making of the Charter of the United Nations.

Zaidi and Dafoe (2021). ‘International Control of Powerful Technology: Lessons from the Baruch Plan for Nuclear Weapons’. Centre for the Governance of AI Working Paper. https://cdn.governance.ai/International-Control-of-Powerful-Technology-Lessons-from-the-Baruch-Plan-Zaidi-Dafoe-2021.pdf

This article was created by Forethought. See the original on our website.

The contentious members, who hadn’t joined the Allies by 1942, were Argentina (neutral / pro-Nazi), and Belarus and Ukraine (both Soviet Republics). Roughly speaking, Belarus and Ukraine were admitted as a concession to the USSR, who objected to the inclusion of Argentina.

Kennedy (2007).

Schlesinger (2003), p.182.

https://en.wikipedia.org/wiki/Military_Staff_Committee

Ehrhardt (2020), p. 100.

“Interview with Stephen Schlesinger on CNN’s Diplomatic License”. December 24, 2004.

Schlesinger (2003), p.71.

CSOP; the Congress of Industrial Organizations (CIO); the Council on Foreign Relations; the America Jewish Committee; the American Bar Association; the League of Women Voters; the Catholic Welfare Conference; the Foreign Policy Association; the NAACP; the Kiwanis International; the Lions International ; the Rotary International; the National Education Association; the American Legion; the National Lawyers’ Guild; and twenty-seven other organizations.

For example, in the Senate debate over ratifying the Treaty, Senator Connally (a US delegate) “listed the numerous independent groups backing the agreement, and mentioned opinion polls in favor of the U.N. Charter” (Schlesinger (2003), p. 290).

Ehrhardt (2020), p. 33.

Schlesinger (2003), p. 84.

Schlesinger (2003), p. 71.

Schlesinger (2003), p. 84.

Schlesinger (2003), p. 162.

Schlesinger (2003), p. 56.

Hull (US Secretary of State) observed in his memoirs that “all the essential points in the tentative draft” that he had originally handed to the Russians and the British before the conference “were incorporated in the draft now accepted by the conference.” A US source on Dumbarton Oaks stated: “neither the British, the Russians, nor the Chinese seemed to take the preparatory work very seriously. Each of the governments sent Roosevelt some general thoughts on a global body, but, except for some lengthy British notations titled “Future World Organization,” nothing of serious consequence.” As a result, the Pasvolsky proposal, “which was by far the most complete and detailed of the three, became—albeit unofficially—the basic frame of reference for building a plan of world organization.” Schlesinger (2003), p. 65.

Schlesinger (2003), p. 57.

Ehrhardt (2020), p. 13, citing E. J. Hughes.

In some respects, but not others. https://en.wikipedia.org/wiki/Woodrow_Wilson_and_race

Gladwynn Jebb (a key UK delegate): “The League system...was about as perfect as the human mind could derive. The only trouble about it was that it wouldn't work. The reason why it wouldn't work was in the first place because the existing Great Powers could not agree as among themselves on certain essential things. And until we do get agreement between the World Powers on these essential things no international machine however perfect will ever work.” (Ehrhardt (2020), p. 196).

Schlesinger (2003), chapter 7.

Angel-on-the-shoulder AI tools

Owen Cotton-Barratt — Mon, 09 Feb 2026 10:17:34 GMT

See the full article on Forethought’s website.

We’ve recently published a set of design sketches for technological analogues to ‘angels-on-the-shoulder’: customized tools that leverage near-term AI systems to help people better navigate their environments and handle tricky situations in ways they’ll feel good about later.

We think that these tools could be quite important:

In general, we expect angels-on-the-shoulder to mean more endorsed decisions, and fewer unforced errors.
In the context of the transition to more advanced AI systems that we’re faced with, this could be a huge deal. We think that people who are better informed, more situationally aware, more in touch with their own values, and less prone to obvious errors are more likely to handle the coming decades well.

We’re excited for people to build tools that help this to happen, and hope that our design sketches will make this area more concrete, and inspire people to get started.

The (overly-)specific technologies we sketch out are:

Aligned recommender systems — Most people consume content recommended to them by algorithms trained not to drive short-term engagement, but to meet long-term user endorsement and considered values
Personalised learning systems — When people want to learn about (or keep up-to-date on) a topic or area of work, they can get a personalised “curriculum” (that’s high quality, adapted to their preferences, and built around gaps in their knowledge) integrated into their routines, so learning is effective and feels effortless
Deep briefing — Anyone facing a decision can quickly get a summary of the key considerations and tradeoffs (in whichever format works best for them), as would be compiled by an expert high-context assistant, with the ability to double-click on the parts they most want to know more about
Reflection scaffolding — People thinking through situations they experience as tricky, or who want to better understand themselves or pursue personal growth, can do so with the aid of an expert system, which, as an infinitely-patient, always-available Socratic coach, will read what may be important for the person in their choice of words or tone of voice, ask probing questions, and push back in the places where that would be helpful
Guardian angels — Many people use systems that flag when they might be about to do something they could seriously regret, and help them think through what they endorse and want to go for (as an expert coach might)

If you have ideas for how to implement these technologies, issues we may not have spotted, or visions for other tools in this space, we’d love to hear them.

See the full article on Forethought’s website.