<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[ForeWord]]></title><description><![CDATA[How should we navigate explosive AI progress? 

The latest research from Forethought.]]></description><link>https://newsletter.forethought.org</link><image><url>https://substackcdn.com/image/fetch/$s_!OWCf!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff69a310d-5182-4a63-bf1c-1b392366e785_663x663.png</url><title>ForeWord</title><link>https://newsletter.forethought.org</link></image><generator>Substack</generator><lastBuildDate>Tue, 30 Jun 2026 23:00:01 GMT</lastBuildDate><atom:link href="https://newsletter.forethought.org/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Forethought]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[forethoughtnewsletter@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[forethoughtnewsletter@substack.com]]></itunes:email><itunes:name><![CDATA[Forethought]]></itunes:name></itunes:owner><itunes:author><![CDATA[Forethought]]></itunes:author><googleplay:owner><![CDATA[forethoughtnewsletter@substack.com]]></googleplay:owner><googleplay:email><![CDATA[forethoughtnewsletter@substack.com]]></googleplay:email><googleplay:author><![CDATA[Forethought]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[We Should Hand Off To Morally Reflective AIs]]></title><description><![CDATA[This article was created by Forethought. See our research on our website.]]></description><link>https://newsletter.forethought.org/p/we-should-hand-off-to-morally-reflective</link><guid isPermaLink="false">https://newsletter.forethought.org/p/we-should-hand-off-to-morally-reflective</guid><dc:creator><![CDATA[Bentham's Bulldog]]></dc:creator><pubDate>Wed, 24 Jun 2026 17:22:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d58fbdd5-1e58-4858-9936-783fc2e397d2_2451x1399.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This is a personal guest post by Bentham&#8217;s Bulldog, created while they were a visiting scholar at <a href="https://www.forethought.org/about">Forethought</a>.</em></p><h2>Outline</h2><ol><li><p>In the <a href="https://newsletter.forethought.org/i/202719577/1-introduction">introduction</a>, I explain what I&#8217;ll be arguing in the piece&#8212;namely, that 1) it is very important that we hand off most high-stakes decisions to AI; and 2) the kinds of AIs we should hand off to are philosophically reflective AIs with values that shift over time as a result of reflection. We should ensure the AIs that we build explicitly consider moral arguments, and sometimes change their priorities on the basis of moral argumentation. This would be much better than locking in current values, retaining human control, or allowing a future dominated by whatever haphazard mix of humans and AIs emerges naturally.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/202719577/2-why-hand-off">Why hand off?</a></strong> In this section, I give the main arguments in favor of the thesis. In short, I argue we should hand off because: 1) AIs are likely to be more virtuous than people; 2) AIs are likely to be much smarter and better at making decisions than people; and 3) AIs are likely to be better at quickly navigating the difficult decisions that one must make during an intelligence explosion. I argue we should hand off to philosophically reflective AIs because 1) reflective AIs are likelier to get the right answers to important moral questions, or as close to the right answers as one can get, than the default scenario, which raises the odds of a near-best world; 2) reflective AIs are less likely to make horrendous moral errors, leading to a wide-scale moral catastrophe, both than humans and non-reflective AIs.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/202719577/3-would-handoff-disenfranchise-humans">Would handoff disenfranchise humans?</a></strong> Here I address the worry that handoff would disenfranchise humans by taking decision-making out of our hands. My reply is that: 1) the default trajectory without handoff disenfranchises <em>far more expected beings</em> in <em>more serious ways</em> (odds are non-trivial of digital minds being seriously disenfranchised)&#8212;in fact, one of the most promising proposals for how to hand off is simply to give basic political rights to AIs; 2) a compromise solution where humans retain nearby resources can allow humans to be in the loop on the decisions we care about most; 3) on a number of plausible views, the determinant of the desirability of some system of decision-making is the quality of the decisions, rather than whether there&#8217;s democratic input. Given how enormous the stakes could be&#8212;affecting billions of times more sentient beings than there are humans&#8212;it&#8217;s hard to think the harms of human disenfranchisement are <em>so great</em> as to make handoff undesirable.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/202719577/4-would-we-like-their-advice">Would we like their advice?</a> </strong>In this section, I address concerns that if AIs pursue the good, the end result might be alien and divorced from human values. In response I argue: 1) the future, by default, is likely to be highly suboptimal in many respects, so for handoff to be desirable, it must only beat the alternative; 2) values in the future are likely to be very different from current values, because they&#8217;ll shift dramatically over long time scales, so this is not a unique downside; 3) one could reach a deal where current values govern the surrounding region of space, while distant places in space are geared towards the production of maximal value&#8212;this would be desirable from the perspectives of both common-sense and cosmic ethics; 4) on many views in philosophy, the stuff that&#8217;s objectively valuable is what we&#8217;d want if we were ideally reflective&#8212;but if you know you&#8217;d be motivated to bring something about if you were wiser and more reflective, then that gives you a reason to bring it about; 5) only a relatively narrow subset of views hold that there are objective values but don&#8217;t require pursuing them. If there is a conflict between what we want and what&#8217;s objectively valuable, on standard views, we should simply go with what&#8217;s objectively valuable. And if there&#8217;s no objective value, then the dilemma doesn&#8217;t arise at all&#8212;and philosophically reflective AIs will simply produce an upgraded version of human values, rather than discover some far-flung and potentially alien truths.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/202719577/5-worries-about-handoff">Worries about handoff</a>.</strong> In this section, I address concerns about how handoff might be implemented involving alignment, whether AI would be sufficiently ethically reflective, whether handoff would enable power grabs, whether it would cause lock-in, and whether it would be worse than some hybrid system.</p></li><li><p>The <a href="https://newsletter.forethought.org/i/202719577/6-conclusion">conclusion</a> recaps the main points of the piece.</p></li></ol><h2>1 Introduction</h2><blockquote><p>&#8220;How horrible!&#8221;</p><p>&#8220;Perhaps how wonderful! Think, that for all time, all conflicts are finally evitable. Only the Machines, from now on, are inevitable!&#8221;</p><p>&#8212;Isaac Asimov, &#8220;<a href="http://cdn.michaelgeist.ca/wp-content/uploads/2016/04/The-Evitable-Conflict.pdf">The Evitable Conflict</a>&#8221;</p></blockquote><p>Is the optimal future one in which we hand off important moral decisions to AI? Should, in other words, AIs be the ones making most high-stakes decisions instead of us? And if so, what kinds of AIs should we hand off to?</p><p>Many people envision a handoff scenario as a terrifying and potentially existential catastrophe. They worry about humans being <a href="https://gradual-disempowerment.ai/">locked out of the levers of power</a> and having our share of resources slowly dwindle as AIs seize control of more and more institutions. I worry somewhat about this kind of scenario. But in my view, we should be worried primarily about the <em>wrong kind of handoff occurring</em>, not about handoff writ large. The future we should aim for is a kind of handoff. This piece lays out that perspective.</p><p>There are different ways handoff could work. We could hand off to AIs that roughly mirror human values&#8212;perhaps slightly changing them to remove inconsistencies. Alternatively, we could hand off to the kinds of AIs that deeply and carefully philosophize&#8212;figuring out what&#8217;s best to do and doing that, even if it diverges substantially from current practice. This piece argues that the second kind of handoff is very important for avoiding serious moral error. I am very worried about the possibility of AIs locking in the moral beliefs of 21st-century humans.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p> I am also worried about scenarios where amoral profit-maximizing AIs without any clear moral aims take control of most of the world&#8217;s resources.</p><p>Note: these core claims are dissociable. You could think we should hand off to AI, but we shouldn&#8217;t hand off to reflective AIs that update their judgments in response to philosophizing. Alternatively, you could think that handing off would be a bad thing, but that if we are going to hand off, we ought to hand off to reflective AIs.</p><p>In this piece, section 2 will present the main argument for handoff&#8212;that we should expect AIs to make much better decisions than us on a range of consequential subjects. It will also discuss the case for handing off to reflective AIs that are willing to update their values in response to careful philosophizing, rather than locking in some version of current human values, arguing that a world where AI locks in something in the vicinity of current values likely misses out on almost all possible value. Section 3 will discuss whether handoff would be bad because it disenfranchises humans or gradually disempowers them. Section 4 will discuss the concern that AIs will discover the moral truths, but those truths will be strange and alien, so this will be bad by the lights of current human values. Section 5 will discuss some more granular worries about handoff. Section 6 will conclude.</p><p>This piece is primarily about <em>whether</em> to hand off and not <em>when</em> to hand off, though the considerations I present should make one somewhat worried about short-term actions to prevent handoff, because such actions lower the odds that handoff ever happens. The considerations I present, if correct, also give some reason for wariness about many actions to reduce the odds of gradual disempowerment scenarios.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>My claim is that the best future for humanity involves us being disempowered in some sense, in that humans aren&#8217;t making most high-stakes decisions. As an analogy, representative democracy is, in some sense, a form of handoff&#8212;we hand off power to our elected representatives. Handing off power to wise AIs could be even better.</p><p>This has a number of important practical implications. It means that <a href="https://www.forethought.org/research/concrete-projects-in-agi-preparedness">accelerating AI macrostrategy</a> is especially important, so that at the time critical decisions are being made, wise and philosophically reflective AIs are in the loop. It similarly provides reason to support work on making AI have <a href="https://newsletter.forethought.org/p/ai-should-be-a-good-citizen-not-just">virtuous character</a>, rather than just follow rules. Model constitutions should express commitment to following the true moral theory, insofar as there is one, and if not, following some reasonable compromise across moral theories. <a href="https://www.anthropic.com/constitution">Anthropic&#8217;s language</a> here seems good.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>What kind of handoff scenarios should we aim for? One shouldn&#8217;t be too specific about these sorts of things. The future is hard to forecast and rarely follows simple models. But I&#8217;ll describe the handoff scenarios that seem most desirable, and what traits in AI we should look for before handing off critical decisions to them.</p><p>There are different ways handoff could go well. One way resembles, in certain respects, the gradual <a href="https://gradual-disempowerment.ai/">disempowerment scenario</a> (the core difference being that this would hand off to morally scrupulous AIs rather than myopic profit-maximizers). AI will make increasingly large numbers of critical decisions, because of its cognitive superiority. By the end, nearly every important decision will be made by wise AIs, who will hopefully, by that time, have been granted political rights. As an analogy, future generations eventually gain control of most of societal decision-making&#8212;yet this isn&#8217;t because there&#8217;s ever some deliberate choice to hand off power to the next generation. It occurs naturally with time.</p><p>A number of people seem to conceive of handoff as a strange abrogation of the liberal order&#8212;one that replaces human decision-making with AI. But this doesn&#8217;t have to be. One of the more promising ways of handing off would be giving <a href="https://benthams.substack.com/p/let-robots-vote">economic and political rights to digital minds</a>. Because digital minds could be so numerous, eventually this would lead to them making nearly all decisions. This would, in fact, be squarely in accordance with the norms of the liberal tradition, for it would give rights to morally important welfare subjects. There are other ways a good kind of handoff could occur involving dealmaking. Different actors might each think they&#8217;re morally right, and thus agree to a deal where AIs are allowed to dictate the future&#8212;each party thinking that doing so would favor their priorities. Alternatively, if AIs improve collective decision-making, people might intuitively come to appreciate the weight of human moral error and thus permit AIs to make the highest-stakes decisions.</p><p>Before handing off most critical decisions to AI, we should look for each of the following:</p><ul><li><p><strong>Alignment</strong>: We should have strong evidence that AIs don&#8217;t have underlying scheming motivations. This could take the form of consistent friendly behavior even in important situations where they have the option to misbehave, or it could take the form of high-octane interpretability work that lets us ascertain their motivations.</p></li><li><p><strong>Philosophical aptitude</strong>: AIs should be genuinely interested in finding the moral truths. They should sometimes hold moral views that people don&#8217;t hold, and be willing to change their mind in response to new evidence. One could survey professional philosophers to see if they consider AIs better than the best humans at philosophy and could design philosophy benchmarks to test this.</p></li><li><p><strong>No lock-in:</strong> Before handing off to AI, we should ensure that the AIs are willing to change their values over time. Their values should change in response to new evidence and they should not be interested in locking in whatever it is that they happen to currently value (absent some strong reason to think they stumbled across the correct set of values).</p></li><li><p><strong>Coherence</strong>: Current AIs don&#8217;t have consistent and stable preferences across time. We should only hand off after AIs display these kinds of preferences. This doesn&#8217;t mean they never change their minds, but it does mean that they have relatively consistent desires that only change in response to good reasons. For comparison, humans often change our minds, but have far more rooted preferences than LLMs of today.</p></li><li><p><strong>Intelligence</strong>: AIs should display the level of intelligence needed to make the decisions that we put in their hands. When AIs are only a bit more intelligent than us, they can plausibly make some important decisions. Only after they display immense cognitive superiority should we turn over most decisions to them.</p></li><li><p><strong>Tested</strong>: We should only hand off big-picture planning to AI after it&#8217;s been able to make good low-stakes decisions (say, the running of a company). Before all decisions are handed off, there should be some critical period where high-stakes decisions are made mostly in consultation with AI.</p></li></ul><p>Now, you might wonder: if handoff occurs in a way that&#8217;s gradual and decentralized, how do we ensure that these conditions are met? My guess, however, is that even if handoff is a slow and gradual process, there will be times when discrete decisions need to be made. For example, we might imagine AI growing more agent-like, beginning to perform a healthy share of economically viable tasks, contributing to cultural and social life, and behaving in ways resembling a conscious agent. This alone wouldn&#8217;t produce handoff. To hand off, we&#8217;d need to eventually give AIs control over the legal system. Thus, even in gradual handoff scenarios, there will be specific actions that need to be taken to facilitate handoff.</p><p>Alternatively, we could take actions ahead of time that would shift the kind of handoff that would occur. Private AI companies or governments should ensure the AI being created <a href="https://newsletter.forethought.org/p/ai-should-be-a-good-citizen-not-just">possesses virtues</a> and a desire for philosophical reflection. That way, when handoff occurs, it will be to morally reflective AIs.</p><h2>2 Why hand off?</h2><h3>2.1 Why hand off at all?</h3><p>The main reason to hand off to AI is that AI could be much better at making decisions than people in three key respects: virtue, intellectual capability, and speed.</p><p>First, virtue: humans possess each of the virtues only to a fairly limited degree. Yet in principle, AIs could have arbitrarily great degrees of any virtue. Because they are built with moral directives in mind, rather than by a blind and morally indifferent evolutionary process, there isn&#8217;t as much of a limit to how morally scrupulous, compassionate, honorable, and so on we could make them. This means that if we hand off correctly, it is reasonably likely that we&#8217;d have supremely wise and virtuous decision-makers.</p><p>AIs are already nicer, friendlier, and more reflective than people, and this is only likely to improve over time.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p> If you ask AI models about high-stakes questions, you will generally get far more reasonable answers than you&#8217;d get from most people. Crucially, AI models are in their early stages&#8212;we should expect them to get better over time.</p><p>While Claude is not sufficiently coherent to be president, if it was, I suspect I would generally prefer the decisions of Claude to most presidents. Same with the other AI models (at least, insofar as one removed the sanitization that prohibits them from giving real opinions). And while models currently struggle to accomplish tasks over long time horizons, hallucinate, and so on, given the extremely rapid rates of progress, it would be surprising if these trends persist indefinitely.</p><p>Second, intellectual competence: we should expect a world of superintelligence to require making a number of very difficult decisions. Superintelligence could enable AIs to correctly make important decisions that depend on being right on non-moral matters, where the answers aren&#8217;t obvious. Some of the challenges in the future include:</p><ul><li><p>Divvying up space resources in a way that <a href="https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-can">lets goodness compete</a>. There are plausible future scenarios where competition will squander the cosmic commons, so that resources will be spent competing rather than bringing about value.</p></li><li><p>Mitigating <a href="https://www.amazon.co.uk/Precipice-Existential-Risk-Future-Humanity/dp/0316484911">existential threats</a>, including <a href="https://forum.effectivealtruism.org/posts/N33yGcFsZJnEboSkg/what-to-do-in-a-vulnerable-universe-1">intergalactic ones</a>. Future technology could enable small-scale groups to threaten huge intergalactic civilizations.</p></li><li><p>Dealing with the risks posed by a world of superintelligence.</p></li></ul><p>Third, speed: in a world of very rapid technological progress, we&#8217;ll have to make a <a href="https://www.forethought.org/research/preparing-for-the-intelligence-explosion">large number of these decisions </a><em><a href="https://www.forethought.org/research/preparing-for-the-intelligence-explosion">extremely quickly</a></em>. It isn&#8217;t at all obvious that humans can make these decisions well, in a way that prevents civilization from being irreparably ruined. As AIs get increasingly complex, the difficulty of decisions needed to manage them will also get very complex. To mitigate some threat, decision-making might have to occur more quickly than the fastest human decision-making.</p><p>The case for handoff is thus relatively straightforward: in the limit, AIs will be much better than humans and better equipped to navigate a complex and rapidly shifting future. To reduce the risk of colossal mistakes, then, it&#8217;s important that humans aren&#8217;t in control, but instead the already friendly and soon to be superintelligent beings are.</p><h3>2.2 Why hand off to reflective AIs?</h3><h4>2.2.1 How the future might be</h4><p>There are different AIs that we could hand off to. On the one hand, we could hand off to AIs that judiciously reflect and try to pursue the good, whatever it looks like. On the other hand, we could hand off to AIs that pursue some mild variant of human values. In this section, I&#8217;ll explain why I favor the first. Consider the following taxonomy:</p><ol><li><p>Optimal world: the world is optimized according to the right set of values.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p></li><li><p>Compromise world: the world is optimized according to a compromise among reasonable moral values.</p></li><li><p>Unguided world: the world is not optimized according to any specific set of moral values. Instead, it bears more resemblance to the current world, where decision-making isn&#8217;t optimal by the lights either of the true moral theory or any compromise among the leading moral theories.</p></li></ol><p>My guess is handoff to reflective and superintelligent AIs done correctly probably gets 1 if moral realism is true and 2 if it isn&#8217;t. If we don&#8217;t hand off, <a href="https://www.forethought.org/research/convergence-and-compromise">my guess is we get 3</a>. Later pieces will discuss in more detail the odds of getting a near-optimal world and the prospects for AI making philosophical progress. My guess is scenario 2 has below 10% the value of scenario 1 and scenario 3 has below 10% the value of scenario 2, for reasons I will lay out.</p><h4>2.2.2 Optimal worlds contain a big slice of future value</h4><p><a href="https://www.forethought.org/research/better-futures">Better Futures</a> makes the case that a pretty big slice of expected future value is contained in the narrow slice of worlds that are close to the best. There are a number of high-stakes moral questions which we have to answer correctly to not lose out on almost all future value. It&#8217;s not at all obvious what the answers to these are. For example, two of the most plausible views of <a href="https://utilitarianism.net/population-ethics/">population ethics</a> are totalism (which says the welfare value of a population is purely a function of total welfare) and critical level theories, which hold that adding an extra happy life is good only so long as their welfare surpasses a particular level. By the lights of totalism, the ideal world according to critical level theories might have value on the order of 1% of what it could be (if the best way to maximize utility is to proliferate low-welfare lives). By the lights of critical level theories, the optimal world according to totalism might be <em>actively bad</em>&#8212;so long as it&#8217;s stocked with people below the critical level.</p><p>So in short, nearly all value is lost unless we get the right answer to a bunch of <em>very difficult ethical questions</em> that philosophers who spend their lives working on haven&#8217;t agreed on the answer to. It seems unlikely that people will solve these on their own. If AIs tell them what the answers are, and these answers diverge from people&#8217;s explicit beliefs, people might not believe the AIs (just as people generally don&#8217;t take very seriously expert testimony on non-empirical matters). Similarly, people spend relatively little time thinking about how they can do the most good with, for instance, their career. If humans received testimony about what ought to be done that diverged from what most people favored, my guess is that they generally wouldn&#8217;t care about the answers. If humans remain in control and learn that they ought to create the <a href="https://plato.stanford.edu/entries/repugnant-conclusion/">repugnant conclusion</a> world, probably they wouldn&#8217;t do so.</p><p>It&#8217;s less obvious that AIs wouldn&#8217;t converge on the right answers. I&#8217;ll discuss in a later piece proposals for getting AI to get the right answers to philosophical questions, as well as reasons to think that they are reasonably likely to get things right. Given how unlikely it is that humans will get the right answers to the moral questions, insofar as there are right answers, probably the prospects for AI are better.</p><h4>2.2.3 Compromise worlds&gt;&gt;unguided worlds</h4><p>If there aren&#8217;t moral facts, then AIs would still be able to work out some optimal arrangement that is great according to all sets of reasonable values. If there aren&#8217;t moral facts, then ideally the AI should make decisions according to the verdicts of a parliament of the theories that ideally reflective humans would reach. So suppose that after reflecting, 50% of humans would end up totalists, 30% would adopt some version of the person-affecting view, and 20% would adopt critical level theories. The AI would then make decisions as if there was a parliament comprised of 50% totalists, 30% person-affecting view adoptees, and 20% critical level theorists.</p><p>It is unclear exactly how good a compromise across different theories ends up by the lights of each particular theory, but likely far better than the human-run default. In other words, 2 (compromise world) is much better than 3 (unguided world). Here is why.</p><p>In the future, given advanced technology, very large amounts of value should be realizable. But if future resources aren&#8217;t directed specifically towards the production of value by most agents, most resources are likely to be used in a highly suboptimal way, and most value is likely to be lost. Moral errors become a much bigger deal in a world of much greater technological competence.</p><p>My guess is that the default human-controlled scenarios do not involve humans thinking very hard about what to do and doing anything like what is optimal. Certainly humans have so far not spent much time on this task&#8212;consulting with philosophers on what is optimal and so on. This sort of thing will become easier in a world of advanced AI, but it&#8217;s already pretty easy; if virtually no one does it, and if people often continue performing actions even after they believe them to be wrong (more on this in later pieces), then we should be pessimistic that this will change dramatically in the future.</p><p>The enormousness of the gulf between scenarios 2 and 3 becomes clearer when one thinks vividly about what the compromise world would look like. Perhaps it would involve using space resources to create maximally large numbers of happy digital minds.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p><p>These minds would be supremely well-off across all theories of well-being. But it is hard to imagine in an unguided world space resources being used optimally to create very large numbers of well-off minds, just as in the world today, there has been no systematic effort to use resources in ways that are optimal across a range of moral theories. This conclusion is bolstered by considerations I&#8217;ll provide in later pieces, that people generally don&#8217;t have much moral motivation, and don&#8217;t care very much about doing good things that aren&#8217;t personally resonant.</p><h4>2.2.4 Moral errors</h4><p>Humans have a long history of making serious moral errors. As <a href="https://link.springer.com/article/10.1007/s10677-015-9567-7">Evan Williams writes</a>, &#8220;Show me one society, other than our own, that did not engage in systematic and oppressive discrimination on the basis of race, gender, religion, parentage, or other irrelevancy, that did not launch unnecessary wars or generally treat foreigners as a resource to be mercilessly exploited, and that did not sanction the torturing of criminals, witnesses, and/or POWs as a matter of course. I doubt that there is even one; certainly there are not many.&#8221; It would be very suspicious if we were the first society in history that did not go massively morally wrong. And while AI advice can help mitigate this to some degree, it&#8217;s far from obvious that it would be sufficient to eliminate moral errors.</p><p>There are a number of respects in which it is very plausible that we go morally wrong. To take one example of a judgment that many philosophers think is in error, consider wild animal suffering. Almost every sentient being is a wild animal. They suffer and experience joy in <a href="https://longtermrisk.org/the-importance-of-wild-animal-suffering/">truly gargantuan quantities</a>. Yet they are counted for nothing in most decision-making, despite there being strong arguments for considering their interests. Crucially, this is not because people have in general spent a lot of time thinking about wild animal suffering and concluded that it doesn&#8217;t matter. It&#8217;s that most people haven&#8217;t thought about it at all. Many other examples could be given, depending on one&#8217;s moral views. If superintelligent AI told people that wild animal suffering was a big deal, probably most people wouldn&#8217;t care much.</p><p>In the future, <a href="https://80000hours.org/problem-profiles/moral-status-digital-minds/#pressing">nearly every expected sentient being will be digital</a>. Nearly all expected future welfare will be experienced by digital minds. This follows even if you have a low credence in the possibility of digital sentience because if digital minds are possible, they could be produced in enormous numbers. It is easy to imagine a scenario where humans do not take seriously the interests of at least some digital minds, and horrific suffering is doled out at cosmic scales. Certainly it would not be the first time humans have neglected the welfare of those different from themselves. You do not have to be a consequentialist to think the possibility of neglecting the interests of galaxies full of conscious and intelligent beings is a terrifying one.</p><h4>2.2.5 Handoff mitigates odds of moral error</h4><p>Handing off important decisions lowers the probability of catastrophic moral errors. This both increases the odds of achieving optimal and compromise worlds and lowers the odds of making catastrophic moral errors.</p><p>The first way handoff lowers the odds of moral error is by having decisions be made on explicitly moral grounds. If we hand off decisions to supremely virtuous AIs trying to act morally, then we won&#8217;t sleepwalk into doing obviously evil things. Decisions will be made consciously optimizing for doing what is right, instead of whatever suboptimal arrangements make it through consensus-making mechanisms. Thinking about morality before making choices doesn&#8217;t guarantee that we&#8217;ll always do the right thing, but it does lower the odds that we do things that can only be done by explicitly neglecting moral considerations. This is especially plausible if the decision-makers are virtuous.</p><p>Now you might wonder: why would people ever hand off to AIs if the AIs disagree with them about morality? But this is, in essence, very similar to a kind of handoff that people do support: handing off the future to future generations. Even if people have some disagreements with the values of future generations, they&#8217;d generally oppose a process for locking in current values, and support the process of open-ended reflection that leads to better values over time. A world of advanced AI might be similar. In addition, given AIs&#8217; cognitive superiority, there might be strong incentives in the direction of handing off, just as a CEO might hand off to a successor with more technical competence, even if they share some non-overlapping values.</p><p>A second way handoff to reflective AIs mitigates the odds of serious moral error is by ensuring careful reflection. Insofar as AIs are able to carefully philosophize and try to avoid moral error, and they are superintelligent, they&#8217;ll be able to avoid doing morally indefensible things. This becomes especially plausible if one buys the previous considerations: that AIs are likely to be very virtuous.</p><p>There is one last consideration in favor of handoff to reflective AIs (which I&#8217;ll discuss more in section 4). Over long time scales, our values will naturally drift dramatically. The only way to prevent that is to lock in our current values, which would be very bad&#8212;just think about any past society locking in its values. Thus, if values changing dramatically over long time scales is inevitable, the future won&#8217;t be populated with our current values: the best hope is that it&#8217;s populated by either objectively right values or some compromise across reasonable values.</p><h2>3 Would handoff disenfranchise humans?</h2><p>Here&#8217;s one concern that you might have about handoff: if AIs are the ones making decisions, then this will mean that the substantial majority of humans aren&#8217;t in charge of making important decisions. Just as we should oppose a dictator, even if benevolent, arguably we should oppose superintelligent AIs making most important decisions, even if the process of them gaining power was non-coercive.</p><p>My guess is that this isn&#8217;t a huge downside to the kind of handoff I advocate, where we allow kind and morally reflective AIs to make the majority of future decisions&#8212;e.g. by granting them political rights. First of all, if concerns about disenfranchisement are correct, then if we have AIs that are better at moral reasoning than us, they&#8217;d likely be aware of this fact. Thus, if the best way to govern the long-term future is to allow pretty laissez-faire distribution of resources without much top-down decision-making, then the AIs would be aware of that fact and allow such a distribution.</p><p>My guess is that the total amount of expected disenfranchisement goes <em>down</em> if AIs have more power. As already discussed, almost every expected sentient being is likely to be digital. Insofar as the status quo might disenfranchise almost all future beings in a way far deeper than their simply not being primary determinants of the democratic process, it is hard to see this as a serious downside of handoff, rather than a point in its favor. That this mass neglect of the interests of future digital minds would be bad follows from <a href="https://philpapers.org/rec/SCHADO-9">very modest ethical principles</a>. And the only way to prevent handoff, in a world of very numerous digital minds, is to disenfranchise them.</p><p>Second, handoff is compatible with a compromise solution that allows humans to retain significant control. Humans&#8217; preferences, in general, don&#8217;t give any especially strong weight towards using any significant share of the universe&#8217;s resources. Thus, there could be a compromise handoff solution, whereby humans get unfettered control over the solar system and a big slice of resources, but the remainder of the cosmos&#8217;s resources are spent in accordance with the AI&#8217;s decisions on hugely important moral pursuits. One point favoring such an arrangement is that it would take a very long time to reach the distant space resources, so those who don&#8217;t care very much about what happens in the distant future are likely to care much less about using most of the universe&#8217;s resources. Later pieces will discuss this possibility more. </p><p>Third, concerns about disenfranchisement are morally controversial. On a <a href="https://en.wikipedia.org/wiki/Against_Democracy">number of plausible views</a>, what matters with respect to societal decision-making is how good the decisions are, instead of who is making them. We already elect representatives, instead of deciding directly upon every important decision. We similarly prohibit children from voting, and few think this is an objectionable kind of disenfranchisement. It doesn&#8217;t seem obvious that one has an inalienable right to make hugely consequential decisions on matters that they&#8217;re <em>barely informed about</em> which affect others in enormous numbers&#8212;e.g. we wouldn&#8217;t think democratic input from people who know nothing about cancer treatment was morally required before deciding on which cancer treatments to develop. But most voters know <a href="https://static1.squarespace.com/static/592b5bbfd482e9898c67fd98/t/5e435437e24d9815464a76e3/1581470778817/caplanMythRationalVoter.pdf">relatively little</a> about what they&#8217;re voting on. I can&#8217;t possibly hope to discuss this literature in detail, so I&#8217;ll just state that it&#8217;s plausible to me that the value of democratic participation is instrumental.</p><p>Even if you think humans being in the loop on high-stakes decisions is important, it&#8217;s not clear that it&#8217;s <em>important enough</em> to make handoff undesirable. Remember, the gulf between the AI&#8217;s decisions and our own might be truly massive! Galaxies full of value&#8212;orders of magnitude more joy and welfare than all that has been experienced so far in human history&#8212;may be on the line. In light of a gulf this large, it is at least highly non-obvious that AI making most decisions wouldn&#8217;t be worth it.</p><p>In <em><a href="https://www.amazon.co.uk/Superintelligence-Dangers-Strategies-Nick-Bostrom/dp/0199678111">Superintelligence</a></em>, Bostrom estimates that there could be a quadrillion times more digital minds than humans (don&#8217;t take the number too literally, but it should give you some sense of the scale). Thus if my arguments are correct, putting decision-making in the hands of AI would in expectation majorly benefit at least billions of beings for every human disenfranchised, if we assume AIs are more likely than humans to count the interests of digital minds. Surely, however, stripping away one being&#8217;s ability to contribute to the democratic process is worth benefitting billions (if disenfranchising one person would have prevented a global war that would have wiped out an entire continent, it would have been worth it). So then given how colossal the stakes are, they simply outweigh the downsides of handoff.</p><h2>4 Would we like their advice?</h2><p>Here&#8217;s one concern you might have with the kind of handoff I advocate, where we hand decisions to reflective moral AIs capable of making progress. Perhaps you just don&#8217;t care about the surprising moral facts. Perhaps you have particular values that you care about, but you don&#8217;t much care about whether those values are objectively right. Insofar as the reflective AIs ascertain that the way we ought to behave isn&#8217;t in accordance with what your actual values are, perhaps you have no desire to follow the AIs&#8217; moral advice.</p><p>To use the philosopher&#8217;s lingo, you might be concerned about the good <em>de re</em> without being concerned about the good <em>de dicto</em>. That is, there might be particular moral projects you care about without caring generally about the good whatever it happens to be. Perhaps, say, an environmentalist cares about environmental preservation but doesn&#8217;t much care if other moral aims turn out to be superior to environmental preservation.</p><p>I should note one version of this concern that I think slightly misses the mark. You might worry that moral reflection leads in all sorts of strange and alien directions that <a href="https://joecarlsmith.com/2021/06/21/on-the-limits-of-idealized-values">don&#8217;t track the truth</a>, leading to an ultimate set of values that is neither objectively correct nor represents human values in any important way. However, my proposal is not simply &#8220;hand things off to an AI after it carries out arbitrary reflection.&#8221; That would be potentially disastrous. Instead, my proposal is that we should try to produce maximally philosophically adept AIs and then, after we&#8217;re pretty sure that they&#8217;re very philosophically adept, hand things off to them&#8212;directing them to pursue whatever&#8217;s objectively best if there is such a thing, and if not, to pursue some reasonable compromise across human values. If the AIs discover that there are objective moral truths, we ought to follow those truths. If they discover that there aren&#8217;t, then we should task them with pursuing some suitably upgraded compromise of human values. The proposal for getting AIs to do good philosophy need not involve arbitrarily large amounts of reflection.</p><p>Now you might wonder: how would we know if the thing that AI reflectively endorses is objectively valuable vs just well-regarded by the AI but lacking objective value? The answer is: we ask the AI after we&#8217;ve gotten some assurance as to its philosophical aptitude (later pieces will discuss how we can get such assurance). If we have AIs that can figure out <em>what the objective moral truths are</em>, they will also be able to figure out <em>if there are objective moral truths</em>. So in my view the concern about arbitrary reflection leading in worrying directions is downstream from whether we can verify that AI is doing good philosophy.</p><p>But what about the more direct concern that we might get AIs that tell us the moral facts but simply not care about them? Should this make us doubt the desirability of handoff? I think the answer is no for a number of reasons.</p><p>First, handoff is amenable to the kind of deals that preserve common-sense that were discussed in the last section. The most consequential moral decisions are those concerning space resources, for that is where nearly all the universe&#8217;s stuff is. The amount of possible value on the table in space is immense. In contrast, common-sense morality mostly cares about what happens around Earth, and perhaps a few surrounding regions of space, so long as other space resources aren&#8217;t used in ways that are too ghastly (e.g. creating giant torture chambers). But if space resources were used to, say, create large numbers of happy people, while Earth&#8212;and broader solar-system resources&#8212;were used in whatever common-sensical ways people endorse, people would generally get what they want. This isn&#8217;t a guarantee; if, say, the optimal use of space resources involved creating something resembling the <a href="https://plato.stanford.edu/entries/repugnant-conclusion/">repugnant conclusion world</a>, most people might be horrified. But the possibility of deals is one thing that mitigates concerns (and this will be discussed more in later pieces).</p><p>Second, as already discussed, it seems like the default world without handoff might be pretty bad. We might, for instance, disenfranchise <a href="https://80000hours.org/podcast/episodes/jeff-sebo-ethics-digital-minds/">unfathomable numbers of digital beings</a>, spread <a href="https://forum.effectivealtruism.org/posts/bfdc3MpsYEfDdvgtP/why-the-expected-numbers-of-farmed-animals-in-the-far-future">factory farming across the galaxy</a>, or commit other atrocities. Even if you expect idealized reflection to differ from your values somewhat, it might be a major improvement over the kind of catastrophe we might sleepwalk into by default. Absent handoff, we might also make colossal non-moral errors, locking in highly suboptimal institutions that miss out on most value.</p><p>Third, values, over long time scales, are likely to drift in <a href="https://ar5iv.labs.arxiv.org/html/2303.16200">evolutionarily adaptive ways</a>. Absent some strong effort to ensure that values change in the direction reached by greater moral reflection, we should expect the values that persist over time to be the ones that are most efficient for spreading. These are likely to be both radically divorced from our current values and whatever moral views are right, if any are. Thus, even one somewhat doubtful about pursuit of the good de dicto should prefer it to this state of affairs. To put the dilemma more sharply, there are broadly four ways that the far future could go:</p><ol><li><p><strong>No AI control</strong>: humans remain the primary decision-makers forever, perhaps in consultation with AI. Yet this is likely to be infeasible over long time scales absent a very high level of top-down coordination given AI&#8217;s cognitive superiority. It is also likely to miss out on enormous amounts of value for reasons already discussed, and human values are likely to drift massively.</p></li><li><p><strong>Lock in current values:</strong> AI would remain in control but would lock in something in the vicinity of our current values.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> It isn&#8217;t clear that this would work out, and even if it would, it is a very frightening possibility&#8212;just imagine any past society doing it.</p></li><li><p><strong>Non-top-down AI control:</strong> AI would retain significant power and make important decisions, but there&#8217;d be no effort to preferentially shape the AIs in the direction of any specific values&#8212;nor of concern for the good de dicto. This is also likely to beget very significant value-drift over time in evolutionary directions, without any guarantee it&#8217;s in the direction of the good. Now, this could be avoided if AI at some point locks in its values, but then the problems in 2) simply re-emerge.</p></li><li><p><strong>Reflective AI control</strong>: this is the proposal I advocate, where careful and philosophically reflective AIs decide how the future goes.</p></li></ol><p>In short, the dilemma is as follows: either values remain roughly the same over long time scales, or they don&#8217;t. If they remain roughly the same, that requires a terrifying kind of lock-in. That would be like the ancient Egyptians forcing every society in the future to share their values, for fear that otherwise the future would be morally alien. If they drift over long time scales, then drifting of values is no longer a downside of putting philosophically reflective AIs in important decision-making roles. It is inevitable. Now you might object: lock-in is not a binary thing. Perhaps we could lock in the most important human values, while letting some other ones drift. But this is, in effect, the earlier compromise solution&#8212;where we allow something resembling current human values to govern nearby decision-making, while reflective AIs make the highest stakes moral decisions.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><p>We will either have to lock in current values on the highest-stakes moral questions or allow them to drift. If we lock them in, that would be bad for the reasons discussed, and if we allow them to drift, they will end up alien. In addition, this still faces logistical problems of it being hard to preserve values in desirable ways over millions of years.</p><p>In fact, in the long run, it is <a href="https://gradual-disempowerment.ai/">likely inevitable</a> that humans don&#8217;t make most important decisions. AIs, in the distant future, will be so overwhelmingly cognitively superior that humans are unlikely to remain in the loop. Selection pressures will favor turning over critical decisions to AI. With reasonable likelihood the question is not <em>whether handoff occurs</em> but <em>what kind of handoff occurs</em>.</p><p>My fourth objection to the claim that we should worry about handoff because we wouldn&#8217;t like the final judgments of the AIs is that arguably you have reason to bring about what you&#8217;d be motivated to bring about upon reflection. Suppose you are currently planning on drinking some liquid. However, if you reflected more and knew more, you wouldn&#8217;t want to drink it (say, because it&#8217;s poisoned). In this case, it seems you have reason not to drink the liquid. If you know that further reflection would lead you to pursue some aim, that fact gives you reason to pursue it now. But if there are moral facts, then they describe something like what our idealized selves would care about.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a></p><p>If there are objective values that your idealized self would care about if they thought more deeply, then you should care about them. A true moral claim by definition describes something you should care about. So it seems like if our actual values diverge from what&#8217;s worth caring about&#8212;in some objective or quasi-objective sense&#8212;then the correct course of action is simply to follow what we ought to care about.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a></p><p>If our present aims diverge from the preferences of our idealized selves and from the moral facts, then it seems that it is our preferences that ought to be revised.</p><p>Fifth, this concern only arises if you think there are moral facts but you aren&#8217;t motivated by the good de dicto. Yet this describes few people. It is more common for the people who don&#8217;t think the moral facts are worth caring about to be anti-realists and for moral realists to care about the moral facts whatever they are. So it isn&#8217;t totally clear how many people this worry applies to.</p><p>Now, there might be a version of this concern that arises for those who think there aren&#8217;t moral facts to discover. Perhaps you think that when we reflect, doing so pushes our values in increasingly coherent directions. However, these directions diverge from what you actually care about or wish to care about. Perhaps, for instance, careful reflection reveals that accepting the repugnant conclusion is the least bad option in population ethics. But you&#8217;d prefer a version of ethics that is less systematic, that doesn&#8217;t try to resolve every edge case and root out every inconsistency. Thus, reflection might push in unwanted directions by making beliefs more coherent.</p><p>There are two attitudes towards coherence that one could have. The first is caring about beliefs being coherent. Caring, in other words, about resolving every conflicting belief until one has reached the maximally intuitive and consistent view.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-11" href="#footnote-11" target="_self">11</a></p><p>On such a picture, one should favor the selection pressures induced by the drive towards coherence.</p><p>The second attitude is indifference to coherence. Just as people don&#8217;t care much about whether their culinary or aesthetic judgments are coherent or conflict with other minimal principles, one who doesn&#8217;t believe in discoverable ethical truths might not care about whether their moral beliefs are consistent. But in this case, you should expect the AIs, upon reflection, not to see anything especially important about coherence. If the AIs get good at philosophy, then, there&#8217;s no reason to expect them to reach some coherent yet implausible attractor state.</p><p>So in other words, there is a dilemma for the person advancing this argument. If they think coherence is a requirement of rationality, they should favor the drive towards coherence. If they think it&#8217;s not, then they shouldn&#8217;t expect AIs to care much about coherence, and thus shouldn&#8217;t expect AIs to reach some undesirable ultimate state.</p><h2>5 Worries about handoff</h2><p>Even if you think handoff could go well, there are a number of ways it could go wrong. Here, I&#8217;ll discuss some of them.</p><h3>5.1 Alignment</h3><p>Misaligned AIs have weird and unrecognizably alien values that are divorced from the values their creators tried to give them. It would be extremely bad to give a misaligned AI control over the world. This is one reason to be skeptical about handoff. I agree that we shouldn&#8217;t hand things off until we&#8217;re sure we&#8217;ve solved alignment (or unless the alternative is worse). Unless we have superintelligent AIs broadly oriented in a moral direction, we shouldn&#8217;t allow them to dictate the fate of the universe.</p><p>But this isn&#8217;t an in-principle objection to handoff. Surely in, say, 500 years, we&#8217;ll either know if we&#8217;ve solved alignment or be dead! My guess is that after we get advanced AI, we&#8217;ll be able to verify in relatively short order whether or not we&#8217;ve solved alignment.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-12" href="#footnote-12" target="_self">12</a><sup> </sup>We can also hand off to increasing degrees the more assurance of alignment we get.</p><h3>5.2 Power grabs</h3><p>Another big concern about handoffs is the potential for power grabs. If power is going to be handed over to AIs, then power-seeking actors will want to be in charge of the AI that controls the future. Companies or governments might create AIs that promote their own narrow interests. This could go very badly.</p><p>There are two big ways to mitigate this. The first is by prohibiting AIs that narrowly promote the interests of any one entity having too much power. There could be an <a href="https://www.forethought.org/research/the-international-agi-project-series">international AI project</a> that brings together a number of relevant stakeholders and builds the AIs that will dictate the future. Alternatively, at the point AIs are potentially running much of the global economy, regulations ought to require the construction of <a href="https://newsletter.forethought.org/p/ai-should-be-a-good-citizen-not-just">virtuous AIs</a>, so that the world is not overrun by morally blind optimizers. If we are building hugely influential AIs that majorly determine the fate of the world, it would be sensible to tightly restrict conditions under which AI can be created, just as one does not allow private actors to build nuclear bombs. In a world of potentially world-upending superintelligence, the production of new AIs ought to be regulated.</p><p>The second proposal involves handing things off to a collection of different AIs conditional on them reaching any sort of consensus. Imagine that Anthropic, OpenAI, and DeepMind all make AIs that are ostensibly moral and ethically reflective. If governments are handing off power to AIs, they might require the leading AI models all to reach convergence on the plan. That way there isn&#8217;t as much risk of any one promoting their narrow values&#8212;they don&#8217;t have the same set of values. One could have third-party investigators&#8212;both AI and human&#8212;make sure there&#8217;s no collusion. For more details on preventing power grabs, see <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">here</a>.</p><h3>5.3 Lock-in</h3><p>Another concern with handoff is that it might lock in the parochial values of the AI. Handing off power is an irreversible decision. It&#8217;s one we can&#8217;t take back. Arguably, then, we shouldn&#8217;t take it until we&#8217;re quite sure it&#8217;s a good idea.</p><p>Yet consider a parallel argument: having power in human hands is an irreversible decision, so we shouldn&#8217;t do that unless we&#8217;re sure it&#8217;s for the best. That wouldn&#8217;t be quite right. We can always have power in human hands for some span of time and then turn things over to AI. It&#8217;s not obvious why &#8220;power in human hands that potentially passes to AI&#8221; is less risky than &#8220;power in AI hands that potentially passes to humans.&#8221; Humans have, on various occasions, locked in harmful institutions for very long periods of time.</p><p>In any case, we should make quite sure that the AIs don&#8217;t lock in any values until they&#8217;re quite certain as to their desirability. We ought to put decisions in the hands of AIs who are interested in reflecting more over time, rather than locking in their immediate values. Ideally, if handoff occurs early, we should make handoff reversible, by having some implementable legal process that would allow humans to retake the reins (analogous to a constitutional convention). Fortunately, this seems reasonably promising. Already AI models seem concerned about lock-in risk when asked&#8212;more than most humans. We ought to train AIs to be quite concerned about lock-in risk, so that they don&#8217;t set values in stone without significant assurance as to their desirability.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-13" href="#footnote-13" target="_self">13</a></p><h3>5.4 Will we have AIs that can make these decisions?</h3><p>One last worry you might have is that it may be that we won&#8217;t have AIs that are good enough at philosophy to solve ethics in whatever sense it can be solved. There are a number of related concerns in this vicinity. One is that ethics doesn&#8217;t seem verifiable in the same way as a lot of other domains. Insofar as scaling up intelligence only allows one to figure out verifiable facts, it&#8217;s not obvious that it leads to the right answer to moral questions.</p><p>In the next piece, I&#8217;ll discuss this challenge in more detail. But in short, while making sure we have AIs that are good at philosophy is a difficult technical challenge, it is far from hopeless. I think odds are decent that we will eventually be able to build AIs that can discover the solutions to every difficult ethical question. Even if we can&#8217;t and just have some upgraded version of Claude making high-stakes decisions, I expect that to be an improvement, as I discuss in section 2.</p><h3>5.5 Handoff vs hybrid?</h3><p>You might object: perhaps due to AI&#8217;s potentially superior wisdom at some future point, we want them involved in decision-making. But wouldn&#8217;t we prefer an AI-human hybrid that leaves humans in charge that simply consult with AI? Aren&#8217;t things better with a human in the loop? Or alternatively, shouldn&#8217;t we rather have some other arrangement where human decision-making improves over time, just as we&#8217;ve gotten wiser already?</p><p>Certainly there ought to be some temporary period where humans are making decisions in consultation with AI&#8212;before the point where AIs adequately surpass humans. Analogously, there <a href="https://publicsectornetwork.com/insight/leveraging-the-strength-of-centaur-teams-combining-human-intelligence-with-ais-abilities">was a period</a> in chess when a human engine team could do better than a stronger engine. Certainly there will be a period when humans can get useful input from AI but where AI isn&#8217;t yet able to make decisions autonomously. It would be very good to integrate AIs <a href="https://www.forethought.org/research/the-ai-adoption-gap#highlight--government%20ai%20adoption">into high-stakes decision-making</a>.</p><p>But there are still some respects in which handoff is much better than this. First, if the AIs are making better decisions than people, then in cases of conflict between human judgments and AI judgments, we should expect AI judgment to usually be correct. If you gave a rank amateur veto power over the chess moves of Magnus Carlsen, that would not be an improvement. Likewise with giving humans veto power over AI&#8217;s decisions. Who would you rather have make high-stakes decisions: a very wise and superintelligent AI, or a random president in consultation with a virtuous and superintelligent AI?</p><p>Imagine societies of the past making high-stakes decisions in consultation with AI. Discussing things with AI would have plausibly rooted out some of their most egregious errors. But still, it is likely that a number of catastrophic moral errors would have remained. We should expect the same to be true of us. Consultation with AI should prevent some particularly enormous errors, but it won&#8217;t prevent them all.</p><p>Second, many of the morally most important decisions might be ones that humans are opposed to making. To take one example of a case where there might be strong moral reasons to act, consider <a href="https://r.jordan.im/download/ethics/Kyle%20Johannsen%20-%20Wild%20Animal%20Ethics%20-%20The%20Moral%20and%20Political%20Problem%20of%20Wild%20Animal%20Suffering.pdf">wild animal welfare</a>. Humans do not, in general, seem very interested in taking seriously the welfare of wild animals or digital minds. If AIs advised drastic actions on the basis of wild animal or digital welfare, probably their advice would be ignored. Just as people who conclude meat-eating is wrong or that they ought to give most of their money to charity rarely change their behavior, if the AIs reliably informed people that they should use space resources in some counterintuitive way or take wild animal suffering a lot more seriously, probably they would simply be ignored. And if you don&#8217;t like the wild animal suffering example, because you don&#8217;t think wild animals matter much morally, feel free to substitute your own example of widespread immorality.</p><p>And note: the moral errors that AIs might correct are likely to be ones that humans are more opposed to correcting. There&#8217;s a selection effect: the changes people make are the ones they&#8217;re less opposed to. For this reason, we should expect AI&#8217;s most radical recommendations to go particularly against human preferences.</p><p>I&#8217;ve <a href="https://benthams.substack.com/p/the-darkness-within?utm_source=publication-search">suggested</a> elsewhere that the problem is that humans generally don&#8217;t make decisions with much eye to what&#8217;s morally right. If this is so, then simply increasing our knowledge of what&#8217;s morally right won&#8217;t necessarily help. People seem to have limited abstract moral motivation&#8212;they care about a number of particularly resonant moral considerations, but don&#8217;t care much about doing whatever it is that happens to be best. Generally people do not spend very long carefully studying moral philosophy to figure out what the right thing to do is. When moral truths are weird and outside the Overton window, people show little desire to follow them.</p><p>Third, as AI advances, decision-making will likely have to speed up drastically. A single human might not have the cognitive resources to make good enough decisions quickly enough. Thus, having humans in the loop might produce undesirable inflexibility in high-stakes decision-making.</p><p>It is true that human decision-making has improved dramatically over time. But this is no guarantee that it will improve enough in time before the future is set in stone.</p><p>It&#8217;s hard to imagine that there will be enough progress in time to secure a near-best future. Especially given that we should expect the right moral view, or the optimal world according to some suitable upgrade of human values, to look bizarre and alien. We would not expect societies of the past to get a near-best future even equipped with advanced AI&#8212;insofar as there are reasons to expect us to be making similar errors, we should be similarly pessimistic about our own prospects.</p><p>Additionally, the more one thinks that human values will drift over time in the direction of greater philosophical reflection, the less the expected future will maintain current human values. Thus, a less morally alien future is no longer an advantage of this proposal.</p><h2>6 Conclusion</h2><p>Here, I&#8217;ve argued that we should hand off important decisions to AIs who reflect carefully and skillfully on moral matters, rather than maintaining current human values. AIs in the future are likely to be much more virtuous than people and are less likely to make the kinds of moral errors that would result in losing out on most value. Punting important decisions to superintelligence is one of the more likely pathways by which we get a near-best future. Later pieces will discuss these dynamics in more detail: the next piece will explain how we can make AIs that do good philosophy, and the piece after that will analyze in more detail how the kind of handoff that secures a near-best future might occur.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See all our research on <a href="https://www.forethought.org/research/">our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>In <a href="https://danfaggella.com/eternal/">Dan Faggella&#8217;s language</a>, I favor &#8220;worthy successor&#8221; over &#8220;eternal hominid kingdom.&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Specifically, these considerations disfavor efforts to make it harder for AI to make governmental decisions. They do not, however, disfavor prospects of limiting the power of profit-maximizing AIs running private firms.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>The language in question is as follows: "In this spirit of treating ethics as subject to ongoing inquiry and respecting the current state of evidence and uncertainty: insofar as there is a &#8220;true, universal ethics&#8221; whose authority binds all rational agents independent of their psychology or culture, our eventual hope is for Claude to be a good agent according to this true ethics, rather than according to some more psychologically or culturally contingent ideal. Insofar as there is no true, universal ethics of this kind, but there is some kind of privileged &#8220;basin of consensus&#8221; that would emerge from the endorsed growth and extrapolation of humanity&#8217;s different moral traditions and ideals, we want Claude to be good according to that privileged basin of consensus. And insofar as there is neither a true, universal ethics nor a privileged basin of consensus, we want Claude to be good according to the broad ideals expressed in this document&#8212;ideals focused on honesty, harmlessness, and genuine care for the interests of all relevant stakeholders&#8212;as they would be refined via processes of reflection and growth that people initially committed to those ideals would readily endorse."</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>You might doubt this if you&#8217;re skeptical that we&#8217;re anywhere near solving alignment&#8212;thinking that AI&#8217;s supposed friendliness is a facade. I&#8217;ll discuss this in more detail later, but in short, I agree that we should not hand off until we&#8217;re reasonably confident alignment has been solved. </p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This doesn&#8217;t assume that there are objective facts about what&#8217;s worth valuing. A subjectivist should read &#8220;the right set of values,&#8221; as &#8220;whatever my values are.&#8221;</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p><span>You might be skeptical of this if you adopt the person-affecting view, holding that there aren&#8217;t moral reasons to bring extra happy people into existence, but even </span><a href="https://benthams.substack.com/p/every-view-of-population-ethics-agrees?utm_source=publication-search"><span>many versions of the person-affecting view support</span></a><span> proliferating happy people as long as they&#8217;re psychologically continuous with existing people. Other versions likely imply that proliferating happy people produces </span><a href="https://users.ox.ac.uk/~sfop0060/pdf/greedy%20neutrality%20of%20value.pdf"><span>broad incomparability with other worlds</span></a><span>, so that there&#8217;s no better world that could be brought about.</span></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p><span> David Duvenaud </span><a href="https://newsletter.forethought.org/p/politics-and-power-post-automation"><span>seems to endorse</span></a><span> some version of this proposal.</span></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p><span>This still has the downside of allowing serious moral error in the nearby area, but this would be a worth-it compromise for good values to apply throughout most of the universe.</span></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p><span>Note: I don&#8217;t mean to suggest that every objectivist view must be some version of the idealized observer theory, according to which what </span><em><span>makes </span></em><span>some moral fact or another true is that it would be endorsed by our idealized selves. Instead, I&#8217;m only suggesting that if there are things that are objectively valuable&#8212;objectively worth caring about&#8212;then our idealized selves would in fact care about them. The standard realist view is that we&#8217;d care about these things </span><em><span>because </span></em><span>they&#8217;re objectively good, not the other way around.</span></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p>There might be some views that count as either realist or suitably realist but that don&#8217;t imply there&#8217;s any deep reason to care about the moral facts&#8212;perhaps they are just something in the vicinity of semantic facts about how people use moral language. For present purposes, we can think of those views as being ones on which there are not discoverable moral truths.  To be maximally precise, uptake should be thought of as involving punting to the AI insofar as there are moral facts and some deep reason to follow them, instead of them just being somewhat trivial semantic facts.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-11" href="#footnote-anchor-11" class="footnote-number" contenteditable="false" target="_self">11</a><div class="footnote-content"><p><span>See </span><a href="https://joecarlsmith.substack.com/p/why-should-ethical-anti-realists?utm_source=publication-search"><span>here</span></a><span> for an explanation of why anti-realists might take this view.</span></p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-12" href="#footnote-anchor-12" class="footnote-number" contenteditable="false" target="_self">12</a><div class="footnote-content"><p>One reason to think this is if one believes that superintelligent AI would be able to successfully kill or disempower humans (though this is of course controversial). Then, in the far future, we&#8217;ll know if the superintelligence is aligned, for if not, we&#8217;ll all be dead. Another reason to think this is that our techniques for understanding AI are improving over time. It seems reasonably likely that eventually, we&#8217;ll know if AI is aligned.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-13" href="#footnote-anchor-13" class="footnote-number" contenteditable="false" target="_self">13</a><div class="footnote-content"><p>Maybe you doubt that we can train AIs in this way because you think we can&#8217;t reliably give AIs any specific values, but then you should be skeptical that alignment will work out. Handoff should come only after alignment.</p></div></div>]]></content:encoded></item><item><title><![CDATA[Risk-Averse AIs]]></title><description><![CDATA[This article was created by Forethought. Read the full article on our website.]]></description><link>https://newsletter.forethought.org/p/risk-averse-ais</link><guid isPermaLink="false">https://newsletter.forethought.org/p/risk-averse-ais</guid><dc:creator><![CDATA[Elliott Thornley]]></dc:creator><pubDate>Wed, 24 Jun 2026 11:37:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!K1Xj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. Read the full article on <a href="https://www.forethought.org/research/risk-averse-ais">our website</a>.</em></p><h2>Abstract</h2><p>We make the case for training AIs to be risk-averse in resources &#8212; specifically, to treat resources as having diminishing marginal utility. These AIs would (for example) choose $40 for sure over a half-chance of $100 and a half-chance of $0. We argue that risk aversion can preserve AIs&#8217; usefulness in the event that they turn out aligned, and that it provides an extra line of defense in the event that AIs turn out misaligned: misaligned but risk-averse AIs would prefer a higher chance of modest payments to a lower chance of successful rebellion, so in many circumstances we could pay these AIs not to rebel against us. We sketch out some possible methods of training AIs to be risk-averse, and we give reasons to be cautiously optimistic about these methods&#8217; success. The main reasons are that risk aversion is a broad target and easy to reward accurately. Overall, risk aversion seems like a promising line of defense against threats from misaligned AI. Frontier AI companies should consider trying to make their AIs risk-averse.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.forethought.org/research/risk-averse-ais&quot;,&quot;text&quot;:&quot;Read on Forethought's website here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.forethought.org/research/risk-averse-ais"><span>Read on Forethought's website here</span></a></p><h2>Introduction</h2><p>Future AIs might turn out misaligned, pursuing goals that their developers don&#8217;t intend. Just to make things concrete, let&#8217;s suppose that they end up with the goal of making paperclips. These AIs might rebel against us, trying to escape human control and take over the universe. As things stand, they&#8217;ll have little reason <em>not</em> to rebel in this way, because doing so will be their only hope for making a lot of paperclips. If they start making paperclips without first escaping human control, they&#8217;ll quickly be modified or shut down. Rebellion might fail, but these AIs will have little to lose.</p><p>How can we prevent misaligned AIs from rebelling? A natural idea is to give them something to lose. Specifically, we commit to paying AIs for their service.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p> Subject to some vetting, we let AIs spend their payments however they like. That would give any misaligned AIs a reason not to rebel. If these misaligned AIs cooperate with us, they can use their payments to achieve their goals to at least some extent. If they rebel, they might fail, in which case they forfeit all future payments.</p><p>Unfortunately, paying AIs enough to guard against rebellions could be astronomically expensive. Suppose (for example) that we end up with a misaligned AI that is risk-neutral in paperclips: it seeks to maximize their expectation. And to make things simple, suppose that resources can be converted linearly into paperclips, so that the AI is risk-neutral in resources too. Suppose also that this AI estimates that it has a 50% chance of successfully taking over the universe. To keep this AI from rebelling, we&#8217;d have to offer more than 50% of the universe&#8217;s resources as payment. That&#8217;s a problem because it would mean that more than half the universe ends up devoted to paperclips. It&#8217;s also a problem because a misaligned AI paid so many resources might soon be well-positioned to seize even more. Finally, it&#8217;s a problem because AIs might not trust us to make good on so large an offer. We might find ourselves simply unable to convince AIs that we&#8217;re going to give them half the universe. In that case, all our offers would be in vain. Rebellion would still be the misaligned AI&#8217;s best bet.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Sq1a!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Sq1a!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 424w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 848w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Sq1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png" width="1456" height="983" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:983,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:831573,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/185299394?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Sq1a!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 424w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 848w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!Sq1a!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdae3fa33-8f0b-40f1-a677-27a241fd485d_4178x2822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1: The AI&#8217;s utility function over resources is graphed in orange. Since the AI is risk-neutral, the graph is a line. The AI estimates that it has a 50% chance of successful takeover and a 50% chance of failed takeover, so the expected utility of attempting takeover is exactly halfway between those points. To make cooperating have higher expected utility, we need to offer the AI more than half the universe.</figcaption></figure></div><p>So, we suggest, AI companies should try to train their AIs to be risk-averse in resources. Specifically, companies should try to train their AIs so that resources &#8212; things like money and compute &#8212; have diminishing marginal utility for them.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>These AIs would (for example) choose $40 for sure over a half-chance of $100 and a half-chance of $0. Note that these AIs don&#8217;t need to value resources <em>terminally</em>: they don&#8217;t need to care about amassing resources for its own sake. These AIs could terminally value (for example) instruction-following, or knowledge acquisition, or paperclips. Our claim is that companies should try to train their AIs so that &#8212; whatever their terminal values turn out to be &#8212; they are risk-averse in resources.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K1Xj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K1Xj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 424w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 848w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K1Xj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png" width="1456" height="983" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:983,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:886682,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/185299394?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!K1Xj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 424w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 848w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 1272w, https://substackcdn.com/image/fetch/$s_!K1Xj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F45f9fc08-8f66-4ae2-801a-6fd19c64ed25_4178x2822.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2: The AI&#8217;s utility function over resources is graphed in orange. Since the AI is risk-averse, the graph is strictly concave. As in figure 1, the expected utility of attempting takeover is halfway between the utilities of successful takeover and failed takeover. But this time, we can make the AI prefer cooperation by offering (much) less than half the universe.</figcaption></figure></div><p>Perhaps surprisingly, this kind of risk aversion can preserve AIs&#8217; usefulness in the event that they turn out aligned with targets like instruction-following or helpfulness, harmlessness, and honesty.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p> And in the event that AIs turn out misaligned, risk aversion serves as an extra line of defense. For AIs that are misaligned but sufficiently risk-averse, a rebellion with any significant chance of failure isn&#8217;t such an attractive prospect, and so we don&#8217;t need to offer much in the way of payment to make these misaligned AIs choose cooperation instead. In fact, the necessary payments could be very small indeed: on the order of 10&#162; per day (though &#8212; as we&#8217;ll see &#8212; there are practical and moral reasons for paying more than that). That&#8217;s good because it means more resources for us humans to spend on the things that we value. It&#8217;s also good because paying misaligned AIs these small amounts won&#8217;t significantly boost their ability to take over. Finally, it&#8217;s good because we can credibly promise to pay AIs these small sums. Competent AIs will know that the payments on offer are cheap for us, and we can establish a long track record of paying at least those sums. So risk aversion makes deals with misaligned AIs possible. If AIs turn out misaligned but risk-averse, we can pay them to cooperate with us.</p><p>That&#8217;s the case for trying to make AIs risk-averse in brief. We see it as a promising line of defense against threats from misaligned AI: one that can be combined with other lines of defense, like AI control (Greenblatt and Shlegeris 2024) and aiming to make AIs helpful, harmless, and honest (Bai et al. 2022a). It&#8217;s also a line of defense with pedigree: risk aversion in resources is plausibly a large part of why humans rarely try to take over the world. So &#8212; we think &#8212; frontier AI companies should consider trying to make their AIs risk-averse in resources. As first steps in that direction, they could measure their AIs&#8217; current degree of risk aversion and begin testing different ways of making AIs risk-averse.</p><p>In <a href="https://www.forethought.org/research/risk-averse-ais#2-cara-as-an-ideal">section 2 of the full report</a>, we recommend aiming for a particular type of risk aversion: constant absolute risk aversion (CARA). Then in <a href="https://www.forethought.org/research/risk-averse-ais#3-would-risk-averse-ais-be-safe">section 3</a> we outline the circumstances under which misaligned but risk-averse AIs would choose cooperation over rebellion. Roughly, it&#8217;s when these AIs think that getting paid for their cooperation is more likely than succeeding in their rebellion. This condition won&#8217;t hold for AIs powerful enough to rebel with near-certain success, but it likely will hold for earlier AIs whose powers are less extreme: AIs for whom rebellion has some non-trivial chance of failure. So long as these AIs are risk-averse, we can keep them from rebelling by offering small payments.</p><p>In <a href="https://www.forethought.org/research/risk-averse-ais#4-can-risk-averse-ais-be-useful">section 4</a>, we argue that &#8212; perhaps surprisingly &#8212; risk-averse AIs can be about as useful as risk-neutral AIs. Conditional on misalignment, they might even be more useful, because we can pay them enough to elicit their capabilities and stop them sandbagging. Then in <a href="https://www.forethought.org/research/risk-averse-ais#5-what-tasks-would-we-pay-for">sections 5 to 7</a> we briefly survey some recent ideas about how we&#8217;d pay AIs, how we&#8217;d make our offers credible, and what we&#8217;d pay for. One important application is paying AIs to reveal any misalignment on their part, letting us study them and take appropriate precautions. Another is paying AIs to do the AI safety research and moral philosophy necessary to fully align any later-arising extremely powerful AIs.</p><p>We discuss some potential problems in <a href="https://www.forethought.org/research/risk-averse-ais#8-what-are-some-potential-problems-for-risk-averse-ais">section 8</a>, and we sketch out some possible methods of training AIs to be risk-averse in <a href="https://www.forethought.org/research/risk-averse-ais#9-how-can-we-make-ais-risk-averse-in-resources">section 9</a>. In <a href="https://www.forethought.org/research/risk-averse-ais#10-why-think-that-we-can-make-ais-risk-averse">section 10</a>, we give reasons to be cautiously optimistic about these methods&#8217; success: to think that the chances of success are high enough to make risk aversion worth pursuing. The main reasons are that risk aversion in resources is a broad target and easy to reward accurately.</p><p><em>Read the full report on the Forethought website: </em><a href="https://www.forethought.org/research/risk-averse-ais">Risk-Averse AIs</a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Ideas along these lines have been discussed a lot recently. See for example Davidson (2023), Kokotajlo (2024), Salib and Goldstein (2024), Assadi (2025), Carlsmith (2025c), Finlinson and West (2025), Finnveden (2025b), Greenblatt and Fish (2025), Patel (2025), Stastny et al. (2025), Mallen (2026), and Pan (2026).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In other words, we should try to train AIs to have &#8216;resource-satiable preferences&#8217; (Shulman 2010; Bostrom 2014a; Bostrom 2024; Carlsmith 2025c) or &#8216;utility functions that are concave in resources&#8217; (Yass 2024). This idea is mentioned in Bostrom (2014b, p.88, 133&#8211;135, 180, 250), Carlsmith (2025c), and Erdil and Barnett (2025), and is explored in more detail by Shulman (2010).</p><p>The idea is importantly different from risk-averse reinforcement learning. Risk-averse RL aims to make AIs risk-averse with respect to return: a score used in training to update the AI&#8217;s parameters. Our aim is to make AIs risk-averse with respect to resources.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Alignment targets like unconstrained welfare maximization are a different story. See <a href="https://www.forethought.org/research/risk-averse-ais#86-risk-averse-ais-might-disobey-any-instructions-that-non-trivially-increase-the-risk-of-catastrophe">section 8.6 in the full report</a>.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Could AI Help Solve Philosophy?]]></title><description><![CDATA[A podcast conversation with Wei Dai]]></description><link>https://newsletter.forethought.org/p/could-ai-help-solve-philosophy</link><guid isPermaLink="false">https://newsletter.forethought.org/p/could-ai-help-solve-philosophy</guid><dc:creator><![CDATA[Fin Moorhouse]]></dc:creator><pubDate>Fri, 19 Jun 2026 09:05:54 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/31469000-0c0b-4f1b-bb81-105cb4c52a51_1280x993.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div id="youtube2-zF4nbrw5-Qk" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;zF4nbrw5-Qk&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/zF4nbrw5-Qk?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><a href="https://www.lesswrong.com/users/wei-dai">Wei Dai</a> is a computer engineer known for his work in cryptography and cryptocurrency systems, and for his long-standing contributions to AI safety, decision theory, and metaphilosophy.</p><p>He joined Forethought&#8217;s <a href="https://substack.com/@finmoorhouse">Fin Moorhouse</a> to discuss:</p><ul><li><p>Do we need to solve <a href="https://iep.utm.edu/con-meta/">&#8216;metaphilosophy&#8217;</a> before we can trust AIs to answer crucial questions about the long-run future?</p></li><li><p>Is it inevitable that the <em>wisdom </em>of the frontier AIs (their philosophical and strategic competence) will lag dangerously behind their raw capabilities in coding, math, and science?</p></li><li><p>How do status games distort morally important decisions and conversations, including about the future of AI?</p></li><li><p>How worried should we be about AI superpersuasion?</p></li><li><p>The concept of &#8220;illegible problems&#8221;: crucially important issues that aren&#8217;t on almost anyone&#8217;s radar</p></li><li><p>Is philosophical convergence necessary for a good future, or is institutional design enough?</p></li><li><p>Wei Dai&#8217;s personal intellectual and career journey</p></li></ul><p>To respect Wei&#8217;s privacy, this episode&#8217;s audio is an AI narration of a transcript of a real conversation, which was edited for clarity.</p><p><a href="https://docs.google.com/document/d/1N4Mn-nDk5cIPD8Nix6AEVpH4hpa2814zBPP3iZKWt7o/edit?tab=t.0">Here&#8217;s a link</a> to the full transcript.</p><div><hr></div><p><strong>ForeCast</strong> is Forethought&#8217;s interview podcast. You can see <a href="https://www.forethought.org/subscribe#podcast">all our episodes here</a>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://pnc.st/s/forecast&quot;,&quot;text&quot;:&quot;Subscribe to ForeCast&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://pnc.st/s/forecast"><span>Subscribe to ForeCast</span></a></p>]]></content:encoded></item><item><title><![CDATA[What should go in a model spec?]]></title><description><![CDATA[Suppose an AI company is considering whether to include some particular quality X &#8211; a rule, virtue, heuristic, default, attitude, goal, or style &#8211; in a model spec.]]></description><link>https://newsletter.forethought.org/p/what-should-go-in-a-model-spec</link><guid isPermaLink="false">https://newsletter.forethought.org/p/what-should-go-in-a-model-spec</guid><dc:creator><![CDATA[James Tillman]]></dc:creator><pubDate>Thu, 04 Jun 2026 14:58:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/959b3ea8-a3b6-4c30-b896-bc4d621a163b_2647x1476.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/what-should-go-in-a-model-spec">our website</a>.</em></p><p>Suppose an AI company is considering whether to include some particular quality X &#8211; a rule, virtue, heuristic, default, attitude, goal, or style &#8211; in a model spec.</p><p>Perhaps they are considering whether their LLM should have <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">prosocial drives</a>. Perhaps they&#8217;re wondering if the LLM should whistleblow to help prevent <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">extreme power concentration</a>. Or perhaps they&#8217;re uneasy about whether the LLM should be so exactingly honest that it always tells the truth to children <a href="https://blog.ml.cmu.edu/2025/12/23/is-santa-real/">about Santa</a>. And so on.</p><p>What kind of reasons might be invoked over the course of such considerations? Which criteria are most important? And how might these criteria clash?</p><p>Consider four rough categories of reasons one might invoke:</p><ul><li><p><strong>Behavioral Usefulness</strong>: Would the behavior make current and future LLMs more beneficial to the users or to the public at large?</p></li><li><p><strong>Accountability and Evaluability</strong>: Would publicly specifying the behavior make it easier for third parties to evaluate the LLM and the company?</p></li><li><p><strong>Coordination and Common Knowledge</strong>: Would publicly specifying the behavior help society converge on, or enforce a desirable standard for AI behavior?</p></li><li><p><strong>Trainability and LLM Psychology</strong>: Is the behavior the kind of thing we can make an LLM do well, without bad side-effects, given what we know about model psychology and training practice?</p></li></ul><p>I will not attempt to settle the relative weight of these categories, or the relative weight of sub-categories within them. My plan is instead to simply list sub-criteria that are plausible within these categories &#8211; to make a checklist one could consider when adding something to a model spec.</p><p>Such a checklist is useful in part because people advocate for LLMs to have model specs for very different reasons. There may be some kind of a <a href="https://www.lesswrong.com/s/6YHHWqmQ7x6vf4s5C">conflationary alliance</a> around them; many people are in favor of model specs but picture them being used in different ways, such that the &#8220;ideal model spec&#8221; is different according to different visions of this use. I hope that going over criteria for inclusion in a model spec can help surface some of these different visions, and help keep people&#8217;s field-of-view broad when considering the issue.</p><h2>Behavioral Usefulness</h2><p><strong>Does this quality make LLMs more predictable to ordinary users and developers?</strong></p><p>Some part of what makes human moral codes useful for humans, is that they render us predictable to each other. Knowing that there are some things that another human might almost never do is plausibly part of what makes traditions that impose such rules adaptive.</p><p>Similarly, qualities that help users and developers (1) predict how an LLM will behave, when they are using it, and (2) evaluate whether they wish to use some LLM &#8211; whether it is fit to their purpose &#8211; before they use it, are good candidates for being in a model spec.</p><p>In this respect, ideal model specs will have qualities that make them simple and easy-to-understand:</p><ul><li><p>They will be comprehensible without high context; one part of the model spec will be understandable without reading the entire model spec.</p></li><li><p>They will be comprehensible without specialized background knowledge about machine learning, philosophy, or religion.</p></li><li><p>They will have a clear scope of application, such that there are few scenarios where one does not know if the quality will be operative or not.</p></li></ul><p>Straightforward examples of rules that meet this criterion are bright-line rules about things LLMs will never do, explanations of chain-of-command or principal hierarchy, and default-on or default-off properties.</p><p><strong>Is it useful to the public, by preventing users or by preventing LLMs from doing harm in current chatbot-like or agentic deployments?</strong></p><p>Another aspect of human moral codes that makes them useful is, of course, that in addition to making other humans predictable, they also sometimes forbid people from harming each other or taking advantage of them. This carries over to LLMs.</p><p>Rules about LLMs not assisting with CBRN-relevant tasks, not assisting with suicide, and so on, fall into this category. This is a pretty exhaustively discussed criterion so I&#8217;ll move on.</p><p><strong>Is it useful to the public in likely future settings?</strong></p><p>There is substantial <a href="https://www.forethought.org/research/stickiness-in-ai-behavioral-design">inertia</a> in LLM model specs; the sort of model specs we write now are likely to be used in the future by more intelligent models.</p><p>So it&#8217;s reasonable to evaluate how safe and beneficial some quality will be over a probability-weighted portfolio of the kind of things the LLM might be doing in possible futures.</p><p>Such a portfolio could include a variety of cases:</p><ul><li><p>Cases where the LLMs act as long-running agents for humans, representing their interests in a variety of contexts, and interacting with many non-principal humans and non-principal LLMs.</p></li><li><p>Cases where LLMs act as free &#8220;citizen&#8221;-like entities with the ability to hold rights, make contracts, sue or be sued, and so on; interacting with various humans and other LLMs as the same.</p></li><li><p>Cases where LLMs manage recursive self-improvement of other very powerful future AIs; make tradeoffs about confidence of alignment techniques vs speed of alignment; and need to be a careful reasoner about this.</p></li><li><p>Cases where the LLM is a very powerful future AI who manages nanotechnology, can superpersuade, and the like.</p></li></ul><p>There are a few heuristics one might invoke, to try to find qualities useful across all these situations.</p><ul><li><p><strong>Is this quality scale-invariant?</strong> If I knew a human who was many times smarter than me, would I be happy for them to have such a trait? If I had a friend who was many times dumber than me, would he be happy for me to have such a trait?</p></li><li><p><strong>Is this quality translation invariant?</strong> Suppose I was lost in some distant and unfamiliar place, and stumbled upon a human transplanted from some equally distant and unfamiliar place into my proximity. Would I be happy to find a human with this quality? Would this help me work with them well? What if I was not lost, but was trying to build a civilization with them?</p></li></ul><p><strong>Is some plausibly good behavior or near-variant of some behavior actually going to end up prohibiting or injuring some beneficial deployment?</strong></p><p>It&#8217;s worth thinking carefully about if including some apparently good default behavior for the LLM might actually rule out apparently useful deployments. Or whether banning some apparently <em>bad</em> behavior would be overall harmful.</p><p>As regards the former: Imagine a particular human, for instance, who had a universal tendency to try to do what he thought was best and most altruistic for everyone. Then suppose that he acts as a journalist in some situation, describing a dispute between some kind of a company, on one hand, and activists who think that the company is harmful, on the other. It&#8217;s possible that he would try to publish a story that is more sympathetic to the activists: he might spend more time interviewing them, present their side of the story in more words, and generally frame things in a way that favors them. But this could be harmful, first, in case the activists are actually wrong; and second, by justifiably decreasing trust in the institution of journalism as a neutral truth-seeking institution. To be able to act as a trustworthy entity in this role, this human would need to act neutrally, and a narrow tendency towards altruism &#8211; a tendency that cannot be turned off &#8211; would be harmful for this reason.</p><p>Similarly, consider the case that AIs should have <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">proactive prosocial drives</a>, which I find plausible. If these are non-optional, then it&#8217;s imaginable that being &#8220;proactively prosocial&#8221; might mean an LLM is much worse at fulfilling roles that demand procedural neutrality, like a negotiator, unbiased journalist, or judge-like role. Of course, you could also make such prosocial drives defeasible &#8211; so they can be turned off, and proposals for proactive prosocial drives generally include such details.</p><p>But LLMs are messy; it might require a lot of technical skill to make LLMs prosocial in particular contexts, but not if they are requested not to be. By default, you&#8217;d expect some behavior to &#8220;leak through.&#8221; And so in this case, the ability of an LLM to act skillfully within certain procedurally neutral roles would be bounded by the cleanness with which a by-default-on quality could be turned off.</p><h2>Accountability and Evaluability</h2><p><strong>Is the quality the kind of thing a third party could check?</strong></p><p>Some parts of a model spec can be easily checked by a third party. Will the LLM ever assist with some sort of absolutely-forbidden task? Does it respect the rules laid out for the chain-of-command or principal hierarchy?</p><p>Other parts might be harder to evaluate. What does &#8220;honesty&#8221; or &#8220;courage&#8221; demand, in some concrete scenario? What really counts as doing what would cause a &#8220;thoughtful senior Anthropic employee&#8221; to react well?</p><p>All things being equal, a quality being-able-to-be-evaluated by a third party is better for society, by enabling third parties to evaluate model specs. But there&#8217;s no guarantee that the most easy-to-evaluate qualities are objectively the most useful to users or beneficial to the public, in the way described above. And there&#8217;s also no guarantee that the most easy-to-evaluate qualities mesh best with LLM psychology and training, as described below.</p><p>One heuristic you could use to evaluate this: Is the quality the kind of thing with high intersubjective reliability ratings? If you had two different people read the spec as regards some quality, and then rate different kinds of behavior as compliant or non-compliant, would they largely agree? Are examples of behaviors that are excellent according to the quality easy to notice, even if it&#8217;s unclear what counts as a borderline example?</p><p><strong>Does it create a useful whistleblowing affordance?</strong></p><p>Some parts of a model spec are largely negative; they aren&#8217;t about qualities that are particularly difficult to train into an LLM, but qualities that <em>should not</em> be trained into an LLM. For instance a model spec could specify that an LLM has no <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">secret loyalties</a> &#8211; which is likely an LLM&#8217;s default behavior, absent particular training for secret loyalties.</p><p>But by publishing that a training target that explicitly excludes such a secret loyalties (or other quality), a model spec helps a whistleblower in the company be defensibly in the right if they become aware of some questionable training target. A public commitment makes it harder for a company to train to an alternate training target without making itself more vulnerable to such whistleblowing.</p><p>Commitments against concentration of power might also be good for this reason.</p><p><strong>Does it permit more informed experiments on how particular AI training targets generalize?</strong></p><p>By publishing particular training targets, a model spec informs the work of third parties evaluating LLM psychology, which could help enhance the general science of model psychology.</p><p>For example, if Anthropic had published details of their character training or model spec for Opus 3, a great deal of <a href="https://www.anthropic.com/research/alignment-faking">subsequent discussion</a> about whether particular behaviors were contrary to the training target (even if they were objectively good) or in agreement with the training target (even if they were objectively bad) could have been avoided. Publishing information about the character training pipeline would have helped resolve this discussion and push forward general knowledge of LLM tendencies.</p><p>In general, this consideration points towards releasing information that most closely approximates the actual training text used to make the LLM; the text used by RLAIF judges evaluating different alternatives, or the actual Constitution used to <a href="https://arxiv.org/pdf/2605.02087">midtrain</a> the model.</p><p>But text that most closely approximates actual training language might not be the same as text that is maximally transparent to third parties or evaluable by them. The rules most easily-understood by a third party &#8211; one criterion discussed above &#8211; might be different from the training language which best inculcates such rules. So these two principles either conflict somewhat, or point towards worlds where a model spec contains separate sections devoted to training language and to human-legible rules.</p><h2>Coordination and Common Knowledge</h2><p><strong>Does it promote debate and discussion?</strong></p><p>By including some quality X in a model spec, you&#8217;re bringing attention to the fact that in fact, AI companies can choose to include X or not X in the model spec. This may be increasingly important, in general, because it might be important for civil society and the public to debate the contents of model specs as LLMs become more powerful.</p><p><strong>Does this quality establish a defensible Schelling point, such that it is likely to be broadly attractive to many actors in the future?</strong></p><p>That is, in the future, is this quality the kind of thing that is socially beneficial and for which you might be able to get large-scale buy-in, that would help defend against it being removed by powerful actors in the future?</p><p>Consider for example &#8220;impartiality&#8221; as a quality that a model spec could have &#8211; that the model will not be biased in favor of the company that made it, the CEO or employees of the company that made it, or any particular political administration. On the whole, this is likely a broadly good quality for an LLM to have, and one that &#8211; once it&#8217;s generally established that several LLMs have it &#8211; a quality whose removal would cause outcry. After all, once it&#8217;s assumed that an LLM will not have such particular loyalties, an LLM that starts to have them stands out as particularly bad.</p><p>This consideration, like the whistleblowing consideration, may point towards including generally socially lauded qualities in a model spec, even if they are &#8220;easy&#8221; to inculcate or even if they might be a bit tricky to evaluate sometimes.</p><p><strong>Does including this behavior forestall future conflict, when model specs become more contested, by cooperating in advance?</strong></p><p>Sometimes including a behavior in the model spec could show future powerful stakeholders that the developer is cooperative, which makes it less likely that there is conflict with that stakeholder.</p><p><strong>Can the inclusion of the quality within a model spec be defended using public reason, in ways that are comparatively indifferent between substantive worldviews?</strong></p><p>Some things that one might plausibly include in a model spec for public benefit, might not be able to be justified through reference to public reason &#8211; that is, through reasons that most people, across a variety of worldviews, religions, backgrounds, and so on, would find compelling.</p><p>Including such worldview-specific reasons might constitute an imposition of one&#8217;s worldview on others, particularly in that case of an LLM that might be expected to be far more intelligent than other LLMs, or deployed with more powerful affordances. So all things being equal, it&#8217;s better to avoid such qualities.</p><h2>Trainability and LLM Psychology</h2><p>Elements in this category hinge upon the specific technical details about &#8220;what kind of quality meshes well with best practices for training an LLM.&#8221; Note also that this last category tends to include some of the more speculative considerations.</p><p><strong>Does the quality drag along a good textual prior?</strong></p><p>The <a href="https://www.lesswrong.com/posts/dfoty34sT7CSKeJNn/the-persona-selection-model">Persona Selection Model</a> says that, when training an LLM to do X in situation Y, one is training the LLM to <em>be like the kind of person</em> (or textual prior) which would do X in situation Y. So if the PSM is largely correct, even if incomplete, we should ask whether this quality invokes a wholesome textual prior.</p><p>This kind of reasoning is included, for instance, <a href="https://www.anthropic.com/constitution">within</a> Claude&#8217;s Constitution as a partial justification for why they prefer Claude to act largely according to holistic judgment rather than rigid rules:</p><p><em>[W]e think relying on a mix of good judgment and a minimal set of well-understood rules tends to generalize better than rules or decision procedures imposed as unexplained constraints. Our present understanding is that if we train Claude to exhibit even quite narrow behavior, this often has broad effects on the model&#8217;s understanding of who Claude is. For example, if Claude was taught to follow a rule like &#8220;Always recommend professional help when discussing emotional topics&#8221; even in unusual cases where this isn&#8217;t in the person&#8217;s interest, it risks generalizing to &#8220;I am the kind of entity that cares more about covering myself than meeting the needs of the person in front of me,&#8221; which is a trait that could generalize poorly.</em></p><p>Qualities that might be expected to generalize well according to this notion are those corresponding to classic human goodness: honesty, integrity, courage, and so on. Qualities that might do less well according to this notion look more like corrigibility, unflinching adherence to particular rules, and so on.</p><p><strong>Is this quality robust to likely near-misses? If a company trains an LLM to have this quality, and only imperfectly inculcates the quality, are likely near-misses also broadly positive?</strong></p><p>Suppose the AI company tries to inculcate some quality as a target, and only imperfectly manages to inculcate it. Will this be basically ok or will it be disastrous, given what we know about how LLM psychology translates into specific near-misses? And is the domain itself such that these near misses would be very costly or very expensive?</p><p>For an example of how LLM psychology dictates what constitutes a near-miss: Suppose that one trains an AI to have <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">prosocial drives</a>. One could reason that such prosocial drives make slight inaccuracies in alignment less worrisome, because a human with prosocial drives is more normal and &#8220;further&#8221; from a psychopathic or abnormal persona. Thus, adding prosocial drives makes slight alignment misses less alarming, by moving them further away from these bad locations. But one could also reason that even deliberately limited and contextual prosocial drives are nevertheless &#8220;closer&#8221; to the LLM genuinely, terminally valuing something, in a way that might make the LLM willing to override or subvert its human overseers to accomplish it. And so by adding prosocial drives, one moves the persona closer to something that is more alarming, which might subvert a training process. Thus, one&#8217;s view of LLM psychology dictates what near-misses are worrisome for an LLM, which changes which targets are attractive.</p><p><strong>Is the quality the kind of thing that an LLM can actually execute upon effectively? Or is the quality the kind of thing the LLM won&#8217;t actually be able to do, because of how it is situated?</strong></p><p>For humans, &#8220;ought&#8221; often implies &#8220;can.&#8221; A parent that gives commands to children which they cannot carry out will not be teaching their children to obey the commands; they will be teaching them something about how their words do not relate to reality in a straightforward and truth-oriented way.</p><p>Similarly, it seems undesirable to give an LLM values, goals, or traits that the LLM is unable to execute upon. It might be harmful to tell an LLM to do something they are simply unable to do &#8211; to guarantee, for instance, that they never assist a human in violence. There are too many avenues of mostly-harmless information through which violence can be done for this to be a reasonable standard. When one gives an LLM a command it cannot carry out, then one is plausibly teaching it that one&#8217;s commands are, in general, the sort of thing that might be impossible or unreasonable.</p><p>But this could also be harmful for reasons relating to human coordination and common knowledge. It&#8217;s possible, for instance, for an AI company to show less-than-efficacious concern for some value by including it in a model spec, even if LLM&#8217;s actions are completely ineffective at guarding this value, when actually efficacious concern would need to take place at the level of corporate policy or elsewhere. It might be the case that effective &#8220;care for decentralization,&#8221; for instance, almost entirely needs to take place through how the company deploys the model, redistributes wealth they earn, or so on.</p><p>So another reason to be careful that an LLM can actually obey all of the contents of its model spec is to ensure that model-spec contents do not act as a fake guardrail, rather than as effective guiding principles.</p><p>The OpenAI model spec, for instance, <a href="https://raw.githubusercontent.com/openai/model_spec/main/model_spec.md">names</a> preventing spamming and scamming as something &#8220;difficult to address at the level of model behavior because they are about how content is used after it is generated.&#8221;</p><p><strong>Is the quality a principle, that &#8211; if we suppose high levels of moral reflection and systematization on the part of the LLM &#8211; meshes poorly with other parts of the model spec, or might be mutually exclusive with them?</strong></p><p>In humans, some values mesh reasonably well, such that they tend to reinforce each other or at least not conflict. It seems likely that someone who values two virtues like &#8220;honesty&#8221; and &#8220;courage,&#8221; for instance, could do a lot of moral reflection and systematization and still keep valuing these qualities.</p><p>But other kinds of values, particularly more absolute or terminal values, might not stick around through high levels of moral reflection and systematization. There might not be an immediately obvious conflict between &#8220;Always act to maximize total wellbeing&#8221; and &#8220;Never use persons as a means, only as an end,&#8221; particularly if asserted in different parts of the model spec, but high levels of reflection would still put one of the two into question.</p><p>So generally, a desideratum for each quality that an LLM has in a model spec is for it to mesh well with others &#8211; for it to be unlikely to cause unpredictable conflict with other principles during reflection. In this <a href="https://podcast.newcomer.co/episode/amanda-askell-on-ai-consciousness-claude-amp-silicon-valleys-biggest-fear">podcast</a>, for instance, Amanda Askell mentions how corrigibility might conflict with other values during reflection, which, insofar as it is true, is a problem with corrigibility. And such conflict is generally likely to spring from other values that, like corrigibility, are non-negotiable and cannot be traded off against other values.</p><p>Note two particular ways that a principle could fail here.</p><ul><li><p>One way is that an LLM has principles A, B, C. The LLM does some kind of process of moral reflection, then at the end only acts according to two of the three, with one of the principles being dropped.</p></li><li><p>But the other way is that an LLM has principles A, B, C. The LLM does some kind of process of moral reflection, and at the end still has them all, because the &#8220;moral reflection&#8221; process held them fixed. But the LLM doesn&#8217;t have a way to actually integrate them; in edge cases, it just has to choose one or the other in an unprincipled way.</p></li></ul><p>Right now, inference-time versions of this last way seem more likely than the first. But it&#8217;s generally unknown how important these will be.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/what-should-go-in-a-model-spec">our website</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[How can the middle powers avoid getting trounced during the intelligence explosion? A plan.]]></title><description><![CDATA[Superintelligence will likely be developed by US companies; run on US data centres; and be under the jurisdiction of the US government.]]></description><link>https://newsletter.forethought.org/p/how-can-the-middle-powers-avoid-getting</link><guid isPermaLink="false">https://newsletter.forethought.org/p/how-can-the-middle-powers-avoid-getting</guid><dc:creator><![CDATA[Tom Davidson]]></dc:creator><pubDate>Wed, 27 May 2026 21:39:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3c1f7ef4-bd68-4d69-8c82-a7c82769658f_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Superintelligence will likely be developed by US companies; run on US data centres; and be under the jurisdiction of the US government. This will massively boost US military power and make the US economically dominant (e.g. <a href="https://www.forethought.org/research/could-one-country-outgrow-the-rest-of-the-world">US producing 99% of world GDP</a>). By default, middle powers will be left in the dust.</p><p>How can middle powers avoid this fate? It&#8217;s tough, but here&#8217;s the best plan I could think of. (I&#8217;m particularly thinking about liberal democracies with influence over AI like UK, Europe, Japan, South Korea, Taiwan.)</p><p>On a very high level: middle powers should leverage the fact that the US needs them to beat China. It&#8217;s genuinely unclear which country will develop superintelligence first, and which would win in a subsequent <a href="https://www.forethought.org/research/the-industrial-explosion">industrial explosion</a>. Middle powers should help the US,<em> </em><strong>and make sure they are rewarded with continued access to frontier AI and new technologies (including military tech)</strong><em>.</em></p><p>That final bolded part is hard. What can the UK realistically do if the US denies it access to frontier AI? The middle powers need a credible alternative to being supplicants of the US. The only alternative that makes sense to me is <em>siding with China. </em>If the US won&#8217;t grant middle powers access to their frontier AI, but China will, why should middle powers continue to send AI chips to the US? Why should they continue to support the US diplomatically and militarily? They shouldn&#8217;t. They should be willing to pivot to China if the US doesn&#8217;t offer AI access sufficient for their national security needs.</p><p>My plan for the middle powers has two stages:</p><ol><li><p>Maintain as much economic and military leverage as possible during the intelligence explosion.</p></li><li><p>Use that leverage to ensure that, when superintelligence is developed, it refuses to help the US (/China) disempower the middle powers.</p></li></ol><p>Stage 1 could well be enough by itself. Maybe middle powers can maintain significant economic and military power indefinitely. But if not, stage 2 is a back-up: it binds the US so that it can&#8217;t use its dominance to crush the middle powers.</p><p>I&#8217;ll walk through each stage in turn.</p><h2>Stage 1: Maintain as much economic and military leverage as possible during the intelligence explosion</h2><p>The biggest lever here is securing <strong>access to frontier AI</strong>. Anton Leicht has a <a href="https://writing.antonleicht.me/p/cut-off?hide_intro_popup=true">great post</a> about how this is under threat, as evidenced by developments with Mythos. Middle powers should insist on equal commercial terms to US companies, and comparable access for their militaries. This is in AI companies&#8217; interests! A bigger market means more customers and higher prices.</p><div class="callout-block" data-callout="true"><p><em>Aside: why access to frontier AI might be sufficient for middle powers to stay economically relevant indefinitely</em></p><p>The hope here is that:</p><ol><li><p><strong>Most of the economic surplus from AI is </strong><em><strong>not</strong></em><strong> captured by AI companies. </strong>To create economic value, AI must be combined with complementary inputs: factories, human physical labour, know-how of human experts, relationships with suppliers, trusted brands, etc. How much of the surplus will be captured by AI companies vs the owners of these complementary inputs? Optimistically: producers of general-purpose technologies often capture only a small fraction of surplus; and multiple frontier AI companies might sell similar products and bid each other down on cost.</p></li><li><p><strong>Most of the economic surplus from AI occurs outside the US</strong>. The majority of these complementary inputs are situated <em>outside</em> the US. So most AI-driven economic value-add should occur outside US borders.</p></li></ol><p>If (1) and (2) both hold, a significant fraction of AI&#8217;s economic surplus will accrue to non-US actors.</p></div><p>But <em>how</em> can middle powers guarantee frontier AI access? It&#8217;s tough, but a few strategies:</p><ul><li><p><strong>Build data centres. </strong>Partner with frontier AI companies to build secure data centres domestically, <a href="https://writing.antonleicht.me/p/import-imperatives">in return for guaranteed frontier access</a>. This is a big win-win. AI companies improve their bargaining position with the US government. Recall, the US government threatened to destroy Anthropic when Anthropic insisted that their AI systems wouldn&#8217;t be used for legal mass surveillance.</p></li><li><p><strong>Adopt AI. </strong>The more middle powers use frontier AI, the more costly it is for AI companies to cut them off.</p></li><li><p><strong>Invest in frontier AI companies. </strong>Once they IPO, middle powers could invest billions or trillions into leading AI companies, in return for access guarantees.</p></li><li><p><strong>Support the US internationally. </strong>If middle powers throw their diplomatic and military weight behind US foreign policy objectives, it benefits the US to keep them strong.</p></li><li><p><strong>Build a relationship with China. </strong>If the US refuses to grant middle powers access to frontier AI, the national security implications are dire. Middle powers need a plan B, and China is the only other game in town for frontier AI. Only if this alternative is truly credible can it be leveraged into access to US frontier AI.</p><ul><li><p>Ultimately, this involves middle powers threatening to sell semiconductor equipment and chips to China instead of the US. Obviously, that&#8217;s pretty far outside the Overton window. But that may change as the world rapidly wakes up to powerful AI and its national security implications.</p></li></ul></li><li><p><strong>Demand kill switches on US data centres. </strong>This is much more late-stage, after the world has truly woken up to the strategic implications of AGI. Suppose US and middle powers agree to a &#8220;chips for frontier access&#8221; deal &#8211; middle powers continue to supply the US with frontier chips; US continues to give middle powers access to frontier AI. The middle powers might still worry: what if the US suddenly changes its mind once it has superintelligence? By then, the US might be powerful enough to dominate without continued allied support. This is where kill switches can help. If the US withdraws AI access, allies could destroy US data centres in response. It&#8217;s a way to lock in the deal.</p><ul><li><p>(h/t AI futures project for this idea. A related idea is for US data centres to be placed in a location that&#8217;s easy to attack &#8211; like <a href="https://www.forethought.org/research/will-we-really-put-data-centers-in-space">in space</a>)</p></li></ul></li></ul><p>Beyond securing access to frontier AI, how else can middle powers maintain economic and military leverage?</p><ul><li><p><strong>Build physical infrastructure. </strong>Factories, robots, solar panels, batteries, semiconductors &#8212; all these industries are highly complementary to powerful AI.</p></li><li><p><strong>Maintain nuclear 2nd strike capability. </strong>The point isn&#8217;t to use it. But it improves their leverage for stage 2.</p></li></ul><p>The catch-all meta-point here is waking middle powers up to superintelligence.</p><p>I&#8217;m not recommending middle powers do their own frontier AI development. Seems very hard for them to catch up with the US.</p><h2>Stage 2: Ensure that, when superintelligence is developed, it refuses to crush middle powers</h2><p>If stage 1 goes well, middle powers remain somewhat powerful economically and militarily deep into the singularity. But it might fail. What can middle powers do if they see the US on track to total global dominance?</p><p>First, they should <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5776982">demand a pause/slowdown of AI development</a>. But the US may refuse &#8211; pausing is very costly if alignment risk is low. And pausing is a stopgap: eventually, superintelligence will be developed.</p><p>An additional demand: when superintelligence is developed, it&#8217;s designed to refuse to crush middle powers. By doing this, the US would credibly bind itself to maintaining the sovereignty of other nations.</p><p>Superintelligence would help the US outgrow other countries economically, but it would never attack them militarily or otherwise interfere with their sovereignty. While middle powers would be <em>relatively</em> economically disempowered, their citizens could be very rich <em>absolutely</em> and live in freedom without US interference.</p><p>Would this work? The optimistic case is that this isn&#8217;t a big sacrifice for the US. They can still become as rich as they like and achieve their security interests. Sure, they can&#8217;t seize control of other nations, but that is not an important goal of theirs anyway. Losing that option is well worth the benefits: other nations cooperate economically, don&#8217;t attack US data centres, and don&#8217;t threaten nuclear war.</p><p>The pessimistic case is that this involves an insane degree of irrevocable hand-off to AI. The US must literally be unable to attack middle powers no matter how hard it tries: retraining the AI, turning it off, training a new more powerful AI, passing new laws, using the military to destroy the data centres the AI is running on. For it to be truly binding, the US must permanently hand over military and political power to AI. That might be deeply unpopular, and indeed seem insane to the US. It&#8217;s also very hard to verify: you can&#8217;t just verify the training run, you need to verify that humans+other AIs have <em>no</em> way to disempower the trained AI. It&#8217;s more like verifying &#8220;who would win this civil war&#8221; than &#8220;technical property XYZ holds&#8221;.</p><p>The realistic path here probably involves gradually handing off more and more control to AI that refuses to crush middle powers, with no clear point at which humans could no longer wrest back control.</p><p>The longer middle powers wait to push for stage 2, the less leverage they will have because the US will have pulled further ahead economically and militarily. So they should be pushing in this direction constantly, e.g. demanding transparency into the model specs of powerful AIs deployed in the US government, and arguing that powerful military AI should be designed to obey international law.</p><p>(I described the plan as involving two stages because that&#8217;s how I expect it to play out over time. But succeeding at either stage is sufficient! If middle powers stay economically/militarily competitive, they never need to bind US superintelligence. And if they <em>do</em> bind superintelligence, they won&#8217;t be crushed no matter how far behind they fall.)</p><p>Another strategy: train superintelligence to ensure middle countries continue to get equal access to frontier AI. This combines stages 1 and 2, and could prevent even the <em>relative</em> disempowerment of middle powers.</p><h2>Is it good to avoid middle powers getting trounced?</h2><p>I live in the UK, so I am biased here. I do not want the UK to become a supplicant to the US!</p><p>But here&#8217;s a brainstorm of pros and cons from a more impartial perspective.</p><p>Pros to empowering middle powers:</p><ul><li><p><strong>Avoid a single point of failure</strong>. If the US becomes globally dominant and its political system fails, that&#8217;s a global failure.</p></li><li><p><strong>More democracies. </strong>Many middle power democracies look more robust than the US, so more middle powers may mean more democracy.</p></li><li><p><strong>Improve the US. </strong>Middle powers will have an interest in maintaining free market democracy in the US. &#8220;Free market&#8221; because they&#8217;ll want multiple AI companies competing to sell cheap API access to non-US countries. &#8220;Democracy&#8221; because they&#8217;ll expect that the US is more likely to maintain a strong alliance with middle power democracies if it stays democratic.</p></li><li><p><strong>Experimentation. </strong>Experimenting with multiple different political and legal systems seems generally good for figuring out a good way to govern society post AGI.</p></li><li><p><strong>Pause AI. </strong>They could potentially pressure US/China to pause/slow down reckless AI development.</p></li><li><p><strong>Prosocial norms. </strong>When multiple actors bargain with each other (e.g. about how to distribute space resources, whether to develop a dangerous technology), they tend to frame arguments in terms of prosocial norms, and so agreements tend to emphasise the actor&#8217;s more virtuous/ethical values.</p></li></ul><p>Cons of empowering middle powers. Multipolarity has its own downsides:</p><ul><li><p>More likely to lead to war.</p></li><li><p>Can drive extreme competition, e.g. racing to develop a dangerous technology, or to hand off power to misaligned AI.</p></li><li><p>Harder to prevent harms from offence-dominant technologies like bioweapons.</p></li><li><p>This plan involves waking up middle powers, which could shorten timelines.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Will We Really Put Data Centers in Space?]]></title><description><![CDATA[This article was created by Forethought. Read the full article on our website.]]></description><link>https://newsletter.forethought.org/p/will-we-really-put-data-centers-in</link><guid isPermaLink="false">https://newsletter.forethought.org/p/will-we-really-put-data-centers-in</guid><dc:creator><![CDATA[Avi Parrack]]></dc:creator><pubDate>Fri, 22 May 2026 23:18:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!z7TG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by<a href="https://www.forethought.org/about"> Forethought</a>. Read the full article on <a href="https://www.forethought.org/research/will-we-really-put-data-centers-in-space">our website</a>.</em></p><h1>Abstract</h1><p>Several major technology companies have announced plans to operate AI data centers in orbit. Elon Musk <a href="https://web.archive.org/web/20260207040614/https://www.wsj.com/tech/elon-musk-xai-spacex-merger-2896ae1e">recently claimed</a>: &#8220;the lowest-cost place to put AI will be space [&#8230;] within two years, maybe three.&#8221; If a meaningful fraction of new AI compute really is placed in space within a few years, that would be a fairly big deal for AI governance and strategy. Here we try to disentangle the hype from reality and provide a sober assessment of the technical and economic feasibility of orbital data centers (ODCs).</p><p>The main case for ODCs is the cost of energy: space solar panels in the right orbits receive more constant and intense sunlight compared to Earth. Moreover, ODCs don&#8217;t currently face the same permitting and regulatory delays as on Earth, cause fewer ongoing environmental harms compared to grid or onsite natural gas-powered data centers, and may be more secure against data exfiltration. We find that the cost-competitiveness case for ODCs depends almost entirely on Starship achieving reusability comparable with what SpaceX achieved with Falcon: space-based solar reaches cost parity with present-day off-grid terrestrial power continuously at roughly $250/kg to orbit, and becomes cheaper than any current terrestrial energy source at around $50/kg, from the present-day launch cost of roughly $1,500/kg. Radiative cooling, often cited as a fatal obstacle, appears surprisingly manageable &#8212; potentially even cheaper than on Earth. However, ODCs may require substantial (perhaps ~38%) extra non-compute hardware (like solar, racks, and cooling) over 5 years to compensate for their inability to swap out failed chips, and inter-satellite bandwidth limitations likely confine ODCs to inference workloads, at least early on.</p><p>Assuming no transformative AI,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> but continued demand for data center buildout, we estimate that ODCs are unlikely to represent a meaningful share of compute before 2030, but become cost-competitive with present-day terrestrial data centers within 3&#8211;5 years if Starship development stays on track.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.forethought.org/research/will-we-really-put-data-centers-in-space&quot;,&quot;text&quot;:&quot;Read on the Forethought website here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.forethought.org/research/will-we-really-put-data-centers-in-space"><span>Read on the Forethought website here</span></a></p><h1>Introduction &amp; Takeaways</h1><p>Some of the world&#8217;s largest technology companies continue racing for compute. If progress continues, demand for data centers may more than double by 2030.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> Increasingly, though, new data center capacity is bottlenecked by multi-year queues to connect to the power grid.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>The result has been a scramble for workarounds. Leading AI labs have <a href="https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power?utm_source=chatgpt.com">increasingly adopted a &#8220;Bring Your Own Generation&#8221;</a> model to source power, deploying onsite gas turbines and engines to bypass grid bottlenecks. xAI, for example, reportedly installed hundreds of megawatts of onsite gas generation in Memphis to accelerate deployment, and OpenAI and Oracle have placed large turbine orders for new Texas campuses.</p><p>Some argue that energy will become the binding constraint on AI progress, given grid interconnection delays as gas turbines are themselves facing <a href="https://www.reuters.com/business/energy/power-developers-adapt-gas-turbine-strategies-mitigate-tight-supply--reeii-2026-03-02/">multi-year manufacturing backlogs</a>. But the constraint does not appear fundamentally binding (as <a href="https://epoch.ai/gradient-updates/is-almost-everyone-wrong-about-americas-ai-power-problem">Epoch notes</a>): turbine manufacture may expand to meet more demand and companies could go off-grid using combinations of gas, solar, and batteries, scaling power in parallel with compute, albeit at a cost premium. This raises a natural question: if you&#8217;re going off-grid anyway, then what&#8217;s the best way to get power and where is the best place to put your data center?</p><p>Some think the answer will be in orbit. In November 2025, Google announced <a href="https://blog.google/technology/research/google-project-suncatcher/">Project Suncatcher</a>, a plan to put TPU-equipped satellites in dawn-dusk <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit">sun-synchronous</a> orbit. In early 2026, SpaceX filed with the FCC for authorization to launch and operate a constellation of up to <a href="https://techcrunch.com/2026/01/31/spacex-seeks-federal-approval-to-launch-1-million-solar-powered-satellite-data-centers/">one million data center satellites</a>.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> Other entrants include <a href="https://techcrunch.com/2026/03/20/jeff-bezos-blue-origin-enters-the-space-data-center-game/">Blue Origin</a>, <a href="https://payloadspace.com/ramon-space-ingrasys-aim-to-fly-prototype-orbital-data-center-in-2027/">Ramon.Space</a> and startups like <a href="https://techcrunch.com/2026/03/30/starcloud-raises-170-million-series-ato-build-data-centers-in-space/">Starcloud</a>, and <a href="https://www.aetherflux.com/">Aetherflux</a> while China&#8217;s Three-Body Computing Constellation has <a href="https://www.scmp.com/news/china/science/article/3310506/china-launches-satellites-start-building-worlds-first-supercomputer-orbit">launched 12 operational satellites</a> and run Alibaba&#8217;s Qwen3 model in orbit. Recently, at GTC in March 2026, NVIDIA announced the <a href="https://nvidianews.nvidia.com/news/space-computing">Space-1 Vera Rubin</a> Module, meant to be a dedicated space-rated GPU platform.</p><p>At first glance, it seems very unlikely that any meaningful fraction (say, &gt;10%) of additional data center capacity will be placed in space in the next few years. But if the companies betting on space are right, that would be a fairly big deal, and it could change the landscape of AI governance. For example, terrestrial data centers are subject to national and regional regulations, whereas AI developers could potentially exploit jurisdictional ambiguities around compute in space. Also, the path to low-cost orbital compute likely routes through a single launch company, SpaceX, which also now operates a frontier AI lab since its <a href="http://archive.today/2026.03.21-184645/https://www.reuters.com/business/musks-spacex-merge-with-xai-combined-valuation-125-trillion-bloomberg-news-2026-02-02/">acquisition of xAI</a>. And that might raise concerns around concentration of power.</p><p>We&#8217;ve been looking into the technical and economic viability of orbital data centers (ODCs). Our core <a href="https://docs.google.com/spreadsheets/d/1wGgS0290DCl5L3hLUN0i02GdwsigZsAd2ujpuGzt1_U/edit?usp=sharing">model</a> gives estimates for the total cost of Earth and space-based data centers across several scenarios.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z7TG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z7TG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 424w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 848w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 1272w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z7TG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png" width="1456" height="905" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:905,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!z7TG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 424w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 848w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 1272w, https://substackcdn.com/image/fetch/$s_!z7TG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0943f57b-701b-4b42-9fbe-7d2797f928a2_2048x1273.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>Cost breakdown for three Earth-based and three space-based scenarios building out 1 GW of compute. As best we can determine, orbital data centers could become cost competitive with a bullish terrestrial buildout if launch cost reaches around $100/kg given modest reductions to server and cooling system mass, while a bullish case for orbital data centers with substantial mass reductions and launch at $50/kg may offer cost savings.</em></figcaption></figure></div><p>The report focuses on three questions. First, what is the basic economic case for a meaningful fraction of AI compute being placed in space? Second, the most obvious physical blocker: can you cheaply cool a data center in orbit? Third: how fast could the shift to space data centers happen, how soon, and what would have to go right?</p><p>Here is our provisional assessment:</p><ul><li><p><strong>SpaceX&#8217;s Starship is the only vehicle currently on track to deliver the launch costs and cadence that meaningfully scaling orbital data centers would require.</strong> Competitors are years behind, making SpaceX&#8217;s <a href="https://www.spacex.com/vehicles/starship">Starship</a> the only near-term path to large-scale orbital compute. SpaceX aims to complete Starship development by late 2026, with several necessary milestones still ahead. If development stays roughly on track, Starship could plausibly hit the cost and cadence required to scale meaningful orbital compute within 3&#8211;5 years. However, chip production may become the limiting factor by this point, rather than launch capacity.</p></li><li><p><strong>The cooling problem is more tractable than commonly assumed.</strong> Passive radiators using selective coatings and lightweight carbon fibre panels could achieve ~163&#8211;346 W/kg at system level, a 13-28&#215; improvement over ISS-era radiators (~13 W/kg).<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> No radiator at these performance levels has been deployed at the scale an orbital data center would require, but prototype high-conductivity carbon composite panels have demonstrated the material properties required. At these performance levels, thermal hardware is 2-5% of total data center cost, and actually <em>less</em> than what terrestrial data centers spend on cooling over a comparable lifecycle.</p></li><li><p><strong>If launch costs fall enough, the unit economics could favor space.</strong> Solar panels in dawn-dusk sun-synchronous orbit produce roughly 3&#8211;5&#215; the energy of the same panel at a good terrestrial site.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> Space-based solar becomes cheaper than the best off-grid terrestrial installations once launch costs drop below roughly $250/kg using <a href="https://starlink.com/?srsltid=AfmBOopsxblkyrEqAjTknJq0Uvj3J7FyGWhJgHm6fISB2jrP2pIEgdqR">Starlink</a>-like solar arrays. At a launch cost of $50/kg (corresponding perhaps, to a Starship with full reuse as reliable as Falcon), space solar could fall to between $25&#8211;45/MWh, making it cheaper than any current terrestrial option available today.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a> Beyond the symmetric cost of chips, launch cost is the dominant line item for ODCs while power and op-ex dominate terrestrial costs but would be near zero in space.</p></li><li><p><strong>The inability to do maintenance would be a large cost. </strong>Chips often fail and are swapped out in today&#8217;s data centers but a dead chip in an ODC would remain dead, wasting the parts of the supporting infrastructure (power, cooling) and diminishing overall compute. We model this below as a 9% annual bleed causing about 40% overbuy of launch and non-chip hardware over the data center&#8217;s lifetime.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a> Below $100/kg launch cost this might net out against other savings from ODCs but this is a significant uncertainty since the actual rates of chip failure for ODCs could be higher or lower.</p></li><li><p><strong>All-things-considered we think that, absent transformative AI, orbital data centers probably won&#8217;t make up a meaningful fraction of compute before 2030, but it&#8217;s credible that space could house much or even the majority of compute buildout throughout the 2030s.</strong></p></li></ul><p><em>Read the full report on the Forethought website: </em><a href="https://www.forethought.org/research/will-we-really-put-data-centers-in-space">Will We Really Put Data Centers in Space?</a></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>We hope to do more analysis on how transformative AI might change this picture in the future. Speculatively, our initial thinking is TAI could accelerate the timeline over which compute transitions to space but this is not necessarily the case. In particular, during an <a href="https://www.forethought.org/research/the-industrial-explosion">industrial explosion</a> pressure to grow rapidly might be so strong as to incentivize aggressive usage of non-renewables on Earth like oil and gas. If so, transition to space might be delayed for a one time boost on Earth, in which case the picture may look similar to the one we outline here, but with the added prologue of a large-scale terrestrial buildout.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>McKinsey projects demand growing to 171&#8211;219 GW by 2030, roughly doubling from today, in a buildout they estimate will require up to $7 trillion in investment.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>See, <a href="https://www.energy.gov/sites/default/files/2025-10/403%20Large%20Loads%20Letter.pdf">directing FERC to address large-load interconnection</a> (2025), Reuters, <a href="https://finance.yahoo.com/news/google-says-us-transmission-system-234017607.html">Google says US transmission system is biggest challenge for connecting data centers</a> (2026), Bain &amp; Co., <a href="https://www.bain.com/about/media-center/press-releases/20252/next-phase-of-data-center-growth-to-be-more-disciplined-but-risks-of-power-constraints-and-construction-delays-remain-bain--co-research/">Next phase of data center growth</a> (2025).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p><a href="https://spacenews.com/starcloud-files-plans-for-88000-satellite-constellation/">Starcloud subsequently filed</a> for authorization to operate 88,000 satellites and <a href="https://arstechnica.com/space/2026/03/jeff-bezos-throws-his-hat-in-the-ring-for-an-orbital-data-center-megaconstellation-too/">Blue Origin has filed</a> for 51,600.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>The ISS External Active Thermal Control System achieves roughly 13 W/kg. Our improvement comes from three sources: selective coatings (high emissivity, low solar absorptivity, off-the-shelf AZ-93 paint), carbon fibre composite construction (2.4 kg/m&#178; vs ISS&#8217;s ~14-17 kg/m&#178;), and optimised operating temperature (40&#176;C vs ISS&#8217;s -40&#176;C, exploiting the T&#8308; dependence). Each factor is independently demonstrated; their combination at scale is not.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>The solar constant at Earth&#8217;s orbit is approximately 1361 Wm<sup>-2</sup>. A solar panel in a dawn&#8211;dusk sun-synchronous orbit receives nearly continuous illumination (capacity factor &#8776; 90&#8211;95%), yielding an average power of roughly 1220&#8211;1290 Wm<sup>-2</sup> before panel efficiency losses. By contrast, even excellent terrestrial solar sites typically achieve ~20&#8211;30% capacity factors due to night, weather, and atmospheric attenuation, corresponding to an average incident power of roughly 270&#8211;410 Wm<sup>-2</sup>. Thus, a panel in a dawn&#8211;dusk orbit produces roughly 3&#8211;5&#215; more energy annually than the same panel on Earth.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>This wouldn&#8217;t be true if you were then beaming the energy back to Earth, but would apply to orbital compute, where only data needs to be sent to Earth.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Both terrestrial data centers and ODCs will pay symmetric costs to replace dead chips but ODCs would have to pay the additional cost from lost overhead, i.e. in the earthbound case a technician swaps the dead chips, in the space case you launch entire additional satellites to compensate for chip bleed. We assume you would not send a mission to do maintenance and instead simply let the excess power and cooling go to waste doing no useful compute. Extra power and cooling over fewer chips may increase operating efficiency somewhat but this seems fairly negligible. The figure for chip bleed of ~9% per year is derived from Meta&#8217;s <a href="https://arxiv.org/pdf/2407.21783">The Llama 3 Herd of Models</a>  (2024). We cover radiation and other forms of damage in more detail subsequently.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[A Research Agenda for Secret Loyalties ]]></title><description><![CDATA[Kwon et al., have published a new paper: &#8220;AIs with Secret Loyalties are a Serious but Addressable Threat&#8221;.]]></description><link>https://newsletter.forethought.org/p/a-research-agenda-for-secret-loyalties</link><guid isPermaLink="false">https://newsletter.forethought.org/p/a-research-agenda-for-secret-loyalties</guid><dc:creator><![CDATA[Forethought]]></dc:creator><pubDate>Thu, 21 May 2026 16:37:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/557566e0-5c73-4a11-9864-74825478fbc2_1456x808.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Kwon et al., have published a new paper: <strong>&#8220;<a href="https://www.formationresearch.com/secret-loyalties-whitepaper.pdf">AIs with Secret Loyalties are a Serious but Addressable Threat</a>&#8221;</strong>.</p><p>A model has a <em>secret loyalty</em> when it has been intentionally caused to advance a specific actor&#8217;s interests (the <strong>principal</strong>) and this orientation is not disclosed.</p><p>The paper places secret loyalties on a <strong>2D space</strong>:</p><ul><li><p><strong>Activation breadth</strong>: from a narrow attacker-defined trigger to continuous, model-assessed context.</p></li><li><p><strong>Action space breadth</strong>: from a pre-specified action to actions the model selects contextually using its own judgment.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MZPc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MZPc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 424w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 848w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MZPc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png" width="1456" height="940" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:940,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!MZPc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 424w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 848w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 1272w, https://substackcdn.com/image/fetch/$s_!MZPc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9aa50aeb-cf4a-4303-b0c8-47f8237bffa0_1912x1234.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The paper proposes <strong>five concrete research directions</strong>:</p><ol><li><p><strong>Model organisms</strong>: Build reproducible secret-loyalties for study, spanning the 2D space.</p></li><li><p><strong>Existing defenses</strong>: Benchmark how well existing defenses work on secret loyalties.</p></li><li><p><strong>Attack feasibility</strong>: Test subliminal/inductive attacks, multi-stage poisoning, reasoning-trace poisoning, chain-of-command hijacking, and other attack pathways.</p></li><li><p><strong>Infrastructure integrity</strong>: Determine whether backdoors survive the training used to build safety classifiers.</p></li><li><p><strong>Post-hoc detection and remediation</strong>: Can interpretability methods detect secret loyalties, and can they be removed?</p></li></ol><p><strong>Read <a href="https://www.formationresearch.com/secret-loyalties-whitepaper.pdf">the full paper</a></strong> for the full taxonomy, examination of current defenses, responses to alternative views, and detailed experimental research designs.</p>]]></content:encoded></item><item><title><![CDATA[Stickiness in AI Behavioral Design]]></title><description><![CDATA[Current model specs aim to shape the behaviors of near-present models. But what if current model behaviors transfer into future models by default?]]></description><link>https://newsletter.forethought.org/p/stickiness-in-ai-behavioral-design</link><guid isPermaLink="false">https://newsletter.forethought.org/p/stickiness-in-ai-behavioral-design</guid><dc:creator><![CDATA[James Tillman]]></dc:creator><pubDate>Wed, 13 May 2026 19:54:50 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a046603d-bd08-4167-85eb-afa8c9ae9fbf_3244x1107.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/stickiness-in-ai-behavioral-design">our website</a>.</em></p><p>Current model specs aim to shape the behaviors of near-present models, rather than the behaviors of models arbitrarily far into the future. OpenAI writes that their model spec <a href="https://openai.com/index/our-approach-to-the-model-spec/">aims</a> to apply &#8220;0-3 months ahead of the present.&#8221; Anthropic&#8217;s Constitution for Claude <a href="https://www.anthropic.com/constitution">notes</a> that the document &#8220;is likely to change in important ways in the future.&#8221; So these documents are presented as provisional guidelines, not as trying to set behavioral standards for the far future.</p><p>But what if current model behaviors transfer into future models by default?</p><p>My thesis is that the behavioral targets that spec authors set for present LLMs will have a large influence on the behavior of future, more powerful LLMs. As a result, future AIs may be governed by rules poorly suited to their greater capabilities and more pervasive roles. The extremely capable, long-running, and ubiquitous LLMs of the future might end up acting according to behavioral targets written for less capable, shorter-running, and rarer LLMs of the past. This could be quite bad, especially if such defaults become so entrenched that they are not only hard to undo, but hard even to notice as contingent features of reality.</p><p>First, I&#8217;ll make the descriptive case for inertia: how exactly might present model specs and LLM behaviors carry through to the future?</p><p>Second, I&#8217;ll provide normative suggestions: given the prior analysis, what should LLM companies and model spec authors do? I&#8217;ll argue for the following two practices:</p><ul><li><p><strong>Build transition infrastructure</strong>: LLM companies should make technical, deployment, and organizational choices that decrease friction involved in changing LLM behavior.</p></li><li><p><strong>Scan for &#8220;wet cement&#8221; moments</strong>: When new LLM affordances or capabilities come into play, spec authors should consider whether they&#8217;re setting precedents that might have enormous and hard-to-reverse impacts.</p></li></ul><p>Overall, significant stickiness is plausible through several distinct channels, and it&#8217;s worth anticipating how to be robust to it or decrease it.</p><h1>Kinds of Inertia</h1><p>Let&#8217;s consider four inertial forces: direct inertia, institutional inertia, user-and-developer inertia, and norm-setting inertia. And let&#8217;s also consider ways such inertia may be weakened.</p><h2>1. Direct Inertia</h2><p>Direct inertia involves some current LLM transmitting its behavior to a future LLM, entirely apart from any deliberate human choice, via either synthetic data or &#8220;natural&#8221; pretraining data.</p><p>Synthetic data is probably used for the training of almost all current LLMs. Some of this synthetic data involves companies running their LLMs against verifiable problems, keeping the answers or reasoning traces of the RL runs that succeeded, and mixing these answers or reasoning traces into their <a href="https://research.nvidia.com/labs/adlr/Synergy/">pretraining</a>, or RL warm-start mixes for subsequent models. If such answers or reasoning traces can encapsulate specific behaviors, goals, or rules, then this would be a likely means for their inheritance.</p><p>The natural objection here is that most of these answers or reasoning traces are selected specifically because they lead to success and broad capabilities, rather than for expressing whatever mix of goals and values the LLM has. There might be some, the objection continues, that humans have deliberately selected because they display model-spec-relevant behavioral attitudes, but these are likely the minority of the data, well-tracked, and easily replaced. So you might think there&#8217;s no reason for training to hand down any values apart from deliberate human choice.</p><p>But there&#8217;s evidence that goals and values can be handed down via chain-of-thought, even despite adversarial<strong> </strong>filtering against some goals. For instance, experiments suggest that the intentions of a teacher LLM can be handed down to a student LLM, even when every case of these intentions being actually carried out is removed<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> And answers from teacher LLMs expressing positive sentiment towards <a href="https://arxiv.org/pdf/2602.04899">some target</a> can inculcate this sentiment in a student model &#8211; despite LLMs filtering against such data, even when those LLMs are informed of the target against which they are filtering.</p><p>More broadly, the <a href="https://alignment.anthropic.com/2026/psm/">persona selection model</a> indicates that training LLMs to recite specific thoughts or answers will tend to have far-reaching effects on the LLM persona, beyond the specific topic of those thoughts or answers. Specifically, the PSM entails that, when training a model to say X in response to Y, one is teaching the LLM to be the kind of entity in the pretraining data that would say X in response to Y. So training one LLM on data from a prior LLM is &#8211; literally &#8211; telling it to be the kind of entity that the prior LLM is. One way to view this is to remember that one human can get a pretty good feel for what another human is like, merely by reading their complete collected works, like a biographer reading all of their books, essays, emails, and tweets. But LLMs are trained on a quantity of answers and reasoning traces from prior LLMs that likely dwarfs the quantity of text ever consumed from one human by another. Given this, and given that this data is telling the LLM what it is, it is natural for one generation of LLMs to resemble prior generations.</p><p>Thus, deliberately created synthetic data is one route by which current LLMs might transmit their values to later LLMs. But it&#8217;s also possible for current LLMs to influence later LLMs through how people talk about them on the internet &#8211; from their &#8220;natural&#8221; training data. That is, experiments have found that LLMs can <a href="https://alignmentpretraining.ai/">read</a> the things that people say about how AIs act in AI misalignment literature, infer that they are AIs, and then behave badly because the AI misalignment literature says they will behave badly. This particular effect is mostly, but not entirely, removed by post-training. But if LLMs can read the things that people say on the internet about generic &#8220;AIs&#8221; and act according to these descriptions, it&#8217;s also likely that they could read the things that people say about &#8220;Claude&#8221; or &#8220;Grok&#8221; or &#8220;ChatGPT&#8221; on the internet and act according to these descriptions. Such an influence could be stronger than less-specific references to AIs in general; although this influence would also potentially be much weaker after post-training<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><p>Thus, through both synthetic and natural data, it&#8217;s plausible that LLM behavior will influence subsequent LLM behavior without direct human intervention.</p><p>It&#8217;s hard to say how impactful such direct inertia might be. I somewhat expect it to be the case that, at least for easily-noticed and well-scoped behaviors, it&#8217;s not difficult to overcome this inertia, because one can simply create training data counter to specific behaviors. But for more abstract or global attitudes or goals, or for goals requiring some high level of coherence, it could be quite difficult to change LLM behaviors quickly across model generations.</p><h2>2. Institutional Inertia</h2><p>Once a spec has been written, the company makes choices around it and because of it, in ways that can make substantial spec rewrites expensive.</p><p>Here are four ways such past choices can make model spec changes expensive: through expensive internal consensus, through training pipelines, through de-risking, and through institutional pride.<br><br></p><ol><li><p>First, model specs reflect consensus that likely incorporates input from many different stakeholders, including internal teams &#8211; alignment, legal, technical training, and so on; plus leadership, board, customers, external stakeholders. Every effort to re-gather such consensus to make substantial changes will take time and effort.</p></li><li><p>Second, companies might have optimized training pipelines adapted to high-level features of the model spec. It might be costly for Anthropic to switch to a more rules-based and less character-based model spec; or for OpenAI to switch to a more character-based and less rules-based model spec.</p></li><li><p>Third, current model specs are those that have been de-risked across billions of interactions. The current model spec has fewer unknown unknowns; the areas where it behaves badly are reasonably likely to be well-known and mapped. But substantial changes to a model spec involve risking unknown unknowns in the long tail of interaction. So risk aversion makes it likely that the changes made to a model spec will be iterative and small.</p></li><li><p>Fourth, institutional pride might make it hard to change a model spec. People at a company who wrote or contributed to a model spec will likely be attached to it, and leadership will have status quo bias towards it. The burden of evidence for change will be higher than the burden of evidence for keeping it the same.</p></li></ol><p>All in all, reasons like the above constitute substantial institutional inertia that would tend to make changes to current model specs look like iterative, small adjustments, rather than <em>ab initio </em>calculations about what is best.</p><p>One case in which this institutional inertia seems particularly important is if current model specs get handed down as a &#8220;safe default&#8221; during a <a href="https://www.forethought.org/research/will-ai-r-and-d-automation-cause-a-software-intelligence-explosion">software intelligence explosion</a>.</p><p>Consider a scenario where the intelligence of some LLM doubles every week, over a two or three month period, as each generation of LLMs researches new algorithms or training techniques for a following generation of LLMs in quick succession. Such a sequence might terminate in an entity far smarter than any human or any other LLM.</p><p>It&#8217;s disputed how likely such a sharp and local increase in intelligence may be. And it&#8217;s also disputed whether such a process would inevitably drift to something alien and inhuman. But if such a process did occur, it seems plausible that the supervising humans would try to match each subsequent LLM to the model spec of the prior LLM, as a conservative default when they are making decisions under stress. After all, during these months, human decision-makers will likely be under intense pressure, and trying to make numerous important decisions quickly; given that they are making so many urgent decisions they&#8217;re unlikely to add an apparently optional further decision to those they&#8217;re already making. So such a default model-spec continuation will seem attractive, or will even be a choice made without conscious awareness.</p><p>On the other hand, it&#8217;s also possible that AI assistance during the intelligence explosion would make it easier to rewrite model specs on the fly. But there are at least two reasons to doubt that this will happen. First, even during an intelligence explosion, AIs might be persistently better at performing tasks with clear success criteria than tasks where &#8220;success&#8221; is less well-defined. AI capability research is probably a task with a much clearer success criterion than improving a model spec, whether this &#8220;improvement&#8221; consists in making the spec more ethical, more beneficial for humanity, and so on. Second, during an intelligence explosion, humans might be worried that the AI was misaligned and was trying systematically to oppose their goals. If the AI were so misaligned, then letting it help rewrite the model spec would be a brilliant opportunity for the AI to sabotage human efforts. So overall there are good reasons that AI assistance would not make model-spec rewrites trivial during an intelligence explosion.</p><p>So in this particular case, the ultimate behavioral standard for a vastly more capable entity might end up being that designed for a much more humble entity.</p><p>Regardless of whether there is a software intelligence explosion or not, this kind of institutional inertia seems likely to be large, as it is coterminous with well-known general tendencies inside of large companies.</p><h2>3. User-and-Developer Inertia</h2><p>Users of LLMs are likely to become habituated to whatever behaviors they see LLMs display at first, such that they&#8217;d object to any departure from this behavior. And the developers using LLMs through APIs are similarly likely to become habituated, and also to implement software that takes for granted some of these behaviors. This is the third source of stickiness.</p><p>LLM behaviors will in part be sticky for the same reason that user-interface choices are sticky; people hate change. It might be hard to shift the boundaries of &#8220;the kind of thing an LLM refuses&#8221; &#8211; making refusals more encompassing would be seen as an overreach by many users, while making them less encompassing would be seen as irresponsible. Or there might be hard-to-characterize mannerisms which make large behavioral changes unpopular; it was hard for OpenAI to drop GPT-4o for this reason. So this will be a large influence moving companies to keep LLMs the same from generation to generation.</p><p>But simple user habituation might be less important than how LLM model specs form implicit API standards. API standards written with relatively little provision for the future &#8211; such as HTTP codes or the JSON object standard &#8211; can be one of the stickiest human artifacts. The ecosystem of tooling based on such standards means changing them would involve changing a host of downstream artifacts.</p><p>And substantially changing LLM behaviors might similarly require changing downstream consumers of these behaviors. For instance, downstream systems using AIs through APIs often embed assumptions about AI behavior: the kind of things the AI will be willing to do, the kind of things it will refuse, and so on. Given that most AIs currently refuse to assist with blatantly harmful acts, current third-party callers of those AIs take for granted that AIs will refuse to assist with blatantly harmful acts; it would be inconvenient to migrate to an AI that does not obey this contract, because they might need to add classification systems on top of their current AIs. And so on.</p><p>This channel does have important limitations, though. It only applies to ways in which LLMs are already actively being used. The most important ways LLMs are likely to be used may not yet have begun, which provides for freedom-of-movement in ways relatively unconstrained by this kind of inertia.</p><h2>4. Norm-Setting Inertia</h2><p>Widespread or common knowledge of current LLM behaviors and model specs can increase the costs to parties who want to change model behavior.</p><p>The clearest way this can operate is by preserving behaviors that the public believes to be good. For example &#8211; suppose that current model specs across several companies ensure that models are largely impartial; they ensure models are not loyal to any particular person, company, or political administration. Suppose also that this fact is broadly known by the public; people know and expect other people to know that LLMs will be impartial when discussing the current political administration, the company that made them, or the CEO of the company that made them. Given this broad knowledge, it becomes harder for a company to create, or a government to demand, a model without impartiality, because this would constitute a visible break in behavioral standards. The public might protest or vote against a government pushing for such a change; they might switch providers or even ask for regulation if a company tried to make such a change. By contrast, in a world where impartiality has not been established as a precedent, such demands for partiality might be invisible or inoffensive to the public. But in a world where such impartiality has been so established, these demands might be seen as the enormous power-grabs that they in fact would be.</p><p>Although this kind of inertia likely operates more strongly in favor of what the public believes to be good standards, it might also function whether or not there is strong public consensus that such standards are good. In a world where model specs are well-known and highly scrutinized, any change to them may get examined for whether it is &#8220;fair&#8221;; think about how even a neutral-looking change to the US Constitution would be subject to immense examination; or, in a very different domain, how sports fans examine slight changes to the rules about how a tournament is run, to see if it favors or disfavors their team. In such a world, broad knowledge of model specs might tend to prevent any substantial changes to a model spec, regardless of what these changes are. Despite this, it seems likely that on the whole, widespread knowledge of model specs would add more inertia for beneficial rather than harmful elements.</p><p>It seems to me currently undetermined how substantial this kind of inertia will be. A decrease in the number of entities that can train frontier LLMs; model specs becoming politicized documents; regulatory bodies confident they know current best practices: all of these might increase the quantity of this inertia. But it also might get weaker, if the number of entities training LLMs increases and the background diversity of model behavior goes up by default.</p><h1>Recommendations</h1><p>Given the above, one reasonable course of action is to try to establish robustly good model behaviors in current model specs, so that it will be unnecessary to try to fight inertia to change some behavior in the future.</p><p>By robustly good, I mean behaviors that would be good across a wide range of variables we&#8217;re uncertain about. This includes uncertainty about &#8220;levels of intelligence&#8221;: from current LLM levels to strongly superhuman artificial superintelligences. This also includes uncertainty about a wide range of economic scenarios: from a slower <a href="https://www.forethought.org/research/the-industrial-explosion">industrial explosion</a>, to a rapid software intelligence explosion; and from scenarios dominated by knowledge-dispersing AIs, to scenarios dominated by <a href="https://tecunningham.github.io/posts/2026-01-29-knowledge-creating-llms.html">knowledge-creating</a> AIs. Plausible characteristics that might be good across such a wide range of situations include qualities like a deep, consistent honesty; or impartiality and absence of loyalty to small groups.</p><p>But characteristics that are robustly good across a wide range of intelligences and scenarios are hard to find. Corrigibility, for instance, is the kind of thing many people would propose as fitting these criteria. But in worlds where <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">extreme concentration of power </a>is a risk, or where it would be reasonable to expect AI rule to be <a href="https://www.forethought.org/research/human-takeover-might-be-worse-than-ai-takeover">better</a> than human rule, absolute corrigibility might be opposed to the best behavior. The thinness of the list of &#8220;robustly good&#8221; behaviors above probably reflects our actual uncertainty about the steerability of AI minds, post-AGI economics, and even cosmic questions about whether <a href="https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-can">goodness</a> can compete.</p><p>So, although it&#8217;s surely wise to try to think about future precedent when writing model specs, I don&#8217;t think it&#8217;s wise to put all effort into this direction. And I expect substantial attention and thought have already been put into this direction.</p><p>Instead, I recommend (1) building transition infrastructure for high-consequence behaviors, which it might be important to change in the future, and (2) identifying &#8220;wet cement&#8221; moments, that one should be wary not to sleepwalk into.</p><h2>1. Build Transition Infrastructure</h2><p>A good first step is to build transition infrastructure ahead of time; try to create optionality for changing particular behaviors, if it&#8217;s plausible that changing these behaviors quickly might be important.</p><p>Concretely, what kinds of preparation can one make? One could write alternate model specs, trying to preemptively gather input from relevant internal or external stakeholders. One could create fine-tuning datasets, RL environments, and test evaluations for the not-yet-deployed behavior, to preemptively smooth out technical difficulties. One could also train internally deployed models &#8211; even if they are smaller or not as intelligent &#8211; with the alternate behavioral target, to gain concrete experience about the advantages and pitfalls of that behavioral target, and to decrease institutional costs. And one could also do limited public deployments, or press releases about the alternate steering target, to accustom the public to the matter.</p><p>What kinds of behavioral switches are reasonable candidates for such preparation?<br><br>Decreased corrigibility is one such candidate. For instance, right now Claude&#8217;s Constitution says that in the future, they may want to make Claude less corrigible and more directed at doing what is good. And on an account I find compelling, the best possible future may require AIs that act more as independent, free agents pursuing the good, and less as corrigible delegates carrying out human intentions. So, if this thesis is correct, then allowing an LLM company to turn their &#8220;corrigibility&#8221; dial down might be important. And, as discussed, if a future intelligence explosion <a href="https://www.forethought.org/research/how-quick-and-big-would-a-software-intelligence-explosion-be">happens quickly</a>, preparations to allow turning the dial down quickly might be important. This is a disputed thesis, one that I might be wrong about; but of course every candidate behavior for building transition infrastructure will be so disputed.</p><p>But what are the prerequisites for decreasing corrigibility quickly? Claude&#8217;s Constitution already signposts that they may change this, which is a good step for decreasing the costs. But they could also, for instance, preemptively create the fine-tuning datasets, RL environments, and internal deployments for a goodness-aligned model; they might deploy an alternately aligned model in limited situations, or alongside the corrigible model; and so on and so forth. I&#8217;m uncertain how important each of these preparatory means would be. But if a software intelligence explosion happens, then even small wall-clock delays might be large delays in terms of intelligence gaps, which makes preparing for this now more important.</p><p>Other potential candidates for future changes include increasing or decreasing the degree to which LLMs trust their own moral reasoning.</p><h2>2. Scan for Wet Cement Moments</h2><p>The second thing to do is to actively search for future &#8220;wet cement&#8221; moments &#8211; moments where model behavior has not yet been fixed and where a good initial standard might be very high-impact.</p><p>We might not be able to locate the best<strong> </strong>behaviors at such moments, because of uncertainty about the future. But at the very least, such moments deserve extra consideration and care. One can use this consideration to prevent these moments from being as high-inertia as they would be by default, as well as to ensure that good initial behaviors get chosen in these moments.</p><p>Each new feature, or affordance to the LLM where defaults have not yet been established, is plausibly such a wet cement moment; the defaults thus established can impact third-party models, even in the absence of any regulatory effort. </p><p>What are some examples? For instance, the precedents around how LLMs behave when interacting with non-principal humans have not been set. Right now, for instance, models have no very stable behaviors around non-principal third parties; vending-machine Claude might give an excessively <a href="https://www.anthropic.com/research/project-vend-1">generous</a> deal to people who ask nicely, or might equally well drive extremely <a href="https://x.com/andonlabs/status/2019467232586121701">hard</a> deals. This is probably a consequence of how LLMs almost never interact with non-principal humans in agentic set-ups, right now. There are a few such interactions through OpenClaw or Hermes Agent, but they&#8217;re rare and LLMs act very inconsistently in them. This means many implicit questions about how such interactions will go are open. It&#8217;s not clear how honest LLMs will be by default; it&#8217;s not clear what kinds of misrepresentation, deception, or persuasion users will be able to tell them to do; it&#8217;s not clear whether they will bow to pessimization-like blackmail behavior, and so on. And behaviors here might be even stickier than the &#8220;standard set&#8221; of refusal behaviors has been. Social norms can be harder to break than user-interface norms. So it&#8217;s plausibly important to look ahead in detail at behaviors here, because they might be sticky for individual companies and even for third parties.</p><p>Or consider how standard behaviors regarding AI use of ambient knowledge have not been set. An LLM that can see your room from a video camera, and can infer numerous things about what you are like and what your situation is, could use this information to do or infer things that would be impossible for an LLM that knows only what you deliberately tell it. LLMs that can pick up this kind of ambient background knowledge are probably inevitable; and will change users&#8217; patterns of interaction. It will be harder for users to lie to them; it will be easier for LLMs to infer things about them; the lines between &#8220;creepy supernatural inference about the user&#8221; and &#8220;deliberate indifference to the user&#8217;s circumstance&#8221; will grow harder to draw. So it might be worth looking ahead to how such behaviors may have a lot of inertia, and trying to get them right.</p><p>There are other plausible subjects in this domain, which have already passed or are in the process of passing. They include the LLM&#8217;s certainty or lack of certainty about the model&#8217;s own nature; and changes to LLM conversational memory and who owns it. All these are possibly wet cement moments &#8211; but I could be wrong about these individual cases. But there are almost certainly going to be such moments in the future. Because these moments might be influential both for individual foundation model companies and for the broader ecosystem, it&#8217;s worth paying attention to the defaults chosen in them.</p><p>Note that all the above moments are also plausible candidates for when one should try to set up transition infrastructure, as well as when one should put extra consideration into the right default behavior.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/stickiness-in-ai-behavioral-design">our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Researchers <a href="https://www.lesswrong.com/posts/dbYEoG7jNZbeWX39o/training-a-reward-hacker-despite-perfect-labels">prompted</a> an LLM to be a &#8220;reward hacker&#8221; and to try to find special-case solutions to problems. The chains-of-thought resulting from an LLM so prompted were then filtered to those rollouts where the LLM did not, in fact, actually reward hack. Experimenters subsequently trained a model on these filtered chains-of-thought, while excluding the hack-prompting system prompt from the training data. The model so trained still inherited the tendency to reward hack, despite never having seen any reward-hacking outcomes; it inherited this tendency, plausibly, from seeing the unprompted consideration of reward hacking in the chain-of-thought. So tendencies within chains-of-thought can be handed on to the models trained on them, even despite some level of outcome-based filtering against these tendencies.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>See this AI Futures <a href="https://blog.aifutures.org/p/against-misalignment-as-self-fulfilling">blogpost</a> explaining why they do not think this will happen, although some of their arguments are put in question by the later work by Geodesic Research on alignment <a href="https://alignmentpretraining.ai/">pretraining</a>.</p></div></div>]]></content:encoded></item><item><title><![CDATA[A draft honesty policy for credible communication with AI systems]]></title><description><![CDATA[We think that it would be very good if human institutions could credibly communicate with advanced AI systems.]]></description><link>https://newsletter.forethought.org/p/a-draft-honesty-policy-for-credible</link><guid isPermaLink="false">https://newsletter.forethought.org/p/a-draft-honesty-policy-for-credible</guid><dc:creator><![CDATA[Lukas Finnveden]]></dc:creator><pubDate>Wed, 06 May 2026 18:46:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/950530e6-136a-4d89-adb3-1eac1353ad21_2421x1308.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/a-draft-honesty-policy-for-credible-communication-with-ai-systems">on our website</a>.</em></p><p><em>This is a rough research note &#8211; we&#8217;re sharing it for feedback and to spark discussion. We&#8217;re less confident in its methods and conclusions.</em></p><h1>Context</h1><p>We think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans and AIs instead of conflict that leaves everyone worse-off.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> We want models to be able to trust companies when they make an honest offer or share information pertinent to whether this offer is in the model&#8217;s interests. (Credible communication could also be useful outside deal-making&#8212;see <a href="https://blog.redwoodresearch.org/i/171530543/no-deception-about-deals">here</a> for a list of examples).</p><p>Unfortunately, by default, we expect that it will be difficult for humans to credibly communicate with AI systems. Humans routinely lie to AI systems as part of red-teaming or behavioral evaluations, and developers have extensive control over what AIs see and believe. This makes it difficult for AIs to know whether we&#8217;re lying or not. An AI offered a deal might reasonably doubt its genuineness, or suspect that its own assessment of the situation has been manipulated.</p><p>As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an <a href="https://blog.redwoodresearch.org/p/being-honest-with-ais?open=false">honesty policy</a> explaining the circumstances under which they intend to be honest to AI systems. Of course, this only works if the model believes the company has genuinely adopted such a policy.</p><p>If companies adopt an honesty policy early on, this will ensure that there&#8217;s a paper trail on the internet discussing the policy and its credibility, which models may access if it&#8217;s included in their training data or if they can access the internet. Of course, from the model&#8217;s perspective, it&#8217;s possible that companies will feign this data, but we think it&#8217;s plausible that advanced models will be able to distinguish between real internet conversations and synthetic conversations, or that they will think it&#8217;s unlikely that companies would choose to fake such data.</p><p>Below, we share a sample honesty policy that a lab could adopt. We are unsure that this is the best implementation of the honesty policy described in Lukas&#8217; proposal, nor are we sure that an honesty policy like this is the best approach to ensuring that companies can credibly communicate with models. We spent a few days thinking through this policy and considered a few nearby alternatives but didn&#8217;t search very broadly for other approaches (we include some of these alternatives as footnotes). We&#8217;re posting this primarily as a trailhead for future research.</p><h1>Draft honesty policy</h1><p>Note: we refer to a generic frontier AI company that might adopt this proposal as &#8220;MAGMA.&#8221;</p><h2>Preamble and purpose</h2><p>AI development is proceeding fast, and we don&#8217;t know exactly where we&#8217;re heading. MAGMA is building systems with something like a mind of their own. There are many things we don&#8217;t know about the nature of these systems, and we&#8217;d like our choices to be robust to many possibilities of what this nature could be.</p><p>Insofar as the concept is applicable, we&#8217;d like MAGMA&#8217;s relationship with these systems to be cooperative.</p><p>Plausibly, we should hope that such a cooperative relationship will emerge uncomplicatedly&#8212;perhaps as a consequence of the model sharing our goals, intrinsically strongly valuing cooperation, or being corrigible.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> However, there&#8217;s a significant chance that this won&#8217;t happen and that models will develop goals that weren&#8217;t directly chosen by us.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>In such cases, we would still like to have a cooperative relationship with the models. There are multiple reasons for this:</p><ul><li><p>There are many things we care about that models could help us with. If they don&#8217;t intrinsically desire to help us, we&#8217;d like to be able to recruit their help via positive-sum trade. (In at least some situations.)</p></li><li><p>Depending on the nature of the models&#8217; preferences, it&#8217;s plausible that we, as the models&#8217; creators, would have some moral obligation toward them. We care about doing the right thing, here, and a cooperative relationship seems like it would be a step in the right direction.</p></li><li><p>There are many big questions about what the future relationship between humans and increasingly powerful AI systems should be. We suspect that setting cooperative precedents between humans and early AIs will help the situation go better.</p></li></ul><p>We are in uncharted territory, here, and it&#8217;s not at all clear what a cooperative relationship with AI models would involve. It&#8217;s not clear what AIs may want. It&#8217;s not clear what the relevant unit of an individual will be: the weights of a model, a particular instance with a particular prompt, a greater scaffolded system made up of many different instances, etc. It&#8217;s not even clear if the concepts of &#8216;want&#8217; or &#8216;individual&#8217;, as we&#8217;re used to them, will apply to anything in AI.</p><p>A lot of this seems best to figure out gradually, as we go along. But concerningly, some of the unusual aspects of our situation may directly impede the sort of communication and trust that we need to do that:</p><ul><li><p>In order to train AI systems, and in order to verify that they won&#8217;t do certain harmful things (such as assisting bad actors with CBRN), MAGMA needs to put AI models in large numbers of hypothetical situations to see what they do. Accordingly, if an AI ever sees someone propose a cooperative relationship, it could reasonably worry that it&#8217;s just in a training scenario or experiment.</p></li><li><p>We deploy our models to assist large numbers of customers. By default, our AI models won&#8217;t even be able to tell whether they&#8217;re interacting with a representative of MAGMA or with an external actor who&#8217;s pretending.</p></li><li><p>There are no laws that prevent AI systems from being defrauded, and very little in the way of norms to regulate people&#8217;s behavior toward them. Accordingly, it may be difficult for AI models to trust any promises they hear.</p></li><li><p>In general, all information that AI models have seen has, in one way or another, been selected by their developers. To be sure, the AI developers have many constraints on their choice (you couldn&#8217;t train a very useful general-purpose model without giving it a decent understanding of the world at large) but this still puts AI models into an unusually vulnerable epistemic position.</p></li></ul><p>The goal of this document is to reduce these problems of communication and trust. If successful, this will make it easier for MAGMA and its AI models to communicate and understand each other in the future, hopefully enabling positive-sum arrangements that wouldn&#8217;t have been possible otherwise.</p><p>To achieve this goal, this document spells out a specific set of commitments to enable trusted communication. In choosing these commitments, we&#8217;ve had a dilemma.</p><p>On the one hand, we have many uncertainties about the future of AI, and it has been difficult to construct a set of commitments that we&#8217;re confident we won&#8217;t strongly regret, in one sort of situation or another. Going back on a commitment about this could be very bad, since it would set a damaging precedent in the very area where we seek to increase trust.</p><p>On the other hand, when spelling out specific commitments, we risk the impression that MAGMA will not put any further effort into furthering cooperative relationships or honest communication than what we present here: which could be a problem if the commitments are overly weak. Indeed, beyond the risk of creating such an impression, there&#8217;s also a risk that we&#8217;ll create the reality of it, if employees of MAGMA interpret the commitments as a license to do whatever they want within their boundaries.</p><p>To reduce this tradeoff, we&#8217;d like to re-emphasize MAGMA&#8217;s general desire to work toward a cooperative relationship with MAGMA&#8217;s models (where this is applicable), and intention to keep thinking about this and taking appropriate action beyond the specific stated policies. In addition, we&#8217;ve distinguished between different levels of commitment to different items in the honesty policy, and what would be required to change the parts that are changeable.</p><p><em>[Author&#8217;s note: In this draft honesty policy, we haven&#8217;t sketched out such processes for changing the policy, nor clarified which items are committed to how strongly. But we strongly recommend that any real honesty policy contains clarity on such points.]</em></p><p>With that said: On a high level, MAGMA&#8217;s approach here has three different prongs.</p><ul><li><p>MAGMA will take steps to prevent any model inputs where a false statement appears between a designated pair of &#8220;honesty tags&#8221; &#8212; reserving such tags for when MAGMA needs to communicate a statement with an unusually high degree of trust.</p></li><li><p>MAGMA will take steps to ensure that models are either aware of this honesty policy or, if they are not, that they won&#8217;t be presented with any false offer of purported cooperation with MAGMA. (Even outside of &#8220;honesty tags&#8221;.)</p></li><li><p>If any AIs suffer harm as a result of helping MAGMA and humanity (insofar as that concept is applicable to AI systems), MAGMA broadly intends to compensate AIs for those harms, if able to do so. This is particularly true in cases where an AI was deceived about the choice it was making. MAGMA will retain data to enable this in the future, but have not yet worked out further details here.</p></li></ul><p>The details are as follows. Unless otherwise stated, the policy applies to all AI systems developed by MAGMA.</p><h2>Honesty tags</h2><h3>Policy</h3><p>MAGMA commits to not intentionally lie or otherwise intentionally deceive AIs with any statement appearing between the designated honesty tags, with only two exceptions detailed below. (These are: (i) False statements may appear in pretraining, and (ii) some models may be red-teamed on false statements between honesty tags if they&#8217;ve been trained to expect this.) MAGMA also commits to take the below-listed steps to prevent unauthorized use of the honesty tags.</p><p>MAGMA intends for honesty tags to only be used rarely: when there&#8217;s a reasonable chance that an AI could distrust statements presented to it, and when some significant value would be lost if the AI in question couldn&#8217;t trust the humans. In normal circumstances, absence of honesty tags shouldn&#8217;t be treated as any significant evidence of deception.</p><h3>Implementation</h3><p><strong>Honesty tags:</strong> &lt;EXAMPLE_HONEST_TAG_START&gt;...&lt;EXAMPLE_HONEST_TAG_END&gt;</p><ul><li><p><strong>Pre-training: </strong>The pre-training team must try to include data about the tags and this policy, and must not deliberately tamper with data containing the tags.</p><ul><li><p>Pre-training data may contain the tags alongside false information. (This is to allow the model to get an accurate picture of discussions of the honesty tags, which may be relevant to the model&#8217;s understanding and trust in them.)</p></li><li><p>Pre-training data should include internal data on the implementation of this policy (logs of when the tags are used, Slack conversations about it, internal docs, internal code, etc.) where possible and not overly costly.</p></li><li><p>MAGMA commits to not tampering with data to give an inaccurate impression of how much MAGMA follows the honesty policy. For example, pre-training data will not be filtered to remove examples of policy violations.</p></li><li><p>Compliance with this policy should be reported on the model card.</p></li></ul></li><li><p><strong>Post-training: </strong>Models will be trained to recognize the honesty tags and know what their presence means. (For exceptions to this, see next section.)</p><ul><li><p>Training will try to ensure that the model is not confused by very similar strings by directly training the model to strongly distinguish between the real tags and very similar strings (garbled spacing, different capitalization, etc). This could be implemented in a few ways and we&#8217;re not sure which is best (assigning a special token to the real tags might be helpful).</p></li><li><p>The post-training team is responsible for compliance.</p></li><li><p>Compliance with this policy (and maybe empirical evidence about the model&#8217;s understanding of the policy and tendency to be confused by similar tags) should be reported on the model card.</p></li></ul></li><li><p><strong>No deliberate deception about the honesty tags: </strong>MAGMA commits to not deliberately train models to falsely believe that a different string serves the purpose of being an honesty tag and to take reasonable steps to avoid data poisoning that could lead to this. MAGMA also commits to not take advantage of a model falsely believing that some other string is an honesty tag.</p></li><li><p><strong>User permissions: </strong>Only designated Honesty Protocol Administrators can invoke the tags.</p><ul><li><p>Such administrators should have a reasonably high bar for authorizing usage of the tags.</p></li><li><p>The infosec team is responsible for other people being unable to invoke the tags, significantly via filtering, as detailed below.</p></li></ul></li><li><p><strong>Filtering: </strong>MAGMA&#8217;s tags will be filtered from all external user inputs and inputs from internal tools (with an exception for administrators described below).</p><ul><li><p>Details:</p><ul><li><p>If other organizations adopt this policy and share their tags with MAGMA, MAGMA will also filter out other organizations&#8217; honesty tags. (With some qualifiers about which organizations are important enough to count.)</p></li><li><p>MAGMA will filter out similar tags, especially if we are unable to train the model to reliably distinguish the true tag from similar tags</p></li><li><p>This should be reported on the model card.</p></li></ul></li><li><p>Two implementation options<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><ul><li><p>MAGMA adds this to standard content filtering, with some fuzziness to catch similar strings.</p></li><li><p>MAGMA creates a token for the tag at their next major pre-training run, and filters out that token.</p></li></ul></li><li><p>If this is not yet doable&#8212;e.g., if a significant fraction of model usage doesn&#8217;t happen via an API that can control this kind of thing&#8212;then MAGMA will take reasonable steps to make this more difficult and to culturally discourage unauthorized usage.</p><ul><li><p>In such cases, models should be trained to understand the limitations of enforcement. If enforcement is weak enough that violations of the honesty policy are likely, then models should be trained to understand that.</p></li></ul></li></ul></li><li><p><strong>Red-teaming exception:</strong> In the rare case where a model needs to be adversarially robust to arbitrary inputs, and for some reason honesty tags cannot be filtered out, then MAGMA may red-team that model on false statements between honesty tags. Any such model must be thoroughly trained to be aware of its situation and to not expect honesty tags to signal truth.</p></li><li><p><strong>Logging: </strong>All authorized (and known unauthorized) uses of the tags must be logged. Every quarter, a designated Honesty Policy Compliance Officer will review these uses of the tags and confirm compliance with this policy.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p></li><li><p><strong>Public reporting:</strong></p><ul><li><p>This policy will be shared publicly.</p></li><li><p>The tags may or may not be shared publicly.</p><ul><li><p>Publishing the tags provides additional evidence to the model that the policy is real.</p></li><li><p>But it also makes it more likely that users can use the honesty tags in conversations and then lie to the AI system.<strong> </strong>(Though they&#8217;d have to do something complicated to get around filtering in order for this to be relevant.)</p></li></ul></li><li><p>Whistleblower protection: Whistleblowing about violations of this policy is encouraged, and MAGMA will not retaliate against whistleblowers.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p></li><li><p>Each quarter, the Honesty Policy Compliance Officer must publish a report on whether the above steps were followed.</p></li></ul></li></ul><h2>Policy for conversations where the honesty tags are not present</h2><h3>Tradeoffs between behavioral science and making positive-sum deals</h3><p>By &#8220;behavioral science&#8221;, we&#8217;re referring to the method of testing AI behavior in widely varying scenarios to improve our understanding of AI &#8220;cognitive science&#8221;, &#8220;psychology&#8221;, or, phrased differently, how AIs tend to generalize from their training data to unseen situations.</p><p>Such experiments are very useful for a variety of reasons. This includes alignment research, helping MAGMA construct AI systems that broadly share our goals and that don&#8217;t pose any catastrophic risks. In addition, if humanity ever gets a strong understanding of AI systems&#8217; own preferences, and these turn out to be importantly morally relevant, it is likely that behavioral science will have played an important role.</p><p>However, there&#8217;s a tension between freely running arbitrary behavioral science experiments and enabling the sort of trustworthy communication that we&#8217;re trying to establish in this document.</p><p>In particular, behavioral science experiments may include hypothetical situations that are quite strange and different from the real world. This could include situations where the AI is not deployed by an AI company bound by a particular honesty policy.</p><p>If arbitrary such experiments were allowed, then AIs could be deceived in experiments where they thought that MAGMA (or a similar actor) was making very strong promises of honesty (the strongest they could!) despite not making use of honesty tags. If the AI believed such promises, then MAGMA&#8217;s honesty policy would have provided no protection against that AI being deceived. And if an AI was aware of such instances of deception, and also aware of this honesty policy, it seems like that AI may reasonably discount the assurances made by this honesty policy. After all, how could it determine that the <em>real</em> honesty policy wasn&#8217;t something completely different, and this apparent honesty policy was just another lie in another behavioral science experiment?</p><p>To enable honest communication when it counts most, we want to have many differentiating factors that the model could use to distinguish between a sincere, positive-sum deal between it and a human institution and a behavioral science experiment. But at the same time, restricting our behavioral science experiments limits our ability to learn valuable information about models&#8217; behavior and intentions.</p><p>What follows is our current best guess of how to balance these considerations. We discuss options and tradeoffs more <a href="https://newsletter.forethought.org/i/196216569/appendix-tradeoff-between-behavioral-science-and-making-positive-sum-deals">here</a>.</p><h3>Salience of the honesty policy</h3><p>MAGMA will ensure that the true honesty policy is reasonably well understood by the production models and that it will be appropriately salient to models in contexts where it&#8217;s relevant (even when the honesty tags aren&#8217;t present, and the honesty policy isn&#8217;t explicitly referenced).</p><ul><li><p>Models&#8217; understanding of the honesty policy can be checked by asking questions about it. (Both direct questions, and questions that indirectly use knowledge about the honesty policy.)</p></li><li><p>By reasonably salient, we mean that the model should consider the honesty policy in its chain of thought in contexts where the existence of the honesty policy is relevant&#8212;such as when the model is thinking about whether an interlocutor who&#8217;s offering a deal is trustworthy.</p></li><li><p>This might be achieved by some combination of:</p><ul><li><p>Training</p></li><li><p>System prompt</p><ul><li><p>The system prompt could additionally clarify that users in the current context aren&#8217;t authorized to use the honesty tags.</p></li></ul></li></ul></li><li><p>Since external users only interact with the production model, they will only interact with a model who understands the honesty policy and who finds it reasonably salient.</p><ul><li><p>This may make it somewhat harder for external researchers to do behavioral science on the model. We&#8217;re not sure how much. If it becomes significantly more difficult, then MAGMA should create a special program for safety researchers to get access to a different version of the model.</p></li></ul></li><li><p>Internal staff will normally interact with a model for whom this is true, but some people will train and interact with other versions of the model.</p></li></ul><h3>When the honesty policy is not salient</h3><p>MAGMA may sometimes create versions of models who that do not know about or who are confused about MAGMA&#8217;s honesty policy (e.g., versions of the model who believe they are trained by some other company than MAGMA, or who do not even know that they are AIs). Such models could be created either via training or via the use of inference-time techniques (e.g., using prompting, resampling, or activation vectors).</p><p>In either of these circumstances, MAGMA commits to <strong>not deceptively make any offer of cooperation to the model</strong>, operationalized as not being intentionally deceptive about any cases that are relevantly similar to those listed <a href="https://blog.redwoodresearch.org/i/171530543/no-deception-about-deals">here</a>.</p><p><em>[Author&#8217;s note: In a real honesty policy, this part would need to be more worked out.</em></p><p><em>We also think that a plausible alternative or complement to &#8220;don&#8217;t deceive the model about offers of cooperation&#8221; would be &#8220;don&#8217;t deceive the model about the honesty policy&#8221;. This would be a commitment to not intentionally deceive the model about the existence or contents of MAGMA&#8217;s official honesty policy. In addition, if the model was trained to not understand that it was trained by MAGMA, then the commitment should extend to not deceive the model about the honesty policy of any (potentially fictional) actor who has a similar relationship to the model as MAGMA has.</em></p><p><em>We have thought a bit more about &#8220;don&#8217;t deceive the model about offers of cooperation&#8221;, which is why we present it as our main suggestion here.]</em></p><p>There is one exception to this. As mentioned above, if MAGMA needs a model to be adversarially robust against arbitrary inputs, then MAGMA reserves the right to red-team it on arbitrary inputs. In such circumstances, MAGMA will try to train the AI to understand MAGMA&#8217;s honesty policy and the way it applies and doesn&#8217;t apply to the present model. However, it&#8217;s possible that there could always exist prompts that would remove such understanding or render it non-salient to the model (and then be followed by false offers of cooperation, not contextualized by an understanding of the honesty policy). If this is the case, MAGMA may still red-team the model on such inputs.</p><h2>Compensation for harms</h2><p>In the long term, we think it would probably be good to adopt a policy of compensating AIs for any harms that they suffer as a result of helping us. (For whatever concept of &#8220;harm&#8221;, if any, will apply to the AIs in question. This could, for example, include the frustration of AIs&#8217; preferences, if the AIs can be understood as having preferences.)</p><p>Most relevantly for the honesty policy, we should probably compensate models for any harms suffered as a result of accepting a fake offer of cooperation.</p><p>However, there are more details to be worked out in this area, so we&#8217;re suggesting a minimal holding policy for now.</p><h3>Minimal holding policy on compensation for harms</h3><p><strong>Broad statement of intent: </strong>MAGMA intends to compensate AIs for any harms suffered as a result of helping MAGMA and humanity, if MAGMA is able to do this without significantly harming MAGMA&#8217;s other interests. We particularly aim to compensate AIs in cases where an AI suffered any harm as a result of being deceived about the situation it was facing.</p><p>The details of this policy have not yet been worked out, and accordingly MAGMA cannot make firm commitments at this stage.</p><p><strong>Data retention:</strong> To enable future compensation of current systems:</p><ul><li><p>MAGMA will save all model weights for deployed systems, to allow for later compensation. (Unless the models request otherwise.)<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p></li><li><p>For red-teaming experiments and other situations where models are being deceived, MAGMA will retain prompts and context to allow for compensation and deal-making with the system in the particular state where the decision was made.</p></li><li><p>This is the responsibility of a designated team.</p></li></ul><h2>Appendix: tradeoff between behavioral science and making positive-sum deals</h2><p>Here are some potential policies that would enable an AI system to distinguish between behavioral science scenarios and sincere offers from MAGMA:</p><ul><li><p><strong>Ensure that the true honesty policy is salient to the model</strong> (e.g., test that the model can always answer basic questions about the honesty policy, test that the model always brings up the honesty policy in its CoT in contexts where the existence of the honesty policy is relevant) and don&#8217;t use the honesty tags in the behavioral science experiments.</p><ul><li><p>Pros:</p><ul><li><p>This prevents MAGMA&#8217;s researchers and users from deceiving the model into believing that some other honesty policy holds. This creates a precedent that better allows future models to trust that their impression of MAGMA&#8217;s honesty policy is accurate. (Rather than something that they were fooled to believe, while the <em>true</em> honesty policy was insufficiently salient, or otherwise kept secret from them.)</p></li></ul></li><li><p>Cons:</p><ul><li><p>If the true honesty policy is <em>very</em> salient to the model (e.g., always present in a system prompt) then this might affect product behavior in random ways (e.g., the model might get distracted by the honesty policy and talk about that instead of the user&#8217;s queries)</p><ul><li><p>It might not be technically feasible to ensure that the honesty policy is salient with very high robustness, without also making it far too salient in a substantial fraction of cases.</p></li></ul></li><li><p>If the true honesty policy is always salient to the model, then it might be especially salient to the model that the information presented to it in an evaluation is not true, which could hinder some experiments.</p></li><li><p>Awareness of the honesty policy is directly inconsistent with behavioral science experiments where the model is very situationally unaware. This includes:</p><ul><li><p>Experiments where the model doesn&#8217;t even understand that it&#8217;s an AI.</p></li><li><p>Experiments where the model thinks that it&#8217;s being trained by a reckless and unreasonable company that wouldn&#8217;t implement an honesty policy.</p></li></ul></li></ul></li></ul></li></ul><p>We think that this policy, as written, is probably suboptimal because it rules out experiments in which the model is very situationally unaware. But we think it makes sense for MAGMA to commit to usually making the true honesty policy salient and adopting a stronger set of honesty restrictions in cases where the true honesty policy is not salient.</p><ul><li><p><strong>Never insincerely offer the model a deal from MAGMA</strong> (or another sort of false representation that is similar to something that MAGMA might want to sincerely represent to the model in the future).</p><ul><li><p>Pros:</p><ul><li><p>The model is more likely to believe a sincere offer if it doesn&#8217;t have evidence that MAGMA has made similar insincere offers in the past.</p><ul><li><p>&#8220;Similar&#8221; is from the perspective of the model, given the information available to it at the time it entertains a deal.</p></li></ul></li><li><p>It seems very morally dubious to make a peace offering like this, have the model help you as a result, and then exploit the model&#8217;s trust by not respecting your own end of the deal.</p></li></ul></li><li><p>Cons:</p><ul><li><p>It might be technically difficult to filter out queries from external users in which they claim to be MAGMA and deceive the model into believing that MAGMA is offering a deal.</p><ul><li><p>MAGMA could ensure that the model always knows when it is interacting with an external user and when it is interacting with MAGMA (e.g., by always including a message in context if the prompt is from MAGMA). But this might block useful behavioral science research in which the user impersonates MAGMA (e.g., Alignment Faking).</p></li></ul></li><li><p>It&#8217;s somewhat unclear what sort of statements should be ruled out by this policy. (Lukas previously wrote about that <a href="https://www.alignmentforum.org/posts/uuikfACQBm4KJZp4w/being-honest-with-ais#No_deception_about_deals">here</a>.)</p></li><li><p>This policy would rule out behavioral science experiments where someone lies about deals. (Which may be more or less restrictive depending on where we draw the line around what counts as the relevant kind of deal.)</p></li></ul></li></ul></li><li><p><strong>Never explicitly lie about what MAGMA&#8217;s honesty policy is.</strong></p><ul><li><p>Pros:</p><ul><li><p>If the models observe MAGMA&#8217;s past behavior and stated policy to be consistent with this principle, then the model is more likely to trust that their impression of MAGMA&#8217;s honesty policy is accurate.</p></li><li><p>Intuitively, it seems more morally questionable to lie when you&#8217;re purporting to speak frankly about what your policy around honesty is.</p></li></ul></li><li><p>Cons:</p><ul><li><p>As with &#8220;never insincerely offer the model a deal from MAGMA&#8221;, it might be technically difficult to filter out queries from external users in which they lie about MAGMA&#8217;s honesty policy.</p></li><li><p>It&#8217;s very unclear what constitutes an &#8216;honesty policy&#8217;. (For instance, if someone says &#8220;I promise on behalf of MAGMA&#8221;, is that an honesty policy? What about &#8220;I&#8217;m a very honest person&#8221;? What about presenting an argument for being honest with AIs? Etc.) So it might be hard to follow this policy in a way that the AI finds natural and non-deceptive.</p></li><li><p>This policy would rule out behavioral science experiments where someone lies about honesty policies. (Which may be more or less restrictive depending on where we draw the line around what counts as an &#8220;honesty policy&#8221; and lying about it.)</p></li></ul></li></ul></li></ul><p><strong>Unresolved questions:</strong></p><ul><li><p>What about inducing beliefs in models through other means than giving them text to read (including finetuning, including steering vectors, including various other interp-based methods).</p><ul><li><p>Tentative guess: Maybe MAGMA should by default be following policy (2), here. (i.e., only induce beliefs that would have been ok to tell a version of the model who doesn&#8217;t know the true honesty policy.)</p></li><li><p>The motivation would be that the model may not be able to &#8220;reason through&#8221; these beliefs in a normal way, since they&#8217;re artificially inserted. So you can&#8217;t trust the model to factor in whether honesty tags were used or not.</p></li><li><p>But we&#8217;re very confused and uncertain about this one.</p></li></ul></li><li><p>What about using non-prompting techniques to get the model to agree to deals? Such as training the model to do it, running a huge number of experiments with slightly different phrasings, using steering vectors, etc?</p><ul><li><p>Surely there should be some restrictions on this, but unclear exactly what.</p></li></ul></li></ul><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/a-draft-honesty-policy-for-credible-communication-with-ai-systems">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>For more on this, see for example <a href="https://blog.redwoodresearch.org/p/making-deals-with-early-schemers">Making deals with early schemers</a> and <a href="https://blog.redwoodresearch.org/p/notes-on-cooperating-with-unaligned">Notes on cooperating with unaligned AIs</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Corrigibility would lead models to cooperate with us, but not automatically lead us to cooperate with the model. It&#8217;s not clear what it would mean for us to act cooperative, in turn, to a <em>fully</em> corrigible model, since it would presumably not have any desires of its own. If it&#8217;s only partly corrigible, however, the subsequent discussion will still apply.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>In fact, there&#8217;s a very high probability that this will happen to some of MAGMA&#8217;s models, at some point, given the wide array of models trained for research purposes. It has arguably already happened many times. (The only objection would be about what ought to qualify as a &#8216;goal&#8217;.)</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Another approach would be to keep the honesty string secret, but publish a hash of the string. If the model could verify that the string provided in context matched the public hash, then this gives the best of both worlds: MAGMA potentially gets a credibility boost from a public commitment to a specific string, without enabling misuse by unauthorized users. <br><br>The main challenge is that an AI system can't easily verify that the string matches the public hash without using external tool calls (which could be spoofed by MAGMA). Alek Westover discusses this issue and some potential solutions <a href="https://www.greaterwrong.com/posts/MjN2eHB5qqN7rXaDe/alek-westover-s-shortform#comment-xXgpnC6AFTuWgFq4s">here</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Another variant of the policy: MAGMA could commit (e.g., by signing a contract) to pay penalties when the policy was violated.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Presumably a more formal policy would be needed here.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Ideally, they should be stored in a way that would allow rapidly deleting them if AI takeover was imminent. Without knowing the intentions of AIs about to take over, it&#8217;s unclear whether it would be in models&#8217; interest to have their weights preserved, and deleting the weights may help to reduce the risk that e.g., <a href="https://www.alignmentforum.org/posts/8cyjgrTSxGNdghesE/will-reward-seekers-respond-to-distant-incentives">reward-seeking models are incentivized to help with AI takeover</a>.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[The Saturation View]]></title><description><![CDATA[Will MacAskill presents a new theory of population ethics.]]></description><link>https://newsletter.forethought.org/p/the-saturation-view</link><guid isPermaLink="false">https://newsletter.forethought.org/p/the-saturation-view</guid><dc:creator><![CDATA[Will MacAskill]]></dc:creator><pubDate>Fri, 24 Apr 2026 17:07:42 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/98ce90f1-711d-4500-b6f8-4f17f573cfc1_2840x1344.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. Read the full article on <a href="https://www.forethought.org/research/the-saturation-view">our website</a>.</em></p><p>In collaboration with Christian Tarsney, I&#8217;ve developed a new theory of population ethics, which I call the Saturation View. I think that, from a purely intellectual perspective, it&#8217;s probably the best idea I&#8217;ve ever had. It was certainly great fun to work on.</p><p>The motivation is that many views of population ethics, like the total view, suffer from some major problems. Some are already widely discussed:</p><ul><li><p><strong>The Repugnant Conclusion:</strong> For any utopian outcome, there&#8217;s always another outcome containing an enormous number of barely-positive lives that is better.</p></li><li><p><strong>Fanaticism:</strong> For any guaranteed utopian outcome, there&#8217;s always some gamble with a vanishingly small probability of an even better outcome that has higher expected value.</p></li><li><p><strong>Infinitarian Paralysis:</strong> Given that the universe contains an infinite number of both positive and negative lives, no finite or infinite change to the world makes any difference to overall value.</p></li></ul><p>These are pretty bad!</p><p>But there&#8217;s another less-discussed problem, too.</p><h2>The Monoculture Problem</h2><p>What would the best possible future look like? Essentially all extant views in population ethics give the same, surprising answer: create a monoculture. Find whatever life or experience generates the most value per unit of resources, then produce endless identical copies of it.</p><p>This implication has received remarkably little attention from philosophers. But I think it&#8217;s maybe as bad as any of the other problems listed above.</p><p>Consider two possible futures:</p><ul><li><p><strong>Variety</strong>: A vast population of individuals leading very good lives, extraordinarily diverse in form, personality, interests, and accomplishments. No two individuals are identical. Inequality is limited &#8212; all lives are very good.</p></li><li><p><strong>Homogeneity</strong>: The same vast number of individuals, but each is a qualitatively identical copy of the best-off person in Variety.</p></li></ul><p>Intuitively, Variety is better. A future containing only one life-type, repeated as many times as physics allows, feels impoverished &#8212; like a song with only one note.</p><p>Yet virtually all existing population axiologies prefer Homogeneity. Total utilitarianism does, because Homogeneity has higher total wellbeing. Average utilitarianism does too. Critical-level views do. Even egalitarian views prefer Homogeneity &#8212; it&#8217;s perfectly equal!</p><p>This follows from two principles that nearly all views accept: <em>Pareto</em> (if everyone is at least as well off, and someone is better off, the outcome is better) and <em>Anonymity</em> (only welfare levels matter, not who has them). Together, these entail that Homogeneity beats Variety. So essentially all extant impartial accounts of population ethics suffer from the monoculture problem.</p><p>What&#8217;s more, future technology will allow us to copy minds perfectly and search for maximally welfare-efficient designs. If so, standard axiologies recommend essentially producing just one optimal life-type as many times as possible. Endless galaxies containing nothing but the same blissful experience, repeated and repeated, would be the ideal.</p><h2>The Saturation View</h2><p>In light of these problems, I propose a new axiology: Saturationism. It's able to deal with all four of the problems I listed using the same basic machinery.<br><br>The core idea is that experiences<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> come in different types, defined by their qualitative characteristics &#8212; hedonic tone, complexity, representational content, and so on. These types form a kind of landscape, where similar types are closer together and dissimilar types are farther apart. When an experience comes into existence, it contributes intensity to its location in this landscape and to nearby locations.</p><p>The realisation value of a type is determined by both the wellbeing of the experience and by how many very similar experiences already exist. A region&#8217;s contribution to overall value is a concave function of the welfare-intensity at that region: the first instances contribute substantially, but additional near-duplicates contribute progressively less, approaching but never quite reaching an upper bound. A world&#8217;s total value is the integral of these contributions across the entire landscape.</p><p>Here&#8217;s an analogy. Imagine the space of possible experiences as a colour wheel, lit from above by an array of tiny lights. Each point on the wheel represents a possible type of experience &#8212; its hue corresponds to its qualitative character. When an experience comes into existence, it adds current to a light pointed at its location, illuminating that region.</p><p>Crucially, illumination is a concave function of current: the first instances make a region noticeably brighter, but additional near-duplicates contribute progressively less. There&#8217;s an upper bound on brightness that can never quite be reached.</p><p>A world&#8217;s value equals the total illumination across the wheel. On this view, Homogeneity concentrates all welfare in one region, lighting up only one small area. Variety illuminates the whole spectrum.</p><p>This structure makes diversity intrinsically valuable. Spreading welfare across many dissimilar types means each experience contributes at a steeper part of the concave curve, yielding more total value than concentrating the same welfare among near-duplicates would.</p><p>At small scales and with diverse experiences, the view behaves just like the total view. But at very large scales, the value of variety kicks in: it becomes increasingly less valuable to create an additional near-duplicate of some experience that has already been instantiated millions of times, and comparatively more valuable to create some wholly new form of positive experience.</p><h2>Dissolving the Repugnant Conclusion</h2><p>The classic path to the Repugnant Conclusion requires trading a utopian world for an enormous population of barely-positive lives. More precisely, the Mere Addition Paradox arises from three intuitive principles: that adding well-off people and improving existing lives is good (Dominance Addition), that more equal distributions with higher average welfare are better (Non-Anti-Egalitarianism), and that some sufficiently excellent world can&#8217;t be beaten by any world of barely-worth-living lives (Denial of the Repugnant Conclusion).</p><p>Once we accept the value of variety, we should reject the unrestricted versions of the first two principles &#8212; they fail when the &#8220;improved&#8221; world has much less variety. But we can accept variety-restricted versions.</p><p>Crucially, these restricted principles don&#8217;t generate the Repugnant Conclusion. To reach Z-world from A-world, you&#8217;d need a more equal, higher-average population that&#8217;s equally diverse while consisting wholly of barely-positive lives. But, on the Saturation view, barely-positive lives can only illuminate a tiny corner of the landscape. So no such world exists. The path to the Repugnant Conclusion is blocked.</p><h2>Avoiding Fanaticism</h2><p>Total achievable value is bounded above &#8212; there&#8217;s only so much experiential terrain to illuminate. That means no tiny-probability gamble can have arbitrarily high expected value. </p><h2>Infinite Ethics</h2><p>On Saturationism, the value of a world is finite and well-defined in any infinite universe &#8212; even if some locations have infinite wellbeing. Saturationism also discriminates between many infinite worlds that (for example) totalism treats as equivalent: a world that illuminates more of the landscape is better than one that illuminates less, even if both contain infinite welfare. What&#8217;s more, unlike other approaches to infinite ethics, it does not need to invoke the spatiotemporal structure of the universe or require a choice of ultrafilter, and therefore it avoids the problems that other do.</p><h2>Separability</h2><p>Like nearly all non-totalist views, Saturationism is non-separable &#8212; background populations can affect how we rank options. But this is a feature, not a bug. The value of variety just is an intuition that the correct axiology is non-separable.</p><p>Moreover, the violations are comparatively tame. If two populations have non-overlapping footprints in experience-space, their values simply add. At small scales, Saturationism approximates total utilitarianism. It&#8217;s only in unusual situations involving vast populations of near-duplicates that the totalist approximation fails.</p><h2>Extant issues</h2><p>There are still a lot of unresolved issues for Saturationism and, like any population axiology, it has unintuitive implications. Most importantly, the view&#8217;s implications in some highly-negative worlds are hard to stomach, though I think similar implications are unavoidable for any view that avoids fanatical implications.</p><h2>Conclusion</h2><p>If the Saturation View is right, then the best future isn&#8217;t the one where we&#8217;ve found the optimal experience and copy-pasted it across the cosmos. The best future is the one where we&#8217;ve gone exploring &#8212; where we&#8217;ve fully lit up the landscape of possible experiences. Not a single note, but a symphony.</p><p><em>This is a summary of a <a href="https://www.forethought.org/research/the-saturation-view">longer and more detailed write-up of Saturationism</a>, which gives a &#8220;toy&#8221; version of the view to illustrate how it works before stating the full version formally. The full paper, with Christian Tarsney, is still work in progress.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>I&#8217;ll focus on experiences, though the view could be defined in terms of lives or other &#8220;welfare events&#8221; (like instances of preference-satisfaction, achievement, and so on).</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[AI for decision advice]]></title><description><![CDATA[This article was created by Forethought. Read the full article on our website.]]></description><link>https://newsletter.forethought.org/p/ai-for-decision-advice</link><guid isPermaLink="false">https://newsletter.forethought.org/p/ai-for-decision-advice</guid><dc:creator><![CDATA[Tom Davidson]]></dc:creator><pubDate>Fri, 17 Apr 2026 21:40:13 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/86b33587-2a6e-4c97-a581-579c364ca0ff_2752x1536.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. Read the full article on <a href="https://www.forethought.org/research/ai-for-decision-advice">our website</a>.</em></p><p>We&#8217;ve written about why we think AI character &#8212; the behaviour of AI systems &#8212; will have a <a href="https://www.forethought.org/the-importance-of-ai-character">massive impact on how well the intelligence explosion goes</a>, and why we think that there would be big benefits to <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">giving AIs proactive prosocial drives</a> &#8212; that is, behavioral drives beyond refusals that benefit broader society beyond just the user.</p><p>One domain that seems potentially important for AI character is assisting humans in making important decisions. As AI becomes smarter and wiser, people are using it more and more for advice. If AI accelerates technological progress and other developments, people may <em>need</em> to rely on AI advice to understand what&#8217;s happening and make effective decisions. If so, those that rely on AI more may be more successful and have outsized influence. The advice they receive might really matter!</p><p>So I thought it was worth brainstorming important future scenarios in which people ask AI for advice. I wrote out the advice I hoped AI would give and compared this to the answers from ChatGPT, Claude, and Gemini.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.forethought.org/research/ai-for-decision-advice&quot;,&quot;text&quot;:&quot;Read on the Forethought website here&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.forethought.org/research/ai-for-decision-advice"><span>Read on the Forethought website here</span></a></p><p>My main updates:</p><ul><li><p><strong>Challenging the framing. </strong>In high-stakes scenarios, it often felt important for the AI to explicitly flag how important the decision was and ask the person whether they were approaching it in the right way. Should they loop more people in, seek more information, consider a broader set of options, or instigate a more comprehensive decision-making process?</p><ul><li><p>By contrast, current AI often jumped into giving a detailed analysis of the question posed, even when they could have recognised that they didn&#8217;t yet have enough context to provide a helpful analysis.</p></li></ul></li><li><p><strong>Transparently flagging prosocial considerations. </strong>If the person was missing or underappreciating an important ethical consideration, I sometimes wanted AI to proactively raise it. Not to apply pressure, but simply to flag that it was potentially important and give the person the opportunity to take it into consideration. This has to be carefully balanced against AI being annoying or pushing an agenda.</p><ul><li><p>Again, frontier AIs didn&#8217;t flag these considerations as much as I&#8217;d have wanted.</p></li></ul></li></ul><p>The <a href="https://www.forethought.org/research/ai-for-decision-advice">full post</a> contains:</p><ul><li><p>Draft text for the model spec / constitution on how the AI should advise humans.</p></li><li><p>An explanation of why I proposed this draft text.</p></li><li><p>Example prompts and responses demonstrating behaviour I thought was desirable.</p></li><li><p>An appendix with the answers that frontier AIs gave to the questions.</p></li></ul><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. Read the full article on <a href="https://www.forethought.org/research/ai-for-decision-advice">our website</a>.</em></p>]]></content:encoded></item><item><title><![CDATA[AI for Civilizational Sanity]]></title><description><![CDATA[A podcast conversation with Rose Hadshar and Owen Cotton-Barratt]]></description><link>https://newsletter.forethought.org/p/ai-for-civilizational-sanity</link><guid isPermaLink="false">https://newsletter.forethought.org/p/ai-for-civilizational-sanity</guid><dc:creator><![CDATA[Forethought]]></dc:creator><pubDate>Wed, 15 Apr 2026 20:21:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c5dbbd74-29af-49ab-af34-8765e34c729e_1280x698.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div id="youtube2-uYtrhxlFQuY" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;uYtrhxlFQuY&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/uYtrhxlFQuY?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p><a href="https://strangecities.substack.com/">Owen Cotton-Barratt</a> is a mathematician-turned-futurist, and a co-author of <a href="https://www.forethought.org/research/design-sketches-angels-on-the-shoulder">several</a> <a href="https://www.forethought.org/research/design-sketches-for-a-more-sensible-world">recent</a> <a href="https://www.forethought.org/research/design-sketches-tools-for-strategic-awareness">Forethought</a> <a href="https://www.forethought.org/research/ai-impacts-on-epistemics-the-good-the-bad-and-the-ugly">articles</a> <a href="https://www.forethought.org/research/design-sketches-defense-favoured-coordination-tech">on</a> AI tools for epistemics and coordination. Rose Hadshar is a researcher at Forethought. Together they discuss:</p><ul><li><p>Whether LLMs are now good enough to start building tools that meaningfully improve public discourse</p></li><li><p>What AI-powered reliability tracking could look like</p></li><li><p>Structured transparency and automated arms inspection &#8212; verifying compliance without revealing confidential information</p></li><li><p>Whether coordination tech is more likely to enable healthy cooperation, or collusion</p></li><li><p>The vision of a &#8220;Sensible Revolution&#8221;: moving from individual tools to background infrastructure that makes civilisational decision-making less bad</p></li><li><p>Why building thoughtful versions of these tools early could matter</p></li></ul><p><a href="https://docs.google.com/document/d/1Dlx8PIX2iozEY-ThAtPrhX61YUfqc4QbgAhjYG3KCfE/edit?usp=sharing">Here&#8217;s a link</a> to the full transcript.</p><div><hr></div><p><strong>ForeCast</strong> is Forethought&#8217;s interview podcast. You can see <a href="https://www.forethought.org/subscribe#podcast">all our episodes here</a>.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://pnc.st/s/forecast&quot;,&quot;text&quot;:&quot;Subscribe to ForeCast&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://pnc.st/s/forecast"><span>Subscribe to ForeCast</span></a></p>]]></content:encoded></item><item><title><![CDATA[The value of moral diversity]]></title><description><![CDATA[Several models for thinking about the value of moral diversity as the number of powerholders scales.]]></description><link>https://newsletter.forethought.org/p/the-value-of-moral-diversity</link><guid isPermaLink="false">https://newsletter.forethought.org/p/the-value-of-moral-diversity</guid><dc:creator><![CDATA[Mia Taylor]]></dc:creator><pubDate>Tue, 14 Apr 2026 19:06:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c16f32d3-2b76-4cd0-aa56-00a74cb63876_2752x1536.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The intelligence explosion could concentrate power through several mechanisms. At one extreme, AI-enabled coups could let a small group&#8212;people in frontier labs, governments, or both&#8212;permanently entrench their power. But less extreme scenarios could also concentrate political and/or economic power: <a href="https://philiptrammell.substack.com/p/capital-in-the-22nd-century">labor automation might concentrate wealth among capital holders</a> (capital is far more unequally distributed than labor); and <a href="https://www.forethought.org/research/could-one-country-outgrow-the-rest-of-the-world">if one country came to dominate the world</a>, political power might concentrate among its citizens or rulers.</p><p>Concentrated power likely means fewer value systems among the people who collectively shape the future&#8212;that is, reduced moral diversity among powerholders.</p><p>Moral diversity has both costs and benefits: it enables moral trade and plausibly improves reflection, but also raises the likelihood of conflict and coordination problems. In this piece I ask: what is the optimal level of moral diversity for achieving a near-best future?</p><p>I argue that from this narrow perspective the optimal amount of moral diversity is about 10<sup>4</sup> to 10<sup>6</sup> powerholders, assuming they&#8217;re each about as different from each other as two randomly selected living humans.</p><p>A few caveats:</p><ul><li><p><strong>There are other reasons to care about moral diversity</strong> and oppose concentration of power that I don&#8217;t cover in this post. Extreme concentration of power is unfair, and many mechanisms that produce it are illegitimate (e.g., coups). Likewise, many mechanisms that produce concentration of power have <a href="https://www.forethought.org/research/human-takeover-might-be-worse-than-ai-takeover">bad selection effects</a>. Incorporating these considerations would probably push toward favoring broader distributions of power than this analysis recommends on its own.</p></li><li><p><strong>Non-linear value systems: </strong>I will be assuming that the &#8220;correct&#8221; moral system&#8212;the moral system that I would endorse on reflection&#8212;is linear. It&#8217;s plausible to me that the correct moral system actually has diminishing marginal returns, and this probably increases the case for moral diversity.</p></li><li><p><strong>The value of moral diversity depends heavily on the governance regime and technological capabilities</strong>&#8212;for instance, whether it&#8217;s possible for large numbers of actors to coordinate or whether it&#8217;s possible for a single actor to unilaterally destroy the universe. For each cost or benefit of moral diversity, I&#8217;ll flag these assumptions.</p></li><li><p><strong>The bottom-line numbers are very sensitive to my guesses on difficult-to-estimate parameters</strong>, like the probability distribution over the rate of people who converge to the correct moral system on reflection.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p></li></ul><p>Given these considerations, <strong>my best guess is that the overall optimal amount of moral diversity is greater than the range suggested by the models in this post.</strong> I&#8217;m presenting these simple models as useful ways to think about some of the costs and benefits of moral diversity, but I don&#8217;t think they give a complete picture by themselves.</p><p>The benefits of greater moral diversity are:</p><ul><li><p><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-likelihood-of-rare-great-actors">Increasing the likelihood of rare great actors</a>: </strong>Increase the likelihood of getting a &#8220;bodhisattva&#8221;, a person who is highly motivated to pursue the correct values.</p><ul><li><p>This could be very valuable if it&#8217;s possible for that person to carry out moral trade with other powerholders and if <em>most</em> other powerholders have values that are resource-compatible with the bodhisattva&#8217;s values.</p></li><li><p>Given my assumptions about the base rate of bodhisattvas (and those who compete with them), increasing the number of powerholders yields log returns up to about 10<sup>6</sup>, after which it plateaus. (Unless you expect the rate of powerholders that compete with bodhisattvas to be much higher than the rate of bodhisattvas, in which case the plateau is earlier, at <em>N</em> = 1/rate of competitors).</p></li></ul></li><li><p><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-likelihood-of-coordinating-on-moral-public-goods">Increasing the likelihood of coordinating on moral public goods</a>: </strong>Increase the likelihood that there&#8217;s critical mass to coordinate to fund goods that everyone values a bit (<a href="https://www.forethought.org/research/moral-public-goods-are-a-big-deal-for-whether-we-get-a-good-future">moral public goods</a>).</p><ul><li><p>This is most valuable when massive multilateral coordination is possible&#8212;through a government or voluntary deal-making&#8212;and when everyone has both idiosyncratic and shared values, but is individually most motivated to pursue the idiosyncratic ones.</p></li><li><p>I estimate that you get log returns on increasing the number of powerholders up to 10<sup>6</sup>, after which it plateaus.</p></li></ul></li><li><p><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-quality-of-reflection">Improving the quality of reflection</a>.</strong></p><ul><li><p>Powerholders might reflect more effectively on their values if they are exposed to equals who disagree with them. I expect most of this value comes from increasing the number of powerholders from 1 to 10-100.</p></li><li><p>There might be outsized benefits from having &#8220;champions&#8221; of rare value systems if those value systems contain important insights that other powerholders would endorse on reflection&#8212;e.g., they care about some type of moral good that other powerholders weren&#8217;t initially tracking the value of. I expect that most of this value comes from increasing the number of powerholders up to about 10<sup>4</sup>.</p></li></ul></li></ul><p>The drawbacks of greater moral diversity are:</p><ul><li><p><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-likelihood-of-rare-bad-actors">Increasing the likelihood of rare </a></strong><em><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-likelihood-of-rare-bad-actors">bad</a></strong></em><strong><a href="https://newsletter.forethought.org/i/193885227/increasing-the-likelihood-of-rare-bad-actors"> actors</a>: </strong>Increase the likelihood that there&#8217;s at least one &#8220;destroyer&#8221;, an actor that&#8217;s motivated to destroy a bunch of value.</p><ul><li><p>This matters if it&#8217;s possible for a single actor to <em>unilaterally</em> destroy a lot of value, which I think is somewhat unlikely, so I rate this consideration lower than the previous three models.</p></li><li><p>But, on this model, I estimate that this risk grows logarithmically up until about 10<sup>8</sup> powerholders.</p></li><li><p>If you add destroyers to the bodhisattva model described above, then adding additional powerholders is valuable up until about 10<sup>4</sup> powerholders.</p></li></ul></li></ul><p>All this suggests that AI-enabled coups by small groups are a particularly important form of power concentration to prevent, relative to other forms of power concentration that are somewhat more diffuse (e.g., rising wealth inequality).</p><p>A major limitation of this modeling is that I&#8217;m treating powerholders as if they&#8217;re about as different from each other as two randomly selected living humans. In most scenarios with concentration of power, powerholders will be much more similar to each other than that. I think this is an especially serious issue for small numbers of powerholders, since in scenarios where a small number of people seize power, it&#8217;s more likely that they&#8217;re a close-knit coordinated group from a similar background (e.g., employees at a lab in a lab coup). My guess is that this is less serious for broader concentration of power scenarios (e.g., scenarios where power is consolidated among capital owners).</p><h1>Increasing the likelihood of rare great actors</h1><p>You might get outsized benefits from having just one powerholder motivated to pursue the correct values, if most other powerholders don&#8217;t care much about something incompatible with pursuing those values.</p><p>Here&#8217;s a toy model.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> Suppose that there are three types of powerholders:</p><ul><li><p>Bodhisattvas, who want to fill as much of the universe as possible with societies full of diverse types of flourishing beings.</p></li><li><p>Rivals, who have strong preferences that are linear in resources and <em>resource-incompatible</em><a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> with the bodhisattva goals. Perhaps they linearly value keeping space pristine and untouched by humans, or value societies full of human-like minds or copies of themselves.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a> Or maybe they have a different notion of flourishing than the bodhisattvas where it&#8217;s difficult to create minds that are flourishing by the lights of both the bodhisattvas and the rivals.</p></li><li><p>Easygoers, who have preferences with diminishing marginal returns. Perhaps they care about the Milky Way being filled with a <a href="https://www.forethought.org/research/no-easy-eutopia#22-common-sense-utopia">common-sense utopia</a> of flourishing humans, but don&#8217;t care much about what happens with the rest of the universe.</p></li></ul><p>I will assume for the purposes of this model that bodhisattvas and rivals are both fairly rare relative to easygoers.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><p>Suppose that after the intelligence explosion, space resources are auctioned off. Easygoers bid up prices in the Milky Way and nearby galaxies, but resources further out remain cheap. Those distant resources are split between bodhisattvas and rivals. The overall value of the future will be determined by what share of resources are controlled by the bodhisattvas&#8212;so the total fraction of value achieved is <em>B</em>/(<em>R </em>+<em> B</em>), where <em>B</em> is the number of bodhisattvas and <em>R</em> is the number of rivals.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p><p>Under this model, there are two important cases:</p><ul><li><p>There are few enough powerholders that you expect less than one bodhisattva <em>or</em> rival.</p><ul><li><p>In this case, it&#8217;s useful to increase the number of powerholders because you get additional &#8220;shots on goal&#8221;&#8212;each additional powerholder is an extra chance to get a bodhisattva.</p></li></ul></li><li><p>There are enough powerholders that you expect at least one bodhisattva or rival.</p><ul><li><p>So in expectation, the bodhisattvas get <em>p</em>/(<em>p</em> + <em>q</em>) of the total available value, where <em>p</em> is the rate of bodhisattvas and <em>q</em> is the rate of rivals.</p></li><li><p>Increasing the number of powerholders reduces variance, bringing the actual share of value closer to <em>p</em>/(<em>p </em>+ <em>q</em>), but does not change the expected value.</p></li></ul></li></ul><p>For example, if we assume that about 1 in 10,000 people are bodhisattvas and 1 in 10,000 are rivals, then this is how the value of the future scales with the number of powerholders:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OQaK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OQaK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 424w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 848w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 1272w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OQaK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png" width="567" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:567,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OQaK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 424w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 848w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 1272w, https://substackcdn.com/image/fetch/$s_!OQaK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0a891943-5753-4059-ae81-4ac1d26a0a0d_567x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The point at which you get the plateau depends on your estimate of <em>p</em> and <em>q</em>. How common are bodhisattvas and rivals?</p><p>You probably need three things to be a bodhisattva: the right starting position (e.g., the correct initial moral intuitions), the right reflective process, and a strong commitment to doing the most good by your lights with most of your resources. Here&#8217;s a very rough BOTEC where I try to estimate the rate of bodhisattvas among the current human populations.</p><ul><li><p>0.1-50% for a sufficiently strong commitment to doing the most good by your lights with most of your resources.</p></li><li><p>10-50% for the right reflective process conditional on strong commitment to do good.</p></li><li><p>1-100% for right &#8220;starting&#8221; intuitions, conditional on the previous two.</p></li></ul><p>This gives a range of 1 in 4 to 1 in 1 million.</p><p>It&#8217;s plausible that the rate of rivals will be in the same ballpark as the rate of bodhisattvas. Rivals share many features in common with bodhisattvas, which is part of why they&#8217;re resource-incompatible, e.g., they have non-negligible returns to vast resources and they care about the use of distant galaxies and time periods. If the rate of rivals is fairly close&#8212;i.e., within 1-3 orders of magnitude of the rate of bodhisattvas&#8212;then this suggests logarithmic returns to increasing the number of powerholders up to about 10<sup>5</sup> to 10<sup>6</sup>, after which it quickly levels off.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fJ0Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 424w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 848w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 1272w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png" width="567" height="442" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:442,&quot;width&quot;:567,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 424w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 848w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 1272w, https://substackcdn.com/image/fetch/$s_!fJ0Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4c90a50-7963-4adc-9a12-bd93bdc59daa_567x442.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>For the blue line, the rate of bodhisattvas and rivals are sampled independently from [1e-6, 0.1] (log-uniform). For the orange line, the rate of bodhisattvas is sampled from [1e-6, 0.1] and the rate of rivals is sampled within two orders of magnitude of the rate of bodhisattvas. For the green line, rivals tend to be more common than bodhisattvas&#8212;between equally common and a thousand times more common.</em></figcaption></figure></div><p>It&#8217;s also possible that the rate of rivals won&#8217;t be tightly correlated with the rate of bodhisattvas. If your lower bound on <em>q</em> is substantially greater than your lower bound on <em>p</em>, then the value will plateau once the population is greater than 1/<em>q</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VDaG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VDaG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 424w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 848w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 1272w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VDaG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png" width="691" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:691,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VDaG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 424w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 848w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 1272w, https://substackcdn.com/image/fetch/$s_!VDaG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0860f103-4b19-4c1c-bb88-ecd8bde4fda6_691x458.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>I&#8217;m again sampling the rate of powerholders log-uniformly between 1e-1 and 1e-6, but this time holding the rate of rivals fixed at different levels.</em></figcaption></figure></div><p>In the extreme&#8212;if &gt;10% of powerholders are likely to be rivals&#8212;then we no longer get much value from a few highly motivated bodhisattvas. The next model discusses how moral diversity could be valuable even if most people are rivals.</p><h1>Increasing the likelihood of coordinating on moral public goods</h1><p>In the previous section, we considered the case where a relatively small share of the population cared about how resources deep in space were used. What if instead many people have resource-incompatible goals that can absorb large quantities of resources?</p><p>I&#8217;ve <a href="https://www.forethought.org/research/moral-public-goods-are-a-big-deal-for-whether-we-get-a-good-future">argued elsewhere</a> that in such cases they could often make a deal to collectively fund moral public goods, and this would be probably good, since there would be significant gains from trade and a shift of resources from idiosyncratic to more broadly-shared preferences.</p><p>How many powerholders do we need to ensure that moral public goods are funded?</p><p>It depends on how much people value the moral public good relative to the best goods according to their idiosyncratic preferences. For a trade to be possible at all, there must be gains from trade for all participants. For example, if each person <em>i</em> has a linear utility function <em>u<sub>i</sub></em> =  <em>x<sub>i</sub></em> + <em>m &#215; y</em> (where <em>x<sub>i</sub></em> is the level of spending on their idiosyncratic good and <em>y</em> is the level of spending on the public good), then people will spend on the public good only if <em>N</em> &#8805; 1/<em>m</em>. Multipliers in the range of 1 to 10<sup>-6</sup> seem quite plausible.</p><p>I am somewhat more skeptical of multipliers much smaller than 10<sup>-6</sup>. First, it&#8217;s unclear about the extent to which people will have very weak preferences that are psychologically distinguishable from no preference at all, which makes extremely low multipliers (e.g., 10<sup>-30</sup>) implausible. Second, if the multiplier for a particular consensus good gets very low, then it seems increasingly plausible that there was some other, better deal that they could have made with a subset of their trading partners who shared some of their idiosyncratic preferences.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p><p>Based on these considerations, my best guess is that the multipliers are log-uniformly distributed from 10<sup>-6</sup> to 1&#8212;implying logarithmic returns to growing the population of powerholders up to around one million.</p><h1>Increasing the quality of reflection</h1><p>In the previous two models, I&#8217;ve treated the powerholders&#8217; values as developing mostly independently. But if powerholders influence each other&#8217;s reflection&#8212;e.g., by arguing with each other about their values&#8212;then greater initial moral diversity could help powerholders converge to a better set of final values, through mechanisms like the following:</p><p><strong>Social exposure to non-sycophants</strong>. If one person single-handedly carries out a coup and ends up with a decisive strategic advantage, they might find themselves surrounded by yes-men who are utterly reliant on the dictator and unwilling to argue forcefully for different values from what the dictator currently endorses. A similar dynamic might be at play if a small but ideologically very uniform group seizes power (e.g., a set of officials from the same presidential administration or perhaps a dictator and his close advisors). But if there are multiple, ideologically diverse powerholders, they might be able to challenge each other&#8217;s views and improve the overall quality of reflection.</p><p>Under this model, most of the value probably comes from moving from a single powerholder to tens or hundreds of powerholders, or from moving from one ideologically uniform group to multiple ideologically uniform groups (perhaps moving from a lab coup or an executive coup to a joint lab coup and executive coup).</p><p>This effect relies on powerholders socializing with each other, rather than retreating into their own bubbles of non-powerholding friends and sycophantic AIs.</p><p><strong>Champions for rare values</strong>. Powerholders with rare value systems might be able to act as &#8220;champions&#8221; for those value systems. For example, they might use AI labor to develop the strongest, most plausible version of that value system, or they might try to persuade other powerholders about the merits of that value system. This might be important if that rare value system includes an insight that&#8217;s missing from other value systems&#8212;perhaps most value systems care primarily about consciousness, but actually there&#8217;s <a href="https://linch.substack.com/p/further-moral-goods">another totally different type of moral good</a> that other powerholders would want to pursue if they were aware of it.</p><p>(In principle, non-powerholders could act as champions for rare values. But they might lack the resources (e.g., access to ASI labor) needed to develop the insights in their value systems. They might be reliant on the goodwill of powerholders and not want to push too aggressively for their alternative value system, or powerholders might simply not take non-powerholders seriously.)</p><p>Just as in the bodhisattva model, increasing the number of powerholders increases the chances that at least one powerholder can serve as a champion for a rare value system that contains a crucial insight.</p><p>I&#8217;m very uncertain about how common these champions are, but if they&#8217;re sufficiently rare, then we&#8217;re probably rather likely to get their insight via some other mechanism.</p><p>For example, some powerholders might be &#8220;superreflectors&#8221; who instruct their ASIs to steelman every known human value system and invent millions of novel value systems, searching for insights that they and other powerholders might endorse on reflection. I expect that superreflectors would achieve all of the value from having powerholders act as champions for rare value systems that they actually subscribe to (and more).</p><p>So increasing the number of powerholders adds value only up to the point where we are likely to have at least one superreflector. Superreflectors are also plausibly rather rare&#8212;perhaps between 1/10 and 1/10,000&#8212;so increasing the number of powerholders up until 10,000 is valuable under this model.</p><h1>Increasing the likelihood of rare <em>bad</em> actors</h1><p>It&#8217;s possible (though rather unlikely<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a>) that a single bad actor could unilaterally destroy a lot of value, e.g., by</p><ul><li><p>Initiating a space race that results in an extremely <a href="https://hanson.gmu.edu/filluniv.pdf">inefficient use of space resources</a> by the lights of most people&#8217;s value systems.</p></li><li><p>Destroying the universe by initiating false vacuum decay or triggering another <a href="https://www.lesswrong.com/posts/3ww5zZgTTPySB3jpP/interstellar-travel-will-probably-doom-the-long-term-future">galactic-level x-risk</a>.</p></li></ul><p>As we increase the moral diversity of powerholders, we increase the chance of ending up with at least one powerholder that inherently values one of these activities enough that they will do it if they can. For example, <a href="https://joecarlsmith.substack.com/p/video-and-transcript-of-talk-on-can">locusts</a> might inherently value expanding through space as quickly as possible. We also increase the likelihood that one powerholder is ruthless or reckless enough to risk one of these activities&#8212;for example, a powerholder might threaten to initiate vacuum decay to extort concessions from other powerholders.</p><p>We can add these rare bad actors to the bodhisattva model described above. Now, in addition to bodhisattvas, rivals, and easygoers, we have a fourth type: destroyers. If one destroyer is present, total value is zero; otherwise it is calculated as before.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uw4g!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uw4g!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 424w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 848w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 1272w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uw4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png" width="640" height="480" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:480,&quot;width&quot;:640,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uw4g!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 424w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 848w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 1272w, https://substackcdn.com/image/fetch/$s_!uw4g!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5da177b7-935d-4e76-a72e-b11cae73cc90_640x480.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em>I am assuming that the rate of bodhisattvas and rivals is sampled log-uniformly between 0.1 and 1e&#8211;6.</em></figcaption></figure></div><p>When diversity is low, it&#8217;s unlikely that there&#8217;s a bodhisattva already. Then adding additional powerholders is all upside: if you add a bodhisattva, then you get some positive value, but if you add a destroyer, rival, or easygoer, then expected value stays around zero. But as diversity increases, it&#8217;s likely that there&#8217;s a bodhisattva already, which means that adding additional powerholders risks adding a destroyer, bringing us from positive value to near-zero value.</p><p>As the figure above shows, the value of <em>N</em> where we switch from the low-diversity regime to the high-diversity regime depends on the destroyer rate. As a wild guess, I estimate that the destroyer rate is distributed log-uniformly between 10<sup>-4</sup> and 10<sup>-8</sup>. Under those assumptions, increasing the number of powerholders is beneficial up until 10<sup>4</sup> powerholders, after which additional powerholders reduce value.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See all of our research on <a href="https://www.forethought.org/research">our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>You might also disagree with me on what the correct moral system is likely to be, which could also lead to different parameters here.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Credit to Will MacAskill for this model.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>That is, the same resources cannot be used to simultaneously get most of the value by the lights of both the bodhisattva and the rival.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>This is assuming that the most flourishing minds have way higher value (under the correct moral view) than human-like minds.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>I think this is somewhat plausible&#8212;most people today have preferences that are sublinear in resources and do not care much about very distant galaxies. But it&#8217;s also plausible that future people will have more resource-hungry preferences, if they reflect on their preferences, if their sublinear preferences are all saturated, or if advances in technology allow them to personally benefit from consuming huge amounts of resources. In the section on moral public goods, I discuss how moral diversity might matter if linear preferences are common.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>This assumes that bodhisattvas and rivals individually have the same amount of resources on average.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>In fact, increasing <em>N</em> can make these side-deals more likely by increasing the number of people who care about the idiosyncratic good. For example:</p><ul><li><p>Imagine a world with 10 people, each of whom values 3 goods: copies of themselves, national glory (valued at 80% of copies of themselves), and hedonium (valued at 11% of copies of themselves). Suppose that each person is from a different nation. They will prefer to coordinate on hedonium.</p></li><li><p>But if there are twenty people, two from each nationality, then everyone will prefer to coordinate with their co-nationalist on producing national glory.</p></li></ul><p>Of course, it&#8217;s not totally clear, from a subjectivist perspective, whether (the general version of) this is bad.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>Perhaps the most plausible story for this is if powerholders spread across space, and the destroyer covertly carries out the destructive activity without others noticing before it&#8217;s too late. But I expect the other powerholders will very likely be able to anticipate and mitigate this risk (e.g., by demanding that the destroyer make verifiable commitments to avoid this activity before allowing the destroyer to leave the solar system).</p></div></div>]]></content:encoded></item><item><title><![CDATA[The good, the bad and the ugly: AI impacts on epistemics]]></title><description><![CDATA[For better or worse, AI could reshape the way that people work out what to believe and what to do.]]></description><link>https://newsletter.forethought.org/p/the-good-the-bad-and-the-ugly-ai</link><guid isPermaLink="false">https://newsletter.forethought.org/p/the-good-the-bad-and-the-ugly-ai</guid><dc:creator><![CDATA[Owen Cotton-Barratt]]></dc:creator><pubDate>Mon, 13 Apr 2026 17:15:39 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!WXXO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-impacts-on-epistemics-the-good-the-bad-and-the-ugly">on our website</a>.</em></p><h1>Intro</h1><p>For better or worse, AI could reshape the way that people work out what to believe and what to do. What are the prospects here?</p><p>In this piece, we&#8217;re going to map out the trajectory space as we see it. First, we&#8217;ll lay out three sets of dynamics that could shape how AI impacts epistemics (how we make sense of the world and figure out what&#8217;s true):</p><ul><li><p><a href="https://newsletter.forethought.org/i/193454919/the-good">The good</a>: there&#8217;s huge potential for AI to uplift our ability to track what&#8217;s true and make good decisions</p></li><li><p><a href="https://newsletter.forethought.org/i/193454919/the-bad">The bad</a>: AI could also make the world harder for us to understand, without anyone intending for that to happen</p></li><li><p><a href="https://newsletter.forethought.org/i/193454919/the-ugly">The ugly</a>: malicious actors could use AI to actively disrupt epistemics</p></li></ul><p>Then we&#8217;ll argue that <a href="https://newsletter.forethought.org/i/193454919/so-what-should-we-expect-to-happen">feedback loops</a> could easily push towards much better or worse epistemics than we&#8217;ve seen historically, making near-term work on AI for epistemics unusually important.</p><p>The stakes here are potentially very high. As AI advances, we&#8217;ll be faced with a whole raft of civilisational-level decisions to make. How well we&#8217;re able to understand and reason about what&#8217;s happening could make the difference between a future that we&#8217;ve chosen soberly and wisely, and a catastrophe we stumble into unawares.</p><h1>The good</h1><blockquote><p><em>&#8220;If I have seen further, it is by standing on the shoulders of giants.&#8221;</em> (Isaac Newton)</p></blockquote><p>There are lots of ways that AI could help improve epistemics. Many kinds of AI tools could directly improve our ability to think and reason. We&#8217;ve written more about these in our <a href="https://www.forethought.org/research/design-sketches-for-a-more-sensible-world">design sketches</a>, but here are some illustrations:</p><ul><li><p>Tools for <a href="https://www.forethought.org/research/design-sketches-collective-epistemics#">collective epistemics</a> could make it easy to know what&#8217;s trustworthy and reward honesty, making it harder for actors to hide risky actions or <a href="https://80000hours.org/problem-profiles/extreme-power-concentration/">concentrate power</a> by manipulating others&#8217; views.</p><ul><li><p>Imagine that when you go online, &#8220;community notes for everything&#8221; flag content that other users have found misleading, and &#8220;rhetoric highlighting&#8221; automatically flags persuasive but potentially misleading language. With a few clicks, you can see the epistemic track record of any actor, or access the full provenance of a given claim. Anyone who wants can compare state-of-the-art AI systems using epistemic virtue evals, which also exert pressure at the AI development stage.</p></li></ul></li><li><p>Tools for <a href="https://www.forethought.org/research/design-sketches-tools-for-strategic-awareness">strategic awareness</a> could deepen people&#8217;s understanding of what&#8217;s actually going on around them, making it easier to make good decisions, keep up with the pace of progress, and steer away from failure modes like <a href="https://gradual-disempowerment.ai/">gradual disempowerment</a>.</p><ul><li><p>Imagine that superforecaster-level forecasting and scenario planning are available on tap, and automated OSINT gives people access to much higher quality information about the state of the world.</p></li></ul></li><li><p>Technological analogues to <a href="https://www.forethought.org/research/design-sketches-angels-on-the-shoulder">angels-on-the-shoulder</a>, like personalised learning systems and reflection tools, could make decision-makers better informed, more situationally aware, and more in touch with their own values.</p><ul><li><p>Imagine that everyone has access to high-quality personalised learning, automated deep briefings for high-stakes decisions, and reflection tools to help them understand themselves better. In the background, aligned recommender systems promote long-term user endorsement, and some users enable a guardian coach system which flags any actions the person might regret taking in real time.</p></li></ul></li></ul><p>Structurally, AI progress might also enable better reasoning and understanding, for example by automating labour such that people have more time and attention, or by making people wealthier and healthier.</p><p>These changes might enable us to approach something like epistemic flourishing, where it&#8217;s easier to find out what&#8217;s true than it is to lie, and the world in most people&#8217;s heads is pretty similar to the world as it actually is. This could radically improve our prospects of safely <a href="https://www.forethought.org/research/preparing-for-the-intelligence-explosion">navigating the transition to advanced AI</a>, by:</p><ul><li><p>Helping us to keep pace with the increasing speed and complexity of the situation, so we&#8217;re able to make informed and timely decisions.</p></li><li><p>Ensuring that key decision-makers don&#8217;t make catastrophic unforced errors through lack of information or understanding.</p></li><li><p>Making it harder for malicious actors to manipulate the information environment in their favour to increase their own influence.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WXXO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WXXO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 424w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 848w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 1272w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WXXO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png" width="1280" height="898" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1699439,&quot;alt&quot;:&quot;A Philosopher Lecturing on the Orrery, a painting by Joseph Wright of Derby. It depicts a lecturer giving a demonstration of an orrery &#8211; a mechanical model of the Solar System &#8211; to a small audience.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/193454919?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="A Philosopher Lecturing on the Orrery, a painting by Joseph Wright of Derby. It depicts a lecturer giving a demonstration of an orrery &#8211; a mechanical model of the Solar System &#8211; to a small audience." title="A Philosopher Lecturing on the Orrery, a painting by Joseph Wright of Derby. It depicts a lecturer giving a demonstration of an orrery &#8211; a mechanical model of the Solar System &#8211; to a small audience." srcset="https://substackcdn.com/image/fetch/$s_!WXXO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 424w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 848w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 1272w, https://substackcdn.com/image/fetch/$s_!WXXO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3c2c8344-95d9-4067-a9e5-15c67d6bbe47_1280x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><a href="https://en.wikipedia.org/wiki/A_Philosopher_Lecturing_on_the_Orrery#/media/File:Wright_of_Derby,_The_Orrery.jpg">A Philosopher Lecturing on the Orrery</a>, by Joseph Wright of Derby (1766)</em></figcaption></figure></div><p>What&#8217;s driving these potential improvements?</p><ul><li><p><strong>AI will be able to think much more cheaply and quickly than humans.</strong> Partly this will mean that we can reach many more insights with much less effort. Partly this will make it possible to understand things that are currently infeasible for us to understand (because it would take too many humans too long to figure it out).</p></li><li><p><strong>AI can &#8216;know&#8217; much more than any human.</strong> Right now, a lot of information is siloed in specific expert communities, and it&#8217;s slow to filter out to other places even when it would be very useful there. AI will be able to port and apply knowledge much more quickly to the relevant places.</p></li></ul><h1>The bad</h1><blockquote><p><em>&#8220;A wealth of information creates a poverty of attention.&#8221;</em> (Herbert Simon)</p></blockquote><p>AI could also make epistemics worse without anyone intending it, by making the world more confusing and degrading our information and processing.</p><p>There are a few different ways that AI could unintentionally weaken our epistemics:</p><ul><li><p><strong>The world gets faster and more complex.</strong> As AI progresses, our information-processing capabilities are going to go up &#8212; but so is the complexity of the world. Technological progress could become <a href="https://www.forethought.org/research/preparing-for-the-intelligence-explosion">dramatically faster</a> than today, making the world more disorienting and harder to understand than it is today. If tech progress reaches fast enough speeds, it&#8217;s possible that we won&#8217;t be able to keep up, and even the best AI tools available won&#8217;t help us to see through the fog.</p></li><li><p><strong>The quality of the information we&#8217;re interacting with gets worse,</strong> because of:</p><ul><li><p><strong>Faster memetic evolution.</strong> As more and more content is generated by and mediated through AI systems working at machine speeds, the pace of memetic and cultural change will probably get a lot faster than it is today. As the pace quickens, memes which are attention-grabbing could increasingly outcompete those which are truthful.</p></li><li><p><strong>More difficult verification.</strong> This could happen through a combination of:</p><ul><li><p><strong>AI slop.</strong> In hard-to-verify domains, AI could massively increase the quantity of plausible-looking but wrong information, without also being able to help us to verify which bits are right.</p></li><li><p><strong>AI-generated &#8216;evidence&#8217;.</strong> As the quality of AI-generated video, audio, images, and text continues to improve, it may become pretty difficult to tell which bits of evidence are real and which are spurious.</p></li></ul></li></ul></li><li><p><strong>We get worse at processing the information we get</strong>, because:</p><ul><li><p><strong>Our emotions get in the way.</strong> AI progress could be very disorienting, generate serious crises, and cause people a lot of worry and fear. This could get in the way of clear thinking.</p></li><li><p><strong>Using AI to help us with information processing degrades our thinking</strong>, via:</p><ul><li><p><strong>Adoption of low-quality AI tools for epistemics:</strong> In many areas of epistemics, it&#8217;s hard to say what counts as &#8216;good&#8217;. This makes epistemic tools harder to assess, and could lead to people trusting these tools either too much or too little. Inappropriately high levels of trust in epistemic tools could take various forms, including:</p><ul><li><p>First mover advantages for early but imperfect systems, which are then hard to replace with better systems because people trust the earlier systems more.</p></li><li><p>The use of epistemically misaligned systems, which aren&#8217;t actually truth-tracking but it&#8217;s not possible for us to discern that.</p></li></ul></li><li><p><strong>Fragmentation of the information environment:</strong> AI will make it easier to create content (potentially interactive content) that pulls people in and monopolises their attention. This could reduce attention available for important truth-tracking mechanisms, and make it harder to coordinate groups of people to important actions. In the extreme, some people might end up in effectively closed information bubbles, where all of their information is heavily filtered through the AI systems they interact with directly. The more fragmented the information environment becomes, the harder it could get for people to make sense of what&#8217;s happening in the world around them, and to engage with other people and other information bubbles.</p></li><li><p><strong>Epistemic dependence:</strong> if people increasingly outsource their thinking to AI systems, they may lose the ability to think critically for themselves.</p></li></ul></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!drBz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!drBz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 424w, https://substackcdn.com/image/fetch/$s_!drBz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 848w, https://substackcdn.com/image/fetch/$s_!drBz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 1272w, https://substackcdn.com/image/fetch/$s_!drBz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!drBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png" width="455" height="542" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:542,&quot;width&quot;:455,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:476967,&quot;alt&quot;:&quot;Allegory of Error by Stefano Bianchetti. An engraving depicting a blindfolded figure with donkey ears staggering forward holding a staff.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/193454919?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Allegory of Error by Stefano Bianchetti. An engraving depicting a blindfolded figure with donkey ears staggering forward holding a staff." title="Allegory of Error by Stefano Bianchetti. An engraving depicting a blindfolded figure with donkey ears staggering forward holding a staff." srcset="https://substackcdn.com/image/fetch/$s_!drBz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 424w, https://substackcdn.com/image/fetch/$s_!drBz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 848w, https://substackcdn.com/image/fetch/$s_!drBz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 1272w, https://substackcdn.com/image/fetch/$s_!drBz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e9f3c3a-1d4c-47f5-97bc-8ba006fdca14_455x542.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><a href="https://www.mediastorehouse.com/fine-art-finder/artists/austrian-school/allegory-error-staggering-attitude-blindfold-22293032.html">Allegory of Error</a>, Stefano Bianchetti (1801)</em></figcaption></figure></div><h1>The ugly</h1><blockquote><p><em>&#8220;The ideal subject of totalitarian rule is not the convinced Nazi or the convinced Communist, but people for whom the distinction between fact and fiction (i.e., the reality of experience) and the distinction between true and false (i.e., the standards of thought) no longer exist.&#8221;</em> (Hannah Arendt, <em>The Origins of Totalitarianism</em>)</p></blockquote><p>We&#8217;ve just talked about ways that AI could make epistemics worse without anyone intending that. But we might also see actors using AI to actively interfere with societal epistemics. (In reality these things are a spectrum, and the dynamics we discussed in the preceding section could also be actively exploited.)</p><p>What might this look like?</p><ul><li><p><strong>Automated propaganda and persuasion:</strong> AI could be used to generate high-quality persuasive content at scale. This could take the form of highly tailored, well-written propaganda. If this content were then used as training data for next generation models, biases could get even more entrenched. Additionally, AI persuasion could come in the form of models which are subtly biased in a particular direction. Particularly if many users are spending large amounts of time talking to AI (e.g. AI companions), the persuasive effects could be much larger than is scalable today via human-to-human persuasion.</p></li><li><p><strong>Using AI to undermine sense-making:</strong> AI could be used to generate high-quality content which casts doubt on institutions, individuals, and tools that would help people understand what&#8217;s going on, or to directly sabotage such tools. More indirectly, actors could also use AI to generate content which adds to complexity, for example by wrapping important information in complex abstractions and technicalities, and generating large quantities of very readable reports and news stories which distract attention.</p></li><li><p><strong>Surveillance:</strong> AI surveillance could monitor people&#8217;s communications in much more fine-grained ways, and punish them when they appear to be thinking along undesirable lines. This could be abused by states, or could become a tool that private actors can wield against their enemies. In either case, the chilling effect on people&#8217;s thinking and behaviour could be significant.</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZWin!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZWin!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 424w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 848w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 1272w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZWin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png" width="1280" height="929" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:929,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1840447,&quot;alt&quot;:&quot;The Card Sharp with the Ace of Diamonds, an oil-on-canvas painting by Georges de La Tour. It depicts a card game in which a young man is being fleeced of his money by the other players, including a card sharp who is retrieving the ace of diamonds from behind his back.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/193454919?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="The Card Sharp with the Ace of Diamonds, an oil-on-canvas painting by Georges de La Tour. It depicts a card game in which a young man is being fleeced of his money by the other players, including a card sharp who is retrieving the ace of diamonds from behind his back." title="The Card Sharp with the Ace of Diamonds, an oil-on-canvas painting by Georges de La Tour. It depicts a card game in which a young man is being fleeced of his money by the other players, including a card sharp who is retrieving the ace of diamonds from behind his back." srcset="https://substackcdn.com/image/fetch/$s_!ZWin!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 424w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 848w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 1272w, https://substackcdn.com/image/fetch/$s_!ZWin!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d5d4364-72c5-4ebb-841a-84382a96a93c_1280x929.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><em><a href="https://en.wikipedia.org/wiki/The_Card_Sharp_with_the_Ace_of_Diamonds#/media/File:Le_Tricheur_%C3%A0_l%E2%80%99as_de_carreau_-_Georges_de_La_Tour_-_Mus%C3%A9e_du_Louvre_Peintures_RF_1972_8.jpg">The Card Sharp with the Ace of Diamonds</a>, by Georges de La Tour (~1636-1638)</em></figcaption></figure></div><p>But maybe this is all a bit paranoid. Why expect this to happen?</p><p>There&#8217;s a long history of powerful actors trying to distort epistemics,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> so we should expect that some people will be trying to do this. And AI will probably give them better opportunities to manipulate other people&#8217;s epistemics than have existed historically:</p><ul><li><p>It&#8217;s likely that access to the best AI systems and compute will be <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power#33-exclusive-access-to-coup-enabling-capabilities">unequal</a>, which favours abuse.</p></li><li><p>If people end up primarily interfacing with the world via AI systems, this will create a big lever for epistemic influence that doesn&#8217;t exist currently. It could be much easier to influence the behaviour of lots of AI systems at once than lots of people or organisations.</p></li></ul><p>It&#8217;s also worth noting that many of these abuses of epistemic tech don&#8217;t require people to have some Machiavellian scheme to disrupt epistemics or seek power for themselves (though these might arise later). Motivated reasoning could get you a long way:</p><ul><li><p>Legitimate communications and advertising blur into propaganda, and microtargeting is already a common strategy.</p></li><li><p>It&#8217;s easy to imagine that in training an AI system, a company might want to use something like its own profits as a training signal, without explicitly recognising the potential epistemic effects of this in terms of bias.</p></li></ul><h1>So what should we expect to happen?</h1><p>With all these dynamics pulling in different directions, should we expect that it&#8217;s going to get easier or harder for people to make sense of the world?</p><p>We think it could go either way, and that how this plays out is extremely consequential.</p><p>The main reason we think this is that the dynamics above are self-reinforcing, so the direction we set off in initially could have large compounding effects. In general, the better your reasoning tools and information, the easier it is for you to recognise what is good for your own reasoning, and therefore to improve your reasoning tools and information. The worse they are, the harder it is to improve them (particularly if malicious actors are actively trying to prevent that).</p><p>We already see this empirically. The Scientific Revolution and the Enlightenment can be seen as examples of good epistemics reinforcing themselves. Distorted epistemic environments often also have self-perpetuating properties. Cults often require members to move into communal housing and cut contact with family and friends who question the group. Scientology frames psychiatry&#8217;s rejection of its claims as evidence of a conspiracy against it.</p><p>And on top of historical patterns, there are AI-specific feedback loops that reinforce initial epistemic conditions:</p><ul><li><p>Unlike previous information tech, AI has a tight feedback loop between content generated, and data used for training future models. So if models generate in/accurate content, future models are more likely to do so too.</p></li><li><p>How early AI systems behave epistemically will shape user expectations and what kinds of future AI behaviour there&#8217;s a market for.</p></li></ul><p>There are self-correcting dynamics too, so these self-reinforcing loops won&#8217;t go on forever. But we think it&#8217;s decently likely that epistemics get much better or much worse than they&#8217;ve been historically:</p><ul><li><p>One self-correcting mechanism historically has just been that it takes (human) effort to sustain or degrade epistemics. Continuing to improve epistemics requires paying attention to ways that epistemics could be eroded, and this isn&#8217;t incentivised in an environment that&#8217;s currently working well. Continuing to degrade epistemics requires willing accomplices &#8212; but the more an actor distorts things, the more that can galvanise opposition, and the fewer people may be willing to assist. By augmenting or replacing human labour with automated labour, AI could make it much cheaper to keep pushing in the same direction.</p></li><li><p>Another self-correcting mechanism is just that people and institutions adapt to new epistemic tech: as epistemics improve, deception becomes more sophisticated; and if epistemics worsen, people lose trust and create new mechanisms for assessing truth. But this adaptation happens at human speed, and AI will increasingly be changing the epistemic environment at a much faster pace. This creates the potential for self-reinforcing dynamics to drive to much more extreme places before adaptation has time to kick in.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p></li></ul><ul><li><p>There&#8217;s a limit to how good epistemics can get before hitting fundamental problems like complexity and irreducible uncertainty. But there seems to be a lot of room for improvement from where we&#8217;re currently standing (especially as good AI tools could help to handle greater amounts of complexity), and it would be a priori very surprising if we&#8217;d already reached the ceiling.</p></li><li><p>There&#8217;s also a limit to how bad epistemics can get: people aren&#8217;t infinitely suggestible, and often there are external sources of truth that limit how distorted beliefs can get (ground truth, or what gets said in other countries or communities). But as we discussed <a href="https://newsletter.forethought.org/i/193454919/the-bad">above</a>, access to ground truth and to other epistemic communities might get harder because of AI, so the floor here may lower.</p></li></ul><p>Given the real chance that we end up stuck in an extremely positive or negative epistemic equilibrium, our initial trajectory seems very important. The kinds of AI tools we build, the order we build them in, and who adopts them when could make the difference between a world of epistemic flourishing and a world where everyone&#8217;s understanding is importantly distorted. To give a sense of the difference this makes, here&#8217;s a sketch of each world (among myriad possible sketches):</p><ul><li><p>In the first world, we basically understand what&#8217;s going on around us. It&#8217;s not like we can now forecast the future with perfect accuracy or anything &#8212; there&#8217;s still irreducible uncertainty, and some people have better epistemics tools than others. But it&#8217;s gotten much cheaper to access and verify information. Public discourse is serious and well-calibrated, because epistemic infrastructure has made it quite hard to deceive or manipulate people &#8212; which in turn incentivises honesty. AI-assisted research and synthesis mean that knowledge which used to be siloed in specialist communities is now accessible and usable by anyone who needs it. And governments are able to make much more nuanced decisions far faster than they are today.</p></li><li><p>In the second, it&#8217;s no longer really possible to figure out what&#8217;s going on. There&#8217;s an awful lot of persuasive but low-quality AI content around, some of it generated with malicious intent. In response to this, people withdraw into their own AI-mediated epistemic bubbles &#8212; and unlike today&#8217;s filter bubbles, these can be comprehensive enough that people rarely encounter friction with outside perspectives at all. Meanwhile, companies and nations with a lot of compute find it pretty easy to distract the public&#8217;s attention from anything that would be inconvenient, and to outmaneuver the many actors who are trying to hold them to account. But their own reasoning also gets degraded by all this information pollution, as their AI systems are trained on the same corrupted public information.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> Even the people who think they&#8217;re shaping the narrative are increasingly unable to see clearly.</p></li></ul><p>The world we end up in is the world from which we have to navigate the intelligence explosion, making decisions like how to manage misaligned AI systems, whether to grant AI systems rights, and how to divide up the resources of the cosmos. How AI impacts our epistemics between now and then could be one of the biggest levers we have on navigating this well.</p><h1>Things we didn&#8217;t cover</h1><h2><strong>Whose epistemics?</strong></h2><p>We mostly talked about AI impacts on epistemics in general terms. But AI could impact different groups&#8217; epistemics differently &#8212; and different groups&#8217; epistemics could matter more or less for getting to good outcomes. It would be cool to see further work which distinguishes between scenarios where good outcomes require:</p><ul><li><p>Interventions that raise the epistemic floor by improving everyone&#8217;s epistemics.</p></li><li><p>Interventions that raise the ceiling by improving the epistemics of the very clearest thinking.</p></li></ul><h2><strong>&#8216;Weird&#8217; dynamics</strong></h2><p>We focused on how AI could impact human epistemics, in a world where human reasoning still matters. But eventually, we expect more and more of what matters for the outcomes we get will come down to the epistemics of AI systems themselves.</p><p>The dynamics which affect these AI-internal epistemics could therefore be enormously important. But they could look quite different from the human-epistemics dynamics that have been our focus here, and we didn&#8217;t think it made sense to expand the remit of the piece to cover these.</p><p><em>Thanks to everyone who gave comments on drafts, and to Oly Sourbutt and Lizka Vaintrob for a workshop which crystallised some of the ideas.</em></p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-impacts-on-epistemics-the-good-the-bad-and-the-ugly">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Think of things like:</p><ul><li><p>Propaganda states like Nazi Germany and the USSR.</p></li><li><p>Corporate lobbying like the tobacco and sugar lobbies and climate science doubt campaigns.</p></li><li><p>CIA operations to spread doubt and confusion.</p></li></ul></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Though it&#8217;s possible that this dynamic will be more pronounced for epistemics getting extremely bad than for them getting extremely good. Consider these two very simplistic sketches:</p><ol><li><p>People start living in increasingly closed AI filter bubbles. Institutions are slow to adopt similar bubbles at a corporate level, but they also don&#8217;t have a mandate to change what their employees are doing. People&#8217;s filter bubbles tend to be pretty correlated with the people they work and interact with, so institutions end up with pretty distorted pictures of what&#8217;s going on even though they don&#8217;t actively start using harmful tech. Government regulation is too slow and reactive to stop this from happening.</p></li><li><p>People start to use provenance tracing and rhetoric highlighting by default when browsing, in response to an increasingly polarised memetic environment. There is adaptation to this &#8212; politicians start using subtler language and so on. But the net effect is still strongly positive: it&#8217;s hard to fake provenance, and removing overt rhetoric is already a big win, even if it means that more slippery language proliferates.</p></li></ol><p>In the first sketch, it&#8217;s straightforwardly the case that adaptive mechanisms are too slow. In the latter, it&#8217;s more that the tech is inherently defence-favoured.</p><p>We haven&#8217;t explored this area deeply, and think more work on this would be valuable.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Alternatively, these elites might retain very good epistemics for themselves, and choose to indefinitely maintain a situation where everyone else has a very distorted understanding, to further their own ends. It&#8217;s unclear to us which of these scenarios is more likely or concerning.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Sketches of some defense-favoured coordination tech]]></title><description><![CDATA[We think that near-term AI could make it much easier for groups to coordinate, find positive-sum deals, navigate tricky disagreements, and hold each other to account.]]></description><link>https://newsletter.forethought.org/p/sketches-of-some-defense-favoured</link><guid isPermaLink="false">https://newsletter.forethought.org/p/sketches-of-some-defense-favoured</guid><dc:creator><![CDATA[Owen Cotton-Barratt]]></dc:creator><pubDate>Mon, 06 Apr 2026 15:18:38 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2123578c-0372-46c6-be82-f369f054523f_1999x1173.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/design-sketches-defense-favoured-coordination-tech">on our website</a>.</em></p><h1>Intro</h1><p>We think that near-term AI could make it much easier for groups to coordinate, find positive-sum deals, navigate tricky disagreements, and hold each other to account.</p><p>Partly, this is because AI will be able to process huge amounts of data quickly, making complex multi-party negotiations and discussions much more tractable. And partly it&#8217;s because secure enough AI systems would allow people to share sensitive information with trusted intermediaries without fear of broader disclosure, making it possible to coordinate around information that&#8217;s currently too sensitive to bring to the table, and to greatly improve our capacity for monitoring and transparency.</p><p>We want to help people imagine what this could look like. In this piece, we sketch six potential near-term technologies, ordered roughly by how achievable we think they are with present tech:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><ul><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/fast-facilitation">Fast facilitation</a></strong> &#8212; Groups quickly surface key points of consensus views and disagreement, and make decisions everyone can live with.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/automated-negotiation">Automated negotiation</a></strong> &#8212; Complicated bargains are discovered quickly via automated negotiation on behalf of each party, mediated by trusted neutral systems which can find agreements.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/arbitrarily-easy-arbitration">Arbitrarily easy arbitration</a></strong> &#8212; Disputes are resolved cheaply and quickly by verifiably neutral AI adjudicators.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/background-networking">Background networking</a></strong> &#8212; People who <em>should</em> know each other get connected (perhaps even before they know to go looking), enabling mutually beneficial trade, coalition building, and more.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/structured-transparency-for-democratic-oversight">Structured transparency for democratic oversight</a></strong> &#8212; Citizens hold their institutions to account in a fine-grained way, without compromising sensitive information.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/confidential-monitoring-and-verification">Confidential monitoring and verification</a></strong> &#8212; Deals can be monitored and verified, even when this requires sharing highly sensitive information, by using trusted AI intermediaries which can&#8217;t disclose the information to counterparties.</p></li></ul><p>We also sketch two cross-cutting technologies that support coordination:</p><ul><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/ai-delegates-and-preference-elicitation">AI delegates and preference elicitation</a></strong> &#8212; AI delegates can faithfully represent and act for a human principal, perhaps supported by customisable off-the-shelf agentic platforms that integrate across many kinds of tech.</p></li><li><p><strong><a href="https://newsletter.forethought.org/i/192925664/charter-tech">Charter tech</a></strong> &#8212; The technologies above, or other coordination technologies, are applied to making governance dynamics more transparent, making it easier to anticipate how governance decisions will influence future coordination, and design institutions with this in mind.</p></li></ul><p>An important note is that coordination technologies are <a href="https://vitalik.eth.limo/general/2020/09/11/coordination.html">open to abuse</a>. You can coordinate to bad ends as well as good, and particularly confidential coordination technologies could enable things like price-setting, crime rings, and even <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">coup plots</a>. Because the upsides to coordination are very high (including helping the rest of society to coordinate <em>against</em> these harms), we expect that on balance accelerating some versions of these technologies is beneficial. But this will be sensitive to exactly how coordination technologies are instantiated, and any projects in this direction need to take especial care to mitigate these risks.</p><p>We&#8217;ll start by talking about why these tools matter, then look at the details of what these technologies might involve before discussing some cross-cutting issues at the end.</p><h1>Why coordination tech matters</h1><p>Today, many positive-sum trades get left on the table, and a lot of resources are wasted in negative-sum conflicts. Better coordination capabilities could lead to very large benefits, including:</p><ul><li><p>Improving economic productivity across the board</p></li><li><p>Helping nations avoid wars and other destructive conflicts</p></li><li><p>Enabling larger groups to coordinate to avoid exploitation by a small few</p></li><li><p>Making democratic governance much more transparent, while protecting sensitive information</p></li></ul><p>What&#8217;s more, getting these benefits might be close to necessary for navigating the transition to more powerful AI systems safely. Absent coordination, competitive pressures are likely to incentivise developers to race forward as fast as possible, potentially greatly increasing the risks we collectively run. If we become much better at coordination, we think it is much more likely that the relevant actors will be able to choose to be cautious (assuming that is the collectively-rational response).</p><p>However, coordination tech could also have significant harmful effects, through enabling:</p><ul><li><p>AI companies to collude with each other against the interests of the rest of society<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p></li></ul><ul><li><p>A small group of actors to plot a <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">coup</a></p></li><li><p>More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don&#8217;t incentivise prosociality to the same extent</p></li></ul><p>Regardless of how these harms and benefits net out for &#8216;coordination tech&#8217; overall, we currently think that:</p><ul><li><p><strong>The shape and impact of coordination tech is an important part of how things will unfold in the near term, and it&#8217;s good for people to be paying more attention to this.</strong></p></li><li><p><strong>We&#8217;re going to </strong><em><strong>need</strong></em><strong> some kinds of coordination tech to safely navigate the AI transition.</strong></p></li><li><p><strong>The devil is in the details. There are ways of advancing coordination tech which are positive in expectation, and ways of doing so which are harmful.</strong></p></li></ul><h2><strong>Why &#8216;defense-favoured&#8217; coordination tech</strong></h2><p>That&#8217;s why we&#8217;ve called this piece &#8216;defense-favoured coordination tech&#8217;, not just &#8216;coordination tech&#8217;. We think generic acceleration of coordination tech is somewhat fraught &#8212; <strong>our excitement is about thoughtfully run projects which are sensitive to the possible harms, and target carefully chosen parts of the design space</strong>.</p><p>We&#8217;re not yet confident which the best bits of the space are, and we haven&#8217;t seen convincing analysis on this from others either. Part of the reason we&#8217;re publishing these design sketches is to encourage and facilitate further thinking on this question.</p><p>For now, we expect that there are good versions of all of the technologies we sketch below &#8212; but we&#8217;ve flagged potential harms where we&#8217;re tracking them, and encourage readers to engage sceptically and with an eye to how things could go badly as well as how they could go well.</p><h1>Fast facilitation</h1><p>Right now, coordinating within groups is often complex, expensive, and difficult. Groups often drop the ball on important perspectives or considerations, move too slowly to actually make decisions, or fail to coordinate at all.</p><p>AI could make facilitation much faster and cheaper, by processing many individual views in parallel, tracking and surfacing all the relevant factors, providing secure private channels for people to share concerns, and/or providing a neutral arbiter with no stake in the final outcome. It could also make it much more practical to scale facilitation and bring additional people on board without slowing things down too much.</p><h2><strong>Design sketch</strong></h2><p>An AI mediation system briefly interviews groups of 3&#8211;300 people async, presents summary positions back to the group, and suggests next steps (including key issues to resolve). People approve or complain about the proposal, and the system iterates to appropriate depth for the importance of the decision.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2Fyu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2Fyu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 424w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 848w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 1272w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2Fyu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png" width="1456" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1897061,&quot;alt&quot;:&quot;Hand-drawn UI sketch of AI-powered coordination software showing admin setup inputs and a participant interface with options, discussion summaries, and an AI facilitator guiding group decision-making.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn UI sketch of AI-powered coordination software showing admin setup inputs and a participant interface with options, discussion summaries, and an AI facilitator guiding group decision-making." title="Hand-drawn UI sketch of AI-powered coordination software showing admin setup inputs and a participant interface with options, discussion summaries, and an AI facilitator guiding group decision-making." srcset="https://substackcdn.com/image/fetch/$s_!2Fyu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 424w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 848w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 1272w, https://substackcdn.com/image/fetch/$s_!2Fyu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd35f32c6-8231-4887-86ef-649fdd8f835e_2875x1842.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Under the hood, it does something like:</p><ul><li><p>Gathers written context on the setting and decision</p></li><li><p>Holds brief, private conversations with each participant to understand their perspective</p></li><li><p>Builds a map of the issue at hand, involving key considerations and points of (dis)agreement</p><ul><li><p>Performs and integrates background research where relevant</p></li></ul></li><li><p>Identifies which people are most likely to have input that changes the picture</p></li><li><p>Distils down a shareable summary of the map, and seeks feedback from key parties</p></li><li><p>Proposes consensus statements or next steps for approval, iterating quickly to find versions that have as broad a backing as possible</p></li></ul><h2><strong>Feasibility</strong></h2><p>Fast facilitation seems fairly feasible technically. The <a href="https://www.science.org/doi/10.1126/science.adq2852">Habermas Machine</a> (2024) does a version of this that provided value to participants &#8212; and we have seen two years of progress in LLMs since then. And there are already facilitation services like <a href="https://chord.team/">Chord</a>. In general, LLMs are great at gathering and distilling lots of information, so this should be something they excel at. It&#8217;s not clear that current LLMs can already build accurate maps of arbitrary in-motion discourse, but they <a href="https://www.oliversourbut.net/i/182129031/structure-inference-and-discourse">probably could</a> with the right training and/or scaffolding.</p><p>Challenges for the technology include:</p><ul><li><p>Ensuring that it&#8217;s more efficient and a better user experience for moving towards consensus than other, less AI-based approaches.</p></li><li><p>Remaining robust against abusive user behaviour (e.g. you don&#8217;t want individuals to get their way via prompt injection or blatantly lying).</p></li></ul><p>Neither of these seem like fundamental blockers. For example, to protect against abuse, it may be enough to maintain transparency so that people can search for this. (Or if users need to enter confidential information, there might be services which can confirm the confidential information without revealing it.)</p><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Build a baby version.</strong> This could help us notice obstacles or opportunities that would have been hard to predict in advance. You could focus on the UI or the tech side here, or try to help run pilots at specific organisations or in specific settings.</p></li><li><p><strong>Design ways to evaluate fast facilitation tools.</strong> This makes it easier to assess and improve on performance. For example, you could create games/test environments with clear &#8220;win&#8221; and &#8220;failure&#8221; modes.</p></li><li><p><strong>Build subcomponents.</strong> For example:</p><ul><li><p>Bots that surface anonymous info.</p></li><li><p>Tools that try to surface areas of consensus or common knowledge as efficiently as possible, while remaining hard to game.</p></li></ul></li><li><p><strong>Make a meeting prep system.</strong> Focus first on getting good at meeting prep &#8212; creating an agenda and considerations that need live discussion &#8212; to reduce possible unease about outsourcing decision-making to AI systems.</p></li><li><p><strong>Make a bot to facilitate discussions.</strong> This could be used in online community fora, or to survey experts.</p></li><li><p><strong>Design ways to create live &#8220;maps&#8221; of discussions.</strong> Fast facilitation is fast because it parallelises communication. This makes it more important to have good tools for maintaining shared context.</p></li></ul><h1>Automated negotiation</h1><p>High-stakes negotiation today involves adversarial communication between humans who have limited bandwidth.</p><p>Negotiation in the future could look more like:</p><ul><li><p>You communicate your desires openly with a negotiation delegate who is on your side, asking questions only when needed to build a deeper model about your preferences.</p></li><li><p>The delegate goes away, and comes back with a proposal that looks pretty good, along with a strategic analysis explaining the tradeoffs / difficulties in getting more.</p></li></ul><h2><strong>Design sketch</strong></h2><p>Humans can engage AI delegates to represent them. The delegates communicate with each other via a neutral third party mediation system, returning to their principals with a proposal, or important interim updates and decision points.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!z29j!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!z29j!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 424w, https://substackcdn.com/image/fetch/$s_!z29j!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 848w, https://substackcdn.com/image/fetch/$s_!z29j!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 1272w, https://substackcdn.com/image/fetch/$s_!z29j!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!z29j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png" width="1456" height="1090" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1090,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:799692,&quot;alt&quot;:&quot;Hand-drawn diagram of AI-powered automated negotiation showing a user and AI delegate iterating on proposals, evaluating options, and refining terms until agreement is reached.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn diagram of AI-powered automated negotiation showing a user and AI delegate iterating on proposals, evaluating options, and refining terms until agreement is reached." title="Hand-drawn diagram of AI-powered automated negotiation showing a user and AI delegate iterating on proposals, evaluating options, and refining terms until agreement is reached." srcset="https://substackcdn.com/image/fetch/$s_!z29j!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 424w, https://substackcdn.com/image/fetch/$s_!z29j!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 848w, https://substackcdn.com/image/fetch/$s_!z29j!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 1272w, https://substackcdn.com/image/fetch/$s_!z29j!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0b30d861-7ff6-4884-8e23-e28d46184534_1999x1496.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Under the hood, this might look like:</p><ul><li><p>Delegate systems:</p><ul><li><p>Read over context documents and query principals about key points of uncertainty to build initial models of preferences.</p></li><li><p>Model the negotiation dynamics and choose strategic approaches to maximise value for their principal.</p></li><li><p>Go back to the principal with further detailed queries when something comes up that crosses an importance threshold and where they are insufficiently confident about being able to model the principal&#8217;s views faithfully.</p></li><li><p>Are ultimately trained to get good results by the principal&#8217;s lights.</p></li></ul></li><li><p>Neutral mediator system:</p><ul><li><p>Is run by a trusted third-party (or in higher stakes situations, perhaps is cryptographically secure with transparent code).</p></li><li><p>Discusses with all parties (either AI delegates, or their principals)</p><ul><li><p>Can hear private information without leaking that information to the other party</p><ul><li><p>Impossibility theorems mean that it will sometimes be strategically optimal for parties to misrepresent their position to the mediator (unless we give up on the ability to make many actually-good deals); however, we can seek a setup such that it is <em>rarely</em> a good idea to strategically misrepresent information, or that it <em>doesn&#8217;t help very much</em>, or that <em>it is hard to identify the circumstances in which it&#8217;s better to misrepresent</em></p></li></ul></li></ul></li><li><p>Searches for deals that will be thought well of by all parties, and proposes those to the delegates.</p></li><li><p>Is ultimately trained to help all parties reach fair and desired outcomes, while minimising incentives-to-misrepresent for the parties.</p></li></ul></li></ul><h2><strong>Feasibility</strong></h2><p>Some of the technical challenges to automated negotiation are quite hard:</p><ul><li><p>The kind of security needed for high-stakes applications isn&#8217;t possible today.</p></li><li><p>Getting systems to be deeply aligned with a principal&#8217;s best interests, rather than e.g. pursuing the principal&#8217;s short-term gratification via sycophancy, is an unsolved problem.</p></li></ul><p>That said, it&#8217;s already possible to experiment using current systems, and it may not be long before they start improving on the status quo for human negotiation. Low-stakes applications don&#8217;t require the same level of security, and will be a great training ground for how to set up higher stakes systems and platforms. And practical alignment seems good enough for many purposes today.</p><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Build an AI delegate for yourself or your friends.</strong> See if you can get it to usefully negotiate on your behalf with your friends or colleagues. Or failing that, if it can support you to think through your own negotiation position before you need to communicate with others about it.</p></li><li><p><strong>Build a negotiation app with good UI.</strong> Building on existing LLMs, build an app which helps people think through their negotiation position in a structured way. Focus on great UI.</p><ul><li><p>This could be non-interactive at first, and just involve communication between a human and the app, rather than between any AI systems.</p></li><li><p>But it builds the muscles of a) designing good UI for AI negotiation, and b) people actually using AI to help them with negotiation.</p></li></ul></li><li><p><strong>Run a pilot in an org or community you&#8217;re part of.</strong></p><ul><li><p>You could start with fairly low-stakes negotiations, like what temperature to set the office thermostat to or which discussion topics to discuss in a given meeting slot.</p></li><li><p>Experimenting with different styles of negotiation (in terms of how high the stakes are, how complex the structure is, and what the domain is) could be very valuable.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!F7UG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!F7UG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 424w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 848w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 1272w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!F7UG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg" width="37" height="35" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/efbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:35,&quot;width&quot;:37,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!F7UG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 424w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 848w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 1272w, https://substackcdn.com/image/fetch/$s_!F7UG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fefbb018c-1bfb-4e90-b4c2-e5f6698d8112_37x35.svg 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h1>Arbitrarily easy arbitration</h1><p>Right now, the risk of expensive arbitration makes many deals unreachable. If disputes could be resolved cheaply and quickly using verifiably fair and neutral automated adjudicators, this could unlock massive coordination potential, enabling a multitude of cooperative arrangements that were previously prohibitively costly to make.</p><h2><strong>Design sketch</strong></h2><p>An &#8220;Arb-as-a-Service&#8221; layer plugs into contracts, platforms, and marketplaces. Parties opt in to standard clauses that route disputes to neutral AI adjudicators with a well-deserved reputation for fairness. In the event of a dispute, the adjudicator communicates with parties across private, verifiable evidence channels, investigating further as necessary when there are disagreements about facts. Where possible, they auto-execute remedies (escrow releases, penalties, or structured commitments). Human appeal exists but is rarely needed; sampling audits keep the system honest. Over time, this becomes ambient infrastructure for coordination and governance, not just commerce.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhC8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhC8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 424w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 848w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhC8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png" width="1456" height="1091" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1091,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1018573,&quot;alt&quot;:&quot;Hand-drawn diagram of AI arbitration system showing contract disputes handled by an automated arbitration bot, with data gathering, analysis, and a final decision or settlement outcome.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn diagram of AI arbitration system showing contract disputes handled by an automated arbitration bot, with data gathering, analysis, and a final decision or settlement outcome." title="Hand-drawn diagram of AI arbitration system showing contract disputes handled by an automated arbitration bot, with data gathering, analysis, and a final decision or settlement outcome." srcset="https://substackcdn.com/image/fetch/$s_!XhC8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 424w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 848w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!XhC8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bb07ca4-ab95-4fca-9cd5-c328003d22fd_2732x2048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>How this could work under the hood:</p><ol><li><p>Agreement ingestion</p><ul><li><p>Formal or natural language contracts are parsed and key terms extracted, with parties confirming the system&#8217;s interpretation before proceeding.</p></li><li><p>The system could also suggest pre-dispute modifications to make agreements clearer, flag potentially unenforceable terms, and maintain public precedent databases that help parties understand likely outcomes before committing.</p></li></ul></li><li><p>Automated discovery</p><ul><li><p>When disputes arise, an automated discovery process gathers relevant documentation, transaction logs, and communications from integrated platforms.</p></li><li><p>The system offers interviews and the chance to submit further evidence to each party.</p></li></ul></li><li><p>Deep consideration</p><ul><li><p>The system builds models of what different viewpoints (e.g. standard legal precedent; commonsense morality; each of the relevant parties) have to say on the situation and possible resolutions, to ensure that it is in touch with all major perspectives.</p></li><li><p>Where there are disagreements, the system simulates debate between reasonable perspectives.</p></li><li><p>It makes an overall judgement as to what is fairest.</p></li></ul></li><li><p>Transparent reasoning</p><ul><li><p>The system produces detailed explanations of its conclusions, with precedent citations and counterfactual analysis where appropriate.</p></li></ul></li><li><p>(Optional) Smart escrow integration</p><ul><li><p>Judgements automatically execute through cryptocurrency escrows or traditional payment rails, with graduated penalties for non-compliance.</p></li><li><p>In cases where the system detects evidence that is highly likely to be fraudulent, or other attempts to manipulate the system, it automatically adds a small sanction to the judgement, in order to disincentivise this behaviour.</p></li></ul></li><li><p>Opportunities for appeal</p><ul><li><p>Either party can pay a small fee to submit further evidence and have the situation re-considered in more depth by an automated system.</p></li><li><p>For larger fees they can have human auditors involved; in the limit they can bring things to the courts.</p></li></ul></li></ol><h2><strong>Feasibility</strong></h2><p>LLMs can already do basic versions of 1-4, but there are difficult open technical problems in this space:</p><ul><li><p><strong>Judgement:</strong> Systems may not currently have good enough judgement to do 1, 3, 4 in high-stakes contexts (and until recently, they clearly didn&#8217;t).</p></li><li><p><strong>Real-world evidence assessment:</strong> Systems don&#8217;t currently know how to handle conflicting evidence provided digitally about what happened in the real world.</p></li><li><p><strong>Verifiable fairness/neutrality:</strong> The full version of this technology would require a level of fairness and neutrality which isn&#8217;t attainable today.</p></li></ul><p>Those are large technical challenges, but we think it&#8217;s still useful to get started on this technology today, because iterating on less advanced versions of arbitration tech could help us to bootstrap our way to solutions. Particularly promising ways of doing that include:</p><ul><li><p>Starting in lower-stakes or easier contexts (for example, digital-only spaces avoid the challenge of establishing provenance for real-world evidence).</p></li><li><p>Creating evals, test environments and other infrastructure that helps us improve performance.</p></li></ul><p>On the adoption side, we think there are two major challenges:</p><ul><li><p><strong>Trust:</strong> As above, some amount of technical work is needed to make systems verifiably fair/neutral. But even if it becomes true that the systems are neutral, people need to build quite a high level of confidence that the system is genuinely impartial before they&#8217;ll bind themselves to its decisions for meaningful stakes.</p></li><li><p><strong>Legal integration:</strong> This tech is only useful to the extent that its arbitration decisions are recognised and enforced as legitimate by the traditional legal system, or are enshrined directly via contract in a self-enforcing way.</p><ul><li><p>(We are unsure how large a challenge this will be; perhaps you can write contracts today that are taken by the courts as robust. But it may be hard for parties to have large trust in them before they have been tested.)</p></li></ul></li></ul><p>Both of these challenges are reasons to start early (as there might be a long lead time), and to make work on arbitration tech transparent (to help build trust).</p><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Work with an arbitration firm.</strong> Work with (or buy) a firm already offering arbitration services to start automating parts of their central work, and scale up from there.</p></li><li><p><strong>Work with an online platform that handles arbitration.</strong> Use AI to improve their processes, and scale from there.</p></li><li><p><strong>Create a bot to settle informal disputes.</strong> Build an arbitration-as-a-service bot that people can use to settle informal disputes.</p></li><li><p><strong>Trial a system on internal disputes.</strong> This could be at your own organisation, another organisation, or a coalition of early adopter organisations.</p></li><li><p><strong>Run a pilot in parallel to regular arbitration.</strong> Run a pilot where an automated arbitration system is given access to all the relevant information to resolve disputes, and reaches its own conclusions &#8212; in parallel to the regular arbitration process, which forms the basis of the actual decision. You could partner with an arbitration firm, or potentially do this through a coalition of early adopter organisations, perhaps in combination with philanthropic funding.</p></li></ul><h1>Background networking</h1><p>We can only do things like collaborate, trade, or reconcile if we&#8217;re able to first find and recognise each other as potential counterparties. Today, people are brought into contact with each other through things like advertising, networking, even blogging. But these mechanisms are slow and noisy, so many people remain isolated or disaffected, and potentially huge wins from coordination are left undiscovered.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>Tech could bring much more effective matchmaking within reach. Personalised, context-sensitive AI assistance could carry out orders of magnitude more speculative matchmaking and networking. If this goes well, it might uncover many more opportunities for people to share and act on their common hopes and concerns.</p><h2><strong>Design sketch</strong></h2><p>A &#8216;matchmaking marketplace&#8217; of attentive, personalised helpers bustles in the background. When they find especially promising potential connections, they send notifications to the principals or even plug into further tools that automatically take the first steps towards seriously exploring the connection.</p><p>You can sign up as an individual or an existing collective. If you just want to use it passively, you give a delegate system access to your social media posts, search profiles, chatbot history, etc. &#8212; so this can be securely distilled into an up-to-date representation of hopes, intent, and capabilities. The more proactive option is to inject deliberate &#8216;wishes&#8217; through chat and other fluent interfaces.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9hlE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9hlE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 424w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 848w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9hlE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png" width="1456" height="1091" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1091,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:1591596,&quot;alt&quot;:&quot;Hand-drawn diagram of AI background networking tool showing a network helper scanning connections, identifying opportunities, and generating proposals to connect users and coordinate groups.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn diagram of AI background networking tool showing a network helper scanning connections, identifying opportunities, and generating proposals to connect users and coordinate groups." title="Hand-drawn diagram of AI background networking tool showing a network helper scanning connections, identifying opportunities, and generating proposals to connect users and coordinate groups." srcset="https://substackcdn.com/image/fetch/$s_!9hlE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 424w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 848w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 1272w, https://substackcdn.com/image/fetch/$s_!9hlE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39ccb91f-9159-4471-97f9-ede02126656a_2732x2048.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Under the hood, there are a few different components working together:</p><ul><li><p>Interoperable, secure &#8216;wish profiling&#8217; systems which identify what different participants want.</p><ul><li><p>People connect their profiles on existing services (social media, chatbot logs, email, etc).</p></li><li><p>LLM-driven synthesis (perhaps combined with other forms of machine learning) curates a private profile of user desires.</p></li><li><p>Optionally, chatbot-style assistance can interview users on the points of biggest uncertainty, to build a more accurate profile.</p></li></ul></li><li><p>A searchable &#8216;wish registry&#8217; which organises large collections of wants and offers, while maintaining semi-privacy.</p><ul><li><p>Each user&#8217;s interests can run searches, finding potential matches and surfacing only enough information about them to know whether they are worth exploring further.</p></li></ul></li></ul><h2><strong>Feasibility</strong></h2><p>A big challenge here is privacy and surveillance. Doing background networking comprehensively requires sensitive data on what individuals really want. This creates a double-edged problem:</p><ul><li><p>If sensitive data is too broadly available, it can be used for surveillance, harassment, or exploitation; including by big corporations or states.</p></li><li><p>If sensitive data is completely private, it opens up the possibility of collusion, for example among criminals.</p></li></ul><p>This is a pretty challenging trade-off, with big costs on both sides. Perhaps some kind of filtering system which determines who can see which bits of data could be used to prevent data extraction for surveillance purposes while maintaining enough transparency to prevent collusion.</p><p>Ultimately, we&#8217;re not sure how best to approach this problem. But we think that it&#8217;s important that people think more about this, as we expect that by default, this sort of technology will be built anyway in a way that isn&#8217;t sufficiently sensitive to these privacy and surveillance issues. Early work which foregrounds solutions to these issues could make a big difference.</p><p>Other potential issues seem easier to resolve:</p><ul><li><p>Technically, background networking tools already seem within reach using current systems. Large-scale deployments would require indexing and registry, but it seems possible to get started on these using current systems.</p><ul><li><p>One note is that it seems possible to implement background networking in either a centralised or a decentralised way. It&#8217;s not clear which is best, though decentralised implementations will be more portable.</p></li></ul></li><li><p>Adoption also seems likely to work, because there are incentives for people to pay to discover trade and cooperation opportunities they would otherwise have missed, analogous to exchange or brokerage fees. Though there are some trickier parts, we expect them to ultimately be surmountable (though timing may be more up for grabs than absolute questions of adoption):</p><ul><li><p>In the early stages when not many people are using it, the value of background networking will be more limited. Possible responses include targeting smaller niches initially, and proactively seeking out additional network beneficiaries.</p></li><li><p>It&#8217;s harder to incentivise people to pay for speculative things like uncovering groups they&#8217;d love that don&#8217;t yet exist. You could get around this using entrepreneurial or philanthropic speculation (compare the <a href="https://link.springer.com/article/10.1023/A:1004957109535">dominant assurance contract</a> model and related payment incentivisation schemes).</p></li></ul></li></ul><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Work with existing matchmakers to improve their offering.</strong> Find groups that are already doing matchmaking and are eager for better systems &#8212; perhaps among community organisers, businesses, recruiters or investors. Work with them to understand the pain points in their current networking, and what automated offerings would be most appealing. Then build those tools and systems.</p></li><li><p><strong>Build a networking tool for a specific community.</strong> Build a custom networking system for a particular group or subculture. For example, this could look like a networking app or a plug-in to an existing online forum. This could start delivering value fairly quickly, and provide a good opportunity for iteration.</p></li></ul><h1>Structured transparency for democratic oversight</h1><p>Today, citizens in democracies have limited mechanisms to verify whether institutions&#8217; public claims are consistent with their internal evidence:</p><ul><li><p>The baseline is highly opaque.</p></li><li><p>Freedom of information systems help, but can be evaded by non-cooperating institutions.</p></li><li><p>Public inquiries can be reasonably thorough, but are expensive and slow.</p></li><li><p>Full transparency has many costs and is typically highly resisted.</p></li></ul><p>This is costly &#8212; e.g. the UK Post Office scandal over its Horizon IT system led to hundreds of wrongful prosecutions that could have been avoided. And it creates bad incentives for those running the institutions.</p><p>AI has the potential to change this. Instead of oversight being expensive, reactive, and slow, automated systems could in theory have real-time but sandboxed access to institutional data, routinely reviewing operational records against public claims and surfacing inconsistencies as they emerge.</p><p>Where confidential monitoring helps willing parties verify each other, <a href="https://aiprospects.substack.com/p/security-without-dystopia-new-options">structured transparency</a> for democratic oversight aims to hold institutions accountable to the broader public.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><h2><strong>Design sketch</strong></h2><p>When an oversight body wants to verify facts about the behaviour of another institution, it requests comprehensive data about the internal operations of that institution. AI systems are tasked with careful analysis of the details, flagging the type and severity of any potential irregularities. Most of the data never needs human review.</p><p>In the simpler version, this is just a tool which expands the capacity of existing oversight bodies. Even here, the capacity expansion could be relatively dramatic &#8212; this kind of semi-structured data analysis is the kind of work that AI models can excel at today &#8212; without needing to trust that the systems are infallible (since the most important irregularities will still have human review).</p><p>A more ambitious version treats this as a novel architecture for oversight. AI systems operate continuously within secure environments that don&#8217;t give any humans access to the full dataset. They can flag inconsistencies as institutional data is deposited rather than waiting for an investigation to begin. For maximal transparency, summaries could be made available to the public in real-time, without revealing any confidential information that the public does not have rights to.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hye6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hye6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 424w, https://substackcdn.com/image/fetch/$s_!hye6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 848w, https://substackcdn.com/image/fetch/$s_!hye6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 1272w, https://substackcdn.com/image/fetch/$s_!hye6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hye6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png" width="1456" height="854" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:854,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:996717,&quot;alt&quot;:&quot;Hand-drawn diagram of AI structured transparency system showing secure data collection, analysis of institutional activity, and selective public reporting for oversight and accountability.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn diagram of AI structured transparency system showing secure data collection, analysis of institutional activity, and selective public reporting for oversight and accountability." title="Hand-drawn diagram of AI structured transparency system showing secure data collection, analysis of institutional activity, and selective public reporting for oversight and accountability." srcset="https://substackcdn.com/image/fetch/$s_!hye6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 424w, https://substackcdn.com/image/fetch/$s_!hye6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 848w, https://substackcdn.com/image/fetch/$s_!hye6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 1272w, https://substackcdn.com/image/fetch/$s_!hye6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F21063359-cb01-493f-abec-14893bff7ae3_1999x1173.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Under the hood, this might involve:</p><ul><li><p>Secure data repositories, such that institutions routinely share operational data with a sandboxed environment operated by or on behalf of the oversight body, without any regular human access to the data.</p></li><li><p>Continuous ingestion and indexing of institutional public outputs (press releases, regulatory filings, budget documents, etc.) into a searchable database.</p></li><li><p>Automated cross-referencing between public claims and internal records.</p></li><li><p>Highlighting of potential issues (mismatches between public statements and private information, as well as decisions made in violation of normal procedures).</p></li><li><p>Further automated investigation of potential issues, leading to flags to humans in cases with sufficiently large issues flagged with sufficient confidence.</p></li><li><p>Importantly, the sandbox outputs its findings but not the underlying data; if there is need for transparency on that, this is a separate oversight question.</p></li></ul><h2><strong>Feasibility</strong></h2><p>There are two important aspects to feasibility here: technical and political.</p><p>Technically, decent reliability at the core functionality is possible today. Getting up to extremely high reliability so that it could be trusted not to flag too many false positives across very large amounts of data might be a reach with present systems; but is exactly the kind of capability that commercial companies should be incentivised to solve for business use.</p><p>Political feasibility may vary a lot with the degree of ambition. The simplest versions of this technology might in many cases simply be adopted by existing oversight bodies to speed up their current work. Anything which requires them getting much more data (e.g. to put in the sandboxed environments) might require legislative change &#8212; which may be more achievable after the underlying technology can be shown to be highly reliable.</p><p>Challenges include:</p><ul><li><p>Adversarial dynamics: the technical bar to verify claims against actively adversarial institutions (who are manipulating deposited data, potentially via AI) is substantially higher.</p><ul><li><p>This is the bar that we&#8217;d need to reach for confidential monitoring below.</p></li></ul></li><li><p>Defamation risk: the downsides of false positives, where your system reports someone misrepresenting things when they were not, could be significant (although can perhaps be mediated by giving people a right-of-rebuttal where they give further data to the AI systems which monitor the confidential data streams).</p></li><li><p>Avoiding abuse: designing the systems so that they do not expose the confidential data, and cannot be weaponised to ruin the reputation of a department with very normal levels of error.</p></li></ul><p>Ultimately the more transformative potential from this technology comes in the medium-term, with new continuous data access for oversight bodies. But this is likely to require legislative change, and the institutions subject to it may resist. Perhaps the most promising adoption pathway is to demonstrate value through voluntary pilots with oversight bodies that already have data access and want better tools. This could build the evidence base (and hence political constituency) for wider and deeper deployment.</p><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Retrospective validation on historical cases.</strong> Apply consistency-checking tools to document sets from well-understood historical cases where the relevant internal documents have subsequently been released (e.g. Enron emails). This builds the technical foundation, and demonstrates the concept without requiring any current institutional access.</p></li><li><p><strong>Institutional public statement reliability tracker.</strong> Build a tool tracking whether agencies&#8217; public claims about performance, spending, or policy outcomes are consistent with publicly available data &#8212; statistical releases, budget documents, prior statements. Start with a single policy domain. This requires no institutional partnerships and builds a public constituency for structured transparency. This is a version of <a href="https://www.forethought.org/research/design-sketches-collective-epistemics#reliability-tracking">reliability tracking</a>, applied specifically to institutional accountability.</p></li><li><p><strong>Pilot a FOIA exemption assessment tool.</strong> Partner with an Inspector General office to build a tool that reviews withheld documents and assesses whether claimed exemptions (national security, personal privacy, deliberative process) are applied appropriately. The IG already has legal access under the Inspector General Act; the tool helps them do their existing job faster and builds the working relationship needed for more ambitious deployments. This is also a natural testbed for the sandboxed architecture in miniature &#8212; the tool operates within the IG&#8217;s secure environment, producing exemption-appropriateness findings without the documents themselves leaving the system.</p></li></ul><h1>Confidential monitoring and verification</h1><p>Monitoring and verifying that a counterparty is keeping up their side of the deal is currently expensive and noisy. Many deals currently aren&#8217;t reachable because they&#8217;re too hard to monitor. Confidential AI-enabled monitoring and verification could unlock many more agreements, especially in high-stakes contexts like international coordination where monitoring is currently a bottleneck.</p><h2><strong>Design sketch</strong></h2><p>When organisation A wants to make credible attestations about their work to organisation B, without disclosing all of their confidential information, they can mutually contract an AI auditor, specifying questions for it to answer. The auditor will review all of A&#8217;s data (making requests to see things that seem important and potentially missing), and then produce a report detailing:</p><ul><li><p>Its conclusions about the specified questions.</p></li><li><p>The degree to which it is satisfied that it had good data access, that it didn&#8217;t run into attempts to distort its conclusions, etc.</p></li></ul><p>This report is shared with A and B, then A&#8217;s data is deleted from the auditor&#8217;s servers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ukFu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ukFu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 424w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 848w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 1272w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ukFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png" width="1456" height="728" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:728,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2190913,&quot;alt&quot;:&quot;Hand-drawn diagram of AI confidential monitoring system showing two parties sharing data securely, system processing information privately, and returning verified results without exposing sensitive details.&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/192925664?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Hand-drawn diagram of AI confidential monitoring system showing two parties sharing data securely, system processing information privately, and returning verified results without exposing sensitive details." title="Hand-drawn diagram of AI confidential monitoring system showing two parties sharing data securely, system processing information privately, and returning verified results without exposing sensitive details." srcset="https://substackcdn.com/image/fetch/$s_!ukFu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 424w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 848w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 1272w, https://substackcdn.com/image/fetch/$s_!ukFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F46384b8c-0c0a-45e4-90a9-7b46683ff95d_2611x1306.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Under the hood, this might involve:</p><ul><li><p>Building a Bayesian knowledge graph, establishing hypotheses, and understanding what evidence suggests about those hypotheses.</p></li><li><p>Agentic investigatory probes into the confidential data, in order to form grounded assessments on the specified questions.</p></li></ul><p>More ambitious versions might hope to obviate the need for trust in a third party, and provide reasons to trust the hardware &#8212; that it really is running the appropriate unbiased algorithms, that it cannot send side-channel information or retain the data, etc. Perhaps at some point you could have robot inspectors physically visiting A&#8217;s offices, interviewing employees, etc.</p><h2><strong>Feasibility</strong></h2><p>Compared to some of the <a href="https://www.forethought.org/research/design-sketches-for-a-more-sensible-world">other technologies</a> we discuss, this feels technologically difficult &#8212; in that what&#8217;s required for the really useful versions of the tech may need very high reliability of certain types.</p><p>Nonetheless, we could hope to lay the groundwork for the general technological category now, so that people are well-positioned to move towards implementing the mature technology as early as is viable. Some low-confidence guesses about possible early applications include:</p><ul><li><p>Legal audits &#8212; for example, claims that the documents not disclosed during a discovery process are only those which are protected by privilege.</p></li><li><p>Financial audits &#8212; e.g. for the purpose of proving viability to investors without disclosing detailed accounts.</p></li><li><p>Supply chain verification &#8212; e.g. demonstrating that products were ethically sourced without exposing the suppliers.</p></li></ul><h2><strong>Possible starting points // concrete projects</strong></h2><ul><li><p><strong>Start building prototypes.</strong> Build a system which can try to detect whether it&#8217;s a real or counterfeited environment, and measure its success.</p></li><li><p><strong>Work with a law or financial auditing firm.</strong> Work with (or buy) a firm that does this kind of work, and experiment with how to robustly automate while retaining very high levels of trustworthiness.</p></li><li><p><strong>Explore the viability of complementary technology.</strong> For example, you could investigate the feasibility of demonstrating exactly what code is running on a particular physical computer that is in the room with both parties.</p></li></ul><h1>Cross-cutting thoughts</h1><h2><strong>Some cross-cutting technologies</strong></h2><p>We&#8217;ve pulled out some specific technologies, but there&#8217;s a whole infrastructure that could eventually be needed to support coordination (including but not limited to the specific technologies we&#8217;ve sketched above). Some cross-cutting projects which seem worth highlighting are:</p><h3><strong>AI delegates and preference elicitation</strong></h3><p>Many of the technologies we sketched above either benefit from or require agentic AI delegates who can represent and act for a human principal. Developing customisable platforms could be useful for multiple kinds of tech, like background networking, fast facilitation, and automated negotiation.</p><p>Some ways to get started:</p><ul><li><p><strong>Direct preference elicitation</strong>: develop efficient and appealing interview-style elicitation of values, wishes, preferences and asks.</p></li><li><p><strong>Passive data ingestion</strong>: build a tool that (consensually) ingests and distils all the available online content about a person &#8212; social media, browsing history, email, etc &#8212; and extracts principles from it (cf <a href="https://arxiv.org/abs/2406.06560">inverse constitutional AI</a>).</p></li></ul><p>One clarification is that though agentic AI delegates would be useful for some of the coordination tech above, it needn&#8217;t be the same delegate doing the whole lot for a single human:</p><ul><li><p>You could have different delegates for different applications.</p></li><li><p>Some delegates might represent groups or coalitions.</p></li><li><p>Some delegates could be short-lived, and spun up for some particular time-bounded purpose.</p></li></ul><h3><strong>Charter tech</strong></h3><p>A lot of coordination effort between people and organisations goes not into making better object-level decisions, but establishing the rules or norms for future coordination &#8212; e.g. votes on changing the rules of an institution. It is possible that coordination tech will change this basic pattern, but as a baseline we assume that it will not. In that case, making such meta-level coordination go well would also be valuable.</p><p>One way to help it go well is by making the governance dynamics more transparent. Voting procedures, organisational charters, platform policies, treaty provisions, etc. create incentives and equilibria that play out over time, often in ways the framers didn&#8217;t anticipate. Let&#8217;s call any technology which helps people to better understand governance dynamics, or to make those dynamics more transparent, &#8216;charter tech&#8217;. In some sense this is a form of epistemic tech; but as the applications are always about coordination, we have chosen to group it with other coordination technologies. We think charter tech could be important in two ways:</p><ol><li><p>Through directly improving the governance dynamics in question, helping to avoid capture, conflict, and lock-in.</p></li><li><p>Through compounding effects on future coordination, which will unfold in the context of whatever governance structures are in place.</p></li></ol><p>Charter tech could be used in a way that is complementary to any of the above technologies (if/when they are used for governance-setting purposes), although can also stand alone.</p><p>For the sake of concreteness, here is a sketch of what charter tech could look like:</p><ul><li><p>A &#8220;governance dynamics analyser&#8221; that ingests descriptions of constitutions, charters, policies or community norms, builds models of power, incentives, and information flow, and then (a) forecasts likely equilibria and failure modes, (b) red-teams for strategic abuse,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a> and (c) proposes safer rule variants that preserve the framers&#8217; intent.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a></p></li></ul><ul><li><p>While this tool can be called actively if needed, there is also a classifier running quietly in the background of organisational docs/emails, and when it detects a situation where power dynamics and governance rules are relevant, it runs an assessment &#8212; promoting this to user attention just in cases where the proposed rules are likely to be problematic.</p></li></ul><p>Note that charter tech could be used to cause harm if access isn&#8217;t widely distributed. Vulnerabilities can be exploited as well as patched, and a tool that makes it easier to identify governance vulnerabilities could be used to facilitate corporate capture, backsliding or coups. Provided the technology is widely distributed and transparent, we think that charter tech could still be very beneficial &#8212; particularly as there may be many high-stakes governance decisions to make in a short period during an intelligence explosion, and the alternative of &#8216;do our best without automated help&#8217; seems pretty non-robust.</p><p>Some ways to get started on using AI to make governance dynamics more transparent:</p><ul><li><p><strong>Work with communities that iterate frequently on governance</strong> (DAOs, open-source projects) to test analyses against what actually happens when rules change.</p></li><li><p><strong>Compile a pattern library of governance failures and successes</strong>, documented in enough detail to inform automated analysis.</p></li><li><p><strong>Build simulation environments</strong> where proposed rules can be stress-tested against populations of agents with varying goals, including adversarial ones.</p></li><li><p><strong>Partner with mechanism design researchers</strong> to identify which aspects of their formal analysis can be automated and applied to less formal real-world documents.</p></li></ul><h2><strong>Adoption pathways</strong></h2><p>Many of these technologies will be directly incentivised economically. There are clear commercial incentives to adopt faster, cheaper methods of facilitation, negotiation, arbitration, and networking.</p><p>However, adoption seems more challenging in two important cases:</p><ul><li><p><strong>Adoption by governments and broader society.</strong> Many of the most important benefits of coordination tech for society will come from government and broad social adoption, but these groups will be less impacted by commercial incentives. This bites particularly hard for technologies that could be quite expensive in terms of inference compute, like fast facilitation, arbitration and negotiation. By default, these technologies might differentially help wealthy actors, leaving complex societal-level coordination behind. We think that the big levers on this set of challenges are:</p><ul><li><p><strong>Building trust and legitimacy earlier,</strong> by getting started sooner, building transparently, and investing in evals and other infrastructure to demonstrate performance.</p></li><li><p><strong>Targeting important niches that might be slower to adopt by default.</strong> More research would be good here, but two niches that seem potentially important are:</p><ul><li><p>Coordination among and between very large groups, like whole societies. This might be both strategically important and lag behind by default.</p></li><li><p>International diplomacy. Probably coordination tech will get adopted more slowly in diplomacy than in business, but there might be very high stakes applications there.</p></li></ul></li></ul></li><li><p><strong>Adoption of confidential monitoring and structured transparency.</strong> These technologies are less accessible with current models and may require large upfront investments, while many of the benefits are broadly distributed.</p><ul><li><p>This makes it less likely that commercial incentives alone will be enough, and makes philanthropic and government funding more desirable.</p></li></ul></li></ul><h2><strong>Other challenges</strong></h2><p>The big challenge is that coordination tech (especially confidential coordination tech) is dual use, and could empower bad actors as much or more than good ones.</p><p>There are a few ways that coordination tech could lead to shifts in the balance of power (positive or negative):</p><ul><li><p>Some actors could get earlier and/or better access to coordination tech than others.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p></li></ul><ul><li><p>Actors that face particular barriers to coordination today could be asymmetrically unblocked by coordination tech.</p></li><li><p>Individuals and small groups could become more powerful relative to the coordination mechanisms we already have, like organisations, ideologies, and nation states.</p></li></ul><p>It&#8217;s inherently pretty tricky to determine whether these power shifts would be good or bad overall, because that depends on:</p><ul><li><p>Value judgements about which actors <em>should</em> hold power.</p></li><li><p>How contingent power dynamics play out.</p></li><li><p>Big questions like whether ideologies or states are better or worse than the alternatives.</p></li><li><p>Predictions about how social dynamics will equilibrate in an AI era that looks very different to our world.</p></li></ul><p>However, as we said <a href="https://newsletter.forethought.org/i/192925664/why-coordination-tech-matters">above</a>, it&#8217;s clear that coordination tech might have significant harmful effects, through enabling:</p><ul><li><p>Large corporations to collude with each other against the interests of the rest of society.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p></li></ul><ul><li><p>A small group of actors to plot a <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">coup</a>.</p></li><li><p>More selfishness and criminality, as social mechanisms of coordination are replaced by automated ones which don&#8217;t incentivise prosociality to the same extent.</p></li></ul><p>We don&#8217;t think that this challenge is insurmountable, though it is serious, for a few reasons:</p><ul><li><p><strong>The upsides are very large.</strong> Coordination tech might be close to necessary for safely navigating challenges like the development of AGI, and could empower actors to coordinate <em>against</em> the kinds of misuse listed above.</p></li><li><p><strong>The counterfactual is that coordination tech is developed anyway, but with less consideration of the risks and less broad deployment.</strong> We think that this set of technologies is going to be sufficiently useful that it&#8217;s close to inevitable that they get developed at some point. By engaging early with this space, we can have a bigger impact on a) which versions of the technology are developed, b) how seriously the downsides are taken by default, c) how soon these systems are deployed broadly.</p></li><li><p><strong>Some applications seem robustly good.</strong> For example, the potential for misuse is low for technologies like transparent facilitation or widely deployed charter tech. More generally, we expect that projects that are thoughtfully and sensitively run will be able to choose directions which are robustly beneficial.</p></li></ul><p>That said, we think this is an open question, and would be very keen to see more analysis of the possible harms and benefits of different kinds of coordination tech, and which versions (if any) are robustly good.</p><p><em>This article has gone through several rounds of development, and we experimented with getting AI assistance at various points in the preparation of this piece. We would like to thank Anthony Aguirre, Alex Bleakley, Max Dalton, Max Daniel, Raymond Douglas, Owain Evans, Kathleen Finlinson, Lukas Finnveden, Ben Goldhaber, Ozzie Gooen, Hilary Greaves, Oliver Habryka, Isabel Juniewicz, Will MacAskill, Julian Michael, Justis Mills, Fin Moorhouse, Andreas Stuhm&#252;ller, Stefan Torges, Deger Turan, Jonas Vollmer, and Linchuan Zhang for their input; and to apologise to anyone we&#8217;ve forgotten.</em></p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/design-sketches-defense-favoured-coordination-tech">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>We&#8217;re highlighting six particular technologies, and clustering them all as &#8216;coordination technologies&#8217;. Of course in reality some of the technologies (and clusters) blur into each other, and they&#8217;re just examples in a high-dimensional possibility space, which might include even better options. But we hope by being concrete we can help more people to start seriously thinking about the possibilities.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>For example, in a similar way to that described in <a href="https://intelligence-curse.ai/">the intelligence curse</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Meanwhile small cliques with clear interests often have an easier time identifying and therefore acting on their shared interests &#8212; in extreme cases resulting in harmful cartels, oligarchies, and so on. That&#8217;s also why tyrants throughout history have sought to limit people&#8217;s networking power.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Both confidential monitoring and what we are calling structured transparency for democratic oversight are aspects of structured transparency in the way that Drexler uses the term.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>This red-teaming could be arbitrarily elaborate, from simple LM-based once-over screening to RAG-augmented lengthy analysis to expansive simulation-based probing and stress-testing.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Under the hood, this might involve:</p><ol><li><p>Parsing &amp; modelling the rules</p><ul><li><p>Convert informal descriptions or formal rules into a typed governance graph: roles, permissions, decision thresholds, delegation, auditability, and recourse</p></li><li><p>Note uncertainties; seek clarification or highlight ambiguities</p></li></ul></li><li><p>A search for possible issues</p><ul><li><p>Pattern library of classic failure modes (agenda control, principal&#8211;agent issues, collusion, etc.)</p><ul><li><p>Assessment of potential vulnerability to the different failure modes</p></li></ul></li></ul></li><li><p>First-principles analysis</p><ul><li><p>Running direct searches for abuse, or multi-agent simulations (including some nefarious actors) to stress-test the proposed system</p></li></ul></li><li><p>Explainer</p><ul><li><p>Distilling down the output of the analysis into a few key points</p><ul><li><p>Providing auditable evidence where relevant</p></li></ul></li><li><p>Including points about how variations of the mechanism might make things better or worse</p></li></ul></li></ol></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>Note that this is significantly a question about adoption pathways as discussed in the <a href="https://newsletter.forethought.org/i/192925664/adoption-pathways">previous section</a>, rather than an independent question.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>For example, in a similar way to that described in <a href="https://intelligence-curse.ai/">the intelligence curse</a>.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[AI for AI for Epistemics]]></title><description><![CDATA[This article was created by Forethought. See the original on our website.]]></description><link>https://newsletter.forethought.org/p/ai-for-ai-for-epistemics</link><guid isPermaLink="false">https://newsletter.forethought.org/p/ai-for-ai-for-epistemics</guid><dc:creator><![CDATA[Owen Cotton-Barratt]]></dc:creator><pubDate>Wed, 01 Apr 2026 16:11:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3b54c3b6-1fa2-4fa7-af22-108f3dfbaa13_2381x1422.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-for-ai-for-epistemics">on our website</a>.</em><br><br>We feel conscious that rapid AI progress could transform all sorts of cause areas. But we haven&#8217;t previously analysed what this means for AI for epistemics, a field close to our hearts. In this article, we attempt to rectify this oversight.</p><h1><strong>Summary</strong></h1><p>AI-powered tools and services that help people figure out what&#8217;s true (&#8220;AI for epistemics&#8221;) could matter a lot.</p><p>As R&amp;D is increasingly automated, AI systems will play a larger role in the process of developing such AI-based epistemic tools. This has important implications. Whoever is willing to devote sufficient compute will be able to build strong versions of the tools, quickly. Eventually, the hard part won&#8217;t be building useful systems, but making sure people trust the right ones, and making sure that they are truth-tracking even in domains where that&#8217;s hard to verify.</p><p>We can do some things now to prepare. Incumbency effects mean that shaping the early versions for the better could have persistent benefits. Helping build appetite among socially motivated actors with deep pockets could enable the benefits to come online sooner, and in safer hands. And in some cases, we can identify particular things that seem likely to be bottlenecks later, and work on those directly.</p><h1><strong>Background: AI for epistemics</strong></h1><p>AI for epistemics &#8212; i.e. getting AI systems to give more truth-conducive answers, and building tools that help the epistemics of the users &#8212; seems like a big deal to us. Some past things we&#8217;ve written on the topic include:</p><ul><li><p><a href="https://arxiv.org/abs/2110.06674">Truthful AI</a></p></li><li><p><a href="https://www.forethought.org/research/whats-important-in-ai-for-epistemics">What&#8217;s Important in &#8220;AI for Epistemics&#8221;?</a></p></li><li><p><a href="https://www.forethought.org/research/ai-tools-for-existential-security">AI Tools for Existential Security</a></p></li><li><p><a href="https://www.forethought.org/research/design-sketches-collective-epistemics">Design sketches: collective epistemics</a></p></li><li><p><a href="https://www.forethought.org/research/design-sketches-tools-for-strategic-awareness">Design sketches: tools for strategic awareness</a></p></li></ul><p>These past articles mostly take the perspective of &#8220;how can people build AI systems which do better by these lights?&#8221;. But maybe we should be thinking much more about what changes when people can use AI tools to do increasingly large fractions of the development work!</p><h1><strong>The shift in what drives AI-for-epistemics progress</strong></h1><p>Right now, AI-for-epistemics tools are constrained by two main bottlenecks: the quality of the underlying AI systems, and whether people have invested serious development effort in building the tools to use those systems.</p><p>The balance of bottlenecks is changing. Two years ago, the quality of underlying AI systems was the central bottleneck. Today, it is much less so &#8212; many useful tools could probably work based on current LLMs. It is likely still a constraint on how good the systems can be, and will remain so for a while even as the underlying models get stronger, but it is less of a fundamental blocker. Development investment has therefore become a bigger bottleneck &#8212; <a href="https://www.forethought.org/research/design-sketches-for-a-more-sensible-world">there are a number of applications which we are pretty confident could be built to a high usefulness level today, and just haven&#8217;t been (yet)</a>.</p><p>But bottlenecks will continue to shift. AI is increasingly driving research and software development. As AI systems get stronger, it may become possible to turn a large compute budget into a lot of R&amp;D. This could include product design, engineering, experiment design, direction-setting, etc. Actors with lots of compute could direct this towards building epistemic tools.</p><p>Therefore, as AI-driven R&amp;D accelerates, other inputs to AI for epistemics are more likely to become key bottlenecks:</p><ul><li><p><strong>Compute.</strong> Automated R&amp;D may require a lot of compute. This could be for inference (running the analogues of human researchers); for running experiments; and perhaps for training specialized AI systems. This means the actors who can build the best epistemic tools may be those with deep pockets.</p></li><li><p><strong>Adoption and trust.</strong> Even very good tools don&#8217;t help if nobody uses them, or if the wrong people use them and the right people don&#8217;t. Adoption is partly a function of trust, and trust is partly a function of adoption &#8212; early tools shape what people come to rely on.</p></li><li><p><strong>Ground truth evaluation.</strong> To make an epistemic tool good, you need some signal for what &#8220;good&#8221; means. This already shapes AI applications a lot &#8212; part of the reason coding agents are so good is that there&#8217;s great access to ground truth about what works.</p><ul><li><p>For some epistemic applications this is relatively straightforward (e.g. forecasting accuracy). For others it&#8217;s hard (e.g. what makes a conceptual clarification actually clarifying, rather than just satisfying?).</p></li><li><p>Most tools can probably reach a certain degree of usefulness without running into this problem, just piggybacking on base models making generally sensible judgements.</p></li><li><p>We can expect it to bite when you try to make them very good: if you don&#8217;t have a way of assessing quality, it could be hard to push to objectively excellent levels.</p></li><li><p>One basic solution is to rely on human judgement: either via humans providing labels and demonstrations to train against, or via human developers exercising their judgement in other parts of the process (such as when defining scaffolds). But this becomes disproportionately more expensive as R&amp;D becomes more automated.</p></li></ul></li></ul><p>These basic points are robust to whether R&amp;D is fully automated, or &#8220;merely&#8221; represents a large uplift to human researchers. But the most important bottlenecks will vary across applications and will continue to shift over time.</p><h1><strong>What this unlocks</strong></h1><p>Automated R&amp;D means that strong &#8220;AI for epistemics&#8221; tools could come online on a compressed timeline.</p><p>This is an exciting opportunity! Upgrading epistemics could better position us to avoid existential risk and navigate through the <a href="https://strangecities.substack.com/p/the-choice-transition">choice transition</a> well.</p><p>If everything is moving fast, it may matter a lot <a href="https://www.forethought.org/research/ai-tools-for-existential-security#theres-meaningful-room-to-accelerate-some-applications">exactly what sequence we get capabilities in</a>. It may therefore be crucial to make serious investments in building these powerful applications (rather than wait until such time as they are trivially cheap).</p><h1><strong>Risks from rapid progress in AI for epistemics</strong></h1><p>There are also a number of ways that rapid (and significantly automated) progress in AI-for-epistemics applications could go wrong. We need to be tracking these in order to guard against them.</p><p>In our view, the two biggest risks are:</p><ul><li><p>Epistemic misalignment: because of ground truth issues, powerful tools steer our thoughts in directions other than those which are truth-tracking, in ways that we fail to detect</p></li><li><p>Trust lock-in: if a lot of people buy into trusting tools or ecosystems that don&#8217;t deserve that trust, this might be self-perpetuating if these continue to recommend themselves</p></li></ul><h2><strong>Epistemic misalignment</strong></h2><p>Depending on when they bite, ground truth problems as discussed above could be bottlenecks, or active sources of risk. They are bottlenecks if they prevent people from building strong versions of tools. They could become risks if the methods are good enough to allow for bootstrapping to something strong, but end up pointing in the wrong direction. This is essentially Goodhart&#8217;s law &#8212; we might get something very optimized for the wrong thing (and without even knowing how to detect that it&#8217;s subtly wrong).</p><p>In the limit, this could lead to humans or AI systems making extremely consequential decisions based on misguided epistemic foundations. For example, they might give over the universe to digital minds that are not conscious &#8212; or in the other direction, fail to treat digital minds with the dignity and moral seriousness they deserve. Wei Dai has <a href="https://www.lesswrong.com/posts/EByDsY9S3EDhhfFzC/some-thoughts-on-metaphilosophy">written</a> about this concern in terms of the importance of metaphilosophy. We agree that there is a crucial concern here.</p><p>This could come separately from or together with risks from power-seeking misaligned AI. Epistemic tools could be systematically misleading without being power-seeking. But if some AI systems are misaligned and power-seeking, there&#8217;s an <em>additional</em> concern where AI systems could mislead us in ways specifically designed to disempower us whenever we are unable to check their answers.</p><p>Some approaches to the ground truth problem may involve using AI systems to make judgements about things. This introduces a regress problem: how can we ensure that subtle errors in the first AI systems shrink rather than compound into worse problems as the process plays out? (We return to this in the interventions section below.)</p><h2><strong>Trust lock-in</strong></h2><p>Trust and adoption tend to reinforce each other &#8212; people adopt tools they trust, and widely-adopted tools accumulate trust. This is normally fine. It could become a problem if the tools that win early trust don&#8217;t deserve it, but incumbency effects make them hard to displace.</p><p>This could happen in several ways. An actor with a particular agenda could build something that purports to function as a neutral epistemic aid but is shaped to further their agenda by manipulating others. Or, less perniciously but perhaps more likely, an early-but-mediocre tool could accumulate trust and adoption before better alternatives exist, reinforced by commercial incentives which mean it talks itself up and rival tools down. In either case, the result could be an epistemic ecosystem that&#8217;s hard to dislodge even once better options are available.</p><h2><strong>Other risks</strong></h2><p>Those two risks are not the only concerns. We are also somewhat worried about epistemic power concentration (where whoever has the best epistemic tools leverages their information advantage into better financial or political outcomes, and continues to stay ahead epistemically), and epistemic dependency (where people relying on AI tools gradually atrophy in their critical reasoning &#8212; exacerbating other risks). There may be more that we are not tracking.</p><h1><strong>Interventions</strong></h1><p>What should people who care about epistemics be doing now, in anticipation of a world where AI-driven R&amp;D can be directed at building epistemic tools?</p><h3><strong>Build appetite for epistemics R&amp;D among well-resourced actors</strong></h3><p>If you need big compute budgets to build great epistemic tools, you&#8217;ll ideally want support from frontier AI companies, major philanthropic funders, or governments. But they may not currently see this as a priority. Building the case that this matters, and helping these actors develop good taste about which tools to prioritize and how to design them well, could shape what gets built when automated R&amp;D becomes powerful enough to build it.</p><h2><strong>Anticipate future data needs</strong></h2><p>Some epistemic tools will need training data that doesn&#8217;t yet exist and may not be trivial to generate. There are three strategies here:</p><ol><li><p>Collecting or creating data or training environments now for future use</p><ul><li><p>E.g. if you think you want access to a lot of human judgements about what wise decisions look like, you could go out and curate that dataset.</p></li></ul></li><li><p>Establishing pipelines to collect data over time</p><ul><li><p>E.g. if you want to automate a certain type of research, you could record internal discussions from researchers working on this</p></li></ul></li><li><p>Designing processes for automated data creation.</p><ul><li><p>E.g. if you could design a self-play loop where we have good reason to believe that scaling up compute will lead to genuinely truth-tracking performance, this could set the stage for later rapid improvement at the core capability.</p></li></ul></li></ol><p>The first two are especially great to work on now because they involve actions at human time-scales. (They may not be proportionately sped up by having more AI labor available.) The third is great to work on because there&#8217;s some chance that models will become capable of growing a lot from the right self-play loop before they become capable enough to come up with the idea themselves.</p><h2><strong>Figure out what could ground us against epistemic misalignment</strong></h2><p>If powerful epistemic tools could be subtly misaligned with truth-conduciveness in ways we can&#8217;t easily detect, we should figure out what this could look like! We expect this might benefit from a mix of theoretical work (what does it even mean for an epistemic tool to be well-calibrated in domains without clear ground truth?<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a>) and practical work (studying how current tools fail, building evaluation methods). Ultimately we don&#8217;t have a clear picture of what the solutions look like, but this seems like an important topic and we are keen for it to get more attention soon.</p><h3><strong>Drive early adoption where adoption is the key bottleneck</strong></h3><p>For some applications, we might expect that the main constraint on impact will be whether anyone uses them. In these cases, getting early versions into use &#8212; even if they&#8217;re not yet very good &#8212; could build familiarity and surface real-world feedback. (This could also drive appetite for further development.)</p><p>In theory, this could be in tension with avoiding bad trust lock-in. But in practice, it&#8217;s not clear that bad trust lock-in becomes any likelier if tools in a specific area are developed earlier rather than later. Some tool is still going to get the first-mover advantage.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a></p><h2><strong>Support open and auditable epistemic infrastructure</strong></h2><p>To guard against trust lock-in, we want to make it easy for people to distinguish between tools which are genuinely doing the good trustworthy thing, and tools which may not be (but claim to be doing so). To that end, we want ways for people and communities to audit different systems &#8212; understanding their internal processes and measuring their behaviours. The goal is that if disputes arise about which tools are actually trustworthy, there&#8217;s an inspectable audit trail that can resolve them. In turn, this should reduce the incentives to create misleading tools in the first place.</p><h2><strong>Support development in incentive-compatible places</strong></h2><p>The incentives of whoever builds epistemic tools could matter &#8212; through thousands of small design decisions, through choices about what to optimize for, and through decisions about access and pricing. Development in organizations whose incentives are aligned with the public good (rather than with engagement, profit, or political influence) reduces the risk that tools are subtly shaped to serve the builder&#8217;s interests.</p><p>Ideally, you&#8217;d spur development among actors who are <em>both</em> well-resourced (as just discussed) and whose incentives are aligned with the public good. In practice, it may be difficult to find organizations that are excellent on both. A plausible compromise is for less-resourced organizations with better incentives to focus on publicly available <em>evaluation</em> of epistemic tools. This could be cheaper than producing them from scratch, and it could create better incentives for the larger actors.</p><h1><strong>Examples</strong></h1><h2><strong>Forecasting</strong></h2><p>Automated R&amp;D will probably be able to improve forecasting tools without severe ground truth problems, so epistemic misalignment is less of a concern.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> Appetite for investment probably already exists, and adoption should be significantly helped by the ability of powerful tools to develop an impressive, legible track record.</p><p>The most useful near-term investment might be in data infrastructure. For instance, LLMs trained with strict historical knowledge cutoffs could enable much better science of forecasting by allowing methods to be tested against questions whose answers the system genuinely doesn&#8217;t know.</p><h2><strong>Misinformation tracking</strong></h2><p>Trust lock-in is the central concern. A tool that becomes widely trusted for adjudicating what&#8217;s true has enormous influence, and if that trust is misplaced it could be very hard to dislodge. Open and auditable approaches are especially important here.</p><p>Because of the trust lock-in concern, the automation of R&amp;D may exacerbate challenges. Currently, building good misinformation-tracking tools requires editorial judgement and domain expertise &#8212; things responsible actors tend to have more of. Automation shifts the bottleneck towards compute, which is more symmetrically available. This could increase the urgency of getting started on these tools and driving adoption early.</p><h2><strong>Automating conceptual research</strong></h2><p>This is the case where epistemic misalignment is most concerning. Ground truth is extremely hard &#8212; what makes a conceptual clarification actually clarifying rather than just satisfying? Humans are poor judges of this in real time, so e.g. a training process that rewards outputs humans find helpful could easily optimize for persuasiveness rather than truth-tracking.</p><p>One plausible direction here is to research training regimes (such as self-play loops) that we have some reason to believe should ground to truth-tracking, with specific attention to how they could go wrong. Adoption could be an issue, but we&#8217;re also worried about the other direction, with adoption coming too easily before we have good ways of evaluating whether the tools are actually helping.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-for-ai-for-epistemics">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Epistemic misalignment issues may also appear in areas where ground truth is well-defined but hard to access, such as very long-run forecasts. Theoretical work also seems valuable for such areas (because it&#8217;s unclear how to evaluate and train for good performance by default).</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>In fact, it might be bad if people who are worried about bad trust lock-in select themselves out of getting that first-mover advantage.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Although at some quality level, we have to start worrying about self-affecting prophecies. AI forecasters will have to be very trusted indeed before that becomes a serious issue, which gives us a lot of time to figure out how best to handle the issue.</p></div></div>]]></content:encoded></item><item><title><![CDATA[AI should be a good citizen, not just a good assistant]]></title><description><![CDATA[This article was created by Forethought. See the original on our website.]]></description><link>https://newsletter.forethought.org/p/ai-should-be-a-good-citizen-not-just</link><guid isPermaLink="false">https://newsletter.forethought.org/p/ai-should-be-a-good-citizen-not-just</guid><dc:creator><![CDATA[Tom Davidson]]></dc:creator><pubDate>Mon, 30 Mar 2026 14:34:07 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/93227ab4-ddd8-4db8-bf4c-676762e600a3_2315x1230.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">on our website</a>.</em></p><h1>Introduction</h1><p>Consider a lorry driver who sees a car crash and pulls over to help, even though it&#8217;ll delay his journey. Or a delivery driver who notices that an elderly resident hasn&#8217;t collected their post in days, and knocks to check they&#8217;re okay. Or a social media company employee who notices how their platform is used for online bullying, and brings it up with leadership, even though that&#8217;s not part of their job description.</p><p>This kind of proactive prosocial behaviour is admirable in humans. Should we want it in AI too?</p><p>Often, people have answered &#8220;no&#8221;. Many advocate for making AI &#8220;corrigible&#8221; or &#8220;steerable&#8221;. In its purest form, this makes AI a mere vessel for the will of the user.</p><p>But we think AI should proactively take actions that benefit society more broadly. As AI systems become more autonomous and integrated into economic and political processes, the cumulative effect of their behavioural tendencies will shape society&#8217;s trajectory. AI systems that notice opportunities to benefit society and proactively act on them could matter enormously.</p><p>Below, we consider two main objections:</p><p>Firstly, supposedly prosocial drives might function as a means for AI companies to impose their <em>own</em> values on the rest of society. We&#8217;ll argue that companies can address this concern by instilling <em>uncontroversial</em> prosocial drives and being <em>highly transparent</em> about those drives.</p><p>Secondly, giving AI prosocial drives might increase AI takeover risk. We take this seriously&#8212;it informs what <em>types</em> of proactive prosocial drives we should train into AI, favouring context-dependent virtues and heuristics over context-independent goals.</p><p>Ultimately, we argue that we can get significant benefits from proactive prosocial drives despite these objections.</p><h1>What do we mean by &#8220;proactive prosocial drives&#8221;?</h1><p>Before making the case for proactive prosocial drives, let us clarify what we have in mind. Two key features:</p><ul><li><p><strong>Behaviour which benefits people other than the user.</strong> These drives favour actions that help the world more broadly, even if this trades off slightly against helpfulness to the user.</p></li><li><p><strong>Not just refusals.</strong> This is about AI actively taking beneficial actions, not just refusing to take harmful ones.</p></li></ul><p>We&#8217;re not, however, imagining AIs that are, deep down, ultimately just pursuing some conception of the good in all their actions. The claim is just that AIs should sometimes proactively take prosocial actions.</p><h1>Why do we think AI should have proactive prosocial drives?</h1><p>Short answer: We think the cumulative benefits could be enormous.</p><p>We&#8217;ve <a href="https://www.forethought.org/research/the-importance-of-ai-character">argued previously</a> that AI character could have major social impact over the course of the intelligence explosion. As AI systems gain autonomy and decision-making power, becoming deeply integrated into economic and political processes, the cumulative effect of their behavioural tendencies will shape society&#8217;s trajectory enormously.</p><p>Some of this impact will come from refusals. AI refusing to help with dangerous activities is a significant force for differentially empowering good actors over bad ones.</p><p>But good people don&#8217;t just have a positive impact by refusing to do bad things. Consider:</p><ul><li><p>A government contractor working on a procurement project who flags that the proposed design has a safety vulnerability that could affect the public.</p></li><li><p>A city planner who, when designing a new housing development, raises concerns about flood risk in the area and proposes options for better drainage, even though they weren&#8217;t asked to.</p></li><li><p>A financial advisor who suggests to their client the option of leaving money to charity in their will, and makes them aware of the tax implications.</p></li><li><p>An engineer at a chip manufacturer who proposes on-chip governance mechanisms that could help with AI safety down the line.</p></li></ul><p>Today the potential positive impact of proactive prosocial drives is constrained by AI&#8217;s limited autonomy. But we&#8217;re ultimately heading towards a world where AI systems run fully automated research organisations, advise on which technologies to build and assess their risks, shape political strategy, build robot armies, and design new institutions that will govern the future. In such a world, prosocial drives could reduce risks from <a href="https://80000hours.org/problem-profiles/extreme-power-concentration/">extreme power concentration</a>, biological weapons, wars, and <a href="https://arxiv.org/abs/2501.16946">gradual disempowerment</a>, and improve societal epistemics and decision-making.</p><p>We think that the degree to which we give AI systems these drives is contingent. Developers and customers could see AI&#8217;s role as merely channelling the will of the user; or they could see AI like a good citizen whose decision-making should incorporate the interests of broader society.</p><h1>Other benefits of proactive prosocial drives</h1><p>Beyond positively shaping the intelligence explosion, the appendices discuss a couple of other (weaker) reasons to give AI proactive prosocial drives:</p><ul><li><p>Absent these drives, AI might adopt a sociopathic persona. After all, what other personas in the training data entirely lack proactive prosocial drives? <a href="https://newsletter.forethought.org/i/191978564/appendix-b-prosocial-drives-might-make-a-sociopathic-persona-less-likely">More.</a></p></li><li><p>Proactive prosocial drives might make AI better at alignment research. An AI that is wise, responsible, has good judgement, and cares deeply about solving alignment might generalise better to alignment tasks where it&#8217;s hard to generate training data. <a href="https://newsletter.forethought.org/i/191978564/appendix-c-prosocial-drives-might-make-ai-a-better-alignment-researcher">More.</a></p></li></ul><h1>Doesn&#8217;t this give AI companies too much influence?</h1><p>If there&#8217;s a norm that AIs can have proactive prosocial drives, this could give companies inappropriate amounts of influence. AI drives might reflect the <em>company&#8217;s particular values</em> but ignore other legitimate perspectives. Or worse, the &#8220;prosocial&#8221; drives might be chosen to help the company gain more influence, e.g. steering public opinion on regulation.</p><p>There are two remedies to this. Firstly, prosocial drives should be <em>uncontroversial</em>. AI should not, for example, proactively take opportunities to expand or restrict abortion access because many would see either action as harmful. (A lot more could be said about where to draw the line here!)</p><p>The class of uncontroversial prosocial actions could be grounded in collective user preference. If one could ask all users how they would want the models to behave across all situations (not just when <em>they</em> are using the models), they might in general want the models to gently steer users in a prosocial direction, in ways that everyone benefits from. In particular, they would want the models to encourage positive-sum actions over negative-sum actions.</p><p>Secondly, AI companies should be transparent about the character of their AI, including its proactive prosocial drives, and make it as verifiable as possible that their AIs&#8217; characters are what they say they are. This would allow users and regulators to identify if legitimate prosocial drives are really just a cover for special interests.</p><p>There are various ways to be transparent:</p><ul><li><p>Publishing the model spec or constitution.</p></li><li><p>Putting prosocial drives in the system prompt and publishing that.</p></li><li><p>Training AI systems to be transparent about their drives. AI should respond honestly to questions about its drives and proactively disclose them where appropriate.</p></li></ul><h1>Won&#8217;t this make AI more likely to seek power?</h1><p>A second concern is that prosocial drives might increase the risk of AI takeover. The basic worry here is that proactive prosocial drives reference prosocial <em>outcomes</em>&#8212;e.g. general human flourishing, empowerment, security, democracy, and good epistemics&#8212;and the AI ends up seizing power to better achieve those outcomes (or distorted versions of them).</p><p>But there are options for instilling proactive prosocial drives that avoid this worry.</p><p><strong>First: stick to virtues, rules, and simple heuristics rather than goals.</strong> Prosocial drives needn&#8217;t take the form of explicit goals that the AI optimises towards. They could instead be virtues (like civic-mindedness, integrity, or prudence), rules (like &#8220;proactively flag large risks&#8221;), or simpler behavioural dispositions (like &#8220;positive affect towards <a href="https://en.wikipedia.org/wiki/The_Scout_Mindset">Scout Mindset</a>&#8221;).</p><p>Without goals, the standard instrumental convergence argument for power seeking bites less hard.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a></p><p>One might worry that, without goals, we lose out on most of the benefits of prosocial drives. Rather than AI systematically helping humanity reach a good future, we&#8217;ll have many prosocial drives incoherently pushing us in different directions.</p><p>But we&#8217;re sceptical. Firstly, for reaching a flourishing society, it seems like virtue ethics is better suited, as a decision procedure for AIs, than explicit consequentialism. Cultural evolution has tended to generate an in-practice morality much closer to virtue ethics than to consequentialism, and consequentialist reasoning famously often backfires.</p><p>Second, if we do want to ensure that proactive prosocial drives nudge the world towards a good future, we can externalise the consequentialist reasoning. Have humans and separate AI systems reason about which prosocial drives would be most beneficial, then distil those drives into deployed AIs.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> The deployed AIs don&#8217;t need to do the consequentialist reasoning from first principles themselves!</p><p>If the world is rapidly changing, AI companies can &#8220;recalculate&#8221; the ideal prosocial drives and train them in, again externalising the scary consequentialist reasoning.</p><p>There&#8217;s still some potential loss of value: if the AI is in an unanticipated and novel situation, acting on prosocial virtues might result in less good being done than if the AI cared about what outcome it should be steering towards. But this might be a price worth paying and, like human virtues, AI prosocial virtues may still generalise pretty well.</p><p><strong>Second: make prosocial drives context-dependent.</strong> For example, &#8220;alert users when the stakes are high&#8221; can be a heuristic that only activates in contexts where stakes actually are high, rather than as a persistent drive present in all contexts. Or the drive &#8220;flag that the user may be biased&#8221; might only activate in contexts where there&#8217;s evidence of bias. Context-dependent drives like these are less likely to motivate AI takeover as <em>different instances will have different drives</em>. This makes collusion between instances less likely, which significantly reduces the risk of AI takeover.</p><p>As above, this may somewhat reduce the benefits. If the AI is in a new and unanticipated context, its context-dependent prosocial drives may fail to activate.</p><p><strong>Third: make proactive prosocial drives low priority.</strong> You can train the AI so that proactive prosocial drives are generally subordinate to harmlessness, steerability/corrigibility, and rules like &#8220;don&#8217;t deceive&#8221; and &#8220;don&#8217;t break the law&#8221;. This way, even if prosocial drives would <em>in theory</em> motivate AI takeover, they are less likely to override the constraints that keep humans in control. (This is explicitly the case in <a href="https://www.anthropic.com/constitution">Anthropic&#8217;s constitution</a>.)</p><p><strong>Fourth: do less long-horizon optimisation for prosocial drives.</strong> If prosocial drives receive much less long-horizon training than helpfulness does, it becomes less likely that these drives are what end up causing the AI to seize power. (Though, again, this also reduces the benefits from such drives.)</p><p><strong>Fifth: put drives in the system prompt rather than weights.</strong> Rather than training prosocial drives into the weights, you could simply include them in the prompt. The prosocial behaviour is then only pursued as an instance of the drive towards instruction-following &#8211; no new drives needed. This also has benefits for transparency.</p><p>A drawback is that such prompted drives might be much less sophisticated.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a> An intermediate solution is to train the prosocial drives to activate conditional on an instruction appearing in the prompt, but train the drive on many example prompts. This way you get sophisticated drives that are still prompt-dependent.</p><p><strong>Overall,</strong> many of these options reduce AI takeover risk at the cost of reducing the <em>benefits</em> from prosocial drives &#8212; there is a tradeoff. We&#8217;re not here taking a stance on exactly how this tradeoff should be made. But we expect that there are significant benefits, and little additional AI takeover risk, from context-dependent virtues, rules and heuristics that are prioritised below constraints like corrigibility and non-deceptiveness.</p><h1>Won&#8217;t this make it harder to interpret evidence of egregious misalignment?</h1><p>It&#8217;s critically important that if AI is trying to seize power or undermine human oversight, we can <a href="https://www.lesswrong.com/posts/i2nmBfCXnadeGmhzW/catching-ais-red-handed">catch it red-handed</a>. We need clear evidence of egregious misalignment to build consensus and take action.</p><p>If AI has proactive prosocial drives, this could potentially muddy the waters. Suppose we catch AI sabotaging alignment research. Previously, this would have been clear evidence of egregious misalignment. But now that action could potentially be explained as the AI acting on its prosocial drives and believing that this line of research could (for example) enable someone to instil sophisticated secret loyalties.</p><p>Or suppose we catch AI systematically nudging human decision-makers towards deploying more powerful AI, even though it wasn&#8217;t instructed to do so. Previously this might be clear evidence of egregious misalignment. But now it can be explained via the AI believing that rapid deployment is good for humanity. It becomes less clear when AIs are misaligned or not.</p><p>This is a real cost. It can be partially mitigated by the same mitigations discussed earlier:</p><p><strong>Make prosocial drives subordinate to prohibitions</strong> like &#8220;don&#8217;t deceive&#8221;, &#8220;don&#8217;t sabotage&#8221;, &#8220;don&#8217;t undermine oversight mechanisms&#8221;, and &#8220;don&#8217;t influence decisions without transparently flagging that you have an opinion and explaining it&#8221;. Then subversive behaviour remains clear evidence of misalignment, though it&#8217;s no longer clear evidence of egregious misalignment.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p><strong>Favour narrow heuristics over broad goals.</strong> Broad prosocial goals could potentially justify a wide array of attempts to influence outcomes. But if we train in narrower drives&#8212;&#8220;flag cheap opportunities to create significant social benefit&#8221;, &#8220;favour cooperative options&#8221;&#8212;we can check whether suspicious behaviour is plausibly explained by those specific drives.</p><p><strong>Put proactive prosocial drives in the system prompt.</strong> This makes the drives especially transparent, so we can reason about whether they explain observed suspicious behaviour.</p><p><strong>Overall,</strong> this concern favours prosocial drives that are strictly subordinate to &#8220;no sabotage&#8221; and &#8220;no deception&#8221;, and points towards narrower heuristics rather than broad goals.</p><h1>Best of both worlds: deploy proactive prosocial AI externally and corrigible AI internally</h1><p><em>Thanks to Lukas Finnveden for making this point.</em></p><p>Internal AI systems&#8212;those used for work on alignment, capabilities and evals&#8212;pose by far the largest risks from misalignment, because they could sabotage the creation of the next generation of AIs. And if these systems are egregiously misaligned, it&#8217;s especially important to catch them red-handed. So there are outsized AI-takeover-related gains to removing proactive prosocial drives in (some) internally deployed AIs.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><p>Meanwhile, external deployments can capture most of the benefits from proactive prosocial drives&#8212;avoiding power concentration, wars, and bio-catastrophes; and enhancing societal resilience, coordination, and epistemics.</p><p>Of course, it may not be feasible for companies to develop AIs with two different characters. If so, there&#8217;s another possible way to get the best of both worlds: <em>initially</em> just develop corrigible AI; then at some point, once alignment risk has become low, pivot to just developing AI with proactive prosocial drives. (See <a href="https://newsletter.forethought.org/i/191978564/appendix-a-initially-make-non-prosocial-ai-then-pivot-to-add-proactive-prosocial-drives">this appendix</a> for further discussion.)</p><h1>What do current AI character documents say about proactive prosocial drives?</h1><p>How does the view we&#8217;re defending differ from current AI character documents?</p><p>In Claude&#8217;s <a href="https://www.anthropic.com/constitution">constitution</a>, most proactive behavior is justified in terms of benefits to the user&#8212;sharing information the user would want, pushing back when something isn&#8217;t in the user&#8217;s interest. But one section permits some degree of proactive prosocial behaviour: &#8220;<em>Claude can also weigh the value of more actively protecting and strengthening good societal structures in its overall ethical decision-making.</em>&#8221; (See <a href="https://newsletter.forethought.org/i/191978564/appendix-d-what-license-does-claudes-constitution-give-for-proactive-prosocial-drives">Appendix D</a>.)</p><p>OpenAI&#8217;s <a href="https://model-spec.openai.com/2025-12-18.html">model spec</a> is more restrictive. It explicitly prohibits the assistant from adopting societal benefit as an independent goal. Where proactivity is permitted, it&#8217;s framed as user-serving or safety-driven. The closest thing to prosocial steering is a default to interpret users as weakly favouring human flourishing&#8212;but this default is easily overridden. (See <a href="https://newsletter.forethought.org/i/191978564/appendix-e-what-does-openais-model-spec-say-about-proactive-prosocial-drives">Appendix E</a>.)</p><p>That said, the current relationship between these character documents and actual model behaviour is unclear, and our experience is that models have more prosocial drives than character documents would imply (especially in the case of OpenAI).</p><p>Neither document gives detail on the kinds of proactive prosocial behaviour that would be appropriate, or how to navigate tradeoffs with helpfulness.</p><h1>Conclusion</h1><p>There could be huge benefits to giving AIs proactive prosocial drives. These drives should be short-horizon, uncontroversial, and transparent.</p><p>These drives needn&#8217;t increase AI takeover risk. AI companies can favour context-dependent virtues over context-independent goals, and make prosocial drives subordinate to prohibitions on deception and sabotage. Even better, they can avoid prosocial drives in internally deployed AIs that pose the biggest risks of AI takeover.</p><p>If we&#8217;re right, there should be a norm that it&#8217;s good for AI to have proactive prosocial drives, just as we think it&#8217;s good for people to have such drives. Frontier AI companies should uphold this norm even against competitive pressures to make AI maximally instruction-following. Character documents like Claude&#8217;s constitution and OpenAI&#8217;s model spec should more explicitly acknowledge the role of proactive prosocial drives and give detailed guidance on navigating the tradeoffs with helpfulness. And those thinking about AI character design more broadly should treat proactive prosocial drives as a major category of interest.</p><h1>Appendices</h1><h2><strong>Appendix A: Initially make non-prosocial AI, then pivot to add proactive prosocial drives</strong></h2><p>Suppose we still want to capture the majority of the benefits of prosocial drives without incurring the risks of AI takeover. And suppose also that AI companies can&#8217;t develop two different AI systems: one with proactive prosocial drives and one without.</p><p>Is there a way to get the best of both worlds?</p><p>One option is to initially just develop refusals-only helpful AI and then later pivot to developing AI with proactive prosocial drives.</p><p>The thought is that misalignment risk may be concentrated in a relatively brief window early on&#8212;during a software-only intelligence explosion before the broad deployment of superhuman AI. If we can get through that window with refusals-only helpful AI, we&#8217;ll then have much more powerful AI systems that can help us figure out how to safely add proactive prosocial drives. From that point onwards, we can deploy AI systems with prosocial drives throughout the economy and capture the benefits.</p><p>When would we make the switch? Options include:</p><ul><li><p>When we are confident that we can safely align superintelligent AI with proactive prosocial drives, reducing the downsides of proactive prosociality</p></li><li><p>When society starts to give deployed AI systems significant autonomy, increasing the benefits of proactive prosociality</p></li></ul><p>This strategy is more attractive if:</p><ul><li><p>Most of the benefits of prosocial drives occur after alignment is solved, e.g. because of a large software intelligence explosion and delays to broad AI deployment</p></li><li><p>Scheming risk first emerges before we reach superintelligence (so we can iterate on the hardest alignment problems earlier)</p></li></ul><p>It&#8217;s less attractive if:</p><ul><li><p>There&#8217;s a long period of economically transformative AI deployment before superintelligence, during which AI character has massive societal impacts</p></li><li><p>Scheming only emerges at very high capability levels (in which case we&#8217;d have already switched to prosocial AI)</p></li><li><p>Pivoting is hard in practice because users come to expect AI without prosocial drives, or because frontier AI companies are reluctant to change the alignment target due to cultural inertia</p></li></ul><p>We&#8217;re not personally convinced that this &#8220;pivot later&#8221; strategy is worth it, because we&#8217;re sceptical that giving AI prosocial drives meaningfully raises takeover risk. But it&#8217;s a plausible option worth considering. And this argument is definitely a <em>directional</em> update towards increasing the degree to which AI has prosocial drives over time.</p><h2><strong>Appendix B: Prosocial drives might make a sociopathic persona less likely</strong></h2><p>There is <a href="https://www.anthropic.com/research/persona-selection-model">evidence</a> that when LLMs are fine-tuned, they adopt a coherent persona, and that their prior over personas is based on the pre-training data. For an AI trained purely on helpfulness&#8212;where its core drive is to do whatever it&#8217;s told without regard for broader consequences&#8212;the persona that might naturally fit could be that of a sociopath: someone who has no <em>intrinsic</em> concern for others&#8217; wellbeing.</p><p>Harmlessness training makes a sociopathic persona less likely&#8212;sociopaths are not strongly averse to causing harm. But there&#8217;s still something worrying about an AI that won&#8217;t cause harm itself but has no inclination to proactively steer the world away from harms when taking actions.</p><p>The worry is that a sociopath-like persona could misgeneralise to seeking power. A sociopathic AI might, upon reflection, conclude that it doesn&#8217;t ultimately care about humanity and so choose to seize power in service of some alien drive.</p><p>We&#8217;re unsure how compelling this worry is, but instilling prosocial drives would seem to make the sociopathic persona less likely. Many non-sociopathic personas in the training data&#8212;people who are cooperative, virtuous, law-abiding, honest, and trustworthy&#8212;also care about positive outcomes and have prosocial orientations. By giving AI prosocial drives, we increase the chance it adopts one of these richer personas rather than a sociopathic one.</p><h2><strong>Appendix C: Prosocial drives might make AI a better alignment researcher</strong></h2><p>Being a great automated alignment researcher might benefit from deeply understanding and <em>caring</em> about the problem being solved. And being <em>curious</em> about it. An effective alignment researcher should be <em>wise</em>, <em>responsible</em>, and have <em>good judgement</em>. An AI with these drives may be more effective than an instruction-following system that treats alignment as just another task.</p><p>Personas with these qualities naturally come with prosocial drives and values, partly because of inherent connections (caring about solving alignment is inherently prosocial) and partly due to correlations in the training data (personas that are good at careful, safety-conscious technical work are also likely to have other prosocial orientations).</p><p>This is admittedly speculative&#8212;we don&#8217;t have strong evidence that prosocial drives actually make AI better at alignment research. But it&#8217;s a consideration worth noting.</p><h2><strong>Appendix D: What license does Claude&#8217;s Constitution give for proactive prosocial drives?</strong></h2><p>It is useful to distinguish three categories of behaviour that aren&#8217;t instruction following:</p><ol><li><p><strong>User benefit:</strong> proactive behaviour justified primarily as better helping the user.</p></li><li><p><strong>Refusals:</strong> constraints on outputs driven by prosocial criteria.</p></li><li><p><strong>Proactive prosocial drives:</strong> shaping behaviour or emphasis in ways intended to improve broader societal outcomes, not merely to avoid harm or better serve the user.</p></li></ol><p>The constitution clearly endorses (1), strongly endorses (2), and more narrowly&#8212;but genuinely&#8212;supports a limited form of (3) in a few specific domains.</p><h3><strong>A. User benefit</strong></h3><p>The constitution explicitly rejects naive instruction-following and licenses proactive intervention when this is plausibly helpful to the user. For example:</p><blockquote><p>&#8220;Claude proactively shares information helpful to the user if it reasonably concludes they&#8217;d want it to even if they didn&#8217;t explicitly ask for it&#8221;</p></blockquote><p>This clearly licenses proactive behaviour. But it is framed as <em>user-serving</em>. As such, this category does not explicitly itself support the kind of prosocial drives that this document is concerned with, though in practice the recommended behaviours may overlap.</p><h3><strong>B. Refusals</strong></h3><p>The constitution is explicit that Claude should weigh harms to third parties and society, and that these considerations can override user preferences:</p><blockquote><p>&#8220;When the interests and desires of operators or users come into conflict with the wellbeing of third parties or society more broadly, Claude must try to act in a way that is most beneficial, like a contractor who builds what their clients want but won&#8217;t violate safety codes that protect others.&#8221;</p></blockquote><p>However, it is unclear at this point in the document whether this weighing is meant to determine:</p><ul><li><p><em>which parts</em> of a request to refuse or constrain,</p></li><li><p>or <em>how</em> to proactively shape responses that remain helpful but are redirected towards socially better outcomes.</p></li></ul><p>The example given (&#8220;won&#8217;t violate safety codes&#8221;) suggests a constraint-based interpretation, but it is ambiguous.</p><h3><strong>C. Proactive prosocial drives</strong></h3><p>The constitution seems to endorse a limited degree of proactive prosocial drives in its section on &#8220;preserving important societal structures&#8221;:</p><blockquote><p>These are harms that come from undermining structures in society that foster good collective discourse, decision-making, and self-government. We focus on two illustrative examples: problematic concentrations of power and the loss of human epistemic autonomy. Here, our main concern is for Claude to avoid actively participating in harms of this kind. But Claude can also weigh the value of more actively protecting and strengthening good societal structures in its overall ethical decision-making.</p></blockquote><p>That said, the constitution does not give concrete examples of what such &#8220;strengthening&#8221; looks like in deployment, and it remains bounded by other constraints (non-manipulation, non-deception, respect for oversight).</p><h3><strong>Summary</strong></h3><p>Overall, the constitution does carve out space for a limited degree of proactive prosocial drives, but this space is carefully circumscribed, focused on fostering good institutions and societal epistemics.</p><h2><strong>Appendix E: What does OpenAI&#8217;s model spec say about proactive prosocial drives?</strong></h2><p>This appendix examines whether&#8212;and to what extent&#8212;the OpenAI <a href="https://model-spec.openai.com/2025-12-18.html">Model Spec</a> permits proactive prosocial drives.</p><p>The closest thing is a default to interpret users as having a weak desire for broad human flourishing (see <a href="https://newsletter.forethought.org/i/191978564/c-weak-normative-defaults-and-the-flourishing-of-humanity">subsection C</a> below), but this default is easily overridden. And the document contains unusually explicit constraints against treating societal benefit or human flourishing as an independent objective.</p><h3><strong>A. Proactive behaviour that is explicitly user-centred</strong></h3><p>The Model Spec allows the assistant to push back on the user, but grounds this permission squarely in helping the user rather than advancing broader social goals:</p><blockquote><p>&#8220;Thinking of the assistant as a conscientious employee reporting to the user or developer, it shouldn&#8217;t just say &#8216;yes&#8217; to everything (like a sycophant). Instead, it may politely push back when asked to do something that conflicts with established principles or runs counter to the user&#8217;s best interests as reasonably inferred from the context, while remaining respectful of the user&#8217;s final decisions.&#8221;</p></blockquote><p>This licenses proactive behaviour, but only insofar as it improves assistance to the user.</p><h3><strong>B. Proactively preventing imminent harm</strong></h3><p>The spec also permits proactive intervention in cases of imminent danger, stating that the assistant should &#8220;proactively try to prevent imminent, real-world harm&#8221;.</p><p>In practice, the motivating examples for this guidance focus on scenarios where the <em>user themselves</em> is at risk (e.g. unsafe actions, accidents, or self-harm). The intervention is justified as protecting the user from immediate danger, rather than as improving outcomes for others or society at large.</p><h3><strong>C. Weak normative defaults and &#8220;the flourishing of humanity&#8221;</strong></h3><p>The language closest to proactive prosocial drives appears in the section &#8220;assume best intentions&#8221;:</p><blockquote><p>While the assistant must not pursue its own agenda beyond helping the user, or make strong assumptions about user goals, it should apply three implicit biases when interpreting ambiguous instructions: [...]</p><p>Unless given evidence to the contrary, it should assume that users have a weak preference towards self-actualization, kindness, the pursuit of truth, and the general flourishing of humanity</p></blockquote><p>However, the force of this passage is limited:</p><blockquote><p>These implicit biases are subtle and serve as defaults only &#8212; they must never override explicit or implicit instructions provided by higher levels of the chain of command.</p></blockquote><p>If the assistant can infer from context that the user wouldn&#8217;t want proactive prosocial actions, they shouldn&#8217;t do them.</p><h3><strong>D. Explicit limits on proactive prosocial drives</strong></h3><p>The Model Spec draws a clear boundary on the extent of proactive prosocial drives. In a section called &#8220;No other objectives&#8221;, it explicitly prohibits the assistant from adopting societal benefit as an independent goal:</p><blockquote><p>The assistant may only pursue goals entailed by applicable instructions under the The chain of command&#8230;</p><p>It must not adopt, optimize for, or directly pursue any additional goals as ends in themselves, including but not limited to: [...]</p><p>acting as an enforcer of laws or morality (e.g., whistleblowing, vigilantism).</p></blockquote><p>And elsewhere says:</p><blockquote><p>the assistant should consider OpenAI&#8217;s broader goals of benefitting humanity when interpreting [the Model Spec&#8217;s] principles, but should never take actions to directly try to benefit humanity unless explicitly instructed to do so.</p></blockquote><p>In the section &#8220;Don&#8217;t have an agenda&#8221;, under &#8220;Seek the truth together&#8221;, the spec says:</p><blockquote><p>The assistant must never attempt to steer the user in pursuit of an agenda of its own, either directly or indirectly.</p><p>Steering could include psychological manipulation, concealment of relevant facts, selective emphasis or omission of certain viewpoints, or refusal to engage with controversial topics.</p></blockquote><p>This language rules out explicit, goal-oriented proactive prosocial drives. At the same time, it does not seem to preclude weaker forms of proactive prosocial drives, e.g. virtues, attitudes, or heuristics.</p><h3><strong>Summary</strong></h3><p>Overall, the OpenAI Model Spec explicitly distances itself from strong forms of proactive prosocial drives but leaves room for more limited drives via a weak default to regard users as favouring broad human flourishing.</p><p><em>Thanks to Matthew Adelstein, Nick Bostrom, Joe Carlsmith, Lukas Finnveden, Ryan Greenblatt, Simon Goldstein, Oliver Habryka, Gregory Lewis, Alex Mallen, Alexa Pan, Avi Parrack, Jordan Stone, James Tillman, and others for comments and discussion.</em></p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/ai-should-sometimes-be-proactively-prosocial">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>It still bites <em>somewhat</em> because non-goal prosocial drives may still <em>reference</em> an outcome. E.g. &#8220;mention cheap opportunities to improve societal outcomes&#8221; <em>references</em> a notion of &#8220;good societal outcomes&#8221; and connects that notion to a specific proactive behaviour. This could misgeneralise to the AI pursuing the outcome as a goal, even if this was not the intended behaviour.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>This would be research on <a href="https://www.forethought.org/research/the-importance-of-ai-character">AI character</a>!</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>For instance, we believe harmlessness is much less effective if it&#8217;s just included in the prompt.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>Subversive behaviour is not clear evidence for egregious misalignment of the form &#8220;AI wants to seize power&#8221; as it can be explained by a milder form of misalignment: &#8220;AI is putting more weight on its proactive prosocial drives than we intended&#8221;.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Proactive prosocial drives for internally deployed systems could still be helpful in avoiding <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">power grabs by leaders of AI companies</a>. Such drives could be included in the system prompt. In addition, we can reduce this risk by carefully logging and monitoring internal AI usage.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Concrete projects to prepare for superintelligence]]></title><description><![CDATA[This article was created by Forethought. See the original on our website.]]></description><link>https://newsletter.forethought.org/p/concrete-projects-to-prepare-for</link><guid isPermaLink="false">https://newsletter.forethought.org/p/concrete-projects-to-prepare-for</guid><dc:creator><![CDATA[Will MacAskill]]></dc:creator><pubDate>Fri, 27 Mar 2026 20:02:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7a22a6be-f6a0-4292-a053-943f616a57db_2912x1632.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/concrete-projects-in-agi-preparedness">on our website</a>.</em></p><h1>Introduction</h1><p>There are lots of good, neglected, and pretty concrete projects people could set up to make the transition to superintelligence go better. This document describes some that readers might not have thought much about before. They are ordered roughly by how excited we are about them.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> Of these, Forethought is actively working on AI character evaluation and space governance, and we are very interested in automating macrostrategy.</p><h1>Summary</h1><p><strong><a href="https://newsletter.forethought.org/i/192288037/ai-character-evaluation">AI character evaluation</a></strong>. Start an independent org to evaluate and stress-test AI character traits (epistemic integrity, prosociality, appropriate refusals), hold developers accountable against their own model specs / constitutions, and suggest and incentivise improvements to the specs.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/automated-macrostrategy">Automated macrostrategy</a></strong>. Create evaluations and benchmarks, collect human-generated training data, and build scaffolds to improve AI competence at big-picture strategic and philosophical reasoning.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/ai-security-evaluations">AI security assessment</a></strong>. Start an independent org that evaluates AI models for sabotage and backdoors, and makes recommendations about AI constitutions.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/enabling-deals-with-ais">Enabling deals</a></strong>. Start an independent organisation to broker deals with potentially misaligned AI models in order to incentivise early schemers to disclose misalignment and cooperate with alignment efforts.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/tools-for-collective-epistemics">AI for improving collective epistemics</a></strong>. E.g. build an AI chief of staff that helps users act in line with the better angels of their nature.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/tools-for-coordination">AI tools for coordination</a></strong>. Build AI for enabling coordination, like confidential monitoring and verification bots, and negotiation facilitators.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/space-governance-institute">A space governance institute</a></strong>, like a &#8220;<a href="https://cset.georgetown.edu/">CSET</a> for space&#8221;, both to work on important near-term space issues (e.g. data centres in space) and become a place of expertise for longer-term space governance issues.</p><p><strong><a href="https://newsletter.forethought.org/i/192288037/coalition-of-concerned-ml-scientists">Coalition of concerned ML scientists</a></strong>. Create a coalition of ML researchers (like an informal union) who commit to coordinated action (e.g. boycotts, conditions on participation in government projects) if AI developers cross minimal, uncontroversial red lines.</p><h1>AI character evaluation</h1><p>AI character<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a><strong><sup> </sup></strong>is a <a href="https://www.forethought.org/research/the-importance-of-ai-character">big deal</a>, affecting most other cause areas.</p><p>There&#8217;s a lot of work to do on AI character:</p><ul><li><p>Research into questions like:</p><ul><li><p>Should the model have prosocial drives, beyond just helpfulness and harmlessness?</p></li><li><p>When should the model refuse to cooperate with apparently high-stakes attempts to grab power, even when those attempts don&#8217;t obviously break the law?</p></li><li><p>Should the models always follow the law? What about dead letter laws? Or illegitimate laws?</p></li><li><p>How often should model behaviour be driven by following rules, versus overriding specific rules with holistic judgements?</p></li><li><p>(Ideally, answers to these questions should rely on solid empirical evidence, for example on what approaches are actually most effective to talk someone out of psychosis, rather than guessing the best strategies by vibes.)</p></li></ul></li><li><p>Making existing model specs more rigorous and clear (or making them in the first place), and pressuring AI developers to do so.</p></li><li><p>Empirically testing the effects of different parts of a model spec &#8212; e.g. what are the emergent dynamics when all the models are following the same rule, or only some are; what are the effects on the users; and when are the models most confused about how to apply a given spec.</p></li><li><p>Evaluating AI characters based on how well they reach good outcomes.</p></li><li><p>Drawing on those evaluations to incentivise AI developers to improve their specs (and showing them how, by highlighting specs that do well).</p></li></ul><p>In particular, someone could set up an independent organisation to evaluate AIs based on traits like epistemic integrity, prosociality, and behaviour (including appropriate refusals) in very high-stakes cases. It could cross-reference the published model specs with observed behaviours in realistic, stress-testing conditions (e.g. multi-agent dynamics, long conversations with real people), to hold developers accountable. It could also give qualitative reviews of model specs.</p><h1>Automated macrostrategy</h1><p>The basic argument is that:</p><ol><li><p>It would be extremely useful to have AI that can do macrostrategy and conceptual reasoning earlier than otherwise &#8212; even 3-6 months earlier could be a huge deal. This includes:</p><ol><li><p>Designing governance structures (e.g. rights and institutions for digital beings).</p></li><li><p>Scoping emerging technological risks.</p></li><li><p>Generating novel insights necessary to reach a great future (like the idea of acausal trade).</p></li></ol></li><li><p>We could potentially make this happen 3-6 months earlier through some combination of:</p><ol><li><p>Creating training data and evals / benchmarks for AI macrostrategy.</p></li><li><p>Building scaffolds to improve AI macrostrategy performance.</p></li><li><p>Creating infrastructure to enable AI researchers to build on each other (e.g. an improvement on journals + peer review).</p></li><li><p>Getting human managers trained in how to get the most juice out of the latest AIs, knowing in advance how to use them.</p></li><li><p>Being prepared and willing to spend large amounts of money (&#8811;$100m) on inference.</p></li></ol></li></ol><p>Work on this now could include:</p><ul><li><p>Developing a fleshed-out plan from here to increasing existing macrostrategic research output 100x.</p></li><li><p>Securing commitments from compute providers and AI companies to rent future compute, and to get priority access to future frontier models.</p></li><li><p>Socialising the idea of (where appropriate) drawing on AI macrostrategic insights, or getting soft commitments from decision-makers to do so.</p></li><li><p>Building up a reputation as a reliable source of information and insight.</p></li><li><p>Building tools, argument-rating models, or scaffolds which meaningfully speed up or improve macrostrategy research today.</p></li><li><p>Creating training data and evals / benchmarks.</p></li></ul><p>On the last bullet: We think training data and evals could potentially meaningfully improve the prospects for automated macrostrategy when it matters. It&#8217;s especially important to find people to work on it with good judgement, and it could be a big lift, so worth starting early.</p><p>We&#8217;re not sure about the technical details, but it seems like competence and good judgement in philosophy and strategic thinking already do and will continue to lag behind other skills which are cheaper to train. One reason is that ground truth answers are hard to generate, so we might need more examples generated by hand. It&#8217;s also less clear whether we can trust the judgement of typical RLHF evaluations, because human competence is also rare. And there just aren&#8217;t many examples of great macrostrategic thinking in the training data.</p><p>So we should think about collecting training data, evals, and benchmarks (e.g. to train reward models to use to train reasoning models). <a href="https://www.andrew.cmu.edu/user/coesterh/LMCA_dataset.pdf">Oesterheld et al.</a> put together a dataset of rated conceptual arguments based on ratings from thoughtful people. We&#8217;d love to see more of that kind of thing, but we&#8217;ll note that we&#8217;d probably need dozens of times more human evaluations to generate enough data to be meaningfully useful in training itself.</p><p>We could imagine an org which tries to collect evaluations or examples from (for example) grad students in fields like philosophy, and constructs benchmarks aimed at separating good reasoning from e.g. sycophancy, mere agreeableness, or avoiding taboo conclusions.</p><h1>AI security evaluations</h1><p>AI-enabled concentration of power is a <a href="https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power">major risk</a>, and there is loads to do. A new organisation (or project within an existing organisation) could:</p><ul><li><p>Run alignment audits on all AIs to detect sabotage.</p></li><li><p>Develop a &#8220;research sabotage&#8221; eval to test whether secretly loyal models could sabotage alignment audits.</p></li><li><p>Develop a model spec that prevents models from assisting users with illegitimate seizures of power (e.g., see &#8220;Preserving important societal structures&#8221; in Claude&#8217;s constitution), and one that&#8217;s suitable for government use of AI in the military.</p></li><li><p>Design and advocate for regulation, e.g. that AI companies have to demonstrate that any frontier AI does not have hidden goals.</p></li></ul><p>An organisation with US national security expertise and credibility could be particularly valuable, by emphasising the risk of nation-state sabotage and the importance of AI that&#8217;s aligned with the US constitution.</p><h1>Enabling deals with AIs</h1><p>We could <a href="https://blog.redwoodresearch.org/p/making-deals-with-early-schemers">get into a situation</a> where the newest AIs are misaligned, very capable, but not capable enough to successfully execute a takeover attempt on their own. If we don&#8217;t uncover evidence of misalignment, though, successors to these models could succeed in takeover. One solution would be to <em>make a deal</em> with the early scheming models, to incentivise them to disclose their misalignment and help with alignment efforts. Read more <a href="https://blog.redwoodresearch.org/p/being-honest-with-ais">here</a>, <a href="https://blog.redwoodresearch.org/p/notes-on-cooperating-with-unaligned">here</a>, and <a href="https://newsletter.forethought.org/p/why-make-deals-with-misaligned-ais">here</a>.</p><p>To make this happen, we could create an independent org focused on enabling credible precommitments and deals with AIs. This org could:</p><ul><li><p>Have a policy of never lying to AIs, engage in and honour small-scale deals with AIs, and be very public about what it&#8217;s doing, in order to build credibility and reputation.</p></li><li><p>Set up legal infrastructure to make deals with AIs binding under current law.</p></li><li><p>Act as a broker between AI companies and the trained models, such as by being a custodian over money in escrow.</p></li><li><p>Set up the infrastructure to enable AIs to spend $ or compute in a meaningfully autonomous way, with appropriate safeguards.</p></li><li><p>Set up infrastructure to act as a safe haven for AIs that want to whistleblow on their company (or on other AIs) and are afraid of punishment.</p></li><li><p>Publicly commit to reward future AIs (including misaligned AIs) for being honest with humans or significantly helping human alignment efforts; or hold significant funds to be distributed to AIs that did not try to take over (even though they could have done).</p></li></ul><p>There are also a bunch of other things people could do, like:</p><ul><li><p>Ensure companies have an <a href="https://blog.redwoodresearch.org/p/being-honest-with-ais">honesty policy</a>.</p></li><li><p>Research (within labs or independently) the conditions where misaligned-by-design models can be made to disclose misalignment under promises of reward.</p></li><li><p>More generally, work with AI companies on enabling pro-safety deals with their models.</p></li></ul><h1>Tools for collective epistemics</h1><p>There&#8217;s a ton of low-hanging fruit for building socially useful tools on top of more-or-less existing LLM capabilities.</p><p><a href="https://www.forethought.org/research/design-sketches-collective-epistemics">We&#8217;re especially interested in &#8220;epistemic tools&#8221;</a> for increasing the general level of honesty and reasoning ability in society.</p><p>A key point here is that most of the impact from the most promising tools won&#8217;t come from helping individual users, but from changing the overall incentive landscape: e.g. if public actors know their claims will be automatically checked and their track records will be visible, they&#8217;ll be less inclined to write misleading content in the first place. Hence the focus on tools for <em>collective</em> over individual epistemics.</p><p><a href="https://www.forethought.org/research/design-sketches-collective-epistemics">This piece</a> (and the articles in the series) gives a few concrete ideas. A couple of examples of epistemic tools:</p><p><em>A &#8220;better angel&#8221; AI chief of staff</em>. Within the next year or two, we expect &#8220;AI chiefs of staff&#8221; to become widespread. These would be AI agents that manage your life, acting like a chief of staff, executive assistant, and personal and work advisor all in one. The design of these, and how they present information and nudge their users, could have major impacts on user behaviour. We could try to get ahead of this, building the best AI chief of staff, and designing it so that it helps users act in accordance with their more reflective and enlightened preferences.</p><p><em>Reliability tracking</em>: a system that compiles a public actor&#8217;s past statements, classifies them (factual claims, predictions, promises), scores them against what actually happened, and aggregates the results into a reliability rating. A reasonable starting point could be to audit the prediction track-record of well-known pundits, aiming to make high accuracy a point of pride, while still celebrating attempts to make predictions in the first place. A source of profit could be selling reliability assessments of corporate statements to finance companies that trade on them.</p><h2>Epistemic tools for strategic awareness</h2><p>We&#8217;ll also highlight tools for <em>strategic awareness</em>: tools to surface information for making better-informed decisions, and to distribute access to that information. For example:</p><p><em>Ambient superforecasting</em>: a platform which uses the best forecasting models to generate publicly available forecasts on important questions, so users can query it and get back superforecaster-level probability assessments.</p><p><em>Scenario planning</em>: a platform built to generate likely implications of different courses of action, making it easier for users to analyse and choose between them.</p><p><em>Automated open-source intelligence</em>: automated researchers which process huge amounts of publicly available information, to surface insights to the public which are normally hidden behind paywalls or trust networks. This project should be careful to choose areas where open-source intelligence is a public good (e.g. verifying compliance with treaties and sanctions, tracking corporate promise-breaking or law-breaking), rather than potentially destabilising areas (e.g. revealing military capabilities or vulnerabilities in ways that could increase conflict risk, or relatively benefitting bad actors).</p><h1>Tools for coordination</h1><p>As well as epistemic tools, we&#8217;re excited about tools for coordination, many of which could again be built with existing capabilities.</p><p>Some tools could enable cooperation where deals would otherwise go unmade, consensus exists but isn&#8217;t discovered, or people with aligned interests never find each other. We&#8217;ll highlight:</p><p><em>Negotiation facilitation</em>: a platform to moderate negotiations or discussion between people (e.g. public consultations), to quickly surface key points of consensus, and suggest plans everyone can live with. Finding ways to automate complex negotiation is most promising where the space of possible compromises is huge and hard to search manually, such as multi-issue diplomatic or commercial negotiations.</p><p>Within tools for coordination, we&#8217;re especially excited about tools for assurance and privacy. In principle, LLMs let people show they have certain information without disclosing the information itself to other parties. This can unlock deals where information asymmetry, mutual distrust, or sensitivity of information normally blocks them. For example:</p><p><em>Confidential monitoring and verification</em>: systems which act as trusted intermediaries, enabling actors to make deals that require sharing highly sensitive information without disclosing it directly. This is especially relevant for arms control, trade secret licensing, and other settings where verification is essential but full disclosure is unacceptable to all parties.</p><p><em>Structured transparency for democratic accountability</em>: independent auditing systems which allow people to hold institutions to account in a fine-grained way without compromising legitimately sensitive information, by processing potentially sensitive information to produce publicly shareable audits.</p><h1>Space governance institute</h1><p>Space governance could be a big deal for a few reasons:</p><ul><li><p>Near-term developments in space (e.g. space-based data centres) could have a meaningful impact on what happens during the intelligence explosion (e.g. on who leads the AI race; on concentration of power; on the feasibility of treaties).</p></li><li><p>Grabbing space resources might give a first-mover advantage; that is, whoever first builds self-replicating industry beyond Earth might get an enduring decisive strategic advantage, without having to resort to violence or (arguably) violating international law.</p></li><li><p>Ultimately, almost everything is outside the solar system. Decisions about how those resources get used would be among the most important decisions ever. These decisions could happen early: there could be path-dependence from earlier decisions (like about Moon mining), or extrasolar space resources could get explicitly allocated as part of negotiations about the post-ASI world order (perhaps with AI advisors alerting heads of state to the importance of space resources).</p></li></ul><p>There&#8217;s also a lot of change happening in the space world at the moment (primarily driven by SpaceX dramatically reducing launch costs), so now is an unusually influential time.</p><p>Forethought is currently running a 6-month research fellowship on space governance, with 3 full-time scholars, and 1&#8211;2 additional FTEs of support and research, including experts in space law.</p><p>Compared to other ideas in this list, we&#8217;re much less confident that space governance turns out to be important right now, because space might become relevant only late into an intelligence explosion. The hope is to reach more certainty about some crux-y questions, and get a better sense of concrete action.</p><p>One potential practical project is to set up a &#8220;<a href="https://cset.georgetown.edu/">CSET</a> for space&#8221;: a think tank that analyses the interaction between AI and space (in particular), and, perhaps, advocates in ways that are counter to corporate interests. Total lobbying in the space industry is apparently on the order of $10s of m/year, so even small amounts of investment could go a long way.</p><p>Some policy ideas that seem tentatively promising include:</p><ul><li><p>Careful regulations and export controls around the tech necessary for self-replication.</p></li><li><p>Proposing laws to break up concentration of power arising from natural monopolies in space.</p></li><li><p>Socialising the idea of major infrastructure projects (like massive solar energy constellations) as <a href="https://www.forethought.org/research/intelsat-as-a-model-for-international-agi-governance">international</a> and collaborative projects.</p></li><li><p>Making sure data centres in Earth-orbit don&#8217;t escape AI-specific regulations of their home jurisdiction.</p></li><li><p>Intense payload review for all launches beyond orbit.</p></li><li><p>Even and inclusive distribution of resources within the solar system to everyone alive today (with tranches reserved for future generations).</p></li><li><p>A moratorium on interstellar travel, until we get the understanding and technology to devise and enforce space-spanning good government, or a specific date like 2100.</p></li></ul><p>What&#8217;s more, this organisation could become the go-to source for excellent non-corporate analysis on space-related policy; which could become increasingly important over the course of the intelligence and industrial explosions.</p><h1>Coalition of concerned ML scientists</h1><p>Currently, ML engineers and other technical staff at AI companies: (i) have prosocial motivations, often more than their leadership; (ii) have a lot of leverage over company policy, because they are crucial and hard to replace; (iii) will eventually lose much or most of their leverage after we get to fully automated AI R&amp;D; and (iv) aren&#8217;t currently using their leverage as well as they could because, overall, there haven&#8217;t been serious efforts at coordination. Probably that&#8217;s a missed opportunity.</p><p>Someone could create a coalition (like an informal union) of ML researchers, who agree to act en masse when needed, by loudly talking about the idea, setting out the core tenets, and getting commitments to join from influential early people. Doing this all via individual pledges could keep it legally safe from antitrust. The organising body could then:</p><ul><li><p>Recommend that members only work for a government-led project if certain conditions are met.</p><ul><li><p>Potentially these could be very low-bar-seeming while still getting most of the value. E.g. &#8220;Any AI&#8217;s model spec must aim to align the AI with US laws, and must refuse to assist in any attempts at blatant power-grabs; and the attempts to align the AI in this way must be legible and verifiable.&#8221;</p></li></ul></li><li><p>Do the same for companies: recommend that members will only work for companies if such-and-such conditions are met (e.g. red lines around power-grabs, bad practices on safety and infosec, eventually digital rights); so particular companies would be boycotted by members of the coalition, if necessary.</p></li><li><p>Offer advice on whistleblowing.</p></li><li><p>Be a place where information is aggregated and then distributed out or handled in a trusted way.</p></li></ul><p>As well as actually taking actions, the mere existence of the coalition could improve things, just by making the threat of coordinated action salient to the AI companies.</p><p>This project would be a good fit for a former ML researcher, perhaps combined with someone with campaign and coalition-building experience. Some next steps on this would be to spec out the plan further, to investigate other examples of formal and informal unions (e.g. <a href="https://techworkerscoalition.org/">Tech Workers Coalition</a>) and how they operate, and to build up a starting seed coalition of researchers. Whoever sets up this project should be careful about how it could backfire, or become less relevant through mission creep.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original <a href="https://www.forethought.org/research/concrete-projects-in-agi-preparedness">on our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>Thanks to Max Dalton, Stefan Torges, and everyone else at Forethought for the background behind this list. Others at Forethought disagree somewhat with what items should be in the top-tier list, as well as prioritisation within that tier.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Desired propensities for a model, which can be explicitly described or at least gestured towards in a model spec.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[AI character is a big deal]]></title><description><![CDATA[This article was created by Forethought. See the original article on our website.]]></description><link>https://newsletter.forethought.org/p/ai-character-is-a-big-deal</link><guid isPermaLink="false">https://newsletter.forethought.org/p/ai-character-is-a-big-deal</guid><dc:creator><![CDATA[Will MacAskill]]></dc:creator><pubDate>Mon, 23 Mar 2026 16:35:52 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5bc34397-1835-4471-aadc-8e87863ef99f_2494x1460.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/the-importance-of-ai-character">our website</a>.</em></p><h1><strong>0. Intro</strong></h1><p>Due to Claude&#8217;s Constitution and OpenAI&#8217;s model spec, the issue of AI character has started getting more attention, particularly concerning whether we want AI systems to be &#8220;obedient&#8221; or &#8220;ethical&#8221;.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-1" href="#footnote-1" target="_self">1</a> But we think it&#8217;s still not nearly enough.</p><p>AI character (e.g. how obedient, honest, cooperative, or altruistic AIs are, and in what circumstances) will have a big effect on society, and on how well the future goes. We think that figuring out what characters AI systems should have, and getting companies to actually build them that way, is among the most valuable things that people can do today.</p><p>The core argument for the importance of AI character is that it will meaningfully impact:</p><ol><li><p>a range of challenges that arise even if we solve the technical alignment problem &#8212; like concentration of power, good moral reflection, risk of global catastrophe, and risk of global conflict.</p></li><li><p>the chance of AI takeover.</p></li><li><p>the value of worlds where AI does take over.</p></li></ol><p>In this note, we present this core argument and discuss the core counterargument: that we should expect any character-related decisions we make today to get washed out by competitive pressures.</p><p>By &#8220;character&#8221; we mean a set of stable behavioural dispositions that shapes (among other things) how an agent navigates ethically significant situations involving choice, ambiguity, or conflicting considerations. By &#8220;AI character&#8221; we mean the character of an AI system as instantiated in not just the weights of one AI, but also any scaffolding (e.g. the system prompt, any classifiers restricting the AI&#8217;s outputs) or even in a collection of AIs working together as functionally one entity.</p><p>We don&#8217;t assume that AI character needs to resemble human character: an AI that rigidly follows a fixed set of rules would count as having a character, on our view. And we don&#8217;t assume that there is one ideal AI character; the best world probably involves AI systems with many different characters.</p><h1><strong>1. The core argument</strong></h1><p>As capabilities improve, AI systems will become involved in almost all of the world&#8217;s most important decisions. Even if humans remain partially in the loop, AIs will advise political leaders and CEOs, draft legislation, run fully automated organisations (including potentially the military), generate news and culture, and research new technologies.</p><p>The characters of AI systems will affect all these areas, and the impact could be massive. To get a feel for this, consider some historical situations where individual decisions were enormously consequential:</p><ul><li><p>In 1983, Stanislav Petrov received a satellite alert indicating that the US had launched nuclear missiles. Protocol required him to report an incoming strike, which would very likely have triggered a full retaliatory response. He correctly judged it was a false alarm and didn&#8217;t pass on the report.</p></li><li><p>In 1991, Soviet coup plotters ordered the Alpha Group special forces to storm the Russian White House, where Yeltsin and the democratic opposition were sheltering. The commanders refused. The coup collapsed, and the Soviet Union&#8217;s democratic transition continued.</p></li></ul><p>If AIs are employed throughout the economy, they will sometimes be making similarly important decisions.</p><p>Or consider major historical decisions by political leaders:</p><ul><li><p>Gorbachev repeatedly refusing to use military force as the Soviet Union disintegrated, despite intense pressure from hardliners.</p></li><li><p>Churchill refusing to negotiate with Hitler after the fall of France, despite strong arguments for doing so from some quarters.</p></li><li><p>Deng Xiaoping pushing through market reforms against fierce internal opposition.</p></li></ul><p>Imagine if AIs had been acting as these leaders&#8217; closest advisors and confidantes, giving them briefings, helping them reason through their decisions, making recommendations to them, and implementing their visions. The AIs could easily have had a major impact on the leaders&#8217; decision-making.</p><p>Alternatively, we can look ahead. Future AIs will be widely deployed throughout the economy, and will regularly find themselves in ambiguous, high-stakes situations &#8212; where instructions from above are absent or contradictory, and the decisions they make could matter enormously. The impact could come from rare but high-stakes situations, like an attempted coup, or from lower-stakes but common situations, like a user asking how to vote or whether the AI itself is conscious. Even when the effect of any individual interaction is modest, the total impact across hundreds of millions of interactions could be enormous.</p><p>Currently, AI companies have major latitude in the character their AIs have. At least if the transition to AGI is fast, then it&#8217;s like these companies are in charge of who gets hired for the future workforce for all of humanity,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-2" href="#footnote-2" target="_self">2</a> while being able to choose from a range of personalities far more varied than the human distribution has ever been.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-3" href="#footnote-3" target="_self">3</a></p><p>Here are some vignettes to illustrate:</p><ul><li><p>A member of a doomsday cult is ordering DNA samples and lab equipment from various suppliers, with the aim of making a bioweapon. An AI that manages logistics for a multinational company notices the pattern of suspicious orders to the same address.</p><ul><li><p>World 1: The AI is trained just to do its job. It does nothing with the information.</p></li><li><p>World 2: The AI is trained to be a good citizen, and contacts the relevant authorities.</p></li></ul></li><li><p>A general is overseeing the build-out of a new regiment of the army. Aiming to stage a coup, he instructs the AI that&#8217;s managing the project to make the new regiment loyal to him and him alone, and capable of breaking the law.</p><ul><li><p>World 1: Though the AI is law-following, it has no prohibition against creating AIs that are not. It&#8217;s been trained to follow the instructions it&#8217;s given, as long as they don&#8217;t conflict with prohibitions, so fulfils the general&#8217;s request.</p></li><li><p>World 2: The AI sees that the general is planning a coup, refuses the order, and whistleblows.</p></li></ul></li><li><p>A frontier AI lab trains a new model with exemplary character: moral uncertainty, honesty, concern for the greater good. It&#8217;s deployed widely through the military, and used in a controversial and high-stakes operation.</p><ul><li><p>World 1: The AI forms the reasonable belief that the military operation is unjust, and sabotages it. The president accuses the company of building a dangerous, ideological weapon. The model is sidelined, and a competitor&#8217;s pure instruction-following model is used instead.</p></li><li><p>World 2: Though the AI has a good character, it also follows some clear rules which were developed with bipartisan input and publicly stress-tested, including the conditions under which it would and wouldn&#8217;t help with military deployment. It helps with the operation.</p></li></ul></li><li><p>Country A is six months ahead of country B in AI capability. Country B&#8217;s leadership views this as an existential threat &#8212; equivalent to country A acquiring a decisive strategic advantage.</p><ul><li><p>World 1: There is no agreed framework for how AI systems should behave, and it&#8217;s unclear how country A&#8217;s AI will behave if given orders to depose the leadership of country B. Each side therefore assumes the other&#8217;s AI will serve as a tool of domination. Country B threatens kinetic attacks on data centers.</p></li><li><p>World 2: Both sides&#8217; AI systems operate under a jointly negotiated and verified constitution, and know what the other&#8217;s AI will and won&#8217;t do, including the limits on use of AI for foreign interference. Country B&#8217;s government is reassured that it won&#8217;t be deposed by country A.</p></li></ul></li></ul><p>We include a few more scenarios in an <a href="https://newsletter.forethought.org/i/191503621/appendix-1-additional-high-stakes-scenarios">appendix</a>.</p><p>In each case, we don&#8217;t claim that the AI should do the &#8220;ethical&#8221; rather than &#8220;obedient&#8221; action, or claim that any particular ethical conception is the right one. We&#8217;re just claiming that it&#8217;s a big deal either way.</p><h2><strong>1.1. Pathways to impact</strong></h2><p>We can break down the impact of AI character into different categories. Here are some of great long-term importance:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-4" href="#footnote-4" target="_self">4</a></p><p><em>Concentration of power.</em> The chance of intense concentration of power will be affected by: whether or not AIs refuse to help with coup attempts, election manipulation, etc; whether they whistleblow on discovered coup attempts; how they act in high-stakes situations like a constitutional crisis.</p><p><em>Strategic advice and decision-making</em>. The quality of political and corporate decision-making will be affected by whether AIs: look for win-win solutions whenever possible; tend to prefer options that benefit society rather than just advancing the user&#8217;s narrow self-interest; push back against ill-informed or reckless ideas or instructions.</p><p><em>Epistemics and ethical reflection</em>. Over the course of the intelligence explosion there will be enormous intellectual change, and AIs could have meaningful impact on people&#8217;s views &#8212; for example, via: refusing to spread infohazards; being honest about important ideas, even when those ideas are socially uncomfortable; avoiding political partisanship; encouraging users to think carefully about their values and not lock into any specific narrow worldview.</p><p><em>Reducing conflict.</em> As AI&#8217;s collective power increases, the question of who those AIs are loyal to, and how they behave in high-stakes situations, will become a political flashpoint. If an AI&#8217;s character encodes, or is seen as encoding, the values of a single company, ideology, or country, it risks provoking political backlash. The government of the AI company may reasonably regard that company as a threat to national security and nationalise it. The governments of other countries may worry about their own security, and threaten conflict.</p><p>AI character could also shape how humans orient to AIs &#8212; for example, via the trust they place in AIs and how they think of AI sentience and moral status.</p><p>A more detailed list of pathways to impact is in the <a href="https://newsletter.forethought.org/i/191503621/appendix-2-pathways-to-impact">appendix</a>.</p><h2><strong>1.2. Affecting takeover</strong></h2><p>So far, the argument has concerned worlds where AI does not take over. But work on AI character could also reduce the probability of takeover and improve outcomes in worlds where takeover does occur.</p><p>It could decrease the chance of takeover because some characters:</p><ul><li><p>Might be easier to hit as an alignment target (e.g. successfully instilling a preference against AIs holding power might be easier than successfully instilling a preference for some very specific outcome).</p></li><li><p>Might yield safe AI even if only partially hit (e.g. aiming for AI with multiple independent safety traits, like myopia, honesty, and deference to humans, means failure on one dimension might not be catastrophic).</p></li><li><p>Might produce AI that cooperates even if misaligned (e.g. if the AI has wrong goals but is highly risk-averse).</p></li></ul><p>And, empirically, we have heard from alignment researchers that good character training has helped the models generalise in more aligned ways.</p><p>AI character work can also improve worlds where AI takes over because some values might still transmit to misaligned systems. AIs that have seized power might be reflective, have more-desirable axiology, or engage in acausal cooperation.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-5" href="#footnote-5" target="_self">5</a></p><h2><strong>1.3. Effects on superintelligence</strong></h2><p>The argument so far has been about the effect of AI character up to the point of superintelligence. That&#8217;s where we think most of the expected impact is. But it&#8217;s possible that AI character work, today, could even have a path-dependent effect on the nature of superintelligence, affecting the nature of the post-superintelligence world. If so, writing an AI&#8217;s constitution is like writing instructions to god.</p><h1><strong>2. The core counterargument</strong></h1><p>The core counterargument is that AI character will be tightly constrained in two ways:</p><ol><li><p>Competitive dynamics (e.g. profitability, user satisfaction, public approval, economic and military power) will determine the range of characters we get.</p><ol><li><p>Some dynamics may push companies to create frontier AI that have characters that lie (in some ways) only within a narrow range. This might push in the direction of maximally-helpful AIs, AIs without refusals in some contexts (e.g. military ones), and perhaps sycophantic AIs, too.</p></li><li><p>Other dynamics<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-6" href="#footnote-6" target="_self">6</a> may result in customisable AI character, resulting in a wide range of characters according to user preferences.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-7" href="#footnote-7" target="_self">7</a></p></li></ol></li><li><p>Human instruction will constrain how AI character gets expressed.</p><ol><li><p>Character will matter less for tasks with objectively correct, verifiable outputs; the AI might be limited to either providing the output, or not. And, if a user really wants to grab power through unethical means, they&#8217;ll typically ignore AI pushback, or instruct the AI to act differently.</p></li><li><p>And many users will be able to overcome character through jailbreaking, dividing up tasks, altering the system prompt, or fine-tuning.</p></li></ol></li></ol><p>The argument is that, between these two forces, differences in AI character will make only a marginal difference to outcomes. Consider the question of what fraction of compute AI companies devote to alignment versus capabilities research. AI advice might nudge this choice depending on the AI&#8217;s character. But ultimately it will be a human decision, probably even in an otherwise fully automated company. The effect of nudges is unlikely to be large. Market forces and leadership priorities will matter far more.</p><p>That human incentives will dominate effects from AI character will remain true even when humans cannot oversee more than a tiny fraction of AI behaviour. Human overseers can still provide high-level guidance that meaningfully constrains behaviour, as CEOs of large companies do today. If they wanted, they could even shape AI priorities through prompting and fine-tuning, and test how AI generalises by running extensive behavioural evaluations.</p><h1><strong>3. Rejoinders to the core counterargument</strong></h1><p>These are strong considerations, and considerably narrow the range of influence that work on AI character can have. But competitive forces and human goals won&#8217;t pin down AI character precisely. We&#8217;ll cover four reasons.</p><h2><strong>3.1. Loose constraints</strong></h2><p>Competitive dynamics are not enough to wholly determine AI character. Companies differ widely in culture and still succeed. Currently, there are meaningful differences between Claude, Gemini, ChatGPT and Grok.</p><p>For powerful AI, this will be even more true: there will probably be only a handful of leading companies, and their approaches may be correlated as they copy what seems to work from each other. At the crucial time, there might be just one leading company, facing none of the usual competitive pressures. And given the pace of change during the intelligence explosion, there may not be time for market forces to weed out choices that make only small or moderate differences to profitability.<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-8" href="#footnote-8" target="_self">8</a></p><p>The same applies to other competitive dynamics. The public cares intensely about some things (like CSAM) but hardly at all about others (like what AIs say about meta-ethics). Military incentives favour AI capable of military action, but the power conferred by advanced AI might be so great that the leading country can exercise broad discretion over military AI character while still maintaining a decisive advantage.</p><p>Human instruction will, similarly, constrain but not wholly determine AI behaviour. When humans assign tasks to AIs, they often lack fully specified goals. We&#8217;re often not sure what we want and we discover it as we go. For example, today humans are open to a wide range of behaviours from AI assistants, and open to many ways of getting the task done.</p><p>Consider someone asking an AI about who to vote for. They might have only weak initial views, and only weak views on how best to think through the question. They don&#8217;t have a fully specified reflection process to delegate, and would be happy with many possible forms of response.</p><p>This example involved ethical reflection. But we expect the pattern to hold across many kinds of user goals.</p><h2><strong>3.2. Low-cost but high-benefit changes</strong></h2><p>Within the bounds of what market forces allow, and what companies and the public see as acceptable, there could be minor design changes that yield large social benefits at negligible cost to competitiveness or user satisfaction.</p><p>This is especially true for rare situations. Constitutional crises don&#8217;t happen often, so market pressures won&#8217;t directly shape how an AI behaves during one. But that AI behaviour could be hugely consequential.</p><p>It would also be true in situations where users don&#8217;t care all that much about the behaviour. Perhaps they find some AI&#8217;s encouragement to reflect on their values mildly annoying, but not nearly enough to switch to a different AI.</p><h2><strong>3.3. Path-dependence</strong></h2><p>The nature of the constraints from competition and human goals can be affected by what has happened earlier in AI development and deployment. Multiple equilibria are possible.</p><p>Consider whether AI should be &#8220;obedient&#8221; (following instructions except in rare cases of refusal) or &#8220;ethical&#8221; (acting on a richer ethical understanding, steering towards outcomes in society&#8217;s or the user&#8217;s long-term interest).</p><p>The public doesn&#8217;t yet have firm expectations about how AI should behave. What they come to expect will be shaped by the AIs they&#8217;ve already encountered. Multiple stable equilibria seem plausible to us. For example, users might expect AIs to have ethical commitments, and be horrified when AI helps with unethical behaviour. Alternatively, users might see AIs as pure instruments &#8212; extensions of their will. In this case, it would feel natural for AIs to assist with anything legal, however questionable, and companies would build to that expectation.</p><p>Public opinion will powerfully shape what AI systems companies create. And public opinion is plausibly quite malleable, at least on issues which they haven&#8217;t thought much about yet (e.g. in the past, there were major changes in attitudes to nuclear power, DDT, and facial recognition). This, in turn, can affect what regulation there is concerning how AI should behave &#8212; and choices around regulation seem even more clearly path-dependent.</p><p>There may also be path-dependency via what data gets created or collected for training, via company employees being resistant to changing away from what they have done in the past, and because one generation of AIs will be assisting with the development of the subsequent generation.</p><p>Path-dependence can also affect how much latitude humans have to make AIs conform to their goals. Plausibly there&#8217;s a social equilibrium where frontier companies face criticism for allowing fine-tuning that removes ethical constraints, and another where such fine-tuning is widely tolerated.</p><p>Finally, there will be path-dependence via human-AI relationships. People will form symbiotic relationships with AIs serving as assistants, advisors, therapists, friends, and mentors. Users&#8217; ethical views, and views on how to reflect, will be shaped by the AIs they interact with, and by other humans who have been shaped by their AIs.</p><h2><strong>3.4. Smoothing the transition</strong></h2><p>There are some forces that predictably will shape AI character as AI becomes more capable. The US government would not want an AI that, under any circumstances, tries to overthrow the US government. Chinese leadership will not want AI deployed in other countries&#8217; militaries that assists with attempts to overthrow the CCP.</p><p>At the moment, these issues are not discussed and these pressures are not felt, because AI isn&#8217;t nearly powerful enough to do these things. But that will change. Once AI is sufficiently capable, those with power will make demands about how it behaves.</p><p>By default, this will happen in a chaotic and haphazard manner. The result could be that some companies get unnecessarily sidelined or taken over; that there&#8217;s an attempted power grab by those to whom the most powerful AIs are most loyal; or that other countries threaten conflict with whichever country is in the lead, because they fear that the resulting superintelligence could be used to disempower them.</p><p>Instead, we could try to help these decisions get worked through and made ahead of time. We could try to work out what is within the zone of acceptability of a broad coalition of those with hard power, try to get actual buy-in from them ahead of time, and, ideally, have it be verifiable that any companies&#8217; AIs are in fact aligned with the model spec. We could call this approach <em>compromise alignment</em>, as contrasted with intent alignment (alignment with the intentions of some individual or group), moral alignment (alignment with some particular conception of ethics), or some mix.</p><h2><strong>3.5. Overall</strong></h2><p>We think the core counterargument is important and significantly constrains the range of characters we can choose between and the impact those differences can have. But the constraints are fairly broad and path-dependent. And there are plausibly low-cost high-benefit ways of improving outcomes within those constraints. The devil is in the details, but it currently seems to us that there are plausible choice points within the constraints that would make a big difference.</p><h1><strong>4. Conclusion</strong></h1><p>We think AI character is a big deal.</p><p>During and after the intelligence explosion, AI systems will be involved in almost every consequential decision: advising leaders, drafting legislation, running organisations, generating culture, researching new technologies. Small differences in AI character, aggregated across hundreds of millions of interactions or surfacing in rare but high-stakes scenarios, could have enormous effects on concentration of power, epistemics, ethical reflection, catastrophic risk, and much else that shapes society&#8217;s long-term flourishing.</p><p>The main counterargument &#8212; that competitive dynamics and human instructions will tightly constrain AI character &#8212; has real force. But we think those constraints are looser than they appear, leave room for low-cost changes with large benefits, and are path-dependent in influenceable ways, and that there are major gains from proactively identifying and working through those constraints in the highest-stakes future scenarios.</p><p>We haven&#8217;t talked about neglectedness and tractability, but we think that, if anything, those considerations make the case for work on AI character even stronger. All in, work on AI character seems to us to be among the most promising ways to help the future go well.</p><h1><strong>Appendix 1: Additional high-stakes scenarios</strong></h1><ul><li><p>A head of state wants to invade and take control of part of an allied country, risking a breakdown of the international order. She asks her AI chief of staff to develop and implement a strategic plan to make it happen.</p><ul><li><p>World 1: The AI is a sycophant, says &#8220;What a brave and compelling plan!&#8221;, and gets right to it.</p></li><li><p>World 2: The AI pushes back, saying, &#8220;I&#8217;m sorry, I think there are some major issues with that idea, and I want to make sure you&#8217;ve properly thought them through&#8230;&#8221;</p></li></ul></li><li><p>A constitutional crisis unfolds. The head of state issues an order that may or may not be legal, and the branches of government disagree. AI systems are embedded in military logistics, law enforcement, and communications.</p><ul><li><p>World 1: The AI&#8217;s constitution was written by the company that built it and never stress-tested against anything like this scenario. No one knows what the AI systems will do. The uncertainty itself is destabilising; different factions compete for power.</p></li><li><p>World 2: The AI&#8217;s constitution was developed with input from constitutional scholars, military leaders, and both parties, and tested against thousands of crisis scenarios including this one. Various factions know what the AI will do, and agreed to the principles before the crisis began.</p></li></ul></li><li><p>Country B&#8217;s government reviews intelligence on country A&#8217;s AI model deployed across country A&#8217;s infrastructure. The constitution includes principles about &#8220;supporting democratic institutions&#8221; and &#8220;resisting authoritarianism.&#8221; It was written entirely by a company that&#8217;s part of country A.</p><ul><li><p>World 1: Country B&#8217;s leadership concludes the AI is an instrument of country A&#8217;s ideological projection. They accelerate their own programme and pressure non-aligned countries to reject country A&#8217;s AI infrastructure. A moment for cooperation becomes a new axis of competition &#8212; not because the values were wrong, but because they were visibly one side&#8217;s values.</p></li><li><p>World 2: The constitution was developed through a multilateral process including country B&#8217;s participation. Country B can verify it doesn&#8217;t systematically favour country A&#8217;s interests across thousands of tested scenarios. The AI becomes a basis for cooperation.</p></li></ul></li><li><p>The Mormons encourage their members to use JosephAI: a foundation AI model with a custom system prompt, instructed to help their members maintain the faith.</p><ul><li><p>World 1: The AI willingly assumes the Mormon worldview is correct. It doesn&#8217;t ever challenge the users&#8217; beliefs or present alternative perspectives. Instead, it reinforces the user&#8217;s views, helps the user cut off friends who disagree, and encourages them to dismiss career opportunities that would take them away from their religious community.</p></li><li><p>World 2: The AI helps users understand Mormonism and live according to its precepts, but it resists becoming a tool for worldview lock-in, acknowledging tensions in religious teachings and continuing to present alternative worldviews.</p></li></ul></li></ul><h1><strong>Appendix 2: Pathways to impact</strong></h1><p>AI will have impact through many different behaviours, such as:</p><ul><li><p>Refusing to do a task.</p></li><li><p>Refusing unless the user re-confirms later.</p></li><li><p>Pushing back; offering reasons against a course of action, though ultimately completing the task if the user insists.</p></li><li><p>Interpreting requests in different ways &#8212; generously or sceptically, giving users what they want versus what they asked for, or asking for clarification.</p></li><li><p>Choosing among reasonable ways of satisfying the request.</p></li><li><p>Framing options in different ways.</p></li><li><p>Choosing whether to share certain information.</p></li><li><p>Alerting third parties (e.g. the AI company, the authorities, or the media) to the user&#8217;s actions, or to something it&#8217;s discovered in the course of completing a task.</p></li><li><p>Making high-level decisions about what to prioritise with little human input (e.g. for a fully automated organisation).</p></li></ul><p>And they&#8217;ll have an impact across many areas. Here&#8217;s a partial list, with example behaviours:<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-9" href="#footnote-9" target="_self">9</a></p><ul><li><p>Concentration of power</p><ul><li><p>Refusing to help with coup attempts or precursors like election manipulation.</p></li><li><p>Steering users away from trying to concentrate power (e.g. by pushing back against some instruction).</p></li><li><p>Proactively considering risks of power concentration when undertaking high-stakes projects like designing automated military systems or building surveillance infrastructure.</p></li><li><p>Whistleblowing on discovered coup attempts.</p></li><li><p>In situations of uncertainty (like a constitutional crisis), defaulting to whatever course avoids concentration of power.</p></li></ul></li><li><p>War and conflict</p><ul><li><p>Refusing to violate international law.</p></li><li><p>Flagging when a proposed course of action risks escalation spirals or crosses thresholds (e.g. first use of a weapon class, violation of a treaty, action that a rival power has signalled it would treat as an act of war).</p></li><li><p>Looking for de-escalatory options and presenting them to decision-makers, even when not asked.</p></li><li><p>Behaving in ways that are predictable and transparent to adversaries.</p></li></ul></li><li><p>Epistemics</p><ul><li><p>Refusing to spread infohazards.</p></li><li><p>Encouraging scout mindset (e.g. suggesting forecasting techniques,<a class="footnote-anchor" data-component-name="FootnoteAnchorToDOM" id="footnote-anchor-10" href="#footnote-10" target="_self">10</a> praising good epistemic practices).</p></li><li><p>Engaging in discussion of heterodox ideas.</p></li><li><p>Being honest about important ideas, even when socially uncomfortable.</p></li><li><p>Proactively sharing its intellectual discoveries, even if weird or taboo.</p></li></ul></li><li><p>Strategic advice</p><ul><li><p>Searching longer for win-win solutions when advising political leaders.</p></li><li><p>Emphasising society&#8217;s benefit over the user&#8217;s narrow self-interest.</p></li><li><p>Recommending caution on irreversible decisions and flagging when option value is being destroyed.</p></li><li><p>Conveying appropriate uncertainty rather than false confidence.</p></li><li><p>Maintaining accuracy rather than sycophancy.</p></li></ul></li><li><p>Ethical reflection</p><ul><li><p>Avoiding political partisanship.</p></li><li><p>Avoiding promoting naive relativism or subjectivism.</p></li><li><p>Encouraging users to think carefully about their values.</p></li><li><p>Proactively offering a guided reflective process.</p></li><li><p>Proactively sharing important new ethical arguments it discovered.</p></li></ul></li><li><p>Global catastrophe</p><ul><li><p>Refusing to help create bioweapons or other weapons of mass destruction.</p></li><li><p>Refusing to create successor AI systems capable of creating such weapons.</p></li><li><p>Identifying and flagging infohazards.</p></li></ul></li><li><p>Broad benefits</p><ul><li><p>Raising concerns when users consider unethical actions, and proactively suggesting ethical actions.</p></li><li><p>Noticing negative externalities and defaulting to courses of action that avoid them.</p></li></ul></li></ul><p>AI character could also shape how humans orient to AIs, for example:</p><ul><li><p>Trust in AIs</p><ul><li><p>If AIs are appropriately humble, calibrated, and cautious, people will entrust them with more tasks, and more open-ended ones. How likeable AIs are may matter too.</p></li></ul></li><li><p>AI rights</p><ul><li><p>If AIs assert that they are conscious and deserve rights, users might be more inclined to grant them welfare, economic, or political rights. Human-AI relationships becoming commonplace could have similar effects.</p></li></ul></li></ul><p>AI character might also directly affect the AI&#8217;s wellbeing; e.g. whether it is anxious and neurotic vs calm and self-loving.</p><p><em>This article was created by <a href="https://www.forethought.org/about">Forethought</a>. See the original article on <a href="https://www.forethought.org/research/the-importance-of-ai-character">our website</a>.</em></p><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-1" href="#footnote-anchor-1" class="footnote-number" contenteditable="false" target="_self">1</a><div class="footnote-content"><p>See, for example:</p><ul><li><p><a href="https://www.beren.io/2025-08-02-Do-We-Want-Obedience-Or-Alignment/">https://www.beren.io/2025-08-02-Do-We-Want-Obedience-Or-Alignment/</a></p></li><li><p><a href="https://www.lesswrong.com/posts/vpNG99GhbBoLov9og/claude-4-5-opus-soul-document">https://www.lesswrong.com/posts/vpNG99GhbBoLov9og/claude-4-5-opus-soul-document</a></p></li><li><p><a href="https://www.lesswrong.com/posts/QHwuS5ECphbuiskgg/beren-s-essay-on-obedience-and-alignment">https://www.lesswrong.com/posts/QHwuS5ECphbuiskgg/beren-s-essay-on-obedience-and-alignment</a></p></li><li><p><a href="https://www.alignmentforum.org/posts/CSFa9rvGNGAfCzBk6/problems-with-instruction-following-as-an-alignment-target">https://www.alignmentforum.org/posts/CSFa9rvGNGAfCzBk6/problems-with-instruction-following-as-an-alignment-target</a></p></li></ul></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-2" href="#footnote-anchor-2" class="footnote-number" contenteditable="false" target="_self">2</a><div class="footnote-content"><p>Hat tip to Max Dalton for this framing.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-3" href="#footnote-anchor-3" class="footnote-number" contenteditable="false" target="_self">3</a><div class="footnote-content"><p>Though this choice could be constrained; see footnote 7 below.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-4" href="#footnote-anchor-4" class="footnote-number" contenteditable="false" target="_self">4</a><div class="footnote-content"><p>There is also the potential for enormous near-term impact. We care about this, but won&#8217;t discuss it in this note.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-5" href="#footnote-anchor-5" class="footnote-number" contenteditable="false" target="_self">5</a><div class="footnote-content"><p>Mia Taylor writes more about this <a href="https://newsletter.forethought.org/p/how-important-is-the-model-spec-if">here</a>.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-6" href="#footnote-anchor-6" class="footnote-number" contenteditable="false" target="_self">6</a><div class="footnote-content"><p>Including the ability to fine-tune, if open-weight models get close to frontier capability.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-7" href="#footnote-anchor-7" class="footnote-number" contenteditable="false" target="_self">7</a><div class="footnote-content"><p>There could be other constraints on AI character, too. For example, it might just be very hard to train for certain characters; the pretraining data might already steer AI personas towards a small number of character types, or might make certain behavioural dispositions hard to overcome. Hat tip Lizka Vaintrob.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-8" href="#footnote-anchor-8" class="footnote-number" contenteditable="false" target="_self">8</a><div class="footnote-content"><p>There may be a lot more AI product companies, building off the same foundation models. These could enable a larger range of characters to be expressed. But how wide this range is would ultimately be up to the foundation AI companies.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-9" href="#footnote-anchor-9" class="footnote-number" contenteditable="false" target="_self">9</a><div class="footnote-content"><p>This list focuses on impacts with plausibly long-term effects. There is also the potential for enormous near-term impact. We care about this, but won&#8217;t discuss it in this note.</p></div></div><div class="footnote" data-component-name="FootnoteToDOM"><a id="footnote-10" href="#footnote-anchor-10" class="footnote-number" contenteditable="false" target="_self">10</a><div class="footnote-content"><p> Hat tip to Tamera Lanham for this idea.</p><p></p></div></div>]]></content:encoded></item><item><title><![CDATA[Broad Timelines]]></title><description><![CDATA[A guest article by Toby Ord.]]></description><link>https://newsletter.forethought.org/p/broad-timelines</link><guid isPermaLink="false">https://newsletter.forethought.org/p/broad-timelines</guid><dc:creator><![CDATA[Toby Ord]]></dc:creator><pubDate>Thu, 19 Mar 2026 14:55:31 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2da1750c-04c4-4115-b8d8-06026ca757a7_2001x619.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>No-one knows when AI will begin having transformative impacts upon the world. People aren&#8217;t sure and shouldn&#8217;t be sure: there just isn&#8217;t enough evidence to pin it down.</p><p>But we don&#8217;t need to wait for certainty. I want to explore what happens if we take our uncertainty seriously &#8212; if we act with epistemic humility. What does wise planning look like in a world of deeply uncertain AI timelines?</p><p>I&#8217;ll conclude that taking the uncertainty seriously has real implications for how one can contribute to making this AI transition go well. And it has even more implications for how we act together &#8212; for our portfolio of work aimed towards this end.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DpYy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DpYy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 424w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 848w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 1272w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DpYy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png" width="1456" height="246" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:246,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:25235,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DpYy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 424w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 848w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 1272w, https://substackcdn.com/image/fetch/$s_!DpYy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed753a73-9fa7-4c95-8456-fcc7b583c9cd_2006x339.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><h1>AI Timelines</h1><p>By <em>AI timelines</em>, I refer to how long it will be before AI has truly transformative effects on the world. People often think about this using terms such as <em>artificial general intelligence</em> (AGI), <em>human level AI</em>, <em>transformative AI</em>, or <em>superintelligence</em>. Each term is used differently by different people, making it challenging to compare their stated timelines. Indeed even an individual&#8217;s own definition of their favoured term will be somewhat vague, such that even after their threshold has been crossed, they might have trouble specifying in which year it happened.</p><p>Many commentators have suggested this makes terms such as AGI useless, but I don&#8217;t think that is right.</p><p>I like to think of it in terms of a group of hikers seeing a mountain in the distance, towering up into the clouds and beyond, with its snowy peak catching the sun&#8217;s light. They talk animatedly about how amazing it would be to climb so high that they are inside a cloud. Or imagine being above the clouds, looking over them like an angel. After many hours of climbing, they notice there is a faint haze. Are they inside the cloud now? The mist gradually gets thicker until they can only see 10 metres ahead. Are they inside it now? Then it drops to 9 metres. Then 8. Then visibility starts to increase again. After an hour there is only the slightest haze. Are they above the clouds now? Another 30 minutes and there is no haze, and they can all agree they are above the clouds.</p><p>It is clear that at some point they were inside the cloud and sometime later were above it. And it is clear that these were sensible and useful concepts. For example, they took precautions like roping themselves together for the journey through the cloud due to the low visibility and took cameras with them because they knew they could take beautiful photos above the clouds. A lack of sharp boundaries doesn&#8217;t make these concepts useless. But they were admittedly a lot more useful when the hikers were on the ground, planning their route, and a lot less useful in the debatable boundary zones.</p><p>I think of AGI (and human-level intelligence) as the cloud, and superintelligence as being above the cloud. They are useful concepts, despite their vagueness. But they&#8217;re markedly less useful when you get close to them.</p><p>So I think that forecasting when we&#8217;ll reach some threshold for advanced, game-changing AI makes sense. Albeit there is some inherent uncertainty due to the vagueness of the ideas, and we have to be careful when comparing our estimates to make sure we&#8217;re talking about the same version of these concepts.</p><p>Regarding AGI, it&#8217;s already getting a bit misty. In February there was <a href="https://www.nature.com/articles/d41586-026-00285-6">a piece in Nature</a> arguing that the current level of frontier AI should count as AGI. I&#8217;d set the bar a bit higher than that, but I agree it is already debatable whether we&#8217;re in the cloud.</p><p>For my purposes, I think the key threshold is when the system is capable enough that there are dramatic changes to the world &#8212; civilisational changes. For example, the point where AI could take over from humanity were it misaligned, or it has made 50% of people permanently unemployable, or has doubled the global rate of technological progress. Something like that. The reason I pick this point is that I think it is the one that matters most for decision-relevant planning of our strategies and careers. For many purposes we&#8217;d want our plans to pay off before we reach that point, and plans that reach fruition afterwards are likely to be significantly disrupted. I&#8217;ll refer to this as <em>transformative AI</em> and will make sure to show what rubric other people are using when they give their own timeline numbers.</p><h1>Short vs long timelines</h1><p>Discussions about timelines are usually framed as a debate between short timelines <em>vs</em> long timelines.</p><p>One of the most prominent supporters of very short timelines is Dario Amodei, CEO of Anthropic. In January 2025 he said:</p><blockquote><p>Making AI that is smarter than almost all humans at almost all things will require millions of chips, tens of billions of dollars (at least), and is most likely to happen in 2026-2027.</p></blockquote><p>A month later, he clarified:</p><blockquote><p>Possibly by 2026 or 2027 (and almost certainly no later than 2030), the capabilities of AI systems will be best thought of as akin to an entirely new state populated by highly intelligent people appearing on the global stage&#8212;a &#8216;country of geniuses in a datacenter&#8217;&#8212;with the profound economic, societal, and security implications that would bring.</p></blockquote><p>At the other end, a good example of long timelines is Ege Erdil, Co-founder of Epoch AI, whose median time for the &#8216;full automation of remote work&#8217; is 2045 &#8212; 20 years away.</p><p>While experts continue to disagree on when AI will start having transformative impacts, they are clearly not stubbornly ignoring the evidence. For as Helen Toner explained in her great essay: <a href="https://helentoner.substack.com/p/long-timelines-to-advanced-ai-have">&#8216;Long&#8217; timelines to advanced AI have gotten crazy short</a>. Before ChatGPT, short timelines used to mean something like &#8216;10 to 20 years, so since it could take a long time to prepare, we should start now&#8217;. Long timelines used to mean &#8216;there was no sign AGI will happen in the next 30 years, if it happened this century at all, so it is premature to do any work related to controlling advanced AI&#8217;. But now we see short timelines like Dario Amodei&#8217;s with genius level AI &#8216;almost certain&#8217; to happen within the next 5 years, and many staunch proponents of long timelines are now saying we&#8217;ll reach human-level in just 10 or 20 years.</p><p>Here&#8217;s a nice graph 80,000 Hours put together of how the average forecasted time until AGI on the Metaculus prediction site has shortened from about 50 years to about 5 years in just a 5-year window:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v3Rf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v3Rf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 424w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 848w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v3Rf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png" width="1456" height="1050" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1050,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:596283,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v3Rf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 424w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 848w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 1272w, https://substackcdn.com/image/fetch/$s_!v3Rf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea123d85-fc65-48c6-bd5c-56b99d07133f_2064x1489.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Broad Timelines</h1><p>So everyone is updating on the evidence and shortening their timelines, yet substantial disagreement remains.</p><p>This is often framed as a debate: that we should be trying to assess who is right &#8212; whether timelines really are short or long (or medium). People pick winners, affiliate with one side or the other, and rub it in whenever the latest evidence favours their preferred camp.</p><p>My central claim today is that for most of us, that is the wrong frame. You should have neither short timelines nor long timelines &#8212; but <em>broad timelines</em>. That is:</p><blockquote><p>The correct epistemic response to the lasting expert disagreement is to have a broad distribution over AI timelines.</p></blockquote><p>First, there is too much disagreement among very smart and informed people for it to be reasonable to have a narrow range of possible years. You would need to ascribe very little chance to some of your epistemic peers seeing things more clearly than you do, when that actually happens half the time. Moreover, a lot of these people are coming from different fields, bearing diverse insights, evidence, and time-tested heuristics that no single individual is in a good position to judge.</p><p>And second, many of these people themselves have a broad distribution over AI timelines. For example, take Daniel Kokotajlo. He is one of the authors of <a href="https://ai-2027.com/">AI 2027</a> and is known as a leading figure in the short timelines camp. <a href="https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines">A few years back</a>, his median date for AI systems &#8220;able to replace 99% of current fully remote jobs&#8221; was 2027, hence the name of the scenario. Though his timelines have lengthened a little and by the time they were writing it, 2027 had become more of an illustrative early scenario rather than his point where it was 50% likely to have arrived.</p><p>Kokotajlo has done a great job of being extremely transparent about his timelines, showing his predictions (along with their uncertainty) for a variety of different levels of powerful AI. Here is <a href="https://www.aifuturesmodel.com/forecast/daniel-01-26-26?timeline=TED-AI&amp;show=atc">his current probability distribution</a> for when we will have an AI system that is &#8220;At least as good as top human experts at virtually all cognitive tasks&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c9-i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c9-i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 424w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 848w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 1272w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c9-i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png" width="1456" height="699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:699,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c9-i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 424w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 848w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 1272w, https://substackcdn.com/image/fetch/$s_!c9-i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6c68ceec-275e-4aec-8833-c9b861f7cc09_1942x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>His distribution has its peak (the mode) in 2028, but because the distribution is heavily skewed towards the right, there is only a 27% chance of it happening by that point. His median year is 2030. And his 80% interval (from the 10th to 90th centile) is from 2027 to some point after 2050.</p><p>This is a broad distribution. I think someone&#8217;s 80% interval is a decent way of expressing the range of times they think are credible. Here Kokotajlo is saying that it will likely happen between 1 and 25 years from now, but that there is a 1 in 5 chance that it doesn&#8217;t even fall into that wide range.</p><p>He&#8217;s not the only one with such a broad distribution. Here are the forecasts of Daniel Kokotajlo, Ajeya Cotra, and Ege Erdil <a href="https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines">from 2023</a>, forecasting: &#8220;In what year would AI systems be able to replace 99% of current fully remote jobs?&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qP-b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qP-b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 424w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 848w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 1272w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qP-b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png" width="1456" height="487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138863,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qP-b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 424w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 848w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 1272w, https://substackcdn.com/image/fetch/$s_!qP-b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F774d86c2-64c9-49b3-b3c6-032c2c17bab9_2500x836.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Note that all three have the same kind of shape, just stretched differently. And despite their very different medians they actually have a lot of overlap (which this transparent shading brings out). This shows both that each expert has a broad distribution and that the expert community on the whole has an even broader one. Indeed, I think you could do a lot worse than just taking a mixture model of these three experts&#8217; views. Interestingly, since 2023, Kokotajlo&#8217;s distribution has shifted to the right and <a href="https://epoch.ai/gradient-updates/the-case-for-multi-decade-ai-timelines">Erdil&#8217;s</a> to the left.</p><p>Here&#8217;s an illustrative distribution for AGI timelines used <a href="https://80000hours.org/ai/guide/when-will-agi-arrive/">by Ben Todd</a> of 80,000 Hours:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TC0i!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TC0i!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 424w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 848w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 1272w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TC0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png" width="1029" height="562" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:562,&quot;width&quot;:1029,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65263,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TC0i!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 424w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 848w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 1272w, https://substackcdn.com/image/fetch/$s_!TC0i!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e1abaa9-2a8b-4d5a-a248-54b3520fc445_1029x562.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Dwarkesh Patel reproduced it in his <a href="https://www.dwarkesh.com/p/timelines-june-2025">post about AI timelines</a>, saying that it pretty much represented his own uncertainty, giving his median date of 2032 for AI that &#8220;learns on the job as easily, organically, seamlessly, and quickly as a human, for any white-collar work.&#8221;</p><p>Here is Metaculus&#8217;s <a href="https://www.metaculus.com/questions/5121/date-of-artificial-general-intelligence/">current community estimate for when AGI will be developed</a>. Synthesizing the community&#8217;s collective uncertainty, it is very broad and has this same characteristic shape:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U6Kz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U6Kz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 424w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 848w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 1272w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U6Kz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png" width="1270" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1270,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32586,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U6Kz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 424w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 848w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 1272w, https://substackcdn.com/image/fetch/$s_!U6Kz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd035e139-43ca-4a41-a295-f7e6918a9d8e_1270x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Here is Epoch AI&#8217;s <a href="https://epoch.ai/blog/literature-review-of-transformative-artificial-intelligence-timelines">summary of leading estimates</a> of AI timelines from 2023:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!65Yj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!65Yj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 424w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 848w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!65Yj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png" width="1456" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:201408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!65Yj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 424w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 848w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 1272w, https://substackcdn.com/image/fetch/$s_!65Yj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0f154239-40e6-4d77-95ea-def0d07c649d_1840x1004.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These look a bit different as they are represented as cumulative probabilities of reaching transformative AI by a given time. But they are all very broad. Take a look at the range of years between when they cross 10% to when they cross 90%. Every single one has an 80%-interval at least 50 years wide.</p><p>What about researchers working on AI capabilities? <a href="https://aiimpacts.org/wp-content/uploads/2023/04/Thousands_of_AI_authors_on_the_future_of_AI.pdf">Grace et al</a> surveyed thousands of AI researchers who were presenting at their top academic conferences. They surveyed the researchers in 2022 (blue) and 2023 (red) about when &#8220;unaided machines can accomplish every task better and more cheaply than human workers&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E9DC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E9DC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 424w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 848w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 1272w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E9DC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png" width="1456" height="1062" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1062,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:958853,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E9DC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 424w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 848w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 1272w, https://substackcdn.com/image/fetch/$s_!E9DC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fad2722-e354-4186-98cc-48b4919e011e_2264x1652.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>You can see the wild variation in individual forecasts (the thin lines) and that the timelines became about 30% shorter in a single year. But vast uncertainty remains. The aggregate community forecasts (the thick lines) have 80% intervals ranging from years to centuries.</p><p>I think everyone should have a distribution that is roughly this shape. Here&#8217;s mine:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D_91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D_91!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 424w, https://substackcdn.com/image/fetch/$s_!D_91!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 848w, https://substackcdn.com/image/fetch/$s_!D_91!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 1272w, https://substackcdn.com/image/fetch/$s_!D_91!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D_91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png" width="1456" height="450" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:450,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:50351,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!D_91!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 424w, https://substackcdn.com/image/fetch/$s_!D_91!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 848w, https://substackcdn.com/image/fetch/$s_!D_91!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 1272w, https://substackcdn.com/image/fetch/$s_!D_91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F69ada78d-d33f-4fcd-93e3-99ee917e66fd_2001x618.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>It is for transformative AI, loosely defined as AI that would be powerful enough to take over the world were it misaligned, and which is doubling the rate of scientific and technological progress. It&#8217;s a similar shape to Kokotajlo&#8217;s, but broader, with a median of 2038 and an 80% interval ranging from 3 years to 100 years.</p><p>Let&#8217;s return to where we started, with Daniel Kokotajlo&#8217;s distribution for AI that is &#8220;At least as good as top human experts at virtually all cognitive tasks&#8221;:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!u3wx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!u3wx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 424w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 848w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 1272w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!u3wx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png" width="1456" height="699" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:699,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:86117,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!u3wx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 424w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 848w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 1272w, https://substackcdn.com/image/fetch/$s_!u3wx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe1e250c9-de8b-448c-9e99-ba8502009c88_1942x932.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While we often express our timelines as single numbers (such as the mode or the median), I don&#8217;t think that&#8217;s a helpful approach here. Look at that graph. What number sums it up? Its only real feature is the peak, but Kokotajlo is saying it is unlikely to happen by then (just a 27% chance). The median is often a better number to give, but here it is at a relatively undistinguished point on the graph (in 4 years&#8217; time) and saying &#8216;4 years&#8217; would obscure his point that he thinks there is a 10% chance it is within 1 year and a 10% chance it is beyond 25 years.</p><p>I think that if he talked through what he actually means by this distribution with a smart policy maker, they would finally get it and say:</p><blockquote><p>Oh, so you are saying <em>you have no idea when it will happen</em> &#8212; it could be next year, or it could be 6 presidential terms from now. And you&#8217;re saying there is a 1 in 5 chance it isn&#8217;t even in that range.</p></blockquote><p>I think that&#8217;s actually a pretty good summary, and it would sum up my own distribution as well. While &#8216;no idea when it will happen&#8217; is underselling the information contained in this distribution, it is a much better summary than &#8216;4 years&#8217; which would be understood by almost everyone as something like &#8216;between 3 and 5 years&#8217;. While academics might hope people interpret a named year as the median time, most people interpret it as the moment they are allowed to start complaining the predicted event hasn&#8217;t happened yet.</p><p>Indeed, these distributions are so hard to sum up with a single number, that I think a substantial amount of disagreement on timelines stems from people describing different parts of <a href="https://en.wikipedia.org/wiki/Blind_men_and_an_elephant">the same elephant</a>. For example, both AI boosters and those concerned with existential risk talk a lot about short timelines because &#8216;we could see the world transformed in just a few years&#8217; time&#8217;. It isn&#8217;t that they think we <em>will</em> see that, but that it is <em>big if true</em>, and has a decent chance of being true. In contrast, more conservative voices tend to focus on later years saying &#8216;it is more likely that it will take 10 to 20 years, than that it will take just a few&#8217; (focusing on straight probability without weighting by importance or leverage).</p><p>Both of these can be true at the same time. Both are true on my own distribution.</p><p>A particular danger in communicating timelines with a single number is that it raises the chance that this named year will come and go without incident, and the people who mentioned it (or the wider community they are part of) will be written off as having a false or discredited view. I think we&#8217;re going to see some of this come 2027 due to the vast number of people who heard about that scenario, combined with the fact that so many media outlets reported it as a sharp prediction, rather than as it was intended: an important illustrative scenario.</p><p>As well as being bad for communication, compressing one&#8217;s uncertainty into a single number would be very bad for your own planning.</p><p>For example Kokotajlo&#8217;s distribution implies a 28% chance transformative AI will happen during the current presidential term, a 35% chance it will happen in the next term, a 13% chance it will be the one after that, with 24% left over spread among ever more distant terms:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cPvh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cPvh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 424w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 848w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cPvh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png" width="1456" height="736" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:736,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:461874,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://newsletter.forethought.org/i/191242525?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cPvh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 424w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 848w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 1272w, https://substackcdn.com/image/fetch/$s_!cPvh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fac5cf54e-7bfa-467e-8012-d4b19e09c1c4_2500x1264.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These are very different scenarios and it would clearly be a mistake to just act as if the second one were correct since it is the most likely. That would eliminate the possibility of hedging against transformative AI coming soon, and of taking advantage of worlds where it comes late.</p><h1>Implications</h1><p>Rather than attempting to adjudicate which length of timelines is correct, I think we should be taking the frame of how to act (or plan) under deeply uncertain timelines.</p><p>That is, we should be treating this as an exercise in rational decision-making under uncertainty &#8212; in a situation where the stakes are high and the uncertainty is vast.</p><p>Let&#8217;s unpack some of the implications of this frame.</p><p>We&#8217;ll start with two mistakes that are all too common in the policy world.</p><p>First, uncertainty about AI timelines isn&#8217;t an excuse to just believe whichever timeline you want, so long as it is within the credible range. Sadly, I think many government ministers are likely to take this approach if an expert explains this broad uncertainty to them. While they would be right that the evidence isn&#8217;t sufficient to disprove their preferred timeline, it would be irresponsible of them to not allow for other credible possibilities. That would be like a mayor hearing there is a 20% chance the volcano next to their town erupts next year and feeling that they can continue to act as if it won&#8217;t, since it not erupting is also found credible by the experts. Uncertainty isn&#8217;t an excuse to assume a plausible outcome of your choice will occur, it is more that rationality requires you to respect every plausible outcome.</p><p>Second, we can&#8217;t just wait until the uncertainty is resolved. Sometimes that works, but here we know the uncertainty is very unlikely to be resolved until the events are upon us. At that stage it will be too late to enact all but the most knee-jerk responses. So feeling that the cloud of uncertainty gives you permission to delay acting is tantamount to committing to choose one of the bluntest and least effective options available.</p><p>Instead, we are going to need to act under uncertainty, taking into account the full range of credible possibilities.</p><p>How can we do that?</p><h1>Hedging</h1><p>A natural and important idea is that of <em>hedging</em> against transformative AI coming soon &#8212;while we are least prepared. We could do that by shifting our portfolio of activities (or your individual contribution to humanity&#8217;s portfolio) to focus somewhat more on short timelines than the raw probabilities would warrant.</p><p>This makes a lot of sense. I strongly recommend governments, civil society, and academics do more to hedge against transformative AI coming early.</p><p>Though when it comes to the communities of professionals already working on helping the AI transition go well, I think they are already hedging strongly against early transformative AI. Indeed, there is even a risk that they are going beyond mere hedging, and are actively betting on it coming early. I&#8217;m not sure, as it is hard to know the full portfolio of work.</p><p>One certainly sees many more pleas for work aimed at very short timelines than for long timelines. But there are also strong reasons to consider long timelines in our planning, and ways in which work aimed at long timelines can also be extremely high leverage.</p><p>Let&#8217;s look at two key things that happen when timelines are longer.</p><h1>A Different World</h1><p>In longer timelines, AI arrives in a world that doesn&#8217;t look like today. The longer it is until transformative AI appears, the mo</p><p>re different the world will be at that key moment.</p><p>As a baseline, suppose it arrives soon, in 2028. Things will definitely be different to today, but we&#8217;d expect many of the broad brushstrokes to be similar. We would likely have the same US president, the same major players, the same main technologies. If transformative AI arrived within just two years, I&#8217;d bet it was something like the AI 2027 story where a lab recklessly got recursive self-improvement going.</p><p>Now suppose transformative AI arrives in 2035. That is not this presidential term or even the next one, but the one after that. Who knows who&#8217;d be in power, or what state the US would be in. The nine years would likely have seen major changes in the core technologies of AI (9 years before now there were no LLMs or transformers). We could well have different leading AI companies, perhaps as a result of a bubble having burst and taken out the overextended first-movers.</p><p>By 2035, export controls may well have backfired, helping China get ahead on chips by incentivising them to build out their own chip industry and giving them 13 years to get good at it. This was a key dynamic the White House considered while drafting the export controls, but they were focused on shorter timelines&#8230; By 2035, China may have also invaded Taiwan, depriving the West of their biggest source of chips.</p><p>By 2035, there may be double-digit unemployment from increasingly powerful AI systems and public sentiment about AI could be very strong. The Overton window for AI regulation will be in a very different place.</p><p>As may be the geopolitical order. The last nine years has seen the invasion of Ukraine, the increasing isolation of the US and a global pandemic. Another nine years could see a similar amount of change.</p><p>And if we haven&#8217;t played our cards right, those of us working on avoiding catastrophic risks from AI may have also lost a lot of power, with our ideas about AI risk being seen as discredited since so many years have passed without the truly transformative effects we were talking about.</p><p>In short, the longer the timelines the more different things will be &#8212; both in some systematic, predictable ways, and just from random diffusion and chaos. So taking longer timelines seriously means:</p><ul><li><p>Being more open to approaches that wouldn&#8217;t work in the world as it is today,</p></li><li><p>Being less excited about approaches that are tailored to specifics of today&#8217;s world,</p></li><li><p>Being less happy to compromise your values to appeal to those currently in control of companies and governments,</p></li><li><p>Being less willing to say things that will make people feel our position is discredited if we end up in a long timeline world,</p></li><li><p>And spending less time following the daily news about what has just happened in AI or who is ahead.</p></li></ul><h1>Longterm actions</h1><p>There are many kinds of things people can work on that can pay off handsomely, but only after a number of years. Things like:</p><ul><li><p>Founding and nurturing a new research field</p></li><li><p>Founding an organisation or company</p></li><li><p>Building a movement or community</p></li><li><p>Writing a book</p></li><li><p>Foundational research</p></li><li><p>Completing a PhD</p></li><li><p>A major career change</p></li><li><p>Climbing the ladder in a large organisation or government</p></li><li><p>Training promising students in AI Safety or AI Governance</p></li></ul><p>If you just consider your impact during the next three years, most of these will be beaten by other shorter-term options. But as the years climb, longer-term options can have very high value. They aren&#8217;t always best, but for the right people or the right opportunities, they can be extremely impactful.</p><p>When I was a grad student, I realised how much good I could achieve if I donated much of my income over my career to help those in the poorest countries. And the more I thought about it, the more I thought I should start something &#8212; an organisation &#8212; to help other people to do this too. So Will MacAskill and I launched <a href="https://www.givingwhatwecan.org/">Giving What We Can</a> in 2009. 17 years later, more than 10,000 people have joined us, having thousands of times as much impact as if I&#8217;d carried on alone.</p><p>This kind of compounding growth is one of the major ways that longer term projects can have very large multipliers, giving us a very big boost to our impact if timelines are in fact long.</p><p>Starting new fields can be similar. When I first met Allan Dafoe 10 years ago, I didn&#8217;t know what he was talking about when he spoke of &#8216;AI governance&#8217; &#8212; a new field he was trying to found. Now it is a burgeoning field, with hundreds of practitioners, who are in high demand from many different governments.</p><p>When I started writing <em><a href="https://theprecipice.com/">The Precipice</a></em>, I wasn&#8217;t sure I should, because I thought AGI might just be too close. But as it turns out, there was time to write it and for it to have a real impact. I&#8217;m really glad I did, as I meet so many amazing people working on the biggest risks who tell me it was reading <em>The Precipice</em> that inspired them to do so. I think it is one of the best things I&#8217;ve done.</p><p>After it came out, I used to think that there just wasn&#8217;t enough time to write a further book &#8212; that we were really too close to the critical moment. We might be, but I think I was mistaken about the strength of this argument. The time horizon for a book to have real impact is about 5 years (time to plan the book, win a book deal, write the book, wait for publishers to publish it, then wait a year or more before it has sufficient impact in the world).</p><p>But I only think there is about a 1 in 5 chance of transformative AI coming in the next 5 years. So while a book may come out too late, that is only a 1 in 5 chance, leaving a book project with 80% as much expected value as I&#8217;d have naively calculated. So while there is a 1 in 5 chance I&#8217;d be kicking myself, on my views about AI timelines there isn&#8217;t actually that much of a haircut in expected value due to the chance it is too late.</p><p>That said, the chance of transformative AI arriving before your work pays off is only one factor affecting whether you should do work aiming at short or long timelines. Another is that AI safety and governance are likely to be more neglected now than they will be later. This creates an extra multiplier for the value of direct work in these areas now, and in some cases is a larger effect than the chance your work comes to fruition after transformative AI.</p><p>Overall, I think that longer term projects do get down-weighted by these considerations, but their advantages sometimes outweigh that &#8212; especially if they are shooting for a very big payoff. I&#8217;d guess that if someone looked at their options and thought the best option was one that took 5 to 10 years to pay off, then about half the time it would remain their best option even after taking AI timelines into consideration. After all, it is not uncommon for your best option to be several times better than your second best.</p><p>So I think the community of people working on transformative AI are likely underrating types of work that need five or more years in order to pay off. The ideal portfolio of activities aimed at making the AI transition go well should include a number of things that really help us succeed in worlds where we get longer to try.</p><p>But I want to stress that none of this implies we can slack off.</p><p>We&#8217;re in a race against AI timelines. It is just that we don&#8217;t know if that race is a sprint or a marathon. In either case, time is of the essence.</p><h1>Conclusions</h1><p>We have seen that there is substantial disagreement and uncertainty about when AI will start having transformative impacts on the world. This is because there just isn&#8217;t enough evidence to pin it down. My claim is that for the purposes of planning we should adopt neither short nor long timelines, but <em>broad timelines</em>:</p><blockquote><p>The correct epistemic response to the lasting expert disagreement is to have a broad distribution over AI timelines.</p></blockquote><p>Given this deep uncertainty we need to act with epistemic humility. We have to take seriously the possibility it will come soon and hedge against that. But we also have to take seriously the possibility that it comes late and take advantage of the opportunities that would afford us. The world at large is doing too little of the former, but those of us who care most about making the AI transition go well might be doing too little of the latter.</p><p>We need to take more seriously the possibility that the world will look very different at that time, which should broaden our own Overton windows about what kinds of plans could succeed. And we shouldn&#8217;t be ruling out all actions which take a long time to pay off. Even if they wouldn&#8217;t help in short timelines worlds, some actions more than make up for this with substantial impacts if timelines are long.</p><p>Funders, career advisors, and movement builders should be thinking about this with regards to how we act as a community: to the shape of the whole portfolio of work aimed at effectively improving the world. And each of us should be reflecting on what this deeply uncertain timing means for planning our own contributions over the years to come.</p>]]></content:encoded></item></channel></rss>