Worlds where we solve AI alignment on purpose don't look like the world we live in
(Or: Why I don’t see how the probability of extinction could be less than 25% on the current trajectory)
AI developers are trying to build superintelligent AI. If they succeed, there’s a high risk that the AI will kill everyone. The AI companies know this; they believe they can figure out how to align the AI so that it doesn’t kill us.
Maybe we solve the alignment problem before superintelligent AI kills everyone. But if we do, it will happen because we got lucky, not because we as a civilization treated the problem with the gravity it deserves—unless we start taking the alignment problem dramatically more seriously than we currently do.
Think about what it looks like when a hard problem gets solved. Think about the Apollo program: engineers working out minute details; running simulations after simulations; planning for remote possibilities.
Think about what it looks like when a hard problem doesn’t get solved. Consider the world’s response to COVID.
When I look at civilization’s response to the AI alignment problem, I do not see something resembling Apollo. When I visualize what it looks like for civilization to buckle down and make a serious effort to solve alignment, that visualization does not resemble the world we live in.
This is the world we live in:
- AI Lab Watch has evaluations of AI companies’ behavior on AI safety. Every company has failing grades in almost every category.
- AI capabilities gets more than 100 times as much investment as AI safety.
- People keep saying “nobody would be so stupid as to X”, and then the people in charge proceed to do X. (Where X = “give AI direct access to the internet”, “hand over autonomous control of important systems”, etc.)
- There is widespread disagreement about how hard it will be to solve AI alignment, and about the difficulty of various sub-problems. AI safety researchers and frontier companies behave as if problems are not hard with ~100% confidence, and do little work to publicly justify this stance or resolve disagreements with more pessimistic parties. When public discussion does occur, it happens between skeptics and random employees, not skeptics and official company representatives.
- Every frontier AI company (that has an alignment plan at all) wants to use AI to solve AI alignment. This is a horrifyingly bad plan—they are admitting that the problem is so hard that they don’t think humans can solve it in time, and then proposing to use an unknown method with unknown reliability to solve the problem. Meanwhile, senior alignment researchers at AI companies describe this as “a very good plan”. My life is in these people’s hands.
- People are confident that they can solve alignment in advance of building ASI, or they’re confident that it’s possible to iteratively solve alignment, in spite of widespread disagreement about whether these are possible.
- Most people involved in frontier AI don’t treat the problem as having existential stakes. Many people have (written or mental) models of how to make ASI safe; these models do not meet the industry standard of most industries. They certainly do not meet the safety standards of industries where failures can result in deaths—spaceflight, civil engineering, surgery, cryptography1. When extinction is at stake, standards should be even higher than that—higher than the standards that are higher than the standards that AI safety plans are still failing to live up to.
- AI companies are allergic to technical philosophy. They take the position that AI alignment is purely an engineering problem.
- Companies make non-binding commitments imposing some safety requirement on a future generation of LLM. Then, when the time arrives and that commitment turns out to be hard to satisfy, they remove the commitment from their plan.
- I wrote the first draft of this post before Anthropic released Responsible Scaling Policy v3, which removed all prior commitments to conditionally pause development. Thank you to Anthropic for providing me with a better example and demonstrating to everyone how untrustworthy you are; but no thank you for throwing away your commitments that might have prevented you from killing me and destroying everything I care about.
- Frontier AI developers all spend some time lobbying against safety regulations or attempting to weaken them. Anthropic is the least-bad actor; they’ve only lobbied against moderate regulations, while coming out in favor of some weak regulations.
- AI developers require employees to sign non-disparagement agreements (including OpenAI and Anthropic—those are just the ones we know about), preventing employees from speaking out about any unsafe practices that might be happening.
- A frontier AI developer, founded as a nonprofit, first fires its safety-conscious board members, then restructures as a for-profit so it can continue to race forward unchecked. (AI Lab Watch has documented many other lapses in integrity.) There’s a good chance that this will be the company that develops ASI.
- People assume we live in a fair world. One argument goes: solving one-shot problems is too hard; therefore, we need to solve alignment iteratively via experimentation. Then, having decided that they need iterative alignment, people decide that it’s possible to solve alignment iteratively. There is no law of the universe that says we get to do things the easy way if the hard way is too hard. Sometimes you really do have to do things the hard way. But instead of dealing with this fact, people act as if the universe will treat us fairly.
- Alignment researchers routinely make basic reasoning errors that indicate that they are not taking the alignment problem seriously.
- Example: Researchers sometimes implicitly assume that if an LLM doesn’t reveal deception in its chain-of-thought, then it’s not being deceptive. We don’t know what’s happening inside deep neural networks. The chain-of-thought doesn’t tell us what’s happening in a neural network’s inner layers.
- More generally: “we found no evidence of X, therefore X is false.” Another common assumption that fits this pattern is “we found no way to jailbreak our model, therefore it’s impossible to jailbreak.” People rarely say this explicitly, but they behave as if it’s valid reasoning.
- A related fallacy is “I can’t think of any ways this could go wrong, therefore it can’t go wrong.” In worlds where ASI goes well, there are processes in place to prevent AI developers from falling for this fallacy.2 We do not have those processes.
- People who are pessimistic about solving alignment get filtered out of positions of power at AI companies. (I’m not aware of anyone at a leading AI company with a P(doom) > 50%, although I’m sure there are nonzero such people.) That means the people who would do the most to put checks in place to make AI safe are the least likely to be in a position to do so. AI companies systematically give power to reckless optimists.
Is this what taking the alignment problem seriously looks like? Is this what taking extinction risk seriously looks like?
Some people are concerned about AI x-risk, and they have P(doom)s in the 5–25% range. I don’t get that. I can’t pass an Ideological Turing Test for someone who sees all these problems, but still expects us to avert extinction with >75% probability. I don’t understand what would lead one to believe that this is what things look like when we’re on track to solving a problem.
There is an argument people sometimes make, to the effect of “the good guys have to do a sloppy job on AI safety, because otherwise the bad guys will beat us to ASI and they’ll be even worse.” I understand that viewpoint.3 But that doesn’t mean P(doom) is low. You could believe something like: if Anthropic builds ASI, there’s a 50% chance we die; if xAI or Meta builds ASI, there’s a 75% chance we die; therefore, Anthropic has to build ASI first. I don’t know anyone who believes this; everyone who wants to race to build ASI seems to have a P(doom) on the order of 25% or less.
I can imagine a world where AI companies are still racing to build ASI, but they’re taking the challenges appropriately seriously and investing accordingly in safety. In that world, I can see P(doom) being less than 25%. But that’s not the world we live in.
Notes
-
Cryptanalysis is a good example of what it looks like when flaws are hard to catch. Standard practice when releasing a new cryptographic algorithm is to circulate it among experts and spend 2–5 years trying to break the algorithm before anyone uses it in the real world. ↩
-
I won’t say “people don’t commit this fallacy”, because there is no alternative world where people don’t make reasoning errors. But in the good world, we have checks in place to make sure that ultimate decisions aren’t made on the basis of those errors. ↩
-
That still doesn’t explain the fact that AI companies exhibit an embarrassing level of sloppiness. There are various alignment challenges that AI companies could address without significantly slowing down, but they still act oblivious to them. ↩