Retroactive If-Then Commitments
An if-then commitment is a framework for responding to AI risk: “If an AI model has capability X, then AI development/deployment must be halted until mitigations Y are put in place.”
As an extension of this approach, we should consider retroactive if-then commitments. We should behave as if we wrote if-then commitments a few years ago, and we should commit to implementing whatever mitigations we would have committed to back then.
Imagine how an if-then commitment might have been written in 2020:
Pause AI development and figure out mitigations if:
- AI exhibits what looks like deceptive or misaligned behavior, or feigns alignment (1, 1b, 2)
- AI breaks out of containment in a toy example
- AI finds a real-world zero-day vulnerability
- AI qualifies for Mensa1
- AI exhibits some degree of agentic capabilities
- AI writes malware
Well, AI models have now done or nearly-done all of those things.
We don’t know what mitigations are appropriate, so AI companies should pause development until (at a minimum) AI safety researchers agree on what mitigations are warranted, and those mitigations are then fully implemented.
(You could argue about whether AI really hit those capability milestones, but that doesn’t particularly matter. You need to pause and/or restrict development of an AI system when it looks potentially dangerous, not definitely dangerous.)
Notes
-
Okay, technically it did not score well enough to qualify, but it scored well enough that there was some ambiguity about whether it qualified, which is only a little bit less concerning. ↩