One day, I was at my grandma’s house reading the Sunday funny pages, when I suddenly felt myself getting sucked into a Garfield comic.

I looked down at my body and saw that I had become fully cartoonified. My hands had four fingers and I had a distinct feeling that I’d be wearing the same outfit for the rest of my life.

I also started feeling really hungry. Luckily, I was in the kitchen, where Jim, Garfield’s owner, had made some lasagna.

“You look really hungry,” he said to me. “Why don’t you take this lasagna?”

I gratefully accepted Jim’s lasagna. As he handed it to me, he issued a grave warning: “Don’t let Garfield eat this.”

I looked at where Garfield was sitting on the floor, harmlessly hating Mondays. He was chubby and slow and there was no way he’d be able to jump up and yank the lasagna tray out of my hands.

Jim said, “Take that to the dining table down at the end of the comic strip.”

Thanking him again, I walked over to the next panel, where I ran into a new Jim and Garfield (because that’s how comic strips work). But this Garfield was a bit bigger and a bit more lithe-looking.

I waved at the characters and walked to the third panel. The third Garfield again looked bigger and a bit more aggressive.

“I’m a bit worried about this lasagna,” I said to Jim. “Garfield seems to be getting bigger and stronger and I’m afraid once I get to the last panel, he’s gonna eat my lasagna.”

“Oh yes, the guy who draws us is trying to make a super-Garfield,” Jim explained. “He’s small now, but someday he will be bigger and stronger than either of us.”

“But won’t that mean he will take my lasagna?”

“Don’t worry about that,” Jim reassured me. “I have a plan to tame Garfield. By the end, I will know how to make him leave the lasagna alone.”

“Okay,” I said hesitantly, and kept walking to the fourth panel. (This was a full-page Sunday comic.) Garfield was now big enough to come up to my knees. He scratched at my leg with his front paws, yearning for a meal.

“I’m really not sure about this,” I said to Jim. “Right now, Garfield can’t take my lasagna no matter what he does; he can’t jump that high and he’s definitely not strong enough to knock me over. At some point, though, I will encounter a Garfield who’s bigger than me for the first time, and then he’ll be able to knock me over and eat the lasagna and maybe even eat the entire house. How do I know that he’s tamed until I get to that point?”

“I’m incrementally improving his behavior in each panel,” Jim explained. “We will have many chances to observe Garfield’s behavior before he becomes super-Garfield. I can verify that I’m making progress on his tameness.”

I thought back to what I had read about cat training on LessWrong. “There’s this guy named Eliezer Yudkowsky who says you only get one critical try to tame a super-cat. If you fail, the cat will steal your lasagna, and then you won’t have any lasagna anymore.”

“That argument doesn’t apply to my methods. I can accumulate evidence about whether the cat training is working. Each Garfield will be less lasagna-obsessed than the last, and we can observe the trajectory of the taming. You also have to consider…” and I don’t quite recall what Jim said after that but it was very complicated and it sounded like he knew a lot more about cat-taming than me.

“Perhaps you’re right,” I said. “You have some good arguments, but so does Eliezer. How can I know who’s really right until I reach the last panel? I think I’d better not go there until I know for sure that my lasagna will be safe.”

Some people say AI alignment is a one-shot problem. You only get one chance to align AI, and if you fail, everyone dies.

Other people disagree. They say it’s possible to make gradually smarter AIs and ratchet your way up to a fully-aligned superintelligent AI, and if the process isn’t working, you can tell in advance.

It doesn’t matter who’s right. The key thing is that we don’t get to find out who’s right until it’s too late. Either the gradual ramp-up succeeds at making an aligned superintelligence (as the gradualists predict), or it fails and we die (as the one-shotters predict).

This is the meta-one-shot problem: we only get one shot at knowing whether it’s a one-shot problem.

There will be some point in time where we build the first AI that’s powerful enough to kill everyone. When that happens, either the one-shotters are right and we only get one shot to align it, or the gradualists are right and we get to iterate. Either way, we only get one shot at finding out who’s right.

The alignment one-shot problem may or may not be real. But the meta-one-shot problem is definitely real: we don’t get the evidence we need until it’s too late to do anything about it.

AI companies’ alignment plans only work if the gradualist hypothesis is true. The main reason they operate this way is that it’s much harder to make plans that work in a one-shot world. Unfortunately, there is no law of the universe that says you get to do the easy thing if the hard thing is too hard. If a plan requires gradualism, then we have no way of being confident that the plan will work.

Having a plan for a gradualist scenario is fine. But AI developers also need plans for what to do if AI alignment is a one-shot problem, because they have no way of knowing which hypothesis is correct. And they shouldn’t build powerful AI until they have both kinds of plans.

Posted on