Recently, Saar Wilf, creator of Rootclaim, had a high-profile debate against Peter Miller on whether COVID originated from a lab. Peter won and Saar lost.

Rootclaim’s mission is to “overcome the flaws of human reasoning with our probabilistic inference methodology.” Rootclaim assigns odds to each piece of evidence and perfoms Bayesian updates to get a posterior probability. When Saar lost the lab leak debate, some people considered this a defeat not just for the lab leak hypothesis, but for Rootclaim’s whole approach.

In Scott Alexander’s coverage of the debate, he wrote:

While everyone else tries “pop Bayesianism” and “Bayes-inspired toolboxes”, Rootclaim asks: what if you just directly apply Bayes to the world’s hardest problems? There’s something pure about that, in a way nobody else is trying.

Unfortunately, the reason nobody else is trying this is because it doesn’t work. There’s too much evidence, and it’s too hard to figure out how to quantify it.

Don’t give up so easily! We as a society have spent approximately 0% of our collective decision-making resources on explicit Bayesian reasoning. Just because Rootclaim used Bayesian methods and then lost a debate doesn’t mean those methods will never work. That would be like saying, “randomized controlled trials were a great idea, but they keep finding that ESP exists. Oh well, I guess we should give up on RCTs and just form beliefs using common sense.”

(And it’s not even like the problems with RCTs were easy to fix. Scott wrote about 10 known problems with RCTs and 10 ways to fix them, and then wrote about an RCT that fixed all 101 of those problems and still found that ESP exists. If we’re going to give RCTs more than 10 tries, we should extend the same courtesy to Bayesian reasoning.)

I’m optimistic that we can make explicit Bayesian analysis work better. And I can already think of ways to improve on two problems with it.

First problem: If you multiply a long list of probabilities as if they’re independent when they’re not, you get an extreme result.

Quick fix: Reduce the magnitudes of the odds updates based on how much evidence you already have. The more individual factors you have, the more a new factor can be explained in terms of existing factors.

For example, you could scale down the log-odds of your second observation by 1/2, your third observation by 1/3, your fourth observation by 1/4, etc. This roughly captures the intuitions that

  1. if you have a lot of evidence already, a new observation is probably mostly predicted by the existing evidence
  2. if you have infinitely many pieces of evidence, that should give you an infinitely large odds update

This approach means you don’t need to spend any time thinking about how correlated your inputs are.

If you have lines of evidence A, B, C, etc., the formula for joint log-odds becomes

\begin{align} \log(A) + \frac{1}{2} \log(B) + \frac{1}{3} \log(C) + … \end{align}

And therefore your joint odds would be

\begin{align} A \cdot B^{1/2} \cdot C^{1/3} \cdot … \end{align}

I don’t have a rigorous justification for this formula2 and it has some obvious problems (for example, if you change the order or your inputs, the answer changes). But it has some advantages over treating every piece of evidence as independent.

As a proof of concept, I created a modified version of Scott Alexander’s lab leak debate calculator that updates less on correlated evidence. My version assumes two lines of evidence are correlated if (1) they’re under the same heading and (2) they point in the same direction. This change reduces the standard deviation of people’s answers from 7.4 orders of magnitude to 4.4. (Or, if you exclude Peter Miller’s extremely-overconfident answer, it reduces the standard deviation from 2.1 OOM to 1.8.)

Second problem: Overconfident probabilities like “1 in 10,000 chance that COVID would first appear in a wet market conditional on lab leak”.

Quick fix: Give every piece of evidence a “reliability score”. Maybe the evidence looks like it suggests 10,000:1 odds but you haven’ thought about it that hard. You read the number in some population survey but maybe the survey mis-calculated, maybe it used bad data collection methods, maybe you misread the number of zeros and it actually said 1 in 1000.

As a simple approach, you could give every piece of evidence a reliability score from 1 (low reliability) to 4 (high reliability). Discount evidence by raising it to the power of 1 / (5 - reliability_score). So 10,000:1 evidence with a reliability score of 2 gets reduced to 10,0001/2 = 100:1, and evidence with a score of 1 gets reduced to 10,000</sup>1/4</sup> = 10:1.

Is that the best way to handle the problem of overconfident odds updates? Probably not. But it’s really easy and it took me three seconds to come up with.

(If you think carefully enough about your odds, you don’t need a reliability score. But the score is a convenient way to encode a concept like “I did some calculations and got 10,000:1 odds but I haven’t carefully checked the calculations.”)

Quoting Scott again,

In the end, I think Saar has two options:

  1. Abandon the Rootclaim methodology, and go back to normal boring impure reasoning like the rest of us, where you vaguely gesture at Bayesian math but certainly don’t try anything as extreme as actually using it.

  2. Claim that he, Saar, through his years of experience testing Rootclaim, has some kind of special metis at using it, and everyone else is screwing up.

(I get the sense Scott is joking, but I’ve heard other people say things like this.)

I propose a third option: Examine the flaws in explicit Bayesian reasoning and look for ways to fix them.

Or a fourth option: Do explicit Bayesian reasoning, don’t take the result literally but implicitly update your beliefs based on the result.

Or a fifth option: Figure out how to fix RCTs, and then do something similar for Bayesian reasoning. (Did we figure out how to fix RCTs yet?)

Or a sixth option: Keep doing Bayesian reasoning, and meanwhile keep trying to fix its flaws (like we are sort-of doing for RCTs).

Notes

  1. Actually only 8 out of 10 but the basic point still stands. 

  2. You could slightly-more-rigorously justify this formula by saying

    • The variance in evidence B is 50% explained by evidence A.
    • The variance in evidence C is 50% explained by A and 50% explained by B.
    • But the parts of C that are explained by A and B heavily overlap, so less than 75% of C is explained by A plus B.