Efforts to Improve the Accuracy of Our Judgments and Forecasts
By Luke Muehlhauser
Editor’s note: This article was published under our former name, The Open Philanthropy Project. Some content may be outdated. You can see our latest writing here.
Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.
In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.
We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.
In brief:
- There is a relatively thin literature on the science of forecasting.[1]Technically, the scientific study of forecasting goes back to at least the 1940s, and arguably earlier. However, I am most interested in studies which do all of the following: collect forecasts about phenomena for which there aren’t strong models and/or lots of data, assess the accuracy of those … Continue reading It seems to me that its findings so far are substantive and helpful, and that more research in this area could be promising.
- This literature recommends a small set of “best practices” for making accurate forecasts that we are thinking about how to incorporate into our process. It seems to me that these “best practices” are likely to be useful, and surprisingly uncommon given that.
- In one case, we are contracting to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way. I consider this a very useful skill across a wide variety of domains, and one that (it seems) can be learned with just a few hours of training. (Update: This calibration training app is now available.)
I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.
Calibration training
An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.
Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.[2] Lichtenstein et al. (1982); Bazerman & Moore (2013), ch. 2. But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”
I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.[3] I had previously practiced calibration using an online game intended to give a form of automated calibration training But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.
For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.
The research I’ve reviewed broadly supports this impression. For example:
- Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.[4]From Russo (2004): Evidence indicates that the calibration of judgment can be substantially enhanced through feedback about one’s own probability judgments. In one experiment, participants working at computers were asked general knowledge questions to which they gave their answers as well as … Continue reading In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.[5]Russo & Schoemaker (2014): [The geologists] were given files from their archives containing many factors affecting oil deposits, but without the actual results. For each past case, they had to provide best guesses for the probability of striking oil as well as ranges as to how much a successful … Continue reading
- Since 2001, Hubbard Decision Research trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.[6]From Hubbard & Seiersen (2016), ch. 7: Since [2001], Hubbard and his team at Hubbard Decision Research have trained well over 1,000 people in calibration methods and have recorded their performance, both their expected and actual results on several calibration tests, given one after the other … Continue reading
I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”[7]A proper scoring rule, applied to a set of probabilistic judgments or forecasts, awards points for both calibration and resolution, and does so in a way that incentivizes judges to report their probabilities honestly. Measures like this should be assessed with respect to an appropriate … Continue reading The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.
The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly.
Further advice for improving judgment accuracy
Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.[8]In the footnotes that follow each piece of “common advice” I list for this post, I do not provide a thorough evaluation of the evidence supporting each claim, but merely provide some pointers to the available evidence. I have skimmed these and other studies only briefly, and my choices for … Continue reading
- Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.[9]Chang et al. (2016) describe the training module randomly assigned to some participants in the Good Judgment Project forecasting tournaments: Training evolved from year 1 to 4, but was never designed to take more than an hour. Common probabilistic reasoning principles included the understanding … Continue reading Similar training has improved judgmental accuracy in some earlier studies,[10]For example, individual components of the training module from Chang et al. (2016) have been tested in earlier studies, as noted by Chang et al. (2016): Considering base rates can also improve judgmental accuracy (Kahneman & Tversky, 1973; Tversky & Kahneman, 1981). [And] teaching people … Continue reading and is sometimes included in calibration training.[11] For example, the Doug Hubbard training I attended included some training in probabilistic reasoning, which in part was necessary to ensure the participants understood how the calibration training was supposed to work.
- Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.[12]Presumably, strong monetary incentives are the primary reason why most financial markets are as efficient as they are, and strong monetary and/or reputational incentives explain why prediction markets work as well as they do (Wolfers & Zitzewitz 2004).Relatedly, Tetlock & Gardner … Continue reading
- Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.[13]In his review of “debiasing” strategies, Larrick (2004) summarized the evidence for the “think of alternatives” strategy this way: By necessity, cognitive strategies tend to be context-specific rules tailored to address a narrow set of biases, such as the law of large numbers or the sunk … Continue reading
- Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.[14]Moore et al. (2016) summarize a few of the studies on problem decomposition and judgment accuracy: Researchers have devoted a great deal of effort to developing ways to reduce overprecision [a type of overconfidence]. Most of the research has revolved around three main approaches… [one of which … Continue reading
- Combine multiple judgments: Often, a weighted (and sometimes “extremized”[15]Tetlock & Gardner (2015), ch. 4, describe “extremizing” this way: When you combine the judgments of a large group of people to calculate the “wisdom of the crowd” you collect all the relevant information that is dispersed among all those people. But none of those people has access to … Continue reading) combination of multiple subjects’ judgments outperforms the judgments of any one person.[16]Soll et al. (2016) summarize some of this literature briefly: When judgments are provided by many people, an extremely effective way to combine them is to weight them equally, such as by taking the simple average or applying majority rule (e.g., Clemen, 1989; Hastie & Kameda, 2005). The idea … Continue reading
- Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,[17] I refer to Tetlock’s forecasting tournaments, both those reported in Tetlock (2005) and especially those reported in Tetlock & Gardner (2015). correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales.
- Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
- Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.[18]There is relatively little (compelling) literature on this hypothesis, and I haven’t evaluated that literature carefully, but my understanding is that what literature exists tends to support the hypothesis, including Tetlock’s work, which I find unusually convincing (due to the strength of the … Continue reading
Another interesting takeaway from the forecasting literature is the degree to which – and consistency with which – some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.[19]Weather forecasters are commonly cited as a group that exhibits good accuracy (e.g. see Silver 2012, ch. 4), but they do not provide an example of accurate judgment in the absence of reasonably strong models and plentiful data. On bridge players, see Keren (1987). On TV pundits and political … Continue reading A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.
How commonly do people seek to improve the accuracy of their subjective judgments?
Certainly many organizations, from financial institutions (e.g. see Fabozzi 2012) to sports teams (e.g. see Moneyball), use sophisticated quantitative models to improve the accuracy of their estimates. But the question I’m asking here is: In the absence of strong models and/or good data, when decision-makers must rely almost entirely on human subjective judgment, how common is it for those decision-makers to explicitly invest substantial effort into improving the (objectively-measured) accuracy of those subjective judgments?
Overall, my impression is that the answer to this question is “Somewhat rarely, in most industries, even though the techniques listed above are well-known to experts in judgment and forecasting accuracy.”
Why do I think that? It’s difficult to get good evidence on this question, but I provide some data points in a footnote.[20]Here is an incomplete list of data points that informed my impression: At least “several” companies have invested in explicit probability calibration training, e.g. Royal Dutch Shell and Doug Hubbard’s clients. The jacket cover of 1989’s Decision Traps, which includes calibration training … Continue reading
Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff
Below is a list of activities, aimed at improving the accuracy of our judgments and forecasts, that are either ongoing, under development, or under consideration at GiveWell and the Open Philanthropy Project:
- As noted above, we have contracted a team of software developers to create a calibration training web/phone application for staff and public use. (Update: This calibration training app is now available.)
- We encourage staff to participate in prediction markets and forecasting tournaments such as PredictIt and Good Judgment Open, and some staff do so.
- Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
- We’re going to experiment with some forecasting sessions led by an experienced “forecast facilitator” – someone who helps elicit forecasts from people about the work they’re doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.
I’m currently the main person responsible for improving forecasting at the Open Philanthropy Project, and I’d be very interested in further ideas for what we could do.
Footnotes