Three Techniques for Weighing Evidence to Reach a Conclusion

In a radically uncertain world, the ability to systematically weigh evidence to reach a justifiable conclusion is undoubtedly a critical skill. Unfortunately, it is one that too many schools fail to teach. Hence this short note, which will cover some basic aspects of evidence, and quickly review three approaches to weighing it.

Evidence has been defined as “any factual datum which in some manner assists in drawing conclusions, either favorable or unfavorable, retarding a hypothesis.”

Broadly, there are at least four types of evidence:

  • Corroborating: Two or more sources report same information, or one source reports the information and the other attests to the first’s credibility;

  • Convergent: Two or more sources provide information about different events, all of which support the same hypothesis;

  • Contradictory evidence is two or more pieces of information that are mutually exclusive, and cannot both or all be true;

  • Conflicting evidence supports different hypotheses, but the pieces of information are not mutually exclusive.

Regardless of its type, all evidence has three fundamental properties:

  • Relevance: “Relevant evidence is evidence having any tendency to make [a hypothesis] more or less probable than it would be without the evidence” (from the US Federal Rules of Evidence);

  • Believability: Is a function of the credibility and competence of the source of the evidence;

  • Probative Force or Weight: Is concerned with the incremental impact of a piece of evidence on the probabilities associated with one or more of the hypotheses under consideration.

There are three systematic approaches to weighing evidence in order to reach a conclusion.

In the 17
th century, Sir Francis Bacon developed a method for weighing evidence. Bacon believed the weight of evidence for or against a hypothesis depends on both how much relevant and credible evidence you have, and on how complete your evidence is with respect to matters which you believe are relevant to evaluating the hypothesis.

Bacon recognized that we can be “out on an evidential limb” if we draw conclusions about the probability a hypothesis is true based on our existing evidence without also taking into account the number relevant questions that are still not answered by the evidence in our possession. We typically fill in these gaps with assumptions, about which we have varying degrees of uncertainty.

In the 18
th century, Reverend Thomas Bayes invented a quantitative method for using new information to update a prior degree of belief in the truth of a hypothesis.

”Bayes Theorem” says that given new evidence (E), the updated (posterior) belief that a hypothesis is true (p(H|E) is a function of the conditional probability of observing the evidence given the hypothesis (p(E|H), times the prior probability that the hypothesis is true (p(H)), divided by the probability of observing the new evidence (p(E)).

In qualitative terms, we start with a prior belief in the probability a hypothesis is true or false. When we receive a new piece of evidence, we use it to update our prior probability to a new, posterior probability.

A key issue with Bayesian reasoning is the source of the decision maker's initial prior. After the Good Judgment Project won the Intelligence Research Project Activity's four year forecasting tournament, its sponsor, Professor Philip Tetlock, concluded that using base rate data for other instances of the question at hand resulted in the greatest improvement to predictive accuracy (see his book, "

Other sources of an initial prior are deductions from theory, analogy, and intuition.

The “Likelihood Ratio” is a critical concept in the Bayesiann process of using new evidence to update a prior to a posterior estimate (which becomes the new prior for the next updating round).

The Likelihood Ratio is the probability of observing a piece of evidence if a hypothesis is true divided by the probability of observing the evidence if the hypothesis is false. The greater the Likelihood ratio for a piece of new evidence (i.e., the greater its information value), the larger should be the difference between the prior and posterior probabilities that a give hypothesis is true.

In the 20
th century, Arthur Dempster and Glenn Shafer developed a new theory of evidence.

Assume a set of competing hypotheses. For each of these hypotheses, a new piece of evidence is assigned to one of three categories: (1) It supports the hypothesis; (2) It disconfirms the hypothesis (i.e., it supports “Not-H”); or (3) it neither supports nor disconfirms the hypothesis.

The accumulated and categorized evidence can then be used to calculate a lower bound on the belief that each hypothesis is true (based on the number of pieces of evidence that support them, and the quality of that evidence), as well as an upper bound (equal to one less the probability that the hypothesis is false, again, based on the evidence that disconfirms the hypothesis, and its quality). This upper bound is also known at the plausibility of each hypothesis.

The difference between the upper (plausibility) and lower (belief) probabilities for each hypothesis is the degree of uncertainty associated with it. Hypotheses are then ranked based on their degrees of uncertainty.

While there are quantitative methods for applying all of these theories, they can also be applied qualitatively, to quickly and systematically produce an initial conclusion about which of a given set of hypotheses is most likely to be true.


How Conceptual Elegance Can Lead to Risk Blindness

We’ve spent a lot of time over our careers working with risks that are, at least in theory, easy to quantify, price, and transfer. These include hazard risks for which there is substantial historical data on the frequency of their occurrence, as well as market risks where historical data sets are also very large.

In these cases, the traditional way of mitigating unwanted risk exposure is to transfer it, via insurance or financial derivative contracts. This also makes it apparently straightforward to calculate an organization’s residual or retained risk after mitigation actions are taken. In turn, this makes is apparently easy to compare the total amount of residual/retained/net risk to a board’s “risk appetite” – for example, the maximum reduction in cash flow or equity market value to which it desires to be exposed over a given period of time (with, for example, a 95% degree of confidence).

Especially after the events surrounding the 2008 global financial crisis (or the collapse of Long Term Capital Management in 1998), we are all painfully aware that in practice, things are not this easy, even in the case of risks that are apparently easy to quantify, price, and transfer.

Some real-world complications include:

  • Use of historical data sets that do not include extreme downside losses that a given system can produce;

  • Evolution in the nature of the system over time that makes historical data an increasingly inaccurate guide to what may occur in the future;

  • Use of inaccurate models to forecast future risks;

  • Risks whose covariance changes, both over time and as a function of conditions (e.g., remember the saying that as conditions deteriorate and uncertainty increases, correlations move towards 1.0);

  • The ability of risk transfer counterparties to make good on the payments they have contractually agreed to make should a risk materialize (e.g., the case of AIG and credit default swaps in 2008).
If conceptually elegant approaches to retained risk and risk appetite are this challenging in practice for hazard and financial risks, they are exponentially more so in the case of operational and strategic risks.

Consider the case of Carillion, the UK facilities management and construction services company that recently went into liquidation with almost GBP 7 billion in liabilities.

One of the principal causes of the company’s failure was cost overruns on major projects. The potential for such overruns had previously been recognized by the company’s management as a potentially existential risk.

However, in the company’s risk management process, the size of the residual/retained risk exposure was apparently much smaller than the gross exposure. But this wasn’t because most of the risk had been transferred to a counterparty via insurance or financial derivative contracts. Rather, it was because of the assumption that internal mitigation actions would significantly reduce the risk.

Thus, the board’s apparent belief that Carillion had a small exposure to existential project cost overrun risk seems to have been based on a series of implicit assumptions that critical mitigation actions (a) would be implemented; (b) in time; and (c) would have their expected risk reducing effects.

It is also critical to recognize the enormous difference in the accuracy with which transferable risks (e.g., hazard and market) and non-transferable risks (e.g., operational and strategic) can be quantified, in order to integrated them into an overall calculation of an organization’s retained risks relative to its risk appetite.

As we have shown, the quantification of risks for which large historical data sets are available is still problematic in many ways, and subject to an unknown degree of error, which exponentially grows over time.

But for many reasons, this challenge pales in comparison with those that confront us when we try to quantify of operational and strategic risks and the potential impact of actions taken to mitigate either their probability of occurrence of the potential negative impact if they materialize. Some of the most important challenges include:

  • We can’t be confident that we have identified all the relevant risks, mainly for two reasons. On the operational front, organizations tend to become more complex as they grow, which gives rise to both new risks and new causal pathways for ones already identified. On the strategic front, the nature of the interacting complex adaptive systems within which a company exists (e.g., technological, economic, social, and political) guarantees that new risks will continuously emerge.

  • In many cases, either reference case/base rate data on which we can ground our risk and mitigation impact quantification processes either don’t exist or if they do, are inevitably incomplete.

  • The subjective estimates we are usually forced to use when attempting to quantify operational and strategic risks and the potential impact of mitigation actions are almost always affected by at least five individual, group, and organizational biases, including:
    1. Over-optimism (e.g., the level of the mean or median estimate);
    1. Overconfidence (e.g., the width of the range of possible outcomes);
    1. Confirmation/Desirability (we pay more attention and give more weight to information that supports our view, or the outcome we desire, and less to information that does not);
    1. Conformity (we hesitate to deviate from the prevailing group view); and
    1. A strong organizational desire to avoid errors of commission (i.e., false alarms about potential risks that don’t materialize) even though this automatically increases the likelihood of errors of omission (i.e., missed alarms about potential risks that actually occur).

  • Complete quantification of the relationships between operational and strategic risks, and between them and hazard and market risks, and how these relationships could vary over different situations and over time is, from both an estimation and a computational perspective, a practical impossibility.

With these observations in mind, let us return to Carillion.

In reviewing what we know so far about this failure (and we will know much more when various inquests and litigation cases are completed), two critical points stand out for us.

First, it is not as though the risk of large project cost overruns sinking a company is not well-recognized or well-documented. For example, Professor Bent Flyybjerg has extensively documented the regularity with which cost overruns occur on large projects (e.g., see his paper, “
Over Budget, Over Time, Over, and Over Again: Managing Major Projects”), and project revenue recognition has for years been a major preoccupation of professional accounting standards bodies.
This leads us to infer (perhaps incorrectly), that Carillion’s management and board must have been very confident that these well-known risks were adequately mitigated by the plans the company had put in place to address them. This raises questions about the evidence that provided the basis for this high degree of confidence, as well as the actions taken to confirm that these plans were being implemented (we look forward to internal audit and compliance reports eventually being publicly disclosed).

Second, the Carillion failure highlights yet again the danger of putting too much trust in enterprise risk management models that attempt to quantify and aggregate very different hazard, market, operational, and strategic risks into a unified measure of “residual/retained risk” exposure that can be compared to an equally neat “risk appetite” number.

We continue to stress that when it comes to managing and governing risk, a desire for conceptual elegance is too often achieved at the cost of dangerous risk blindness that only becomes apparent when it is too late to avoid organizational failure.

Of course, this begs the question of what constitutes a better approach to the management and governance challenges posed by various types of risk. Here's a short summary of our view:

  • Use of quantitative Enterprise Risk models that aggregate gross and net exposures to hazard and market risks still makes sense, with the caveats noted above. Given the limitations of these models, their use should be complemented with other techniques, like scenario based stress testing.

  • The general category of "operational risk" encompasses a very wide range of "things that could go wrong." Where such risks can be readily quantified, priced, and transferred, they should be included in the quantitative Enterprise Risk Management models and system. Where this is not the case, risk management should focus on establishing plans, processes, and systems that are robust to potential operational failures under a wide range of scenarios, while also building in various sources of resilience when robust design falls short and failures occur. There are many techniques that can be used to analyze and manage these risks, such as failure mode and effects analysis. And key actions to mitigate operational risks should be assessed and verified at regular intervals. A final focus should be on building an adaptive organization that can constantly identify and adjust to new operational risks created by increasing internal complexity and/or a changing external environment.

  • When it comes to balancing risk exposure with a board's risk appetite, strategic risks present the most vexing challenge. As we have repeatedly noted, attempts at quantifying these risks are at best highly uncertain. It must therefore be the case that a board's decisions about strategic risk exposure versus risk appetite ultimately depends on directors' subjective judgment. But that does not mean such judgments must be unstructured. Consciously or not, they will usually reflect an assessment of the degree of imbalance between the goals being pursued, the resources available, and the strategy for employing those resources in light of the uncertainties facing the organization. The greater the degree of imbalance between goals, resources, and strategy, and the higher the external uncertainty, the greater an organization's strategic risk exposure.

Modeling -- Not as Easy as it Looks!

No, we’re not talking about a catwalk in stilettos. We’re talking about an activity that, especially since VisiCalc first ran on an Apple II in 1979, has become an integral part of management.

For all its current ubiquity, what too many people fail to appreciate is the amount of uncertainty inherent in quantitative modeling. With that in mind, we offer this quick review.

Level 1: Choice of Theory

Explicitly or implicitly, models intended to explain or predict observed effects begin with a causal theory or theories. The accuracy of the conceptual theory that underlies a quantitative model is rarely acknowledged as an important source of uncertainty.

Level 2: Choice of Modeling Method

The next step is choosing a modeling method that accurately captures the major features of the theory. For example, where theory states that the target effects to be modeled emerge from the bottom-up via the interaction of agents with varying information and beliefs, the agent-based modeling may be the method chosen. Alternatively, where theory states that the target effect is heavily driven by feedback loops, then a top-down system dynamics modeling approach may be used.

The extent of the match between theory and the modeling approach chosen is another potential source of modeling uncertainty.

Level 3: The Structure of the Model(s)

Yet another source of uncertainty are structural choices that are made when implementing a given modeling method. These include the variables that are included in the model, and the nature of the relationships between them (e.g., are they related to each other, and, if they are, is the relationship linear or non-linear, and constant or dependent on other variables?).

In some cases, uncertainties about the correct structure of a model can be resolved through the use of “ensemble” methods, which involves he construction of multiple models and the aggregation of their outputs.

Level 4: The Values of Model Variables

“Parameter uncertainty” refers to doubts about the accuracy of the values that are attached to a model’s variables, including dependency relationships between them (e.g., their degree of correlation). In simple deterministic models, this involved disagreements over values for individual variables, or the values to be used in “best-case, worst-case, most-likely case” scenarios.

In more complex Monte Carlo models, values for key variables are specified as distributions of possible outcomes. In this case, sources of uncertainty include the type of distribution used to describe the possible range of values for a variable (e.g., a normal/Gaussian or power law/Pareto distribution), and the specification of key values for the selected distribution (e.g., will rare but potentially critical events be captured?).

Level 5: Recognizing Randomness

For most variables, there is an irreducible level of uncertainty that cannot be reduced through more data or better knowledge about the variable in question. Sources of this randomness can include sensor and measurement errors, or small fluctuations caused by a complex mix of other factors. Whether and how this randomness is included in potential variable values is another source of model uncertainty.

Level 6: Mathematical Errors

We’ve all done it – wrongly specified an equation or variable value when building a model late at night (and/or under time pressure). And most of us are usually lucky enough to catch those errors the next morning before someone else does, when, after a few cups of coffee, we test our model before finalizing it and say, “that doesn’t look right.” Like it or not, mathematical errors are yet another – and very common – source of model uncertainty.

The discipline of model verification and validation is used to assess these six sources of model uncertainty. Verification focuses on the accuracy with which a model implements the theory upon which it is based, while validation assesses the accuracy with which a model represents the target system.

Level 7: Non-Stationarity

This brings us to the final source of uncertainty. Validation usually involves assessing the extent to which a model can reproduce the target system’s historical results. However, if the system itself is evolving or “non-stationary” – and particularly if that evolution is driven by a complex process that cannot be fully understood (i.e., it is “emergent”), then a final source of uncertainty is how long a model’s predictions will remain accurate (within certain bounds).

Computer models have substantially increased business productivity as they have come into widespread use over the past forty years. Yet they have also introduced new sources of uncertainty into decisions that are made using their outputs. It is for this reason that wise decision makers always test model results against their intuition, and when they disagree take the time to further explore and understand the root causes at work. Both modeling methods and decision makers’ intuition usually benefit from the time invested in this discussion.