If you ever came across Bayes’ theorem, chances are you know it’s a mathematical theorem. This theorem has a central role in probability theory. It’s most commonly associated with using evidence for updating rational beliefs in hypotheses. While this post is not about listing its real-world applications, I’m going to give the general gist for why it has such potential in the first place.

Imagine it’s a rainy season. One morning you start wondering whether to take an umbrella with you before you leave your house. You look outside your window to see that the weather is currently sunny. So, your first thought is that rain is unlikely.

However, upon a second glance, you notice some scary-looking dark clouds on the horizon and you decide to take your umbrella after all.

You didn’t have to stop there, of course. You could improve your guess even more by checking the forecast on your favorite weather website. What’s important is that your decision is based on your estimate of the *probability* that it will rain. So, you used information about the current weather conditions (and possibly from other sources) to *update* your estimate of this probability.

The aim of this post is to give a not-too-formal introduction to Bayes’ theorem โ a mathematical tool which does the job of updating probabilities from evidence in the best possible way. Before that, I’m going to define a few other concepts from probability theory which will be necessary for understanding it.

## Events and probabilities

### Events

One way to think about an **event **is an *outcome*, or a set of outcomes, of some general *process*. This might sound confusing, but a few examples should make it clearer:

- A coin landing heads after a single flip
- A coin landing heads 4 times after 10 flips
- Rolling a 2 with a 6-sided die
- Donald Trump becoming the next US president
- It raining on a particular day

In the first example, the event is the coin landing heads, whereas the process is the act of flipping the coin once. In the fourth example, the process is the entire US presidential election race and the event is Donald Trump winning it. You get the idea.

### Probabilities

The **probability** of an event is a number that, intuitively speaking, represents the *uncertainty* associated with the event’s occurrence. In everyday life, people often use *percentages* to denote probabilities of events. For example, the probability that a *fair* coin will land heads is 50%.

For a more standardized representation, mathematicians use the decimal 0.5 to refer to the same probability. Under this convention, an event which is *impossible* to occur would have a probability of 0 (equivalent to 0%). And an event *certain* to occur will have a probability of 1 (equivalent to 100%). An event whose occurrence has *some* degree of randomness or uncertainty will be assigned a real number between 0 and 1. The closer an event’s probability is to 1, the more likely it is that the event will occur, and vice versa.

I’m going to continue with the weather example and assign some random probability to the event that it rains. In probability theory, the notation used for expressing probabilities is **P(Event)**. For example, you can write *“the probability of rain is equal to 0.6”* (or 60%) as:

What does the 0.6 probability mean? Giving a *philosophical* definition of probability is not necessarily an easy job. And there are still ongoing debates among some philosophers and statisticians on the topic. But I like this practical definition:

- If P(“Rain”) = 0.6, your expectations for rain are equivalent to your expectations to draw a red ball from a shuffled bag of 6 red balls and 4 blue balls.

#### A small clarification

Negative probabilities and probabilities greater than 1 don’t exist. A probability of 0 already means the event will not occur and a probability of 1 means the event is certain to occur (although check out my post on zero probabilities for clarification on this rule).

## What are conditional probabilities?

Let’s continue with the weather example. You can ask the question: “What is the probability that it will rain, given that the weather is windy and there are dark clouds in the sky?”.

Here you aren’t simply interested in the general probability that it will rain. You want to know the probability of rain after taking into account a particular piece of new information (the current weather conditions). In probability theory, such probabilities are called **conditional** and the notation used for them is:

- P(Event-1
**| Event-2)**.

Conditional probabilities expresses the probability that Event-1 will occur when you assume (or know) that Event-2 has already occurred. With this notation, you can write “*the probability that it will rain, given that the weather is currently windy and cloudy, is equal to 0.85″* as:

Question:

- How does the probability of rain change after finding out the weather is windy and cloudy?

This is precisely the type of question you’d answer with a conditional probability. In the current example, the answer is:

- The probability of rain increased to 0.85.

Remember, we said P(“Rain”) = 0.6. So, you initially thought that rain is *moderately* *likely.* Then, after seeing the current weather conditions, you updated your expectations to *very likely*.

It’s important to clarify that this notation doesn’t imply a causal relationship between the two events. For example, you would update your expectations for rain after seeing other people carrying umbrellas, even though the presence of umbrellas itself doesn’t *cause* rain.

In the current example, you somehow knew that P(Event-1 | Event-2) = 0.85. But what happens when you don’t know what this conditional probability is? Is there a standard way for calculating it for any kind of event?

## How Bayes’ theorem connects probabilities and conditional probabilities

In mathematics, true statements are called **theorems**. That is, statements whose truth you can prove using logic. The proof requires starting from a few basic statements, called **axioms**.

The British statistician Thomas Bayes first discovered Bayes’ theorem and that’s why it’s named after him. So, let’s finally look at the mathematical statement it makes:

Instead of using concrete event names (like “Rain”), I gave the two events in the equation the general names Event-1 and Event-2. The equation consists of four parts and the traditional terminology used for referring to them is:

- P(Event-1):
**Prior probability** - P(Event-2):
**Evidence** - P(Event-2 | Event-1):
**Likelihood** - P(Event-1 | Event-2):
**Posterior probability**

In words, Bayes’ theorem asserts that:

- The
*posterior probability*of Event-1, given Event-2, can be calculated by multiplying the*likelihood*and the*prior probability*terms and dividing their product by the*evidence*term.

In other words, you can mathematically get to the posterior probability of one event, given another. You just need to know the values of the three terms on the right-hand side of the equation.

By the way, this terminology is traditional Bayesian lingo and you shouldn’t take it *too* *literally*. For example, in everyday life, people often use the words “probability” and “likelihood” interchangeably. However, in probability theory, all four terms have distinct and very specific meanings.

In the rest of this post, I’m going to focus on what each of them represents.

### Prior probability

The **prior probability **of an event (often simply called **the prior**) is its probability calculated from some prior information about the event.

The word prior can be somewhat misleading. It’s not immediately clear what the probability is supposed to be *prior to.* A simple way to to describe it would be as the* *probability of the event calculated from all the information related to the event that is *already known*. In the weather example, the prior probability of rain was given as P(“Rain”) = 0.6. This could come (for instance) from the prior knowledge that 60% of the days on the same date have been rainy for the past 100 years.

Here is another way to look at it. A prior probability is always prior with respect to some piece of information that you *left out* from the calculations. In this example, the information left out when calculating P(“Rain”) = 0.6 is basically* everything*, except for the past rain frequency for the current date.

### Evidence

You started with the prior P(“Rain”) = 0.6 but now you have new information you can use for more accurately (re-)estimating the same probability. The **evidence** term in Bayes’ theorem refers to the *overall probability* of this new piece of information.

In the current example, the information used for updating P(“Rain”) was the current weather conditions, so the evidence would be P(“Windy & Cloudy”). That is, the probability of having windy and cloudy weather, *regardless of* whether the day turns out to be rainy. You can think about it as the *average* probability of one event across all possibilities for the other events.

Notice that, outside Bayesian tradition, the word “evidence” is most commonly used to refer to the piece of information itself, and not to its probability. This is a good reminder to not be too literal about these terms.

### Likelihood

Unlike the previous two terms of the equation, the **likelihood** represents a conditional probability. In the weather example, this is the probability of having a windy and cloudy morning, given that it ends up raining at least once throughout that day:

- P(“Windy & Cloudy” | “Rain”).

An intuitive way to think about it is as the degree to which the first event is consistent with the second event. That is, the likelihood represents how strongly you expect that the morning will be windy and cloudy, assuming that the day is going to be rainy.

### Posterior probability

The **posterior probability **(often simply called **the posterior**) is *the* conditional probability you calculate when using Bayes’ theorem. It represents the *updated prior probability* after taking into account some new piece of information. As prior probability is always relative, so is the posterior probability of an event. What this means is that the posterior probability becomes the new prior probability which you can then update using some other piece of information. And the cycle goes on. As Dennis Lindley put it:

Today’s posterior is tomorrow’s prior.

In the weather example, the posterior probability was P(“Rain” | “Windy & Cloudy”): the conditional probability that can convince you to take an umbrella on your way out.

### Putting it all together

Now I’m going to take some actual values and use Bayes’ theorem to calculate the posterior probability in our weather example.

The value of the prior probability was already specified to be P(“Rain”) = 0.6. And here are the values for the likelihood and evidence terms:

- P(“Windy & Cloudy” | “Rain”) = 0.68
- P(“Windy & Cloudy”) = 0.48

The result of plugging in these numbers into the equation is:

- P(“Rain” | “Windy & Cloudy”) = 0.68 * 0.6 / 0.48 = 0.85

And yes, I did come up with these numbers so they can nicely fit intะพ the equation, in order to get the posterior probability P(“Rain” | “Windy & Cloudy”) = 0.85.

## Summary

Bayes’ theorem is the mathematical device you can use for updating probabilities in light of new knowledge. No other method is better at this job.

Its simplicity might give the false impression that actually applying it to real-world problems is always straightforward. However, getting the correct values of the terms on the right-hand side of the equation can be a challenge. In particular, calculating the evidence term for more complicated problems is often difficult. It might require the use of special mathematical techniques not directly related to the analysis.

Having said that, mathematicians develop and apply new powerful Bayesian algorithms in a wide variety of fields. Some examples include data analysis, artificial intelligence, neuroimaging, forensics, and so on. In future posts, I’m going to discuss these applications which are far more interesting and important than the toy weather example I used here.

Although this post barely scratched the surface of this topic, I hope it gives a good idea about what makes Bayes’ theorem so exciting.

If you found this topic intriguing, you will probably also like my post The Anatomy Of Bayes’ Theorem. There, I look at the theorem from a more intuitive point of view and also show its mathematical derivation.

Also, check out my two-part post on Bayesian belief networks to see a really cool way to use Bayes’ theorem for making inferences on multiple events that depend on each other.

Piet Uamusse says

Hi,

Just a quick errata on the section about posterior probability. I believe you used the word calculating by mistake instead of using calculate or the phrase ” are calculating”.

This is on the first sentence, in the second line of the paragraph.

Thanks a lot for the tutorial ๐

The Cthaeh says

Thanks for the correction, Piet. Fixed ๐