Why do people flip coins to resolve disputes? It usually happens when neither of two sides wants to compromise with the other about a particular decision. They choose the coin to be the *unbiased agent* that decides whose way things are going to go. The coin is an unbiased agent because the two possible outcomes of the flip (heads and tails) are equally likely to occur. Mathematically, this is expressed as:

- P(Heads) = P(Tails) = 0.5

But think about it. Have you ever bothered to check if heads and tails are really equally likely outcomes for the coins you flip? Actually, no real coin is truly fair in that sense. One side is always slightly heavier or bumpier than the other. This biases the coin away from the ideal 50/50 ratio of heads and tails.

In fact, the coin flipping process itself can slightly skew even an ideal coin’s outcomes. In their SIREV paper, authors Persi Diaconis, Susan Holmes, and Richard Montgomery analyze the math and physics behind the process of coin flipping. Here’s an excerpt from their abstract (you can find the original pdf here):

We analyze the natural process of flipping a coin which is caught in the hand. We prove that vigorously-flipped coins are biased to come up the same way they started. […] For natural flips, the chance of coming up as started is about .51.

## Introduction

Whether this kind of a bias is a problem in the real world is a separate question. However, if you decided to gamble on coin flips, you can be sure it will have a dramatic effect on your long-term wins when the number of flips grows significantly. For example, if a coin comes up heads with probability 0.51 (instead of 0.5), after 10000 flips the expected number of heads is going to be 5100. This is 100 more than the expected number of a perfectly unbiased coin.

Okay, maybe you don’t ever intend to gamble with coins. And you don’t care if any coin is biased or not. However, thinking about the process of estimating a coin’s bias is very useful in itself. You can apply analogous methods to all sorts of problems that you* do* care about.

In this post I’m going to show a way of estimating the bias of a coin using Bayes’ theorem. The method relies only on empirical data collected by flipping the coin multiple times. It isn’t concerned with any of the physics behind individual flips.

Instead of flipping a real coin and reporting its outcomes, I’m going to simulate coin flips with the programming language MATLAB. I’m not going to discuss any of the code here but if you’re into programming you can download it and run your own simulations. You don’t need to be into programming to be able to follow the rest of the post, however.

In the actual simulation, I’m going to use Bayes’ theorem to recalculate the estimate of a coin’s bias after every flip. But before I do that, I’m going to introduce the concept of probability distributions which is going to be helpful along the way.

## Probability distributions

*For a deeper introduction to probability distributions, check out my post dedicated to this topic.*

Imagine you have a random process with multiple possible outcomes. An example is rolling a six-sided die, where the 6 possible outcomes are the numbers 1 through 6.

The sample space of a random process is the set of these possible outcomes. Each outcome has a particular probability. If you remember from my post about sample spaces, the probability of the entire sample space is equal to 1 because, with complete certainty, *at least one* of the possible outcomes will occur. The exact way this total probability is distributed among the possible outcomes is called a **probability distribution**.

Think about the die rolling example again. Assuming the die is perfectly unbiased and each outcome is equally probable, you divide the total probability (1) to six equal parts and the probability of each outcome becomes 1/6:

A random process can have any number of possible outcomes. A probability distribution only requires that the sum of all probabilities adds up to 1 — neither more nor less. This suggests that anytime you adjust the probability of one outcome, this will be “at the expense of” the probability of at least one of the other outcomes.

For example, if it turns out that the die is unfairly biased to come up 6 with probability 0.4 (instead of 1/6), the remaining 5 outcomes *only* have a total probability of 0.6 to distribute among each other.

### Probability distributions and uncertainty

A probability distribution is a more general concept. It describes the uncertainty in a random process (like rolling a die). For example, before you roll the die, you’re uncertain about which side will be facing up because the process is random and each outcome will occur with some probability (which is specified by the probability distribution).

But what if you’re also uncertain about the probability distribution itself? You don’t have to assume that the die is fair and each outcome’s probability is 1/6. You can actually assign probabilities to the possible candidates for the die’s real probability distribution. This way you have a probability distribution *for* the possible probability distributions! Things are getting deep.

## The bias of a coin

In this section, I’m going to illustrate the last concept with a simulation. However, instead of a die, I’m going to use a coin because having only two possible outcomes will make calculations much simpler.

Most coins are close to being fair. That is, they have the same probability of landing heads and tails (0.5). You can talk about the bias in terms of only one of those probabilities.

There are only two possible outcomes and the other probability has to be 1 minus the first. Therefore, specifying the bias of a coin actually specifies the entire probability distribution.

From now on, I’m going to measure the bias as the probability with which it lands heads:

- Bias = P(Heads)

### Representing the bias mathematically

With the above definition, a fair coin is one for which Bias = 0.5. Most coins may not be perfectly fair but their bias is still very close to 0.5 and deviations are hard to detect “with a naked eye”.

Imagine that there’s this special coin factory which produces coins having all sorts of biases. Meaning, if you pick a random coin from the factory, its bias can be any number between 0 and 1.

But to make things even simpler, assume that the possible biases are only the 101 *discrete* values between 0 and 1:

- 0, 0.01, 0.02, 0.03, …, 0.98, 0.99, and 1.

You have no reason to assume that a randomly picked coin is more likely to have any of the 101 values compared to the rest. Therefore, you start your estimation of the bias by assigning it a uniform prior distribution:

In order to not clutter the x-axis too much, I didn’t put all possible bias values in its label. But the blue line you see in the plot is supposed to be a series of 101 dots positioned very close to each other. The height of each dot (the y-axis value) represents the prior probability of the corresponding “Heads Bias” value.

In this case, the probability of each bias is 1/101 (this follows from the requirement that the probabilities in any distribution must add up to 1). Calling the prior distribution **uniform** is another way of saying that all biases are equiprobable.

### Parameters and parameter estimation

**Parameter** is a term in probability theory used for referring to characteristics of a system, such as the bias of a coin. A parameter is simply a feature or a property of the system.

**Parameter** **estimation **is inferring the value of a parameter from new data about the system (read more in my post on Bayesian vs. Frequentist approaches to statistics).

In the next section, I’m going to reverse things and first show a simulation in which the bias of a coin (drawn from the special coin factory I mentioned above) is estimated with Bayes’ theorem. In the final section, I’m going to explain some of the technical details of the simulation.

## Coin bias simulation

Say someone randomly drew a coin from a pile produced by the factory. The coin’s bias happens to be:

- Heads Bias = 0.3

However, you don’t know this yet. The person hands you the coin and asks you to estimate its bias. If you want to do this by using Bayes’ theorem, you would flip the coin many times and use the outcomes to update the probability of each possible value of its bias. In other words, after each flip you would update the prior probability distribution to obtain the posterior probability distribution.

Then you would make the posterior distribution your next prior distribution and update it with the next coin flip. The idea is to repeat this procedure a sufficient number of times until you’ve gotten a “good enough” estimate of the coin’s bias. How good is “good enough” depends on you and your goals.

### Results

Let’s get down to business. See what happens to the posterior distribution after the first 500 flips (click on the image to start the animation):

Click on the image to start/restart the animation.

So, after 500 flips most of the probability gets distributed around the value 0.3. In fact, the probability for most other values virtually disappeared — including the probability of the coin being fair (Bias = 0.5). This already is a pretty good estimate of the real bias!

But you might want an even better estimate. If you’re really determined to find the true bias of this coin, you can continue flipping it. Just keep flipping until you’ve reached the precision that satisfies your goals. See what happens after a set of 15 000 additional coin flips (again, click on the image to start the animation):

Click on the image to start/restart the animation.

Well, now you can be almost certain the bias is either 0.3 or some value very close to it! You see that, as you accumulate more data in the form of coin flips, you get closer and closer to the real bias of the coin.

Now it’s time to dig into the details of the simulation.

### Explaining the simulation

As I said in the beginning, you can download the code for this simulation and run it yourself with different values for the coin bias parameter. The code is written in MATLAB and is only 30 lines long. Here’s what it does:

- Creates an array of 101 numbers which represent the prior probabilities of the 101 possible bias values. Each probability is set equal to 1/101.
- Selects a bias for the imaginary coin (you can change this part).
- Generates a random number between 0 and 1 and counts it as “heads” if it’s less than or equal to the value of the bias, and counts it as “tails” if it’s greater than the bias. This is one imaginary coin flip.
- By applying Bayes’ theorem, uses the result to update the prior probabilities (the 101-dimensional array created in Step 1) of all possible bias values into their posterior probabilities.
- Repeats steps 3 and 4 as many times as you want to flip the coin (you can specify this too).

There are also a few code lines that dynamically plot the updated probabilities (like the animated plots you saw in the previous section). Now the only thing left to explain is Step 4.

#### Updating the prior probability distribution with Bayes’ theorem

If you haven’t already, I suggest you take a look at my post explaining the intuition behind the rule. It’ll help you understand any parts from this section you potentially get stuck on. Also, you can take a look at another post where I show all the steps of the calculations with another example.

Say you’re currently updating the probability of the bias being 0.5 and you just flipped heads. The equation for updating the prior into a posterior probability is:

##### Explaining the terms of the equation

**P(Bias=0.5)**: The prior probability that the bias is equal to 0.5. Initially, it’s 1/101.**P(Bias=0.5 | “Heads”)**: The posterior probability (or the*next*prior probability) is the updated prior probability after taking into account the result of the flip.**P(“Heads” | Bias=0.5)**: The likelihood term represents the probability of flipping heads, if the coin’s bias is 0.5. Well, by definition, that probability is equal to 0.5. If the result is tails instead, the likelihood will again be equal to 0.5. But if you were estimating any other piece of the probability distribution of the bias, the two would differ. For example:- P(“Heads” | Bias=0.8) = 0.8
- P(“Tails” | Bias=0.8) = 0.2

**P(“Heads”)**: The evidence term is the*overall*probability of getting heads and is the sum of all 101 (prior * likelihood) products. You can think about it as the expected probability of getting heads on the current flip. This term has the same value for all biases.

So, this way you calculate the posterior probabilities for all values between 0 and 1. Then you flip the coin again and repeat the calculations but this time with the posterior distribution as the next prior distribution. As you collect more data, your estimate of the bias will get better and better and you can get arbitrarily close to the real value.

Well, as long as you have motivation to continue flipping the coin.

## Summary

In this post I introduced two new concepts from probability theory:

- Probability distribution: a particular assignment of probabilities to all possibilities of an uncertain process
- Parameter estimation: the process of narrowing down the possible values of a parameter of a system from data generated by the system

In particular, I demonstrated the use of Bayesian parameter estimation. Even though estimating the bias of a coin is a very simple problem, once you intuitively understand the technique, the generalization to more complicated problems isn’t difficult.

YF says

Very nice simulation. Thanks.

YF says

I was unable to download the Matlab code for the simulation. Might you have one available as a text or MS Word document? Thanks!

The Cthaeh says

Hi YF, I’m glad you liked the post! You can access the file in txt format here.

YF says

Many thanks! Your site is very helpful and I appreciate the work and thought you have put into it; I will be recommending it to others.

Mattias says

How would I then use the posterior to calculate how many heads I can expect to see in 10k flips?

The Cthaeh says

That’s a good question, Mattias!

The easiest thing you can do is take the mean of the posterior distribution (called the maximum a posteriori estimate, or MAP) and use it as the proportion parameter (

p) of a binomial distribution. And of course use 10000 as thenparameter of the distribution. Then the expected number of heads would be the mean of the binomial distribution itself, which is equal to p*n.In other words, if the mean of the posterior distribution is, say, 0.55, based on this analysis you would expect about 5500 heads in 10k flips.

This isn’t the best way to do it, however. Notice that this estimate doesn’t take into account the full posterior distribution. So, if the posterior is too wide or skewed, your estimate based on the MAP wouldn’t be as accurate.

A better approach (though computationally more expensive) would be to have a separate binomial distribution for every possible value of the bias of the coin and take the weighted sum of those distributions, where the weights are the probabilities of the respective biases. For example, for simplicity let’s assume that the only possible biases for the coin are:

– 0.25, with a probability of 0.3

– 0.5, with a probability of 0.4

– 0.75, with a probability of 0.3

Then you would have 3 binomial distributions, each with the same parameter n=10000, but where the p parameter is 0.25, 0.5, and 0.75. Then you would multiply each distribution by 0.3, 0.4, and 0.3 (respectively) and take the sum the 3 weighted distributions. Finally, you take the mean of the resulting sum of distributions as your estimate for the number of heads in 10k flips.

I realize that I am skipping some details here, but please let me know if this makes sense to you. Feel free to ask any clarifying questions, including about the details.