**The law of large numbers** is one of the most important theorems in probability theory. It states that, as a probabilistic process is repeated a *large number of times*, the frequencies of its possible outcomes will get closer and closer to their respective probabilities.

For example, flipping a regular coin many times results in approximately 50% heads and 50% tails frequency, since the probabilities of those outcomes are both 0.5.

The law of large numbers demonstrates and proves the fundamental relationship between the concepts of **probability** and **frequency**. In a way, it provides the bridge between probability theory and the real world.

In this post I’m going to introduce the law with a few examples. I’m going to skip the formal proof, as it requires a slightly more advanced mathematical machinery. But even the informal “proof” I’m going to show you should give a good intuition for why the law works.

# Introduction

Let’s define the frequency of any of the possible outcomes of a random process as the number of times the outcome has happened divided by the total number of trials. In other words, as the percentage of trials in which the outcome has occurred.

Let’s say you flip a regular coin a certain number of times. After each flip you record if it was heads or tails and at the end you calculate the total count of the 2 outcomes. Here’s a question for you. About what % of the the flips do you think will be heads?

Your gut feeling is to say “about 50%”, yes? That’s because you know regular coins have about 0.5 probability of landing heads. Your reasoning is correct, but still, let’s try to apply some skepticism.

I said you flipped the coin a certain number of times but I didn’t say how many. What if you only flip it once? Then, if it lands heads, 100% of the flips will be heads, otherwise 0% will be heads. Those are clearly not the expected 50%!

What if you flip it, say, 4 times? Well, grab a coin and try this experiment on your own. You’ll see that it’s not that uncommon to get sequences with 3 or even 4 heads/tails. In such cases, you will still get frequencies like 0%, 25%, and 75%. As you can see, the “about 50%” answer is at least a little suspicious.

Of course, when I said that you flip the coin “a certain number of times”, you probably assumed I meant a higher number, like 100 or 1000. Then, you say, the frequency of heads flips will be much closer to 50%. And I agree.

In fact, this is exactly what the law of large numbers is all about.

# The law of large numbers

When you flip the coin 1, 2, 4, 10, etc. times, the frequency of heads can *easily* happen to be away from the expected 50%. That’s because 1, 2, 4, 10… are all small numbers. On the other hand, if you flip the coin 1000 or 10000000 times, then the frequency will be very close to 50%, since 1000 and 10000000 are large numbers.

Before we go any further, let’s get some visual intuition about this. I wrote a short code to simulate 1000 random coin flips and plotted the % of heads after each flip. Click on the image below to see the animated simulation:

Click on the image to start/restart the animation.

As you can see, when the number of flips (let’s call it N) is small, the frequency can be quite unpredictable. It starts at 0 because the first flip happened to be tails. Then it sharply jumps to values above 50% and eventually, around the 500th flip, it starts to settle around the expected 50%. So, notice that even N = 200 is still not large enough for the frequency to get “close enough” to the probability.

At this point you might be asking yourself a natural question. As far as the law of large numbers is concerned, what exactly is considered a high N? The answer is that there isn’t a magical number that is always large enough for any random process. Some might require a higher or lower number of repetitions for the frequency of the outcomes to start converging to their respective probabilities. I will come back to this point later.

By the way, those of you who read my post on estimating coin bias with Bayes’ theorem might make an interesting connection. As a small exercise, I’ll let you think about how the law of large numbers relates to the shape of the posterior distribution of a coin’s bias parameter as the number of flips increases. Also think about the general connection with parameter estimation, which I talked about in the Bayesian vs. Frequentist post. Don’t hesitate to share your thoughts in the comment section below!

In the next section, I’m going to show a few more examples. Then I’m going to state the law in a bit more formal mathematical terms and show you an intuitive proof for why it works.

## More examples

Let’s take a look at another animated simulation of 1000 flips of the same coin.

Click on the image to start/restart the animation.

This time the first flip happened to be heads, that’s why the frequency starts all the way at 100%. Notice that the convergence is slightly slower compared to the previous simulation. In fact, even after 1000 flips, it hasn’t quite settled around 50%. Further evidence that there isn’t a special number of flips that will guarantee a “good enough” convergence.

Now let’s look at another simulation of 1000 flips. But this time we’re flipping a fake coin that has a 0.65 bias towards heads. Meaning, the probability of landing heads is 0.65, instead of 0.5. Here’s the results from the code I just ran:

Click on the image to start/restart the animation.

Again, no problem, the frequency converges towards the probability at a similar rate as in the previous examples..

Notice that in each case the exact path of convergence is quite unique and unpredictable. In a way, this makes the law of large numbers beautiful. Regardless of *how exactly* the long-term frequency reaches the theoretical probability, the law guarantees that sooner or later the two will meet.

Now let’s look at a slightly more complicated random process than flipping a coin. Instead, let’s roll some dice.

After a single roll of a die, any of the six sides has an equal probability of 1/6 to be on top. This means that, according to the law of large numbers, the expected long-term frequency of each side should be ~16.7%. Take a look:

Click on the image to start/restart the animation.

Again, notice that even after hundreds of rolls, the frequencies haven’t completely settled towards the expected ones. In fact, it seems like convergence here is somewhat slower compared to the coin flipping example, doesn’t it? It will generally take longer for frequencies of a process with six possible outcomes to settle, compared to a process with only two possible outcomes. I’ll come back to this point in a bit.

Take a look at what path the frequency of each outcome took in the simulation (the black dashed line indicates the expected percentage (EP)):

You can see deviations from the EP as large as 5% even after 100-200 rolls. In fact, even at the end of 1000 rolls, most sides’ frequencies have not completely converged. Think about the kinds of implications this could have if you decide to play some sort of a gambling game involving rolling dice!

## A formal statement of the law

The law of large numbers was something mathematicians were aware of even around the 16th century. But it was first formally proved in the beginning of the 18th century, with significant refinements by other mathematicians throughout the following 2-3 centuries.

So far, I’ve been implicitly using a particular form of the law stated by the 20^{th} century French mathematician Émile Borel.

In words, this formulation says that when the same random process is repeated a large number of times, the frequency of the possible outcomes will be approximately equal to their respective probabilities.

Here’s the same statement, formulated as a mathematical limit (don’t worry if you’re not familiar with limits, I’ll explain what each of the terms means):

Here **n** is the number of times the random process was repeated.

**N _{n}(outcome)** is the number of times a particular outcome has occurred after n repetitions.

**P(outcome)** is the probability of the outcome.

The **arrows** are standard notation when writing limits and should be read as “approaches”.

The **∞** symbol represent (positive) infinity.

For any given n, N_{n}(outcome)/n is equal to the percentage/proportion/frequency of that outcome after n repetitions.

Here’s how the statement above reads:

- N
_{n}(outcome)/n approaches P(outcome), as n approaches infinity.

For one value to approach another simply means to get closer and closer to it. So, a more verbose way to read the statement would be:

- The frequency of an outcome after n repetitions gets closer to its probability, as n gets closer and closer to infinity.

Of course, for a number to get closer to infinity simply means that it keeps getting larger and larger.

The law of large numbers is both easy to formulate and intuitive. I also showed you some empirical evidence for its validity. But does it really always work?

Just because it works for random processes like flipping a coin and rolling a die, does it mean it will work for any random process? Not to mention that those weren’t even the “real” processes, but computer simulations of them.

You’re right, you can’t just blindly accept something as true, just because it sounds intuitive. Especially in mathematics. And even less so in probability theory!

Fortunately, formal proofs do exist. Some forms of the law are called **weak**, where only “convergence in probability” is guaranteed, as opposed to **strong**er formulations of the law, where “almost sure” convergence is guaranteed.

I won’t go into detail about those distinctions here, as it would require slightly more advanced concepts beyond the scope of this post. If you’re curious about the terms, feel free to check out the **Almost surely and almost never** section of my post on the meaning of zero probabilities.

## An informal proof

First, let’s establish what conditions about the random process need to be true for the law to hold.

The part of the law which says that “the same random process” needs to be repeated a large number of times means that the process is repeated under identical conditions and the probability of each outcome is independent of any of the previous outcomes. In probability theory, this is known as being independent and identically distributed (IID).

If probabilistic dependence is a new concept to you, check out the **Event (in)dependence** section of my post about compound event probabilities.

Anyway, back to the proof.

### A toy example with fake coins

Imagine you have these special coins where *both sides* of the coin are the same (either both are heads or both are tails). Say you have exactly 50 of the “heads” coins and 50 of the “tails” coins. You put all N = 100 coins in a bag and mix it really well.

Now you close your eyes and draw *n* number of those coins and calculate the % of them that were heads. This should be familiar territory by now. If n = 1, the % will be 0 or 100, depending on the kind of coin you happen to draw.

What if n = 99? Now you have drawn all but 1 of the coins. If the one remaining coin is a “heads” coin, the proportion of heads among the drawn coins will be 49/99, which is about 49.5%. On the other hand, if the remaining coin is “tails”, the proportion will be 50/99, or approximately 50.5%. In both cases, the % is very close to the actual 50% of “heads” coins in the bag.

What if we make n = 98? Here, the worst possible deviation from 50% would happen if the 2 remaining coins were of the same kind. In those cases, the % of heads would be 48/98 ≈ 49% or 50/98 ≈ 51%.

The point I’m trying to make is that, when n is very close to N (in this case 100), even in the worst case scenario you still get a frequency very close to the real one (50%). But if you keep decreasing n, as you get closer to 0, the % of heads can have arbitrarily large deviations from 50% (for example, when n < 10). And, of course, when n = N (100 coins drawn), the % of heads will be exactly equal to 50.

### Generalizing to the infinite case

Now imagine the exact same example as the one above but with one difference. Every time you draw a coin from the bag, instead of putting it aside, you just write down its type, throw it back inside, and reshuffle the bag.

By following this procedure, we are essentially creating the identical and independent conditions required by the law of large numbers. Because we shuffle the bag after each draw, we are guaranteeing the same 0.5 probability of drawing either kind on every trial. This follows from the classical definition of probability I introduced in a previous post.

If you think about it, this example is basically equivalent to flipping a regular coin n number of times. In both cases trials are independent of each other and in both cases there are 2 possible outcomes, each with a probability of 0.5.

If you stretch your imagination a little bit, the example is also equivalent to a hypothetical situation where you have a magical bag with an *infinite* number of coins, where “half” of them are “heads” coins and the other “half” are “tails” coins. In other words, this is just like the toy example above, but instead of N = 100, we have… N = ∞.

When all 100 coins in the toy example are drawn, the frequency of the outcomes exactly matches the frequency of each type of coin in the bag. But when N is infinity, what exactly does it mean to have drawn *all* coins? Well, it just means to draw *a lot* of coins!

*Drum rolls*…

When n = N, the frequency of an outcome is equal to its probability. Therefore, when N = ∞, the frequency of an outcome will become equal to its probability when n = ∞.

In other words, with the toy example (where N was a finite number) we established that as n gets closer to N, it becomes harder and harder for the frequency to deviate from the expected frequency *too much*. And now you just need to transfer this intuition to the case where N is infinity.

Like I said, this is not a formal proof. It’s something intended to give you a general intuition for why the law of large numbers works. If you find any of this confusing, feel free to discuss it in the comment section.

## How large is large enough?

The last thing I want to briefly touch upon is something that came up several times throughout this post. Namely, how many trials does it take for the law of large numbers to start “working”?

The short answer is that the question itself is a bit vague. Remember, the law of large numbers guarantees that the empirical frequency of an outcome will be approaching (getting closer to) its expected frequency (as determined by the probability of the outcome). So, a piece of information that needs to be added to the above question is *how close* do you need to get to the expected frequency.

For example, when flipping a coin, if you might not mind a difference of, say 5%. In that case, even a few hundred flips will get you there. On the other hand, if you need to rely on a frequency that is *almost* equal to the expected 50%, you will need a higher number of flips. Like the law says, the higher the number of trials, the closer the frequency will be to the expected one.

Another important factor is variance. Generally, the higher number of possible outcomes a random process has *and* the more the outcomes are *different* from each other, the “slower” it will take for the frequencies to converge to their probabilities. You already saw this with the die rolling example. After 1000 rolls, convergence was worse compared to the earlier coin examples (again after 1000 flips).

There are mathematical papers that go deeper into this topic and give formal estimates for the rate of convergence under different conditions. But I think by now you should have a good initial intuition about it. The best way to get a better feel is to play around with other simulations. Maybe more complicated than flipping coins and rolling dice. Try to see the kinds of factors that determine the rate of convergence for yourself.

# Summary

In this post, I introduced the law of large numbers with a few examples and a formal definition. I also showed a less formal proof. Here’s a few take-homes.

The law of large numbers shows the inherent relationship between frequency and probability. In a way, it is what makes probabilities useful. It essentially allows people to make predictions about real-world events based on them.

The law is called the law of large numbers for a reason — it doesn’t “work” with small numbers. But what is a large number depends very much on the context. The closer you need to get to the expected frequency, the larger number of trials you will need. Also, more complex random processes which have a higher number of (and more diverse) possible outcomes will require a higher number of trials as well.

The law of large numbers is a mathematical theorem but it’s probably not an incident that it actually has the word ‘law’ in it. I like to think about it in similar terms as some natural physical laws, such as gravitation. The probability of an outcome is like a large body that “pulls” the empirical frequency towards itself! The exact trajectory might be different every time, but sooner or later it will reach it. Just like a paper plane thrown from the top of a building will eventually reach the ground.

I hope you found this post useful. In future posts, I will talk about the more general concept of convergence of random variables, where convergence works even if some of the IID requirements of the law of large numbers are violated.

See you in soon!

Ken boire says

January 2, 2019 at 5:20 pmVery good. But can you explain the life-changing FDA trials that rely on n as small as 10 in a control and 20 overall? FDA has just approved a trial for Invivo Therapeutics that is for spinal cord paralysis. There has to be a 20% difference in measurable recovery between the two sample groups to satisfy the trial. How does this make sense? Is it valid? How is a n=10 sample for a control better than using existing data from the universe of patients historically available?

The Cthaeh says

January 2, 2019 at 6:35 pmHi, Ken. That’s an excellent question. I’m glad I have an opportunity to compare these 2 general cases.

In the context of the law of large numbers (LLN), we’re interested in estimating the probability P of an outcome by its frequency over some number of trials. Here, how large n should be will depend on how close we want our estimate to be to the real value, as well as on the size of P. Smaller P’s will generally require larger n’s to get a good estimate.

The n in the types of studies you mentioned has different requirements to satisfy. I haven’t read the specific study you described, but my guess is they are comparing some central measure (like the mean) between a control group and the experimental group using the therapeutics and testing the difference for statistical significance.

When dealing with null hypothesis significance testing (NHST) and p-values, you’re trying to avoid making one of the two types of statistical error and you need some adequate n for that. And you’re concerned about stuff like statistical power. But like I said, what is adequate for this domain depends on different things compared to the situation with the LLN.

If you want to learn more about the things I just described, you can check out my post explaining p-values and NHST.

Now, having said all that, n=10 does sound like a very small number for such a study (or any study really, but especially for a study of such importance). Unfortunately, a lot of studies in social sciences do suffer from significant methodological weaknesses, so your suspicions about that particular study are most likely justified.

As for your question about historical patient data, I couldn’t agree more. In my opinion, one of the significant downsides of NHST is precisely the fact that it forces every study to be conducted in a sort of “scientific vacuum”, without the ability to incorporate already existing data into the analysis.

To summarize my answer, in NHST you can get away with smaller n because you’re not concerned with exact estimation of any parameters but simply with wanting to make sure that two means aren’t equal to each other (or that the effect size is larger than some threshold, which is related to the 20% required difference you mentioned). But despite that, for this particular case your concerns are most likely quite valid.

Bing Garcia says

May 1, 2019 at 5:11 amEdward Thorp: “The main disadvantage to buy-and-hold versus indexing is the added risk. In gambling terms, the return to buy-and-hold is like that from buying the index then adding random gains or losses by repeatedly flipping a coin. However, with a holding of 20 or so stocks spread out over different industries, this extra risk tends to be small”.

Is Thorp referring to the law of large numbers? Thanks so much.

The Cthaeh says

May 1, 2019 at 11:39 amHey Bing.

Thorp is talking about the added variance on your expected return on investment when you choose to do by-and-hold, compared to if you’re just buying the index. He isn’t directly talking about the law of large numbers, but the truth of his claim does implicitly depend on it. Well, any time you’re investing and there’s any kind of risk involved (this probably covers close to 100% of all investments), you are relying on the law of large numbers to turn a profit in the long run, even if you do end up suffering temporary losses from individual investments.

Does this help?