Welcome to the first post from my new series. Here we’re going look at a famous probability question often called the birthday problem. This is actually a more general question related to the probability of at least one coincidence after a fixed number of draws from a discrete uniform distribution.
This post is part of my series Probability Questions from the Real World.
The birthday problem is the first in the list of probability questions from Henk Tijms’ book Understanding Probability I told you about in the introductory post. Here it is, as stated in the book:
“You go with a friend to a football (soccer) game. The game involves 22 players of the two teams and one referee. Your friend wagers that, among these 23 persons on the field, at least two people will have birthdays on the same day. You will receive ten dollars from your friend if this is not the case. How much money should you, if the wager is to be a fair one, pay out to your friend if he is right?”
Before you continue reading, try to think about this question. What exactly is it asking? What information do we need to answer it? Which part of this information is immediately available and which part do you need to derive or calculate from the available part?
You’ll benefit a good deal if you first try to answer the question yourself. If you succeed, this will boost your confidence for the concepts you’ve learned so far. But trust me, even if you get stuck somewhere, just thinking about and trying to answer it will help you understand my explanations much better afterwards.
If you don’t even know where to start, some of these previous posts might give you insights:
Overview of the post
Even though the question is asking about a betting amount, the real question is about the probability of your friend being right, since we need this probability to frame the question in terms of expected values and figure out what this fair amount is. The main point of this question is to show the counterintuitive result that this probability is much higher than you might initially expect. Even with a group of just 23 people, the probability of at least one birthday coincidence is very close to 50%.
I’m first going to discuss the concept of a fair bet and the relationship between the betting amounts in a fair bet. Since this relationship is mediated by the win/lose probabilities, I’m going to show you a simple way of calculating the probability of at least one birthday coincidence in a group of 23 people.
Then, I’m going to derive a more general formula for the probability of at least one birthday coincidence among an arbitrary number of people.
Finally, I’m going to generalize this problem even more and derive a formula for the probability of at least one coincidence after a fixed number of samples from a discrete uniform distribution.
The list of four posts above are recommended readings, since we’re going to need to use a few of the concepts from them in the solution. Namely, sample spaces, compound event probabilities, as well as some basic combinatorics concepts like the rule of product and (partial) permutations. If you’re not familiar with the discrete uniform distribution, you might also want to take a look at my overview post on discrete probability distributions.
Breaking down the question
Alright, as the first step, let’s make sure we understand what exactly we’re being asked.
Basically, you have a group of 23 people whose birthdays are unknown. They could easily all have different birthdays, right? In principle, all 23 of them could have the same birthday too, though that would be quite the coincidence!
Anything between these two extremes is possible. For example:
- Two of them have the same birthday (and the remaining 21 people all have different birthdays).
- Three of them have the same birthday.
- Two people share a birthday and another three share a birthday (different from the first two)
So, given all this, your friend asserts that the 23 people on the field don’t all have different birthdays. From those 23, there’s at least 2 that have the same birthday. Or, for all your friend cares, they could indeed all have the same birthday (as unlikely as it is). The important thing is that there’s at least one coincidence. Let’s call this possibility “At Least One Coincidence” (ALOC). And let’s call the other possibility “No Coincidences” (NC).
Your friend is confident enough to offer you a bet about this. He says that if he’s wrong, he’s going to give you $10. But if he’s right, you need to pay him some (yet unknown) amount of money. In fact, this is what the question is asking — what should this unknown amount be so that the bet is a fair one?
Your first instinct might be to say that this amount should also be $10. “If I win, I get $10. If you win, you get $10”. Sounds fair, right? Not so fast.
What is a fair bet?
A simple bet is an agreement between two people about two mutually exclusive outcomes. And, depending on which of them occurs, one player pays some amount to the other.
Typically, the amounts they pay to each other are equal and this is considered to be fair. For example, if the bet is about the outcome of a fair coin flip (where “heads” and “tails” are equally likely), this sounds reasonable. But what if the two outcomes aren’t equally likely?
Imagine a bet about drawing a red/green ball out of a box containing 1 red and 9 green balls. Would it be fair for the winning amount to be the same (say, $10)? Not really. The player betting on a green ball obviously has a big advantage. We can demonstrate that by calculating the expected values for the two players.
Defining a fair bet in terms of expected values
Let’s first define some terms. Let’s call the two outcomes “green” and “red”, respectively. Also, let’s say that Ag(green) and Ag(red) represent the amount the player betting on the “green” outcome will receive if the green/red ball is drawn. For example, if the winning amount is $10, these values are:
In other words, if a green ball is drawn, the player betting on the “green” outcome gets $10 and if a red ball is drawn the same player gets $(-10). To receive a negative amount basically means to lose that amount.
Similarly, Ar(red) and Ar(green) represent the amounts the person betting on the “red” outcome will receive:
Notice that and (one player’s winning amount is the same as the other player’s losing amount).
Let’s also define P(green) and P(red) as the probabilities of these outcomes. Since 9 out of 10 balls are green and the remaining one is red, these probabilities are:
Finally, let’s define EVg and EVr to represent the expected values for the players betting on the “green” and “red” outcomes, respectively. Remember the formula for expected value:
Using this formula, we can calculate the expected values as:
Choosing fair payouts
As you can see, the two expected values are very different! If the players repeat this bet many times, after every bet the player betting on the “green” outcome will, on average, add $8 to their bank account, whereas the other will lose $8.
So, naturally, we can consider a fair bet to be one where both players have the same expected value. This means adjusting the win/lose amounts so that EVg becomes equal to EVr:
If we fix one of the winning amounts, using the known probabilities we can calculate the other to satisfy this equality:
Using and , we can rewrite this as:
Now, if we fix Ag(green) to $10 and plug it into the above equation (along with the probabilities of the outcomes), we get:
And because now the only unknown is Ar(red), we can solve for it:
Therefore, in order for the bet to be fair, if the player betting on the 0.9 probability outcome wins $10, the player betting on the 0.1 probability outcome needs to win $90 (when the outcome they’re betting on occurs).
A fair bet for the birthday problem
So, let’s apply this principle to the birthday problem. Remember, the two possible outcomes here are “At Least One Coincidence” (ALOC) and “No Coincidences” (NC). The respective probabilities of these outcomes are P(ALOC) and P(NC), which we don’t know yet. But because they are mutually exclusive and completely cover the sample space, we know that:
Which means we can represent these probabilities in terms of each other and if we find either of them, we’ve automatically found the other:
By the way, when the sum of probabilities of two events from a sample space is equal to 1, they are called complementary events.
Now, using this, let’s formalize the original question so we can properly proceed to answering it.
Putting the question in mathematical terms
“Your friend wagers that, among these 23 persons on the field, at least two people will have birthdays on the same day. You will receive ten dollars from your friend if this is not the case.”
Similar to the green/red ball example, we can define AALOC(ALOC), AALOC(NC), ANC(ALOC), and ANC(NC) as the amounts the two friends will receive depending on which outcome they’re betting on and which outcome occurs. We’re told that ANC(NC) = $10, from which we also know that AALOC(NC) = $(-10) (the amount your friend will give you if there are no birthday coincidences).
And the main question is:
“How much money should you, if the wager is to be a fair one, pay out to your friend if he is right?”
Which means the question is asking about the value of ANC(ALOC) that satisfies the equality:
When we plug in the known values, this becomes:
Notice that in the last equation I also used , in order to be left with a single unknown amount (the one we’re looking for).
And because P(ALOC) and P(NC) are complementary (add up to 1), if we calculate either of these probabilities, the ANC(ALOC) term becomes the only unknown in the equation above and then we can simply solve for it.
Therefore, answering this birthday question boils down to calculating P(ALOC) or P(NC). Let’s see how.
Solving the birthday problem
Let’s establish a few simplifying assumptions.
First, assume the birthdays of all 23 people on the field are independent of each other. Second, assume there are 365 possible birthdays (ignoring leap years). And third, assume the 365 possible birthdays all have the same probability. That is, the birthdays have a discrete uniform distribution with parameter . Meaning, each of the possible birthdays has a probability of . Remember the probability mass function of a discrete uniform distribution is:
In reality, these assumptions about birthdays are wrong. But for now let’s go with them and later I’m going to revisit this point.
As in solving any probability problem, a good first step is to think about the sample space.
Specifying the sample space
Instead of using dates, for simplicity let’s label the birthdays with the numbers 1 through 365. For example, 1 is January 1st, 2 is January 2nd, 365 is December 31st, and so on. Also, let’s label the 23 people on the field with the letters ‘a’ through ‘w’.
A particular element of the sample space here is an assignment of a number between 1 and 365 to each of the 23 letters. For example:
Notice that in the above example all 23 birthdays are different. This is an arrangement that would mean a win for you. And here’s another arrangement that has two coincidences (which would mean your friend wins the bet):
Because of our independence and equal probability assumptions, every one of these possible arrangements has an equal probability. We can think of the sample space for this problem as consisting of N of these arrangements. Of those, C have at least one coincidence and N – C have no coincidences.
Therefore, the two probabilities we’re interested in are simply:
So, all we need to do is find the values of N and C and we’re done.
Calculating N (the total number of elements in the sample space) is the easiest of the two.
The number of elements in the sample space
Each of the 23 people can have 365 possible birthdays, right? Each of the 365 possibilities for the first person can be combined with the 365 possibilities for the second. And each of those pairs can be combined with 365 possibilities for the third person and so on. Well, this is a straightforward application of the rule of product which gives us:
That’s a pretty big number, isn’t it? The exact value is:
Well, at least we know we shouldn’t even think about trying to manually list and count C!
Counting sample space elements that satisfy either condition
Alright, after all this work, we managed to narrow down the problem to finding C, the number of elements in the sample space that have at least one birthday coincidence. But notice that, since we already know the value of N, we can also find C indirectly, by calculating N – C. Let’s define the variable to stand for the number of elements in the sample space in which there are zero birthday coincidences.
Therefore, calculating either C or D would solve our problem. But the question is, which of the two is easier to calculate?
Can we count the number of birthday arrangements having at least one coincidence?
Let’s try to calculate C directly. We need to count the possible birthday arrangements in which there’s at least one coincidence.
First, we need to count all pairs, triplets, quadruplets, etc. of people with potentially matching birthdays. If we put the 23 people in a set, this list of pairs, triplets, and so on would comprise this set’s power set minus the empty set and the singletons. In my post on combinatorics, I showed you that the power set of a set with n elements has elements. In our case, this means the number of pair/triplet/etc. arrangements is (subtracting 23 singletons and 1 empty set). And we need to find the count of each of those 8388584 elements (and sum these counts).
But we’re not done, since we also need to count double pairs (when there are two pairs of people with the same birthday), triple pairs, double triples, a pair and a quadruple… It’s a mess!
And the problem doesn’t even end there because, even if we do all this hard work, we run into the issue of overcounting. For example, when counting pairs, we’re implicitly also counting pair combinations, etc. So, we would also need to find a clever way of subtracting all arrangements which we’ve counted more than once in the above process.
Bottom line is, there is no straightforward formula that we can use to count the number of birthday arrangements with at least one coincidence.
Counting the number of birthday arrangements with no coincidence
Seems like trying to calculate C directly isn’t a very fruitful approach. So let’s try to calculate D instead. That is, let’s simply count the number of birthday arrangements with no coincidences. As you’re about to see, this is going to be much easier.
Let’s think about it. We can allow the first person (‘a’) to have any of the 365 possible birthdays, right? The second person (‘b’), on the other hand, can still have any of the possibilities, except whatever a’s birthday is. This means that there are pairs of birthday combinations for the first two people where their birthdays are different.
Then, we need to assign a birthday to the third person (‘c’), with the only constraint that their birthday shouldn’t match either of the birthdays of a or b, which means there are 363 possibilities. So, the number of different birthdays for the first 3 people is:
Well, extending this logic to all 23 people, the number of possible arrangements with no birthday coincidences is:
If you read my post on combinatorics, you probably immediately recognize this as nothing but a partial permutation! Another term for a partial permutation is the K-permutation of N, which in our case translates to finding the number of 23-permutations of 365. Remember the general formula:
Applying this formula, we get:
If you’re curious, the exact value here is also a huge number:
This is it, we officially found the last unknown value we needed! Now let’s put all these pieces together and finally answer the question.
The final answer
Let’s recap what we did so far. We were looking for the value of ANC(ALOC), which is the amount you need to pay your friend if it turns out there’s at least one birthday coincidence. We previously derived the following equation:
And we needed to find the values of P(ALOC) and P(NC), in order to solve for ANC(ALOC). We derived the following expressions for these probabilities:
We couldn’t calculate C directly, but we managed to easily calculate N and N – C:
And from these two, we can also solve for C itself:
Which means we can now calculate the needed probabilities:
With a little bit of algebraic manipulation, we can simplify these to:
And these are approximately equal to:
Plugging these approximations into the expected value equation and solving for ANC(ALOC), we get:
Well, there it is. We know that the amount you’ll need to pay your friend if there’s at least one birthday coincidence is $9.72.
By the way, are you surprised that the probability P(ALOC) is so high for only 23 people? Now that we’ve answered the main question, let’s explore the whole problem in a little more depth and see why this probability is what it is, even if it seems counterintuitive.
The general solution to the birthday problem
The final probability for having no birthday coincidences we calculated was:
Where 365 is the number of possible birthdays and 23 is the number of people. And because P(NC) and P(ALOC) are probabilities of complementary events, we also saw we can represent the probability of at least one coincidence as:
Notice that the number 23, which represents the number of people, appears twice in this formula. If we had started with a different number of people, all steps for deriving this probability would have been the same. Which means the final formula would look identical, except 23 would be a different number.
Let’s denote the number of people with the variable k. Then, by replacing 23 with k, we can state a more general formula:
You can view the left-hand side as a conditional probability and read it as “the probability of at least one coincidence, given that the number of people is k”. Let’s plot this probability as a function of k:
Technically, this is a discrete sample space but I’m visualizing the probability mass function as a curve (instead of bars) for less clutter.
Notice the 0.5 probability is around the 23rd person, like we calculated in the previous section. Do you see how this probability goes to almost 1 even for only 60 people? This basically means that if you have a group of 60 or more people, you can be almost certain there will be at least one birthday coincidence.
And here’s a plot of the complementary probability of having no birthday coincidences:
You see that it’s basically a flipped version of the other probability (as expected).
For more than 60 people, the probability of at least one birthday coincidence keeps increasing but doesn’t quite reach 1 until the number exceeds 365. However, these differences are practically insignificant since they are all probabilities between 0.99 and 1 (pretty high).
The pigeonhole principle
For 365 people, the exact probability of having at least one coincidence is so small that most computer applications won’t even be able to calculate it with full precision. Let me just say that it’s a number like 0.999… with more than 150 9’s! However, as unlikely as it is, in principle it’s still possible to have no coincidences with a group of 365 people.
Now, is it also possible with 366 people? Well, not anymore. Even if everybody in a group of 365 people had a different birthday, the moment you try to add a new person to the group, their birthday will necessarily match one of the existing ones, simply because there are no more available dates in the year.
This is a general principle in combinatorics called the pigeonhole principle. Roughly, it states that if you have to distribute a certain number of objects into a number of boxes and there are more objects than boxes, at least one box must contain more than one object.
For example, if you have 3 empty boxes and 4 marbles, you can put the first 3 marbles in each of the boxes, but the 4th will need to go into a non-empty box:
For the birthday problem, you can think of the 365 possible birthdays as the boxes, and the people as the objects that need to be distributed across them. A coincidence here means more than one person ending up in the same box.
Notice that even the formula breaks down if we try to use it with k > 365. For example, for 366 it becomes:
And the factorial of a negative integer isn’t even defined. Bottom line is that, for the probability is given by the above formula, otherwise it’s equal to 1.
Why are birthday coincidence probabilities so high?
We answered the original problem and came up with a general formula for calculating the probability of birthday coincidences. But still, something feels unexplained. Why is it so likely to have a birthday coincidence even with a small group of people, given that there are so many birthday possibilities?
The answer lies in the issue we hit when trying to directly count the number of birthday arrangements with at least one coincidence. Namely, there are too many ways to satisfy the ALOC requirement. Here’s some of the many ways in which you can get arrangements with one or more coincidences (again using the convention for representing days of the year as the numbers 1 through 365):
- 2, 2, 17, 184…
- 4, 13, 13, 13, 25, 25, 201…
- 75, 75, 84, 84, 9, 9, 9, 9…
- 17, 5, 5, 5, 5, 5, 5, 19…
- 8, 8, 11, 11, 14, 14, 1, 1, 1, 3, 7…
When we look at the problem from this perspective, it actually becomes quite intuitive. The more available “slots” for birthdays you have, the more difficult it becomes to not have at least one coincidence.
The reason the result seems counterintuitive at first is probably related to psychological factors. When they hear this question for the first time, most people (including myself) automatically try to answer it by thinking about how likely it is that someone in the group would have the same birthday as them, which is indeed significantly less likely (as an exercise, try to find this probability yourself).
But when we don’t care about which particular people have the birthday coincidence, suddenly the possibilities become too many. If you think about it, throughout your life you too have probably come across plenty of examples of at least two people sharing a birthday among your classmates, coworkers, and so on.
Drawing k independent samples from a discrete uniform distribution
We derived a particular expression for the probability of at least one birth coincidence with a group of 23 people:
Then, we generalized this to any number of people by substituting the number 23 in the expression with the variable k:
The number 365 is still fixed, since we’re only concerned with birthday coincidences. But what if we replaced this number with a variable as well? Let’s call this variable n:
What does this general formula for “At Least One Coincidence” represent?
Remember, in the beginning I said that we’re assuming the birthdays assigned to the 23 people are drawn from a discrete uniform distribution with parameter (that is, we treat it as a random variable with n equally possible outcomes). So, this more general formula above simply represents the probability of at least one coincidence after k independent samples from a uniform distribution with parameter n.
Now, with this formula you can solve a whole class of different problems that involve drawing independent samples from a uniform distribution. For example, you can calculate the probability of k people having been born during the same season (where n will be 4) or during the same day of the week (where n will be 7). Outside the domain of birthdays, you can calculate things like the probability of getting at least one coincidence after rolling k dice (n = 6) or the probability of k people having the same last 2 digits in their phone numbers (n = 100).
What if the simplifying assumptions are violated?
In answering the birthday question, we made a few assumptions. Namely, that each of the possible birthdays has the same probability and that birthdays are independent of each other. We also assumed there were 365 possible birthdays (ignoring 29th of February during leap years).
In reality, these assumptions aren’t true. For example, depending on the region, more people are born during certain months of the year than during other months. Also, if the group of people contains siblings (and especially twins), their birthdays might not be independent. What do you think is the effect of relaxing some of these assumptions?
Relaxing these assumptions increases the probability of birthday coincidences
Including the 29th of February as a possibility decreases the ALOC probability but the effect is very small. That’s why people usually ignore this because it adds unnecessary complications to the formula without gaining almost anything in accuracy. On the other hand, relaxing the independence and uniform distribution assumptions both increase the ALOC probability.
The independence assumption means that the birthday probability of one person doesn’t depend on what the birthdays of the remaining people are. If, on the other hand, there were some positive dependence between the birthdays, the chances of a match would naturally increase. For example, if there were siblings in the group, the probability of them having the same birthday is usually somewhat higher because they could be twins. And even if they’re not twins, their parents may have a preference to have children around the same time of the year. Therefore, relaxing the independence assumption typically increases the probability of at least one birthday coincidence.
Similarly, if the birthdays don’t have a uniform distribution, this effectively reduces the number of birthday options (even if it doesn’t eliminate any particular birthdays). For example, if we knew that many more children were born during summer months, that would mean there’s a higher concentration of summer birthdays in the general population. Consequently, on average there would be a higher concentration of summer birthdays in samples of the population. And if there are more summer birthdays in a sample, to have a birthday coincidence naturally becomes more likely.
By the way, if you’re curious to read more about how birthdays are actually distributed across the world, check out this awesome article.
Bottom line is that, the simplifying assumptions we started with give us a lower bound for the probability of at least one birthday coincidence. In reality, the probabilities are even higher (though their calculation becomes more complicated and that’s why people usually go with the lower bound estimate).
We first looked at the concept of a fair bet in terms of expected values. A fair bet between two people is one in which they both have equal expected values.
For our particular question, we derived the following equality:
- ALOC = the “At Least One Coincidence” outcome
- NC = the “No Coincidences” outcome
- P(ALOC) = probability of ALOC
- P(NC) = probability of NC
- ANC(ALOC) = the amount paid by the player betting on NC if ALOC occurs (the value we need to answer the question)
Finding P(ALOC) and P(NC) leaves ANC(ALOC) as the only unknown in the equation. But finding either of these probabilities is enough because ALOC and NC are complementary events and their probabilities have the relationship:
I showed you why calculating P(ALOC) directly is very difficult and that’s why we calculated P(NC) instead:
And from the two relationships above, we also got a value for P(ALOC):
And from these two we got the final answer:
Then, we derived a general formula for the birthday problem for any number of people, represented by the variable k:
And finally, we derived a general formula for the probability of at least one coincidence not just for the birthday problem, but for any problem involving drawing independent samples from a discrete uniform distribution with parameter n:
I also explained that this formula is only valid for and that P(ALOC) is equal to 1 for all . This comes from the so-called pigeonhole principle.
I also gave you some intuition for why the coincidence probabilities are so high even for a small group of people, despite there being 365 possibilities and that these probabilities are, in fact, lower bounds under the simplifying assumptions of independent and equal probabilities (in reality these probabilities are even higher).
Hopefully you found this post useful. In my next post, I’m going to show you how to calculate the same probabilities using an entirely different approach: a computer simulation with Python!