If you want to take your understanding of probabilities to the next level, it’s crucial to be familiar with the concept of a **probability distribution**.

In short, a probability distribution is an __assignment of probabilities or probability densities__ to __all possible outcomes__ of a __random variable__.

For example, take the random process of flipping a regular coin. The outcome of each flip is a random variable with a probability distribution:

- P(“Heads”) = 0.5
- P(“Tails”) = 0.5

Depending on the type of random variable you’re working with, there are two general types of probability distributions: **discrete** and **continuous**. In this post, I’m going to give an overview of both kinds. And in follow-up posts I’m going to individually introduce specific frequently used probability distributions from each kind.

Let’s first get a quick reminder about sample spaces.

## The sample space of a random variable

The sample space of a random variable is a set that consists of all its possible outcomes.

If you’re not familiar with the distinction between **random variables** and **deterministic variables**, here’s a short definition.

An outcome is the simplest possible realization of the random variable. Meaning, for something to be an ‘outcome’, it shouldn’t be possible to break it into smaller outcomes.

In the die rolling example from the beginning, there are 6 possible outcomes: 1, 2, 3, 4, 5, and 6. Therefore, the sample space of the random variable is {{1}, {2}, {3}, {4}, {5}, {6}}. Simple enough.

Well, if I were to be nit-picky, I would also include the tiny possibility of a die landing on one of the edges, or even one of the vertices (or maybe even not landing at all). But for now let’s indeed keep things simple.

Depending on your interests, you could also define the outcomes of the sample space to be more complex events. You can form those by combining simple outcomes of the original sample space (if you’re not familiar with the concept of events in the context of probability theory, you can make a small detour and read my post on compound event probabilities). Then you would treat those events as atomic outcomes. That is, in the context of your sample space, they would be unbreakable.

Let’s see how.

### Alternative sample spaces

In general, an event is a combination of one or more simple outcomes of a random variable. Defining a sample space requires a set of non-overlapping events. That is, an elementary outcome should be a member of only one event. This will allow you to treat the complex events as elementary outcomes in the context of your sample space.

Here are a few ways to do that for our example random variable:

- Six events: {{1}, {2}, {3}, {4}, {5}, {6}}, where each event is a single outcome (the original sample space)
- Two events: {{1, 3, 5}, {2, 4, 6}}, where one of the events represents the odd number outcomes and the other represents the even number outcomes
- Three events: {{1, 2}, {3, 4}, {5, 6}}, where each event represents a pair of consecutive numbers
- One event: {{1, 2, 3, 4, 5, 6}}, which corresponds to the entire sample space. This event is guaranteed to occur for each realization of the random variable (hence, this choice of a sample space would technically make it a deterministic variable).

Say you take the set {{1, 3, 5}, {2, 4, 6}} to be your sample space. This choice implies that you only care if the outcome is an odd or an even number. So, even though the {1, 3, 5} element could technically be broken down to the elementary {1}, {3}, and {5}, this distinction would be irrelevant in the context of your sample space.

## From sample spaces to probability distributions

Once you have a sample space, you need to assign probabilities to each of its elements (the outcomes). Of course, this assignment must be consistent with the axioms of probability theory (Kolmogorov’s axioms).

The mathematical object that represents a probability distribution is called a **function**. Informally, a function is something that takes a value (the **input**), does some calculation, and returns another value (the **output**).

In the case of probability distributions, the input is a specific outcome and the output is its probability (or probability density).

Take a look at this graphical illustration of the probability distribution of a die roll:

Here, the box in the middle represents the function that maps the possible outcomes to their respective probabilities. Since we’re assuming a fair die where all outcomes are equally probable, they each have a probability of 1/6.

If you choose a different way of constructing the sample space, then the probability distribution will be different. For example, if you choose the outcomes of the sample space to be {{1, 2}, {3, 4}, {5, 6}}, then the function will look like this:

Notice that the sum of all probabilities is always equal to 1. This comes directly from probability axioms. Namely, the probability of the entire sample space must be equal to 1.

## Discrete and continuous variables

When we have a discrete random variable, the function representing its distribution is called a **probability mass function**. On the other hand, if the random variable is continuous, the function behind the distribution is called a **probability density function**.

I gave a detailed explanation of the distinction between probability masses and probability densities in my post discussing zero probabilities. In fact, if you’re hearing these terms for the first time (or otherwise have difficulties following the rest of this section), I strongly encourage you to take a look at that post.

To get a better understanding of probability distributions, it’s very helpful to visualize them. So, let’s take a look at the two types of probability distributions with some visual intuition.

### Discrete probability distributions

Discrete probability distributions are used for discrete random variables.

Intuitively, a variable is discrete if it has a set of distinct outcomes with “gaps” between them. That is, there are no possible outcomes that lie “between” any other 2 possible outcomes.

Bar plots are a natural way for visualizing discrete probability distributions. For example, here’s how you can represent the probability distribution of flipping a coin:

And here’s how you can represent the probability distribution of rolling a die:

Both of these examples are of discrete variables, since there are no defined values between any pair of outcomes. In other words, outcomes like 2.4 and 5.1 aren’t possible when rolling a die. Similarly, you don’t have “almost heads” or “some tails” in the sample space of flipping a coin.

You can think of the height of each bar in the plots above as representing how much of the total probability mass its corresponding outcome holds.

A probability mass simply refers to the traditional notion of probability. As always, the total probability mass is equal to 1.

### Continuous probability distributions

In a continuous random variable, between any 2 of its possible outcomes, you have infinitely many other possible outcomes. That is, there are no “gaps” between the outcomes.

With discrete variables, you can “discretely” distribute the total probability mass among each of the distinct outcomes. But because a continuous random variable has an infinite number of possible outcomes in any sub-interval of its domain, you can’t do the same here.

In short, the reason is that you would end up with a sample space whose total probability mass is infinity, because you would need to add up an (uncountably) infinite number of finite probabilities for each outcome.

For example, if your random variable is the height of a randomly picked pine tree from a specific forest, its possible values are all positive real numbers within certain boundaries. A tree’s height could be, say, 15 meters, or 15.2 meters. Or 15.48252192 meters. All (infinitely many) numbers between the “minimum” and the “maximum” heights are possible (whatever they may be).

Hence, each possible outcome doesn’t have a probability mass directly assigned to it (in fact, the probability of each exact outcome would be equal to 0).

Instead, outcomes are assigned a **probability density**. “Probability mass per unit length of sample space” is a somewhat informal way of defining probability densities. This is analogous to physical densities, defined as mass per unit of length/area/volume.

In the pine tree example, a probability density would be “probability mass per unit of height”.

Each point from the sample space has a probability density. Hence, the probability density function simply assigns a probability density to each possible outcome.

Curves are convenient for visualizing probability density functions. Here’s how you can represent the probability density function of the height of a pine tree:

#### Reading probability density functions

With probability mass functions, you can determine the probability of an outcome by simply looking at its value on the plot. But you can’t do that with probability density functions. In other words, the probability of a tree having a height of 15 m is not 0.4!

In fact, the probability of any specific value is 0. For continuous random variables, it only makes sense to talk about probabilities of intervals. That is, only intervals can have non-zero probabilities.

The probability of an interval is given by the area under the curve over the interval. For example, the probability of a tree having a height between 14.5 and 15.5 would be equal to the following shaded area:

What is this shaded area equal to? Well, it kind of looks like a rectangle with a curvy top. The length of its base is 1 and its average height seems to be somewhere around 0.37. So, the area is approximately 0.37 * 1 = 0.37.

#### Mathematically calculating the probability of an interval

Of course, in reality you shouldn’t be estimating this area visually, as there are mathematical ways to do that. Namely, it’s calculated through the mathematical operation called **integration**.

Some of you familiar with normal distributions (also simply known as “bell curves”) might recognize that the above curve follows a normal distribution with a mean of 15 and variance of 1. The definite integral between 14.5 and 15.5 (that is, the real area under the curve) of this distribution is actually about 0.383. This tells us that the probability of the height of a randomly chosen tree to be in the interval [14.5, 15.5] is about 0.383.

By the way, don’t worry if you’re not familiar with normal distributions or integrals. The main important point from this section is that probabilities of intervals are calculated by finding the area under the curve over them.

Also, don’t worry if the concepts related to continuous probability distributions I introduced in the past sections are completely new to you. If so, I would again strongly encourage you to read my post on zero probabilities. There I give some intuition about what probability densities are and why exact outcomes always have a probability of 0.

## Probability distribution intuition

Now that you have some familiarity with the basic concepts around probability distributions, let’s see where they actually come from.

In an earlier post I introduced **the law of large numbers**. There I told you it was one of the most fundamental theorems in probability theory. In another post, I used it to give some intuition for **expected value**, which is another fundamental probability theory concept.

And now I’m going to use it again, this time to give you some intuition about probability distributions.

If you’re not familiar with the law of large numbers, feel free to take a look at my related post. In short, this law is a mathematical theorem that links probabilities of outcomes to their long-term frequency of occurring.

### Relative frequency distributions and the law of large numbers

Let’s go back to the probability distribution of a single die roll:

Each outcome has a probability of 1/6. Now let’s generate some actual die rolls from this distribution.

I wrote a short code to simulate 1000 random rolls of a die and plotted the % of each outcome after each flip. This is called a **relative frequency distribution** because it gives the relative frequencies of the outcomes (the frequencies divided by the total number of rolls).

Click on the image below to see the animated simulation:

Notice something? The relative frequency distribution looks very similar to the probability distribution from the plot above it. That is, the relative frequency of each outcome is very close to its probability. This should strongly remind you of what the law of large number says, yes?

Well, let’s make the number of rolls even larger. Look at the relative frequency distribution after 20 000 rolls of the die:

This looks almost identical to the probability distribution of a single die roll, doesn’t it? You could probably guess that if I made a simulation of 1 million die rolls (or even more), the relative frequency distribution and the probability distribution would be even closer to each other.

The law of large number says that the relative frequency of an outcome approaches its probability as the number of repetitions approaches infinity. Then, a direct consequence of this statement is that the relative frequency distribution of a random variable will similarly approach its theoretical probability distribution.

In other words, the probability distribution of a random variable is nothing more than the theoretical limit of its relative frequency distribution as the number of realizations of the variable approaches infinity. Pretty neat!

#### The law of large number for continuous probability distributions

The simulations from the previous section were of a discrete random variable. But what about continuous variables?

As you can imagine, creating a relative frequency distribution by sampling from the probability distribution doesn’t quite work the same way here. If you simply took the relative frequencies of individual outcomes, you would end up with zeros almost everywhere. This is because the sample space of a continuous random variable is some interval on the real number line. Hence, the number of elements of the sample space isn’t even countable.

The way out of this is to again resort to intervals. If you split up the sample space into non-overlapping intervals, you can create a relative frequency distribution of those intervals themselves. You do this by counting the number of outcomes that fall within each interval and divide each by the total number of repetitions.

If you make the size of the intervals smaller, you also need to increase the number of intervals in order to cover the whole sample space.

**Histograms** are used for approximating probability distributions of continuous random variables by splitting the sample space into so-called **bins**. Bins are supposed to be non-overlapping and each bin covers some interval from the sample space. They are typically of equal size, but they don’t have to be. However, for simplicity, I will only consider bins of equal size in the examples below.

When you have generated a certain number of samples from the probability distribution, you put each sample in the bin it belongs to. Then you represent each bin as a rectangle whose width is the size of the bin and whose height is the count or relative frequency of the outcomes in the bin.

Let’s see how this works.

#### Approximating probability density functions with histograms

Let’s go back to the pine tree height example from the previous section. This was its probability density function:

I split up the sample space into 6 bins of size 1 m and generated 100 samples from the original distribution. Here’s what the corresponding histogram looks like:

The y-axis shows the percentage of samples that fall within each bin. I also plotted a curve showing the closest probability density function that fits the histogram. It’s a pretty rough approximation to the original distribution, but you can see the resemblance, right?

Now let’s see what happens when we increase the number of bins to 12:

The number of samples is still 100, but this looks like a much better approximation, no? Now let’s split the sample space into 60 bins:

At first glance, this looks like a worse approximation, but that’s because 100 samples isn’t enough to sufficiently sample all 60 bins. So, let’s also increase the number of samples to 10000:

Now this looks much better! The height of the rectangles hasn’t quite converged to their corresponding theoretical limits, but the differences are very small.

In general, you will get better and better approximations by increasing the number of samples and the number of bins.

To satisfy ourselves, let’s see what happens when we generate 1 million samples and split the sample space into 120 bins:

Well, that’s a great approximation. This histogram looks almost identical to the probability density function.

This section probably also gives you some intuition about why the probability of an interval of a continuous random variable is the area under the curve of its probability density function. When you approximate the function with a histogram, you would get the same probability by simply summing the areas of all the bins that cover that interval.

## Summary

In this post, I defined probability distributions for discrete and continuous random variables as an assignment of probabilities or probability densities to all possible outcomes of a variable.

I introduced discrete random variables as variables which have gaps between their possible outcomes. On the other hand, continuous random variables have infinitely many possible outcomes between any pair of outcomes.

Every random variable has a total probability mass equal to 1. The point is to distribute this mass among all possible outcomes in a way that satisfies Kolmogorov’s axioms of probability. With discrete random variables, each possible outcome directly takes some portion of the probability mass. On the other hand, with continuous random variables each specific outcome gets a probability density (“mass per unit of sample space”) assigned to it.

With a few plots and simulations, I tried to show you that one way of thinking about probability distributions is as the theoretical limit of the relative frequency distribution as you generate more and more values from the random variable.

This post is meant to be a starting point for understanding probability distributions and in future posts I intend to cover deeper and more complex concepts related to them.

For example, distributions can be characterized by a set of different parameters like mean, variance, and skewness, which determine things like the shape and location of the distribution. In some of the next posts I’m going to introduce specific kinds of discrete and continuous distributions and show the intuitive meaning of their parameters (as well as how to calculate them).

In my next post, I discuss calculating the mean and variance of probability distributions.

As always, feel free to leave any questions in the comments below. Also, please share if you had difficulty following any part of the post.

Until next time!

Avik Sarkar says

Great post as usual. Your blog has really helped me a lot in getting an intuitive idea about probability and statistics.

However, I have a little doubt here.

“As you can imagine, creating a relative frequency distribution by sampling from the probability distribution doesn’t quite work the same way here. If you simply took the relative frequencies of individual outcomes, you would end up with zeros almost everywhere. This is because the sample space of a continuous random variable is some interval on the real number line. Hence, the number of elements of the sample space isn’t even countable.”

The way I understood this is, suppose I take 1 million pine tree heights as sample. Now since finding two tree with exactly same height is impossible, so all of the data is unique and therefore has a frequency is 1. So the relative frequency of them are 1 / 10^6 and would appear as a few speckles. But as you said since there are uncountably infinite real numbers in the sample space so, most of the places have relative frequency as 0. But why is this model incorrect? Is it due to the random scattering of points in such a vast sample space, this model would provide no help in statistical analysis?

Thanks.

The Cthaeh says

Hi, Avik. Thank you for the nice words about my posts! Always happy to be helpful!

Yes, your understanding here is exactly right. To put it in slightly more general terms, if you generate N samples from a probability density function, you would end up with “probabilities” of 1/N for each sample you happened to generate and 0 for every other potential sample you could have generated.

I wouldn’t quite say this kind of a plot would be completely useless. The speckles that you end up generating will still be distributed more densely in places where the probability density function is high. So, visually it would still give you some idea about its shape.

It would be pretty useless in telling you anything about probabilities, however. To see why, imagine you generate N samples where N is, say, 10^6. Also, let’s say one of the samples you generated happened to be 14.3823. Now, if you attempt to conclude that the probability that a tree has a height of 14.3823 is (about) 1/10^6, how far away would you be from the true probability? Well, for starters, notice that the “probability” that was calculated with this method is completely dependent on the choice of N. If you had generated only 100 samples and again one of them was 14.3823, then the same analysis would force you to conclude that the probability of a tree having this height is (about) 1/100. A pretty big difference in the estimates!

If you took a tiny interval around 14.3823 (something like [14.382299999999, 14.382300000001] and generated a really large number of samples (in order to adequately sample such a small interval), you would get the real probability of having a tree with that height. Well, kind of. It would still be the probability of an interval, but it’s so small, it makes no practical difference.

Now, whether the 1/N “probability” is close to the real probability would be pretty much a matter of complete luck (dependent on whether you picked “the right” N).

Does this help?

Avik Sarkar says

Thanks for such a detailed response!

Yes, that does make a lot of sense.

Honestly speaking, I was kind of waiting for your posts on probability distributions. I am so glad that you started this series.

I’m looking forward to reading your posts on those specific distributions especially Poisson’s, Exponential and Normal Distribution.