Imagine you obtained some data from a particular collection of *things*. It could be the heights of individuals within a group of people, the weights of cats in a clowder, the number of petals in a bouquet of flowers, and so on.

Such collections are called samples and you can use the obtained data in two ways. The most straightforward thing you can do is give a detailed description of the sample. For example, you can calculate some of its useful properties:

- The average of the sample
- The spread of the sample (how much individual data points differ from each other), also known as its
*variance* - The number or percentage of individuals who score above or below some constant (for example, the number of people whose height is above 180 cm)
- Etc.

You only use these quantities to summarize the sample. And the discipline that deals with such calculations is **descriptive statistics**.

But what if you wanted to learn something more general than just the properties of the sample? What if you wanted to find a pattern that doesn’t just hold for this particular sample, but also for the population from which you took the sample? The branch of statistics that deals with such generalizations is **inferential statistics** and is the main focus of this post.

The two general “philosophies” in inferential statistics are **frequentist inference** and **Bayesian inference**. I’m going to highlight the main differences between them — in the types of questions they formulate, as well as in the way they go about answering them.

But first, let’s start with a brief introduction to inferential statistics.

# Inferential statistics

Say you wanted to find the average height difference between all adult men and women in the world. Your first idea is to simply measure it directly. The current world population is about 7.13 billion, of which 4.3 billion are adults. Would you measure the individual heights of 4.3 billion people? I didn’t think so. It’s impractical, to say the least.

A more realistic plan is to settle with an estimate of the real difference. So, you collect **samples** of adult men and women from different subpopulations across the world and try to infer the average height of all men and all women from them.

And this is how the term *inferential* statistics gets its name. You have a population which is too large to study fully, so you use statistical techniques to estimate its properties from samples taken from that population.

## The interocular traumatic test

In special cases, you might simply want to know whether a pattern or a difference exists at all. You don’t have to care about the specifics like the exact magnitude of a difference between two groups. In those cases, the simplest inference technique you can use is sometimes jokingly called the **interocular traumatic test (IOTT)**. You apply this test when the pattern is so obvious that it hits you right between your eyes!

For example, if you’re comparing the annual salary differences between company CEOs and company janitors, you won’t need to be that skilled in statistics to conclude that there is a big gap between the two.

As you can imagine, the IOTT has very limited applicability in the real world. Many differences are much too subtle to detect in such direct ways. Not to mention that most interesting patterns aren’t answerable by simple “yes/no” questions. People have developed many statistical techniques to deal with these complex cases.

## The 3 goals of inferential statistics

In inferential statistics, we try to infer something about a population from data coming from a sample taken from it. But what exactly is it that we’re trying to infer?

All methods in inferential statistics aim to achieve one of the following 3 goals.

### Parameter estimation

In the context of probability distributions, a parameter is some (often unknown) constant that determines the properties of the distribution.

For example, the parameters of a normal distribution are its *mean* and its *standard deviation*. The mean determines the value around which the “bell curve” is centered and the standard deviation determines its width. So, if you know that the data has a normal distribution, parameter estimation would amount to trying to learn the true values of its mean and standard deviation.

### Data prediction

For this goal, you usually need to have already estimated certain parameters. You use them to predict future data.

For example, after measuring the heights of females in a sample, you can estimate the mean and standard deviation of the distribution for all adult females. Then you can use these values to predict the probability of a randomly chosen female to have a height within a certain range of values.

Assume that the mean you estimated is around 160 cm. Also assume that you estimated the standard deviation to be around 8 cm. Then, if you randomly pick an adult female from the population, you can expect her to have a height within the range of 152 – 168 cm (3 standard deviations from the mean). Heights that deviate more from the mean (e.g., 146 cm, 188 cm, or 193 cm) are increasingly less likely.

### Model comparison

I am briefly mentioning this goal without going into details because it is a somewhat more advanced topic which I will cover in others posts (for starters, check out my post on Bayesian belief networks — a method for belief propagation that naturally allows model comparison).

In short, model comparison is the process of selecting a statistical model from 2 or more models, which best explains the observed data. A model is basically a set of postulates about the process that generates the data. For example, a model can postulate that the height of an adult is determined by factors like:

- Biological sex
- Genes
- Nutrition
- Physical exercise

A statistical model would postulate a specific relationship between these factors and the data to be explained. For example, that genetic components influence height more than physical exercise. Two models may postulate different strengths with which each factor influences the data, a particular interaction between the factors, and so on. Then, the model that can accommodate the observed data best would be considered most accurate.

# Frequentist and Bayesian frameworks — the comparison

The differences between the two frameworks come from the way the concept of *probability* itself is interpreted.

## Overview of frequentist and Bayesian definitions of probability

In an earlier post, I introduced the 4 main definitions of probability:

- Long-term frequencies
- Physical tendencies/propensities
- Degrees of belief
- Degrees of logical support

Frequentist inference is based on the first definition, whereas Bayesian inference is rooted in definitions 3 and 4.

In short, according to the frequentist definition of probability, only repeatable random events (like the result of flipping a coin) have probabilities. These probabilities are equal to the long-term frequency of occurrence of the events in question. Frequentists don’t attach probabilities to hypotheses or to any fixed but unknown values in general. This is a very important point that you should carefully examine. Ignoring it often leads to misinterpretations of frequentist analyses.

In contrast, Bayesians view probabilities as a more general concept. As a Bayesian, you can use probabilities to represent the uncertainty in any event or hypothesis. Here, it’s perfectly acceptable to assign probabilities to non-repeatable events, such as Hillary Clinton winning the US presidential race in 2016. Orthodox frequentists would claim that such probabilities don’t make sense because the event is not repeatable. That is, you can’t run the election cycle an infinite number of times and calculate the proportion of them that Hillary Clinton won.

For more background on the different definitions of probability, I encourage you to read the post I linked to above.

## Parameter estimation and data prediction

Consider the following example. We want to estimate the average height of adult females. First, we assume that height has a normal distribution. Second, we assume that the standard deviation is available and we don’t need to estimate it. Therefore, the only thing we need to estimate is the mean of the distribution.

### The frequentist way

How would a frequentist approach this problem? Well, they would reason as follows:

I don’t know what the mean female height is. However, I know that its value is fixed (not a random one). Therefore, I cannot assign probabilities to the mean being equal to a certain value, or being less than/greater than some other value. The most I can do is collect data from a sample of the population and estimate its mean as the value which is most consistent with the data.

The value mentioned in the end is known as the **maximum likelihood estimate**. It depends on the distribution of the data and I won’t go into details on its calculation. However, for normally distributed data, it’s quite straightforward: the maximum likelihood estimate of the population mean is equal to the sample mean.

### The Bayesian way

A Bayesian, on the other hand, would reason differently:

I agree that the mean is a fixed and unknown value, but I see no problem in representing the uncertainty probabilistically. I will do so by defining a probability distribution over the possible values of the mean and use sample data to update the distribution.

In a Bayesian setting, the newly collected data makes the probability distribution over the parameter narrower. More specifically, narrower around the parameter’s true (unknown) value. You do the updating process by applying Bayes’ theorem:

The way to update the entire probability distribution is by applying Bayes’ theorem to each possible value of the parameter.

If you aren’t familiar with Bayes’ theorem, take a look at my introductory post, as well as this post. They will give you some intuition about the theorem and its derivation. And if you really want to see the use of Bayes’ theorem in action, this post is for you. There, I demonstrated the estimation of the bias of a coin by updating the full probability distribution after each coin flip.

Frequentists’ main objection to the Bayesian approach is the use of prior probabilities. Their criticism is that there is always a subjective element in assigning them. Paradoxically, Bayesians consider not using prior probabilities one of the biggest weaknesses of the frequentist approach.

Although this isn’t a debate you can answer one way or another with complete certainty, the truth is not somewhere in the middle. In the future, I’m going to write a post that discusses the mathematical and practical consequences of using or not using prior probabilities.

### Data prediction

Here, the difference between frequentist and Bayesian approaches is analogous to their difference in parameter estimation. Again, frequentists don’t assign probabilities to possible parameter values and they use (maximum likelihood) point estimates of unknown parameters to predict new data points.

Bayesians, on the other hand, have a full posterior distribution over the possible parameter values which allows them to take into account the uncertainty in the estimate by integrating the full posterior distribution, instead of basing the prediction just on the most likely value.

### P-values and confidence intervals

Frequentists don’t treat the uncertainty in the true parameter value probabilistically. However, that doesn’t magically eliminate uncertainty. The maximum likelihood estimate could still be wrong and, in fact, most of the time it is wrong! When you assume that a particular estimate is the correct one, but in reality it isn’t, you will make an error. Consequently, this has lead to the development of two mathematical techniques for quantifying and limiting long-term error rates:

- null hypothesis significance testing (NHST) and the related concept of p-values
- confidence intervals

The general idea is to make an estimate, then assume something about the estimate only under certain conditions. You choose these conditions in a way that limits the long-term error rate by some number (usually 5% or lower).

I discussed the types of error rates NHST controls (as well as the correct interpretation of p-values) in a previous post about this topic. But I strongly encourage you to read that post if you’re not familiar with p-values. In fact, I encourage you to read it even if you are, due to their very frequent misinterpretation.

#### Confidence intervals

**Confidence intervals** are the frequentist way of doing parameter estimation that is more than a point estimate. The technical details behind constructing confidence intervals are beyond the scope of this post, but I’m going to give the general intuition.

Put yourself in the shoes of a person who’s trying to estimate some mean value (the average height in a population, the average IQ difference between two groups, and so on). As usual, you start by collecting sample data from the population. Now, the next step is the magic that I’m not telling you about. It’s a standard procedure for calculating an interval of values.

You determine the whole procedure, including the sample size, before collecting any data. And you choose the procedure with a particular goal in mind. If you, *hypothetically*, repeat the procedure a large number of times, the confidence interval should contain the true mean with a particular probability. In statistics, commonly used ones are the 95% and the 99% confidence intervals.

If you choose a population with a fixed mean, collect sample data, and finally calculate the 95% confidence interval, 95% of the time the interval you calculated will cover the true mean.

Once you’ve calculated a confidence interval, it’s incorrect to say that it covers the true mean with a probability of 95% (this is a common misinterpretation). You can only say in advance that, in the long-run, 95% of the confidence intervals you’ve generated by following the same procedure will cover the true mean.

##### A graphical illustration

Click on the image below to start a short animation that illustrates the procedure I described above:

Click on the image to start/restart the animation.

Somewhere on the real number line, we have a hypothetical mean (abbreviated with the letter ‘m’) . We generate 20 consecutive 95% confidence intervals. Two of them happen to miss the mean, and 18 happen to cover it. That gives 18/20 = 90%.

Why did only 90% of the confidence intervals cover the mean, and not 95%? The answer, of course, is that the process is inherently probabilistic and there is no guarantee that, of any fixed number of confidence intervals, exactly 95% will cover the true mean. However, as the number of generated intervals increases, the percentage that cover the mean will get closer and closer to 95%.

By the way, notice how individual confidence intervals don’t have the same width and they are all centered around different values.

# Summary

In this post, I introduced the distinction between descriptive and inferential statistics and explained the 3 goals of the latter:

- Parameter estimation
- Data prediction
- Model comparison

I ignored the last goal and mostly focused on the first.

I showed that the difference between frequentist and Bayesian approaches has its roots in the different ways the two define the concept of probability. Frequentist statistics only treats random events probabilistically and doesn’t quantify the uncertainty in fixed but unknown values (such as the uncertainty in the true values of parameters). Bayesian statistics, on the other hand, defines probability distributions over possible values of a parameter which can then be used for other purposes.

Finally, I showed that, in the absence of probabilistic treatment of parameters, frequentists handle uncertainty by limiting the long-term error rates, either by comparing the estimated parameter against a null value (NHST), or by calculating confidence intervals.

Georgi Georgiev says

As noted in previous posts as well, only referring to long-run error rates is fallacious representation of both the founders and the modern proponents of what can be broadly called frequentist inference. Regardless, I expected more of a battle in this post, given the title graphic, but the Bayesian inference was mostly missing from the text…

Looking forward to this: “In the future, I’m going to write a post that discusses the mathematical and practical consequences of using or not using prior probabilities.”

The Cthaeh says

Yeah, I didn’t really go into a simulated debate between the two frameworks, despite what might be suggested by the featured image. I mostly wanted to outline the differences, without expressing any preferences.

In future posts, however… 🙂

Kolja says

“Confidence intervals are the frequentist way of doing parameter estimation…”

Thats actually the bayesian approach or am i wrong?

The Cthaeh says

In the Bayesian framework some people use the so-called ‘credibility intervals’. They sound very similar to ‘confidence intervals’ but are actually mathematically very different.

You may also come across them under the name ‘highest density intervals’.

With the Bayesian credibility intervals you can actually make the statement “the probability that the true value of the parameter is within these boundaries”, of course subject to certain assumptions. This is not possible with Frequentist confidence intervals (in fact, it’s a common misinterpretation).

Esha Deshpande says

Can you help me understand this part of the post

“Bayesian, on the other hand, have a full posterior distribution over the possible parameter values which allows them to take into account the uncertainty in the estimate by integrating the full posterior distribution, instead of basing the prediction just on the most likely value”

The Cthaeh says

Hi, Esha. This is a tricky topic and when I first encountered it it took me some time to get my head around it as well.

This basically boils down to Frequentists refusing to assign probabilities to parameters with unknown but fixed values (take a look at the Parameter estimation and data prediction section above to see what I’m talking about). Whereas Bayesians, having no (philosophical) problems assigning such probabilities, can obtain an entire probability distribution over the possible parameter values.

Say you’re playing a game in which you draw a coin from a bag full of coins, all with a random bias towards “heads”. The bias is a number between 0 and 1 which defines the probability of flipping “heads” with the respective coin (a bias of 0.4 means there is 0.4 probability that you will flip “heads”).

You’re allowed to flip the coin 1000 times, then place a bet on either “heads” or “tails”, and flip the coin one more time. If you correctly guess the 1001st flip, you double your money, otherwise you lose it.

What you will likely do is record the results of the first 1000 flips and try to make a guess about the coin’s bias. If your estimate of the bias is greater than 0.5, you will bet in favor of “heads” (and in favor of “tails” otherwise).

The part where you are estimating the bias is called parameter estimation. Bayesian and Frequentist approaches to parameter estimation differ not only in terms of the specific techniques they use but, more importantly, in the end result of the estimation. Since Bayesians will have no problem assigning probabilities to the possible values of the coin’s bias, they can obtain an entire probability distribution over the possible values. And therefore, they can answer questions like “what is the probability that the coin’s bias is 0.5?”. Frequentists don’t do that. Instead, their estimate of the bias will be a single number and they will have no measure of the uncertainty around that value.

For example, the maximum likelihood estimate will be 0.4 both if you flip 4 heads out of 10 flips and if you flip 400000 heads out of a million flips. But you will obviously be much more confident in your estimate in the latter case, which will be naturally reflected in the posterior probability distribution (which will be much more tightly centered around the mean).

Frequentists can use confidence intervals to partially solve this problem. When the number of flips is higher, the range of values covered by the interval is going to be much tighter. But, within that confidence interval, you will still have no measure of ‘likelihood’ for the different values. In other words, if your confidence interval is [0.3, 0.5], every value from that interval is an equally good candidate for being the actual value of the bias. When the number of flips is low, this can make a big difference.

As I said, this is a tricky topic and I am probably not covering everything to fully answer your question. Please feel free to ask for further clarification!

Samantha M says

I enjoyed this post very much. To be honest, my knowledge of statistics is limited to basic classes in college and some in grad school, and I still am learning. I never understood what the MLE was, and your explanations are clear and concise without omitting crucial details. The overall distinction between Bayesian and Frequentist approaches was clear to me as well.

Thank you, and I look forward to reading more.

The Cthaeh says

Thanks, Samantha, I’m glad you found the post interesting!

Daniel Toma says

Up above, I see the following comment:

“Assume that the mean you estimated is around 160 cm. Also assume that you estimated the standard deviation to be around 8 cm. Then, if you randomly pick an adult female from the population, you can expect her to have a height within the range of 152 – 168 cm (3 standard deviations from the mean).

Can you explain how the range of 152 – 168 cm is 3 standard deviations from the mean?

The Cthaeh says

Hi, Daniel.

I’m assuming you’re referring to the property of normal distributions of having ~99% of the probability mass is within 3 standard deviations from the mean. I wasn’t really referring to that, but making the point that generally most values are “close” to the mean in terms of standard deviations. As you noticed, in my example 152-168 is only 1 standard deviation (which, if the data were normally distributed, would mean ~70% of the probability mass).

Thank you for your observation. Normal distributions are very important and I’ll most likely dedicate an entire post to them where I’ll also discuss the 68-95-99.7 rule you are referring to.

Joseph says

Hi. Thank you for this post. I have a question about this part:

” Once you’ve calculated a confidence interval, it’s incorrect to say that it covers the true mean with a probability of 95% (this is a common misinterpretation). You can only say in advance that, in the long-run, 95% of the confidence intervals you’ve generated by following the same procedure will cover the true mean. ”

I don’t understand why you can’t say that the confidence interval covers the mean with a probability of 95%.

Since our confidence intervals work 95% of the time, as mentioned in the second sentence, then for this particular time the probability of our confidence interval to work is 95%, which means that we get the mean with a probability of 95%.

Thank you.

Joseph

The Cthaeh says

Hi Joseph, very good question. In fact, it’s a question that almost anybody who chooses to dig deep enough into Frequentist concepts encounters sooner or later.

I know what I wrote sounds counter-intuitive. It used to be counter-intuitive for me for a long time. But stay with me and I promise, once it clicks, you will have a small feeling of just been taken outside of the Matrix.

“For this particular time the probability of our confidence interval to work is 95%, which means that we get the mean with a probability of 95%.”

Let’s assume your statement is true. The first problem is something I already pointed to earlier in the post. Namely, in Frequenist statistics you don’t assign probabilities to fixed but unknown values. Granted, this is a philosophical, not mathematical problem, but this view is at the root of the Frequentist approach and was passionately defended by founders like Fisher, Pearson, and Neyman.

But okay, you can say you don’t have the same philosophical reservations and want to treat the mean like a random variable. Can you then say the probability of the true mean being within your just-calculated confidence interval is 95%? The answer is still no, and this time the problem is also mathematical.

But even without mathematics, it’s rather easy to show why. Say you have a sample of 20 students (out of 1000 students in a particular school) and you measure their height. You calculate the mean and the standard deviation and calculate the 95% confidence interval, which happens to be [160, 180] (in cm). Then you conclude that there’s a 95% probability the real mean of *all* students is between 160 cm and 180 cm. Good.

Then you draw another random sample of 20 students from the same school and calculate a confidence interval of [165, 178]. Then you draw a third sample and this time you calculate a confidence interval of [150, 167] (just happened to pick really short people).

Clearly, you can’t say that there’s a 95% probability that each of those confidence intervals will hold the true mean, as that would be a contradiction. And none of them is any more special than the rest, so you have no basis to choose any and assume the true mean is within it with 95% probability.

Does that make sense?

The Cthaeh says

For more intuition, check out my post about p-values (and more specifically, the section

P<0.05. So what?), where I also talk about the second most common misinterpretation in Frequentist statistics. Namely, “if p<0.05, does that mean the probability the null hypothesis is true is less than 5%?".