What is statistics about?
Well, imagine you obtained some data from a particular collection of things. It could be the heights of individuals within a group of people, the weights of cats in a clowder, the number of petals in a bouquet of flowers, and so on.
Such collections are called samples and you can use the obtained data in two ways. The most straightforward thing you can do is give a detailed description of the sample. For example, you can calculate some of its useful properties:
- The average of the sample
- The spread of the sample (how much individual data points differ from each other), also known as its variance
- The number or percentage of individuals who score above or below some constant (for example, the number of people whose height is above 180 cm)
You only use these quantities to summarize the sample. And the discipline that deals with such calculations is descriptive statistics.
But what if you wanted to learn something more general than just the properties of the sample? What if you wanted to find a pattern that doesn’t just hold for this particular sample, but also for the population from which you took the sample? The branch of statistics that deals with such generalizations is inferential statistics and is the main focus of this post.
The two general “philosophies” in inferential statistics are frequentist inference and Bayesian inference. I’m going to highlight the main differences between them — in the types of questions they formulate, as well as in the way they go about answering them.
But first, let’s start with a brief introduction to inferential statistics.
Say you wanted to find the average height difference between all adult men and women in the world. Your first idea is to simply measure it directly. The current world population is about 7.13 billion, of which 4.3 billion are adults. Would you measure the individual heights of 4.3 billion people? I didn’t think so. It’s impractical, to say the least.
A more realistic plan is to settle with an estimate of the real difference. So, you collect samples of adult men and women from different subpopulations across the world and try to infer the average height of all men and all women from them.
And this is how the term inferential statistics gets its name. You have a population which is too large to study fully, so you use statistical techniques to estimate its properties from samples taken from that population.
If you want a little more background on the distinction between samples and populations, please read this section of my post related to probability distributions.
The interocular traumatic test
In special cases, you might simply want to know whether a pattern or a difference exists at all. You don’t have to care about the specifics like the exact magnitude of a difference between two groups. In those cases, the simplest inference technique you can use is sometimes jokingly called the interocular traumatic test (IOTT). You apply this test when the pattern is so obvious that it hits you right between your eyes!
For example, if you’re comparing the annual salary differences between company CEOs and company janitors, you won’t need to be that skilled in statistics to conclude that there is a big gap between the two.
As you can imagine, the IOTT has very limited applicability in the real world. Many differences are much too subtle to detect in such direct ways. Not to mention that most interesting patterns aren’t answerable by simple “yes/no” questions. People have developed many statistical techniques to deal with these complex cases.
The 3 goals of inferential statistics
In inferential statistics, we try to infer something about a population from data coming from a sample taken from it. But what exactly is it that we’re trying to infer?
All methods in inferential statistics aim to achieve one of the following 3 goals.
In the context of probability distributions, a parameter is some (often unknown) constant that determines the properties of the distribution.
For example, the parameters of a normal distribution are its mean and its standard deviation. The mean determines the value around which the “bell curve” is centered and the standard deviation determines its width. So, if you know that the data has a normal distribution, parameter estimation would amount to trying to learn the true values of its mean and standard deviation.
For this goal, you usually need to already have estimated certain parameters. Then you use them to predict future data.
For example, after measuring the heights of females in a sample, you can estimate the mean and standard deviation of the distribution for all adult females. Then you can use these values to predict the probability of a randomly chosen female to have a height within a certain range of values.
Assume that the mean you estimated is around 160 cm. Also assume that you estimated the standard deviation to be around 8 cm. Then, if you randomly pick an adult female from the population, you can expect her to have a height within the range of 152 – 168 cm (3 standard deviations from the mean). Heights that deviate more from the mean (e.g., 146 cm, 188 cm, or 193 cm) are increasingly less likely.
I am briefly mentioning this goal without going into details because it is a somewhat more advanced topic which I will cover in others posts (for starters, check out my post on Bayesian belief networks — a method for belief propagation that naturally allows model comparison).
In short, model comparison is the process of selecting a statistical model from 2 or more models, which best explains the observed data. A model is basically a set of postulates about the process that generates the data. For example, a model can postulate that the height of an adult is determined by factors like:
- Biological sex
- Physical exercise
A statistical model would postulate a specific relationship between these factors and the data to be explained. For example, that genetic components influence height more than physical exercise.
Two models may postulate different strengths with which each factor influences the data, a particular interaction between the factors, and so on. Then, the model that can accommodate the observed data best would be considered most accurate.
Frequentist and Bayesian statistics — the comparison
The differences between the two frameworks come from the way the concept of probability itself is interpreted.
Overview of frequentist and Bayesian definitions of probability
In an earlier post, I introduced the 4 main definitions of probability:
- Long-term frequencies
- Physical tendencies/propensities
- Degrees of belief
- Degrees of logical support
Frequentist inference is based on the first definition, whereas Bayesian inference is rooted in definitions 3 and 4.
In short, according to the frequentist definition of probability, only repeatable random events (like the result of flipping a coin) have probabilities. These probabilities are equal to the long-term frequency of occurrence of the events in question.
Frequentists don’t attach probabilities to hypotheses or to any fixed but unknown values in general. This is a very important point that you should carefully examine. Ignoring it often leads to misinterpretations of frequentist analyses.
In contrast, Bayesians view probabilities as a more general concept. As a Bayesian, you can use probabilities to represent the uncertainty in any event or hypothesis. Here, it’s perfectly acceptable to assign probabilities to non-repeatable events, such as Hillary Clinton winning the US presidential race in 2016. Orthodox frequentists would claim that such probabilities don’t make sense because the event is not repeatable. That is, you can’t run the election cycle an infinite number of times and calculate the proportion of them that Hillary Clinton won.
For more background on the different definitions of probability, I encourage you to read the post I linked above.
Parameter estimation and data prediction
For more background on this section, check out my introductory post on probability distributions.
Consider the following example. We want to estimate the average height of adult females. First, we assume that height has a normal distribution. Second, we assume that the standard deviation is available and we don’t need to estimate it. Therefore, the only thing we need to estimate is the mean of the distribution.
The frequentist way
How would a frequentist approach this problem? Well, they would reason as follows:
I don’t know what the mean female height is. However, I know that its value is fixed (not a random one). Therefore, I cannot assign probabilities to the mean being equal to a certain value, or being less than/greater than some other value. The most I can do is collect data from a sample of the population and estimate its mean as the value which is most consistent with the data.
The value mentioned in the end is known as the maximum likelihood estimate. It depends on the distribution of the data and I won’t go into details on its calculation. However, for normally distributed data, it’s quite straightforward: the maximum likelihood estimate of the population mean is equal to the sample mean.
The Bayesian way
A Bayesian, on the other hand, would reason differently:
I agree that the mean is a fixed and unknown value, but I see no problem in representing the uncertainty probabilistically. I will do so by defining a probability distribution over the possible values of the mean and use sample data to update this distribution.
In a Bayesian setting, the newly collected data makes the probability distribution over the parameter narrower. More specifically, narrower around the parameter’s true (unknown) value. You do the updating process by applying Bayes’ theorem:
The way to update the entire probability distribution is by applying Bayes’ theorem to each possible value of the parameter.
If you aren’t familiar with Bayes’ theorem, take a look at my introductory post, as well as this post. They will give you some intuition about the theorem and its derivation. And if you really want to see the use of Bayes’ theorem in action, this post is for you. There, I demonstrated the estimation of the bias of a coin by updating the full probability distribution after each coin flip.
Frequentists’ main objection to the Bayesian approach is the use of prior probabilities. Their criticism is that there is always a subjective element in assigning them. Paradoxically, Bayesians consider not using prior probabilities one of the biggest weaknesses of the frequentist approach.
Although this isn’t a debate you can answer one way or another with complete certainty, the truth is not somewhere in the middle. In the future, I’m going to write a post that discusses the mathematical and practical consequences of using or not using prior probabilities.
Here, the difference between frequentist and Bayesian approaches is analogous to their difference in parameter estimation. Again, frequentists don’t assign probabilities to possible parameter values and they use (maximum likelihood) point estimates of unknown parameters to predict new data points.
Bayesians, on the other hand, have a full posterior distribution over the possible parameter values which allows them to take into account the uncertainty in the estimate by integrating the full posterior distribution, instead of basing the prediction just on the most likely value.
P-values and confidence intervals
Frequentists don’t treat the uncertainty in the true parameter value probabilistically. However, that doesn’t magically eliminate uncertainty. The maximum likelihood estimate could still be wrong and, in fact, most of the time it is wrong! When you assume that a particular estimate is the correct one, but in reality it isn’t, you will make an error. Consequently, this has lead to the development of two mathematical techniques for quantifying and limiting long-term error rates:
- null hypothesis significance testing (NHST) and the related concept of p-values
- confidence intervals
The general idea is to make an estimate, then assume something about the estimate only under certain conditions. You choose these conditions in a way that limits the long-term error rate by some number (usually 5% or lower).
I discussed the types of error rates NHST controls (as well as the correct interpretation of p-values) in a previous post about this topic. But I strongly encourage you to read that post if you’re not familiar with p-values. In fact, I encourage you to read it even if you are, due to their very frequent misinterpretation.
Confidence intervals are the frequentist way of doing parameter estimation that is more than a point estimate. The technical details behind constructing confidence intervals are beyond the scope of this post, but I’m going to give the general intuition.
Put yourself in the shoes of a person who’s trying to estimate some mean value (the average height in a population, the average IQ difference between two groups, and so on). As usual, you start by collecting sample data from the population. Now, the next step is the magic that I’m not telling you about. It’s a standard procedure for calculating an interval of values.
You determine the whole procedure, including the sample size, before collecting any data. And you choose the procedure with the following goal in mind.
- If you, hypothetically, repeat the procedure a large number of times, the confidence interval should contain the true mean with a certain probability.
In statistics, commonly used ones are the 95% and the 99% confidence intervals.
If you choose a population with a fixed mean, collect sample data, and finally calculate the 95% confidence interval, 95% of the time the interval you calculated will cover the true mean.
Once you’ve calculated a confidence interval, it’s incorrect to say that it covers the true mean with a probability of 95% (this is a common misinterpretation). You can only say, in advance, that in the long-run, 95% of the confidence intervals you’ve generated by following the same procedure will cover the true mean.
A graphical illustration of calculating confidence intervals
Click on the image below to start a short animation that illustrates the procedure I described above:
Click on the image to start/restart the animation.
Somewhere on the real number line, we have a hypothetical mean (abbreviated with the letter ‘m’) . We generate 20 consecutive 95% confidence intervals. Two of them happen to miss the mean, and 18 happen to cover it. That gives 18/20 = 90%.
Why did only 90% of the confidence intervals cover the mean, and not 95%? The answer, of course, is that the process is inherently probabilistic and there is no guarantee that, of any fixed number of confidence intervals, exactly 95% will cover the true mean. However, as the number of generated intervals increases, the percentage that cover the mean will get closer and closer to 95%.
By the way, notice how individual confidence intervals don’t have the same width and they are all centered around different values.
In this post, I introduced the distinction between descriptive and inferential statistics and explained the 3 goals of the latter:
- Parameter estimation
- Data prediction
- Model comparison
I ignored the last goal and mostly focused on the first.
I showed that the difference between frequentist and Bayesian approaches has its roots in the different ways the two define the concept of probability. Frequentist statistics only treats random events probabilistically and doesn’t quantify the uncertainty in fixed but unknown values (such as the uncertainty in the true values of parameters). Bayesian statistics, on the other hand, defines probability distributions over possible values of a parameter which can then be used for other purposes.
Finally, I showed that, in the absence of probabilistic treatment of parameters, frequentists handle uncertainty by limiting the long-term error rates, either by comparing the estimated parameter against a null value (NHST), or by calculating confidence intervals.