“Simple explanations are better than complex explanations.” — have you heard this statement before? It’s the most simplified version of the principle Occam’s razor. More specifically, the principle says:
A simple theory is always preferable to a complex theory, if the complex theory doesn’t offer a better explanation.
Does it make sense? If it’s not immediately convincing, that’s okay. There have been debates around Occam’s razor’s validity and applicability for a very long time.
In this post, I’m going to give an intuitive introduction to the principle and its justification. I’m going to show that, despite historical debates, there is a sense in which Occam’s razor is always valid. In fact, I’m going to try to convince you this principle is so true that it doesn’t even need to be stated on its own.
In order to be able to follow the second (and main) part of this post, you need to have basic familiarity with Bayes’ theorem, combinatorics, and compound probability. Reading this post, this post, and this post should get you covered.
Introduction to Occam’s razor
Occam’s razor is one of the oldest principles in philosophy of science. Even though it’s named after the 14th century English philosopher William of Ockham, its roots are traceable to earlier periods.
Prominent philosophers and scientists who have proposed variants of the principle include Ptolemy, Aristotle, Aquinas, Kant, Newton, Einstein, and many others.
So what is Occam’s razor? Let me introduce it with a concrete example.
Occam’s razor example
Say you’re reading a legal case on an attempted carjacking of a luxury car. The first excerpt you come across is the following:
Prosecutor: “You were caught sitting inside that car after breaking one of the side windows.”
Defendant: “I wasn’t stealing anything. I was just trying to hide from the rain.”
Here we have the observation that the defendant was found inside a car that didn’t belong to them. The first and main explanation is that they were trying to steal the car. Let’s call this Explanation A. However, the defendant offers an alternative explanation. It’s not that they had any intentions to steal, they were just trying to hide from the rain. Let’s call this Explanation B.
You probably immediately dismiss the latter, but why? So far you haven’t heard any evidence in favor of either explanation. Why aren’t they equally likely to be true?
One reason could be that you’ve never heard of a person breaking inside a car to hide from the rain. On the other hand, breaking into a car to steal it is much more common and makes A more likely to be true. But that’s not the only problem. Unlike A, with B there’s also a strong sense of incompleteness.
You continue reading:
Prosecutor: “This is very unusual. Why would you break into a car to hide from the rain?”
Defendant: “The car was the closest thing around. It was my only chance to hide.”
Prosecutor: “Why didn’t you look for a more appropriate place to hide, even risking getting a little wet?”
Defendant: “I have a rare skin condition that makes me very sensitive to water. I shouldn’t stay under rain for longer than a minute.”
Breaking it down
As you continue reading, Explanation B seems to answer more questions, but for some reason it doesn’t get much more believable. Why is that?
Notice how with each extra piece of information the defendant makes Explanation B more complete at the expense of also making it more complex. And while Explanation A is already (almost) complete, B needs a lot of additional statements and clarifications.
Each of these statements is an extra hypothesis that needs to be independently verified. In science, such hypotheses are referred to as ad hoc. Generally speaking, ad hoc hypotheses increase the explanatory power of a theory while also increasing its complexity and reducing its predictive power.
Bottom line is, getting the explanatory power of B to the level of A makes B much more complex than A. Hence, according to Occam’s razor, A is preferable to B:
A simple theory is always preferable to a complex theory, if the complex theory doesn’t offer a better explanation.
Traditional justifications for Occam’s razor
Is Occam’s razor valid? So far, I’ve only appealed to your intuition, but there are more rigorous ways to answer this question.
Traditionally, philosophers and scientists have attempted to justify the principle with a combination of rational and empirical arguments.
Consider this quote from the philosopher Elliott Sober:
Just as the question ‘why be rational?’ may have no non-circular answer, the same may be true of the question ‘why should simplicity be considered in evaluating the plausibility of hypotheses?’
This is an example of a rational (a priori) justification for the principle. The argument here is that we need a principle like Occam’s razor to choose between rival theories or hypotheses when there isn’t a better way to do so (such as when they explain the observations equally well).
This may not be the most convincing argument, but it has some practical value. After all, even if simplicity by itself doesn’t make a theory “more true”, there are other benefits of working with simple theories. For one thing, they tend to be easier to understand and often have a higher aesthetic appeal.
Empirical arguments for the principle focus on instances of its real-world success. If Occam’s razor is a valid principle, all else being equal, theories that were once simpler must frequently “beat” their rivals when more evidence becomes available. Let’s look at a concrete example.
The 18th century naturalist Georges-Louis Leclerc, Comte de Buffon proposed a law currently known as Buffon’s Law. Roughly, it says that geographical regions separated by natural barriers (like mountains and oceans) have distinct biological species, with occasional exceptions.
According to Charles Darwin, the law was explained by the natural dispersal of species at separate geographical regions over time. Being separated from each other, they also evolved independently. The occasional exceptions were explained by anomalous dispersal (also known as “jump dispersal”) in the recent past. This was caused by rare factors like ocean currents, wind, floating ice, and (more recently) human activities which allowed species from separate regions to mix in common environments.
Opposed to this explanation, the botanist Léon Croizat suggested that historical tectonic changes (rather than dispersal), such as newly-formed mountains, were responsible for Buffon’s Law. He explained exceptions with historical land bridges between regions (that stopped existing at a later point in time).
Croizat’s hypothesis postulates the existence of currently non-observable entities (the land bridges). For this reason, biologists have criticized it for introducing unnecessary complexity in its explanation of the law. Hence, support for Darwin’s hypothesis would be an empirical “point” in favor of Occam’s razor. In fact, Darwin’s theory has received strong support, including in some recent findings.
Of course, available evidence neither proves nor rejects either hypothesis. And even if it did, that by itself would only be weak evidence in favor or against Occam’s razor.
To most people, Occam’s razor makes intuitive sense, but academics have tried to find more robust justifications for it. Why use simplicity as a deciding factor in theory selection?
I already gave examples of rational and empirical arguments in favor of the principle. Of course, other arguments exist, including aesthetics and elegance considerations, which have also been occasionally used as justifications.
However, in the following section I want to introduce a third type of justification. It is relatively new and comes from probability theory. In some sense, it’s a combination of a rational and an empirical justification. And in another sense, it’s the main focus of this post, so let’s dive in.
Probabilistic justification for Occam’s razor
Let’s go back to the original question. Why is it a good thing for a hypothesis/theory to be simple? Well, if simpler theories generally tend to be closer to the truth, then simplicity is good because it brings us closer to the truth. This is the essence of the empirical arguments for Occam’s razor. But simply counting “simple” and “less simple” theories and what percentage of them turn out to be true isn’t a very rigorous or reliable approach.
The main problem is that none of the crucial concepts is well defined. What exactly does it mean for a theory to be simple? How can we compare two theories in terms of simplicity? Also, what does it mean for a theory to “turn out to be true”? There are hardly any proofs of certainty in science, so it’s never safe to say that any theory is 100% true or false. A third problem is comparing theories in terms of how well they explain the available observations. What exactly does it mean for a theory to explain something better than another theory?
This all seems too vague and subjective.
A better approach is to adopt a formal definition of simplicity which, at least partly, coincides with a more intuitive one. Then we also need a formal approach to establishing the truth value of theories. Finally, we need to formalize what it means for a theory to explain an observation. If we can then show that, by only manipulating the simplicity of a theory we can increase its truth value, this would be a much better justification for Occam’s razor.
Formalizing “truth”, “simplicity”, and “degree of explanation”
One of the main reasons why absolute truth can never be established for almost any real world problem is underdetermination. In short, any observation is consistent with an infinite number of potential explanations (theories). Therefore, no amount of evidence can ever prove a theory with complete certainty. That said, not all explanations are equally plausible.
In my post discussing the nature of probabilities, I showed how they can be viewed as degrees of belief or degrees of uncertainty. This means that they’re a great candidate for ranking theories according to their truth values. Let’s define the truth of a theory as its probability. If P(Theory) = 1, then the theory is certainly true. If P(Theory) = 0, it’s certainly false. And any number between 0 and 1 means there’s uncertainty about its truth.
So far, so good. Next, we need to define simplicity. A natural way is to relate it to the number of independent assumptions underlying the theory. Fewer assumptions = simpler theory. Makes sense? Almost. We also need to consider the simplicity of the assumptions themselves. I’m going to come back to this point in the next section. For now, let’s just say a theory is simple if it has few assumptions that are themselves simple.
Finally, we need to say what it means for a theory to explain an observation. In the most basic sense, if any part of the observation is inconsistent with what the theory had predicted prior to the observation, we count that part as not explained. So, the degree to which a theory explains an observation is the degree to which the observation is consistent with the theory’s predictions.
This may sound too abstract, but I promise it’s all going to become much clearer in the following sections.
Applying the new definitions
Let’s start with a hypothetical theory, called Theory A. Let’s also call all observations relevant to the theory Observations B (experiments, surveys, correlation data, and so on). Then, the degree to which A is consistent with B is the conditional probability P(Observations B | Theory A), also called the likelihood. The probability that the theory is true is the conditional probability P(Theory A | Observations B), also known as the posterior probability. From now on, I’m going to refer to these two simply as P(B|A) and P(A|B).
Naturally, we can relate them using Bayes’ theorem:
Imagine an alternative theory, called Theory A’, which tries to explain the same observations B. Its posterior probability is:
Here’s the crucial expectation. According to Occam’s razor, if A and A’ explain B equally well, the simpler of the two will have a higher posterior probability. Let’s see if that’s true.
Notice that the evidence term P(B) doesn’t play a role in this comparison, since its value doesn’t depend on the theory. Then, the comparison depends solely on the products P(B|A) * P(A) and P(B|A’) * P(A’).
Let’s simplify this even further by assuming the prior probabilities P(A) and P(A’) are equal. This reduces the comparison to the likelihood terms P(B|A) and P(B|A’). Since Occam’s razor is supposed to be universally valid, it must also be valid in this special case.
Now, if we manage to show that, given the same explanatory power, simpler theories always yield higher likelihood values, we will essentially prove the validity of Occam’s razor!
In the next sections I’m going to do just that by demonstrating it with an example, followed by the formal intuition for why it works in the general case.
Imagine a cardboard with a 10 by 10 grid lying on a table.
You also see 15 black beans of equal size randomly dispersed next to it. Each of the 100 squares of the board is large enough to fit exactly one bean.
You leave the room and see another person enter after you. In about 10 minutes you come back and find that the other person has ordered the beans on the board, forming the following pattern:
The person who entered the room after you has left, so you can’t ask them why they did it. However, this looks just like the English letter T, so you suspect this is what they were trying to do (maybe it’s the first letter of their name?). Let’s call this Hypothesis 1.
But of course, that’s not the only possible explanation. They could have also wanted to form a horizontal and a vertical line with those exact lengths. In that case, the T-like pattern would be a mere coincidence (Hypothesis 2).
Yet a third hypothesis is that the person placed the beans without following any rule. In other words, they chose the 15 squares completely randomly. Let’s call this Hypothesis 3.
Let’s also call the “T” pattern itself Pattern T.
I deliberately selected the hypotheses with an increasing level of complexity. H2 is a more complex explanation of T because it explains the positions of the horizontal and vertical lines separately (in H1 they are bound together). H3 is even more complex because it explains the position of each bean separately!
Following the discussion above, we need to demonstrate the truth of the following inequality of the likelihood terms:
- P(T | H1) > P(T | H2) > P(T | H3)
Let’s do that.
Calculating P(T | H1)
Think about what this probability represents. Here’s a more verbose reading of P(T | H1):
- The probability that the person would form the observed pattern, given that they intended to form the letter T.
If you go back to the figures, you’ll see that there are only 8 possible ways to form the letter T. This is because there are 2 possible vertical positions and 4 possible horizontal positions (2 * 4 = 8). Therefore:
- P(T | H1) = 1/8
Calculating P(T | H2)
Since in this case the two lines are not attached to each other, there are new possible combinations.
I’m deliberately going to go a little fast over the calculations, since the details aren’t that important. If you have any questions, feel free to leave them in the comment section.
Let’s start with the vertical line. There are 3 possible vertical positions and 10 possible horizontal positions. So, overall there are 3 * 10 = 30 possible positions.
For each of the 30 positions of the vertical line, there are 2 vertical and 4 horizontal possible positions for the horizontal line: 2 * 4 = 8 (above and below it). This means that the total number of possible configurations is 30 * 8 = 240.
But we’re not done just yet. Notice that if the vertical line is close enough to one of the sides, there’s enough room for the horizontal line to squeeze to its left or to its right. This gives 3 * 8 = 24 additional possibilities for the horizontal line. The 8 comes from the 8 additional vertical positions for the horizontal line and 3 is the number of possible horizontal positions next to the vertical line.
There are 2 horizontal and 3 vertical possible positions for the vertical line where it touches one of the sides. Therefore, we get 2 * 3 * 24 = 144 additional combinations. Careful with counting the same combinations more than once!
The calculations for when the vertical line is 1 or 2 squares away from each side are very similar. They yield 96 and 48 additional combinations, respectively. Therefore, the final number of combinations is:
- 240 + 144 + 96 + 48 = 528
Finally we can calculate P(T | H2) as:
- P(T | H2) = 1/528
Calculating P(T | H3)
Remember that, according to H3, the position of each bean was determined independently of the others. There are 100 squares and 15 beans. In how many possible ways can the beans be placed on the board?
You can calculate this quite easily using the binomial coefficient:
In our case, N = 100 and K = 15, which gives a total of 253338471349988640 combinations. Whoa!
So, the final likelihood is:
- P(T | H3) = 1/253338471349988640
Analyzing the example
Notice that all 3 hypotheses explain the observation perfectly. Yet, look at the gigantic difference between their likelihoods (and respectively their posteriors)! Even under the equal prior probability assumption, H1 is 66 times more probable than H2 and 31 667 308 918 748 580 times more probable than H3! You can verify this is true by calculating the ratios P(T | H1) / P(T | H2) and P(T | H1) / P(T | H3).
Of course, you could argue that their prior probabilities shouldn’t be equal. But the point is that, regardless of what the prior probabilities are, the simplicity of a hypothesis is a huge factor in determining the value of the likelihood term. And because the posterior probability is proportional to the likelihood, a hypothesis gets “probability points” merely by being simpler than its rivals.
A big win for Occam’s razor!
Earlier I related simplicity of an explanation to the number and simplicity of its underlying independent assumptions. Let’s apply this criterion to the three hypotheses from the example.
What do you have to assume for H1? If you think about it, once you assume the position of the top left bean, you’re no longer free to assume anything about the positions of the remaining beans. According to H1, the pattern is always supposed to be a T and only its position is allowed to vary. Furthermore, there are only 8 allowed positions for the top left bean, as the remaining positions don’t allow for the entire T to fit inside the grid.
On the other hand, H2 doesn’t require the pattern to be a T. It only requires that the pattern contains an 8-bean-long vertical line and a 7-bean-long horizontal line. Therefore, you need to make independent assumptions about the positions of both lines.
H3 is even more complex because there you need to make independent assumptions about the positions of all 15 beans.
All hypotheses explain the pattern completely. The only difference is in their simplicity, but you saw the kind of difference that can make! The reason more independent assumptions lead to a lower probability ultimately rests on the following rule:
- The probability of a number of independent assumptions to all be true is equal to the product of their individual probabilities.
And because probabilities are numbers between 0 and 1, multiplying more of them reduces the value of the final product.
If you need a refresher on this, check my post on compound probabilities.
In the past, some philosophers saw value in Occam’s razor because they believed nature itself was simple. Hence, they expected that theories explaining nature must also be simple. However, today we know that natural observations and their explanations can be very complex and messy. Therefore, we need stronger arguments to justify the virtue of simplicity in theories.
In this post, I briefly introduced you to rational and empirical justifications of Occam’s razor. But I mostly hope to have demonstrated that there’s a strong mathematical and probabilistic argument behind the principle. Namely, if you need to make too many and complex independent assumptions to explain an observation, the posterior probability of your explanation will get a natural penalty, regardless of how well it explains the observation. In other words, if your explanation requires too many independent (and unlikely) things to be true, it has to offer better explanations of phenomena than any rival explanations with fewer/simpler assumptions.
Of course, if a theory doesn’t properly explain an observation, its simplicity is worthless. For this reason, in the real world there’s often a trade-off between a theory’s simplicity and how well it explains a set of observations.
Earlier I mentioned the concept of ad hoc hypotheses. These are special purpose hypotheses that are typically used to save a theory from being falsified by an observation. But you saw that you can’t really “cheat” Bayes’ theorem by introducing ad hoc hypotheses. If an ad hoc hypothesis adds little to the explanation, any boost to the posterior probability will be canceled out (or worse) by the complexity added by the hypothesis.
I started this post by promising to show you that Occam’s razor doesn’t need to be stated as a separate principle. I hope now you see why this is true — the principle (or at least a version of it) follows naturally from the laws of probability. Hence, asserting it independently is redundant and… complicates things unnecessarily.
Paradoxically, Occam’s razor renders itself obsolete!