In my introductory Bayes’ theorem post, I used a “rainy day” example to show how information about one event can change the probability of another. In particular, how seeing rainy weather patterns (like dark clouds) increases the probability that it will rain later the same day. **Bayesian belief networks**, or just **Bayesian networks**, are a natural generalization of these kinds of inferences to multiple events or random processes that depend on each other.

This is going to be the first of 2 posts specifically dedicated to this topic. Here I’m going to give the general intuition for what Bayesian networks are and how they are used as causal models of the real world. I’m also going to give the general intuition of how information propagates within a Bayesian network.

The second post will be specifically dedicated to the most important mathematical formulas related to Bayesian networks.

# Overview of Bayesian networks

Imagine you have a dog that really enjoys barking at the window whenever it’s raining outside. Not necessarily every time, but still quite frequently. You also own a sensitive cat that hides under the couch whenever the dog starts barking. Again, not always, but she tends to do it often.

The reason I’m emphasizing the uncertainty of your pets’ responses is that most real-world relationships between events are *probabilistic*. You rarely observe straightforward links like “If X happens, Y happens with complete certainty”.

To continue the example above, if you’re outside your house and it starts raining, there will be a high probability that the dog will start barking. This, in turn, will increase the probability that the cat hides under the couch. You see how information about one event (rain) allows you to make inferences about a seemingly unrelated event (the cat hiding under the couch).

You can also make the inverse inference. If you see the cat hiding under the couch, this will increase the probability that the dog is currently barking. And that, in turn, will increase the probability that it’s currently raining.

Bayesian networks are very convenient for representing similar probabilistic relationships between multiple events.

Before you move to the first section below, if you’re new to probability theory concepts and notation, I suggest you start by reading the post I linked to in the beginning. It will give you the starting “language” for following the next sections.

## Bayesian networks as graphs

People usually represent Bayesian networks as directed graphs in which each node is a **hypothesis** or a **random process**. In other words, something that takes at least 2 possible values you can assign probabilities to. For example, there can be a node that represents the state of the dog (barking or not barking at the window), the weather (raining or not raining), etc.

The arrows between nodes represent the **conditional probabilities** between them—how information about the state of one node changes the probability distribution of another node it’s connected to.

Here’s how the events “it rains/doesn’t rain” and “dog barks/doesn’t bark” can be represented as a simple Bayesian network:

The nodes are the empty circles. Next to each node you see the event whose probability distribution it represents. Next to the arrow is the conditional probability distribution of the second event, given the first event. It reads something like:

- The probability that the dog will start barking, given that it’s currently raining.

In general, the nodes don’t represent a particular event, but all possible alternatives of a hypothesis (or, more generally, states of a variable). In this case, the set of possible events for the first node consists of:

- It rains
- It doesn’t rain

And for the second node:

- The dog barks
- The dog doesn’t bark

But in most cases, the nodes can take more than two and often an infinite number of possible values.

## Bayesian networks as joint probability distributions

The simple graph above is a Bayesian network that consists of only 2 nodes. It represents a **joint probability distribution** over their possible values. That’s simply a list of probabilities for all possible event combinations:

The blue numbers are the joint probabilities of the 4 possible combinations (that is, the probabilities of both events occurring):

- P(Rains & Dog barks) = 9/48 ≅ 0.19
- P(Rains & Dog doesn’t bark) = 3/48 ≅ 0.06
- P(Doesn’t rain & Dog barks) = 18/48 = 0.375
- P(Doesn’t rain & Dog doesn’t bark) = 18/48 = 0.375

Notice how the 4 probabilities sum up to 1, since the four event combinations cover the entire sample space.

The orange numbers are the so-called **marginal probabilities**. You can think of them as the *overall* probabilities of the events:

- P(Rains) = 12/48
- P(Doesn’t rain) = 36/48
- P(Dog barks) = 27/48
- P(Dog doesn’t bark) = 21/48

These are obtained by simply summing the probabilities of each row and column.

## Building complex networks

Earlier I mentioned another relationship: if the dog barks, the cat is likely to hide under the couch. Same as before, this relationship can be represented by a Bayesian network:

Here’s the joint probability distribution over these 2 events I came up with:

What if you wanted to represent all three events in a single network? Doing this is surprisingly easy and intuitive:

The main idea is that you create a node for each set of complementary and mutually exclusive events (like “it’s raining” and “it’s not raining”) and then place arrows between nodes that directly depend on each other. Each arrow’s direction specifies which of the two events depends on the other.

Networks can be made as complicated as you like:

Each of these nodes has possible states. For example:

**Season**: Spring / Summer / Fall / Winter**Grass**: Dry / Wet**Cat mood**: Sleepy / Excited

The arrows hold the probabilistic dependencies between the nodes they connect (I omitted labeling the arrows to not make the graph too cluttered). In other words, for each arrow there’s a table like the ones I showed in the previous section.

For example, the arrow between the “Season” and “Allergies” nodes is a table of joint probabilities. This table will hold information like the probability of having an allergic reaction, given the current season.

## What are Bayesian networks used for?

I think it’s most intuitive to think about a Bayesian network as a model of some aspect of the world. The network has certain assumptions about the probabilistic dependencies between the events it models.

You can use Bayesian networks for two general purposes:

- Making future predictions
- Explaining observations

Take a look at the last graph. An example of making a prediction would be:

- If P(Dog bark = True) is high, P(Cat hide = True) is also high.

In other words, if the dog starts barking, this will increase the probability of the cat hiding under the couch.

Explaining observations would be going in the opposite direction. If the cat is hiding under the couch, this will increase the probability that the dog is barking, because the dog’s barking is one of the possible things that can make the cat hide.

Most of the time, you construct Bayesian networks as **causal models** of reality (although they don’t have to necessarily be causal!). This means that you assume the parents of a node are its causes (the dog’s barking *causes* the cat to hide).

In the next section, I’m going to show the mechanics of making predictions and explaining observations with Bayesian networks.

# Updating probabilities of Bayesian networks

New information about one or more nodes in the network updates the probability distributions over the possible values of each node.

Generally, there are two ways in which information can propagate in a Bayesian network: **predictive** and **retrospective**. I’m going to explain both in turn.

## Predictive propagation

Predictive propagation is straightforward—you just follow the arrows of the graph. If there’s new information that changes the probability distribution of a node, the node will pass the information to its children. The children will, in turn, pass the information to their children, and so on.

Here’s an example from the last graph. Imagine that the only information you have is that the current season is fall:

- P(Season = Fall) = 1

(This automatically sets the probabilities of the other possible seasons to 0.)

Here’s an animated illustration of how this information will propagate within the network (click on the image to start the animation):

Click on the image to start/restart the animation.

Whenever a node lights up, it means something updated its probability distribution (either external evidence or another node).

Let’s follow one of the information paths. Knowing that the season is fall increases the probability that it’s currently raining. That, in turn, increases the probability that the dog is barking at the window. Finally, that increases the probability that the cat is hiding under the couch.

If this sounds intuitive, it’s because it is. The information propagation simply follows the (causal) arrows, as you would expect.

## Retrospective propagation

Retrospective propagation is basically the inverse of predictive propagation. Normally, when something updates a node’s probability distribution, the node also updates its children. But if a node was updated directly or by its child, it also updates its parents.

Here’s an example. Imagine that the only information you have is that the cat is currently hiding under the couch:

- P(Cat hide = True) = 1

Click on the graph below to see another animated illustration of how this information gets propagated:

Click on the image to start/restart the animation.

First, knowing that the cat is under the couch changes the probabilities of the “Cat mood” and “Dog bark” nodes. The intuition is that both can potentially be the cause(s) of the cat hiding. For example, if the cat is hiding under the couch, something must have caused it. This directly makes the probabilities of its potential causes higher.

In the animation, the “Cat hide” node updates its parents one at a time. However, I’m only showing them one at a time because it makes it easier to visually trace the information propagation in the network. In reality, the “Cat hide” node updates the “Cat mood” and “Dog bark” nodes simultaneously. I’m going to explain this in more detail in the second part of this post.

The newly updated “Dog bark” node will now update its own parent, the “Rain” node (again, because the rain is one of the possible reasons for the dog’s barking).

Notice that each updated node also updates its children through predictive propagation. For example, when the “Dog bark” node updates the “Rain” node, the latter updates the “Grass” and “Umbrellas” nodes. That is, now that P(Rain = True) is higher, it’s also more likely that the grass is wet and that people are carrying umbrellas.

It’s also important to note that when you update two or more nodes, they will update their child simultaneously (similar to how a node updates its parents simultaneously).

# Summary

So, this is it for the first part. Here are the main points I covered:

- Bayesian belief networks are a convenient mathematical way of representing probabilistic (and often causal) dependencies between multiple events or random processes.
- A Bayesian network consists of nodes connected with arrows.
- Each node represents a set of mutually exclusive events which cover all possibilities for the node.
- Nodes send probabilistic information to their parents and children according to the rules of probability theory (more specifically, according to Bayes’ theorem).

In the second part of this post, I’m specifically going to focus on the last point. I’m also going to explain the concept of **conditional dependence** and **independence** of a set of nodes, given another set of nodes.

In future posts, I plan to show specific real-world applications of Bayesian networks which will demonstrate their great usefulness.

Stay tuned!

## Leave a Reply