Conditional Probability
Multiplication rule
Theorem of total probability
Rejection sampling
Bayes' theorem
Use of Bayes' theorem
Urn Models
What are they?
Why care?
Fallacies regarding conditional probability
Mistaking $P(A|B)$ for $P(B|A)$
Simpson's Paradox
Monty Hall problem
Problems for practice

$\newcommand{\ev}{{\mathcal F}}$

Conditional Probability

Probability that a coin toss would result in a head is a statement more about our ignorance regarding the outcome than an absolute property of the coin. If our ignorance level changes (eg, if we get some new information) the probability may change. We deal with this mathematically using the concept of conditional probability.

EXAMPLE 1: Here is a box full of shapes.


A box of shapes

I pick one at random. What is the probability that it is a triangle? The answer is $P(\mbox{triangle})=\frac{5}{12}.$

Now, someone gives me some extra information: the randomly selected shape happens to be green in colour. What is the probability of its being triangle in light of this extra information?

Now my sample space is narrowed down to only the green shapes.


Narrowed sample space

Here the probability of triangle is different $\frac 27.$

We cannot use the same notation $P(\mbox{triangle})$ for this new quantity. We need a new notation that reflects our extra information. The new notation is $P(\mbox{triangle}|\mbox{green}).$ We call it the conditional probability of the selected shape being a triangle given that it is green. ■

In general, the notation is $P(A|B)$ where $A,B$ are any two events. The mathematical definition is just as it should be. Instead of the entire sample space $\Omega$ you now narrow you focus down to only $B.$ So $A$ is now narrowed down to $A\cap B.$ So $P(A|B)$ actually measures the $P(A\cap B)$ relative to $P(B).$ Hence the definition is:

Definition: Conditional probability If $A,B$ are any two events with $P(B)>0$ then $$P(A|B) = \frac{P(A\cap B)}{P(B)}.$$ If $P(B)=0,$ then $P(A|B)$ is undefined.

Theorem Consider a probability $P$ on some sample space. Fix any event $B$ with $P(B)>0.$ For all event $A$ define $P'(A)$ as $P'(A) = P(A|B).$ Then $P'$ is again a probability.

Proof: We have to check that the three axioms are satisfied by $P'.$

The first two axioms obviously hold! For the third axiom, let $A_1,A_2,...$ be countably many disjoint events. Then $$ P'(A_1\cup A_2\cup\cdots) = \frac{P((A_1\cup A_2\cup\cdots)\cap B)}{P(B)} = \frac{P((A_1\cap B)\cup(A_2\cap B)\cdots)}{P(B)} = \frac{\sum P(A_i\cap B)}{P(B)}=\sum \frac{P(A_i\cap B)}{P(B)}=\sum P(A_i|B) = \sum P'(A_i). $$ [QED]

EXERCISE 1: Show that if $P(A|B)=P(A)$ then $A,B$ must be independent. Is the converse true? Be careful with the second part!

Multiplication rule

EXERCISE 2: Show that if $P(A)>0$ then $P(A\cap B) = P(A)P(B|A).$

This result is just a minor rearrangement of the definition. But it has an intuitive interpretation. $A\cap B$ means both $A$ and $B$ has happened. We are finding its probability in two steps: first the probability that $A$ has happened, $P(A).$ Then, $P(B|A),$ the conditional probability that $B$ has happened given that $A$ has happened. This is often represented diagrammatically:

This form is particularly useful when $A,B$ are events such that $A$ indeed occurs before $B$ in the real world. Here is an example.

EXAMPLE 2: A box contains 5 red and 3 green balls. One ball is drawn at random, its colour is noted, and is replaced back. Then one more ball of the same colour is added. Then a second ball is drawn. What is the probability that both the balls are green?

SOLUTION: Notice that randomness enters in two stages, since there are two random selections involved. Let $A$ be the event that the first ball is green, and $B$ be the event that the second ball is green.

We are to find $P(A\cap B) = P(A)P(B|A).$

What is the probability that the first ball is green? The answer is $P(A) = \frac 38.$ Before drawing the second ball, the composition of the box has changed depending on the outcome of the first stage. This is where conditional probability helps. Given that the first ball was green, we know the composition of the box before the second drawing: 5 red and $3+1=4$ green. So $P(B|A) = \frac 49.$

The final answer therefore is $\frac 38\times\frac 49 = \frac 16.$

It is instructive to check this by simulation.

balls = c('r','r','r','r','r','g','g','g')
event = c()
for(i in 1:5000) {
  first.draw = sample(balls,1)
newballs = c(balls,first.draw)
second.draw = sample(newballs,1)
event[i] = (first.draw=='g' && second.draw=='g')
}
mean(event)

■

Often, in case of multistage random experiments, it is easier to think about the diagram than about the definition of conditional probability.

In a similar way, you can prove (by induction) the following theorem.

Multiplication rule Let $A_1,...,A_n,B$ be events such that $P(A_1\cap \cdots \cap A_n)>0.$ Then $$P(A_1\cap\cdots\cap A_n\cap B) = P(A_1)P(A_2|A_1)P(A_3|A_1\cap A_2)\cdots P(B|A_1\cap\cdots\cap A_n).$$

Theorem of total probability

Sometimes an event can occur via different paths. To find the probability of such an event we need to add the probabilitis of all the paths. This is leads to the theorem of total probability.

Theorem of total probability Let $A_1,...,A_n$ be mutually exclusive and exhaustive events, where $\forall i~~P(A_i)>0.$ Let $B$ be any event. Then $$P(B) = \sum_1^n P(A_i)P(B|A_i).$$

Proof: The following diagram illustrates the situation.


Theorem of total probability

We need to add the probabilities from all the paths from Start to $B.$ The probability of a path is computed by multiplying the probabilities along each of the arrows along it.

Now let's write down the formal proof.

Since $A_1\cup\cdots\cup A_n=\Omega,$

hence $ B = B\cap \Omega = (B\cap A_1)\cup\cdots\cup (B\cap A_n).$

Also, since $A_i$'s are disjoint, hence $B\cap A_i$'s are disjoint as well.

So $P(B) = \sum_1^n P(B\cap A_i) = \sum_1^n P(A_i) P(B| A_i),$ as required. [QED]

Rejection sampling

Suppose that $\phi\neq A\subseteq B$ are finite sets. You have a list of all elements of $B.$ But you do not have a list of all elements of $A.$ However, given any element of $B$ you can check if it is in $A$ or not. In this case how can you draw one element randomly from $A?$

One way is to use rejection sampling. In this technique you draw one element of $B$ randomly. If it is in $A$, then stop and output that element. Else, you again draw a random element from $B$ (with replacement), and continue like this.

This procedure is bound to terminate after a finite number of steps. The output will be a random sample from $A.$

EXERCISE 3: How to choose between 5 friends with equal probability using only a fair die? The following R code will give a hint.

repeat { 
  x = sample(6,1) 
if (x<=5) break
}

Bayes' theorem

Multi-stage random experiments are all around us. Many processes in nature occur step by step, and each step involves some randomness. Often the last layer of randomness is due to the measurement error. Bayes' theorem is a way to "remove" this last layer to look deeper.

The theorem of total probability lets us move forward along the arrows, while Bayes' theorem lets us move backwards.

Bayes' theorem (version 1) Let $A,B$ be any two events with $P(A), P(B)>0.$ Then $$P(A|B) = \frac{P(A)P(B|A)}{P(A)P(B|A)+P(A^c)P(B|A^c)}.$$

Proof: First think of the formula in terms of the following diagram. The denominator is the probability of reaching $B$ from Start. The numerator is the probability of only the red path.

The proof is very simple: $$P(A|B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A)P(B|A)}{P(B)} = \frac{P(A)P(B|A)}{P(A)P(B|A)+P(A^c)P(B|A^c)}, $$ as required. [QED]

Bayes' theorem (version 2) Let $A_1,...,A_n$ be mutually exclusive and exhaustive events. Let $B$ be any event. We assume $P(A_1),...,P(A_n), P(B)>0.$ Then for any $k=1,...,n,$ $$P(A_k|B) = \frac{P(A_k)P(B|A)}{\sum_{i=1}^n P(A_i)P(B|A_i)}.$$

EXERCISE 4: Look at the following diagram and write down the proof.


More general form of Bayes' theorem

The main idea behind Bayes' theorem goes beyond these two versions. Whenever, you can draw an arrow diagram connecting events, and know all the labelling probabilities, you can apply Bayes' theorem.

Use of Bayes' theorem

EXAMPLE 3: I live in a locality where burglary is uncommon. The chance that a burglar breaks into my house is 0.1. I have a dog that is highly likely to bark (say, with 0.95 probability) if a burglar enters. However, otherwise my dog is a quiet one. If there is no burglar around, he barks with probability only 0.01. I hear my dog bark. What is the chance that a burglar has entered?

SOLUTION: Let $A=$ {burglar has entered } and $B=$ {dog barks}.

We are given that $$P(A)=0.1, ~~ P(B|A)=0.95,~~ P(B|A^c)=0.01.$$ So we get the following diagram.

We want to find $P(A|B).$ To apply Bayes theorem we need to find $P(B).$ $$\begin{eqnarray*} P(B)&=&P(A)\cdot P(B|A)+P(A^c)\cdot P(B|A^c) \\ &=& 0.1 \times 0.95 + (1-0.1) \times 0.01 \\ &=& 0.104 \end{eqnarray*}$$ Now apply Bayes theorem to get $$P(A|B)=\frac{0.1 \times 0.95}{0.104}=0.913.$$ Diagrammatically, you can think like this. To find $P(B)$, we consider all paths from start to $B$. Multiply the probabilities along each path and add. Thus $P(B)=0.1 \times 0.95 + 0.9 \times 0.01=\cdots$ Similarly to find $(A\cap B)$ add the probabilities of all the paths from start to B through $A.$

Here $P(A \cap B)=0.1 \times 0.95.$

So now you can find $P(A|B)=\frac{P(A \cap B)}{P(B)}.$ ■

This is an example of a two stage random experiment. The first stage is whether a burglar enters or not. The second stage is whether the dog barks or not.

As in the above example, a typical problem starts by telling you unconditional probability of the first stage, and the conditional probability of the second stage given the first. Only the outcome of the second stage is observed, and the problem is to find the conditional probability of the first stage given the outcome of the second stage.

The same approach is applicable to any similar multistage experiment.

Urn Models

What are they?

An urn model is a multistage random experiment. It consists of one or more boxes (called urns), each containing coloured balls (balls are all distinct, even balls having the same colour). Balls are drawn at random (using SRSWR or SRSWOR) and depending on the outcome, some balls are added/removed/transferred. Then again a few balls are drawn, and so on. Here is one example.

EXAMPLE 4: An urn contains 3 red and 3 green balls. One ball is drawn at random, its colour noted, and returned to the urn. Then another ball of the same colour is added to the urn. Then the same process is repeated again and again. The possibilities grow like this:

Typical questions of interest here are:

What is the probability that at the $10$-th stage we shall have 12 red and 4 green balls?
What is the probability that the ball drawn at stage $n$ is red?
Given that we have exactly 6 red balls at the 9-th stage, what is the (conditional) probability that we had exactly 4 red balls at the 6-th stage?

■

All such questions may be answered by using the theorem of total probability and Bayes' theorem. By the way, one of the above three questions may be answered immediately. Which one? What is the answer?

The above urn model is an example of the Polya Urn Model, where in general we start with $a$ red and $b$ green balls, and at each stage a random ball is selected, replaced and $c$ more ball(s) of its colour is(are) added.

Why care?

You may see this link for further discussion. Some real life scenarios can be mathematically treated as urn models.

We shall discuss two such examples.

EXAMPLE 5: Most people form their opinions based on random personal experience, instead of a carefully planned overall survey of a situation. Polya's urn model is a simple version of this, as the following story shows.

An American lady comes to India. She has heard about the unheigenic condition prevaling here, and is apprehensive about flu. Well, as luck would have it, on her way from the airport she meets a man suffering from flu. "Oh my," she shudders, "so the rumour about flu is not unfounded, it seems!". The very next day her city tour is cancelled, because the guide is down with flu. "What a terrible country this is!", the lady starts to worry, "It is full of flu!" So imagine her panic when on the third day she learns that a waiter in the hotel has caught the disease.

Now here is the story of another American visitor to our country. He is also apprehensive of flu. But on the first day he does not meet any flu-case. "May be this fear of flu in India is a rumour after all," he thinks with some relief at the end of the day. The next day passes, and still he does not meet a single person with flu. He is now quite confident that the apprehension about flu is not serious. When yet another day further supports his optimistic belief, he starts thinking that the expensive flu-vaccine he took back home was a wastage of money.

Which of these two view points is reasonable? Neither. They both formed their own ideas based on their personal random experience. The true prevalence of flu in India is the same for both of them, but their personal beliefs about it are drastically different.

Polya's urn model captures this idea. A red ball means fear of flu, a green ball means the opposite. Initially they were equal in number. The lady met a flu case on day 1 (i.e., randonly selected a red ball), and her fear deepened (one more red ball added). The man did not meet any flu case in day 1 (green ball selected), so his courage increased (one more green ball added). Yet, what is the chance of selecting a red ball at stage 1? It is still $\frac 12$ same as stage 0 (ie, the true prevalence rate of flu has not changed from stage 0).

This model also demonstates a common phenomenon: once you randomly select balls of a certain colour in the first few stages, the (conditional) probability of selecting more balls of that colour increases. Indeed, people who has met more good people in their childhood tend to see more good people around them. Similarly, people who has met more bad people during their childhood are more likely to find faults with everybody.

However, one must understand that the real situation is far too complex to be captured adequately by Polya's urn model. ■

Here is another real life situation captured by urn models.

EXAMPLE 6: In the Ehrenfest model of heat exchange physicists consider two connected containers with $k$ particles distributed between them. At each step a particle is chosen at random and transferred to the other container. The question is: What is the distribution of particles at the $n$-th stage. This may be thought of as follows: one urn contains $k$ balls some of which are red and the rest green. A ball is drawn at random, removed, and another ball of the opposite colour is added. Here red balls play the role of particles in the first container, and green balls those in the other. ■

Fallacies regarding conditional probability

Conditional probabilities are often used wrongly in our everyday life. Here are three examples.

Mistaking $P(A|B)$ for $P(B|A)$

Parents of most prospective candidates for ISI admission wonder: "Does a particular coaching centre increase the chance of admission to the ISI?" Stated in terms of probabilities this is a question involving $P(A|B)$ where $A$ is that a (randomly selected) student gets admitted to ISI, and $B$ is that the student went to that coaching centre.

Most parents go about guessing $P(A|B)$ as follows. They would enquire from successful students from the previous years if they had studied at that coaching centre or not. When they hear that out of the 90% students came from that centre, they are impressed about its performance.

Is this decision logically valid?

No, what the parents learned from their survey was that $P(B|A)$ is large. This does not imply in any way that $P(A|B)$ is large. They should have surveyed the coaching goers and figured out the proportion that got admitted. This proportion could have been (and most often is) microscopically low.

Simpson's Paradox

Suppose that $A_1,A_2$ and $B$ are three events such that $P(A_1|B) < P(A_2|B)$ and also $P(A_1|B^c) < P(A_2|B^c).$

Can you conclude from this that $P(A_1) < P(A_2)?$ (Think before clicking here.)

Now consider the following real life data set.

It is about number of death penalties given for murder cases. The cases have been classified by three factors:

the race of the victim (i.e., the person murdered): white or black
the race of the defendant (i.e., the person accused): white or black
whether death penalty was given: yes or no.

The red and green parts give the actual data, the remaining numbers are derived from them. For example the 11.3 is obtained as $53/(53+414).$ The blue part is obtained by adding the red and green parts. For example, $414+16=430.$

Now consider the cases where the victim is white (the red part in the table). Notice that for white defendants 11.3% got a death penalty, while for black defendants the percentage is 22.9%. Thus if

$A_1$ denotes the event "White defendant gets death penalty"
$A_2$ is the event that "Black defendant gets death penalty",
$B$ is the event that "the victim is white",

then we infer $P(A_1|B) < P(A_2|B).$

Again, focusing on the green part we get a similar observation (0.0 < 2.8). So we infer $P(A_1|B^c) < P(A_2|B^c).$

So we combine these to conclude $P(A_1) < P(A_2).$ Thus, it seems that the victim's race does not matter: a white defendant is always less likely to get a death penalty.

So let's ignore the victim's race. This basically means adding the red and green tables to get the blue table. Similar argument based on this combined table, however, seems to indicate $P(A_1) > P(A_2)$ since $11.0 > 7.9.$

What went wrong? This is called Simpson's paradox and often crops up in practice.

(Think before clicking here.)

Monty Hall problem

This is based on a popular TV reality show.

The host of the program shows you three closed doors. You know that a random one of these hides a car (considered a prize), the remaining two doors hide goats (considered valueless). You are to guess which door has the car. If you guess correctly, then you get the car. Once you choose a door, the host opens some other door and shows that there is a goat behind it. Now you are given an option to switch to the other closed door. Should you switch? Remember that the host knows the contents behind each door and will always show you a door with a goat.

You can play this game online here.

Here are two ways to think about this, both natural but leading to opposite conclusions:

Whether your original selection was right or wrong, there is always at least another door hiding a goat. So the host will always open that. There is no extra info in it. Thus, nothing can be gained by switiching.
Earlier you had three doors and knew nothing about their contents. Now you at least know the content behind one door. In light of this extra information, switiching is justified.

The confusion remains even if you do some conditional probability computations. Let's label the the door you chose originally by the number 1. Also let's label with the number 2 the door opened by the host. The remaining door is labelled 3.

Here the sample space is $\{1,2,3\},$ the numbers denoting the possible positions of the car. The unconditional probabilities were $\frac 13$ each. The conditional probabilities are $\frac 12, 0, \frac 12.$

Does the confusion go away now? Unfortnately, no:

since $\frac 12 > \frac 13$ you should switch.
But the conditional probability of both doors 1 and 3 are $\frac 12.$ So nothing is to be gained by switching.

How to resolve the paradox? You might like to simulate the situation using R. Allegedly, the famous mathematician G Polya was not convinced about the correct answer until he was shown a computer simulation!

car = sample(3,1000,rep=T)
host  = c(3,2,3)
other = host[car]
sum(car==1)
sum(car==other)

Here is an explanation of the code. We shall play the game 1000 times. Each time we freshly randomize the position of the car. This is done in the first line of the code. We need a strategy for the host. Remember that the door you selected first is called door 1. So the host's strategy is like a function that maps car's true position to door to be kept closed. If the car is not behind door 1, then the host has only one choice. If the car is behind door 1, then the host can open either 2 or 3. Here, w.l.g., we are keeping 3 closed. So the function is $host[1]=3, host[2]=3$ and $host[3]=2.$ In other words, the strategy is the array $(3,2,3).$

Problems for practice

EXERCISE 5: Is it true that $P(A|B)+P(A^c|B)=1?$ Is it true that $P(A|B)+P(A|B^c)=1?$

Table of contents

Comments