Basic concepts
Probability axioms
Continuity of probability function
Inclusion-Exclusion formula
Countable sample spaces
Equally likely cases
Simulations
Problems for practice
Sets and Venn diagrams
Axioms
Equally likely
Harder

$\newcommand{\ev}{{\mathcal F}}$

Basic concepts

A random experiment is an activity whose outcome we cannot predict. The set of all outcomes is called its sample space. Each element of this set is called a sample point. By the term event we understand a subset of the sample space.

EXAMPLE 1: A coin toss is a random experiment. Its sample space is $\{head,tail\}.$ There are four different events possible here: $\phi, \{head\},\{tail\},\{head,tail\}.$ ■

EXAMPLE 2: Rolling a die is another random experiment. The sample space here is $\{1,2,3,4,5,6\}.$ One possible event is the set of all even numbers, $\{2,4,6\}.$ ■

I give you a coin to toss. Before tossing, you carefully inspect it. You find no difference at all between the two sides (except for the pictures on them). So you infer that both head and tail are equally likely (i.e., you are equally ignorant about both sides). It is common to express this situation as "50-50" chance, or $50\%$ chance of a head (or a tail), or probability of a head (or a tail) is $\frac 12.$

The main idea is that the chance of a head equals the chance of a tail. We like to express this by first imagining a totality and then halving it. This totality may be taken as 100 or 100% or 1 or any other positive number.

In probability theory we take the total as $1.$ This choice is justified by statistical regularity, as the following example shows.

EXAMPLE 3: Consider rolling a fair die. Let $A$ be the event that we get a prime number, i.e., $A=\{2,3,5\}.$ Intuitively, the probability of $A$ should be $\frac 12.$ We shall use R to perform 5000 trials of this random experiment and check the running proportion of cases that the event $A$ happens.

x = sample(6,5000,rep=T)
A = x %in% c(2,3,5)
plot(cumsum(A)/(1:5000), ty='l')

Notice that the proportions indeed tend towards $\frac 12.$

Similarly, we can get an idea of the probability of $B=\{1,3,4,5\}$ using the following R code.

x = sample(6,5000,rep=T)
B = x %in% c(1,3,4,5)
mean(B)

■

With each event we assign a probability which is a number from $[0,1].$ In practice it is difficult (impossible?) to get a biased coin (i.e., a coin which is more likely to show one side than the other). It is very easy to simulate such a coin though:

x = sample(c('h', 't'), 1000, prob=c(0.7,0.3), rep=T)
sum(x=='h')
sum(x=='t')

The prob=c(0.7,0.3) specifies the probabilities.

If the sample space is countable (finite/infinite), then generally $\ev$ is just the power set of $\Omega.$ If $\Omega$ is uncountable, then there may be some "bad" subsets for which probability cannot be defined! These are not called events, and so are not considered as members of $\ev.$ In this case $\ev$ is a strict subset of the power set of $\Omega.$ Fortunately, we shall not come across such "bad" subsets any time soon.

Let $\ev$ be the set of all events in a sample space. For example, if $\Omega = \{head, tail\}$ then $\ev$ is $\big\{\phi,\{head\},\{tail\},\{head,tail\} \big\}.$

Then a probability is a function $P:\ev\rightarrow[0,1].$

Of course, not all such functions can be a probability. A function needs to satisfy certain common sense conditions to be called a probability function.

EXAMPLE 4: A report says that the health condition in a country is so bad, that the chance of a newborn baby surviving for at least 1 year is only 50%. However, the chance that he survives for at least 5 years is 90%. Does it sound odd?

SOLUTION: Yes. Any baby surviving for at least 5 years has of course survived the first year as well. So how can the latter chance be larger? ■

This example gives one common sense condition that a probability function must satisfy: if $A\subseteq B$ then we should have $P(A)\leq P(B).$

Of course, there are many many such common sense conditions and it is difficult to come up with a complete list. A Russian mathematician named Kolmogorov reduced this list to only 3 conditions, called the probability axioms.

Probability axioms

A collection of sets is called disjoint if the intersection of any two sets from the collection is empty. They are also called mutually exclusive

Probability axioms Let $\Omega$ denote the sample space. Let $\ev$ denote the set of all events. Then

For any event $A\in\ev$ we have $P(A)\geq 0$
$P(\Omega)=1$
If $A_1,A_2,...\in\ev$ are countably many (finite/infinite) disjoint events then $$ P(\cup A_n) = \sum P(A_n). $$

You'll often see this sentence:

"$(\Omega,\ev,P)$ is a probability space".

This is a shorthand for:

$\Omega$ is a nonempty set (sample space),
$\ev$ is the collection of all events,
$P:\ev\rightarrow{\mathbb R}$ is a probability (i.e., a function satisfying the three probability axioms).

Notice the last axiom. Here the sum may involve infinitely many terms. Such a sum is called an infinite series. You'll learn about them in details in your analysis course. But for now you may quickly read this crash course on infinite series.

It is a remarkable fact that whatever other common sense condition one has been able to think of so far actually follows as a consequence of these! Also, you cannot drop any of these requirements, in the sense that no two of these imply the other. Can you show this?

It is an interesting exercise to derive various common sense conditions from these axioms. Here is one.

EXAMPLE 5: If $A\subseteq B$ then show that $P(A)\leq P(B).$

SOLUTION: Split $B$ as $B=A\cup (A^c\cap B).$

The two events in RHS are disjoint. So by Axiom 3 we have $$ P(B) = P(A) + P(A^c\cap B). $$ The second probability in the RHS is $\geq 0$ (by Axiom 1). So done. ■

Try your hand at these:

EXERCISE 1: Show $P(A^c) = 1-P(A).$

EXERCISE 2: Show $P(\phi)=0.$

EXERCISE 3: If $A_1\subseteq A_2\subseteq A_3,$ then show that $P(A_1\cup A_1\cup A_3) = P(A_1)+P(A_2\cap A_1^c)+P(A_3\cap A_2^c).$

EXERCISE 4: Show $P(A\cup B) = P(A)+P(B)-P(A\cap B).$

Next we shall prove some common sense properties that will require more effort.

Continuity of probability function

We know that if $f(x)$ is a continuous function and $x_n\rightarrow a$ then $f(x_n)\rightarrow f(a).$ Any probability function has a similar property.

Definition: Increasing limit Let $A_1,A_2,...$ be a sequence of events. Let $A$ be any event. We say that $A_n$'s increase to $A$ (and write $A_n\nearrow A$) if $$A_1\subseteq A_2\subseteq A_3\subseteq\cdots$$ and $$A = \cup_n A_n.$$

Definition: Decreasing limit Let $A_1,A_2,...$ be a sequence of events. Let $A$ be any event. We say that $A_n$'s decrease to $A$ (and write $A_n\searrow A$) if $$A_1\supseteq A_2\supseteq A_3\supseteq \cdots$$ and $$A = \cap_n A_n.$$

Theorem If $A_n\nearrow A$ or $A_n\searrow A$ then $P(A_n)\rightarrow P(A).$

Proof: Let's do the $A_n\nearrow A$ case first.

Define $B_1=A_1$ and for $n\geq 2$ let $B_n =A_n\setminus A_{n-1}.$

Then $B_1,B_2,B_3,...$ are all disjoint. (Why? Don't just say "obvious!". Write a one line proof.) Also, for $n\in{\mathbb N},$ we have $A_n = B_1\cup\cdots\cup B_n.$

So $A = B_1\cup B_2\cup\cdots.$(Why?)

Hence, by the third axiom, $P(A) = \sum_1^\infty P(B_i) = \lim_n \sum_1^n P(B_i) = \lim_n P(A_n),$ as required.

The $A_n\searrow A$ case follows on taking complements.

[QED]

EXERCISE 5: (A puzzle) You are approached by a gambler at a casino. "Hmm, youngster", he remarks as he looks you up and down, "you seem to be new here. Let me offer you some money." He comes closer, sits beside you, and continues in a friendly voice, "Here I have a die, a fair one. You shall roll it again and again. After each roll, we shall do a little transaction like this: if the die shows six, you pay me some positive amount, say $t.$ But if it shows any other number I shall pay you ten times that amount. Does that sound like a good game to you?"

Being freshly admitted to ISI, you are of course proud of your probability skills, and reply "Yes".

"That's very good for you, very good indeed", exclaims the man in glee, "but it is not good for me, you see. I just made the offer because I took a liking to you. I hope that you would keep two requests in return."

You get cautious, but agree to hear them anyway.

"The first request is that I shall dictate the value of $t$ before each roll". Noticing a cloud of worry upon your face, he adds, "Don't worry, $t $ will always be positive, and I shall fix the amount before the roll."

You see no harm in that, and ask him to proceed.

"The second favour that I ask for is to call it quits whenever I like. That means I shall decide when the game will stop. It is only the barest protection for me, you see. I shall soon become bankrupt, and then I at least need to have my right to go back home! Surely you would not deny me that !"

You find the entire offer reasonable enough, and so accept it.

Have you done a wise thing? [Hint: the die is indeed fair, and there is no word play. It is a pure mathematical puzzle.]

Inclusion-Exclusion formula

Just now we have mentioned the result $P(A\cup B) = P(A)+P(B)-P(A\cap B).$ We can think of this like

$P(A)+P(B)$ overestimates $P(A\cup B)$ because the $P(A\cap B)$ part is included twice. So we need to exclude it once.

This idea of inclusion and exclusion works for any finite number of events.

Inclusion-Exclusion formula Let $A_1,...,A_n$ be any $n$ events. Let $T$ denote the set of all subsets of $\{1,...,n\}.$ Then $$P(A_1\cup \cdots \cup A_n) = \sum_{k=1}^n \left[(-1)^{k+1} \sum_{\alpha\in T,|\alpha|=k} P(A_\alpha)\right],$$ where for any $\alpha\subseteq T$ we define $$A_\alpha = \cap_{i\in \alpha} A_i.$$

Proof: The notation is a bit complicated. Let's understand it first with the $n=3$ case. Here the first term of the outer sum consists of the sum of all $A_\alpha$ where $\alpha\in T$ and $|\alpha|=1$ (i.e., all singleton subsets of $\{1,2,3\}$). This sum is simply $$P(A_1)+P(A_2)+P(A_3).$$ Similarly, the next term consists of all $A_\alpha,$ where $\alpha$ is a doubleton subset of $\{1,2,3\}.$ Remember that $A_{\{1,2\}} = A_1\cap A_2$ and so on. So the second term (for $k=2$) becomes $$-[P(A_1\cap A_2)+P(A_2\cap A_3)+P(A_1\cap A_3)].$$ The third term (for $k=3$) similarly is $P(A_1\cap A_2\cap A_3).$ So the entire sum looks like $$P(A_1\cup A_2\cup A_3) = [P(A_1)+P(A_2)+P(A_3)]-[P(A_1\cap A_2)+P(A_2\cap A_3)+P(A_1\cap A_3)]+P(A_1\cap A_2\cap A_3).$$ The Venn diagram shows why this formula is correct. But a Venn diagram cannot be considered as a proof, as it shows only one possible case. However, a Venn diagram does indicate how to construct a general proof. Note that $A_1\cup A_2\cup A_3$ is made of certain disjoint events:


Cells

We have coloured these using blue, green and red. Blue cells consist of points belonging to exactly one $A_i.$ For example, $B_1$ is the set of points that belong only to $A_1.$ The green cells consist of points belonging to exactly two $A_i$'s, and so on. So $$A_1 = B_1\cup B_{12}\cup B_{13} \cup B_{123},$$ and similarly for $A_2,$ and $A_3.$ Note the pattern: all the $B's$ with $1$ somewhere in the subscript has occurred in the RHS. Since the events in the RHS are disjoint, so we have $$P(A_1) = P(B_1)+P(B_{12})+P(B_{13})+P(B_{123}).$$ Now, the first stage (inclusion) is $$\begin{eqnarray*} P(A_1)+P(A_2)+P(A_3) & = & [P(B_1)+P(B_{12})+P(B_{13})+P(B_{123})]\\ & & +[P(B_2)+P(B_{12})+P(B_{23})+P(B_{123})]\\ & & +[P(B_3)+P(B_{13})+P(B_{23})+P(B_{123})] \end{eqnarray*}$$

Note that here each $B$ with a single subscript occurs once, each $B$ with two subscripts occur twice, and so on. This is of course natural, since for example, $B_{12}$ occurs once as part of $A_1$ and then again as part of $A_2.$

Next, we have $$A_{12} = B_{12}\cup B_{123}.$$ Here all the $B$'s with $12$ in the subscript occur in the RHS. Again, since the $B$'s are all disjoint, $$ P(A_{12}) = P(B_{12})+P(B_{123}). $$ Similarly for $A_{23}$ and $A_{13}.$ Using these, the second stage (exclusion) is $$\begin{eqnarray*} P(A_{12})+P(A_{23})+P(A_{13}) & = & [P(B_{12})+P(B_{123})]\\ & & +[P(B_{23})+P(B_{123})]\\ & & +[P(B_{13})+P(B_{123})] \end{eqnarray*}$$ Note that no $B$ with a single subscript occurs at all. The $B$'s with two subscripts occur once each, while the $B$ with three subscripts occur thrice. Do you see the pattern? Each $B$ is like $B_\beta,$ where $\beta$ is a nonempty subset of $\{1,2,3\}.$ Similarly, each $A$ is like $A_\alpha,$ where $\alpha $ is also a nonempty subset of $\{1,2,3\}.$ Now $B_\beta$ occurs as a part of $A_\alpha$ if and only if $\alpha\subseteq \beta.$ So the number of times we see $B_{123}$ in the RHS is same as the number of subsets of size 2 of $\{1,2,3\},$ which is $\binom{3}{2}=3.$ The same technique will also explain why $B_{23},$ for example, occurs only once: the number of subsets of size 2 of $\{2,3\}$ is $\binom{2}{2}=1.$ In fact, the same approach also explains the absence of $B$'s with single indices. Each single index $B$ occurs $\binom{1}{2}=0$ times!

The third stage (inclusion) is similar, though simpler. Here $A_{123} = B_{123},$ and so $P(A_{123}) = P(B_{123}).$ Our basic pattern holds here also: The 3-index $B$ occurs $\binom33=1$ time. The $2$-index $B$'s occur $\binom23=0$ time, and the 1-index $B$'s occur $\binom13=0$ time.

The following table gives a summary:

No. of indices of $B$	Considered how many times in			Total
No. of indices of $B$	Stage 1 (incl)	Stage 2 (excl)	Stage 3 (incl)
1	$\binom11$	$\binom12$	$\binom13$	1-0+0=1
2	$\binom21$	$\binom22$	$\binom23$	2-1+0=1
3	$\binom31$	$\binom32$	$\binom33$	3-3+1=1

Now we head for the general case for any given $n.$

For any $\alpha\in T$ define

$B_\alpha=$ the set of all those points that belong to $A_i$ iff $i\in \alpha.$

Clearly, by definition, $B_\alpha$'s are all disjoint and $$A_1\cup\cdots\cup A_n = \cup_{\alpha\in T} B_\alpha.$$ Now observe that for any $\alpha \in T$ $$A_\alpha = \cup_{\beta\supseteq\alpha} B_\beta.$$ So $$P(A_\alpha) = \sum_{\beta\supseteq\alpha} P(B_\beta).$$ So the number of times a $k$-index $B$ is considered in the $r$-th stage is $\binom{k}{r}.$ (Click here for more explanation.) More precisely, $$ \sum_{|\alpha|=r} P(A_\alpha) = \sum_{k=1}^r \binom{k}{r} \sum_{|\beta|=k} P(B_\beta). $$

Hence total number of inclusion of a $k$-index $B$ is $$ \binom{k}{1}-\binom{k}{2}+-\cdots+(-1)^{n+1} \binom{k}{n} = 1-(1-1)^k=1, $$ using binomial theorem.

Hence every $B$ is included exactly once in the RHS. Thus, $$ P(\cup_{\alpha\in T}B_\alpha) = RHS, $$ as required. [QED]

Notice that the proof has used only the third axiom of probability. So if we have any function $P(\cdot)$ that satisfies the third axiom, the theorem is valid for that function as well. Examples of such functions include length, area, volume, mass, number of elements (for finite sets). In short, it is true for any measure of size.

Indeed, all functions that satisfy axiom three are called (signed) measures, and measure theory is the branch of mathematics that deals with them.

Countable sample spaces

If the sample space $S$ is countable (finite/infinite), say $$ S = \{x_1,x_2,...\}, $$ then take any sequence $p_1,p_2,...$ of nonnegative numbers adding up to $1.$ Defining $P(\{x_i\})=p_i$ completely specifies a probability. Conversely, any probability can be constructed like this.

Equally likely cases

The simplest special case is when the sample space $\Omega$ is finite (say $|\Omega|=n$) and we take $p_1=\cdots=p_n=\frac 1n.$

In this case, for any $A\subseteq \Omega$ we have $P(A) =|A|/|\Omega|.$

Many interesting problems fall in this category. They are basically problems of combinatorics. One type of problem is occupancy problems, where we have some boxes and some balls are distributed over them following various conditions.

EXAMPLE 6: There are three distinct boxes and 10 distinct balls. The balls are dropped randomly among the boxes so that all possible configurations are equally likely. (No ball is outside a box, and each box can hold all the balls.) What is the probability that the first box is empty?

SOLUTION: Each of the 10 balls has 3 possible destinations, irrespective of the other balls. So the total number of configurations is $3^{10}.$ So $|\Omega|=3^{10}.$

Let $A$ be the event that the first ball remains empty. Then $A$ occurs if and only if all the balls land in the other 2 boxes. So $|A|=2^{10}.$

Since all outcomes are equally likely hence $$P(A)= \frac{|A|}{|\Omega|} = \left(\frac 23\right)^{10}.$$ ■

EXAMPLE 7: Same problem as above, except that the balls are now identical. The boxes are still distinct. What is the answer now?

SOLUTION: By the bar-star argument $|\Omega| = \binom{12}{2}=66.$

Similarly, $|A| = \binom{11}{1} = 11.$

So the answer is $\frac 16.$ ■

Certain real-life scenarios may be modelled like this. Here are a few examples from physics (no need to cram these terms for the exams!).

EXAMPLE 8: There are $r$ (identical/distinct) particles. Each particle may be in one of $n$ distinct states. We can think of the particles as balls and the states as boxes. For example, if the states are UP and DOWN, and there are $r=12$ identical particles, among which 5 are in UP state and 7 in DOWN, we can visualise this as:

If there are $r=3$ distinct particles and the same two states, then the picture could be like:

Physicists assume various types of probabilities on these.

Maxwell-Boltzmann distribution: The balls are distinct. Then there are $n^r$ many possible configurations (each of the $r$ balls has $n$ possible destinations). All these $n^r$ configurations are assumed equally likely. No real life particle shows this behaviour.

Bose-Einstein distribution: The balls are identical. So, by the bar-star argument, we know that there are ${n+r-1\choose n-1}$ configurations possible. These are assumed equally likely.

Fermi-Dirac distribution: Here $n\geq r$ and no box can hold more than one ball. Balls are identical. So ${n\choose r}$ distinct configurations are possible (since a configuration is determined completely by which of the boxes have one ball in it). All these configurations are assumed equally likely.

■

Next we discuss a type of problems that are used in statistical quality control.

EXAMPLE 9: An electronic component is packaged 100 per box. Each component may be either good or defective. We want to accept a box if and only if it has no defective component in it. Testing all the 100 components one by one is too time consuming (and also useless if the test is destructive). So instead we draw a simple random sample without replacement (SRSWOR) of size 10 and reject the box if any of these 10 turns out to be defective. Find the probability that a box containing exactly 5 defectives will be rejected.

SOLUTION: Let $\Omega = $ all samples of size 10. What is its size? Instead of writing ${100\choose 10}$, we shall follow the steps that a typical quality control officer would take: pick one, then pick the next, then the next and so on. This stepwise approach is generally better (Why?) than jumping into a $^nC_r$ or $^nP_r$ formula.

Here the first step may be done in 100 ways, the next in 99 ways, etc all the way upto $100-10+1=91$ ways.

So $|\Omega| = 100\times 99\times\cdots\times 91.$

By the given condition, all the outcomes are equally likely.

Let $A = $ the event that the box is rejected.

Thus $A$ is the event that the box contains 1 or more defectives.

There are 5 possible cases here: exactly $i$ defectives, where $i=1,...,5.$ These are disjoint cases. So we may try to find the probabilities of each and add up. But we can save the work by working with $A^c$ instead.

Here $A^c=$ the event that the box contains no defective. Again we shall find $|A^c|$ stepwise.

The first component may be selected in 95 ways (avoiding the 5 defectives). The next in 94 ways, etc up to the 10-th component, which may be selected in $95-10+1 = 86$ ways.

Hence $|A^c| = 95\times94\times\cdots\times86.$

Thus, $$ P(A^c) = \frac{|A^c|}{|\Omega|} = \frac{95\times94\times\cdots\times86}{100\times99\times\cdots\times91}. $$ ■

Here is one example where we shall need the inclusion-exclusion formula.

EXAMPLE 10: 10 distinct balls are dropped randomly over 3 (distinct) boxes. Each box can hold any number of balls. What is the probability that at least one box will remain empty?

SOLUTION: Such "at least" problems usually mean that our event is the union of some simpler events. Since the simpler events may not be disjoint, we need the inclusion-exclusion formula.

Let $\Omega=$ the set of all possible ways of dropping the balls. Then $|\Omega| = 3^{10}.$

We assume that all the outcomes are equally likely.

Let $A_i=$the event that the $i$-th box remains empty, where $i=1,2,3.$

Then we are looking for $P(A_1 \cup A_2\cup A_3).$

By the inclusion-exclusion formula, this is the same as $$ P(A_1)+P(A_2)+P(A_3)-\big[P(A_1\cap A_2)+P(A_2\cap A_3)+P(A_3\cap A_1) \big] + P(A_1\cap A_2\cup A_3). $$ Since the labelling of the boxes are arbitrary, we have $P(A_1) = P(A_2)=P(A_3).$

Similarly, the three probabilities inside the exclusion term are also equal to each other. Thus we are left with $$ 3P(A_1)-3P(A_1\cap A_2)+ P(A_1\cap A_2\cup A_3). $$ Now $|A_1| = 2^{10}$ and $|A_1\cap A_2| = 1^{10}$ (Why?) .

Also $|A_1\cap A_2\cap A_3| = 0$ (Why?) .

So $P(A_1) = \frac{|A_1|}{|\Omega|} = \left(\frac 23\right)^{10}$ and $P(A_1\cap A_2) = \frac{|A_1\cap A_2|}{|\Omega|} = \left(\frac 13\right)^{10}. $

Hence the required answer is $3\times \left[ \left(\frac 23\right)^{10}-\left(\frac 13\right)^{10} \right] = \frac{2^{10}-1}{3^9}.$

■

Simulations

Often we come across events that easily described in words, but whose probabilities are rather hard to compute. Computer simulation comes handy in such cases. Computer simulations help in detecting theoretical mistakes too.

EXAMPLE 11: A deck of 10 cards labelled 1,...,10 is shuffled thoroughly. We shall say that the $i$-th card is at home, if it is in the $i$-th positron after the shuffle. Write an R code to estimate the probability that exactly 3 cards are at home.

SOLUTION:

event = numeric(5000)
for(k in 1:5000) {
  x = sample(10,10)
  at.home = sum(x==(1:10))
  event[k] = (sum(at.home)==3)
}
mean(event)

■

Problems for practice

Sets and Venn diagrams

EXERCISE 6: Let $E,F,G$ be any three events. Find expressions for the event that of $E,F,G$

only $F$ occurs
both $E$ and $F$ but not $G$ occur
at least one event occurs
at least two events occur
all three events occur
none occurs
at most one occurs
at most two occur
exactly two occur
exactly one occurs.

$i$	Number of favorable outcomes	Probability
2	1	$\frac{1}{36}$
3	2	$\frac{2}{36}$
4	3	$\frac{3}{36}$
5	4	$\frac{4}{36}$
6	5	$\frac{5}{36}$
7	6	$\frac{6}{36}$
8	5	$\frac{5}{36}$
9	4	$\frac{4}{36}$
10	3	$\frac{3}{36}$
11	2	$\frac{2}{36}$
12	1	$\frac{1}{36}$

Table of contents

Comments