Some more standard densities

Beta function
Problem set 1
Beta distribution
Problem set 2
Bayesian machine learning
Problem set 3
Cauchy distribution
Problem set 4
Cauchy baffles the laws of large numbers
Problem set 5
Normal distribution
Problem set 6
Central Limit Theorem
Problem set 7
Maxwell's derivation of the normal distribution
Problem set 8
Problems for practice

Some more standard densities

Beta function

The next distribution that we shall discuss is the Beta distribution. Just as we needed to know about the Gamma function in order to talk about the Gamma distribution, we need to know about the Beta function before we can get into Beta distribution.

Definition: Beta function The Beta function is the function $B:(0,\infty)\times(0,\infty)\rightarrow{\mathbb R}$ defined as $$B(a,b) = \int_0^1 x^{a-1}(1-x)^{b-1}\, dx\mbox{ for } a,b>0.$$

Notice that if $a,b\geq 1,$ then the integrand is continuous, and so integrable over $[0,1].$ If, however, $a$ or $b$ lies in $(0,1)$, then the integral is am improper one, and hence we have to worry about its existence. Fortunately, it is easy to establish convergence by comparison with $\int_0^1 x^\alpha\, dx$ for $\alpha>-1.$

Most manipulations with the Beta function uses the following theorem to reduce the Beta function to the Gamma function:

Theorem For $a,b>0$ we have $$B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}.$$

Proof:The proof is quite easy using a bivariate change of variable. However, since that result will be proved in Analysis III, we shall omit the proof of this theorem here.[QED]

Problem set 1

EXERCISE 1: Find $B(4,5).$ Remember that $\Gamma(n) = (n-1)!$ for $n\in{\mathbb N}.$

EXERCISE 2: Writing the factorials in terms of the Gamma function, express $\binom{10}{6}$ in terms of the Beta function

Beta distribution

Video for this section

Now that we see that for $a,b>0$ the function $x^{a-1}(1-x)^{b-1}$ is nonnegative and integrable over $(0,1)$ with integral $B(a,b),$ we can immediately manufacture a density out of it:

Definition: Beta distribution The distribution with density $$f(x) = \left\{\begin{array}{ll}\frac{1}{B(a,b)} x^{a-1}(1-x)^{b-1}&\text{if }x\in(0,1)\\ 0&\text{otherwise.}\end{array}\right.$$ for $a,b>0$ is called the Beta distribution with parameters $a,b.$

The beta densities show a wide variety of shapes.


A variety of shapes from the Beta family

Theorem If $X\sim Beta(a,b)$ then $E(X) = \frac{a}{a+b}.$

Proof: $$E(X) = \frac{1}{B(a,b)}\int_0^1 x\times x^{a-1} (1-x)^{b-1}\, dx = \frac{1}{B(a,b)}\int_0^1 x^{(a+1)-1} (1-x)^{b-1}\, dx = \frac{B(a+1,b)}{B(a,b)}.$$ Now we shall express the Beta functions in terms of the Gamma function to get $$\frac{B(a+1,b)}{B(a,b)} = \frac{\Gamma(a+1)\Gamma(b)}{\Gamma(a+b+1)}\times\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}.$$ We know that $\forall \alpha >0~~\Gamma(\alpha+1) = \alpha \Gamma(\alpha).$ Hence $\frac{\Gamma(a+1)}{\Gamma(a)}= a$ and $\frac{\Gamma(a+b)}{\Gamma(a+b+1)}=\frac{1}{a+b}.$ Hence the result. [QED]

Problem set 2

EXERCISE 3: For particular values of $a,b$ we get the $Unif(0,1)$ distribution. Which values?

EXERCISE 4: If $X\sim Beta(a,b)$, then exactly one of the two statements is correct in general. Which one?

$E(X) = \frac{a}{a+b}.$
$E(X) = \frac{b}{a+b}.$

Answer by thinking about the density. Now prove it mathematically.

EXERCISE 5: If $X\sim Beta(a,b)$, then find $V(X).$

EXERCISE 6: If $X\sim Beta(a,b)$ then show that $1-X\sim Beta(b,a).$

Bayesian machine learning

Video for this section

The main use of the beta distribution in statistics is as a "distribution of probability". To understand this, consider the large collection of coins, both biased and unbiased. If you pick one of them at random and let $\Pi$ denote its $P(head),$ then $\Pi$ is a random variable, just as the height of a random person is considered a random variable. Now $\Pi$, being a probability, must have a distribution supported on $[0,1].$ Also, since $\Pi$ depends on various physical properties of the coin, it is natural to expect that it will have a density. The members of the Beta family are suitable choices. This was a toy example. More often we have some event and we ask for expert opinion about its probability. Then $\Pi$ denotes the opinion of a randomly selected expert. If there is a peak in its density, then that's the value most experts vote for. In this sense, $Beta(1,1)\equiv Unif(0,1)$ denotes absolute ignorance.

As a warm up towards the use of Beta distribution plays a role in Bayesian machine learning, consider the simple set up from last semester. We have a box containing 3 white and 2 black balls. An SRSWOR of size 3 is drawn from it, and transferred to a new box. Now an SRSWR of size 5 is drawn from the second box. If all the balls in the SRSWR are found to be black, then what is the (conditional) probability that all the balls in the first SRSWOR were black? I hope you remember the following diagrammatic representation for this:


Diagram for the above problem

There are four paths. The $i$-th path has probability $p_iq_i$ for $i=0,1,2,3.$ The required probability is $\frac{p_3q_3}{\sum_i p_iq_i}.$

You may think of the first stage of this experiment as "constructing a random coin" whose $P(head)$ is $0$ or $\frac 13$ or $\frac 23$ or $1.$ These have probabilities $p_0,...,p_3,$ respectively. The second stage is tossing the coin 5 times. Given that we get 5 heads, we want to find the probability that our coin has $P(head)=1.$

In this toy example, our "random coin" could have only finitely many possible $P(head)$'s. Now let's make it more realistic. We shall start by picking a random coin from th population of all coins. Let $\Pi$ denote its $P(head).$ Just as the height of a randomly selected person is a random variable, similarly this $\Pi$ is a random variable. It can takes values in $[0,1].$ Let us assume that $\Pi\sim Beta(1,1)\equiv Unif(0,1).$ This means we consider any value of $\Pi$ in $[0,1]$ equally likely (this is one way to formalising "absolute ignorance"). We shall now toss this coin 5 times. Given that all the tosses result in heads, what is the conditional distribution of $\Pi?$ We had started with a $Beta(1,1)$ distribution. This is often called the prior distribution, as it represented the state of our knowledge about $\Pi$ prior to collection of data. In this sense, the conditional distribution we are seeking is the posterior distribution, which represents our state of knowledge after seeing the data (the outcomes of the coin tosses). We plan to draw a diagram similar to the earlier one for this problem. But here the random coin can have any $P(head)$ in $[0,1].$ So we have a contnuum of arrows, which is hard to draw:


Diagram with continuous sprays of arrows

The 1 labelling the left hand arrow is the density of the prior $Beta(1,1).$ The $p^5$ labelling the right hand arrow is the (binomial) probability of obtaining $5$ heads out of 5 tosses of a coin with $P(head)=p.$ Following the Bayes theorem idea as before we get the posterior density $$\frac{1\times p^5}{\int_0^1 1\times p^5\, dp}=6p^5,$$ which is the $Beta(6,1)$ density.

It is instructive to draw the two densities side by side to understand how our state of belief has been updated in light of the data:


After seeing the data our belief is concentrated more near 1.

In this way, whenever the prior is from the Beta family, our data is $X|\Pi=p\sim Binom(n,p)$, the posterior will also be from the same family. We express this by saying:

Beta is the conjugate prior family for Binomial.

An exercise below asks you to prove this. For now let us toss the same coin 5 more times. Suppose now we get 4 tails and 1 head. If we carry out the same exercise again (but this time with $Beta(6,1)$ playing the role of the prior, and "1 head, 4 tails" as our data), we again get a Beta posterior that we plot below:


Our belief is now peaked closer to the centre.

In this way the Beta family can express our ever-changing state of belief as more and more data stream in.

Problem set 3

EXERCISE 7: Find the posterior for $\Pi$ if the prior is $Beta(6,1)$ and the data consist of 1 head and 4 tails out of 5 independent tosses of the coin. The answer should be a Beta distribution with parameters that you are determine.

EXERCISE 8: Show that if $\Pi$ has prior $Beta(a,b)$ and our data consist of exactly $X$ heads out of $n$ tosses, then the posterior is again a Beta distribution. What are its parameters?

EXERCISE 9: Suppose that we have coin with $P(head)$ having prior $Unif(0,1).$ We toss the coin $n$ times independently and obtain exactly $X$ heads. Let $f(p)$ be the (continuous) density of the posterior. It is natural to estimate $p$ using the value where $f$ is the maximum. This is called the (maximum a posteriori) MAP estimator. Derive its formula in terms of $n$ and $X.$ Is it the same as the "usual" estimator $\frac Xn?$

Cauchy distribution

Video for this section

Definition: Cauchy distribution By Cauchy distribution we mean the distribution with density $$f(x) = \frac{1}{\pi(1+x^2)},\quad x\in{\mathbb R}.$$ Sometimes we also talk about $Cauchy(\mu,\sigma)$ distribution which is the distribution of $\mu+\sigma X$, where $X$ has the above density. In this notation, the density corresponds to the $Cauchy(0,1)$ distribution.

The most important reason for including this distribution in our discussion is that it has one notoriously bad property. It does not have any finite moment! In particular, do not think that $Cauchy(\mu,\sigma)$ has mean $\mu$ and variance $\sigma^2.$

Theorem If $X$ has $Cauchy$ distribution, then $E(X)$ does not exist. As a result $E(X^n)$ does not exist for any $n\in{\mathbb N}.$

Proof: $\int_0^\infty \frac{x}{1+x^2}dx\sim \int_0^\infty \frac 1xdx = \infty.$ [QED]

Theorem The characteristic function of the Cauchy distribution is $e^{-|t|}.$

Proof:Needs techniques (complex contour integration/differentiation under intergal) beyond the present level.[QED]

Problem set 4

EXERCISE 10: How can you generate a Cauchy random variable from a $Unif(0,1)$ random variable?

EXERCISE 11: Consider the unit semicircle shown below.

We pick a point at random on it, and extend the ray through it from the origin until it hits the $x=1$ line at some $(1,Y).$ Find the distribution of $Y.$

EXERCISE 12: If $X$ is a Cauchy random variable, then show that $\frac 1X$ is also a Cauchy random variable.

Cauchy baffles the laws of large numbers

Video for this section

The following R snippet will plot the running means of 10000 IID Cauchy random variables. The plot does not converge.

n = 10000
x= rcauchy(n)
y=cumsum(x)/(1:n)
plot(y,ty='l')

This demonstration is theoretically justified using the following theorem.

TheoremIf $X$ and $Y$ are independent Cauchy random variables, then for any $a\in[0,1]$ $aX+(1-a)Y$ is also a Cauchy variate.

Proof: This may be proved using Jacobians, or more directly using characteristic function. The characteristic function of $aX+(1-a)Y$ is $$E(e^{it(aX+(1-a)Y)}) = E(e^{itaX+it(1-a)Y}) = E(e^{itaX}\cdot e^{it(1-a)Y}) = E(e^{itaX})E(e^{it(1-a)Y}),$$ since $X,Y$ are independent. Now, we know that $E(e^{itX}) = E(e^{itY}) = e^{-|t|}.$ Hence $$ E(e^{itaX})E(e^{it(1-a)Y}) = e^{-|ta|})\times e^{-|t(1-a)|}) = e^{-|ta|-|t(1-a)|} = e^{-|t|},$$ since $a\in[0,1].$

This completes the proof. [QED]

The next theorem, which is a simple corollary to this theorem, shows why $\bar X_n$ failed to converge to a number in our simulation of the law of large numbers.

Theorem If $X_1,...,X_n$ are IID Cauchy, then $\bar X_n= \frac 1n\sum_1^n X_i$ is also Cauchy.

Proof: See the exercise below. [QED]

Problem set 5

EXERCISE 13: Prove the above theorem using induction on $n.$ You may like to use the identity $$\bar X_n = \frac{(n-1)\bar X_{n-1} + X_n}{n}.$$

EXERCISE 14: If $X,Y$ are independent Cauchy random variables, and we take $a\not\in[0,1],$ then is it possible for $aX+(1-a)Y$ to have Cauchy distribution?

Normal distribution

Video for this section

This is the most commonly used distribution in statistics.

Definition: Normal distribution We say that $X\sim N(\mu,\sigma^2)$ to mean $X$ has density $$\phi(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2} \right), \quad x\in{\mathbb R}.$$

The density looks like the following:


$\mu$ controls centre, $\sigma$ controls spread

Proving that this is indeed a density is not entirely straightforward. Howeer, we may reduce the problem to a Gamma integral and use the following theorem.

Theorem $\Gamma\left(\frac 12\right)= \sqrt\pi.$

Proof:Omitted.[QED]

To use this result in order to show that the total integral of the $N(\mu,\sigma^2)$ density is indeed 1, we proceed as follows. $$\int_0^\infty e^{-x^2/2}\, dx= \frac{1}{\sqrt2}\int_0^\infty t^{-1/2} e^{-t} \, dt$$ by using $t = x^2/2.$ This new integral is just $$\int_0^\infty t^{\frac 12-1} e^{-t} \, dt = \Gamma\left(\frac 12\right) = \sqrt\pi.$$ So $$\int_{-\infty}^\infty e^{-x^2/2}\, dx=2\int_0^\infty e^{-x^2/2}\, dx= \sqrt{2\pi}.$$ Hence we have shown that the $N(0,1)$ density integrates up to 1. To prove for any general $N(\mu,\sigma^2)$ we simply use the substitution $y = \frac{x-\mu}{\sigma}$ to reduce it to the $N(0,1)$ case.

The letter lower case phi, $\phi$, is generally used for the $N(0,1)$ density, while its capital version $\Phi$ is reserved for the CDF. $$\Phi(x) = \int_{-\infty}^x \phi(t)\, dt$$ Louiville showed that $\Phi(x)$ cannot be expressed in terms of elementary functions (trigonometric, exponential, logarithmic, square root, cube root etc). However, its value may be computed numerically for any given $x.$

Theorem If $X\sim N(\mu,\sigma^2),$ then for any $a\in{\mathbb R}$ and $b\neq 0$ we have $a+bX\sim N(a+b\mu,b^2 \sigma^2).$

Proof: Directly from Jacobian formula. [QED]

A corollary is the following theorem.

Theorem If $X\sim N(\mu,\sigma^2),$ then $\frac{X-\mu}{\sigma}\sim N(0,1).$

The transformation from $X$ to $\frac{X-\mu}{\sigma}$ is called standardisation.

TheoremIf $X\sim N(\mu,\sigma^2),$ then $E(X)=\mu$ and $V(X)=\sigma^2.$

Proof:Easy, and left as an exercise. Just one reminder. As a first step you should substitute $y=\frac{x-\mu}{\sigma}$ to arrive at $N(0,1).$ Now the expectation is given by the integral $$\frac{1}{2\sqrt\pi}\int_{-\infty}^\infty x e^{-x^2/2}\, dx.$$ Don't rush to the conclusion that this must be zero, because the integrand is an odd function. Here you are working with an improper integral. So you need to make sure that $\int_0^\infty x e^{-x^2/2}\, dx$ is finite before you can use the odd function argument. [QED]

Theorem The characteristic function of $N(0,1)$ is $e^{-t^2/2}$ for $t\in{\mathbb R}.$

Proof: As you have not formally done complex integration yet, all our characteristic function derivations are heuristic.

Here we can show directly that for any $s\in{\mathbb R}$ we have $E(e^{sX}) = e^{s^2/2},$ where $X\sim N(0,1).$ This is easily shown using a simple substitution.

Now, if you replace $s$ with $it,$ you get the result. This replacement is justified using arguments from complex analysis beyond the present scope. [QED]

Problem set 6

EXERCISE 15: If $X\sim N(0,1),$ then express the following probabilities in terms of $\Phi(\cdot).$

$P(X<1)$
$P(|X|<1)$
$P(|X|>2)$

EXERCISE 16: If $X\sim N(2,3^2),$ then express the following probabilities in terms of $\Phi(\cdot).$

$P(X<1)$
$P(|X|<1)$
$P(|X|>2)$

EXERCISE 17: If $\Phi ^{-1}(0.95)=1.64$, then find $c\in{\mathbb R}$ such that $P(|X-1|>c) = 0.1 $ where $X\sim N(1,1^2).$

Central Limit Theorem

The Central imit Theorem (CLT) is possibly the most famous theorem in probability theory and statistics. Originally stated and proved by Gauss, the theorem has many variants due to other mathematicians. Here we shall state the simplest version.

Central imit Theorem (CLT) Let $X_1,X_2,...$ be IID with $E(X_i) = \mu$ and $V(X_i) = \sigma^2 < \infty.$ Let $$\bar X_n = \frac{X_1+\cdots+X_n}{n}\mbox{ for } n\in{\mathbb N}.$$ Then the distribution of $\frac{\sqrt n(\bar X_n-\mu)}{\sigma}$ tends to $N(0,1)$ as $n\rightarrow \infty.$ More precisely, if $F_n(\cdot)$ denotes the CDF of $\frac{\sqrt n(\bar X_n-\mu)}{\sigma},$ and $\Phi(\cdot)$ denotes the $N(0,1)$ CDF, then $$\forall t\in{\mathbb R}~~F_n(t)\rightarrow \Phi(t) \mbox{ as } n\rightarrow \infty.$$

Proof:Next semester.[QED]

This theorem is a manifestation of statistical regularity. Whatever may the true distribution of the $X_i$'s be, if you average a large number of $X_i$'s you get close approximation to the normal distribution. This allows statistician to deal with averages of a large number of IID observations without knowing the true underlying distribution.

Let's look at a typical example.

EXAMPLE 1: If 40% of the population of a city supports a poll candidate, then what is the approximate probability that a random sample of 500 persons from the city will have at least 250 supporters?

SOLUTION: Here we think of the sampling procedure as 500 trials of the same random experiment: Pick a person at random from the population of the city.

We shall assume that the trials are IID. Now here we are introducing an approximation: the first membr of the sample was drawn from the entire population, but since we generally sample without replacement in such a scenario, the second member of the sample was drawn from a population of size one less than in the case of the first member. So the random experiment has actually changed, and they are not independent also. But since the population is large (much larger than 500), so we are ignoring both the non-identical and dependent nature and assuming IID.

We also have a random variable: $$X(\omega) = \left\{\begin{array}{ll}1 &\text{if }\omega\mbox{ supports the candiate}\\ 0&\text{otherwise.}\end{array}\right.$$ Here$\omega$ is the person sampled. Each trial gives rise to one copy of this random variable, so we have $X_1,...,X_{500}$ IID $Bernoulli(0.4).$ This $0.4$ came from the 40% given in the problem.

By CLT we have $$\frac{\sqrt n (\bar X_n-\mu)}{\sigma}\rightarrow N(0,1)$$ as $n\rightarrow \infty,$ where $\mu = E(X_i)$ and $\sigma^2 = V(X_i)< \infty.$ We shall write this as $$\bar X_n \stackrel{\bullet}{\sim} N\left(\mu,\frac{\sigma^2}{n}\right)$$ for large $n.$ Here $\stackrel\bullet\sim$ means "approximately distributed as".

In our case, $\mu = 0.40$, $\sigma^2 = 0.4(1-0.4) = 0.24$ and $n=500.$ So $$\bar X_{500} \stackrel{\bullet}{\sim} N\left(0.40,\frac{0.24}{500}\right),$$ or $$\sum_1^n X_i \stackrel{\bullet}{\sim} N(0.40\times 500,0.24\times 500)\equiv N(200, 120).$$ Nowe we can find the required probability as $$P(\sum_1^{500} X_i \geq 250) \approx 1-\Phi\left(\frac{250-200}{\sqrt{120}}\right).$$ This probability may be obtained by looking up standard $N(0,1)$ tables or using R as

1-pnorm((250-200)/sqrt(120))

■

In this problem we knew the distribution of the $X_i$'s, but we never really made any use of it, except to compute $E(X_i)$ and $V(X_i).$

Problem set 7

EXERCISE 18: [rossdistrib10.png]

EXERCISE 19: [rossdistrib8.png]

[Hint]

For the second part, just drop the 6's. This means you are rolling a 5-faced die 800 times.

EXERCISE 20: [rossdistrib5.png]

Maxwell's derivation of the normal distribution

Video for this section

Generally, Gauss is credited with the "discovery" of the normal distribution, which he derived via his famous central limit theorem. However, a lesser known derivation is due to James Clerk Maxwell, which we shall discuss now.

Imagine a gas without any overall flow. We know that the molecules are moving tis way and that randomly. Maxwell was interested in finding the distribution of velocities. Of course, he had no way of grabbing a molecule and measuring its velocity. So he did a brilliant logical argument starting with little more than common sense notions available to anybody.

His first step was to set up three axes $x$ , $y$ and $z.$ Then the velocity could be expressed as $(V_1,V_2,V_3).$ Since there is no overall flow and we have no reason to favour any one direction over the other hence $V_1,V_2,V_3$ must be IID.

Now we make our first not-so-common-sense assumption: each $V_i$ has a density, say $f.$ Then the joint density of $(V_1,V_2,V_3)$ is $f(v_1)f(v_2)f(v_3).$ Surely this cannot depend on the direction of $\vec v = (v_1,v_2,v_3).$ So we must have $$f(v_1)f(v_2)f(v_3) = g(v_1^2+v_2^2+v_3^2)$$ for some function $g(\cdot).$

Now we make our second technical assumption: $f$ is differentiable. Then differentiating both sides of the above equality partially wrt $v_i$ we get $$\begin{eqnarray*} f'(v_1)f(v_2)f(v_3) & = & 2v_1g'(v_1^2+v_2^2+v_3^2),\\ f(v_1)f'(v_2)f(v_3) & = & 2v_2g'(v_1^2+v_2^2+v_3^2),\\ f(v_1)f(v_2)f'(v_3) & = & 2v_3g'(v_1^2+v_2^2+v_3^2). \end{eqnarray*}$$ So $$\frac{f'(v_1)f(v_2)f(v_3)}{v_1}=\frac{f(v_1)f'(v_2)f(v_3)}{v_2},$$ or $$\frac{f'(v_1)}{v_1f(v_1)}=\frac{f'(v_2)}{v_2f(v_2)}.$$ Since $v_1,v_2$ are arbitrary, hence this means $$\frac{f'(x)}{xf(x)}=k$$ for some constant $k.$ Solving the differerntial equation we get $\log f(x) = \frac{k x^2}{2}+$ constant, or $$f(x) \propto e^{k x^2/2}.$$ Since $f$ is a density, hence its total intergral must be 1. Hence $k < 0$ and we get the density of $N\left(0,-\frac{1}{k}\right).$

Problem set 8

EXERCISE 21: If $X$ is a random variable with density proportional to $\exp((1-x)(3+4x))$ for $x\in{\mathbb R}$, then find the distribution of $X.$

EXERCISE 22: Let $\vec V = (V_1,V_2,V_3)$ have the joint distribution as in Maxwell's derivation. Consider $\vec U = \frac{\vec V}{\|\vec V\|}$, the unit vector along $\vec V.$ Describe the distribution of $\vec U.$

EXERCISE 23: If $X$ has density proportional to $e^{ax^2+bx+c}$ for $x\in{\mathbb R}$, for some constants $a,b,c$, then find $E(X)$ and $V(X).$