[Home]

Table of contents


$\newcommand{\argmax}{\mathrm{argmax}}$

Maximum Likelihood Estimation (MLE)

Set up

You have some data: $$ X_1,...,X_n. $$ You know that this is a random sample (i.e., IID) from some distribution with PMF or continuous PDF $f(x,\theta),$ where $\theta\in \Theta $ is some unknown parameter, and $\Theta $ is the parameter space, the set of all possible values that $\theta $ can take. We assume that $\Theta $ is known.

How would one "know" such a thing in real life? In one of two ways:

EXAMPLE: $X_1,...,X_n$ are the outcomes (Head=1, Tail=0) of $n$ tosses of a coin with unknown probability of head $\theta.$ Then the PMF is $$ f(x,\theta) = \theta^x(1-\theta)^{1-x},\quad x=0,1. $$ Here we do not know $\theta,$ but we know that $\theta\in[0,1].$ So $\Theta = [0,1].$ To make sense of the PMF when $\theta=0$ or 1, we take $0^0=1.$ ///

Our aim is to estimate (i.e., approximately guess) the value of $\theta.$ MLE is the most popular technique to do so.

A minor point: The estimate (i.e., the approximate guess) is obtained based on the data. Thus, the outcome of MLE is a function of the data, say $\hat \theta(x_1,...,x_n).$ This function is called an estimator. When you evaluate it at the actual data you get $\hat \theta(X_1,...,X_n)$, which is a number called an estimate. The difference between an estimator and an estimate is that between a function and its value. The distinction is often blurred in casual usage. The abbreviation MLE is used to denote Maximum Likelihood Estimate or Maximum Likelihood Estimator or Maximum Likelihood Estimation (the entire process of arriving at the guess).

The procedure

First compute the likelihood function $$ L(\theta) = \prod_{i=1}^n f(X_i,\theta). $$ Note that the likelihood is actually a function of $\theta$ as well as the $X_i$'s, though I have suppressed the $X_i$'s in the left hand side.

The process of MLE now consists of finding $\hat \theta\in \Theta $ that maximises $L(\theta).$ Mathematically, we write this as $$ \hat \theta = \argmax \{L(\theta)~:~\theta\in \Theta\}. $$ How the maximisation is carried out in a given problem is not dictated by MLE. But, as you might have guessed, differentiation is a popular technique. Now, differentiating a product of functions may not be easy. So if you are planning to differentiate, it is generally wiser to work with the log-likelihood function: $$ \ell(\theta) = \log L(\theta)=\sum_{i=1}^n \log f(X_i,\theta), $$ which is a sum instead of a product of identical functions.

Of course, you must ensure that $L(\theta)>0$ before taking log. Since $\log x$ is a differentiable, strictly increasing function, we have $$ \argmax \{\ell(\theta)~:~\theta\in \Theta\} = \argmax \{L(\theta)~:~\theta\in \Theta\}. $$

Example

EXAMPLE:  $X_1,...,X_n$ random sample from Poisson($\lambda$) with PMF: $$ f(x,\lambda) = e^{-\lambda}\frac{\lambda^x}{x!} \text{ for } x=0,1,2,... $$ for $\lambda>0.$ Find MLE of $\lambda.$

SOLUTION: Here the parameter space is $(0,\infty).$ The likelihood function is $$ L(\lambda) = \prod_{i=1]}^n e^{-\lambda}\frac{\lambda^{X_i}}{X_i!} = e^{-n \lambda} \frac{ \lambda^{\sum X_i}}{ \prod X_i!}. $$ This might look alarming, especially the product in the denominator. But remember that you are to maximise it as a function of $\lambda.$ Anything that does not involve $\lambda $ is just a constant. So it is basically like $$ L(\lambda) = A e^{-n \lambda} \lambda ^ B, $$ where $A$ and $B$ are constants. Differentiating and equating it to zero, is not tough. But we can make life easier by first taking log: $$ \ell(\lambda) = \log A -n \lambda + B \log \lambda. $$ So $$ \ell'(\lambda) = -n + \frac{B}{\lambda}. $$ Solving $\ell'(\hat \lambda) = 0$ we get $\hat \lambda = \bar X.$

Second derivative test: $\ell''(\lambda) = -\frac{B}{\lambda^2}.$ Since $B>0$ (unless all the $X_i$`s are zero), $\ell''(\hat \lambda) < 0$, ensuring a maximum. ///

Will this always work well?

There is no guaranty in general that this procedure will work well, or even work at all. It could be that $L(\theta)$ is unbounded above for $\theta\in \Theta$, or even if it is bounded above, it may not attain its supremum (like the function $g(x)=x$ over $x\in(0,1)$).

In the example above we already had a problem: MLE did not exist if all the $X_i$`s were 0, since $e^{-n \lambda}$ for $\lambda>0$ has no maximum!

However, in an overwhelming majority of cases, such problems do not arise. There are many theorems providing sufficient conditions under which the MLE works well. We shall not go into those theorems in this basic course.

But let us understand intuitively what is meant by "works well" here. Let $\theta_*\in \Theta$ be the true (unknown) value of $\theta.$ Then one desirable property is that $\hat \theta (X_1,...,X_n)\rightarrow \theta_* $ as $n\rightarrow \infty.$ This property is called consistency. In a wide variety of situations (again there are theorems giving sufficient conditions), MLE is consistent.

Another desirable property is that it should be precise. This may be measured by its standard error (SE) (which is just a fancy name for the standard deviation in case of an estimator). How small can you make it? Well, you cannot make it negative! Can you make it zero? Well, errr...yes, if we take our estimator to be just a constant (like for a coin toss case, we always report $\hat \theta = \frac 12$ without looking at the data). Now that is of course a stupid estimator! So low standard error by itself may lead to nonsense. But once we put more reasonable conditions (like consistency), the standard error is forced to be positive. There are theorems that give us some lower bound for standard error under those conditions (e.g., Cramer-Rao bound). Any estimator attaining that lower bound is called efficient. Is MLE always efficient? Well, not necessarily. But it is often asymptotically efficient, meaning that $$ \frac{SE(MLE)}{\text{lower bound}}\rightarrow 1 $$ as $n\rightarrow \infty.$

We still do not have enough mathematical tools under our disposal to make these ideas any more precise. However, it is not difficult to see intuitively why MLE is a reasonable thing to do. Indeed, this is what a common man would anyway do, as the following example shows.

EXAMPLE:  Suppose that you have a coin, and you know that its probability of head is either 0.9 or 0.1. You have tossed it 100 times, and have obtained 87 heads and 13 tails. What will be your estimate of the probability of head based on this?

Clearly, your estimate will be 0.9. Because, it is highly unlikely that a coin with probability of head 0.1 would produce 87 heads out of 100 tosses.

Well, that is exactly the reasoning behind MLE. Here the parameter space is $\Theta=\{0.1,0.9\}.$ The likelihood function is $$ L(\theta) = \theta ^{87} (1-\theta)^{13}. $$ Clearly, $L(0.9) > L(0.1),$ and so we go for $\hat \theta = 0.9.$ ///

What does the likelihood function denote?

For discrete data it denotes the probability of observing the data for a given value of $\theta.$ So here the likelihood is always between 0 and 1.

The interpretation is sightly more involved for the PDF case. Recall that we are assuming the PDF to be continuous. Under the continuity assumption we have this result (from the fundamental theorem of calculus): $$ \lim_{\delta\rightarrow 0+}\frac{P(X\in(x-\delta,x+\delta))}{2 \delta} = f(x). $$ Now, when we measure a continuous variable, we do so only with a finite precision. Suppose that this precision level is measured by $\delta>0.$ That is, when we say that the measured value of $X$ is $x$ we actually mean $X\in(x-\delta,x+\delta).$ Now, typically, $\delta$ is pretty small allowing us to assume $$ P(X\in(x-\delta,x+\delta)) \approx 2 \delta f(x). $$ In other words, the probability of the measured value being $x$ is proportional to $f(x).$ Hence, in the continuous PDF case also, the likelihood function gives the probability of observing the data up to a constant of proportionality.

Comments

To post an anonymous comment, click on the "Name" field. This will bring up an option saying "I'd rather post as a guest."