You have some data:
$$
X_1,...,X_n.
$$
You know that this is a random sample (i.e., IID) from some
distribution with PMF or continuous PDF $f(x,\theta),$ where $\theta\in
\Theta $ is some unknown parameter, and $\Theta $ is the
parameter space, the set of all possible values
that $\theta $ can take. We assume that $\Theta $ is
known.
How would one "know" such a thing in real life? In one of two ways:
by looking at the bar chart (for discrete case) or histogram
(for continuous case), and recognising the shape
by domain knowledge (e.g., some expert in that field tells
you that typically this variable follows that distribution).
EXAMPLE: $X_1,...,X_n$ are the outcomes (Head=1, Tail=0) of $n$ tosses of a
coin with unknown probability of head $\theta.$ Then the
PMF is
$$
f(x,\theta) = \theta^x(1-\theta)^{1-x},\quad x=0,1.
$$
Here we do not know $\theta,$ but we know
that $\theta\in[0,1].$ So $\Theta = [0,1].$ To make
sense of the PMF when $\theta=0$ or 1, we take $0^0=1.$
///
Our aim is to estimate (i.e., approximately guess) the value
of $\theta.$ MLE is the most popular technique to do so.
A minor point: The estimate (i.e., the approximate guess) is obtained based on
the data. Thus, the outcome of MLE is a function of the
data, say $\hat \theta(x_1,...,x_n).$ This function is
called an estimator. When you evaluate it at the actual
data you get $\hat \theta(X_1,...,X_n)$, which is a number
called an estimate. The difference between an estimator
and an estimate is that between a function and its value. The
distinction is often blurred in casual usage. The abbreviation
MLE is used to denote Maximum Likelihood Estimate or Maximum
Likelihood Estimator or Maximum Likelihood Estimation
(the entire process of arriving at the guess).
First compute the likelihood function
$$
L(\theta) = \prod_{i=1}^n f(X_i,\theta).
$$
Note that the likelihood is actually a function of $\theta$
as well as the $X_i$'s, though I have suppressed
the $X_i$'s in the left hand side.
The process of MLE now consists of finding $\hat \theta\in
\Theta $ that maximises $L(\theta).$ Mathematically, we
write this as
$$
\hat \theta = \argmax \{L(\theta)~:~\theta\in \Theta\}.
$$
How the maximisation is carried out in a given problem is not
dictated by MLE. But, as you might have guessed, differentiation
is a popular technique. Now, differentiating a product of
functions may not be easy. So if you are planning to
differentiate, it is generally wiser to work with
the log-likelihood function:
$$
\ell(\theta) = \log L(\theta)=\sum_{i=1}^n \log f(X_i,\theta),
$$
which is a sum instead of a product of identical functions.
Of course, you must ensure that $L(\theta)>0$ before taking
log. Since $\log x$ is a differentiable, strictly increasing
function, we have
$$
\argmax \{\ell(\theta)~:~\theta\in \Theta\} = \argmax \{L(\theta)~:~\theta\in \Theta\}.
$$
EXAMPLE:
$X_1,...,X_n$ random sample from Poisson($\lambda$)
with PMF:
$$
f(x,\lambda) = e^{-\lambda}\frac{\lambda^x}{x!} \text{ for } x=0,1,2,...
$$
for $\lambda>0.$ Find MLE of $\lambda.$
SOLUTION:
Here the parameter space is $(0,\infty).$ The likelihood
function is
$$
L(\lambda) = \prod_{i=1]}^n e^{-\lambda}\frac{\lambda^{X_i}}{X_i!} =
e^{-n \lambda} \frac{ \lambda^{\sum X_i}}{ \prod X_i!}.
$$
This might look alarming, especially the product in the
denominator. But remember that you are to maximise it as a function
of $\lambda.$ Anything that does not involve $\lambda $
is just a constant. So it is basically like
$$
L(\lambda) = A e^{-n \lambda} \lambda ^ B,
$$
where $A$ and $B$ are constants. Differentiating and
equating it to zero, is not tough. But we can make life easier by
first taking log:
$$
\ell(\lambda) = \log A -n \lambda + B \log \lambda.
$$
So
$$
\ell'(\lambda) = -n + \frac{B}{\lambda}.
$$
Solving $\ell'(\hat \lambda) = 0$ we get $\hat \lambda =
\bar X.$
Second derivative test: $\ell''(\lambda) =
-\frac{B}{\lambda^2}.$
Since $B>0$ (unless all the $X_i$`s are zero), $\ell''(\hat \lambda) <
0$, ensuring a maximum.
///
There is no guaranty in general that this procedure will
work well, or even work at all. It could be that $L(\theta)$
is unbounded above for $\theta\in \Theta$, or even if it is
bounded above, it may not attain its supremum (like the
function $g(x)=x$ over $x\in(0,1)$).
In the example above we already had a problem: MLE did not exist
if all the $X_i$`s were 0, since $e^{-n \lambda}$
for $\lambda>0$ has no maximum!
However, in an overwhelming majority of cases, such problems do
not arise. There are many theorems providing sufficient
conditions under which the MLE works well. We shall not go into
those theorems in this basic course.
But let us understand intuitively what is meant by "works well"
here. Let $\theta_*\in \Theta$ be the true (unknown) value
of $\theta.$ Then one desirable property is that $\hat \theta
(X_1,...,X_n)\rightarrow \theta_* $ as $n\rightarrow \infty.$ This
property is called consistency. In a wide variety of
situations (again there are theorems giving sufficient
conditions), MLE is consistent.
Another desirable property is that it should be precise. This may
be measured by its standard error (SE) (which is just a fancy
name for the standard deviation in case of an estimator). How
small can you make it? Well, you cannot make it negative! Can you
make it zero? Well, errr...yes, if we take our estimator to be
just a constant (like for a coin toss case, we always
report $\hat \theta = \frac 12$ without looking at the
data). Now that is of course a stupid estimator! So low standard
error by itself may lead to nonsense. But once we put more
reasonable conditions (like consistency), the standard error is
forced to be positive. There are theorems that give us some lower
bound for standard error under those conditions (e.g., Cramer-Rao bound). Any estimator
attaining that lower bound is called efficient. Is MLE always
efficient? Well, not necessarily. But it is often asymptotically
efficient, meaning that
$$
\frac{SE(MLE)}{\text{lower bound}}\rightarrow 1
$$
as $n\rightarrow \infty.$
We still do not have enough mathematical tools under our disposal to make
these ideas any more precise. However, it is not difficult to see
intuitively why MLE is a reasonable thing to do. Indeed, this is
what a common man would anyway do, as the following
example shows.
EXAMPLE:
Suppose that you have a coin, and you know that its probability
of head is either 0.9 or 0.1. You have tossed it 100 times, and
have obtained 87 heads and 13 tails. What will be your estimate
of the probability of head based on this?
Clearly, your estimate will be 0.9. Because, it is highly
unlikely that a coin with probability of head 0.1 would produce
87 heads out of 100 tosses.
Well, that is exactly the reasoning behind MLE. Here the parameter
space is $\Theta=\{0.1,0.9\}.$ The likelihood function is
$$
L(\theta) = \theta ^{87} (1-\theta)^{13}.
$$
Clearly, $L(0.9) > L(0.1),$ and so we go for $\hat \theta = 0.9.$
///
For discrete data it denotes the probability of observing the
data for a given value of $\theta.$ So here the likelihood
is always between 0 and 1.
The interpretation is sightly more involved for the PDF
case. Recall that we are assuming the PDF to be continuous. Under
the continuity assumption we have this result (from the
fundamental theorem of calculus):
$$
\lim_{\delta\rightarrow 0+}\frac{P(X\in(x-\delta,x+\delta))}{2 \delta} = f(x).
$$
Now, when we measure a continuous variable, we do so only with a
finite precision. Suppose that this precision level is measured
by $\delta>0.$ That is, when we say that the measured value
of $X$ is $x$ we actually
mean $X\in(x-\delta,x+\delta).$ Now,
typically, $\delta$ is pretty small allowing us to assume
$$
P(X\in(x-\delta,x+\delta)) \approx 2 \delta f(x).
$$
In other words, the probability of the measured value
being $x$ is proportional to $f(x).$ Hence, in the
continuous PDF case also, the likelihood function gives the
probability of observing the data up to a constant of
proportionality.
Comments
To post an anonymous comment, click on the "Name" field. This
will bring up an option saying "I'd rather post as a guest."