Population and sample
Basic workflow of statistics
What if more than one model gives comparably good fits?
Domain knowledge
Parsimony
Interpretability

Population and sample

The concept of population and sample sits at the heart of how statistical regularity is used in statistics. This key concept in a nutshell is this:

Whatever data we collect is like a cup of water from a vast ocean. The cup of water is all that we have to base our inference on, but it is not the water in the cup that we want to draw inference about. The target of our inference is the entire ocean.

The statistical term for the cup of water is a sample, the ocean being called the population.

The very term population conjures up the vision of the totality of all the people living in a country. While this is indeed a very important example, statistics uses the term "population" in a broader sense: Suppose that I toss a coin. Thus is a random experiment that I can repeat as many times as I like. A statistician likes to think of this as drawing "head"s and "tail"s from an infinite population consisting of many, many "head"s and "tail"s. Since the population is infinite, we cannot really say that the chance of obtaining a "head" is the total number of heads divided by the population size. Instead, we pretend that God is handing out the "head"s and "tail"s randomly with certain probabilities. The "population" then is not just an infinite set, it is the entire random experiment.

This approach may appear a bit wierd at first, and may take some time to digest. But that's how you should learn to think in order to study statistics.

The idea if statistics is to repeat the experiment a large number of times (or, equivalently, to draw a large sample from the population) and use statistical regularity to learn about the random experiment (or, equivalently, the population).

Basic workflow of statistics

Many activities may be considered like a blackbox, we put some input into it, and get some output out of it. In many cases, the blackbox is unpredictable, in the sense even when we put in the same input the output differs unpredictably. Statistics starts by postulating an ideal form of unpredictablity: a blackbox whose output is unpreditable, but whose unpredictability may be repeated as many times as we like. Such ideal blackboxes are called random experiments. Just like ideal gases, a random experiment is an idealised concept, that may not exist in practice. Some of its best approximations are coin toss or die roll. Next, we try to explain every other blackbox in terms of one or more random experiments. Much like chemists trying to explain all chemicals in terms of elements. This is called statistical modelling.

EXAMPLE: If we measure the amount of dust or suspended particulate matter (SPM) in air everyday in the same location we see random fluctuations in the values. Clearly, the values are not independent. Here is one way to statistically model the data:

Let $\epsilon_t = $ the amount of fresh SPM generated on day $t.$ We assume $\epsilon_t$'s are IID from some random experiment. We link these with the observed data as follows: $$ X_t = \epsilon_t + \theta_1 X_{t-1} + \theta_2 X_{t-2}. $$ Thishas the interpretation that the amount of SPM is partly due to the residual SPM from the last two days plus the fresh SPM generated today. The constants $\theta_1$ and $\theta_2$ are the fractions determining how much the SPM of the last two days influence today's SPM.

This model has three unknowns: the random experiment from which the $\epsilon_t$'s were generated, $\theta_1$ and $\theta_2.$

The job of the statistician is to collect lots of $X_t$'s (i.e., measure $X_t$'s over many days) and then somehow use statistical regularity to find these unknown quantities. ///

From where did we get this model? Is there any theory that SPM indeed behaves in this way? Not really. It is just a model, a mathematically simple way to approximate the random behaviour of $X_t$'s. In statistics we start by assuming some such model, and estimate the unknown quantities based on the data. Then we compare the actual data with the fitted model. If the fitted model exlains the behaviour of the data well, then we are happy, else we look for some other model.

This is much like fitting a polynomial to a scatterplot. We start by fitting a straight line: $y = \alpha + \beta x,$ i.e., by choosing the values of $\alpha$ and $\beta $ that gives the best possible fit (according to some suitable criteria). Then we draw this best line on the scatterplot, and decide if it is a good fit. The best fit need not be a good fit, just as the best swimmer in India is not a good swimmer according to the Olympic standard. If our best fitting line is indeed a good fit, we are happy. Otherwise we look for a different model, say all polynomials of the form $\alpha + \beta x + \gamma x^2.$ Again, the same titual follows: we pick those values for the parameters $\alpha$, $\beta $ and $\gamma$ that give the best fit (within this class of models), and check its goodness-of-fit.

This is the general statistical workflow:

Decide upon a goodness-of-fit criterion.
Pick a (class of) models.
Pick the best fitting member of that class.
Check it's goodness-of-fit.
If you are happy, then use it for prediction etc. If unhappy, pick another class of models, and repeat.
Give up, when you are bored!

A remarkably wide range of models may be used in step 2 above. But remember that your aim is to get a good fit, and not merely to showcase your modelling prowess! Getting a good fit is usually not easy, even with creative choces for the class of models.

What if more than one model gives comparably good fits?

Then we choose one among them according to the following three basic guidelines:

domain knowledge
parsimony
interpretability

Domain knowledge

Statistics just deals with numbers, without caring for the story behind them. Consider the experiment for Boyle's law. Here we get different values for volume (V) for different values of pressure (P). The scatterplot shows a decreasing trend.

Part of Boyle's original data

Both a straightline and a rectangular hyperbola may appear to be a good fit. But a straightline is obviously impossible once you recall that you cannot make the volume negative by applying a bit of extra pressure!

Parsimony

"Parsimony" means "miserliness" (being unwilling to spend money). Sometimes we see that two models producing comparably good fits are of different levels of complexity. Then we naturally choose the one thatis simpler. This is the principle of parsimony. It is also called Occam's razor principle.

Interpretability