Set up: We have a bivariate
data $(x_1,y_1),...,(x_n,y_n).$ Suppose that the scatterplot
shows a linear pattern. We want to fit a straight line of the
form $y = \alpha + \beta x$ to the data. We want our line to
pass "as close as possible to all the points as possible". This
is a rather vague specification. There are a number of ways to
make it precise. The most popular among them is the least
squares approach. Suppose that we want to predict the value
of $y$ for $x = x_i$
using the equation $y = \alpha + \beta x.$ The predicted
value would be $\hat y_i = \alpha + \beta x_i.$ We measure the
(unsigned) distance between $\hat y_i$ and $y_i$ as
$$
(y_i-\hat y_i)^2 = (y_i - \alpha - \beta x_i)^2.
$$
Then the total error is
$$
\sum_{i=1}^n (y_i - \alpha - \beta x_i)^2 =
S(\alpha,\beta),\text{ say.}
$$
We want to choose $\alpha,\beta $ so
that $S(\alpha,\beta)$ is minimised. This is called the
least squares approach. We shall now outline two ways to
minimise $S(\alpha,\beta).$
First we differentiate $S(\alpha,\beta)$ partially
wrt $\alpha $ and $\beta $ and equate the partial
derivatives to zero. This gives two equations
$$
\frac{\partial S}{\partial \alpha} = -2\sum(y_i-\alpha - \beta x_i)
= 0,
$$
and
$$
\frac{\partial S}{\partial \beta } = -2\sum x_i(y_i-\alpha - \beta x_i)
= 0.
$$
Remember that our unknowns are $\alpha$ and $\beta,$
while the $x_i$'s and $y_i$'s are all known. So these
are two linear equations in two unknowns. In matrix form
these are
$$
\left[\begin{array}{ccccccccccc}
n & \sum x_i \\ \sum x_i & \sum x_i^2
\end{array}\right]\left[\begin{array}{ccccccccccc}\alpha\\\beta
\end{array}\right] =
\left[\begin{array}{ccccccccccc}\sum y_i\\ \sum x_i y_i
\end{array}\right].
$$
Here the coefficient matrix is nonsingular if and only
if $\frac 1n\sum x_i^2-(\overline x)^2\neq 0.$ This condition
is natural, because, otherwise, all the points on the same
vertical line, and slope of a vertical line is undefined.
Solving we get
$$
\hat \beta = \frac{n\sum x_i y_i- \overline x\overline y}{n\sum
x_i^2-(\overline x)^2 },
$$
and then $\hat \alpha $ may be obtained from
$$
\overline y = \hat \alpha + \hat \beta \overline x.
$$
Now, equating the first derivatives to zero, only ensures a
stationary point. We still do not know if it is a maximum or
minimum or something else, and even if it is a minimum, whether
it is a global minimum, or just a local one. Second derivative
tests (beyond our mathematical toolbox at present) will help
resolve the first question, but not the second. We shall not
discuss this any further here, because we still do not have the
necessary math tools at our disposal.