Chapter 2 Statistics, Probabilities and Matrix Algebra Preliminaries
In the first part of this chapter, we reviewed some concepts and methods in statistics, probability and matrix algebra, which are supposed to be known before taking the class. In the second part, we’ll introduce something new, such as random fields and their related properties.
2.1 Matrix Algebra
Refer to the supplementary materials “supplementary-linear algebra.pdf”.
2.2 Multivariate Gaussian Distribution
A real valued Gaussian random variable \(X\) (also called normal random variable) with mean \(\mu\) and variance \(\sigma^2\) has the following density function \[f(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.\] Usually, it is denoted by \(X\sim N(\mu, \sigma^2)\). For example, the density plot of a Gaussian random variable with mean zero and variance \(1\) is

Now suppose that we have \(n\) independent Gaussian random variables \(X_1,...,X_n\) with \(X_i \sim N(\mu_i, \sigma_i^2).\) Let \({\mathbf{X}}= (X_1,...,X_n)^\top\). Then the joint density of \({\mathbf{X}}\) is just multiplication of densities of \(X_i\)s, that is \[f_{{\mathbf{X}}}(x_1,...,x_n) = (2\pi)^{-n/2}\prod_{i=1}^n \sigma_i^{-1}\exp\left\{-\sum_{i=1}^n \frac{(x_i-\mu_i)^2}{2\sigma_i^2}\right\}.\] But in the real world, most of time \(X_1,...,X_n\) are correlated with each other, which could be generated by \({\mathbf{X}}= {\mathbf{A}}{\mathbf{e}}\) where \({\mathbf{A}}\) is a \(n\times n\) matrix and \({\mathbf{e}}\) are independent Gaussian random vector. So we need to generalize the Gaussian distribution to multivariate case with dependence among components. Here is the definition.
2.3 Generalized/Ordinary Least Squares
Let \({\mathbf{Y}}=(y_1,...,y_n)^\top\), \({\boldsymbol{\beta}}=(\beta_1,...,\beta_p)^\top\), \({\boldsymbol{\epsilon}}=(\epsilon_1,...,\epsilon_n)^\top\) and \[{\mathbf{X}}= \begin{pmatrix} x_{11} & x_{12} &\cdots & x_{1p}\\ x_{21} & x_{22} & \cdots & x_{2p}\\ \vdots & \vdots & \cdots & \vdots\\ x_{n1} & x_{n2} &\cdots & x_{np}\\ \end{pmatrix}\]
Consider the following linear model \[\begin{align} \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, \tag{2.1} \end{align}\]where \({\mathbb{E}}({\boldsymbol{\epsilon}}) = \mathbf{0}\) and \(\text{Var}({\boldsymbol{\epsilon}}) = \Sigma\).
Suppose that \(y_i\)s are independent with common variance, then the residual variance \(\text{Var}({\boldsymbol{\epsilon}}) = \sigma^2 {\mathbf{I}}\). In this case, the best linear unbiased estimator of \({\boldsymbol{\beta}}\) is the ordinary least squares estimator, \[\hat {\boldsymbol{\beta}}_{OLS}= ({\mathbf{X}}^\top {\mathbf{X}})^{-1}{\mathbf{X}}^\top {\mathbf{Y}},\] which is obtained by minimizing \[\min_{{\boldsymbol{\beta}}} ({\mathbf{Y}}-{\mathbf{X}}{\boldsymbol{\beta}})^\top ({\mathbf{Y}}-{\mathbf{X}}{\boldsymbol{\beta}}).\]
Suppose that \(y_i\)s are correlated but with common variance, specifically, \(\text{Var}({\boldsymbol{\epsilon}}) = \Sigma\). In this case, what is the BLUE of \({\boldsymbol{\beta}}\)?
Perform the cholesky decomposition of \(\Sigma\): \(\Sigma = {\mathbf{L}}{\mathbf{L}}^\top\). There exists an \(n\)-dimensional independent random vector \({\mathbf{e}}\) such that \({\boldsymbol{\epsilon}}= {\mathbf{L}}{\mathbf{e}}\). Then the above linear model can be transformed to \[{\mathbf{L}}^{-1}{\mathbf{Y}}= {\mathbf{L}}^{-1}{\mathbf{X}}{\boldsymbol{\beta}}+ {\mathbf{e}}.\] Since \(\text{Var}({\mathbf{e}}) = {\mathbf{I}}\), the BLUE of \({\boldsymbol{\beta}}\) is OLS, that is \[ \hat{\boldsymbol{\beta}}_{GLS} = (({\mathbf{L}}^{-1}{\mathbf{X}})^\top({\mathbf{L}}^{-1}{\mathbf{X}}))^{-1}({\mathbf{L}}^{-1}{\mathbf{X}})^\top {\mathbf{L}}^{-1}{\mathbf{Y}}= ({\mathbf{X}}^\top \Sigma^{-1}{\mathbf{X}})^{-1}{\mathbf{X}}^\top \Sigma^{-1}{\mathbf{Y}}. \] \(\hat{\boldsymbol{\beta}}_{GLS}\) is called the generalized least squares (GLS) estimator of \({\boldsymbol{\beta}}\).
The variance of GLS estimator is \[\text{Var}(\hat{\boldsymbol{\beta}}_{GLS}) = ({\mathbf{X}}^\top \Sigma^{-1}{\mathbf{X}})^{-1}.\]
In the real application, usually the covariance \(\Sigma\) is unknown. We have to model the covariance and get its estimator \(\hat \Sigma\) first. Then \(\hat{\boldsymbol{\beta}}_{GLS}\) is obtained by plugging in \(\hat \Sigma\) to the above formula.
2.4 Maximum Likelihood Estimations
Maximum likelihood estimations (MLE) are often used to estimate parametric models. Here, we still consider the linear model in (2.1). Let \({\boldsymbol{\theta}}=(\theta_1,...,\theta_q)^\top \in \Theta\) where \(\Theta\) is the parameter space. Suppose that the covariance \(\Sigma\) is a function of \({\boldsymbol{\theta}}\). We use the notation \(\Sigma({\boldsymbol{\theta}})\) to represent it. To apply MLE, we need know the distribution of \({\mathbf{Y}}\). Here, we assume \({\mathbf{Y}}\) follows multivariate Gaussian distribution. Then the MLE of \({\boldsymbol{\beta}}\) is obtained by maximizing the log likelihood function, \[\log \ell({\boldsymbol{\beta}},{\boldsymbol{\theta}}; {\mathbf{Y}}) = -\frac{n}{2}\log(2\pi)-\frac{1}{2}\log|\Sigma({\boldsymbol{\theta}})| - \frac{1}{2}({\mathbf{Y}}-{\mathbf{X}}{\boldsymbol{\beta}})^\top \Sigma^{-1}({\boldsymbol{\theta}})({\mathbf{Y}}-{\mathbf{X}}{\boldsymbol{\beta}}).\] For any fixed \({\boldsymbol{\theta}}\in \Theta\), the unique \({\boldsymbol{\beta}}\) that maximize \(\log \ell({\boldsymbol{\beta}},{\boldsymbol{\theta}};{\mathbf{Y}})\) is \[\hat {\boldsymbol{\beta}}_{MLE} = ({\mathbf{X}}^\top \Sigma^{-1}({\boldsymbol{\theta}}){\mathbf{X}})^{-1}{\mathbf{X}}^\top \Sigma^{-1}({\boldsymbol{\theta}}){\mathbf{Y}},\] which would be the GLS estimator of \({\boldsymbol{\beta}}\) when \(\Sigma(\theta)\) is the actual variance of \({\mathbf{Y}}\).
Plugging \(\hat {\boldsymbol{\beta}}_{MLE}\) to \(\log \ell({\boldsymbol{\beta}},{\boldsymbol{\theta}}; {\mathbf{Y}})\), we obtain \[L({\boldsymbol{\theta}}; {\mathbf{Y}}) = -\frac{1}{2}\log|\Sigma({\boldsymbol{\theta}})| -\frac{1}{2}{\mathbf{Y}}^\top\left\{\Sigma^{-1}({\boldsymbol{\theta}})-\Sigma^{-1}({\boldsymbol{\theta}}){\mathbf{X}}({\mathbf{X}}^\top \Sigma^{-1}({\boldsymbol{\theta}}){\mathbf{X}})^{-1}{\mathbf{X}}^\top \Sigma^{-1}({\boldsymbol{\theta}})\right\}{\mathbf{Y}}.\] \(\hat {\boldsymbol{\theta}}_{MLE}\) can be obtained by maximizing \(L({\boldsymbol{\theta}};{\mathbf{Y}})\). Most of time there is no closed-form solution. In general, it is obtained with numerical methods.
2.5 Random Fields for spatial models
In a statistical analysis, we usually try to model the data by a probabilistic model. As a very simple example, suppose that we randomly selected \(n\) adults in US and measured their heights, say \(x_1,...,x_n\). We usually model the height of an adult as a random variable \(X\) with mean \(\mu\) and \(\sigma^2\). Let \(F(\cdot)\) be the distribution of \(X\). Usually, it is assumed to be Gaussian distribution. Then we develop statistical methods to estimate the mean, variance, or even the distribution \(F\) if it is unknown. Here, any data point is considered as a possible outcome of \(X\) and is generated by random sampling. In other words, the probabilistic model provides the data generation mechanism.
Now consider the case for spatial data. What kind of probabilistic model shall we use? Suppose that we observed temperatures at locations \({\mathbf{s}}_1, ..., {\mathbf{s}}_n\) inside a region \(\mathcal{D}\) at time \(t\), say \(y({\mathbf{s}}_1),...,y({\mathbf{s}}_n)\). We would like to predict temperatures at other locations in \(\mathcal{D}\).
Here is a reasonable idea. Prior to time \(t\), the temperature of any point in \(\mathcal{D}\), say \(Y({\mathbf{s}})\), could be considered as a continuous random variable with potential probability to take any possible value. And hence the temperatures for all points in \(\mathcal{D}\) can be modeled as a random surface. All possible \(3D\) surfaces can be considered as the sample space \(\mathcal{S}\). The temperatures at time \(t\) would be one \(3D\) surface out of \(\mathcal{S}\). In order to measure the uncertainty, we should also model the distribution of any events in the sample space \(\mathcal{S}\). Further, at time \(t\), we have \(n\) observations of the potential surface for \(Y\). To predict \(Y\)s at other locations, we should consider the distribution of those \(3D\) surfaces in \(\mathcal{S}\) whose values at \({\mathbf{s}}_1,...,{\mathbf{s}}_n\) are fixed, that are \(y({\mathbf{s}}_1),...,y({\mathbf{s}}_n)\).
A probability tool to implement the above modeling idea is called random process, that is a collection of random variables on a given index set \(\mathcal{D}\), say \(Y=\{Y({\mathbf{s}}), {\mathbf{s}}\in \mathcal{D}\}\). Here, the domain \(\mathcal{D}\) could be discrete, continuous, finite or infinite. If \(\mathcal{D}\) is a domain in Euclidean space with dimension greater than one, \(Y\) is called random field. Since spatial data are usually collected in \(2D\) or \(3D\) domain, we’ll use the name “random field” rather than “random process” in the following even if the data were collected in \(1D\) domain.
How do we define the distribution of a random field, since its potential value is a curve or a surface? That would be a probability measure defined on a functional space, which is abstract and beyond the scope of our class. Rather, we would consider a family of finite dimensional distribution functions (F.D.D.), that are for any \(n\) and any \({\mathbf{s}}_1,...,{\mathbf{s}}_n \in \mathcal{D}\), the distribution of \(Y({\mathbf{s}}_1),...,Y({\mathbf{s}}_n)\) is modeled by \[F(y_1,...,y_n) = \mathbb{P}(Y({\mathbf{s}}_1) < y_1,...,Y({\mathbf{s}}_n) < y_n),\] where \(F\) is a family of well-defined functions on any finite dimensional space \(\mathbb{R}^d\). By Kolmogorov Consistency Theorem, the family of F.D.D. will uniquely determine the probability measure of a stochastic process.
The F.D.D. are sufficient for most of estimation and prediction problems in spatial analysis. Hence, most of time we do not need to characterize the probability measure of the whole process, especially in applications.
A very popular model in spatial statistics is Gaussian random field, which is a random field whose F.D.D. is a multivariate Gaussian distribution, that is \[\mathbb{P}(Y({\mathbf{s}}_1) < y_1,...,Y({\mathbf{s}}_n) < y_n) =\int_{-\infty}^{y_1}\cdots\int_{-\infty}^{y_n} (2\pi)^{-n/2}|\Sigma|^{-1/2}e^{-\frac{1}{2}({\mathbf{x}}-{\boldsymbol{\mu}})^\top \Sigma^{-1}({\mathbf{x}}-{\boldsymbol{\mu}})}d{\mathbf{x}},\] where \({\mathbf{x}}=(x_1,...,x_n)^\top\), \({\boldsymbol{\mu}}\) is a vector representing the mean of \(Y\) at locations \({\mathbf{s}}_1,...,{\mathbf{s}}_n\) and \(\Sigma\) is the covariance matrix of \((Y({\mathbf{s}}_1),...,Y({\mathbf{s}}_n))^\top\).
From above we can see that a Gaussian random field should have a well-defined mean function \(\{\mu({\mathbf{s}}),{\mathbf{s}}\in \mathcal{D}\}\) and a valid covariance function \(C({\mathbf{s}},{\mathbf{s}}')\) for any \({\mathbf{s}},{\mathbf{s}}'\in\mathcal{D}\).
As is well known, the distribution of a Gaussian random variable is determined by its mean and variance. From above definition, the distribution of a Gaussian random field is determined by its mean function \(\mu({\mathbf{s}})\) and covariance function \(C({\mathbf{s}},{\mathbf{s}}')\).
In general, there is usually no restriction for the mean function \(\mu({\mathbf{s}})\). However, for a non-degenerate random field, the covariance function \(C({\mathbf{s}},{\mathbf{s}}')\) should be positive definite, that is given fixed \(n\), for any \({\mathbf{s}}_1,...,{\mathbf{s}}_n\in \mathcal{D}\) and any \(a_1,...,a_n\), \[\sum_{i,j=1}^n a_ia_jC({\mathbf{s}}_i,{\mathbf{s}}_j) \geq 0,\] where the above equality only holds when \(a_1=\cdots=a_n=0\). Indeed, this follows from \[\sum_{i,j=1}^n a_ia_jC({\mathbf{s}}_i,{\mathbf{s}}_j) =\text{Var}\left(\sum_{i=1}^n a_iY({\mathbf{s}}_i)\right)\geq 0.\]
Summary:
A random field model is appropriate to model spatial data, among which, Gaussian random fields are the most popular models in spatial statistics.
The distribution of a random field is determined by its finite dimensional distributions (F.D.D.). For a Gaussian random field, the finite dimensional distributions are multivariate Gaussian distributions.
The mean of a random field is a function defined on the domain \(\mathcal{D}\), say \(\mu({\mathbf{s}}),{\mathbf{s}}\in\mathcal{D}\); the variance of a random field is a function defined on the product space \(\mathcal{D}\times \mathcal{D}\), say \(C({\mathbf{s}},{\mathbf{s}}'), {\mathbf{s}},{\mathbf{s}}'\in \mathcal{D}\). The distribution of a Gaussian random field is determined by its mean function and covariance function.
Only positive definite function on \(\mathcal{D}\times \mathcal{D}\) can be used to model the covariance function of a random field.
2.6 Stationarity and Isotropy
In the previous section, we introduced random field models for spatial data modeling. The spatial data collected at finite locations are just limited number of values on one possible surface of the random field. That means we do not have any replications (independent realizations of a random field). With such limit information, estimating the mean function, covariance function, or finite dimensional distributions are hopeless. We have to impose some restrictions or structures on the random field models.
A random field is called strictly stationary if its finite dimensional distribution is invariant under shift, that is, for any lag \({\mathbf{h}}\in {\mathbb{R}}^d\), any locations \({\mathbf{s}}_1,...,{\mathbf{s}}_n\), \[{\mathbb{P}}(Y({\mathbf{s}}_1)<y_1, ..., Y({\mathbf{s}}_n) < y_n) = {\mathbb{P}}(Y({\mathbf{s}}_1+{\mathbf{h}}) < y_1, ..., Y({\mathbf{s}}_n+{\mathbf{h}}) < y_n).\] The strictly stationary assumption is relatively strong. In application, we usually use a weakly stationary random field, which is a random field whose mean function is constant over the domain and covariance function is invariant under shift, specifically, for any \({\mathbf{s}}\in\mathcal{D}\), any lag \({\mathbf{h}}\in \mathcal{D}\), \[{\mathbb{E}}(\mu({\mathbf{s}})) = \mu, \ \ \ \ \text{Cov}(Y({\mathbf{s}}), Y({\mathbf{s}}+{\mathbf{h}})) = \text{Cov}(Y(\mathbf{0}), Y({\mathbf{h}}))\equiv C({\mathbf{h}}).\] From the definition, we can see that \[\text{Var}(Y({\mathbf{s}})) = C(\mathbf{0}).\] Hence, weakly stationary random fields have constant marginal variance.
For a Gaussian random field, if it is weakly stationary, then it is also strictly stationary.
Further, if the covariance function of a weakly stationary random field only depends on the distance of two locations, that is, \[\text{Cov}(Y({\mathbf{s}}), Y({\mathbf{s}}+{\mathbf{h}})) \equiv C(||{\mathbf{h}}||), \ \ ||{\mathbf{h}}|| = \sqrt{h_1^2+...+h_d^2},\] then the random field is isotropic.
Example 2.3 (Matérn random fields) The Matérn covariance function \(C(h|\sigma^2, \theta,\nu)\), where \(\sigma^2 > 0, \theta>0,\nu>0\) are marginal variance, range and smoothness parameters, is widely used to model covariance structures in spatial statistics. It is defined as \[C({\mathbf{h}}|\sigma^2, \theta,\nu):=\frac{2^{1-\nu}\sigma^2}{\Gamma(\nu)}(\theta||{\mathbf{h}}||)^\nu K_\nu(\theta ||{\mathbf{h}}||),\] where \(K_\nu\) is a modified Bessel function of the second kind (see, e.g., N. Cressie (1993)). Below include a 3D plot and a contour plot of a Matérn covariance function with \(\sigma^2 = 1, \theta = 1, \nu = 0.5\).
2.7 Spatial Continuity
As we model spatial data as a random surface, the smoothness of realizations of random fields is an important characteristics. It is related to the covariance function of a random field. There are two types of continuity for random fields.

References
Cressie, N.A.C. 1993. Statistics for spatial data, 2nd ed. New York: Jone Wiley & Sons.
Kent, John T. 1989. “Continuity Properties for Random Fields.” Ann. Probab. 17 (4). The Institute of Mathematical Statistics: 1432–40. doi:10.1214/aop/1176991163.
Stein, Michael. 1999. Interpolation of spatial data: some theory for kriging. Springer.