Geostatistical Interpolation

One of the strengths of geostatistical interpolation is its ability to incorporate multiple data layers in the prediction. In other words, it capitalizes on the correlation between observations and among variables to make predictions at any unsampled location. It is particularly useful when the variable of interest, called a primary variable, is supplemented by secondary data that were collected in greater numbers and over different sets of locations. Note that all variables need to share a common subset of sampling locations in order to assess their cross-correlation.

In the example below, water lead level was measured at two different sets of locations during two separate sampling campaigns: 1) 809 sentinel sites, and 2) 611 voluntary sites, where 420 samples share the same location.

Screenshot

Data collected at sentinel sites are considered more reliable and cokriging is used to incorporate the voluntary data as secondary source of information in predicting water lead concentrations (left map) and the associated cokriging variance (right map) at all residential tax parcels in the City of Flint.

Screenshot

Kriging Types

Vesta allows the calculation of a wide variety of kriging estimators depending on the type of variable (e.g., continuous vs categorical), the number of variables (1 to 3), and the type of data transform (i.e., no transform, indicator transform, logratio transform). See flowchart below.

Screenshot

Categorical Variables

Categorical variables are a way to "tag" data as being of a particular category, such as the material used to construct a water pipe (lead, copper, PVC), soil type (silt, clay, sand), or land uses (agrarian, industrial, residential). Many times data is collected in several places across a larger area to get enough information to interpolate remaining spots based on the category of the surroundings. However, categorical variables cannot be interpolated “as is”.

Instead, the probability of occurrence of each category s_k is predicted, then the category s_k' with the largest probability is selected as the most probable and used as prediction: S ̂_ML (u)=s_k' if i ̂(u;s_k' )=Prob{S(u)=s_k'}> i ̂(u;s_k ) ∀ k ≠k'. Uncertainty about the prediction can be captured using any measure of the spread of the probability distribution. For example, the relative local entropy ranges between 0 and 1 (maximum uncertainty if all estimated probabilities are the same):

Screenshot

Indicator Kriging with Categorical Variables

This interpolation requires the coding of all observations S(u_α ) into indicators of presence/absence of each of the K categories, followed by the application of indicator kriging to each set of indicators:

Screenshot

Where the kriging weights λ_α are solutions of a system of linear equations.

Simplicial Indicator Kriging with Categorical Variables

A limitation of indicator kriging is that since the probabilities for all categories are estimated individually, there is no guarantee that at each location u the K estimated probabilities i ̂(u;s_k ) sum up to 1. Such constraint can be satisfied by treating the categorical variable as compositional (Aitchinson, 1986 ).

In that case, the data are first transformed into log-ratios, which are then interpolated by kriging, followed by their back-transform to obtain estimated probabilities => {i ̂(u_α;s_k ), k=1,⋯,K} that will by construction sum to 1 and be valued between 0 and 1.

The three steps, which are accomplished automatically in Vesta , are thus:

Transform each categorical data S(u_α ) into a set of log-ratios:
Kriging of log-ratios at unsampled location u:
Transform estimated logratios back into probabilities:

Validation of Interpolation with Categorical Variables

For categorical variables, Vesta offers the option of assessing the accuracy of the prediction using a second dataset that was not used in the prediction stage. The two datasets available are thus:

Estimated probabilities and most likely category predicted at a set of N test locations
Observed category and corresponding indicators at these N same locations:

These data are used by Vesta to calculate:

Contingency Table: S ̂_ML (u_β ) vs S(u_β ), which highlights the number of locations that are correctly classified (diagonal elements of the table) as well as the number of locations that are assigned to the incorrect category.
ROC curve and AUC (Area Under the Curve) statistic for each category s_k: The ROC curve plots the probability of false alarm (false positive) versus the probability of detection (vertical axis); see example below (Swets, 1988 ; Goovaerts, 2017)

Screenshot

The accuracy of the prediction is quantified using the relative area under the ROC curve (AUC statistic), which ranges from 0 (worst case) to 1 (best case). The AUC is equivalent to the probability that the classifier will rank a randomly chosen positive instance (e.g., presence of a lead service line) higher than a randomly chosen negative instance (e.g., absence of lead service line).

Continuous Variables with no Transform

This is the most common situations where the goal is to predict the value of one continuous variable using observations for that variable (kriging estimator) and other related variables that are sampled more densely (cokriging). The algorithms implemented in Vesta are described here <<<<>>>>>> and in Goovaerts (1997).

Continuous Variables with Indicator Transform

Two common features of environmental datasets are the occurrence of a few very large concentrations (hot-spots) and the presence of data below the detection limit (censored observations). Extreme values can strongly affect the characterization of spatial patterns, (i.e., estimation of the variogram) and subsequently the prediction. Several approaches exist to handle strongly positively skewed histograms (Saito and Goovaerts, 2000 ).

One way to attenuate the impact of extreme values is to use more robust statistics and estimators. The non-parametric approach of indicator kriging (IK) falls within that category (Journel, 1983 ). The basic idea is to discretize the range of variation of the environmental attribute by a set of thresholds z_k (e.g. deciles of sample histogram, detection limit, regulatory threshold) and to transform each observation into a vector of indicators of non-exceedence of each threshold:

Screenshot

Where z(u_α) is the observation of variable Z at location u_α.

The user specifies the number K of thresholds that are then calculated as equiprobable percentiles of the sample frequency distribution, F(.), as: z_k=F-1(p_k) with {p_k= p_min+(k-1)×((1-p_min ))/K,k=1,⋯,K}. For example, K=9 would result in the calculation of 9 thresholds as the 9 deciles of the dataset. Kriging is then applied to the set of indicators and estimated values are assembled to form a conditional cumulative distribution function (ccdf). The mean and the variance of the probability distribution can be used as an estimate of the attribute value and the uncertainty attached to the prediction (Goovaerts, 2009).

In several situations, observations themselves can be uncertain or imprecise; e.g., measurement errors for laboratory data, or small number problem for health outcome. In that case, one might not know with certainty whether the attribute value z(u) exceeds a particular threshold z_k or not. Like in the previous case, the user specifies the number K of thresholds that are then calculated as equiprobable percentiles of the sample frequency distribution, F(.). The indicators are called “soft” because they are valued between 0 and 1, instead of being binary 0/1 as for hard indicators (no uncertainty).

Soft indicators are cumulative frequencies and can be calculated for three types of distributions in Vesta, as shown in the variography section on soft indicators.

Post-processing Indicator Kriging Estimates

Like for categorical variables, since the probabilities for all thresholds are estimated individually, there is no guarantee that at each unsampled location u the K estimated probabilities i ̂(u;z_k ) form a “valid” cumulative probability distribution function; that is 0≤i ̂(u;z_k )≤1 ∀ k and i ̂(u;z_k' )≤i ̂(u;z_k ) if z_(k^' ) ≤ z_k. In addition, the set of K probabilities need to be interpolated within each class (zk,zk+1] and extrapolated beyond the smallest and the largest thresholds to build a continuous model for the probability distribution.

Following the procedure described in Goovaerts (1997) the set of K probabilities {i ̂(u;z_k ), k=1,⋯,K} underwent the following correction:

Screenshot

Kriging Estimators

Material forthcoming.

Please see variogram estimators in the interim.

Process Steps

Select "Geostatistical Interpolation" in the Methods panel
Select "Start" to open the interpolation dialog
In the "Dataset" field, select the dataset that includes the observations to use for interpolation
In the "Variogram model" field, select one model among the ones listed or create a new variogram model by selecting "Create Variogram" to open the Variogram dialog.
In the "Interpolated variable" field, highlight the variable you want to interpolate. This variable will be automatically listed by Vesta based on the variogram model that was selected in the previous step.
In the "Dataset with Output Geography" field, select the dataset/geography where you want to conduct the interpolation.
The dialog for the type of kriging will be tailored to the characteristics of the variogram model selected
- For a "categorical variable" variogram, the type of kriging will automatically be set to "indicator kriging" and the user will have to choose between conducting multiple indicator kriging (case of indicator transforms) and simplicial indicator kriging (case of log-ratio transform). *For a "continuous variable" variogram, Vesta will automatically detect whether kriging or cokriging is being conducted, as well as the names of secondary variables, depending on whether the variogram model is univariate or mutltivariate.
- For a "continuous variable (indicator transform)" variogram, Vesta will automatically detect the number and values of thresholds, as well as wheter the corresponding indicators are hard (0 or 1) or soft (0 to 1).
For a "categorical variable", users can conduct a cross-validation by comparing the predicted category with the category observed over the Output Geography specified on Step 6. Check the box “validate” and select the categorical variable in the Output Geography.
The kriging estimates available are
- simple (use of global mean that is automatically set to the arithmetical average of observations)
- ordinary (use of local mean)
- Standardized ordinary (only for cokriging)
In the "Search Strategy" field, specify the strategy to select observations used for prediction:
- Search window (all observations within the search radius will be retained for kriging), and
- Number of neighbors (specify the maximum number of nearest observations to retain).
Discretization count, which is the number of random points used to discretize any object in the input and output geographies.
Click "Run". Interpolation results are shown under a “kriging folder” located in the Output Geography in the Data panel to the right.

Kriging Output

Interpolation results are shown under a “kriging folder” located in the Output Geography in the Data panel to the right.

All estimates and variances can be mapped in Vesta by right-clicking on the output item and in the Visualize submenu, select Map.

The output depends on the type of kriging:

Kriging Output of Categorical Variables

For categorical variables, the output includes:

Estimate value and Kriging variance of each of the indicators or log-ratios
Most likely category and local entropy (variables and maps)
For validation option, the contingency table and the ROC curve with the AUC statistics

Kriging Output of Continuous Variables

For continuous variables, the output includes kriging estimate and kriging variance.

Kriging Output of Continuous Variables (indicator transform)

For continuous variables that had indicator transform applied, the output includes:

estimated value for each of the indicators, which can be interpreted as the estimated probability that the variable of interest does not exceed the specific threshold.
Mean and variance of the probability distribution (variables and maps)