Variograms
The variogram is a measure of the average dissimilarity between data as a function of their separation in geographical space. The experimental variogram is calculated by averaging half squared differences between observations (called spatial increments) within classes of distance (called lags) and direction. For areal data (e.g., counties, census tracts) the separating distance is calculated between their geographic centroids. Different types of averaging (i.e. variogram estimators) are available in Vesta.
Variogram Basics
Following Tobler’s first law of geography (1970), "everything is related to everything else, but near things are more related than distant things."
Thus, in the presence of spatial autocorrelation, one should expect the variogram values to increase as the separation distance |h| increases. In most situations, the variogram will stop increasing at a given distance, called the range, which can be interpreted as the distance of dependence or the zone of influence of the attribute. In other words, two observations separated by a distance larger than the range, are considered spatially independent.
The discontinuity at the origin of the variogram (i.e., at very small separation distances) is called the nugget effect and arises from measurement errors or sources of spatial variation at distances smaller than the shortest sampling interval, or both.
Variogram Types
Vesta allows the calculation of a wide variety of variograms depending on the type of variable (e.g., continuous vs categorical), the number of variables (1 to 3), and the type of data transform (i.e., no transform, indicator transform, logratio transform), see flowchart below.
Vesta offers four different types of transforms:
1. Hard indicator
Hard indicators of non-exceedence of a threshold zk (continuous variable)
Where Z(u) is the observation of variable Z at location u. The user specifies the number K of thresholds that are then calculated as equiprobable percentiles of the sample frequency distribution, F(.), as:
zk=F-1(pk)
with {pk = pmin+(k−1)×((1-pmin))/K, k=1,⋯,K}.
For example, K=9 would result in the calculation of 9 thresholds as the 9 deciles of the dataset.
2. Soft indicator
Soft indicators of non-exceedance of a threshold zk (continuous variable).
Like in the previous case, the user specifies the number K of thresholds that are then calculated as equiprobable percentiles of the sample frequency distribution, F(.). The indicators are called “soft” because they are valued between 0 and 1, instead of being binary 0/1 as for hard indicators. Soft indicators are cumulative frequencies and can be calculated for three types of distributions:
a) Gaussian distribution
where G(.) is the cumulative standard normal distribution, Z(u) and S(u) are the observed value and standard error (square root of variance) at location u, both specified by the user.
b) Poisson distribution
where FPo(u) is the cumulative Poisson distribution for location u, defined by parameter P(u) (i.e., population size) and the observed value Z(u).
c) Binomial distribution
where FBi(u) is the cumulative binomial distribution for location u, defined by parameter P(u) (i.e., population size) and the observed value Z(u).
3. Indicator of presence/absence (categorical variables)
where the number K and label of each category is derived automatically from teh data by Vesta.
4. Log-ratio (categorical variables)
The transform starts with the creation of a generalized indicator function:
where the number K and label of each category is derived automatically from the data by Vesta. This indicator function is then used to calculate three different types of logratio based on the user's choice:
a) Additive (alr)
where sK is the category with teh smallest frequency of occurrence.
b) Centered (clr)
c) Isometric (ilr)
Variogram Estimators
This section provides information about variogram estimators available in Vesta.
The following variogram estimators can be selected when creating a variogram:
Traditional estimator (Journel and Huijbregts, 1978)
where z(u𝛼) is the attribute value at the location u𝛼 and N(h) is the number of data pairs within that class of distance and direction. This is the most frequently used estimator and the default option in Vesta.
Standardized estimator (Pannatier, 1996)
where 𝝈2(h) is the variance of the 2N(h) data used for estimation at lag h. This standardization accounts for changes in variance from one lag to the next and is used to correct for the preferential sampling of high values, which leads to overestimating the nugget effect (see example for groundwater arsenic concentrations in Goovaerts et al., 2005).
Weighted estimator (Rivoirard et al., 2000)
where w(u𝛼) is the weight assigned to the attribute value at location u𝛼. Weighted estimators can be used to take an irregular sampling density into account (weight = surface of the polygon of influence of the observation). Another application is accounting for the varying level of reliability of the data; for example to assign less weight to unreliable mortality rates estimated from sparsely populated areas (weight = population size).
Cross Variogram Estimators
The cross-variogram measures how two variables, z1 and z2, jointly vary as a function of the separation distance in geographical space. The experimental cross-variogram is calculated by averaging the product of spatial increments of two different variables within classes of distance (called lags) and direction. Different types of averaging (i.e. variogram estimators) are available in Vesta.
Unlike the variogram that takes only positive values, the cross-variogram can take negative values. It occurs when the two variables are negatively correlated; hence, the increase (decrease) in variable z1 tends to be associated with a decrease (increase) in variable z2. Their spatial increments tend to be of opposite sign and its product will be negative on average.
The following cross-variogram estimators can be selected in Vesta (note that setting Z1=Z2 yields the formula for the univariate variogram estimators):
Traditional Estimator
where z1(u𝛼) and z2(u𝛼) are the values of attributes Z1 and Z2 at the location u𝛼 and N(h) is the number of data pairs within that class of distance and direction. This is the most frequently used estimator and the default option in Vesta.
Standardized Estimator
where σ1(h)×σ2(h) is the standard deviation of the 2N(h)z1×z2 data used for estimation at lag h. This standardization accounts for changes in variance from one lag to the next and is used to correct for the preferential sampling of high values; which leads to overestimating the nugget effect.
Weighted Estimator
where w1(u𝛼;h)= w1(u𝛼)×w1(u𝛼+h) is the weight assigned to the z1-data pair.
Variogram Model
A continuous function must be fitted to experimental values so as to deduce variogram values for any possible lag h required by prediction algorithms and also to smooth out sample fluctuations. The difficulty is that only certain functions, known as permissible models, can be used in the fitting. Typically, two or more permissible models must be combined to fit the shape of the experimental variogram. The type (i.e., nugget effect, spherical, exponential, and cubic) and parameters of these models (range, sill) are estimated automatically by Vesta using least square regression.
The fitting of direct and cross variograms is even more challenging as the different experimental curves cannot be modeled independently. The most common approach is to adopt a linear model of coregionalization (Goovaerts, 1997) in which:
- all curves are fitted with the same set of basic permissible models (type, range), and
- for each model the matrix of sills is positive semi-definite. This is the model implemented in Vesta.
The following permissible variogram models (Chiles and Delfiner, 1999) are available in Vesta:
Nugget Effect
Spherical model with range a
Exponential model with practical range a
Cubic model with range a
All of these models are permissible in three dimensions and are for now expressed in their isotropic form, that is, solely as a function of the separation distance h. These four models are bounded, which means that a sill, here set to 1 (standardized models), is actually or practically reached at a distance a called the range:
- For the nugget effect model, the sill is reached as soon as h > 0.
- The spherical and cubic models reach their sill at distance a (actual range).
- The exponential model reaches its sill asymptotically. A practical range a is defined as the distance at which the model value is at 95% of the sill.
Bounded models are also referred to as transition models, and their covariance counterpart is c(h)=1−g(h).
Three types of behavior near the origin are distinguished:
- Parabolic behavior, as in the cubic model. Such behavior is characteristic of highly regular phenomena such as topographic elevation of gently undulating hills.
- Linear behavior, as in the spherical or exponential model. Note that for the same practical range, the exponential model starts increasing faster than the spherical model.
- Discontinuous behavior, e.g., nugget effect model.
Creating a Variogram
Vesta allows the calculation of a wide variety of variograms depending on the type of variable (such as continuous vs categorical), the number of variables (1 to 3), and the type of data transform (i.e., no transform, indicator transform, log-ratio transform).
The type of variable will be recognized automatically by Vesta, leading to the display of a custom dialog window where the user can specify the number of variables and the type of data transform; see process steps.
Process Steps
- There are three different ways to call and open the “Create a variogram” dialog:
- Select the "Data" menu item and then select the “Variogram” icon.
- In the Data panel, right click on the variable of interest. Hover over “Create”, then select “Variogram” from the drop down menu.
- Select "Geostatistical interpolation" in the Methods panel, then select "start" to open the method dialog and click on "Create Variogram".
- In the "Dataset" field, select the desired dataset from the Data panel you wish to analyze
- In the "Variable" field, select the first variable you wish to include in the analysis.
- Select more variables, if desired:
- If the first variable is continuous, you have the option to add up to two additional continuous variables that will be used as secondary information in cokriging. As the number N of variables increases, more tabs are open for the variogram plots. The total number of direct and cross variograms will be 1, 3 or 6 depending on whether N is 1, 2 or 3.
- For categorical variables, Vesta will automatically determine the number of categories and calculate a variogram for each category. Only a single variable can be specified in that case.
- Select the type of transform
- For a single continuous variable, you have the option to calculate indicator variograms for a series of thresholds by checking the box “Indicator”. The interface then changes to allow the user to select the number of thresholds which are calculated as equiprobable percentiles of the sample frequency distribution. Two types of indicators (see type of transform) are available.
- Hard indicators - valued at 0 or 1 depending on whether the observation exceeds the threshold or not.
- Soft indicators - valued between 0 and 1 and corresponding to the cumulative frequency of the thresholds in a local distribution defined by its mean (observation) and variance (Gaussian distribution) or population parameter (Poisson and Binomial distribution), which are variables to be specified by the user.
- For a categorical variable, Vesta will automatically create indicators of presence/absence for each category and calculate the corresponding indicator variogram. If the option “Logratio” is checked, the user can specify one of three different types of logratio: additive, centered, and isometric (default). Both sets of indicators and logratios can be exported as a text file.
- For a single continuous variable, you have the option to calculate indicator variograms for a series of thresholds by checking the box “Indicator”. The interface then changes to allow the user to select the number of thresholds which are calculated as equiprobable percentiles of the sample frequency distribution. Two types of indicators (see type of transform) are available.
- Select the one of the four available estimators for the variogram.
- If the weighted estimator is selected, a “weight” variable needs to be specified.
- For continuous variables that represent rates or ratios, the Poisson estimator can be selected. In that case, a “population” variable (i.e., denominator) and a population multiplier need to be specified. For example, in the NCI National Cancer Atlas mortality data, the rate is for 100,000 person-years. So the "Multiplier" in the dialog box would be 100,000.
- The lag, lag count and tolerance fields define how the data pairs are binned by distance to create the experimental variogram. By default (and when the Auto box is checked) the data pairs are grouped into 10 distance bins (the lag count).
- The default distance lag is the geography's maximum radius (the half-diagonal of the bounding rectangle) divided by 10 and the tolerance is half the lag distance. If the tolerance is greater than the lag, the same data pairs will be included in consecutive distance bins, causing a smoothing of the variogram. Only omnidirectional variograms can be computed at this time.
- Select the number of basic models to form the nested model. Although the maximum is technically set to 10, in practice no more than 3 models are needed to achieve a good fit.
- The GOF (Goodness of fit) statistics measures the average squared difference between the experimental variogram and the model fitted.
- Vesta automatically selects the combination of permissible models (i.e., nugget effect, spherical, exponential, cubic) that minimizes the GOF statistics, and the estimated sill and range of each model are reported.
- The model can be exported as a text file (dialogue entries, estimated model parameters) or saved and named for later use in interpolation by kriging or cokriging. The saved model appears as a separate icon in the Data panel.
Saved variogram models can be edited by right-clicking on the variogram item in the Data panel.