When working with spatial data, we often encounter situations where observations on a variable are absent for some of the locations on the map. Geostatistians talk of “sampled” locations—places where observations on a variable of interest are measured—and “not sampled” or “unsampled” locations—places where we would like to have measurements but lack them. In a nutshell, this is the problem addressed by geostatistical modeling.

What is Geostatistical Modeling?

At its simplest geostatistical modeling estimates values of a variable at not sampled locations.  There are lots of modeling approaches for doing this, including Bayesian, regression, and local averaging techniques, to name a few. We’ll get to the advantages of geostatistical modeling in a moment, but let’s now consider, what is it?  

Geostatistical modeling proceeds by first, modeling spatial patterns and association using something called the variogram, and second, by using the variogram to inform a model that takes into account location, the values at surrounding sampled locations, and the variogram (here is a quick blog on the background and history of geostatistics). This approach is called kriging, and it has several advantages.

3 Advantages of Geostatistical Modeling

First, geostatistical modeling provides estimates throughout the map, including sampled and not sampled locations. Relative to alternative techniques, kriging has a zero error at sampled locations (those where measured values are available). This is very appealing since it seems odd to have, for example, a value of say 3.5 measured, and a model that estimates 4.2.  Shouldn’t the model output equal the observed values at sampled locations? Kriging is one of the only approaches where the model output equals the observed values.Second, robust estimates of uncertainty are provided along with the estimated values. These uncertainty estimates are derived from the kriging variance, which is the variance of the prediction error. Not all alternative methods provide uncertainty estimates, which are absolutely essential in order to communicate how much confidence we have in the estimates, and how that confidence varies across the geography under consideration. When we look at kriging model output one typically inspects maps of the kriged values as well as the uncertainty map.

Map of kriging estimates (left) and uncertainties (right) for water lead level (lognormal scale) at a series of tax parcels in downtown Flint.

Third, geostatistical modeling isn’t just one technique. It’s a robust compendium of models, comprised of a rich set of approaches suited to different kinds of data and types of variables.  These include Kriging for continuous variables (real numbers, such as concentrations of soil contaminants, or disease rates), Poisson Kriging for rare presence-absence data (such as whether or not a whale is sighted from a search vessel), indicator kriging for indicator variables  (for example data coded as 0, 1 as happens for case-control studies with a “1” indicating a case, and a “0” indicating a control), and compositional kriging for variables that sum to 100% (for example, proportions of a local population with different disease status, such as healthy, infected, infected and infectious, recovered). 

Kriging not only accounts for spatial dependencies in the modeled variable, but also for associations with other variables (this is multivariate Kriging). Finally, kriging can incorporate information from both “hard” and “soft” data. Here hard data is measurements on the variables of interest, whereas soft data is composed of expert opinion, prior guesses and so on.

At this point you now may have an initial appreciation for the power and robustness of geostatistical methods. So let us now ask, with such power, what might be those things that are harder to accomplish?

Defining the challenges and limitations of geostatistical modeling

Consider 3 salient challenges to geostatistical modeling: coregistration of data through change of spatial support, multiple variables, and compositional data analysis. We’ll briefly define these and then provide summaries of current state-of-the-art for each of them.  

1. Change of Scale

We often need to analyze data that come from different layers, for example, information on income and socioeconomic status at the census tract level, with health outcome data that is available for residential addresses  Here, the income and SES data are for areas – census tracts, while health outcomes are available for points – residential locations. How can we bring these together to undertake an analysis? This is accomplished using a change of scale by associating the census tract data to the residential locations.  We can refer to this as “A2P”, indicating area to point change of scale (see this recent blog on uses of geostatistics in disease mapping).

As noted above, there are ways of doing this such as assigning averages from areas to points (e.g. a “cookie cutter” approach), but these lack an essential property: coherence. Coherence means one should be able to go from say points to areas, then back, from areas to points, and obtain the same data values as one began with. Unfortunately, most of the commonly used change of scale techniques are not coherent.

2. Multiple Variables

Another challenge is the question of how to handle multiple variables that are themselves geographically distributed and co-dependent. For those with a statistical bent co-dependent means their values are correlated. And of course, we may need to deal with spatial structure in each of the variables, so that values at nearby locations tend to be similar. This is the problem posed for geostatistics by the “GIS Layer Cake”. Embedded in this is the change of support problem across the layers. How, for example, might we go from parcel data (area data) to customers (point data)? Hence the challenge for geostatistical modeling is even more complex.

3. Compositional Data

What about variables for which there are categories that sum to a total, say 1 or 100%? These are called “compositional data”, since they are composed of proportions that together total a whole. While the theory of kriging for compositional data exists, at present there is only one commercial GIS software (Vesta) that accomplishes it.

How users can overcome the challenges of geostatistical modeling 

The solutions to these challenges are provided by BioMedware’s Vesta software.  Vesta automates modeling and uses embedded expert knowledge to select the appropriate geostatistical techniques, and then applies them to your data. For example, to accomplish a change of scale one only needs to select “Change Scale” and then enter the two data layers, one for the “Source Dataset” and the second for the “Destination Dataset”.  

Vesta takes care of the rest, selecting the appropriate geostatistical technique (e.g. for upscaling, downscaling, side-scaling) and applying it to the data to obtain a result data set that is coherent and appropriately scaled. Other advanced geostatistical techniques in Vesta follow the same user-centered paradigm to make advanced modeling easy, understandable, correct and accurate.

Geostatistical Modeling and Vesta

This blog has covered a lot of ground, beginning with a high-level view of geostatistical modeling, followed by its advantages, challenges, and limitations. We’ve considered several solutions, and introduced some that are now available only in the Vesta software. To learn more about Vesta and claim your free trial, click here here.