Compositional Data Analysis in Geostatistics

A significant portion of the data analyzed geostatistically doesn’t consist of simple numerical values like temperature or elevation. Instead, we often deal with compositional data—parts of a whole that are constrained to sum to a constant. Think of a rock sample’s mineralogy (quartz, feldspar, mica), biodiversity inventory (counts of different benthic species in a sediment), or the percentage of sand, silt, and clay in a soil sample. This property also pervades epidemiological data founded on counts; for example, analyzing the number of deaths caused by different diseases within a county or the number of individuals diagnosed for each stage of breast cancer within a State. In these situations, the information is expressed as percentages or proportions, as the focus is typically on relative frequencies, not absolute frequencies.

Categorical data, coded as indicators of presence/absence, can also be viewed as a particular type of compositional data. In that case, the percentages are either 100 (the category is present) at that location or 0 (the category is absent). Examples include land use categories (i.e., residential, commercial, industrial, agricultural, and recreational) for a given tax parcel or type of water service line material (i.e., copper, lead, galvanized, plastic) for a residential dwelling.

Treating compositional data with traditional statistical methods is a classic pitfall, resulting in spurious correlations and biased prediction. This blog post dives into why compositional data is different, the problems with naive analysis, and the elegant framework of Compositional Data Analysis (CoDA) that provides a robust solution for geostatistical modeling and is available in our Vesta software.

The Problem: The Curse of the Constant Sum

Imagine you have a rock sample composed of three minerals: A, B, and C. By definition, their percentages must add up to 100%.

Sample 1: A=60%, B=30%, C=10%

Sample 2: A=80%, B=10%, C=10%

A standard statistical analysis would look at the raw percentages. You might calculate the mean and variance of mineral A, or the correlation between mineral A and B. This is where the trouble begins. The components are not independent; they are linked by the constant-sum constraint (closure). If the proportion of A increases, at least one of the other components must decrease to maintain the total of 100%.

This constraint creates several major issues:

Spurious Correlations: Negative correlations are artificially induced. In our example, as A increases from 60% to 80%, B decreases from 30% to 10%. A standard Pearson correlation would show a strong negative relationship, but this is a mathematical artifact of the closure, not necessarily a reflection of a geochemical process.
Closed-Space Geometry: The data doesn’t exist in standard Euclidean space. A three-part composition actually lives on a 2-dimensional simplex (a triangle) within the 3D space [0, 100] x [0, 100] x [0, 100]. Standard distance and variance calculations, which assume Euclidean geometry, are inappropriate and can be misleading.
Bias in Prediction: When using methods like kriging to predict a component at an unsampled location, the model doesn’t honor the sum-to-constant constraint. You might predict 65% A, 40% B, and 20% C for a new location—a nonsensical result that sums to 125%. When predicting probabilities of occurrence for categorical variables, we might similarly obtain kriged probabilities that are either negative or exceed 1.

The Solution: The Log-Ratio Transformation

The solution is to project the data into a new space where the constraint-sum no longer applies and standard (geo)statistical tools are valid. This is achieved through log-ratio transformations. The core idea is to analyze the relationships between the ratios of components, rather than the components themselves.

The foundational framework for this was laid by John Aitchison in the 1980s. The most common transformations are:

1. The Additive Log-Ratio (alr)

This transforms the D-part composition into a vector of D-1 real numbers (e.g., D=3 in the above example). It’s simple and intuitive, but it’s not symmetric—the choice of the denominator component affects the results.

2. The Centered Log-Ratio (clr) Transformation

This transforms the D-part composition into D real numbers. However, the resulting vectors are not linearly independent; they sum to zero, which can cause issues with methods that assume full-rank data (like standard principal component analysis).

3. The Isometric Log-Ratio (ilr) Transformation

The ilr transformation is the most robust and theoretically sound method. It uses an orthonormal basis to map the data from the D-dimensional simplex to a (D-1)-dimensional Euclidean space. This preserves distances and angles and results in a set of linearly independent variables. The calculation is more complex, involving sequential pairwise log-ratios, but it is the preferred method for advanced statistical modeling.

All three types of transformation are available in Vesta. If you’re not sure which one is right for your research, use Vesta’s default option of isometric Log-Ratio (ilr) Transformation.

The CoDA Workflow in Geostatistics

Applying CoDA to a geostatistical problem involves a multi-step approach:

Pre-processing: Check for zero values in the composition. Logarithms are undefined for zero. Zeros must be imputed or replaced using a method appropriate for compositional data (e.g., simple multiplicative replacement). In Vesta, this is done using a zero substitution technique; see Goovaerts (2025) for an example.
Transform: Apply a chosen log-ratio transformation (ilr is recommended for modeling, clr for interpretation, alr for simplicity).
Modelling: Perform the geostatistical analysis on the transformed data. This is where you calculate variograms, fit a model, and perform kriging or simulation. Since the data is now in an unconstrained space, all standard geostatistical tools work as intended.
Back-Transform: After prediction or simulation in the transformed space, the results must be back-transformed to the original compositional space. This involves applying the inverse of the log-ratio transformation to get predicted proportions that correctly sum to 100%.

A Simple Example: Soil Texture

Let’s say you are mapping soil texture across a field using the percentages of Sand, Silt, and Clay.

Wrong Approach: You krige Sand, Silt, and Clay separately. Your predicted map shows a location with 40% Sand, 50% Silt, and 30% Clay. The sum is 120%, which is impossible.

The correct approach, which is fully automated in Vesta, proceeds as follows:

Transform: You apply the ilr transformation to your [Sand, Silt, Clay] data at each sampled location. This gives you two new variables (ilr1, ilr2).
Model: You calculate experimental variograms for ilr1 and ilr2, fit a model to each, and perform ordinary kriging to predict ilr1 and ilr2 at all grid locations.
Back-Transform: At each grid cell, you take the kriged (ilr1, ilr2) values and apply the inverse ilr transformation. The result is a predicted composition [Sand, Silt, Clay] that sums to exactly 100%.

Is A Compositional Approach Worth The Trouble?

The benefits of a compositional approach are not limited to ensuring that all estimates are non-negative and sum to a constant. For example, the first application of a compositional approach to water service line data (categorical data) proved to be a success as it substantially improved the detection of lead service lines (AUC= 0.8 vs 0.6) over indicator kriging (Goovaerts, 2023).

In another study on the mapping of sediment grain size distribution (continuous data), the compositional approach with non-Euclidean coordinates resulted in the smallest prediction errors for all four textural fractions (i.e., clay, silt, sand, gravel) when secondary information is incorporated (Goovaerts, 2025).

Next Steps For Using Compositional Data Analysis in Geostatistics

Compositional Data Analysis is not just an academic exercise; it is a fundamental necessity for correctly interpreting and modeling many geospatial datasets. By recognizing the closed nature of this data and applying the log-ratio transformation framework, we can avoid spurious correlations, honor the physical constraints of our systems, and build more accurate and reliable geostatistical models.

While the mathematics can seem intimidating at first, modern software packages (like R’s compositions and robCompositions libraries, or Python’s scikit-bio) make implementation straightforward. Vesta, however, is the only GIS software offering tools for the analysis, modeling, and visualization of compositional data in a space-time framework. For any geostatistician working with data that represents parts of a whole, embracing CoDA is the key to unlocking true and meaningful insights from the data.

And for those of us who aren’t geostatisticians? Vesta’s automated analysis will do it all for you. Try it out for free here.

Sources:

Goovaerts, P. 2023. Geospatial model of composition of water service lines in Flint, MI: Validation using excavation data and a new compositional Geostatistical approach. AWWA Water Science, 5(2), e1331.
Goovaerts, P. 2025. Geostatistical analysis of seabed sediment type: A compositional approach with non-Euclidean distance metrics. Mathematical Geosciences, DOI: 10.1007/s11004-025-10225-1.

Compositional Data Analysis in Geostatistics: Navigating the Closed-Space Conundrum