Our team recently published a blog on missing data and how it can be handled in GIS applications such as Vesta. This topic resonated with several of our readers, who raised important points we would like to discuss further.
The first blog observed that missing data are quite common and occur whenever observations for certain variables at specific locations and times are unavailable due to equipment problems, data losses, confidentiality concerns, and simple human error.
There are two forms of missing data: Missing completely at random (MCAR) and missing at random (MAR). MCAR occurs when the missing data are randomly distributed across all observations; MAR occurs when the missing data are not randomly distributed across observations but are instead distributed within one or more subsamples. Refer to the original blog for specific examples. The conversation of missing values stimulated Parag J Dutta, Ph.D. to observe in an email:
“This is regarding your blog post “What is Missing Data and How to Handle it with Vesta”. The post is very informative and useful for everyone dealing with GIS/Spatio-Temporal Statistics. I wish to convey my thanks to the Development Team for including the innovative missing data handling feature in the latest version of VESTA. However, as you have mentioned in your blog post, we don’t want to discard an observation because some of the variables were not recorded at that location. So, how does Vesta exactly handle missing data for spatio-temporal modeling in the case of MAR (missing at random)? Does it fill the blanks with statistically calculated imputations or remove the data point for only the missing variable component? It’s crucial, at least, for mineral resource reporting where statistical/mathematical imputations are considered inappropriate practice and not accepted by mineral resource reporting systems such as JORC.”
Vesta currently removes the data point for only the missing variable component. We view this as the most conservative approach that is faithful to the observed values.
The released version of Vesta does not mathematically or statistically impute values automatically. The next version of Vesta—expected to be released mid-September—includes spatiale multivariable kriging that can be used to impute missing values using observed spatial and multivariate dependencies. However, this will not be accomplished automatically—the user will need to explicitly specify the imputation.
BioMedware’s approach to handling of missing data in visualization is described within the original blog. For example, statistical graphics are only affected if one of the variables is missing (i.e. a point in a Scatterplot won’t be included if observations on either variable are missing). In algorithms, we are highly conservative and do not impute missing data. Each variable is handled separately for change of spatial scale (e.g, change of spatial support), so a missing observation in one variable will not affect another. Missing data handling in analysis methods depends on how the methods use the variable. For a multi-variate technique such as regression or kriging, if the data point is missing for a used variable, then the evaluation is also ‘missing’, but calculations will be done for that point if it is missing in unmodeled variables.
So, is our current conservative approach the one to take? When a user wishes to accomplish imputation of missing values, how should this be done (if at all)? Is it appropriate to have the user explicitly model the procedure for imputing missing values? Should there be a menu item for “missing value imputation” with a set of alternative methods? (e.g. local mean, ST kriging, etc)?
Dr. Dutta’s opinion is: “When in Rome, do as the Romans do.” Google is the best example in this case. The McMahon line demarcates the international border between India and China. However, the exact trajectory of the line is mysteriously unknown to both the Indian and the Chinese (or Tibetan) public. Therefore, Google Maps includes Indian territories inside China for the maps designed for Chinese Google, and vice versa for Indian Google.
To elaborate, if one is interested in scientific research, then the imputation of missing values, using an application-specific scientifically sound technique/algorithm, seems reasonable. Instead, if one has to submit a report to a client authority that prohibits input of imputed data in the model, you may consider the original data table with a complimentary data table that is generated by geostatistical or machine learning algorithms relying on space-time dependent (autocorrelated) variables, or even an extension to incorporate dependent auxiliary covariates into the model.”
So, how do we approach this issue? Dr. Dutta offered three insightful methods for imputing missing data:
- Guessing by common sense with strong domain (subject) knowledge.
- Applying mathematical/statistical/machine-learning algorithms (for example, the random forests algorithm of Breiman, 2001).
- Dreaming the spatial distribution of the true values. The last technique is the toughest and plausibly the most accurate; it is not drawn from statistics or mathematics but from chemistry. The pioneering proponent of the dreaming technique is August Kekulé (1829 – 1896) who discovered the ring structure of the benzene molecule (a six-membered ring of carbon atoms with alternating single and double bonds) after a day-dream. Data Scientists are now way inferior to Kekulé; if the technique is practiced with dedication and effort, it can plausibly lead to interesting outcomes.
We’d like to open this up to the BioMedware community. What do you think? How would you like to see missing data handled, and what imputation algorithms would you like included in Vesta?
Get in touch with Geoffrey at jacquez@biomedware.com and let us know what you think!