Some of us remember the term “large data,” which was a big deal in the 1980s. Desktop computing was just beginning, and statistical models were the leading tool in data analysis.
These models leveraged matrix algebra techniques, and methods such as matrix inversion were part of the bread-and-butter of analysis techniques. Datasets were minuscule by today’s standards; a few hundred data rows were common, and data sets seldom had thousands of observations like they do today.
How do you do matrix inversion on a dataset with thousands of rows and columns? Enter the era of “Large Data” when statisticians, programmers, and data scientists developed new algorithms and analysis approaches to apply statistical methods to large data. They, in large part, succeeded. But the expansion of the data universe was just beginning.
What is Big Data?
The term “Big Data” was coined in the 1990s and is described as “heterogeneous”, “unstructured”, and “huge in size”. But even these descriptors are moving targets. The size may exceed dozens of terabytes and even zettabytes of data. Hence Big Data has required new approaches, techniques, and technologies to visualize, decant, and analyze. Enter Big Data Analytics, but what does it mean? Let’s dive into the “V’s”: Volume, Variety and Velocity (Figure 1).
Figure 1. Big Data characterized by Volume, Variety and Velocity. Source: Ender005 – Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49888192
The Three V’s
Volume has to do with the sheer size of the data: number of observations and the number of things observed (also referred to as variables). These are analogous to the rows and columns of the large data era. But where data matrices rows and columns might be called structured, in the sense that every observation can be “fit” into a row, with the same number of variables in the columns, big data are often unstructured. The second V, Variety, refers to the lack of structure of big data. The data are not observed as in a statistical experiment (along the lines of large data), but come from a compendium of sources: Web scrapings, text data, numerical data, imagery, retail receipts, cell phone traces and so on. These can’t be stuffed into the conventional rows and columns of a structured data matrix.
Finally, the third V, Velocity. The data are streaming, even screaming, into our data receptors, and they don’t stop. Huge volumes of unstructured data coming in hourly and daily.
Today: The Five V’s
The three V’s have expanded to five: Volume, Variety, Velocity, Veracity, and Value.
- Veracity has to do with the “truth” of the data. It’s especially relevant as the varied nature of the data includes web scrapings and social media offerings. When do we treat factually incorrect social media (e.g. “Global warming is an information conspiracy”) as true? Factually, that statement is incorrect unless one explores the range of views on global warming on social media, in which case it is a valid observation. Judgment is required, but humans simply can’t read and judge every piece of a terabyte data stream.
- Value can be assessed by the financial return expected from information; it also can be measured as something intrinsic to the information itself and the qualities of the big data.
The Sixth ‘V’ of Geospatial
How are big data in GIS? Let’s address this by first asking what are the six V’s in geospatial. I’ll call these the GV’s, which echoes the BeeGee’s from the era of Large Data.
Geospatial Volume: Volume includes data sources that are becoming the bread and butter of geospatial analytics. Imagery coming from satellites, aircraft and drones can be enormous, especially as the number of bands and resolution increases. Consider satellite data. LandSat imagery has been around for decades and includes six bands (bandwidths of reflectance that can be sensed by the satellite, analogous to red light, green light and so on) with pixel sizes of 30m or so. But now hyperspectral imagery can have 100’s of bands, with submeter pixel sizes. These result in hyperspectral cubes of dimension “number of bands” by “number of pixels” – resulting in enormous data volumes. When the satellites pass over the same area, the result is a temporal series of hyperspectral cubes.
Example of a hyperspectral cube. Source: Wikipedia.
Geospatial Variety: Additional examples of high-volume geospatial data include cell phone traces and attributes for populations, retail sales behavior in store networks, and transportation traces from commercial vehicles to name a few. It is not uncommon to undertake geospatial analyses that include satellite and hyperspectral data, census data for underlying populations, and mobility data from cell phones and transport.
Geospatial Velocity: “Surveillance” is a frequently encountered term in security, geohealth, crime and other domains. The notion is a stream of incoming data that supports timely response and intervention, and the payoffs can be significant, provided the high-velocity data arrive and are analyzed in near-real time.
Geospatial Veracity: Are the data good representations of the underlying reality? This question underpins all of geospatial analysis, and terms such as registration (are the geographic locations, correct?), co-registration (do the locations from two different data layers line up?), ground truthing (does our interpretation of the data correspond to what is actually there?), and uncertainty (how sure are we that our classification of what is on the ground is correct?) are the lingua franca of geospatial analysis.
Earthrise, taken from Apollo 8 by William Anders.
Geospatial Value is where the rubber hits the road. Sometimes the data themselves have inherent value, such as the well-known “blue marble” image taken from Apollo Eight that brought home just how marvelous and fragile our earth really is. Often, Geospatial Value arises from geospatial analysis. Examples include real estate value estimation (what is that home worth?) and route finding (how do I get there from here?). In geospatial health, we are seeing payoffs in health/environment research through the analysis of big data sets, including:
- Flu and pharmacy data
- hazards and auto accidents
- human mobility,
- satellite imagery to define disease vector habitat (e.g. mosquitoes)
- and so on.
These support the construction of more detailed and better-informed disease models. This effort is nascent in many ways, as new techniques are needed to handle the big geospatial data sets. And the effort is even more challenging for analyses that seek to interdigitate with clinical and genomic data.
Geospatial Varied support: Varied spatial support poses a key problem in geospatial analysis. By “spatial support” we mean the area or location that underpins where the data are coming from. Residential locations are often represented as a point, or dot, on a map. They also can be shown using the property description for the home, an area comprising the area of the land on which the home is placed.
Socio-economic data often come from a census, and in the United States, census geography includes census blocks, tracts, metropolitan statistical areas, and counties, among others. Each of these represents a different spatial support, comprised of different-sized areas and encompassing different populations. Dealing with different spatial supports across different variables is an important problem in geospatial analysis, addressed by techniques of upscaling, side scaling, and downscaling.
Big Geospatial Data Brings Big Problems
The foundation of many geospatial analysis techniques are founded on statistical approaches suited to the era of Large Data. But can these techniques be extended to Big Geospatial Data, or are new approaches required? How do we do change of support for geospatial big data, let alone undertake spatial statistics and geostatistics?
Research is currently underway employing machine learning and AI techniques. But what might be the role of AI in Geospatial Big Data analysis? Problems seeking solutions include:
- The identification of relevant data sources. You identify the problem, and AI helps find appropriate data.
- Initial data massaging, those tasks that are absolutely necessary but consume so much of an analyst’s time: Data registration, change of temporal and spatial support, spatial integration, temporal integration.
- Method selection, helping the analyst select appropriate techniques.
- Results interpretation, from spatial patterns, to geospatial analysis, to incorporation of authoritative information (e.g. PubMed) in results interpretation.
Geospatial Reasoning. Perhaps the biggest payoff, if attainable, is semi-automated hypothesis generation based on logical approaches, including induction, deduction, and abduction. This requires AI that follows logical constructs, a heavy lift when we consider issues of AI hallucination, for example. But Q* (Qstar) recently solved basic math equations, a problem in mathematical logic that requires application of rules such as propositions, transitive properties, association and so on. Is the next step then logical reasoning along the lines of Strong Inference, the ability to generate hypotheses, interpret results to create plausible sets of explanations, and perhaps find the underlying causes of disease?
References
Dieold, Francis X. 2012. On the Origin(s) and Development of the Term ‘Big Data’. SSRN Electronic Journal. DOI: 10.2139/ssrn.2152421.