The Problem Nobody Talks About

Spatial analysis is one of the most powerful tools available to practitioners in public health, epidemiology, environmental science, ecology, and beyond. The ability to reveal geographic patterns in disease incidence, identify environmental risk hotspots, or map the movement of livestock pathogens across regions can change the course of a study, a policy decision, or a public health response. The technology to do this has never been more capable.

And yet, most GIS analysts spend the majority of their working hours doing none of it.

Instead, they spend their time finding data scattered across dozens of government repositories, wrestling incompatible file formats into submission, hand-cleaning messy spreadsheets, debugging import errors, and troubleshooting coordinate mismatches — all before the first spatial analysis even begins. 

A recent survey of geospatial practitioners conducted by an Eastern Michigan University graduate research team confirmed what many analysts already know from painful experience: 87.5% of respondents identified data cleaning and formatting as their single most time-consuming workflow stage. A further 50% named importing and exporting between software packages as a separate, major time sink — independent of the cleaning burden.

The voice of one participant, an academic researcher, captured the situation precisely:

“Half my day is just trying to get a CSV to talk to ArcGIS without it crashing. If the AI Advisor handles the import/export handshake, that’s worth the subscription alone.”— EMU Survey Participant 6, Academic Researcher

Unfortunately, this is the dominant experience of practitioners working with spatial data today. The geospatial data problem is not a single obstacle but four compounding ones: 

  1. Finding the right data,
  2. Taming an ecosystem of incompatible formats
  3. Ensuring data quality
  4. Integrating disparate sources into a coherent analytical foundation

Each layer is a problem in its own right. Together, they represent an enormous, largely invisible tax on the productive capacity of GIS professionals.

This article breaks down the four compounding layers of the geospatial data problem — discovery, format, quality, and integration — and explains how Vesta’s geospatial analytic software addresses each one.

The Four Layers of the Geospatial Data Problem

Geospatial data problems don’t arrive one at a time. A single analyst working on a single study can face all four of the challenges below in a single afternoon. Each compounds the next.

1. Geospatial Data Discovery: Finding the Right Sources

Before any analysis can begin, analysts must locate the data they need. 

For spatial practitioners working in public health, environmental health, agriculture, or epidemiology, this is rarely straightforward. Authoritative, geographically referenced data is distributed across a sprawling landscape of federal, state, and international repositories — the U.S. Census Bureau, the Centers for Disease Control and Prevention, the Environmental Protection Agency, the U.S. Department of Agriculture, the Food and Agriculture Organization of the United Nations, the World Health Organization, and dozens of state and local agencies — each with its own access protocols, data structures, and update cycles.

The challenge is not simply that data exists in many places. It’s that finding the right dataset often requires deep familiarity with the landscape of available sources that many analysts, particularly those newer to interdisciplinary work, simply do not have. As the Harvard Spatial Data Lab has documented, open datasets in spatiotemporal research are frequently difficult to locate due to their diversity and scattered sources, creating friction that precedes any analytical work.

For researchers who must assemble data spanning health outcomes, demographic characteristics, and environmental exposures simultaneously, this discovery burden compounds. 

A spatial epidemiologist studying cancer incidence may need to locate cancer registry data, Census demographic data at the tract level, environmental exposure data from the EPA, and health behavior data from the CDC — each from a different source, in a different format, with different update schedules. The data hunting itself can consume days.

2. Geospatial Data Formats: The Incompatibility Crisis

Once data is found, it must be ingested. This is where many practitioners encounter what the GIS community has informally come to call “format hell.” The geospatial data ecosystem is not a unified standard — it is an accumulation of formats developed across decades for different purposes by different organizations, many of which do not communicate cleanly with one another.

A working GIS analyst routinely handles a variety of data: 

  • CSV files
  • Shapefiles
  • ESRI File Geodatabases
  • GeoTIFF rasters
  • Web Map Service (WMS) endpoints
  • KML files
  • GeoJSON
  • and more 

Each has its own encoding conventions, geometry rules, attribute field constraints, and coordinate reference system assumptions. Moving data between these formats is rarely lossless and rarely painless.

The Shapefile illustrates the problem well. Developed by ESRI in the early 1990s, the Shapefile format remains the most widely distributed vector format in GIS — the default output of most government data portals and a near-universal expectation for data exchange. 

However, its limitations are severe and well-documented, including 

  • Attribute field names are truncated to 10 characters
  • Floating-point numbers may be stored as text
  • Format cannot store topological relationships
  • Structurally split across multiple required files 

These limitations create corruption risk whenever files are mismatched or misplaced (switchfromshapefile.org; Wikipedia, Shapefile). Despite the emergence of superior alternatives such as GeoPackage and GeoJSON, the Shapefile persists because of inertia and because government data publishers continue to rely on it.

Coordinate Reference System (CRS) mismatches compound the format problem. Spatial datasets from different sources are commonly delivered in different coordinate systems and combining them without proper reprojection introduces errors that can be difficult to detect and consequential when they go unnoticed. Using the wrong projection can distort area measurements by 20% or more and introduce systematic distance errors that propagate through downstream analyses. In analytical contexts where accuracy is consequential — disease cluster detection, environmental exposure mapping, agricultural yield estimation — these errors are not abstractions.

The EMU survey confirmed what format fragmentation looks like in practice: beyond the 87.5% who flagged data cleaning as their top time sink, fully half of respondents identified the importing and exporting of data between software packages as a separate, major burden. These are practitioners losing significant fractions of their working time not to analysis, but to the friction of incompatible data ecosystems.

3. Geospatial Data Quality: The Invisible Labor

Even when data is located and successfully imported, it frequently arrives in a state that requires substantial remediation before it can support reliable analysis. Missing values, inconsistent text encoding, malformed geometries, duplicate records, and non-standard representations of null or missing data — coded as “-99”, “NA”, “-“, or blank fields depending on the source — are endemic in real-world spatial datasets.

The consequences of undetected data quality problems in spatial analysis are well-established. Missing geocodes and positional errors in spatial epidemiology datasets have been shown to inflate standard errors of estimates, reduce statistical power to detect spatial clustering, and introduce systematic geographic bias in results. Research on prostate cancer spatial patterns in Virginia found that cluster maps appeared markedly different depending on whether missing geocodes were excluded or handled — a finding with direct implications for how public health decisions are made from spatial analyses. In a domain where, as one survey participant noted, “a hallucination isn’t just a typo — it’s a wrong public health decision,” the quality of the data pipeline is not a technical detail. It is a scientific and ethical imperative.

The reproducibility literature reinforces this concern. A landmark survey found that over 60% of researchers in the Earth and Environment domain were unable to reproduce other studies, and more than 40% failed to reproduce their own work. The interdisciplinary nature of geosciences leads to inconsistent standards and varying adoption of reproducible practices, and undocumented manual data cleaning steps — the kind that happen when an analyst hand-edits a spreadsheet before import — are a recognized contributor to this crisis. Data wrangling that is invisible, undocumented, and unrepeatable is not just inefficient. It is a threat to the scientific record.

Cancer registry analysts, who work with some of the most carefully curated spatial health data in existence, encounter these problems regularly.

4. Geospatial Data Integration: Making Disparate Sources Work Together

The final layer of the geospatial data problem is integration: bringing data from multiple sources, formats, and geographic scales into a coherent analytical foundation. This is perhaps the most technically demanding routine task in GIS work, and current tools handle it inconsistently.

Joining a point dataset of disease cases to a polygon layer of census tracts requires a spatial aggregation step that can be executed in multiple ways — nearest centroid, containment, areal weighting — each producing different results and each appropriate for different analytical purposes. Linking tabular health records to geographic units by ID field requires exact string matching that may fail silently when identifier formats differ between sources. Combining datasets collected at different temporal resolutions — annual county-level cancer incidence with decennial census demographics, for example — requires decisions about temporal alignment that affect analytical validity.

These challenges are compounded by the dominance of the ESRI/ArcGIS ecosystem. ArcGIS is the industry standard in many institutional contexts, and its formats and workflows have effectively become reference points for data exchange. But this creates both a standard and a trap: analysts working outside the ESRI ecosystem face interoperability friction when receiving data in proprietary formats, while those working within it face licensing costs and workflow lock-in that can constrain analytical choices.

A 2021 NIEHS workshop convening experts in exposure science, geospatial technologies, data science, and population health identified the integration of large, diverse data streams — from sensors, models, surveys, and electronic health records — with disparate spatial and temporal coverage as one of the primary technical challenges facing geospatial epidemiology research. Multi-source integration is not a solved problem. It is an active research challenge that practitioners navigate daily, largely without software support designed for the purpose.

The cumulative effect of all four layers is significant. A dataset that is difficult to find, arrives in the wrong format, contains undetected quality problems, and must be joined to three other sources before analysis can begin may consume days of a skilled analyst’s time — time that was not budgeted, is rarely documented, and could have been spent on the science the analyst was trained to do.

How Vesta Solves the Geospatial Data Pipeline

Vesta, BioMedware’s integrated geospatial analysis platform, was built with the practitioner’s full workflow in mind — not just the analysis stage, but everything that precedes it. The following describes how Vesta’s data capabilities map directly onto each of the four problem layers identified above.

Built-In Access to Authoritative Geospatial Data Sources

Rather than requiring analysts to locate, download, and manually import data from government repositories, Vesta connects directly to a curated set of authoritative public data sources through its Online Data Import system. From within the application, users can access:

  • CDC Datasets: PLACES (local health data by county, tract, and ZIP code), CDC WONDER (mortality and population data), Drug Poisoning Mortality, YRBSS (Youth Risk Behavior Surveillance), and BRFSS (Behavioral Risk Factor Surveillance System).
  • U.S. Census Bureau: American Community Survey demographics, economics, and housing data; Economic Census; annual population estimates — at all geographic levels from state through block group, ZIP code, and place. Vesta automatically handles the Census API’s 50-variable limit and state-by-state iteration, eliminating a common technical barrier for users assembling national datasets.
  • CDC Environmental Public Health Tracking: Environmental public health indicators with state and county filtering.
  • Agriculture and Livestock: FAOSTAT (UN food and agriculture statistics), Global Livestock World (GLW) raster datasets, and WAHIS (World Animal Health Information System) disease data.

The result is that analysts can begin with data rather than with the search for it. Datasets that previously required navigating multiple agency portals, downloading ZIP files, and manually importing and cleaning CSVs are available through a single, consistent interface within Vesta.

Smart Import Tames Geospatial Data Formats Automatically

For data that analysts bring to Vesta themselves, the platform’s Smart Import Dialogs are designed to handle the full spectrum of formats GIS practitioners encounter — and to do the hard work of interpretation automatically.

  • Spreadsheet Import (Excel and CSV): Vesta automatically detects column types (integer, decimal, date/time, string), identifies latitude and longitude columns, recognizes time columns including year-only formats and date ranges, identifies ID columns based on name patterns, and detects common non-standard missing value encodings (“-“, “NA”, “-99”) that would otherwise require manual remediation.
  • Geographic Data Import: Vesta supports full Shapefile import including the Census Bureau’s DBF format variant, and ESRI File Geodatabase (.gdb) files. On import, Vesta automatically reprojects data from any source coordinate reference system to Web Mercator — silently eliminating the CRS mismatch problem that introduces spatial errors when combining datasets from different sources. Invalid polygon geometries are detected and automatically repaired during import.
  • Raster Import (GeoTIFF): Multi-band raster files are imported with full metadata extraction, automatic downsampling for large files, NoData preservation, and smart detection of time periods from band metadata — enabling temporal raster datasets to be imported accurately without manual configuration.
  • Base Maps and Background Layers: Vesta provides 15 base layer styles for map display and context, ranging from satellite and hybrid imagery to cartographic styles optimized for data visualization (Dataviz), thematic mapping (Toner, Backdrop), outdoor and topographic reference (Outdoor, Topo), and aesthetic variety (Aquarelle, Winter, Landscape, Ocean). A None option is available for analysts who prefer a clean canvas.

Automated Geospatial Data Quality Checks And Preprocessing

Vesta’s data quality handling is integrated into the import pipeline rather than treated as a separate step. This means that quality problems are surfaced and addressed at the moment of import — before they can propagate into analysis results — rather than discovered after the fact.

The platform automatically detects and handles duplicate records, validates polygon geometries and performs repairs using buffer operations, refines column types by distinguishing between integer and decimal fields, and normalizes text encoding across imported datasets. Following every import, Vesta generates a post-import summary that reports variable count, object count, geometry types, and any warnings surfaced during import — including mismatch counts when data is being merged.

A preview mode allows analysts to inspect a sample of incoming data before committing to a full import — enabling informed decisions about whether the data meets expectations before it enters the analytical environment.

The significance of automated quality handling extends beyond efficiency. When quality checks are built into the software rather than left to manual vigilance, the results are consistent, documented, and reproducible. The same import run by two different analysts produces the same quality-checked result — addressing one of the primary sources of irreproducibility identified in the geospatial research literature.

Multi-Source Geospatial Data Integration Without Silent Errors

Vesta’s import pipeline is designed to handle the full complexity of multi-source data integration — including the operations that are most prone to silent error in manual workflows.

Spatial aggregation of point datasets to polygon geographies is supported with multiple user-selectable methods, giving analysts control over how spatial relationships are computed. Dataset merging by ID field includes mismatch reporting — if records in one dataset do not match records in another, the discrepancy is surfaced explicitly rather than silently dropped. Multiple files can be batch-imported simultaneously, with real-time progress tracking and the ability to cancel mid-import if needed.

The cumulative result is a platform in which data from multiple sources — CDC health data, Census demographic layers, proprietary CSV records, raster environmental datasets — can be brought into a single analytical environment in formats that are clean, correctly projected, and ready for analysis. The pipeline that most analysts currently execute through a combination of spreadsheet editors, command-line tools, and GIS software import dialogs is consolidated into a single, auditable, repeatable workflow.

What Does Better Mean for the Geospatial Analyst?

The practitioner at the center of this problem is not a technician who lacks skill. They are, as the EMU study described them, someone who understands the science — who knows what a hotspot means, what a spatial cluster implies, what a kriging interpolation is for — but who spends the majority of their working time before they can apply that knowledge. As one hydrogeologist in the EMU survey put it:

“I know what I want to see (hotspots), but I don’t always remember which specific kriging method to use. Having a ‘co-pilot’ to suggest the right tool would save me hours of googling.”— EMU Survey Participant 1, Hydrogeologist

The 87.5% finding is not just a striking statistic. It represents a recoverable loss. If data preparation consumes the majority of analyst time, then a platform that substantially automates that preparation returns hours per week and weeks per year to the work these practitioners were trained — and hired — to do: spatial analysis, pattern interpretation, evidence-based insight.

The connection to Vesta’s AI Advisor is important here. The AI Advisor — Vesta’s integrated analytical co-pilot, trained on decades of spatial analysis expertise — can only deliver its full value when the data that feeds it is clean, correctly projected, and properly structured. The data pipeline is not a prelude to the analytical platform. It is the foundation. A well-built foundation enables everything that follows.

The trust dimension matters as well. Spatial epidemiologists and public health practitioners rightly demand transparency in analytical tools. As one GIS analyst in the EMU survey observed, in this domain a software error is not just a technical inconvenience — it can propagate into a public health decision. Vesta’s approach to import quality — surfacing warnings, providing summaries, making the import process auditable — reflects a commitment to transparency at every stage of the workflow, not just at the analysis output.

Resetting the Starting Line for Geospatial Analysis

The field of geospatial analysis has, largely by necessity, accepted data wrangling as an inevitable cost of doing spatial work. The tools available have reflected this acceptance: powerful analytical engines bolted onto import dialogs that were never designed to handle the messy realities of real-world data.

This doesn’t have to be the standard.

The best spatial analysis tools should meet analysts where the work actually starts — not at the method, but at the data. That means built-in access to authoritative sources, intelligent format handling, automated quality detection, and integration infrastructure that surfaces problems rather than hiding them. It means making the path from raw, scattered, incompatible data to analysis-ready datasets as short, transparent, and repeatable as possible.

Vesta was built with this conviction. Thirty years of geospatial IP, combined with a modern software architecture and a genuine understanding of how practitioners actually work, have produced a platform in which the data pipeline is a first-class feature — not an afterthought.

If half your day is just getting data to work, it’s time to change what you’re working with.

Learn more about Vesta’s data capabilities here.

Frequently Asked Questions About Geospatial Data

 

Why does geospatial data preparation take so long?

Geospatial data is scattered across dozens of federal, state, and international repositories, delivered in incompatible formats with conflicting coordinate reference systems, and frequently arrives with undetected quality problems. A 2026 EMU survey found 87.5% of GIS analysts name data cleaning and formatting as their single most time-consuming workflow stage.

What formats are common in geospatial data?

Working GIS analysts routinely encounter CSV, Shapefile, ESRI File Geodatabase, GeoTIFF, Web Map Service (WMS), KML, and GeoJSON. Each has its own encoding conventions, geometry rules, and coordinate reference system assumptions, and conversion between them is rarely lossless.

How does a coordinate reference system mismatch affect spatial analysis?

Using the wrong projection when combining datasets can distort area measurements by 20% or more and introduce systematic distance errors that propagate through downstream analyses — a consequential problem in disease cluster detection, environmental exposure mapping, and agricultural yield estimation.

What is a geospatial data pipeline?

A geospatial data pipeline is the sequence of steps — discovery, import, format conversion, quality checking, and integration — that moves raw geospatial data into an analysis-ready state. Most GIS workflows handle these steps in separate tools; integrated platforms like Vesta consolidate them.

Can geospatial analysis software handle data from multiple sources?

Integrated platforms can. Vesta’s import pipeline supports spatial aggregation with multiple user-selectable methods, dataset merging by ID field with explicit mismatch reporting, and batch import of files from different sources — consolidating a workflow that most analysts currently execute across multiple tools.