A major problem in the geospatial analysis of health events is identifying “true” excesses in health outcomes and disease risk. Most cancers, for example, occur at a “background” rate, often due to the underlying DNA mutation rate. This background mutation rate is attributable to errors in DNA replication that happen when a cell divides. Sometimes, these errors give rise to cancer.  Other cancers are due to environmental exposures that damage DNA. A problem arises when seeking to identify cancers above and beyond the background rate.  Is an observed cancer or disease cluster real? 

The Phenomenon of Pareidolia

When we lay on our backs and watch the clouds scud by, we often imagine animals, objects, and people.  In fact, our minds and eyes are phenomenal pattern recognition machines, prone to finding meaningful patterns where none exist.  This phenomenon is called Pareidolia, which my Duck Duck Go search defines as:

“The perception of a recognizable image or meaningful pattern where none exists or is intended, as the perception of a face in the surface features of the moon.” 

The term was first used in 1994 by Steven Goldstein to describe a psychological phenomenon involving an image or another piece of information falsely perceived as significant.  The word comes from the Greek para — amiss, faulty, wrong — and eidolon — image. Interesting examples include a decades-old grilled cheese sandwich said to bear the image of the Virgin Mary, which sold on eBay for $28,000 in 2004.

Continuing in the vein of edible pareidolia, the “Nun Bun” hails from the Bongo Java Coffee Shop in Nashville, Tennessee (Owner Bob Bernstein).  In the figure, below, an animated gif helps with the recognition of the nun Eidolon on the Nun Bun. Do you see Mother Theresa?

These illustrate how our brain and eyes (our “wetware”) are prone to pareidolia, the identification of meaningful patterns where none exists. 

What does this have to do with geospatial health data?  We often encounter disease maps when beginning an analysis – this is often the point of departure.  Are perceived patterns on these maps “real,” or can they be best explained as random patterns—a manifestation of pareidolia?

Caption: Do you see a cluster of high cancer rates?  Is the cluster real or can it be explained by chance?  Large, high areas (red) visually dominate, whether the rate in those areas is significantly high or not. This cervix cancer mortality in counties in the southwestern United States.  Screen capture from Vesta software. 

Defining a cluster

So, what do we mean by a cluster?  We can think of a cluster as a significant excess of disease cases in space, in time, or in both space and time.  

  • “Space” means geographically, such as an excess of cancer cases in a county
  • “Significant” means that the aggregation is unusual compared to the background risk.  

To be considered a true cluster, the Centers for Disease Control also recommends that cases in the cluster have a biologically reasonable possible underlying cause. For example, exposure to a known or suspected human carcinogen.   

A common practice among health departments that investigate disease clusters is to determine whether there is a significant excess so that the aggregation is unusual.  This involves using inferential statistical methods that assess whether the cluster can be explained by chance.  If the cluster is statistically unusual and has a biologically reasonable hypothesized cause, the health department initiates a more detailed investigation to identify the cause and take preventative measures.   

The Role of AI and ML

Might it be possible to use artificial intelligence (AI) and machine learning (ML) methods to identify excesses of health events?  AI refers to large language models now being used to achieve general-purpose language understanding and generation, and machine learning algorithms give computers the ability to learn without being explicitly programmed.  Machine learning has been beneficial in pattern recognition, and as cluster detection is a pattern recognition endeavor, it seems reasonable to ask whether machine learning can be routinely used in this domain.  In health analysis, ML has been particularly fruitful in clinical applications to diagnose disease (see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8950225/).  But still, applications in assessing spatial, temporal, and space-time disease clusters are extremely limited.

Can ML provide estimates of the probability of geospatial health outcomes in general, and of disease clusters in particular? The ML techniques must provide reproducible, reliable, and replicable results. The cluster probability estimates must be accurate and provide appropriate measures of cluster uncertainty.  Finally, they must account for: 

  • Background risk 
  • Confounders 
  • Covariates 
  • Other risk factors 
  • Population size 
  • Population growth 
  • Human mobility 

A layman’s (mine) quick review of the ML geospatial health literature did not find any algorithms that accomplish all of these things.  But it seems possible to use ML to meet all of these requirements.  The big question though is what advantages ML disease clustering methods might have over common practice using inferential statistics?  I have some ideas but would like to hear from you.  Write me at jacquez@biomedware.com and let’s explore.