Clustering and Nearest Neighbors

Advanced Spatial Analysis

Video Locked

Please log in to watch this video

Log In
Chapter Info
Course The Ultimate PostGIS course
Module Advanced Spatial Analysis
Chapter Clustering and Nearest Neighbors

Chapter Content

Spatial clustering and nearest neighbor analysis are powerful techniques for discovering patterns in geographic data. Whether you're identifying urban agglomerations, optimizing service locations, or analyzing point distributions, PostGIS provides sophisticated tools for spatial grouping and proximity searches. This comprehensive guide covers clustering algorithms and efficient nearest neighbor queries with real-world applications.

Introduction to Spatial Clustering and Nearest Neighbors

Spatial clustering and nearest neighbor analysis help reveal hidden patterns in geographic data by:

  • Grouping similar or nearby features together
  • Identifying spatial hotspots and patterns
  • Finding optimal locations for services
  • Analyzing spatial relationships and dependencies
  • Supporting decision-making in urban planning and resource allocation
Technique Function Purpose Best Use Case
K-Means Clustering ST_ClusterKMeans Partition into k equal-sized groups Regional planning, service zones
DBSCAN Clustering ST_ClusterDBSCAN Density-based natural grouping Urban agglomerations, hotspots
Distance Clustering ST_ClusterWithin Group by distance threshold Service catchments, proximity zones
Nearest Neighbors KNN with <-> Find closest features efficiently Location optimization, routing

Setting Up Sample Data

Let's create a comprehensive dataset of world cities to demonstrate clustering and nearest neighbor techniques:

-- Create cities table with realistic global data
CREATE TABLE cities (
  id SERIAL PRIMARY KEY,
  name TEXT,
  country TEXT,
  continent TEXT,
  pop_max INTEGER,
  pop_min INTEGER,
  latitude NUMERIC,
  longitude NUMERIC,
  geom GEOMETRY(Point, 4326)
);

-- Insert sample cities from different continents
INSERT INTO cities (name, country, continent, pop_max, pop_min, latitude, longitude, geom) VALUES
-- Asia
('Delhi', 'India', 'Asia', 32900000, 16800000, 28.6139, 77.2090, ST_Point(77.2090, 28.6139)),
('Mumbai', 'India', 'Asia', 24400000, 12500000, 19.0760, 72.8777, ST_Point(72.8777, 19.0760)),
('Tokyo', 'Japan', 'Asia', 37400000, 13500000, 35.6762, 139.6503, ST_Point(139.6503, 35.6762)),
('Shanghai', 'China', 'Asia', 28500000, 24200000, 31.2304, 121.4737, ST_Point(121.4737, 31.2304)),
('Beijing', 'China', 'Asia', 21500000, 11100000, 39.9042, 116.4074, ST_Point(116.4074, 39.9042)),

-- Europe  
('London', 'United Kingdom', 'Europe', 9600000, 8900000, 51.5074, -0.1278, ST_Point(-0.1278, 51.5074)),
('Paris', 'France', 'Europe', 11000000, 2100000, 48.8566, 2.3522, ST_Point(2.3522, 48.8566)),
('Berlin', 'Germany', 'Europe', 3700000, 3500000, 52.5200, 13.4050, ST_Point(13.4050, 52.5200)),
('Madrid', 'Spain', 'Europe', 6600000, 3200000, 40.4168, -3.7038, ST_Point(-3.7038, 40.4168)),
('Rome', 'Italy', 'Europe', 4300000, 2900000, 41.9028, 12.4964, ST_Point(12.4964, 41.9028)),

-- North America
('New York', 'United States', 'North America', 18800000, 8400000, 40.7128, -74.0060, ST_Point(-74.0060, 40.7128)),
('Los Angeles', 'United States', 'North America', 12400000, 4000000, 34.0522, -118.2437, ST_Point(-118.2437, 34.0522)),
('Chicago', 'United States', 'North America', 8600000, 2700000, 41.8781, -87.6298, ST_Point(-87.6298, 41.8781)),
('Toronto', 'Canada', 'North America', 6400000, 2800000, 43.6532, -79.3832, ST_Point(-79.3832, 43.6532)),
('Mexico City', 'Mexico', 'North America', 21600000, 9200000, 19.4326, -99.1332, ST_Point(-99.1332, 19.4326)),

-- South America
('São Paulo', 'Brazil', 'South America', 22400000, 12300000, -23.5505, -46.6333, ST_Point(-46.6333, -23.5505)),
('Buenos Aires', 'Argentina', 'South America', 15000000, 3000000, -34.6118, -58.3960, ST_Point(-58.3960, -34.6118)),
('Lima', 'Peru', 'South America', 10700000, 9700000, -12.0464, -77.0428, ST_Point(-77.0428, -12.0464)),

-- Africa
('Lagos', 'Nigeria', 'Africa', 15400000, 13900000, 6.5244, 3.3792, ST_Point(3.3792, 6.5244)),
('Cairo', 'Egypt', 'Africa', 20900000, 10200000, 30.0444, 31.2357, ST_Point(31.2357, 30.0444)),
('Johannesburg', 'South Africa', 'Africa', 10000000, 4400000, -26.2041, 28.0473, ST_Point(28.0473, -26.2041)),

-- Oceania
('Sydney', 'Australia', 'Oceania', 5300000, 4600000, -33.8688, 151.2093, ST_Point(151.2093, -33.8688)),
('Melbourne', 'Australia', 'Oceania', 5100000, 4900000, -37.8136, 144.9631, ST_Point(144.9631, -37.8136));

-- Create spatial index
CREATE INDEX idx_cities_geom ON cities USING GIST (geom);

-- Create additional indexes for performance
CREATE INDEX idx_cities_continent ON cities (continent);
CREATE INDEX idx_cities_pop_max ON cities (pop_max);

K-Means Clustering

ST_ClusterKMeans partitions points into a specified number of clusters by minimizing within-cluster variance. It's ideal for creating evenly-sized groups.

Basic K-Means Clustering

-- Partition cities into 5 global regions
SELECT 
  id, 
  name, 
  country,
  continent,
  pop_max,
  ST_ClusterKMeans(geom, 5) OVER () AS cluster_id
FROM cities
ORDER BY cluster_id, pop_max DESC;

-- Create a table with cluster assignments
CREATE TABLE city_clusters_kmeans AS
SELECT 
  c.*,
  ST_ClusterKMeans(c.geom, 5) OVER () AS cluster_id
FROM cities c;

Analyzing K-Means Results

-- Analyze cluster characteristics
SELECT 
  cluster_id,
  COUNT(*) AS city_count,
  array_agg(DISTINCT continent) AS continents,
  AVG(pop_max) AS avg_population,
  MIN(pop_max) AS min_population,
  MAX(pop_max) AS max_population,
  ST_AsText(ST_Centroid(ST_Collect(geom))) AS cluster_center
FROM city_clusters_kmeans
GROUP BY cluster_id
ORDER BY cluster_id;

-- Find cluster centroids and create service regions
WITH cluster_centroids AS (
  SELECT 
    cluster_id,
    ST_Centroid(ST_Collect(geom)) AS centroid,
    COUNT(*) AS city_count,
    SUM(pop_max) AS total_population
  FROM city_clusters_kmeans
  GROUP BY cluster_id
)
SELECT 
  cluster_id,
  ST_AsText(centroid) AS centroid_location,
  city_count,
  total_population,
  -- Create 2000km service area around centroid
  ST_Area(ST_Buffer(centroid::geography, 2000000)) / 1000000 AS service_area_km2
FROM cluster_centroids
ORDER BY total_population DESC;

Optimizing K-Means Cluster Count

-- Test different cluster counts to find optimal k
WITH cluster_analysis AS (
  SELECT 
    k,
    cluster_id,
    COUNT(*) AS cluster_size,
    ST_Collect(geom) AS cluster_geom
  FROM (
    SELECT 
      2 AS k, ST_ClusterKMeans(geom, 2) OVER () AS cluster_id, geom FROM cities
    UNION ALL
    SELECT 
      3 AS k, ST_ClusterKMeans(geom, 3) OVER () AS cluster_id, geom FROM cities
    UNION ALL
    SELECT 
      4 AS k, ST_ClusterKMeans(geom, 4) OVER () AS cluster_id, geom FROM cities
    UNION ALL
    SELECT 
      5 AS k, ST_ClusterKMeans(geom, 5) OVER () AS cluster_id, geom FROM cities
    UNION ALL
    SELECT 
      6 AS k, ST_ClusterKMeans(geom, 6) OVER () AS cluster_id, geom FROM cities
  ) clustered
  GROUP BY k, cluster_id
),
cluster_metrics AS (
  SELECT 
    k,
    AVG(cluster_size) AS avg_cluster_size,
    STDDEV(cluster_size) AS cluster_size_stddev,
    MIN(cluster_size) AS min_cluster_size,
    MAX(cluster_size) AS max_cluster_size
  FROM cluster_analysis
  GROUP BY k
)
SELECT 
  k,
  ROUND(avg_cluster_size, 1) AS avg_size,
  ROUND(cluster_size_stddev, 1) AS size_stddev,
  min_cluster_size,
  max_cluster_size,
  ROUND(cluster_size_stddev / avg_cluster_size, 2) AS coefficient_of_variation
FROM cluster_metrics
ORDER BY k;

DBSCAN Clustering

ST_ClusterDBSCAN groups points based on density, identifying natural clusters and outliers. It's excellent for discovering urban agglomerations and hotspots.

Basic DBSCAN Clustering

-- Identify urban agglomerations using DBSCAN
-- eps: 10 degrees (~1100km), minpoints: 2
SELECT 
  id,
  name,
  country,
  continent,
  pop_max,
  ST_ClusterDBSCAN(geom, eps := 10.0, minpoints := 2) OVER () AS dbscan_cluster
FROM cities
ORDER BY dbscan_cluster NULLS LAST, pop_max DESC;

-- Create table with DBSCAN results
CREATE TABLE city_clusters_dbscan AS
SELECT 
  c.*,
  ST_ClusterDBSCAN(c.geom, eps := 10.0, minpoints := 2) OVER () AS dbscan_cluster
FROM cities c;

Analyzing DBSCAN Results

-- Analyze DBSCAN clusters and outliers
SELECT 
  CASE 
    WHEN dbscan_cluster IS NULL THEN 'Outliers'
    ELSE 'Cluster ' || dbscan_cluster
  END AS cluster_type,
  COUNT(*) AS city_count,
  array_agg(name ORDER BY pop_max DESC) AS cities,
  array_agg(DISTINCT continent) AS continents,
  AVG(pop_max) AS avg_population,
  SUM(pop_max) AS total_population
FROM city_clusters_dbscan
GROUP BY dbscan_cluster
ORDER BY dbscan_cluster NULLS LAST;

-- Find dense urban regions
WITH dense_clusters AS (
  SELECT 
    dbscan_cluster,
    COUNT(*) AS city_count,
    SUM(pop_max) AS total_population,
    ST_ConvexHull(ST_Collect(geom)) AS cluster_boundary
  FROM city_clusters_dbscan
  WHERE dbscan_cluster IS NOT NULL
  GROUP BY dbscan_cluster
  HAVING COUNT(*) >= 3
)
SELECT 
  dbscan_cluster,
  city_count,
  total_population,
  ROUND(ST_Area(cluster_boundary::geography) / 1000000, 0) AS area_km2,
  ROUND(total_population / (ST_Area(cluster_boundary::geography) / 1000000), 0) AS population_density_per_km2
FROM dense_clusters
ORDER BY population_density_per_km2 DESC;

Adaptive DBSCAN Parameters

-- Test different DBSCAN parameters
WITH dbscan_tests AS (
  SELECT 
    'eps=5, minpts=2' AS params,
    ST_ClusterDBSCAN(geom, eps := 5.0, minpoints := 2) OVER () AS cluster_id,
    name, continent
  FROM cities
  
  UNION ALL
  
  SELECT 
    'eps=10, minpts=2' AS params,
    ST_ClusterDBSCAN(geom, eps := 10.0, minpoints := 2) OVER () AS cluster_id,
    name, continent
  FROM cities
  
  UNION ALL
  
  SELECT 
    'eps=15, minpts=3' AS params,
    ST_ClusterDBSCAN(geom, eps := 15.0, minpoints := 3) OVER () AS cluster_id,
    name, continent
  FROM cities
)
SELECT 
  params,
  COUNT(CASE WHEN cluster_id IS NOT NULL THEN 1 END) AS clustered_cities,
  COUNT(CASE WHEN cluster_id IS NULL THEN 1 END) AS outlier_cities,
  COUNT(DISTINCT cluster_id) AS num_clusters
FROM dbscan_tests
GROUP BY params
ORDER BY params;

Distance-Based Clustering

ST_ClusterWithin groups points that lie within a specified distance of each other, creating transitive clusters.

-- Group cities within 1000km of each other
SELECT 
  id,
  name,
  country,
  continent,
  ST_ClusterWithin(geom, 10.0) OVER () AS proximity_cluster
FROM cities
ORDER BY proximity_cluster, pop_max DESC;

-- Analyze proximity clusters
WITH proximity_clusters AS (
  SELECT 
    c.*,
    ST_ClusterWithin(c.geom, 10.0) OVER () AS proximity_cluster
  FROM cities c
)
SELECT 
  proximity_cluster,
  COUNT(*) AS city_count,
  array_agg(name ORDER BY pop_max DESC) AS cities_in_cluster,
  array_agg(DISTINCT continent) AS continents,
  SUM(pop_max) AS total_population,
  ST_AsText(ST_Centroid(ST_Collect(geom))) AS cluster_center
FROM proximity_clusters
GROUP BY proximity_cluster
HAVING COUNT(*) > 1
ORDER BY total_population DESC;

Nearest Neighbor Analysis

PostGIS supports efficient K-Nearest Neighbor (KNN) searches using the <-> operator with GiST indexes.

Basic Nearest Neighbor Queries

-- Find 3 nearest cities to each city
SELECT 
  c1.name AS city,
  c1.country,
  c2.name AS nearest_neighbor,
  c2.country AS neighbor_country,
  ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
  SELECT name, country, geom
  FROM cities c2
  WHERE c2.id != c1.id
  ORDER BY c1.geom <-> c2.geom
  LIMIT 3
) c2
ORDER BY c1.name, distance_km;

-- Find nearest neighbor by continent
SELECT 
  c1.name AS city,
  c1.continent,
  c2.name AS nearest_same_continent,
  ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
  SELECT name, geom
  FROM cities c2
  WHERE c2.id != c1.id 
    AND c2.continent = c1.continent
  ORDER BY c1.geom <-> c2.geom
  LIMIT 1
) c2
ORDER BY c1.continent, distance_km;

Advanced Nearest Neighbor Analysis

-- Find nearest large city (>10M population) to each city
SELECT 
  c1.name AS city,
  c1.pop_max AS city_population,
  c2.name AS nearest_megacity,
  c2.pop_max AS megacity_population,
  ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
  SELECT name, pop_max, geom
  FROM cities c2
  WHERE c2.id != c1.id 
    AND c2.pop_max > 10000000
  ORDER BY c1.geom <-> c2.geom
  LIMIT 1
) c2
WHERE c1.pop_max <= 10000000
ORDER BY distance_km;

-- Analyze nearest neighbor distances by continent
WITH nearest_neighbors AS (
  SELECT 
    c1.continent,
    c1.name,
    ST_Distance(c1.geom::geography, c2.geom::geography) / 1000 AS nn_distance_km
  FROM cities c1
  CROSS JOIN LATERAL (
    SELECT geom
    FROM cities c2
    WHERE c2.id != c1.id
    ORDER BY c1.geom <-> c2.geom
    LIMIT 1
  ) c2
)
SELECT 
  continent,
  COUNT(*) AS city_count,
  ROUND(AVG(nn_distance_km), 0) AS avg_nn_distance_km,
  ROUND(MIN(nn_distance_km), 0) AS min_nn_distance_km,
  ROUND(MAX(nn_distance_km), 0) AS max_nn_distance_km,
  ROUND(STDDEV(nn_distance_km), 0) AS stddev_nn_distance_km
FROM nearest_neighbors
GROUP BY continent
ORDER BY avg_nn_distance_km;

Combining Clustering and Nearest Neighbors

Combine clustering with nearest neighbor analysis for comprehensive spatial insights:

-- Assign service hubs and find nearest cities
WITH cluster_hubs AS (
  -- Select the largest city in each K-means cluster as a hub
  SELECT DISTINCT ON (cluster_id) 
    cluster_id,
    id AS hub_id,
    name AS hub_name,
    geom AS hub_geom,
    pop_max AS hub_population
  FROM city_clusters_kmeans 
  ORDER BY cluster_id, pop_max DESC
),
city_hub_assignments AS (
  SELECT 
    c.id,
    c.name AS city_name,
    c.pop_max AS city_population,
    h.hub_name,
    h.hub_population,
    ST_Distance(c.geom::geography, h.hub_geom::geography) / 1000 AS distance_to_hub_km
  FROM cities c
  CROSS JOIN LATERAL (
    SELECT hub_name, hub_population, hub_geom
    FROM cluster_hubs h
    ORDER BY c.geom <-> h.hub_geom
    LIMIT 1
  ) h
  WHERE c.name != h.hub_name
)
SELECT 
  hub_name,
  hub_population,
  COUNT(*) AS cities_served,
  ROUND(AVG(distance_to_hub_km), 0) AS avg_distance_km,
  ROUND(MAX(distance_to_hub_km), 0) AS max_distance_km,
  SUM(city_population) AS total_population_served
FROM city_hub_assignments
GROUP BY hub_name, hub_population
ORDER BY total_population_served DESC;

Performance Optimization

Technique Description When to Use
Spatial Indexes GIST indexes enable KNN searches Always for nearest neighbor queries
LATERAL Joins Efficient for nearest neighbor queries When finding k-nearest for each point
Appropriate Parameters Choose clustering parameters based on data scale Test different values for optimal results
Batch Processing Process large datasets in chunks Very large point datasets
Geography vs Geometry Use geography for accurate global distances Global or large-scale analysis

Real-World Applications

Supply Chain Optimization

-- Optimize distribution centers using K-means
WITH distribution_centers AS (
  SELECT 
    cluster_id,
    ST_Centroid(ST_Collect(geom)) AS optimal_location,
    COUNT(*) AS cities_served,
    SUM(pop_max) AS total_market_size
  FROM city_clusters_kmeans
  GROUP BY cluster_id
)
SELECT 
  cluster_id,
  ST_AsText(optimal_location) AS recommended_location,
  cities_served,
  total_market_size,
  -- Find nearest actual city to optimal location
  (SELECT name 
   FROM cities 
   ORDER BY optimal_location <-> geom 
   LIMIT 1) AS nearest_city
FROM distribution_centers
ORDER BY total_market_size DESC;

Emergency Response Planning

-- Identify cities that are isolated (far from neighbors)
WITH isolation_analysis AS (
  SELECT 
    c1.name,
    c1.country,
    c1.pop_max,
    ST_Distance(c1.geom::geography, c2.geom::geography) / 1000 AS nearest_neighbor_km
  FROM cities c1
  CROSS JOIN LATERAL (
    SELECT geom
    FROM cities c2
    WHERE c2.id != c1.id
    ORDER BY c1.geom <-> c2.geom
    LIMIT 1
  ) c2
)
SELECT 
  name,
  country,
  pop_max,
  ROUND(nearest_neighbor_km, 0) AS isolation_distance_km,
  CASE 
    WHEN nearest_neighbor_km > 2000 THEN 'Highly Isolated'
    WHEN nearest_neighbor_km > 1000 THEN 'Moderately Isolated'
    ELSE 'Well Connected'
  END AS isolation_level
FROM isolation_analysis
ORDER BY nearest_neighbor_km DESC;

Best Practices

  1. Choose appropriate clustering methods: K-means for equal groups, DBSCAN for natural clusters
  2. Test different parameters: Experiment with cluster counts and distance thresholds
  3. Use spatial indexes: Essential for efficient nearest neighbor searches
  4. Consider data scale: Adjust parameters based on geographic extent
  5. Validate results: Check clustering results against domain knowledge
  6. Handle outliers: DBSCAN identifies outliers - decide how to treat them
  7. Document methodology: Record parameters and rationale for reproducibility