Spatial clustering and nearest neighbor analysis are powerful techniques for discovering patterns in geographic data. Whether you're identifying urban agglomerations, optimizing service locations, or analyzing point distributions, PostGIS provides sophisticated tools for spatial grouping and proximity searches. This comprehensive guide covers clustering algorithms and efficient nearest neighbor queries with real-world applications.
Introduction to Spatial Clustering and Nearest Neighbors
Spatial clustering and nearest neighbor analysis help reveal hidden patterns in geographic data by:
- Grouping similar or nearby features together
- Identifying spatial hotspots and patterns
- Finding optimal locations for services
- Analyzing spatial relationships and dependencies
- Supporting decision-making in urban planning and resource allocation
Technique | Function | Purpose | Best Use Case |
---|---|---|---|
K-Means Clustering | ST_ClusterKMeans | Partition into k equal-sized groups | Regional planning, service zones |
DBSCAN Clustering | ST_ClusterDBSCAN | Density-based natural grouping | Urban agglomerations, hotspots |
Distance Clustering | ST_ClusterWithin | Group by distance threshold | Service catchments, proximity zones |
Nearest Neighbors | KNN with <-> |
Find closest features efficiently | Location optimization, routing |
Setting Up Sample Data
Let's create a comprehensive dataset of world cities to demonstrate clustering and nearest neighbor techniques:
-- Create cities table with realistic global data
CREATE TABLE cities (
id SERIAL PRIMARY KEY,
name TEXT,
country TEXT,
continent TEXT,
pop_max INTEGER,
pop_min INTEGER,
latitude NUMERIC,
longitude NUMERIC,
geom GEOMETRY(Point, 4326)
);
-- Insert sample cities from different continents
INSERT INTO cities (name, country, continent, pop_max, pop_min, latitude, longitude, geom) VALUES
-- Asia
('Delhi', 'India', 'Asia', 32900000, 16800000, 28.6139, 77.2090, ST_Point(77.2090, 28.6139)),
('Mumbai', 'India', 'Asia', 24400000, 12500000, 19.0760, 72.8777, ST_Point(72.8777, 19.0760)),
('Tokyo', 'Japan', 'Asia', 37400000, 13500000, 35.6762, 139.6503, ST_Point(139.6503, 35.6762)),
('Shanghai', 'China', 'Asia', 28500000, 24200000, 31.2304, 121.4737, ST_Point(121.4737, 31.2304)),
('Beijing', 'China', 'Asia', 21500000, 11100000, 39.9042, 116.4074, ST_Point(116.4074, 39.9042)),
-- Europe
('London', 'United Kingdom', 'Europe', 9600000, 8900000, 51.5074, -0.1278, ST_Point(-0.1278, 51.5074)),
('Paris', 'France', 'Europe', 11000000, 2100000, 48.8566, 2.3522, ST_Point(2.3522, 48.8566)),
('Berlin', 'Germany', 'Europe', 3700000, 3500000, 52.5200, 13.4050, ST_Point(13.4050, 52.5200)),
('Madrid', 'Spain', 'Europe', 6600000, 3200000, 40.4168, -3.7038, ST_Point(-3.7038, 40.4168)),
('Rome', 'Italy', 'Europe', 4300000, 2900000, 41.9028, 12.4964, ST_Point(12.4964, 41.9028)),
-- North America
('New York', 'United States', 'North America', 18800000, 8400000, 40.7128, -74.0060, ST_Point(-74.0060, 40.7128)),
('Los Angeles', 'United States', 'North America', 12400000, 4000000, 34.0522, -118.2437, ST_Point(-118.2437, 34.0522)),
('Chicago', 'United States', 'North America', 8600000, 2700000, 41.8781, -87.6298, ST_Point(-87.6298, 41.8781)),
('Toronto', 'Canada', 'North America', 6400000, 2800000, 43.6532, -79.3832, ST_Point(-79.3832, 43.6532)),
('Mexico City', 'Mexico', 'North America', 21600000, 9200000, 19.4326, -99.1332, ST_Point(-99.1332, 19.4326)),
-- South America
('São Paulo', 'Brazil', 'South America', 22400000, 12300000, -23.5505, -46.6333, ST_Point(-46.6333, -23.5505)),
('Buenos Aires', 'Argentina', 'South America', 15000000, 3000000, -34.6118, -58.3960, ST_Point(-58.3960, -34.6118)),
('Lima', 'Peru', 'South America', 10700000, 9700000, -12.0464, -77.0428, ST_Point(-77.0428, -12.0464)),
-- Africa
('Lagos', 'Nigeria', 'Africa', 15400000, 13900000, 6.5244, 3.3792, ST_Point(3.3792, 6.5244)),
('Cairo', 'Egypt', 'Africa', 20900000, 10200000, 30.0444, 31.2357, ST_Point(31.2357, 30.0444)),
('Johannesburg', 'South Africa', 'Africa', 10000000, 4400000, -26.2041, 28.0473, ST_Point(28.0473, -26.2041)),
-- Oceania
('Sydney', 'Australia', 'Oceania', 5300000, 4600000, -33.8688, 151.2093, ST_Point(151.2093, -33.8688)),
('Melbourne', 'Australia', 'Oceania', 5100000, 4900000, -37.8136, 144.9631, ST_Point(144.9631, -37.8136));
-- Create spatial index
CREATE INDEX idx_cities_geom ON cities USING GIST (geom);
-- Create additional indexes for performance
CREATE INDEX idx_cities_continent ON cities (continent);
CREATE INDEX idx_cities_pop_max ON cities (pop_max);
K-Means Clustering
ST_ClusterKMeans
partitions points into a specified number of clusters by minimizing within-cluster variance. It's ideal for creating evenly-sized groups.
Basic K-Means Clustering
-- Partition cities into 5 global regions
SELECT
id,
name,
country,
continent,
pop_max,
ST_ClusterKMeans(geom, 5) OVER () AS cluster_id
FROM cities
ORDER BY cluster_id, pop_max DESC;
-- Create a table with cluster assignments
CREATE TABLE city_clusters_kmeans AS
SELECT
c.*,
ST_ClusterKMeans(c.geom, 5) OVER () AS cluster_id
FROM cities c;
Analyzing K-Means Results
-- Analyze cluster characteristics
SELECT
cluster_id,
COUNT(*) AS city_count,
array_agg(DISTINCT continent) AS continents,
AVG(pop_max) AS avg_population,
MIN(pop_max) AS min_population,
MAX(pop_max) AS max_population,
ST_AsText(ST_Centroid(ST_Collect(geom))) AS cluster_center
FROM city_clusters_kmeans
GROUP BY cluster_id
ORDER BY cluster_id;
-- Find cluster centroids and create service regions
WITH cluster_centroids AS (
SELECT
cluster_id,
ST_Centroid(ST_Collect(geom)) AS centroid,
COUNT(*) AS city_count,
SUM(pop_max) AS total_population
FROM city_clusters_kmeans
GROUP BY cluster_id
)
SELECT
cluster_id,
ST_AsText(centroid) AS centroid_location,
city_count,
total_population,
-- Create 2000km service area around centroid
ST_Area(ST_Buffer(centroid::geography, 2000000)) / 1000000 AS service_area_km2
FROM cluster_centroids
ORDER BY total_population DESC;
Optimizing K-Means Cluster Count
-- Test different cluster counts to find optimal k
WITH cluster_analysis AS (
SELECT
k,
cluster_id,
COUNT(*) AS cluster_size,
ST_Collect(geom) AS cluster_geom
FROM (
SELECT
2 AS k, ST_ClusterKMeans(geom, 2) OVER () AS cluster_id, geom FROM cities
UNION ALL
SELECT
3 AS k, ST_ClusterKMeans(geom, 3) OVER () AS cluster_id, geom FROM cities
UNION ALL
SELECT
4 AS k, ST_ClusterKMeans(geom, 4) OVER () AS cluster_id, geom FROM cities
UNION ALL
SELECT
5 AS k, ST_ClusterKMeans(geom, 5) OVER () AS cluster_id, geom FROM cities
UNION ALL
SELECT
6 AS k, ST_ClusterKMeans(geom, 6) OVER () AS cluster_id, geom FROM cities
) clustered
GROUP BY k, cluster_id
),
cluster_metrics AS (
SELECT
k,
AVG(cluster_size) AS avg_cluster_size,
STDDEV(cluster_size) AS cluster_size_stddev,
MIN(cluster_size) AS min_cluster_size,
MAX(cluster_size) AS max_cluster_size
FROM cluster_analysis
GROUP BY k
)
SELECT
k,
ROUND(avg_cluster_size, 1) AS avg_size,
ROUND(cluster_size_stddev, 1) AS size_stddev,
min_cluster_size,
max_cluster_size,
ROUND(cluster_size_stddev / avg_cluster_size, 2) AS coefficient_of_variation
FROM cluster_metrics
ORDER BY k;
DBSCAN Clustering
ST_ClusterDBSCAN
groups points based on density, identifying natural clusters and outliers. It's excellent for discovering urban agglomerations and hotspots.
Basic DBSCAN Clustering
-- Identify urban agglomerations using DBSCAN
-- eps: 10 degrees (~1100km), minpoints: 2
SELECT
id,
name,
country,
continent,
pop_max,
ST_ClusterDBSCAN(geom, eps := 10.0, minpoints := 2) OVER () AS dbscan_cluster
FROM cities
ORDER BY dbscan_cluster NULLS LAST, pop_max DESC;
-- Create table with DBSCAN results
CREATE TABLE city_clusters_dbscan AS
SELECT
c.*,
ST_ClusterDBSCAN(c.geom, eps := 10.0, minpoints := 2) OVER () AS dbscan_cluster
FROM cities c;
Analyzing DBSCAN Results
-- Analyze DBSCAN clusters and outliers
SELECT
CASE
WHEN dbscan_cluster IS NULL THEN 'Outliers'
ELSE 'Cluster ' || dbscan_cluster
END AS cluster_type,
COUNT(*) AS city_count,
array_agg(name ORDER BY pop_max DESC) AS cities,
array_agg(DISTINCT continent) AS continents,
AVG(pop_max) AS avg_population,
SUM(pop_max) AS total_population
FROM city_clusters_dbscan
GROUP BY dbscan_cluster
ORDER BY dbscan_cluster NULLS LAST;
-- Find dense urban regions
WITH dense_clusters AS (
SELECT
dbscan_cluster,
COUNT(*) AS city_count,
SUM(pop_max) AS total_population,
ST_ConvexHull(ST_Collect(geom)) AS cluster_boundary
FROM city_clusters_dbscan
WHERE dbscan_cluster IS NOT NULL
GROUP BY dbscan_cluster
HAVING COUNT(*) >= 3
)
SELECT
dbscan_cluster,
city_count,
total_population,
ROUND(ST_Area(cluster_boundary::geography) / 1000000, 0) AS area_km2,
ROUND(total_population / (ST_Area(cluster_boundary::geography) / 1000000), 0) AS population_density_per_km2
FROM dense_clusters
ORDER BY population_density_per_km2 DESC;
Adaptive DBSCAN Parameters
-- Test different DBSCAN parameters
WITH dbscan_tests AS (
SELECT
'eps=5, minpts=2' AS params,
ST_ClusterDBSCAN(geom, eps := 5.0, minpoints := 2) OVER () AS cluster_id,
name, continent
FROM cities
UNION ALL
SELECT
'eps=10, minpts=2' AS params,
ST_ClusterDBSCAN(geom, eps := 10.0, minpoints := 2) OVER () AS cluster_id,
name, continent
FROM cities
UNION ALL
SELECT
'eps=15, minpts=3' AS params,
ST_ClusterDBSCAN(geom, eps := 15.0, minpoints := 3) OVER () AS cluster_id,
name, continent
FROM cities
)
SELECT
params,
COUNT(CASE WHEN cluster_id IS NOT NULL THEN 1 END) AS clustered_cities,
COUNT(CASE WHEN cluster_id IS NULL THEN 1 END) AS outlier_cities,
COUNT(DISTINCT cluster_id) AS num_clusters
FROM dbscan_tests
GROUP BY params
ORDER BY params;
Distance-Based Clustering
ST_ClusterWithin
groups points that lie within a specified distance of each other, creating transitive clusters.
-- Group cities within 1000km of each other
SELECT
id,
name,
country,
continent,
ST_ClusterWithin(geom, 10.0) OVER () AS proximity_cluster
FROM cities
ORDER BY proximity_cluster, pop_max DESC;
-- Analyze proximity clusters
WITH proximity_clusters AS (
SELECT
c.*,
ST_ClusterWithin(c.geom, 10.0) OVER () AS proximity_cluster
FROM cities c
)
SELECT
proximity_cluster,
COUNT(*) AS city_count,
array_agg(name ORDER BY pop_max DESC) AS cities_in_cluster,
array_agg(DISTINCT continent) AS continents,
SUM(pop_max) AS total_population,
ST_AsText(ST_Centroid(ST_Collect(geom))) AS cluster_center
FROM proximity_clusters
GROUP BY proximity_cluster
HAVING COUNT(*) > 1
ORDER BY total_population DESC;
Nearest Neighbor Analysis
PostGIS supports efficient K-Nearest Neighbor (KNN) searches using the <->
operator with GiST indexes.
Basic Nearest Neighbor Queries
-- Find 3 nearest cities to each city
SELECT
c1.name AS city,
c1.country,
c2.name AS nearest_neighbor,
c2.country AS neighbor_country,
ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
SELECT name, country, geom
FROM cities c2
WHERE c2.id != c1.id
ORDER BY c1.geom <-> c2.geom
LIMIT 3
) c2
ORDER BY c1.name, distance_km;
-- Find nearest neighbor by continent
SELECT
c1.name AS city,
c1.continent,
c2.name AS nearest_same_continent,
ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
SELECT name, geom
FROM cities c2
WHERE c2.id != c1.id
AND c2.continent = c1.continent
ORDER BY c1.geom <-> c2.geom
LIMIT 1
) c2
ORDER BY c1.continent, distance_km;
Advanced Nearest Neighbor Analysis
-- Find nearest large city (>10M population) to each city
SELECT
c1.name AS city,
c1.pop_max AS city_population,
c2.name AS nearest_megacity,
c2.pop_max AS megacity_population,
ROUND(ST_Distance(c1.geom::geography, c2.geom::geography) / 1000, 0) AS distance_km
FROM cities c1
CROSS JOIN LATERAL (
SELECT name, pop_max, geom
FROM cities c2
WHERE c2.id != c1.id
AND c2.pop_max > 10000000
ORDER BY c1.geom <-> c2.geom
LIMIT 1
) c2
WHERE c1.pop_max <= 10000000
ORDER BY distance_km;
-- Analyze nearest neighbor distances by continent
WITH nearest_neighbors AS (
SELECT
c1.continent,
c1.name,
ST_Distance(c1.geom::geography, c2.geom::geography) / 1000 AS nn_distance_km
FROM cities c1
CROSS JOIN LATERAL (
SELECT geom
FROM cities c2
WHERE c2.id != c1.id
ORDER BY c1.geom <-> c2.geom
LIMIT 1
) c2
)
SELECT
continent,
COUNT(*) AS city_count,
ROUND(AVG(nn_distance_km), 0) AS avg_nn_distance_km,
ROUND(MIN(nn_distance_km), 0) AS min_nn_distance_km,
ROUND(MAX(nn_distance_km), 0) AS max_nn_distance_km,
ROUND(STDDEV(nn_distance_km), 0) AS stddev_nn_distance_km
FROM nearest_neighbors
GROUP BY continent
ORDER BY avg_nn_distance_km;
Combining Clustering and Nearest Neighbors
Combine clustering with nearest neighbor analysis for comprehensive spatial insights:
-- Assign service hubs and find nearest cities
WITH cluster_hubs AS (
-- Select the largest city in each K-means cluster as a hub
SELECT DISTINCT ON (cluster_id)
cluster_id,
id AS hub_id,
name AS hub_name,
geom AS hub_geom,
pop_max AS hub_population
FROM city_clusters_kmeans
ORDER BY cluster_id, pop_max DESC
),
city_hub_assignments AS (
SELECT
c.id,
c.name AS city_name,
c.pop_max AS city_population,
h.hub_name,
h.hub_population,
ST_Distance(c.geom::geography, h.hub_geom::geography) / 1000 AS distance_to_hub_km
FROM cities c
CROSS JOIN LATERAL (
SELECT hub_name, hub_population, hub_geom
FROM cluster_hubs h
ORDER BY c.geom <-> h.hub_geom
LIMIT 1
) h
WHERE c.name != h.hub_name
)
SELECT
hub_name,
hub_population,
COUNT(*) AS cities_served,
ROUND(AVG(distance_to_hub_km), 0) AS avg_distance_km,
ROUND(MAX(distance_to_hub_km), 0) AS max_distance_km,
SUM(city_population) AS total_population_served
FROM city_hub_assignments
GROUP BY hub_name, hub_population
ORDER BY total_population_served DESC;
Performance Optimization
Technique | Description | When to Use |
---|---|---|
Spatial Indexes | GIST indexes enable KNN searches | Always for nearest neighbor queries |
LATERAL Joins | Efficient for nearest neighbor queries | When finding k-nearest for each point |
Appropriate Parameters | Choose clustering parameters based on data scale | Test different values for optimal results |
Batch Processing | Process large datasets in chunks | Very large point datasets |
Geography vs Geometry | Use geography for accurate global distances | Global or large-scale analysis |
Real-World Applications
Supply Chain Optimization
-- Optimize distribution centers using K-means
WITH distribution_centers AS (
SELECT
cluster_id,
ST_Centroid(ST_Collect(geom)) AS optimal_location,
COUNT(*) AS cities_served,
SUM(pop_max) AS total_market_size
FROM city_clusters_kmeans
GROUP BY cluster_id
)
SELECT
cluster_id,
ST_AsText(optimal_location) AS recommended_location,
cities_served,
total_market_size,
-- Find nearest actual city to optimal location
(SELECT name
FROM cities
ORDER BY optimal_location <-> geom
LIMIT 1) AS nearest_city
FROM distribution_centers
ORDER BY total_market_size DESC;
Emergency Response Planning
-- Identify cities that are isolated (far from neighbors)
WITH isolation_analysis AS (
SELECT
c1.name,
c1.country,
c1.pop_max,
ST_Distance(c1.geom::geography, c2.geom::geography) / 1000 AS nearest_neighbor_km
FROM cities c1
CROSS JOIN LATERAL (
SELECT geom
FROM cities c2
WHERE c2.id != c1.id
ORDER BY c1.geom <-> c2.geom
LIMIT 1
) c2
)
SELECT
name,
country,
pop_max,
ROUND(nearest_neighbor_km, 0) AS isolation_distance_km,
CASE
WHEN nearest_neighbor_km > 2000 THEN 'Highly Isolated'
WHEN nearest_neighbor_km > 1000 THEN 'Moderately Isolated'
ELSE 'Well Connected'
END AS isolation_level
FROM isolation_analysis
ORDER BY nearest_neighbor_km DESC;
Best Practices
- Choose appropriate clustering methods: K-means for equal groups, DBSCAN for natural clusters
- Test different parameters: Experiment with cluster counts and distance thresholds
- Use spatial indexes: Essential for efficient nearest neighbor searches
- Consider data scale: Adjust parameters based on geographic extent
- Validate results: Check clustering results against domain knowledge
- Handle outliers: DBSCAN identifies outliers - decide how to treat them
- Document methodology: Record parameters and rationale for reproducibility