Creating an Open Geodemographic Classification Using K-means Clustering in Python

Geodemographic classifications turn complex population data into clear, actionable insights for policy, research, and commercial use. In this hands-on tutorial, we demonstrate how to build a bespoke geodemographic classification from UK census data using the Python data science stack. Participants will be guided through sourcing and preparing open data, selecting relevant variables, and clustering communities in ways that can be tailored to specific needs. We also explore how recent advances in machine learning can reduce the technical burden of creating geodemographics by using a large language model to generate cluster names and descriptions. By the end of the session, you will have the tools and workflow needed to design your own open, reproducible geodemographic classifications.

The tutorial is free, but users will need to register on this website to access the materials.

This tutorial contains the full workflow for producing a geodemographic classification from scratch in python using k-means clustering. The creatinggeodem.ipynb notebook contains the full code and explanatory text for the workshop. It can be followed from the website link or ran interactively through the linked github repository. The key steps covered in the notebook are:

Data Access and Processing:

Access UK Census data and process using Pandas.
Select a specific region of interest (e.g., Liverpool City Region, Greater Manchester, Greater London).

Census Data Analysis and Variable selection:

Select relevant Census variables for clustering.
Standardise variables.
Perform correlation & variance analysis to identify potentially redundant variables.
Alternative variable selection methods (e.g., PCA, Autoencoders).

Clustering:

Determine optimal number of clusters using Clustergrams.
Apply K-Means clustering to classify areas based on selected variables.
Perform top-down hierarchical clustering to divide clusters into subgroups.

Analytical Techniques:

Use UMAP (Uniform Manifold Approximation and Projection) to visualise high-dimensional embeddings in 2D.

Visualisation and Communication:

Visualise clusters and subclusters using Kepler.gl for interactive mapping.
Explore cluster characteristics using summary statistics and index scores.
Export results to various formats (GeoPackage, Parquet) for use in GIS software.

Cluster Naming with LLMs:

Use Large Language Models (LLMs) to generate descriptive names and summaries for clusters based on their characteristics.

Tutorial

Data and Resources

Source Code: Github RepositoryHTML

Visit Link
Data: input_data_1.zipZIP
Input data for the workshop.

Download
External Website: Workshop Notebook WebpageHTML
Webpage version of the jupyter notebook.

Visit Link
Data: workshop_slides.pptxPPTX
Workshop slides presented at the Spatial Data Science Conference 2025. Includes a brief introduction to geodemographics and some suggested extentions to the tutorial.

Download

Additional Info

Field	Value
Source	Office of National Statistics (ONS), National Records of Scotland (NRS), Northern Ireland Statistics and Research Agency (NISRA), GeoDS
Author	Goodwin, Owen
Maintainer	Owen Goodwin
Last Updated	November 27, 2025, 13:39 (UTC)
Created	September 30, 2025, 15:29 (UTC)