This is part 1 of the 2 part course from CDRC on the Internet User Classification (IUC) and K-Means Clustering. The video in this part introduces the IUC data set, and the practical session shows you how to work with the IUC dataset in R. If you are completely new to RStudio, please check out our Short Course on Using R as a GIS.
After completing this material, you will:
Our first step is to download the IUC dataset:
Open a web browser and go to https://data.cdrc.ac.uk
Register if you need to, or if you are already registered, make sure you are logged in.
Search for IUC
Open the Internet User Classification page.
iuc2018.csv
file to your working directory.You can download the shapefile with the data already joined to the LSOA boundaries, but this is the national data set and is quite large (75MB). R will work with this, but might be a bit slow. The steps below will only get the shapefile for Liverpool, which will be a much smaller file.
Open the file in Excel - what data do we have?
Check out the User Guide if you want to.
Start a new Script in RStudio.
Set your working directory.
Use this code to read in the file:
iuc <- read.csv("iuc2018.csv")
head()
to check what the data are:head(iuc)
SHP_ID LSOA11_CD LSOA11_NM GRP_CD GRP_LABEL
1 1 E01020179 South Hams 012C 5 e-Rational Utilitarians
2 2 E01033289 Cornwall 007E 9 Settled Offline Communities
3 3 W01000189 Conwy 015F 5 e-Rational Utilitarians
4 4 W01001022 Bridgend 014B 7 Passive and Uncommitted Users
5 5 W01000532 Ceredigion 007B 9 Settled Offline Communities
6 6 E01018888 Cornwall 071G 9 Settled Offline Communities
Is this the data we expect to see?
Use View()
to look at the data.
Use str()
to see whether they are character or numeric variables.
str(iuc)
'data.frame': 41729 obs. of 5 variables:
$ SHP_ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ LSOA11_CD: chr "E01020179" "E01033289" "W01000189" "W01001022" ...
$ LSOA11_NM: chr "South Hams 012C" "Cornwall 007E" "Conwy 015F" "Bridgend 014B" ...
$ GRP_CD : int 5 9 5 7 9 9 9 9 5 6 ...
$ GRP_LABEL: chr "e-Rational Utilitarians" "Settled Offline Communities" "e-Rational Utilitarians" "Passive and Uncommitted Users" ...
There will be character chr
, integer int
and numeric num
values in this data frame. Make sure you can identify which is which, and that you know what the differences are.
hist()
).To create any maps, we need some spatial data.
BoundaryData.zip
to download the files.Extract the files, and move all the files starting with the name england_lsoa_2011
to your working folder.
We will also need some spatial libraries:
#load libraries
library(sf)
library(tmap)
Read in the spatial data we downloaded from Edina:
#read in shapefile
LSOA <- st_read("england_lsoa_2011.shp")
head()
, class()
and str()
.Let’s do a quick map:
qtm(LSOA)
Next step is to join the attribute data (iuc
) to the spatial data (LSOA
).
Use head()
to check which columns we are using for the join:
#check which columns we are joining
head(iuc)
head(LSOA)
#join attribute data to LSOA
LSOA <- merge(LSOA, iuc, by.x="code", by.y="LSOA11_CD")
#check output
head(LSOA)
Simple feature collection with 6 features and 7 fields
geometry type: POLYGON
dimension: XY
bbox: xmin: 334715 ymin: 385417 xmax: 339020.8 ymax: 390548
projected CRS: OSGB 1936 / British National Grid
code label name SHP_ID LSOA11_NM
1 E01006512 E08000012E02001377E01006512 Liverpool 031A 26586 Liverpool 031A
2 E01006513 E08000012E02006932E01006513 Liverpool 060A 24660 Liverpool 060A
3 E01006514 E08000012E02001383E01006514 Liverpool 037A 27675 Liverpool 037A
4 E01006515 E08000012E02001383E01006515 Liverpool 037B 26856 Liverpool 037B
5 E01006518 E08000012E02001390E01006518 Liverpool 044A 28180 Liverpool 044A
6 E01006519 E08000012E02001402E01006519 Liverpool 056A 27474 Liverpool 056A
GRP_CD GRP_LABEL geometry
1 1 e-Cultural Creators POLYGON ((336203 390010, 33...
2 1 e-Cultural Creators POLYGON ((335402.8 390317.5...
3 1 e-Cultural Creators POLYGON ((335651.3 389926.8...
4 10 e-Withdrawn POLYGON ((335186 389604, 33...
5 10 e-Withdrawn POLYGON ((335537.2 389034.5...
6 3 e-Veterans POLYGON ((338014.6 386447.2...
Finally, we can plot the maps quickly using qtm()
from the tmap
library:
#ahah index
qtm(LSOA, "GRP_CD")
This works well. However we don’t get many options with this. We can use a different function tm_shape()
, which will give us more options.
tm_shape(LSOA) +
tm_polygons("GRP_CD")
tm_shape(LSOA) +
tm_polygons("GRP_CD", palette = "Set3", n = 10) +
tm_layout(legend.title.size = 0.8)
This allows us to change the title, colours and legend title size. We are now using a qualitative palette as these data have no inherent order (i.e. group 1 is not more or less than group 2).
We can customise the colours. Try running this code to see the different palette options:
library(RColorBrewer)
display.brewer.all()
This webpage also has some more information https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html
We can also specify the colours ourselves, for example using the colours from the CDRC Maps website:
my_colours <- c("#ea4d78","#f36d5a","#e4a5d0","#ffd39b","#a5cfbc","#d2d1ab","#79cdcd","#dd7cdc","#808fee","#8470ff")
tm_shape(LSOA) +
tm_polygons("GRP_CD", palette = my_colours, n = 10) +
tm_layout(legend.title.size = 0.8)
Note the order of the colours is important. Colour for group 1 is first, group 2 second and so on. I got the colours from the Maps site, using https://imagecolorpicker.com/ to identify the codes for each colour.
If you use the cluster name (rather than group number) it will assign the colours alphabetically - so make sure you check on your map which colour is assigned to which group!
To map this correctly, you need to re-order the colours so they match the correct cluster group in alphabetical order. See part1-example-script.R for an example of this.
By now you should have a R script with all the steps to:
Try creating a map for a different area of the UK, using the script you have.
Loops are a great computer programming concept we can use. Working from the example above, and the loop code from the Short Course on Using R as a GIS, can you write a loop to create a map for each domain of the AHAH index?
Mini R Course loop code:
#set which variables will be mapped
mapvariables <- c("AllUsualResidents", "Age00to04", "Age05to07")
#loop through for each map
for (i in 1:length(mapvariables)) {
#setup map
m <- tm_shape(LSOA) +
#set variable, colours and classes
tm_polygons(mapvariables[i], palette = "Greens", style = "equal") +
#set scale bar
tm_scale_bar(width = 0.22, position = c(0.05, 0.18)) +
#set compass
tm_compass(position = c(0.3, 0.07)) +
#set layout
tm_layout(frame = F, title = "Liverpool", title.size = 2,
title.position = c(0.7, "top"))
#save map
tmap_save(m, filename = paste0("map-",mapvariables[i],".png"))
#end loop
}
Now we have these cluster classifications, how can we link them to people?
Try joining on population data from the Short Course on Using R as a GIS on to our LSOA variable which already has the Internet User Classification (IUC) data in it.
You can then get the total population by cluster:
aggregate(LSOA$AllUsualResidents, by=list(LSOA$GRP_CD), FUN=sum)
This practical was written using R 4.0.2 (2020-06-22) and RStudio 1.3.1093 by Dr. Nick Bearman (nick@geospatialtrainingsolutions.co.uk).
This work is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. This version was created on 18 November 2020.