This is part 2 of the 2 part course from CDRC on the Access to Healthy Assets & Hazards (AHAH) dataset and creating Multi Dimensional Indices. The video in this part introduces the concept of Multi Dimensional Indices data set, and the practical session shows you how to create your own MDI.
After completing this material, you will:
We are going to replicate the AHAH index for two domains (Retail and Health).
Just a quick reminder, the overall AHAH index is made up of four different domains: 1. Retail
2. Health Services
3. Air Quality is called Environment in the code below
4. Natural Environment is called Greenspace in the code below
Each domain is made up of a number of different indicators:
To create an index, we are going to use the source files for AHAH. These are the distances to nearest for each LSOA.
Go to https://data.cdrc.ac.uk/dataset/access-healthy-assets-hazards-ahah
Download the AHAH Inputs/Components file: allvariableslsoawdeciles.csv
Also have a look at the metadata file if you wish.
Start a new Script in RStudio.
Set your working directory.
Use this code to read in the file:
data <- read.csv("allvariableslsoawdeciles.csv")
head()
to check what the data are:head(data)
rownum lsoa11 gpp_dist ed_dist dent_dist pharm_dist gamb_dist ffood_dist
1 1 E01000001 0.7013333 3.447556 0.6404444 0.3862222 0.3902222 0.1697778
2 2 E01000002 0.9379661 3.126102 0.9259322 0.4940678 0.4167797 0.1944068
3 3 E01000003 1.1073529 3.729706 0.4217647 0.2500000 0.3976471 0.2158824
4 4 E01000005 0.6284615 1.946538 0.7998718 0.7305128 0.2803846 0.1493590
5 5 E01000006 0.4471429 5.954286 1.2271429 0.7614286 0.5992857 0.5628571
6 6 E01000007 0.3076316 6.049737 0.8247368 0.5094737 0.3765789 0.3623684
pubs_dist leis_dist blue_dist off_dist tobac_dist green_pas green_act ur
1 0.1806667 0.2140000 2.3871111 0.3808889 1.4608889 0.04622136 0.295 1
2 0.2205085 0.2789831 2.0488136 0.6798305 1.2693220 0.04107206 0.320 1
3 0.1917647 0.3576471 1.2179412 0.4217647 1.3497059 0.05061065 0.200 1
4 0.2280769 0.1370513 1.1812821 0.6170513 0.4865385 0.04428218 0.140 1
5 0.6114286 1.6885714 1.7607143 0.6478571 0.6635714 0.52745888 0.585 1
6 0.6207895 0.9642105 0.8634211 0.7339474 0.7263158 0.54568011 0.380 1
no2_mean pm10_mean so2_mean d_gpp_dist d_ed_dist d_pharm_dist d_dent_dist
1 28.47180 17.45706 1.776443 3 2 1 3
2 27.82823 17.44603 1.735325 5 2 2 4
3 28.47449 17.47987 1.788466 6 2 1 1
4 26.11497 17.33858 1.602624 3 1 5 4
5 21.71050 17.18462 1.494547 1 4 5 6
6 21.55922 17.16954 1.489493 1 4 3 4
d_gamb_dist d_ffood_dist d_pubs_dist d_leis_dist d_blue_dist d_off_dist
1 10 10 10 1 6 10
2 10 10 10 1 5 9
3 10 10 10 1 3 10
4 10 10 10 1 3 10
5 9 8 8 4 5 10
6 10 9 8 2 2 9
d_tobac_dist d_green_pas d_green_act d_no2_mean d_pm10_mean d_so2_mean
1 8 9 3 10 10 10
2 9 10 3 10 10 10
3 8 9 1 10 10 10
4 10 9 1 10 10 9
5 10 3 7 10 10 9
6 10 3 4 10 10 9
Is this the data we expect to see?
Use View()
to look at the data.
Is this what you expect?
There will be character chr
, integer int
and numeric num
values in this data frame. Make sure you can identify which is which, and that you know what the differences are.
hist()
).We need some libraries for this work. Remember to install them if you haven’t got them installed already.
#load libraries (and install if needed)
library(nFactors)
library(data.table)
library(dplyr)
When creating an index, we need to rank the data, rather than using the source distances. Additionally, all the data need to be in the same order (so, for example, low values are ‘good’ and high values are ‘bad’). Use this code to rank the data, and then re-order it where needed.
#the rank code reorders the data. for example:
#have a look at the gamb_dist data
head(data$gamb_dist)
[1] 0.3902222 0.4167797 0.3976471 0.2803846 0.5992857 0.3765789
#rank the data
data$gamb_dist_rank <- rank(data$gamb_dist,ties.method= "min")
#look at the data
head(data$gamb_dist_rank)
[1] 2681 3299 2863 754 7938 2402
In this code we use a shortened version (where we overwrite the data rather than creating a new column each time). This saves us ending up with many many columns.
Sometimes you might see a 0.5 value. This is when two values are the same, and so would usually have the same rank. The 0.5 R applies so these values can be put in order.
#retail
data$gamb_dist <- rank(data$gamb_dist,ties.method= "min")
data$gamb_dist <- rank(-data$gamb_dist) # Invert ranking
data$ffood_dist <- rank(data$ffood_dist,ties.method= "min")
data$ffood_dist <- rank(-data$ffood_dist) # Invert ranking
data$pubs_dist <- rank(data$pubs_dist,ties.method= "min")
data$pubs_dist <- rank(-data$pubs_dist) # Invert ranking
data$off_dist <- rank(data$off_dist,ties.method= "min")
data$off_dist <- rank(-data$off_dist) # Invert ranking
data$tobac_dist<- rank(data$tobac_dist,ties.method= "min")
data$tobac_dist <- rank(-data$tobac_dist) # Invert ranking
#health
data$gpp_dist <- rank(data$gpp_dist,ties.method= "min")
data$ed_dist <- rank(data$ed_dist,ties.method= "min")
data$dent_dist <- rank(data$dent_dist,ties.method= "min")
data$pharm_dist <- rank(data$pharm_dist,ties.method= "min")
data$leis_dist <- rank(data$leis_dist,ties.method= "min")
We also need to transform the data. We apply a Rankit rank-based normalisation to allow us to merge the rank based indicators.
To apply this transformation we need to setup a custom function:
#Rankit rank-based normalisation
exp_default <- function(x,y){(x-0.5)/nrow(y)}
We apply this exp_default
function to the data, and then use the qnorm
function to convert the ratios to measures (so we can combine them later on).
#retail
data$gamb_dist <- exp_default(data$gamb_dist, data)
data$gamb_dist <- qnorm(data$gamb_dist, mean = 0, sd = 1)
data$ffood_dist <- exp_default(data$ffood_dist, data)
data$ffood_dist <- qnorm(data$ffood_dist, mean = 0, sd = 1)
data$pubs_dist <- exp_default(data$pubs_dist, data)
data$pubs_dist <- qnorm(data$pubs_dist, mean = 0, sd = 1)
data$leis_dist <- exp_default(data$leis_dist, data)
data$leis_dist <- qnorm(data$leis_dist, mean = 0, sd = 1)
data$off_dist <- exp_default(data$off_dist, data)
data$off_dist <- qnorm(data$off_dist, mean = 0, sd = 1)
data$tobac_dist <- exp_default(data$tobac_dist, data)
data$tobac_dist <- qnorm(data$tobac_dist, mean = 0, sd = 1)
#health
data$gpp_dist <- exp_default(data$gpp_dist, data)
data$gpp_dist <- qnorm(data$gpp_dist, mean = 0, sd = 1)
data$ed_dist <- exp_default(data$ed_dist, data)
data$ed_dist <- qnorm(data$ed_dist, mean = 0, sd = 1)
data$dent_dist <- exp_default(data$dent_dist, data)
data$dent_dist <- qnorm(data$dent_dist, mean = 0, sd = 1)
data$pharm_dist <- exp_default(data$pharm_dist, data)
data$pharm_dist <- qnorm(data$pharm_dist, mean = 0, sd = 1)
The next step is to combine the various indicators to each of the domains:
# Domain scores
# weight scores for each domain
#retail
data$r_domain <- (0.20 * data$gamb_dist +
0.20 * data$ffood_dist +
0.20 * data$pubs_dist +
0.20 * data$off_dist +
0.20 * data$tobac_dist)
#health
data$h_domain <- (0.20 * data$gpp_dist +
0.20 * data$ed_dist +
0.20 * data$dent_dist +
0.20 * data$pharm_dist +
0.20 * data$leis_dist)
We apply an exponential transformation, to avoid the cancellation effect. It also has the impact of biasing the lower end (the ‘bad’ values) .
Again, we need a custom function to run these transformations:
#function for exponential transformations
# Exponential transformation ~> X=-23 ln(1-R(1-exp(-100/23)))
exp_trans <- function(x,y){-23*log(1-(x/nrow(y))*(1-exp(-100/23)), base = exp(1))}
We rank and then transform the domain scores:
# rank the domain scores
# Domain ranks
data$r_rank <- rank(data$r_domain,ties.method= "min")
data$h_rank <- rank(data$h_domain,ties.method= "min")
# apply the exponential transformation to the scores
# Exp domains
data$r_exp <- exp_trans(data$r_rank,data)
data$h_exp <- exp_trans(data$h_rank,data)
We then calculate the index score:
#combine (at equal weighting) to make the AHAH score
# AHAH score
data$ahah <- (0.5 * data$r_exp +
0.5 * data$h_exp)
We can also do so subsequent calculations for creating the ranks and deciles:
# Create Ranks and Deciles
data$r_ahah <- rank(data$ahah,ties.method= "min")
data$d_ahah <- ntile(data$ahah, 10)
data$d5_ahah <- ntile(data$ahah, 5)
data$r_dec <- ntile(data$r_exp,10)
data$h_dec <- ntile(data$h_exp,10)
Deciles are often used with multi-dimensional indices (including the IMD) so users can easily pick out the top 10% most unhealthily / most deprived areas.
We also calculate the ranks, as some users find it easier to work with ranks, as it allows them to find the most unhealthy place in GB or in their local authority, for example.
Finally we can write out the data as a CSV file, either with all the columns
#write out csv file with all columns
write.csv(data,"2020-09-25-r+h.csv",quote = FALSE, row.names = FALSE)
Or just a subset:
# extract only LSOA code rank, values and AHAH index
# Extract only Domains and Index ranking/Deciles
data_ph <- fread("2020-09-25-r+h.csv")
data_ph <- data_ph[ ,c(2,36:47)]
write.csv(data_ph,"2020-09-25-summary.csv",quote = FALSE, row.names = FALSE)
Now we have our output, try creating a map of it, based on the code from the previous sessions
Try also comparing either the map or the table of values to the AHAH index - what has changed?
Try removing one (or more) of the indicators. How does this impact things?
Try swapping out the Retail domain and put in something else? (see full script for details).
This practical was written using R 4.0.2 (2020-06-22) and RStudio Cloud (RStudio Server Pro Version 1.3.1056-1) by Dr. Nick Bearman (nick@geospatialtrainingsolutions.co.uk).
This code is based on the AHAH code, from https://github.com/GDSL-UL/AHAH/blob/master/create_index_version2.R. Sample code is also available in the scripts folder, for the two domains in mdi-all-domains-example.R
and all four domains in mdi-all-domains-example.R
.
This work is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. The latest version of the workbook is available from https://data.cdrc.ac.uk/dataset/advanced-gis-methods-training-ahah-and-multi-dimensional-indices and https://github.com/nickbearman/cdrc-ahah-mdi-course. This version was created on 03 November 2020.