Welcome

This is part 2 of the 2 part course from CDRC on the Access to Healthy Assets & Hazards (AHAH) dataset and creating Multi Dimensional Indices. The video in this part introduces the concept of Multi Dimensional Indices data set, and the practical session shows you how to create your own MDI.

After completing this material, you will:



Part 2: Multi Dimensional Indices

We are going to replicate the AHAH index for two domains (Retail and Health).

Just a quick reminder, the overall AHAH index is made up of four different domains: 1. Retail
2. Health Services
3. Air Quality is called Environment in the code below
4. Natural Environment is called Greenspace in the code below

Each domain is made up of a number of different indicators:

Downloading AHAH Source Data

To create an index, we are going to use the source files for AHAH. These are the distances to nearest for each LSOA.

Loading Source data in R

  • Start a new Script in RStudio.

  • Set your working directory.

  • Use this code to read in the file:

data <- read.csv("allvariableslsoawdeciles.csv")
  • Use head() to check what the data are:
head(data)
  rownum    lsoa11  gpp_dist  ed_dist dent_dist pharm_dist gamb_dist ffood_dist
1      1 E01000001 0.7013333 3.447556 0.6404444  0.3862222 0.3902222  0.1697778
2      2 E01000002 0.9379661 3.126102 0.9259322  0.4940678 0.4167797  0.1944068
3      3 E01000003 1.1073529 3.729706 0.4217647  0.2500000 0.3976471  0.2158824
4      4 E01000005 0.6284615 1.946538 0.7998718  0.7305128 0.2803846  0.1493590
5      5 E01000006 0.4471429 5.954286 1.2271429  0.7614286 0.5992857  0.5628571
6      6 E01000007 0.3076316 6.049737 0.8247368  0.5094737 0.3765789  0.3623684
  pubs_dist leis_dist blue_dist  off_dist tobac_dist  green_pas green_act ur
1 0.1806667 0.2140000 2.3871111 0.3808889  1.4608889 0.04622136     0.295  1
2 0.2205085 0.2789831 2.0488136 0.6798305  1.2693220 0.04107206     0.320  1
3 0.1917647 0.3576471 1.2179412 0.4217647  1.3497059 0.05061065     0.200  1
4 0.2280769 0.1370513 1.1812821 0.6170513  0.4865385 0.04428218     0.140  1
5 0.6114286 1.6885714 1.7607143 0.6478571  0.6635714 0.52745888     0.585  1
6 0.6207895 0.9642105 0.8634211 0.7339474  0.7263158 0.54568011     0.380  1
  no2_mean pm10_mean so2_mean d_gpp_dist d_ed_dist d_pharm_dist d_dent_dist
1 28.47180  17.45706 1.776443          3         2            1           3
2 27.82823  17.44603 1.735325          5         2            2           4
3 28.47449  17.47987 1.788466          6         2            1           1
4 26.11497  17.33858 1.602624          3         1            5           4
5 21.71050  17.18462 1.494547          1         4            5           6
6 21.55922  17.16954 1.489493          1         4            3           4
  d_gamb_dist d_ffood_dist d_pubs_dist d_leis_dist d_blue_dist d_off_dist
1          10           10          10           1           6         10
2          10           10          10           1           5          9
3          10           10          10           1           3         10
4          10           10          10           1           3         10
5           9            8           8           4           5         10
6          10            9           8           2           2          9
  d_tobac_dist d_green_pas d_green_act d_no2_mean d_pm10_mean d_so2_mean
1            8           9           3         10          10         10
2            9          10           3         10          10         10
3            8           9           1         10          10         10
4           10           9           1         10          10          9
5           10           3           7         10          10          9
6           10           3           4         10          10          9
  • Is this the data we expect to see?

  • Use View() to look at the data.

  • Is this what you expect?

There will be character chr, integer int and numeric num values in this data frame. Make sure you can identify which is which, and that you know what the differences are.

  • How are the data distributed? (try hist()).

Setup

We need some libraries for this work. Remember to install them if you haven’t got them installed already.

#load libraries (and install if needed)
  library(nFactors)
  library(data.table)
  library(dplyr)

Ranking the Data

When creating an index, we need to rank the data, rather than using the source distances. Additionally, all the data need to be in the same order (so, for example, low values are ‘good’ and high values are ‘bad’). Use this code to rank the data, and then re-order it where needed.

#the rank code reorders the data. for example:
#have a look at the gamb_dist data
head(data$gamb_dist)
[1] 0.3902222 0.4167797 0.3976471 0.2803846 0.5992857 0.3765789
#rank the data
data$gamb_dist_rank <- rank(data$gamb_dist,ties.method= "min")
#look at the data
head(data$gamb_dist_rank)
[1] 2681 3299 2863  754 7938 2402

In this code we use a shortened version (where we overwrite the data rather than creating a new column each time). This saves us ending up with many many columns.

Sometimes you might see a 0.5 value. This is when two values are the same, and so would usually have the same rank. The 0.5 R applies so these values can be put in order.

#retail
data$gamb_dist <- rank(data$gamb_dist,ties.method= "min")
data$gamb_dist <- rank(-data$gamb_dist) # Invert ranking
data$ffood_dist <- rank(data$ffood_dist,ties.method= "min")
data$ffood_dist <- rank(-data$ffood_dist) # Invert ranking
data$pubs_dist <- rank(data$pubs_dist,ties.method= "min")
data$pubs_dist <- rank(-data$pubs_dist) # Invert ranking
data$off_dist <- rank(data$off_dist,ties.method= "min")
data$off_dist <- rank(-data$off_dist) # Invert ranking
data$tobac_dist<- rank(data$tobac_dist,ties.method= "min")
data$tobac_dist <- rank(-data$tobac_dist) # Invert ranking

#health
data$gpp_dist <- rank(data$gpp_dist,ties.method= "min")
data$ed_dist <- rank(data$ed_dist,ties.method= "min")
data$dent_dist <- rank(data$dent_dist,ties.method= "min")
data$pharm_dist <- rank(data$pharm_dist,ties.method= "min")
data$leis_dist <- rank(data$leis_dist,ties.method= "min")

Transformations

We also need to transform the data. We apply a Rankit rank-based normalisation to allow us to merge the rank based indicators.

To apply this transformation we need to setup a custom function:

#Rankit rank-based normalisation
  exp_default <- function(x,y){(x-0.5)/nrow(y)} 

We apply this exp_default function to the data, and then use the qnorm function to convert the ratios to measures (so we can combine them later on).

#retail
data$gamb_dist <- exp_default(data$gamb_dist, data)
data$gamb_dist <- qnorm(data$gamb_dist, mean = 0, sd = 1)
data$ffood_dist <- exp_default(data$ffood_dist, data)
data$ffood_dist <- qnorm(data$ffood_dist, mean = 0, sd = 1)
data$pubs_dist <- exp_default(data$pubs_dist, data)
data$pubs_dist <- qnorm(data$pubs_dist, mean = 0, sd = 1)
data$leis_dist <- exp_default(data$leis_dist, data)
data$leis_dist <- qnorm(data$leis_dist, mean = 0, sd = 1)
data$off_dist <- exp_default(data$off_dist, data)
data$off_dist <- qnorm(data$off_dist, mean = 0, sd = 1)
data$tobac_dist <- exp_default(data$tobac_dist, data)
data$tobac_dist <- qnorm(data$tobac_dist, mean = 0, sd = 1)

#health
data$gpp_dist <- exp_default(data$gpp_dist, data)
data$gpp_dist <- qnorm(data$gpp_dist, mean = 0, sd = 1)
data$ed_dist <- exp_default(data$ed_dist, data)
data$ed_dist <- qnorm(data$ed_dist, mean = 0, sd = 1)
data$dent_dist <- exp_default(data$dent_dist, data)
data$dent_dist <- qnorm(data$dent_dist, mean = 0, sd = 1)
data$pharm_dist <- exp_default(data$pharm_dist, data)
data$pharm_dist <- qnorm(data$pharm_dist, mean = 0, sd = 1)

Combining the Indicators to the Domains

The next step is to combine the various indicators to each of the domains:

# Domain scores
# weight scores for each domain
#retail
data$r_domain <- (0.20 * data$gamb_dist +
                    0.20 * data$ffood_dist +
                    0.20 * data$pubs_dist +
                    0.20 * data$off_dist +
                    0.20 * data$tobac_dist)
#health
data$h_domain <- (0.20 * data$gpp_dist +
                    0.20 * data$ed_dist +
                    0.20 * data$dent_dist +
                    0.20 * data$pharm_dist +
                    0.20 * data$leis_dist)

We apply an exponential transformation, to avoid the cancellation effect. It also has the impact of biasing the lower end (the ‘bad’ values) .

Again, we need a custom function to run these transformations:

#function for exponential transformations
  # Exponential transformation  ~>   X=-23 ln(1-R(1-exp(-100/23)))
  exp_trans <- function(x,y){-23*log(1-(x/nrow(y))*(1-exp(-100/23)), base = exp(1))}

We rank and then transform the domain scores:

# rank the domain scores
# Domain ranks
data$r_rank <- rank(data$r_domain,ties.method= "min")
data$h_rank <- rank(data$h_domain,ties.method= "min")

# apply the exponential transformation to the scores
# Exp domains
data$r_exp <- exp_trans(data$r_rank,data)
data$h_exp <- exp_trans(data$h_rank,data)

Calculating the Index Score

We then calculate the index score:

#combine (at equal weighting) to make the AHAH score
# AHAH score
data$ahah <- (0.5 * data$r_exp + 
                0.5 * data$h_exp)

We can also do so subsequent calculations for creating the ranks and deciles:

# Create Ranks and Deciles
data$r_ahah <- rank(data$ahah,ties.method= "min")

data$d_ahah <- ntile(data$ahah, 10)
data$d5_ahah <- ntile(data$ahah, 5)

data$r_dec <- ntile(data$r_exp,10)
data$h_dec <- ntile(data$h_exp,10)

Deciles are often used with multi-dimensional indices (including the IMD) so users can easily pick out the top 10% most unhealthily / most deprived areas.

We also calculate the ranks, as some users find it easier to work with ranks, as it allows them to find the most unhealthy place in GB or in their local authority, for example.

Finally we can write out the data as a CSV file, either with all the columns

#write out csv file with all columns
write.csv(data,"2020-09-25-r+h.csv",quote = FALSE, row.names = FALSE)

Or just a subset:

# extract only LSOA code rank, values and AHAH index
# Extract only Domains and Index ranking/Deciles
data_ph <- fread("2020-09-25-r+h.csv")
data_ph <- data_ph[ ,c(2,36:47)]
write.csv(data_ph,"2020-09-25-summary.csv",quote = FALSE, row.names = FALSE)

Comparisons

  • Now we have our output, try creating a map of it, based on the code from the previous sessions

  • Try also comparing either the map or the table of values to the AHAH index - what has changed?

  • Try removing one (or more) of the indicators. How does this impact things?

  • Try swapping out the Retail domain and put in something else? (see full script for details).


This practical was written using R 4.0.2 (2020-06-22) and RStudio Cloud (RStudio Server Pro Version 1.3.1056-1) by Dr. Nick Bearman ().

This code is based on the AHAH code, from https://github.com/GDSL-UL/AHAH/blob/master/create_index_version2.R. Sample code is also available in the scripts folder, for the two domains in mdi-all-domains-example.R and all four domains in mdi-all-domains-example.R.

This work is licensed under the Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/. The latest version of the workbook is available from https://data.cdrc.ac.uk/dataset/advanced-gis-methods-training-ahah-and-multi-dimensional-indices and https://github.com/nickbearman/cdrc-ahah-mdi-course. This version was created on 03 November 2020.