Downloading CDRC data with cdrcR

cdrcR is an R wrapper developed to enable access to the CDRC API endpoints and to retrieve open and freely available CDRC data programmatically. The package is designed to have one main function – getCDRC – which allows you to get data from all the available CDRC API endpoints. You can access a list of these available endpoints and their metadata by running listCDRC(). This list will provide you with a dataset identifier – the dataCode – which you will need to use to request the correct endpoint.
You can also run ?getCDRC() to access the function’s full documentation.

Installation

You can install the cdrcR package from CRAN, or the development version from Github using devtools.

The cdrcR package relies on an external service, so please do periodically check the github repository for any updates on the versioning.

# Install from Github
install.packages("devtools")
devtools::install_github("aelissa/cdrcR")

# OR install from CRAN
install.packages("cdrcR")

Register to the CDRC API

To use to CDRC APIs, you will need to register to the CDRC API HERE.

Please be aware that the CDRC API registration is separate from the data.cdrc.ac.uk account. If you already have an account there, you will still need to register at the above link. Below is a screenshot of the correct registration website.

Using the Package

First of all, you need to load the library.

library(cdrcR)

Then, to get started you need to log-in with the username and password that you used (or just created) when registering with the CDRC API. Please note that you will need to use loginCDRC() each time you start working with the API again.

Your CDRC login details for data.cdrc.ac.uk will not work here, so be sure to use your CDRC API account details.

loginCDRC(username="your-username",password="your-password")

You can now list the open access datasets that are available via the API, alongside the relative dataCode which identifies the API endpoint. You will need this dataCode to specify the dataset that you wish to get later on.

listCDRC()

This function will result in a data frame that should look something like this:

Title	DataCode	dataSetURL	GeographicalCoverage	GeographyLevel
Access to Healthy Assets & Hazards (AHAH) 2019	AHAHInputs, AHAHOverallIndexDomain	https://data.cdrc.ac.uk/dataset/access-healthy-assets-hazards-ahah	Great Britain	LSOA
Classification of Workplace Zones (COWZ) 2011	COWZUK2011	https://data.cdrc.ac.uk/dataset/classification-workplace-zones-cowz	United Kingdom	WZ
Index of Multiple Deprivation (IMD) 2019	IMD2019	https://data.cdrc.ac.uk/dataset/index-multiple-deprivation-imd	United Kingdom	LSOA
Internet User Classification (IUC) 2018	IUC2018	https://data.cdrc.ac.uk/dataset/internet-user-classification	Great Britain	LSOA
London Output Area Classification (OAC) 2011	LOACClassification2011, LOACInputData2011	https://data.cdrc.ac.uk/dataset/london-oac-2011	London	OA
London Workplace Zone Classification 2017	LWZCClassification2017, LOACInputData2011	<https://data.cdrc.ac.uk/dataset/london-workplace-zone-classification >	GreaterLondon	WZ
Classification of Multidimensional Open Data of Urban Morphology (MODUM) 2016	MODUMClassificationEW2016	https://data.cdrc.ac.uk/dataset/classification-multidimensional-open-data-urban-morphology-modum	England	OA

Once you have decided which dataset you would like to access, pick its relative DataCode. This will be the input for the dataCode parameter in the getCDRC function.

The getCDRC() function is the function that obtains the CDRC data, and it requires 4 parameters in order to fulfill your request. These are:

DataCode - The API identifier for the specific dataset, available from the table above.
geography - The geographical level in which to retrieve the data. Choose from c(postcode, MSOA, LSOA, LAD, LADcode).
geographyCode - A character-vector of one or more postcodes, LSOA codes, MSOA codes, LAD codes or LAD names.
boundaries - f FALSE (the default), returns a data frame of the desired data. if TRUE, the Open Geography Portal API is used to return an sf with the ‘geometry’ column.

Please be aware that not all API endpoints enable query for the geography in which it was built (you can find the original geography level for each dataset with listCDRC() via the GeographyLevel attribute). The API endpoints can be queried for several geographies; postcodes, LSOAs, MSOAs, LAD codes and LAD names. This means that the datasets built at OA and WZ cannot be retrieved with this geography in the geography parameter of the getCDRC() function. Rather, you must specify one of the specified geographies (postcodes, LSOAs, MSOAs, LAD codes/LAD names), for which the areas that overlap the geography will be returned. For example, this means that the Workplace Zone Classifications cannot be specified with WZ. However, if you wish to retrieve the WZC for a specific area, you can do so like this.

wz <- getCDRC("COWZUK2011", geography = "LADname", geographyCode = "Manchester")

This will retrieve all of the workplace zones in Manchester. Or, you can state a specific postcode, or LSOA, and the workplace zone that overlaps that specific postcode or LSOA will be returned.

getCDRC("COWZUK2011", geography = "postcode", geographyCode = "M139PR")

This returns the workplace zone for the University of Manchester.

Example 1: Accessing the AHAH Dataset

The Access to Healthy Assets and Hazards dataset (AHAH) is a multidimensional (composite) index developed by the CDRC to measure how ‘healthy’ neighbourhoods are in Great Britain. It combines indicators under 4 different domains of accessibility, including:

Retail environment: access to fast food outlets, pubs, off-licenses, tobacconists and vape shops, and gambling outlets
Health services: access to GPs, hospitals, pharmacies, dentists, and leisure services
Physical/Natural Environment: access to blue space and green space
Air Quality: Nitrogen Dioxide, Particulate Matter 10, Sulphur Dioxide

Example 1 (a): AHAH Index for Postcodes in Leeds

In this example, to highlight the usability of getCDRC() we will access the overall AHAH domain index (via the dataCode AHAHOverallIndexDomain), for several postcodes: L13AY, L82TJ, L83UL.

If you recall, the AHAH is a composite index, and so the overall score is a combination of all four domains. Also, although we are interested in postcodes to rank by their level of access to healthy assets and hazards, the data is at LSOA level. Therefore, the LSOAs that overlap the requested postcodes will be returned.

# Login
loginCDRC(username="your-username",password="your-password")

# Check the data and their relevant dataCode
listCDRC()

# Get the AHAH index for the postcodes
ahah <- getCDRC("AHAHOverallIndexDomain",
                geography = "postcode", 
                geographyCode = c("L13AY","L82TJ","L83UL"))
              
# Inspect ahah to understand what was returned
dim(ahah)
head(ahah)
names(ahah)

The ahah dataframe created above consists of the following variables:

Variable	Description
lsoa11	Lower Super Output Area code (2011)
r_rank	Retail domain - Ranks
h_rank	Health domain - Ranks
g_rank	Blue/Green space domain - Ranks
e_rank	Air Quality domain - Ranks
r_exp	Retail domain - Value (after exponential transformation)
h_exp	Health domain - Value (after exponential transformation)
g_exp	Blue/Green space domain - Value (after exponential transformation)
e_exp	Air Quality domain - Value (after exponential transformation)
ahah	Access to Healthy Assets and Hazards - Value
r_ahah	Access to Healthy Assets and Hazards - Ranks
d_ahah	Access to Healthy Assets and Hazards - Deciles
r_dec	Retail domain - Deciles
h_dec	Health domain - Deciles
g_dec	Blue/Green space domain - Deciles
e_dec	Air Quality domain - Deciles

# Here we will rank the postcodes by the AHAH index from the best performing to the worst performing, using the ahah variable - the overall Access to Healthy Assets and Hazards score. 
ahah[order(ahah$ahah),c("postCode","ahah")]

#  postCode     ahah
#3   L8 3UL 20.01734
#2   L8 2TJ 23.04482
#1   L1 3AY 45.91745

In this example, we can see that L8 3UL is the most healthy in terms of access to healthy goods and assets, with an overall AHAH score of 20.0734, and L1 3AY is the least healthy, with an overall AHAH score of 45.9175. To gain a better understanding of these scores, we can visualise the breakdown of them by their domain with the following code.

# Load libraries
# install.packages(c("ggplot2","tidyr"))
library(ggplot2)
library(tidyr)

# Transform data into long format
ahahLong <- pivot_longer(data = ahah,
                         # The columns we want to make long - the domain scores
                         cols = c("rExp","hExp","gExp","eExp"),
                         # Variable for the variable names to go into
                         names_to = "domains",
                         # Variable for the variable values to go into
                         values_to = "scores") 
                         
# Inspect the new dataframe
names(ahahLong)
head(ahahLong[,c(1,14,15)])

## Create graph
# Global mappings
ggplot(ahahLong, aes(domains, scores, fill = domains))+
  # Create the bar charts
  geom_col(show.legend = FALSE)+
  # For each postcode
  facet_wrap(vars(postCode))

You will now have something that looks like this! As we can see, the postcode L1 3AY has a score of over 80 for rExp. This variable is the score given for the retail domain, which suggests that poor access to retail is a large factor in its overall score. It also has larger scores in eExp and gExp, which are the air quality, and blue and greenspace domains respectively. However, we can also see that this postcode has a very low hExp score, suggesting very good, and greater accessibility to health services than the other two postcodes.

Example 1 (b): AHAH Index for LSOAs in Doncaster

Now, we will access the same data, but for the City of Doncaster in the Yorkshire and The Humber region. Here, we will need to specify the Local Authority District name (LADname): Doncaster. We will also want to return the boundaries so that we can map the data.

# Check the data and their relevant dataCode
listCDRC()

# Get the AHAH index for Doncaster
ahahDN <- getCDRC("AHAHOverallIndexDomain",
                  geography = "LADname", 
                  geographyCode = "Doncaster", 
                  boundaries = TRUE)

## Inspect the output
# Check the names of the variables
names(ahahDN)
# Check out the first 6 observations
head(ahahDN)
## - OK, let's map the deciles of the overall ahah score - the dAhah variable!

## Map the deciles of AHAH througout Doncaster
# Load in the tmap package
# install.packages("tmap", dependencies = TRUE)
library(tmap)

# Create the map 
tm_shape(ahahDN) + 
  # Fill it with the AHAH deciles, giving appropriate title
  tm_fill("dAhah", style = "cat", title = "Decile of AHAH") +
  # Specify legend outside of map window, with text size
  tm_layout(legend.outside = TRUE,
  legend.text.size = 1)

You should now have an output that looks something like this! We can see some clear patterns of access to healthy assets and hazards in Doncaster. Doncaster’s urban centre, towards the center of the map, has clusters of areas with poor access to healthy assets and hazards. As we move further from the urban centre, access improves, until we reach the rural areas towards the periphery of the borough, where access to healthy assets and hazards is again typically poor. Though there are exceptions to this, with pockets of good access towards the rural north and northwest, and southwest.

Example 2: Accessing the Internet User Classification

The Internet User Classification (IUC) 2018 is a bespoke classification that describes how people living in Great Britain interact with the internet. It is developed at the LSOA and Data Zone (DZ) level, and creates clusters of internet use and engagement. It is an update of the 2014 Internet User Classification. You can view the user guide with methodology and the IUC profiles HERE.

There are a number of metadata variables that can be ignored with this return (id, createdDate, modifiedDate, isDeleted, rowVersion, and lastUpdatedBy). Whilst the grpCD and grpLabel are the variables of interest.

Variable	Description
LSOA11CD	LSOA
grpCD	The group code (1-10)
grpLabel	The IUC group label

The grpCD and the grpLabel are the variables of interest here. The groups are as follows:

Group	Label	Description
1	e-cultural Creators	High levels of engagement, particularly social media, streaming, and gaming. New but active users.
2	e-Professionals	High levels of engagement, fairly young urban professionals, who are experienced, daily users.
3	e-Veterans	High levels of engagement, affluent families in low density suburbs, middle aged, qualified professionals, who are frequent and experienced users.
4	Youthful Urban Fringe	Average levels of engagement, young and ethnic minorities, typically students and young urbanites at the edges of deprived communities.
5	e-Rational Utilitarians	High demand constrained by poor infrastructure, engagement consists of e-commerce from middle aged or older residents, with personal computers at home.
6	e-Mainstream	Average levels of engagement from wide range of social echelons, located on the edge of urban areas or in transitional neighbourhoods.
7	Passive and Uncommited Users	Limited or low levels of engagement, typically located outside of city centres and close to the rural-urban fringe. Individuals are rarely online, with weekly access or less.
8	Digital Seniors	White British, wealthy and retired in semi-rural or coastal regions, infrequent but adept users (less so for social media, streaming, and gaming)
9	Settled Offline Communities	Very limited engagement with the internet, accessing rarely or not at all. Most are elderly and tend to reside in semi-rural areas. Any online behaviour is via computers and mostly information seeking.
10	e-Withdrawn	The least engaged group, typically located in deprived urban regions, in areas with less affluent White British or areas of high ethnic diversity, and the greatest levels of unemployment. Potentially opt out of engagement for economic reasons.

IUC for LSOAs in Liverpool

For this example, we will get the Internet User Classification for LSOAs across Liverpool. Again, we want to return the geographies so that we can map the results.

# Check dataCode
listCDRC()

# Get Liverpool LSOAs 
liverpool <- sf::st_as_sf(liverpool)

# Get the IUC data with geographical boundaries, using the Liverpool$LSOA11CD as the geographyCode input
iuc <-getCDRC("IUC2018",
              geography = "LSOA", 
              geographyCode = liverpool$LSOA11CD, 
              boundaries = TRUE)
              
## Inspect the output
# Check the names
names(iuc)
# Check the first 6 observations
head(iuc)
## OK - lets map the group names - the grpLabel variable!

# Map the IUC throughout Liverpool 
tm_shape(iuc) + 
  # Fill with the group labels, and give appropriate title
  tm_fill("grpLabel", style = "cat", title = "IUC Groups") +
  # Specify legend outside of mapa window and legend text size
  tm_layout(legend.outside = TRUE,
  legend.text.size = 1)

You should now have an output that looks something like this! We can see that the north of Liverpool is dominated with clusters of e-Withdrawn and Passive and Uncommitted Users, some of the groups with the least interaction with the internet. The e-Professional groups are clustered towards the center south of Liverpool, particularly near the Docks, further eastwards around the universities, and at St Michael’s. The e-Cultural Creators, the group that interact the most, particularly with social media, are located within the student neighbourhoods in Liverpool.

Example 3: Accessing the Index of Multiple Deprivation

The Index of Multiple Deprivation (IMD) is a composite indicator that measures relative levels of deprivation via 39 indicators, separated into 7 domains of deprivation. These are:

Income: 22.5% weighting
Employment: 22.5% weighting
Health Deprivation and Disability: 13.5% weighting
Education, SKills, and Training: 13.5% weighting
Crime: 9.3% weighting
Barriers to Housing and Services: 9.3% weighting
Living Environment: 9.3% weighting

These 7 distinct domains of deprivation are combined and weighted to calculate the overall measure of multiple deprivation experienced by people living in a neighbourhood, measured by LSOA. The IMD has a number of variables, described in the table below.

Variable	Description
ladCode	The code of the Local Authority
LSOA11CD	LSOA code
LSOA11NM	LSOA name
imd2010Adjusted	IMD score
nationalQuintile1	Quntile of deprivation from most deprived (1) to least deprived (5)
nationalDecile2	Decile of deprivation from most deprived (1) to least deprived (10)
imdRank-	The ranking of the LSOA from most deprived (1) to least deprived (32,844)

IMD for Sheffield and Leeds Local Authority Districts.

For this example, we will access the IMD for two Yorkshire cities, Sheffield and Leeds. We will then compare the levels of deprivation between the two cities in a tabular and visual manner.

# Check dataCode
listCDRC()

# Get the IMD data for Sheffield and Leeds
imd <- getCDRC("IMD2019",
               geography = "LADname", 
               geographyCode = c("Sheffield","Leeds"))

# Inspect the output
# Check the names
names(imd)
# Check the first 6 observations
head(imd)

# Get summaries for the LADs
# Sheffield - ladCode = E08000019
summary(imd[imd$ladCode =="E08000019",])
# Leeds - those that are not Sheffield - ie. Leeds
summary(imd[imd$ladCode !="E08000019",])

# Lets make a density plot to visualise this 
ggplot(imd) +
  # IMD rank, but colour by Local Authority
  geom_density(aes(imdRank, fill = ladCode), alpha = 0.6)

You should have an output that looks something like this. The more deprived a neighbourhood is, the lower the rank, so the most deprived neighbourhood will have an IMD ranking of 1. The less deprived neighbourhoods will therefore have a greater IMD ranking, with the least deprived neighbourhood being ranked 32,844 (the number of LSOAs in England). We can see from this density plot that Sheffield (E08000019 - pink) has more LSOAs with a lower IMD ranking (high levels of deprivation), and more LSOAs with a higher IMD ranking (low levels of deprivation). Such results could suggest that Sheffield is more polarised than Leeds, and may subsequently experience higher levels of socio-spatial inequalities.

Try For Yourself!

By now, you should be acquainted with the process of accessing CDRC endpoints via the getCDRC() function. If you need more examples, feel free to follow the below exercises. If not, thank you for following this tutorial and good luck with your analyses!

Additional Exercises:

Access the classification of Workplace Zones, and retrieve data for the county that you live in, using the LADnames to retrieve the Local Authority Districts within it. Acquaint yourself with the variables (check the metadata), and try to find the following:
- The best place(s) for finding manufacturing employment.
- The locations likely to have a high proportion of employment in agriculture, forestry and fishing as well as mining, quarrying and rural services.
- Locations dominated by servants of society.
Access the IMD for a city that you are interested in. Using the deciles of deprivation, find the following:
- The least deprived neighbourhoods.
- The most deprived neighbourhoods.
- The neighbourhoods that are neither the most nor least deprived.
- This time, using the imdRank variable, find how these areas rank in relation to the country.
Access the AHAHInputs for your postcode. Acquaint yourself with the these variables (check the metadata) and consider how all of these factors gave your neighbourhood its overall score.
- Distance to pubs.
- Distance to your GP, dentist, and pharmacy.
- Distance to greenspace (check metadata for details on active and passive greenspace).
- Distance to fast food.
- And finally, the air quality within your neighbourhood.