An aggregated level footfall sensor data is a derived product from raw footfall data, producing five-minute footfall counts. The raw data was passive WiFi signal probing from a sensor network across Great Britain between 2015 and 2020.
The data are used as a proxy for estimating footfall at retail locations.
The dataset includes details about the location of the sensors (description as well as latitude, longitude, height, depth, installation dates) and cleaned five-minute interval footfall estimates which include timestamps, locations, adjusted and unadjusted footfall counts. A complete description of this dataset can be found below in the Data and Resources section.
Content
The dataset includes information from 1151 sensor locations across 107 cities in Great Britain, identified by addresses including building numbers, street names, and unit postcodes. Data spans from July 2015 to September 2020, aggregated into five-minute intervals.
- Rows: Over 20 million records are across approximately 67 monthly files.
- Columns: The dataset comprises 22 variables distributed across 71 files.
Quality, Representation and Bias
The quality and representation of the SmartStreetSensor Footfall dataset are influenced by the methodologies employed and the inherent biases associated with its collection process:
- Sensor Range: The signal strength and sensor range are variable, influenced by environmental conditions and technical specifications. This variability introduces inconsistencies in coverage.
- Probing Frequency: Devices probe for Wi-Fi signals at differing frequencies based on manufacturer, operating system, and usage state, affecting the detection consistency.
- MAC Address Collisions: A minor percentage (0.01%) of MAC addresses are reported by multiple devices due to MAC randomization techniques, adding complexity to data cleaning.
- Human Error: Sensor power disconnections and operational disruptions result in occasional data gaps.
- Postprocessing Assumptions: The process of transforming probe requests into footfall estimates involves assumptions that may lead to overcounting or undercounting in specific scenarios.
-
Geographical Representation: The dataset is heavily skewed toward Greater London, with one-third of sensor locations situated in this region. Consequently, national-level aggregated metrics may disproportionately reflect patterns in London.
-
Temporal Coverage: Early stages of data collection, before July 2016, included fewer sensors (approximately 200), mostly located in London, further amplifying initial geographical biases.
- Device Misclassification: Sensors cannot distinguish between mobile devices and other Wi-Fi-enabled devices (e.g., printers or routers), potentially inflating counts.
- City-Level Distribution: The highest sensor concentration is in cities like London (318 locations), followed by Edinburgh (46) and Manchester (32). In contrast, smaller towns often have one or two sensors, limiting granularity in those areas.