City Chef: Bring Your Own City (BYOC)
Generate synthetic population data with our new open-source tool
Fred is a data scientist at Arup working on machine learning and city modelling
In this post we share a project we’re making called City Chef.
As much as we enjoy a good graph algorithm in The City Modelling Lab, we also love a bit of data. But we often get blocked by it, or the lack of. All the wrangling in the world can’t get you past missing information — maybe because it’s personal, sensitive, confidential or simply doesn’t exist.
We started building City Chef to help us around this problem. We use it to generate data for fake cities. Including data about the occupants of the cities and their activities.
Section 1: Introduction to City Chef, including a few use cases
Section 2: More detail about City Chef features and how they work
Section 1: Welcome to City Chef
Why fake cities full of fake people?
We use City Chef outputs as toy examples for experiments and tests. For example; to (i) build realistic census and survey data for testing population synthesis algorithms, and (ii) build dummy transit services for testing our network synthesis work.
We think City Chef, or at least some of the ideas and code behind it, might be useful for others working in city modelling, so we’re sharing it. If it’s of use for your project, we’d like to know. If you work on it, we hope you’ll contribute.
A personal disclaimer/apology: City Chef is WIP and currently more a collection of methods than a cohesive API. However, to ease people in — the project includes two example applications for getting started:
Application 1: Census & Travel Survey Generator
We started City Chef because we wanted to play with the code from this really cool paper using Variational Auto-Encoders for population synthesis. But we didn’t have access to the right population survey data — so we started faking it. Since then we’ve used fake population survey data for lots of other population synthesis experiments.
We use this notebook to randomly generate a city with facilities, networks, statistical zones, households and people with attributes and simple activity plans.
From this city and it’s population we extract (i) population marginal statistics, (ii) commuter demand matrices, and (iii) household travel surveys.
Application 2: OSM & GTFS Generator
We build some really big and complex transport networks for our transport simulations — typically combining data from Open Street Map (OSM) and General Transit Feeds Specifications (GTFS). But its sometimes nice to have some smaller data to play with — especially for building test cases or toy examples.
We use this notebook to (i) randomly generate a road network with bus route/s, (ii) add some spatial noise, then (iii) output to OSM and GTFS formats. This allows us to quickly generate controlled test cases for our network algorithms:
Technical Overview
We sometimes like to think of cities as really complex joint distributions. Sometimes we can use physical models, like scheduled transit services and queuing. But otherwise, especially where humans are involved, things get more probabilistic.
When we generate a population of agents for our models, we want them to be as representative of the real population as possible. For example, we want to correctly decide the likelihood of an individual, of a certain age and income, in a certain area, being a car owner or not.
We’ve been working on methods for modelling this Data Generation Process (DGP). But experimenting and testing is hard without data — typically we have access to only small samples and marginal statistics. Such as a few household surveys and the overall car ownership in an area.
City Chef tries to get around this problem by providing fake data. But to be as useful as possible — to provide useful experiments and valid tests, the City Chef project has some key aims:
(i) Output data in useful formats
(ii) Representative physical DGP components, such as networks
(iii) Representative probabilistic DGP components, such as age distributions
and where this fails:
(iv) Representative complexity in the distributions
(v) Feedback for ‘expert’ validation
If that sounds useful or interesting then read on for some specific details of the processes we’ve devised.
Section 2: City Chef In Detail
This section provides a detailed summary of the various components of City Chef and how they work.
Facilities
The base process for any City Chef city is a spatial distribution of facilities. Facilities are locations for the population activities, such as offices (work activity), parks (leisure activity) and houses (home activity). The attributes of these facilities, such as their density or distances to other facilities is synthesised later. Additionally, the amount of people using them and when, is a feature of the peoples’ activity plans, also synthesised later.
- Define a bounding box
- Uniformly sample ‘centres’ within bounding box
- Optionally poisson sample number of facilities of each type around each centre
- Gaussian sample locations of facilities around each centre
Road Network
We build a road network for car travel. We use a quad-tree structure to define the density of the network, so that road accessibility is likely to be better where facility density is higher. The connectivity of the resulting networks is based on parent — child relationships within the quad tree. This results in variations in connectivity by euclidean distance, based on the relation to neighbouring quad grids.
- Build a grid tree with configurable facility density
- Build a nested network of roads from the grid tree
Bus Network
We build a bus network for public transit travel on the road network. Route generation builds viable routes using a weighted random walk algorithm.
First a start location is randomly chosen, weighted by facility density at each vertices. Then, from each start location the algorithm traverses the possible network based on the walk algorithm. The weights for this traversal are based on (i) the facility density of the possible choices, (ii) the straightness of the resulting route and (iii) the number of times the edges have already been traversed by a route. Routes cannot repeat the previous edge so that the traversal ends at dead ends in the road network or when a maximum number of stops is reached.
Edges are assigned a distance, a free-speed (based on length) and a free-flow traversal time as per the underlying road network. For simplicity, no consideration is made of a schedule or interchange times.
- Initiate start locations based on facility density
- Optionally, poisson sample number of routes based on population
- Generate routes based on weighted random walk
- Add generated routes that meet minimum requirements for stops and length
- Combine routes to form total bus network graph
Rail Network
We build a rail network for public transit travel not restricted by the road network. As per the bus network generation, an algorithm attempts to build viable routes by weighted random walks. Unlike the bus network we build a new graph for all potential traversals. This graph connects all road network junctions (not dead ends) based on Delaunay triangulation.
The random walks are seeded and traverse as per the bus routes. Edges are assigned a distance, a speed (based on distance) and a resulting traversal time. For simplicity, no consideration is made of a schedule or interchange times.
- Generate a viable network using Delaunay triangulation
- Sample routes as per the bus methodology
- Discard unused viable network
Statistical Zones
We generate statistical zones using another quad-tree based on facility density. This is as per the road network, but we also add some random variation to the quad dimensions.
- Build an irregular quad tree structure based on facility density
- Assign zones to facilities as required
Household and Person Features
We use chained generators to sequentially synthesise features for household and their occupants. The relationship of persons features and their household are maintained.
We use these generators, such as the one below, to build tabular data with lots of features, like age, income and occupation. We create joint distributions with various ad-hoc assemblies of random generators — we encourage some experimentation and iteration to build distributions that you are happy with.
Peoples final attributes can be based on hidden features. More generally all facilities are given hidden features based on spatial patterns that have arisen from the facility and network generation steps. These hidden features are used to add additional complexity and spatial correlations to the regular household and person features. Hidden features include distances to networks, facility densities, or closest facility distance/s:
Activity and Facility Choice
A simple activity plan is generated based on every individual’s features, such as their age, including hidden features, such as spatial densities and distances. This plan consists of a single activity at a given facility location.
- Activity type choice is made based on household and individual features, this can include staying at home
- Facilities for each activity type are weighted by their desirability (based on hidden density and distances as described above)
- Agents randomly choose a facility based on this weighting, the facility euclidean distance and on their individual features
Mode Choice
Individuals choose a mode of transport (car, public transit, bike or walk) based on the expected journey times of each mode to their chosen activity facility and on their individual features. Expected journey times are calculated in a somewhat simplistic way to speed compute for larger populations:
- For car and transit modes calculate journey times between all vertices on their combined networks. Note that no consideration is made of transit schedules or even interchanges. Pre computing this matrix of travel times is only sensible for relatively dense populations and relatively sparse networks
- For each person journey time is calculated by assuming travel from the nearest network vertices from their origin and destination facilities. Travel time to this vertices is based on manhattan distance
- Person mode choice is generated based on expected journey times and personal attributes
Wrapping Up
Thanks for reading!
We’d love to know of any application of this project and example applications are very welcome. But there’s lots to be displeased about of course — so we hope you’ll chip in. We have a few ideas for the future:
Technical
- Needs linting
- Needs doc strings
- Very little logging at the moment
- No tests
- Needs EPSG conversions (for OSM and GTFS outputs)
- Needs better structure/API
- Many of the classes and methods share features
- Bad module names
- Speed vs detail trade-off
Theoretical
- Accessibility calculation is simplistic
- Activity plan synthesis is only simple tour based
- Facility locations only contain a single facility
- Transit route synthesis is pretty bad, especially for trains
More broadly City Chef uses directed tree like causation for the Data Generating Process. This is primarily for simplicity/speed. But it means that modelling causation across features is very simplified. For example: there is no feedback from network connectivity to facility locations (but there is vice versa). We would like to work on a more dynamic agent- or cellular automation approach to make our test cities more reflective of patterns we see in the real world.