City Chef: Bring Your Own City (BYOC)

Generate synthetic population data with our new open-source tool

Fred Shone

Published in

Arup’s City Modelling Lab

9 min readApr 3, 2020

Fred is a data scientist at Arup working on machine learning and city modelling

In this post we share a project we’re making called City Chef.

As much as we enjoy a good graph algorithm in The City Modelling Lab, we also love a bit of data. But we often get blocked by it, or the lack of. All the wrangling in the world can’t get you past missing information — maybe because it’s personal, sensitive, confidential or simply doesn’t exist.

We started building City Chef to help us around this problem. We use it to generate data for fake cities. Including data about the occupants of the cities and their activities.

Section 1: Introduction to City Chef, including a few use cases

Section 2: More detail about City Chef features and how they work

Section 1: Welcome to City Chef

Why fake cities full of fake people?

We use City Chef outputs as toy examples for experiments and tests. For example; to (i) build realistic census and survey data for testing population synthesis algorithms, and (ii) build dummy transit services for testing our network synthesis work.

We think City Chef, or at least some of the ideas and code behind it, might be useful for others working in city modelling, so we’re sharing it. If it’s of use for your project, we’d like to know. If you work on it, we hope you’ll contribute.

A personal disclaimer/apology: City Chef is WIP and currently more a collection of methods than a cohesive API. However, to ease people in — the project includes two example applications for getting started:

Application 1: Census & Travel Survey Generator

We started City Chef because we wanted to play with the code from this really cool paper using Variational Auto-Encoders for population synthesis. But we didn’t have access to the right population survey data — so we started faking it. Since then we’ve used fake population survey data for lots of other population synthesis experiments.

We use this notebook to randomly generate a city with facilities, networks, statistical zones, households and people with attributes and simple activity plans.

“City Zero” — A randomly generated city, with facilities, networks and a population with simple activity plans

From this city and it’s population we extract (i) population marginal statistics, (ii) commuter demand matrices, and (iii) household travel surveys.

“City Zero” — Example population statistics

Application 2: OSM & GTFS Generator

We build some really big and complex transport networks for our transport simulations — typically combining data from Open Street Map (OSM) and General Transit Feeds Specifications (GTFS). But its sometimes nice to have some smaller data to play with — especially for building test cases or toy examples.

We use this notebook to (i) randomly generate a road network with bus route/s, (ii) add some spatial noise, then (iii) output to OSM and GTFS formats. This allows us to quickly generate controlled test cases for our network algorithms:

A randomly generated road network and bus route with spatial noise applied.

Technical Overview

We sometimes like to think of cities as really complex joint distributions. Sometimes we can use physical models, like scheduled transit services and queuing. But otherwise, especially where humans are involved, things get more probabilistic.

When we generate a population of agents for our models, we want them to be as representative of the real population as possible. For example, we want to correctly decide the likelihood of an individual, of a certain age and income, in a certain area, being a car owner or not.

We’ve been working on methods for modelling this Data Generation Process (DGP). But experimenting and testing is hard without data — typically we have access to only small samples and marginal statistics. Such as a few household surveys and the overall car ownership in an area.

City Chef tries to get around this problem by providing fake data. But to be as useful as possible — to provide useful experiments and valid tests, the City Chef project has some key aims:

(i) Output data in useful formats

(ii) Representative physical DGP components, such as networks

(iii) Representative probabilistic DGP components, such as age distributions

and where this fails:

(iv) Representative complexity in the distributions

(v) Feedback for ‘expert’ validation

If that sounds useful or interesting then read on for some specific details of the processes we’ve devised.

Section 2: City Chef In Detail

This section provides a detailed summary of the various components of City Chef and how they work.

Facilities

The base process for any City Chef city is a spatial distribution of facilities. Facilities are locations for the population activities, such as offices (work activity), parks (leisure activity) and houses (home activity). The attributes of these facilities, such as their density or distances to other facilities is synthesised later. Additionally, the amount of people using them and when, is a feature of the peoples’ activity plans, also synthesised later.

Define a bounding box
Uniformly sample ‘centres’ within bounding box
Optionally poisson sample number of facilities of each type around each centre
Gaussian sample locations of facilities around each centre

Facility locations scattered around centres (black crosses)

Road Network

We build a road network for car travel. We use a quad-tree structure to define the density of the network, so that road accessibility is likely to be better where facility density is higher. The connectivity of the resulting networks is based on parent — child relationships within the quad tree. This results in variations in connectivity by euclidean distance, based on the relation to neighbouring quad grids.

Build a grid tree with configurable facility density
Build a nested network of roads from the grid tree

Bus Network

We build a bus network for public transit travel on the road network. Route generation builds viable routes using a weighted random walk algorithm.

First a start location is randomly chosen, weighted by facility density at each vertices. Then, from each start location the algorithm traverses the possible network based on the walk algorithm. The weights for this traversal are based on (i) the facility density of the possible choices, (ii) the straightness of the resulting route and (iii) the number of times the edges have already been traversed by a route. Routes cannot repeat the previous edge so that the traversal ends at dead ends in the road network or when a maximum number of stops is reached.

Edges are assigned a distance, a free-speed (based on length) and a free-flow traversal time as per the underlying road network. For simplicity, no consideration is made of a schedule or interchange times.

Initiate start locations based on facility density
Optionally, poisson sample number of routes based on population
Generate routes based on weighted random walk
Add generated routes that meet minimum requirements for stops and length
Combine routes to form total bus network graph

Road network (grey), bus routes (red), unique routes have distinct stop colours, but can overlap

Rail Network

We build a rail network for public transit travel not restricted by the road network. As per the bus network generation, an algorithm attempts to build viable routes by weighted random walks. Unlike the bus network we build a new graph for all potential traversals. This graph connects all road network junctions (not dead ends) based on Delaunay triangulation.

The random walks are seeded and traverse as per the bus routes. Edges are assigned a distance, a speed (based on distance) and a resulting traversal time. For simplicity, no consideration is made of a schedule or interchange times.

Generate a viable network using Delaunay triangulation
Sample routes as per the bus methodology
Discard unused viable network

Road network (grey), bus routes (red), train routes (blue)

Statistical Zones

We generate statistical zones using another quad-tree based on facility density. This is as per the road network, but we also add some random variation to the quad dimensions.

Build an irregular quad tree structure based on facility density
Assign zones to facilities as required

The statistical zones can be used to provide other census outputs, such as population density

Household and Person Features

We use chained generators to sequentially synthesise features for household and their occupants. The relationship of persons features and their household are maintained.

We use these generators, such as the one below, to build tabular data with lots of features, like age, income and occupation. We create joint distributions with various ad-hoc assemblies of random generators — we encourage some experimentation and iteration to build distributions that you are happy with.

A simple example for generating people’s occupation attribute based on age and income

Example person attributes table, ‘hh’ denotes a shared household attribute

Peoples final attributes can be based on hidden features. More generally all facilities are given hidden features based on spatial patterns that have arisen from the facility and network generation steps. These hidden features are used to add additional complexity and spatial correlations to the regular household and person features. Hidden features include distances to networks, facility densities, or closest facility distance/s:

A simple example for generating a facility hidden attribute

Example household facility hidden features (mm denotes that data had been min-max standardised)

Activity and Facility Choice

A simple activity plan is generated based on every individual’s features, such as their age, including hidden features, such as spatial densities and distances. This plan consists of a single activity at a given facility location.

Activity type choice is made based on household and individual features, this can include staying at home
Facilities for each activity type are weighted by their desirability (based on hidden density and distances as described above)
Agents randomly choose a facility based on this weighting, the facility euclidean distance and on their individual features

Work facility density feature — larger and redder points denote workplaces with higher attractiveness

Mode Choice

Individuals choose a mode of transport (car, public transit, bike or walk) based on the expected journey times of each mode to their chosen activity facility and on their individual features. Expected journey times are calculated in a somewhat simplistic way to speed compute for larger populations:

For car and transit modes calculate journey times between all vertices on their combined networks. Note that no consideration is made of transit schedules or even interchanges. Pre computing this matrix of travel times is only sensible for relatively dense populations and relatively sparse networks
For each person journey time is calculated by assuming travel from the nearest network vertices from their origin and destination facilities. Travel time to this vertices is based on manhattan distance
Person mode choice is generated based on expected journey times and personal attributes

Wrapping Up

Thanks for reading!

We’d love to know of any application of this project and example applications are very welcome. But there’s lots to be displeased about of course — so we hope you’ll chip in. We have a few ideas for the future:

Technical

Needs linting
Needs doc strings
Very little logging at the moment
No tests
Needs EPSG conversions (for OSM and GTFS outputs)
Needs better structure/API
Many of the classes and methods share features
Bad module names
Speed vs detail trade-off

Theoretical

Accessibility calculation is simplistic
Activity plan synthesis is only simple tour based
Facility locations only contain a single facility
Transit route synthesis is pretty bad, especially for trains

More broadly City Chef uses directed tree like causation for the Data Generating Process. This is primarily for simplicity/speed. But it means that modelling causation across features is very simplified. For example: there is no feedback from network connectivity to facility locations (but there is vice versa). We would like to work on a more dynamic agent- or cellular automation approach to make our test cities more reflective of patterns we see in the real world.