The Case of the Fake Database
Or… How to Generate Content for a Police Procedural
Way back in 2005 I worked at Team Bondi on “L.A. Noire”, where I was tasked with figuring out how to accurately simulate a population of two million people without consuming much memory or taxing the CPU.
My first instinct was to build a prototype to see whether a lot of the heavy lifting could be done via procedural generation, allowing the game to spawn characters in a particular area, at a specific time of day, and have them be “correct” in the sense of exhibiting realistic goals and behaviours.
The idea was to use a technique akin to arithmetic coding in order to represent each person as a number that would encode their demographic information, according to census statistics from 1947.
I built the prototype in Ruby 1.8.2 on Windows, using the “FXRuby” gem to provide the UI. In 2019 I re-discovered the code on an old drive and, after fixing a single syntax error, was amazed to discover that it worked on OSX under Ruby 2.6.3 with a version of FXRuby released in December 2018.
Think about that. Fourteen years of changes to the language and libraries and a different platform to boot and it still ran! Mind blown.
That’s the “Cast Manager” shown above, a tool for generating “Actors” that match a given query, with the ID of each Actor being used to seed the procedural generation, guaranteeing that they would always be the same without the overhead of having to store any of their data.
The “Event Management System”, shown below, used these Actors to populate simple event templates that could then be used to generate rudimentary newspaper articles.
The goal, which unfortunately never came to fruition, was for events that occurred around the player in the game to happen spontaneously but in accordance with these kinds of rules, allowing the player to become involved, perhaps chasing down the suspect or helping the victim, and then seeing it all described in newspaper articles the following day.
RnDB: A Random Database
Never one to give up on a good idea, I have spent the last two weeks creating RnDB, a Ruby Gem that makes it easy to procedurally generate data like this, and to query it efficiently as if it were stored in a database. Here’s a quick video overview to set the scene.
So, how does that all work? I’m glad you asked…
The core trick is to not generate everything at random, but to impose a strict order instead. This may seem silly when talking about procedural generation, but it makes it possible for us to quickly query and join records.
Consider the age of a random person in the population. It would be easy to take some demographic statistics about age distribution and use that to generate a random number. Do that for all members of the population and you’ll end up with a bunch of numbers. If you then sorted those numbers, and grouped them into categories such as “child” and “adult”, then you’d have something similar to what RnDB uses when generating data.
Note that this ordering is only something that is apparent when records are sorted by their ID, which should not normally be visible to the end user. Selecting a random record yields a result indistinguishable from that of generating all of its attributes at random.
Thickets of Slices
Dividing and sub-dividing ranges of record IDs into smaller clusters that share the same values for particular attributes is achieved with a data structure that I call a “Thicket”. It’s a bit different to a sparse array or a trie or anything else like that; it really is nothing more than an array of ranges, which represent clusters of IDs (which I call “Slices”) separated by large gaps. It’s lightweight and efficient to iterate and count, and I visualise Thickets as looking something like the “Cantor Set”.
As shown in the video, queries encapsulate one of these data structures, and so really only operate on ranges of IDs. Individual records may be instantiated from a particular ID, and once an attribute of a record is accessed for the first time, RnDB runs some code to generate its value.
Running a “Migration”
But how do we figure out which IDs should be in the Thicket in the first place? That happens when a “Table” is first added to the “Database”. At this point, we specify how many records should exist, and this is used to “migrate” the table, which simply takes the statistics used when defining its attributes and turns them into ranges of record IDs that are stored in a Thicket.
OK, so maybe that does sound confusing! But what it means is this: for an attributed named “colour” with values “red”, “green” and “blue”, where red occurs 60% of the time, green occurs 30% of the time and blue occurs 10% of the time, then, when adding 100 of these records to the database, we store the fact that records with the colour red have the IDs 0 to 59, while blue records have IDs 60 to 89 and green 90 to 99.
Querying the Database
One of the neat features of RnDB is its ability to run a “Query”. It seems almost crazy to be able to query data that doesn’t exist yet, but all we’re doing is honing in on clusters of IDs that satisfy the query. So, for example, querying the previous database for records that have a colour of green simply returns the IDs 60 to 89, wrapped in a Thicket that makes it very efficient to count that there are 30 of them, to select a random subset of them, to iterate through them one at a time, or whatever you like.
We operate on clusters of IDs almost everywhere, not bothering to instantiate records at all until absolutely necessary. And, even then, a record consists of nothing more than its ID, with its attributes generated only on demand.
Accessing a record’s attributes is where the procedural generation comes into play. We only bother generating values at the point of first access, for the sake of efficiency. And this happens in a few stages.
- To begin with, we seed the random number generator with a tuple that is formed by concatenating the seed of the database itself, the name of the table, the ID of the record and the name of the attribute we want to generate. So something like 137-person-19685027-occupation, if we want to generate the occupation attribute of record ID 19685027 from the Person table in the database with seed 137. Explaining how to transform this tuple into a number that can be used as a seed is best left to the code itself. Seeding is done this way to ensure that generating an attribute has no side effects, allowing us to access them in any order.
- If the attribute was defined with a probability distribution, such as with the colour attribute above, then we can immediately calculate its value from the record ID. In some cases, this may be all we need to do.
- But if a block of code was supplied when defining the attribute, then that code is now executed, with its return value becoming the attribute value. This code may do anything, even accessing the value of other attributes or records in other Tables in order to do its work.
The final trick in the bag is automagically calculating associations between records. This uses the fast query mechanism discussed previously, and involves running two queries and then mapping between them, by finding the index into the first query of the ID of the current record, and then using that index to find the ID of the other record.
For instance, if MarriedWomen is a query that contains the IDs of all married women, and MarriedMen contains the IDs of all married men, and if I am a married woman, then the ID of my husband will be this:
That is, I find out where my ID is in MarriedWomen, and I find the ID in the same position in MarriedMen. This is super-fast and bidirectional.
It’s a kind of magic.
Try it Yourself
Head over to GitHub and take a look at the RnDB source code, then install the Gem and have a play with it yourself. Let me know what you think!