Fake It til You Make It

Because good test data is hard to create

Published in

Top Python Libraries

4 min readOct 3, 2024

Half of testing any project involves throwing varying data at it to see what breaks. If you are creating medium to large datasets to feed to your project by hand… I feel for you. I was there years ago, and would quickly tire of trying to create records with reasonable data that would be identifiable in the output (you need that to track down the input record that caused the error or incorrect output).

A JSON file created using Faker — Using Faker to create test data

Today, Python has a module called faker that can help you handle this somewhat mundane task. It supplies routines to create names, addresses, phone numbers, e-mails, and even integers and floating point numbers. You can restrict the numeric values to given ranges, and you can localize everything to one or more locales and languages.

But you might be asking why you’d need to generate random data, especially if the data you really want to process already exists? There are several reasons:

The real data may have problems you haven’t considered. While that might seem to favor the argument for using it, consider this- while initially debugging your code, wouldn’t you want to work with data you know is initially clean? Real data contains real surprises, and wouldn’t you really prefer to know if the problem you’re seeing is caused by the data or by your code?
Legally, there are rules in place that might actually prohibit you from using live data. In the U.S. there is a law concerning medical data known as the Health Insurance Portability and Accountability Act of 1996, or HIPAA, which restricts the use of people’s medical data.

Getting started

If you don’t have it already, use pip to install the module:

pip install faker

Then import it into your program:

from faker import Faker

We can then create a Faker object to use to generate our data:

fake = Faker()

What’s there?

We can now use our object to create different types of data. If you’re using a REPL, try the following commands:

fake.name()
fake.address()
fake.url()
fake.email()

Each of these are called a “provider”, and faker comes with a slew of them, but if you have a special need (such as “medical diagnosis”, or “car part name”), you can write your own as well.

from faker import Faker
from faker.providers import BaseProvider
import random

class Medical (BaseProvider):
    _diagnosis = [‘Diabetes’, ‘Cardiac’, ‘Rhinitis’]
    def diagnosis(self):
        return random.choice(self._diagnosis)

fake = Faker()
fake.add_provider(Medical)

diag = fake.diagnosis()

Going beyond the borders

You aren’t limited to US styled data. When you create your object, you can specify one or more locales.

from faker import Faker

fake = Faker([‘en_US’, ‘en_AU’, ‘sv_SE’])

for _in range(10):
    print(fake.name(), fake.address())

Where can I find more?

For more complete documentation and a list of the default set of providers, see Faker on GitHub

Putting it all together

Now, let’s put this all together and create a JSON file of data with several fields.

from faker import Faker
from faker.providers import BaseProvider
import random
import json

fake = Faker(['en_US', 'en_UK', 'ja_JP'])

data = []

for _ in range(100):
  rec = { 'name': fake.name(),
          'company': fake.company(),
          'position': fake.job(),
          'address': fake.address(),
          'telephone': fake.phone_number(),
          'e-mail': fake.email(),
          'color': fake.color_name(),
          'count': fake.random_int(min=10, max=50),
        }
  data.append(rec)
  
with open('generatedData.json', 'w') as f:
  json.dump(data, f)

This data has fields in US English, UK English, and Japanese, and creates a usable file to test with.

Since the data is random, it can’t be associated with any real people. The addresses are made up, as are the companies and job titles.

Once your program works with this data, then you can begin to alter it to test your error detection and edge condition processing, all without exposing any client data.

A loosely related story if you made it this far

I worked for a company that printed a weekly customer list, on five-part stock. The reports were distributed to various departments for their internal use, and several weeks later, they were collected and shredded, so that competitors couldn’t go through our trash and steal our customer lists. Sounds legit, right?

But… after printing and decollating (splitting up the five-part paper in to individual reports), what was done with the carbons? Well they were so flimsy that they wouldn’t shred easily, so they were just thrown away.

So, our competitors we were so worried about could wait several weeks and try to piece together the shredded strips of the old report… or pick up the pristine carbons of the current report. This was the state of data security in the “good old days”.

Also note that we never had a documented occurrence of any theft of this data, but shredding the reports gave management the warm fuzzy feeling of having done something.