Generate a Dataset Automatically for your Machine Learning Projects

Paul Livesey
Data Science Tips and Tricks
6 min readOct 17, 2018

--

Mockaroo is a very cool web app for generating rows and rows of data exactly to your specification. Go along to the site and sign in using Google or Facebook. Once in, you are ready to start creating some lovely data.

The first thing you will see is the following:

The initial default schema

It’s all set up with a default schema. I will show you how you can change this shortly.

Before that, what you should do first is to set up the schema basics. At the bottom of the schema, you will find a box where you can enter the number of rows. The default of 1000 rows is the maximum you get for free. The rest you will need to pay for, which starts at around $50 a year (pretty good if you do a lot of this kind of thing). It then goes up to $500 a year for the real BIG datasets.

The bottom of the page also allows you to change the output file format and there are lots of choices, including, of course, CSV and Excel, along with the slightly more interesting types such as SQL, XML, and Firebase.

You can also select the line-ending type, whether the headers are included and also the BOM (the UTF-8 byte order mark, if you know what that means. I sure as hell don’t!).

Now, let’s look at an example of what the default schema generates. If you click the Preview button, you will see first a table view:

Table View of the Data

… and if you click on the csv tab at the top, you will see, funnily enough, a csv layout. Nice!

CSV View of the Data

But we don’t want the default. We want to generate our own stuff. Give your new fields a name. You can add more fields, delete any that are there and shift them around using the handy little tab things on the left.

Next, click on the box in the Type column. This is where it starts to get more interesting:

All of the data types that Mockaroo can generate for you,

As you can see from this partial screenshot of the dialog that appears, there are many, many options that are already prepared for you. You can also build your own data types with formula as well, but before we get to that, let’s take a look at some examples from the above.

A Custom List allows you to enter some strings…

…and Mockaroo will choose the data that goes into the dataset with one of the following selection methods:

Here is a randomized version of the items I listed above:

Another interesting choice is Dummy Image URL. You get the following schema layout:

As you can see, you can change the image size and format. It will generate a list of URLs similar to the following:

The data created for the Dummy Image URL

dummyimage.com is a website which will do just that. Here is an example image generated from that page:

dummyimage.com generated image

The final of data types generated by the people at Mockaroo that we will look at is the interestingly titled Naughty String.

Naughty, Naughty!

Here, you will find ‘Strings which have a high probability of causing issues when used as user-input data’. i.e. data that could potentially screw up your system. Useful for testing you code robustness:

That is very bad!

The last thing we are going to look at is how to generate your own custom fields. You use the blank option or the formula option on any of the generated data types or just use them on their own to generate data for you with some code.

The blank and the function options

The blank option allows you to fill the data with acertain percentage of blanks items. Here, I have set my Animal field to 50%:

This is the result:

Animals at 50% blank-osity

Functions, the other custom field, is probably the most useful and interesting thing in the whole of Mockaroo. It is also the most complicated.

Here, you can use ruby code along with lots of built-in functions to generate your own custom data. You can check out the full power of the formulae here, but let’s take a look at some examples:

If we add the simple: date(‘7/4/2015’) from the tutorial page, we get:

The date

If we expand on this with some ruby code,:

if this == “4/6/2018” then “Yo!” else “No!” end

We end up with:

Now, I am going to try and generate the thing I came to this web app to do, three hours ago, before I got distracted writing this article. I want to generate some strings in a numeric sequence, along the lines of…

AC_Record 1
AC_Record 2

all the way to…

AC_Record 13

After a bit of playing around and looking up how to convert integers to strings in ruby, I came up with the following…

First I got Mockaroo to create a sequence of numbers using the Sequence Type:

Then I added my string to this expression (which grabs the data generated by the type, i.e. 1, 2, 3, etc) and I used the ruby function .to_s to convert it to a string. Viola!

“AC_Record “ + this.to_s

Which gives me:

What I came here to do!

What a lovely web app! There are many, many more options to play around with. Well worthy of all of those dollars you are earning with your high paying Data Science jobs.

Disclaimer: I am not affiliated with the (probably) very fine people at Mockaroo or with any of their relatives. None of them knows me, my wife or my pet Chinchilla.

--

--

Paul Livesey
Data Science Tips and Tricks

I am a teacher of Computer Science, currently living in China. At the same time I am trying to complete a Masters at Georgia Tech in Machine Learning.