AI Dev Tips #4 — ChatGPT Creates Fake Data for our DB Schema

Chris St. John

Published in

AI Dev Tips

9 min readAug 18, 2024

Step-by-step: AI magic to create data for any schema in 5 minutes.

In This Tutorial:

Recap of creating a schema with ChatGPT.
Why use fake data?
Prompt considerations for generating fake data.
Basic prompts for ChatGPT.
Handling errors when inserting data.
Testing and improving the process.

1. Recap: Getting the Schema from ChatGPT, Gherkins

In an earlier article, we used ChatGPT to generate a PostgreSQL schema from natural language input.

You will need a schema for this.

If you’re not sure what I am talking about, do that AI Dev Tips 1 tutorial first, it takes like 5 minutes.

See: AI Dev Tips 1: Business Idea to PostgreSQL Database Schema in 5 Minutes: Transform a business idea into a PostgreSQL database schema in just five minutes, including tables, attributes, and types.

Quick recap of that article:

Input: We provided ChatGPT with natural language of our business ideas. We could (and probably should) also use Gherkin scenarios.

In previous early AI Dev Tip tutorials I forgot about mentioning one aspect I always mention early in the dev process — you should really use Gherkins from the beginning. This helps get the stakeholders, product manager thinking about everything they really need, and helps avoid potential later issues.

Gherkin scenarios: structured, plain-text descriptions of specific behaviors or features in a software application, typically written in a Given-When-Then format. They are useful because they allow both technical and non-technical stakeholders to understand and agree on the expected behavior of a system. Research this with Google, but there are actual standardized format recommendations.

Example Gherkin:

Feature: Add a new book to the library database

  Scenario: Successfully adding a new book with all required details

    Given the librarian is on the "Add New Book" page
    When the librarian enters a title, author, publication date, and ISBN
    And clicks the "Save" button
    Then the book should be saved in the database
    And the librarian should see a confirmation message
    And the book should be listed in the "Books" table with the correct details includin
    And the details should include "title", "author", "pages", "isbn"

Output: ChatGPT generated the schema, which we then inserted into our PostgreSQL database using pg-sql.com.

create a schema: Read the article

This schema will be our foundation for generating fake data.

btw, subscribe to AI Dev Tips (follow), clap, share, follow me. Thanks!

2. Why use fake data (aka “mock data”)?

Next we’ll create the fake data. But why?
Why Create Fake Data?

Prototyping: Fake data allows for the rapid prototyping of features, enabling teams to iterate quickly and validate ideas without waiting for real data to become available. In our case, we’re going to also simulate various queries.
Testing & Development: Allows us to simulate real-world scenarios without compromising actual data or needing to get real users. We can actually simulate and test a wide range of ideas with data already populated.
Performance Tuning: Large volumes of fake data can be used to test the scalability and performance of a system under load, helping to identify bottlenecks or areas that need optimization.
Learning, Training & Demonstration: Fake data can be used to learn new ideas, train machine learning models or to demonstrate software functionality without risking exposure of confidential data.
Privacy & Compliance: For testing compliance with data protection regulations, such as GDPR, by avoiding the use of sensitive or personally identifiable information (PII) during testing.

Also, note, that these examples are for PROTOTYPING database schema and query ideas. It’s likely your actual production database will be much more complex. Doing this with a large very complex database of dozens of tables may not be as successful (or possible).

3. Considerations: Creating Fake Data with AI

When generating fake data, we need to consider the following aspects of your schema:

Data Types & Attributes: Ensure you define clear data types (e.g., integer, varchar, timestamp) and attributes. We did this already, but you are using your custom example, so check this.
Constraints: Include any unique, primary, or foreign key constraints. These rules are applied to database columns to ensure data integrity: a unique key ensures that each value in a column is distinct; a primary key uniquely identifies each record in a table; and a foreign key establishes a link between two tables, ensuring that the value in one table corresponds to a valid entry in another. Make sure you have that in your schema, if not ask ChatGpt to validate and/or create it.
IMPORTANT: Table Relationships: Understand the relationships between tables. Again this is related to the keys above.

This is IMPORTANT because a database with a lot of tables and a particular relationship pattern may get errors while creating the data if not paying attention to this.

AI seems to be getting better, but ordering has been a source of errrors in the past. Example: you put a table referencing another that was not created yet, it may get an error.

Data Ranges & Patterns: Specify any required data ranges, text patterns, or special characters. You may want to test certain ideas and types of data. Make sure in your prompt you are clear aboout this.
Volume of Records: Determine how many records are needed per table. I am going to start with just 20. ChatGPT will max out at some point, you may be able to 100.
If you need more rows, you have some options (1) ask for 50 or 100 at a time and just add them to your insert. (2) Write a script or hav ChaGPT write ascript that create the randomized data.

4. Generating Fake Data with ChatGPT

Here’s how to prompt ChatGPT to generate the fake data:

Initial Prompts (basic version)

I am only going to enter my first 2 tables (the entire schema is more), so I’m sure not to get initial errors.

ChatGPT 4o:
Please create fake data rows that follow the attributes and types for these tables in this schema and output as INSERT SQL statements for my database. <insert your schema in any format, perhaps use the first couple tables for testing>

Output: ChatGPT provides SQL INSERT statements, but initially, it may generate only a few records for each. That’s ok for this demo, but you may need to ask for more or aggregate them.

Next I put this into pg-sql.com (as mentioned in the earlier AI Dev Tips articles)

Hit RUN. It said the query was successful! That was easy!

Now I’ll do the remaining ones, repeating the input step at the beginning of #4 above but with the other tables:

Just to make sure this is a success I added “Make sure they connect to the fake data already created”

Output:

Enter that ChatGPT outputed INSERT query into PG SQL site:

Query successful!

Continue below…

Quick promo: I have a couple new Ebooks you may want to check out, for cloud engineering professionals (my store has many):

Ebook: Get my Cloud Audit ebook / Full Store: SystemsArchitect Store

5. Handling Errors: Foreign Key Constraint Issues

Sometimes, you might encounter errors like this:

INSERT or UPDATE on table “cart” violates foreign key constraint “cart_user_id_fkey”

That is caused by an ordering issue of creating the tables/data.

ChatGPT 4o:
“I need to create a lot of fake data for my schema; however, I’m having a problem with foreign key constraints. Please give me the correct order in which I should insert data into the tables.”

ChatGPT Output: Provides the correct order, such as users -> authors -> books -> cart, ensuring foreign key constraints are respected.

6. Testing and Inserting Data

Now that you have the correct data and order:

Insert Data: Copy and paste the generated SQL INSERT statements into your PostgreSQL query tool, ensuring you follow the correct order.
Verification: Run SELECT * queries to confirm that the data has been inserted correctly.

ChatGPT 4o:
Based on the fake data and tables please give me a list of select queries and basic queries I can run from the SQL console:

Typed this into pg-sql.com and got the results:

SELECT * FROM Users;

Let’s try another slightly more complicated

SELECT * FROM Authors WHERE date_of_birth >= ‘1900–01–01’ AND date_of_birth < ‘2000–01–01’;

That worked, how about one that involves 2 tables.

Get all order details along with the user who placed the order:

SELECT Orders.order_id, Orders.total_amount, Users.username
FROM Orders
JOIN Users ON Orders.user_id = Users.user_id;

That worked.

Friends, if you made it this far… we’re sitting on a goldmine here.

Look into it some more. I am creating more advanced examples in upcoming articles.

7. Improvements and Tips

To enhance this process:

Smaller Batches: Insert data for a few tables at a time to avoid overwhelming ChatGPT.
Custom Constraints: Specify additional constraints (e.g., price ranges, text patterns) to fine-tune the generated data.
Error Handling: If errors occur, re-evaluate the prompts and adjust as needed.
Validation: Always validate the generated data using other tools to ensure accuracy.

This tutorial showcased how to use ChatGPT for generating fake data, a task that can be tedious and time-consuming. By following the steps outlined above, you can streamline this process, freeing up time to focus on more critical aspects of your development work.

Subscribe to AI Dev Tips (follow), clap, share, follow me. Thanks!

I have a couple new Ebooks you may want to check out:

Ebook: Get my Cloud Audit ebook / Full Store: SystemsArchitect Store

About me

I’m a cloud architect, senior developer and tech lead who enjoys solving high-value challenges with innovative solutions.

I’m always open to discussing projects. If you need help, have an opportunity or simply want to chat, you can reach me on X/Twitter @csjcode and same username gmail. Also, can schedule a Zoom here (lower on page): https://systemsarchitect.io/docs/consulting

I’ve worked 20+ years in software development, both in an enterprise setting such as NIKE and the original MP3.com, as well as startups like FreshPatents, SystemsArchitect.io, API.cc, and Instantiate.io.

My experience ranges from cloud ecommerce, API design/implementation, serverless, AI integration for development, content management, frontend UI/UX architecture and login/authentication. I give tech talks, tutorials and share documentation of architecting software. Also previously held AWS Solutions Architect certification.

Also check out all my ebooks: SystemsArchitect Store

Recently I’m working on Instantiate.io, a value creation experiment tool to help startup planning with AI. I’m making progress writing a reference manual on cloud metrics.

Also, an enthusiast of blockchain, I’m active working on applications in the innovative Solana blockchain ecosystem, like Doerz.fun rewards platform.

Thanks for reading and hope to see you around!