Building a Data Seeder/Loader for your database…in Python

9 min readMar 28, 2023

Some time ago, I was challenged with making a data seeder for a social media app I was working on. What I came up with worked, it was modular and simple to use… but it was a complex piece of code with many logical complications and — like all such modules are, it was quickly made obsolete by changes to the data contract and the maintenance became a serious problem. I researched the subject a lot but didn't find anything satisfactory, so I came up with a class-reflection based approach which will deep dive into my classes and rely on the pure magic that SQLAlchemy brings.

Getting Started

First and foremost, I want to say that this article is truly a love letter to SQLAlchemy. I have been using it for many many years and the only ORM that comes anywhere near close to it is Entity Framework in C#. For those of you purists who say ‘just write the SQL by hand!’ — folly I say, there's nothing, including performance tuning, you cant do with SQLALchemy if you know what your doing.

That said, we have a few steps that need to be followed in order for this to actually work and be maintainable (or need none at all)

First, what is reflection?

Reflection refers to the ability for code to be able to examine attributes about objects that might be passed as parameters to a function. For example, if we write type(obj) then Python will return an object which represents the type of obj.

By using reflection, we can create an *abstract* and *programmatic* way to discover our data types, foreign key relationships, and constraints.

lets start by defining our data models:

import contextlib
import uuid
from datetime import timedelta
from typing import List, Iterable
from sqlalchemy import create_engine, MetaData, DateTime, String, Integer, Boolean, Table, Column, Interval
import pytz
from random import randint, choice
from faker import Faker
from types import SimpleNamespace
from sqlalchemy.dialects.postgresql import UUID
from sqlalchemy.exc import IntegrityError
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import registry, sessionmaker

# Explicitly import the registry and declarative base for reflection
mapper_registry = registry()
DeclarativeBase = declarative_base()
Base = mapper_registry.generate_base(cls=())


class ModelMixin(Base):
     # we build a mixin with a common method we would like to impl
     @classmethod
    def get_or_create(cls, *_, **kwargs):
        context_id = kwargs.get("id")
        user = cls.session.query(cls).filter_by(id=context_id).first()
        if not user:
            _object = cls(**kwargs)
            cls.session.add(_object)
            cls.session.commit()
            return _object
   

class User(ModelMixin):
    __tablename__ = "user"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4, index=True)
    email = Column(String(255))
    username = Column(String(255))
    password = Column(String(255))
    active = Column(Boolean)
    createdAt = Column(DateTime, default=lambda: datetime.now(pytz.utc))
    updatedAt = Column(DateTime)
    accessToken = Column(String(255))

    agentId = Column(
        UUID(as_uuid=True),
        ForeignKey("agent.id", ondelete="CASCADE", link_to_name=True),
        index=True,
    )
    createdByAgent = relationship(
        "Agent",
        foreign_keys="User.agentId",
        backref=backref("users", single_parent=True),
   )

Here we have a User object which has a foreign key relation to an Agent object, the user was created by the agent and back references his parent.

We also build a model for Payment History, in this case because we want to make sure all foreign keys are being followed.

class Agent(ModelMixin):
    __tablename__ = "agent"
    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4, index=True)
    email = Column(String(255))
    password = Column(String(255))
    username = Column(String(255))
    active = Column(Boolean)
    createdAt = Column(DateTime)
    updatedAt = Column(DateTime)
    accessToken = Column(String(255))
    quota = Column(Integer, default=0)


class PaymentHistory(ModelMixin):
    __tablename__ = "payment_history"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4, index=True)
    beforeScore = Column(Integer)
    changeScore = Column(Integer)
    newScore = Column(Integer)
    approval = Column(Boolean)
    createdAt = Column(DateTime, default=lambda: datetime.now(pytz.utc))
    approvalAt = Column(DateTime)
    ownerId = Column(
        UUID(as_uuid=True),
        ForeignKey(
            "user.id",
            ondelete="CASCADE",
            link_to_name=True,
        ),
        index=True,
    )
    owner = relationship(
        "User",
        foreign_keys="PaymentHistory.ownerId",
        backref=backref("paymentHistory", single_parent=True),
    )

One of the challenges we face here is seeding data with a non-auto incrementing id field — this is a best practice for distributed systems, but can be challenging when writing data loaders for your db. Another pitfall is foreign key relationships — they are useful, for data integrity purposes but man can they really foul up your intentions to test your tech stack at scale.

The First Step

Now, if we want to generate data here, there's some logical complexities we need to follow to ensure we are able to seed this data:

We need to access the non instantiated User and Agent classes AFTER they have entered the SQLA mapper registry, To do this We need to import the module and return the table_name from that module and convert it to pascal case — the class name.

This example assumes you have a Config class (maybe in a settings.py) file which reads config values with dotenv.

import os
from dotenv import load_dotenv

load_dotenv()

class Config:
    postgres_connection: str = os.getenv("POSTGRES_CONNECTION", "")

Now our config and setup is done, how about we start by setting up our seeder/loader with a class:

class DataSeeder:
      """
      Instanciated with: 
      engine = create_engine(
              f"postgresql+psycopg2://{Config.postgres_connection}")
      DataSeeder(engine=engine, ...)
      """

    def __init__(
        self, include_models: list, 
        engine: create_engine, 
        number_of_records: int = 10
      ):
        self.number_of_records = number_of_records
        Session = sessionmaker(bind=engine)
        self.session = Session()
        self.fake = Faker()
        self.metadata = MetaData(bind=engine)
        self.metadata.reflect()
        self.mapped = {}

So we are working inside of a class context which has the advantage of sharing some properties — we will start by defining some helper functions

Inspecting and Reflecting the Models

These three methods largely make up what I consider to be an extraordinarily hacky abstraction — but sometimes the most beautifully terrible ones work the best. In this case, we are effectively doing a linear search algorithm to explore a given array of items to its maximum depth and comparing each item against the recursive access of another array. If that sounds convoluted, that's because it is — but we have to ensure that all tables, all columns, and all rows are caught.

#define a list of models to reflect
included_models = ['Agent', 'User']

@staticmethod
def snake_to_pascal_case(name: str) -> str:
        return "".join(word.capitalize() for word in name.split("_"))

@staticmethod
def get_model_class(table_name: str) -> Base:
    for model in include:
        # we ignore import errors here so we can iterate to find the model
        with contextlib.suppress(ImportError, AttributeError):
            module = __import__(
              # note, you should point it to your own folder structure,
              # mine is a FastAPI app (app.api.x)   
              # protip, make sure you use __init__.py files and pythonpath
                f"app.api.{model}",
                fromlist=[table_name]
            )
            
            # convert the table_name to TableName
            class_name = snake_to_pascal_case(table_name)

            # Return the actual class its self
            return getattr(module, class_name)

def get_model_metadata(self) -> Iterator:

      # import metadata for all models to build the registry
        for model in include:
            with contetlib.suppress(ImportError):
                # ya i know, were using exec here... deal with it
                exec(f"{model}.models import ModelMixin as Base")

       # loop through models in registry
        for table in sorted(
            Base.metadata.sorted_tables, 
                key=lambda t: t.name, 
                # sort them in Alphabetical order
                reverse=True
        ): # ensure the table we are targeting is valid
          if table.name not in self.metadata.tables:
              continue
        #  build a generator for the data
        if model := self.get_model_class(table.name):
            model.name = model.__name__
            yield model, table

Preparing Data Types

Now that we have a mapping of our class names and MetaData imported, we can proceed with handling the data types for each column — this is where some of the magic happens, because it lets us dot-notate access the types… strictly speaking we could have done this a lot easier, maybe with a Pydantic model — but simple name spaces are good enough… for now.

def get_data_type_mapper(self) -> SimpleNamespace:
  # build a namespace for easy type access
  return SimpleNamespace(
      type_maps=[
          SimpleNamespace(
              type=DateTime,
              fake_type=self.fake.date_time_between(
                  start_date="-30y", end_date="now"
              ),
          ),
          SimpleNamespace(type=Boolean, fake_type=self.fake.boolean()),
          SimpleNamespace(type=Integer, fake_type=self.fake.random_int()),
          SimpleNamespace(
              type=Float, fake_type=self.fake.pyfloat(positive=True)
          ),
          SimpleNamespace(
              type=Interval, fake_type=timedelta(seconds=randint(0, 86400))
          ),
          SimpleNamespace(type=UUID, fake_type=str(uuid.uuid4())),
          SimpleNamespace(
              type=String,
              fake_type=f"{' '.join([self.fake.word() for _ in range(8)])}",
          ),
      ]
  )

Now we need to ensure we follow all columns, including foreign key relationship and seed the fake data.

Generating the Row data

We generate fake data for a given table by inspecting each column and mapping it to a corresponding data type. then generates fake data based on the mapped data type using the `Faker` library.

The following code is effectively an undirected graph clone algorithm , but its perhaps the most abstract implementation of reflection, cyclic recursion and data input validation that I’ve ever done… I’m a little proud of this one, though don't ask me to do it at a google interview.

 def generate_fake_row_data(self, table: MetaData) -> dict:
    
    # loop through all table columns
    row_data = {}
    for column in table.columns:

    # Check data types for all columns in a table and generate fake data
    for data_type in data_type_mapper.type_maps:
        if isinstance(column.type, data_type.type):
            row_data[column.name] = data_type.fake_type
    
    # ensure pk is unique
    if column.primary_key and isinstance(column.type, UUID):
        row_data[column.name] = str(uuid.uuid4())
    
    # loop through table relationships
    for fk in column.foreign_keys:
        fk_table = fk.column.table
        fk_column = fk.column
    
        # link fk to existing record, or make one first then build relationship
        if fk_records := self.get_table_data(fk_table.name, fk_column.name):
            row_data[column.name] = choice(fk_records)[0]
        else:
             # cycle back to the first node
            new_data = self.generate_fake_row_data(fk_table)
            # if the fk does not exist, make it
            self.save_model(self.get_model_class(fk_table.name), new_data)

If a column has a foreign key constraint, we must recursively generate a fake row for the referenced table until it finds a primary key that does not violate any constraints. This is done by checking if matching rows exist in the referenced table and selecting one of them randomly; otherwise, generate a new row for the referenced table and use its primary key as the foreign key value. This method ensures the integrity of the generated data by handling unique constraints when saving the data to the database.

Finally we must save the results to the database after checking for integrity errors and rolling back if they exist — this is not strictly required, but its good practice to ensure we don't fail on an unforeseen relationship or unique constraint.

 @staticmethod
def save_model(model, row_data):
    try:
        model.get_or_create(model, **row_data)
        # commit changes to the database
        model.session.commit()
    except IntegrityError:
         # handle unique constraint violation by rolling back the transaction
        model.session.rollback()

def generate(self):
    # get the MetaData generator and iterate
    for model, table in list(self.get_model_metadata()):
        # where num_of_records is our upper limit
        for _ in range(self.number_of_records):
            row_data = self.generate_fake_row_data(table)
            self.save_model(model, row_data)
        print(model, f"{self.number_of_records} records added to db")

That's effectively it, we can now add the records to our database.

Running the Data Seeder

                # note, oother than UUID, we use no postgres specific
                # fields, im sure you can swap this out for anything
engine = create_engine(f"postgresql+psycopg2://{Config.postgres_connection}")
included_models = ['Agent', 'User', 'PaymentHistory']
seeder = DataSeeder(
    included_models,
    number_of_records=1000,
    engine=enngine
)
seeder.generate()
>> <class 'app.api.agent.models.Agent'> 1000 records added to db
>> <class 'app.api.user.models.User'> 1000 records added to db
>> <class 'app.api.history.models.PaymentHistory'> 1000 records added to db

Now what is truly fun, is that these three classes had relationships to models (which we did not show here) -> lets see the result with some queries to visualize the data:

Get all the related id records created in a single cycle

select a.id, u.id, b.balance, ph.id from "user" u
    join payment_history ph
        on u.id = ph."ownerId"
    join balance b on u.id = b."ownerId"
    join agent a on u."agentId" = a.id

Get the user and his balance which was created by the relationship

select u.*, b.balance from "user" u join balance b on u.id = b."ownerId"

Conclusion

The proposed code, my implementation of which you can find here — is just the start of what we could do. some improvements we might make:

Map data types with Pydantic models, and support advanced data types for various languages like MySQL and types like JSON.
Accurately profile the results of creation — the print statement used ignores chained foreign key creation.
Reduce the cyclomatic complexity by abstracting the recursions into a single set and leveraging tools like py_linq or SQLAlchemy_mixins.
Create a reflected session to use live inspections on the data model, and potentially save a lot of trouble here.

Theres many things we coujld do to improve, but still this solution worked for me, and im happily testing my backend app on millions of data points today — and hopefully this helps you shortcut there also. If you do end up using this, please shoot me a star on github, and seriously if you want to submit a PR I will entertain any solution which improves upon mine.