Unclogging our Database: PDFs in Postgres

How we integrated Amazon’s Simple Storage Service into our ORM to solve our growing DB size issue

Written by: Nick Saenz

Background: Large PDFs ballooning our DB, trouble ahead!

The data we store at Haven Life is mostly relational. We create and maintain objects representing real life things like customers and life insurance policies. Most of these objects are easily translatable to our relational database (Postgres), so we save them using our ORM, Persistor. How much your policy is worth for example, is an integer that’s easy to store.

As any database engineer will tell you, their job gets harder the larger the database becomes. When we noticed our database was growing faster than we expected, we quickly discovered it was mostly due to large binary files (in our case, PDFs) that were being stored in the database. Unlike the small string and integer values before, these PDFs are orders of magnitude (some over 100MB!) larger than any one other piece of data.

Solving for Space: S3 + Persistor

Enter S3 — a highly scalable and flat object storage system. We already had an ORM that worked, so our main problem was answering the question “how do we keep our ORM objects simple to save while also putting extra large objects in S3?”

At that, an integration was born! Our ORM, Persistor, uses Typescript decorators to keep track of objects and their properties for ease of translation to the database. We used this same concept of decorators to pass relevant info to Persistor to be able to save certain properties in S3 vs. in Postgres.

@property({
isRemoteObject: true,
remoteKeyBase: 'document-doc-key'
})
pdfProperty: string;

In the example above, we attach two pieces of metadata to our pdf property ;isRemoteObject set to true means this is a property we want to send to S3, and remoteKeyBase is a string that becomes part of the generated UUID to make searching for a particular document easier.

By leveraging our existing pattern of using decorators to inform Persistor’s behavior, our developers can still interact with data the way they usually do with minimal disruption to existing coding flow and patterns.

Benefits of S3 integration: Smaller database, virus scanning, and more

In addition to reducing the working size of our DB, we are also able to take advantage of some more advanced S3 functionality like virus scanning for security, and document versioning to allow easy historical auditing of sensitive documents.

Challenges: Multiple Systems

While the benefits were clear, a large change of any kind brings new challenges. With the S3 integration, our existing transaction logic had to grow to accommodate more complex cases. One of the benefits of a relational database is that individual SQL transactions are ACID. To maintain this behavior with our new integration, saving one object now required getting consistency between more than one system, so a higher level of transaction logic had to be created to accommodate cases like partial saves (e.g. within the same object, Postgres save was OK, but saving to S3 timed out) and other error cases.

Summary: Worth it!

This feature has saved our organization time (less developer time creating and maintaining S3 wrapper classes, managing transaction logic, etc.) and money (S3 storage is much cheaper than equivalent relational DB storage), so in the end the feature was successful because it was what we needed in our circumstance. As always, using the right tool for the job is key — go the simple easy way until it either becomes too difficult to expand or becomes too expensive. If you have any questions or comments, add them below or you can email the author, Nick Saenz — nick@havenlife.com.

Thank you for reading — happy saving! :)

--

--