Postgres json column + Rails @ Casper
As part of an engineering team that has been using a json column in production for two years, things can sometimes feel a bit lonely. Googling around for blog posts about the json column, you’ll find a fair number of discussions of the feature on a theoretical level or getting started guides that give you some examples of the syntax, but there’s not much information to be had about real world usage of the json column. Are other engineering teams using the json column in production? Or just theorizing about it and experimenting on weekends? Assuming there are others like me, who want to hear tales of the json column put into action, I offer this story to the internet.
Keeping track of all those mattresses
At Casper we built a Rails app called Bedpost 😂 to manage fulfillment. One of its first responsibilities was to pull tracking information from carrier API’s so that we could show estimated delivery dates to customers.
At the time, UPS was our only carrier, so we focused on their tracking API. Their API returns more information than just an estimated delivery date. We receive a bunch of XML with many fields and lists of nested objects. We only had a use for the estimated delivery date field, but we were pretty sure we’d want to dig into the rest of the data more deeply at some point in the future, so we wanted to make sure to store it all.
We might have been able to design a structured schema for the UPS tracking data, but that would require an understanding of the UPS Tracking API, and that is not easily acquired; for starters the UPS API documentation is 200+ pages long. Just looking at the data it returns you’ll notice strange things. For example, status strings are mostly consistent but will sometimes vary slightly as if a person somewhere is typing them in. And even more surprisingly the “ScheduledDeliveryDate” field changes it’s name to “RescheduledDeliveryDate” when the value changes. It would be a huge task to wrangle a sufficient understanding of this data to fit it into a structured schema. And for that matter, we wouldn’t be able to design the schema just around UPS, because we’d soon be working with many more carriers.
Rather than attempting to decide on a schema, we just took the easy step of adding a json column to our shipments table, serializing the xml UPS response into a json object, and dumping it into that field (side note: this blog post is a great starting place if you’re looking for a basic overview and info on how to get started). And just like that we had all the data that UPS wanted to give us flowing right into our database.
Making use of the JSON data
Once we had UPS tracking data in our database we found it was relatively easy to begin putting it to use. Here are some examples of the ways we were using the tracking info.
We were able to access the data relatively conveniently and some number of months went by where we happily focused on other things and let the tracking data pile up as json.
More complicated querying
Eventually, the Casper team started asking deeper questions about our tracking data. We wanted to understand when exactly shipments were being scanned out of our warehouse, the cadence of delivery attempts, etc. The answer to these questions lay nested in the “shipment_events” array. An array within the UPS tracking info containing events that took place during transit. Here are some queries we put together about shipment events.
As you can see, things get a little hairy as you start to ask questions like when was the first failed delivery for a shipment. Had we been building logic around shipment events at that time, I think this might have been the moment when we made the move to using a schema. But the queries above weren’t in the master branch of our application. Looking into shipment events was more of an end-of-the-day-fooling-around type of task. And as such it didn’t really feel like the sort of thing that merited creating new tables and doing the work to transform UPS data into those tables. We decided to just leave the data where it is, and use a database view to structure shipment events for easier querying when it did tickle our fancy.
As mentioned, the queries displayed above were just being used for some light data analysis and were not being run in any environment where speed was an important factor. However we did notice that they were quite slow. For one example it took about 10 seconds to retrieve the most recent delivery attempt across all shipments. We experimented a bit to see what speed increases we could get and found that the same query went from 10 seconds to 1.8 seconds when we changed our column type from json to jsonb. We went ahead and made that change.
For curiosity’s sake we looked a bit further into what it might mean to push forward with json data and a shipment tracking events view. We found that creating a materialized view for the shipment tracking events cut that query time in half to 0.9 seconds and adding an index to that view got it down to 0.3 seconds. But we didn’t go down this route with our production tracking data, because it started to become clear that our time using json was coming to an end.
We’ve come to a point as a company where it’s best that we define a Casper-specific way of thinking about tracking shipments — something that is both cleanly mapped to the way our warehouses and carriers operate, and also easily communicable to our customers. We’re defining that schema now and we’ll be glad to get it into production. It’ll be great for cross-departmental communication. It will force us into basic validation of our data when we store it, rather than when we try to put it to use, which should allow us to address issues more proactively (i.e. before a customer sees a 500). And it’ll be faster to query. I think the only thing we lose is the time that we’re putting into designing and implementing the schema.
We will continue to hold on to all of our raw carrier tracking info in json. We had a good run querying that data directly. Moving forward, as we work with our Casper-specific schema, we will still be able to pull in new fields retroactively from the json data. The higher-level pattern that we’re following here is to use the json column to store untransformed data from an external source. It’s a great tool for that particular situation and it’s wonderfully convenient that postgres allows us to keep that data in the same database and even the same table as our structured relational data. Please comment if you have your own suggestions for ways to put the postgres json column to use. Or better yet, come meet us in person! We’ve got open roles, and if you’re not interested in that, well we just generally like making new friends.
Finally, as a bonus for making it to the end, here’s a cool NPR story about all data that UPS is collecting on their drivers. I hope they are organizing it better than the stuff they’ve made publicly available.