Postgres is a quirky DB

Candice McCollough
Dec 20, 2019 · 5 min read

All of my professional experience before coming to Kabbage in 2019 was in full-on DotNet shops; NetFramework /NetCore2, VB/C#, MSSQL Server, Windows VDIs, Bill Gates smiling down from above. I had some forays into AWS DynamoDB and other software stacks for small projects, but the vast majority of my work was in the Windows ecosystem. Arriving at Kabbage, many things were the same or similar — NetCore3, Kubernetes instead of AWS Fargate, Bamboo instead of Jenkins — but the most pervasive difference was that Kabbage mostly uses PostgreSQL to store relational data.

I’m very familiar with MSSQL, and on the surface the interface was very similar, and I thought, ‘Oh, this won’t cause me any trouble’. But as I’ve used it more, I’ve noticed there’s definitely some quirks, especially if you are trying to use it with Entity Framework (EF) in your NetCore services.

The first thing I noticed was that the names in Postgres are case-sensitive. If you have a column named myentityid and you try to represent that column in EF as MyEntityId, it ain’t gonna work. There’s a number of easy ways to tell EF what the casing should be, I like using Attributes. Annotate the property in your entity model with [Column("myentityid")]and it’ll find it just fine. If you prefer a more behind-the-scenes approach (and are managing the DB with code-first migrations) the package EfCore.NamingConventions exposes a handy extension method to automatically snake_case all of the names in your DB.

Thinking back on the MSSQL conventions used on my previous projects, one sure-fire way to get a PR rejected was using varchar(max) instead of varchar(n) when you had a reasonable assumption of the average length of the values. Say you wanted to store a “Name” value for some entity, you might have a hard limit of 64 characters in code, but you expect most values to be about 25 characters. There’s no storage reason to not use varchar(max), because you know you’ll never reach the (soft) limit of 8000 chars, but for performance reasons you use varchar(64) because the query engine will estimate 50% of the declared size of variable sized columns when its figuring out how much memory is needed for query execution. If you had used varchar(max), then that 50% would be 4000 bytes. Depending on the complexity of your queries, it can be quite a performance hit to use varchar(max) everywhere in MSSQL.

So when the first Postgres query to create a new table slid up in my PR queue, and I saw that text data type for a small value…well, I was ready — “Needs Work” finger poised above the mouse — but I remembered this was modern Postgres, not MSSQL, and other things had been different. It was worth a little research.

According to Postgres’ documentation on their datatypes, there is no performance difference between varying(n)and text. In most cases, you should just use text. The only time you shouldn’t is when you want an error for a string longer than specified, and you will never need to increase that length, and you don’t want to write that limit in code or with a check constraint — so… never. The only legitimate reason I can see for not using text is portability, since it’s not a standard type, but you can use the alias varchar and it works the same.

Now on the quirks that caused problems. We have a service that returns anywhere between 1 and 500 items per page of results, and one of our end users was complaining of timeouts when requesting 1 or 2 results, but was getting the data back within milliseconds for requests for 500, without changing any other part of the query. Very strange. The only thing we change for that is the LIMITclause. Testing it out, we saw asking for 1 row took 6 seconds, and 500 rows took 28ms.

Turns out, Postgres will alter the query plan if you are using LIMIT based on what it knows about the table. It will sometimes assume that if you are only asking for a few rows, your data must have a lot of duplicates and any couple will do. Postgres had decided that when only looking for fewer than ~5 rows for that specific query, it would forego using an index, and just scan the table backwards until it found the TRUEstate of the WHEREclause. This caused a 6s read. When the LIMITwas higher or removed completely, the read completed immediately, because the query engine decided using the index would be more efficient if it was going to have to pull all those rows anyway.

Postgres keeps a system catalog called pg_statisticthat informs the query engine about the contents of tables. Running ANALYZE <TABLE>will gather and set these statistics, and can help the engine make smarter query plans, but this doesn’t guarantee to stop the behavior we were seeing.

To fix these slow queries, we could have forced a LIMITof some arbitrarily large number, but we decided just adding an additional (unnecessary) ORDER BYclause on the primary key would force index usage, and not change functionality for our end-users. This also added the unintentional benefit of having deterministic ordering even when there were ties in the primary sort column — typically a monetary value.

The last quirk is more weird, but fairly rare. It only happens if you use EF Core with Postgres and also need to create seed data. I was trying to start my database with some data I knew it needed, and then would allow users to add more data via a RESTful service on top of this DB. EF Core allows you to seed as part of a migration via the HasData method on the modelBuilder class but you have to provide the Id for the row yourself. This means if you set up an auto-incrementing Id on your table, then the first row you create after that seed data is added (the first one actually using that auto-incrementer) will generate the id as 1 and conflict with the row you’ve created with your seed data. This seems to be only an issue with the Postgres provider of EF Core, and the best workaround, seems to be to use negative Ids in your seed data. This raises some eyebrows in code review, but I kind of like the clear distinction between seed and user created data.

I do like Postgres better than MSSQL, it’s cheaper to run and its fast. It removes the anxiety around choosing the correct varchar(n)length and its got fantastic documentation. Its definitely got some quirks but a very active community with answers and strong support in major frameworks. That LIMIT based query ‘optimization’ though, wat.

kabbage-engineering

Kabbage Engineering

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store