All of my professional experience before coming to Kabbage in 2019 was in full-on DotNet shops; NetFramework /NetCore2, VB/C#, MSSQL Server, Windows VDIs, Bill Gates smiling down from above. I had some forays into AWS DynamoDB and other software stacks for small projects, but the vast majority of my work was in the Windows ecosystem. Arriving at Kabbage, many things were the same or similar — NetCore3, Kubernetes instead of AWS Fargate, Bamboo instead of Jenkins — but the most pervasive difference was that Kabbage mostly uses PostgreSQL to store relational data.
I’m very familiar with MSSQL, and on the surface the interface was very similar, and I thought, ‘Oh, this won’t cause me any trouble’. But as I’ve used it more, I’ve noticed there’s definitely some quirks, especially if you are trying to use it with Entity Framework (EF) in your NetCore services.
The first thing I noticed was that the names in Postgres are case-sensitive. If you have a column named
myentityid and you try to represent that column in EF as
MyEntityId, it ain’t gonna work. There’s a number of easy ways to tell EF what the casing should be, I like using Attributes. Annotate the property in your entity model with
[Column("myentityid")]and it’ll find it just fine. If you prefer a more behind-the-scenes approach (and are managing the DB with code-first migrations) the package EfCore.NamingConventions exposes a handy extension method to automatically
snake_case all of the names in your DB.
Thinking back on the MSSQL conventions used on my previous projects, one sure-fire way to get a PR rejected was using
varchar(max) instead of
varchar(n) when you had a reasonable assumption of the average length of the values. Say you wanted to store a “Name” value for some entity, you might have a hard limit of 64 characters in code, but you expect most values to be about 25 characters. There’s no storage reason to not use varchar(max), because you know you’ll never reach the (soft) limit of 8000 chars, but for performance reasons you use varchar(64) because the query engine will estimate 50% of the declared size of variable sized columns when its figuring out how much memory is needed for query execution. If you had used varchar(max), then that 50% would be 4000 bytes. Depending on the complexity of your queries, it can be quite a performance hit to use varchar(max) everywhere in MSSQL.
So when the first Postgres query to create a new table slid up in my PR queue, and I saw that
text data type for a small value…well, I was ready — “Needs Work” finger poised above the mouse — but I remembered this was modern Postgres, not MSSQL, and other things had been different. It was worth a little research.
According to Postgres’ documentation on their datatypes, there is no performance difference between
text. In most cases, you should just use
text. The only time you shouldn’t is when you want an error for a string longer than specified, and you will never need to increase that length, and you don’t want to write that limit in code or with a check constraint — so… never. The only legitimate reason I can see for not using
text is portability, since it’s not a standard type, but you can use the alias
varchar and it works the same.
Now on the quirks that caused problems. We have a service that returns anywhere between 1 and 500 items per page of results, and one of our end users was complaining of timeouts when requesting 1 or 2 results, but was getting the data back within milliseconds for requests for 500, without changing any other part of the query. Very strange. The only thing we change for that is the
LIMITclause. Testing it out, we saw asking for 1 row took 6 seconds, and 500 rows took 28ms.
Turns out, Postgres will alter the query plan if you are using
LIMIT based on what it knows about the table. It will sometimes assume that if you are only asking for a few rows, your data must have a lot of duplicates and any couple will do. Postgres had decided that when only looking for fewer than ~5 rows for that specific query, it would forego using an index, and just scan the table backwards until it found the
TRUEstate of the
WHEREclause. This caused a 6s read. When the
LIMITwas higher or removed completely, the read completed immediately, because the query engine decided using the index would be more efficient if it was going to have to pull all those rows anyway.
Postgres keeps a system catalog called
pg_statisticthat informs the query engine about the contents of tables. Running
ANALYZE <TABLE>will gather and set these statistics, and can help the engine make smarter query plans, but this doesn’t guarantee to stop the behavior we were seeing.
To fix these slow queries, we could have forced a
LIMITof some arbitrarily large number, but we decided just adding an additional (unnecessary)
ORDER BYclause on the primary key would force index usage, and not change functionality for our end-users. This also added the unintentional benefit of having deterministic ordering even when there were ties in the primary sort column — typically a monetary value.
The last quirk is more weird, but fairly rare. It only happens if you use EF Core with Postgres and also need to create seed data. I was trying to start my database with some data I knew it needed, and then would allow users to add more data via a RESTful service on top of this DB. EF Core allows you to seed as part of a migration via the HasData method on the modelBuilder class but you have to provide the Id for the row yourself. This means if you set up an auto-incrementing Id on your table, then the first row you create after that seed data is added (the first one actually using that auto-incrementer) will generate the id as 1 and conflict with the row you’ve created with your seed data. This seems to be only an issue with the Postgres provider of EF Core, and the best workaround, seems to be to use negative Ids in your seed data. This raises some eyebrows in code review, but I kind of like the clear distinction between seed and user created data.
I do like Postgres better than MSSQL, it’s cheaper to run and its fast. It removes the anxiety around choosing the correct
varchar(n)length and its got fantastic documentation. Its definitely got some quirks but a very active community with answers and strong support in major frameworks. That
LIMIT based query ‘optimization’ though, wat.