When an application which uses a database exhibits some kind of unexpected behavior, that sparks a holy war between DBAs and developers: DBAs scream: “Your application crashes the database!”, while developers shout back: “But everything worked just fine before that!”. Worst of all, the DBAs and developers can’t really help each other: the former don’t know the nuances of the application and features of the driver, the latter don’t know all the dark corners of the infrastructure. It would be nice to avoid this kind of a mess.
As you might have guessed: quite often, merely scrolling through go-database-sql.org is not enough. It’s better to arm yourself with other people’s experience. Even better if the experience is obtained through pain and lost money.
You can find the essentials on working with pretty much any SQL database in Go at go-database-sql.org. If you haven’t read that yet — please do.
From my point of view, the main strength of Go is its simplicity. For instance, that simplicity manifests in the common practice of writing queries in raw SQL (ORM is not welcome). It turns out to be both a boon and a source of extra hardships.
So when you take database/sql package from the standard library, you want to extend its interfaces. As soon as that happens, take a look at github.com/jmoiron/sqlx. I’ll show you a few examples of how this extension can make your life easier.
StructScan usage allows you to avoid manual column <-> field mapping.
NamedQuery usage lets you use structure fields as placeholders in a query.
database/sql is a set of database access interfaces, and sqlx is their extension. The interfaces need an implementation to work. Drivers are responsible for the implementation.
Most popular drivers:
- github.com/lib/pq — pure Go Postgres driver for database/sql. For a long time, this driver was the standard by default. Currently, it has lost its relevance and is not developed by its author.
- github.com/jackc/pgx — PostgreSQL driver and toolkit for Go. Today this tool is a better choice.
github.com/jackc/pgx — you really want to use this driver. Why? It is actively developed and supported.
- It can be more performant if used without database/sql interfaces.
- It supports more than 60 of Postgres-specific types (extra ones that Postgres has in addition to standard SQL ones).
- It provides an option to log whatever happens within the driver.
- pgx has human-readable errors, while lib/pq throws panics. If you don’t catch a panic, the program will crash. (As a side note, don’t use panics in Go as you would use exceptions in other languages; they are quite different.)
- With pgx we have an option to configure every connection independently.
- It supports the PostgreSQL logical replication protocol.
Typically one would write the following loop to fetch data from the database:
Internally, the driver receives data and accumulates it in a 4 KB buffer. Network roundtrip and buffer filling occurs on rows.Next() call. If the 4KB buffer is not enough to complete fetching, the next batch of data is fetched from the network. The more network trips are there, the slower the processing speed becomes. On the other hand, as the buffer limit is 4 KB, we won’t hog the entire available memory.
But, of course, we would like to maximize the buffer capacity to minimize the number of network calls, and lower the latency of our service. So let’s add such an option in the driver and try to gauge the expected speed boost with synthetic tests:
Evidently there is no significant processing speed difference. But why?
As it turns out, we are limited by the size of the data sending buffer within Postgres itself. That buffer has a hardcoded size of 8KB. Using strace we can see that the OS returns 8192 bytes in the read system call. tcpdump confirms this with the packet sizes.
Tom Lane (one of the main developers of the Postgres core) comments on that as follows:
Traditionally, at least, that was the size of pipe buffers in Unix machines, so in principle this is the most optimal chunk size for sending data across a Unix socket.
Andres Freund (Postgres developer from EnterpriseDB) thinks that the 8KB buffer is not the best implementation currently and there should be performance tests with other sizes and other socket configurations.
Apart from that, we should remember that PgBouncer also has a buffer and its size can be configured with the pkt_buf parameter.
Another dangerous feature of the pgx (v3) driver: for every established connection it sends requests into the database to get the information about the Object ID (OID).
These identificators were added to Postgres to uniquely identify internal objects: strings, tables, functions, etc.
The driver uses the knowledge about OIDs to figure out how to map data from database column types into primitive Go types. For this purpose, pgx internally uses the following map (key — type name, value — Object ID)
This implementation causes the driver to send 3 queries for every connection to the database to fill the table with Object IDs.
If the database and the application work normally, the Go connection pool makes it possible to avoid spawning new connections to the database. However, in the event of tiniest database degradation, the connection pool gets exhausted and the connections rate increases exponentially. OIDs fetching requests are pretty heavy, and as a result, the driver can bring the database into a critical state.
Here is a moment when such OIDs requests were poured onto one of our databases:
What can we do about that?
Connections pool limits
First and the foremost: set an upper-bound on the connections pool size.
For database/sql it can be done with the DB.SetMaxOpenConns function. If you skip the database/sql interfaces and use pgx.ConnPool (connection pool implemented by the driver itself), then in ConnPoolConfig you can specify MaxConnections (5 by default).
By the way, while using pgx.ConnPool the driver will reuse the information about the received OIDs and won’t fetch OIDs from the database with every new connection.
If you want to use database/sql anyway, you can cache the information about OIDs yourself.
This method does work, but it can be dangerous in the following case:
- You use enum or domain types in Postgres;
- If the master fails you switch the application to a logical (rather than physical) replica.
In case of failover under these conditions the cached OIDs become invalid. But we can’t clear them as the application can’t know the moment of the failover.
In Postgres world, streaming replication is commonly used to implement high availability, which copies database instances bit-by-bit, so the problems with OIDs caching can rarely be observed in the wild nature. (But still, you’d better go and chat with your DBAs and clarify how exactly does standby work in your particular case).
In the next major version of pgx driver (v4), OIDs fetching requests will be removed. So the new v4 driver is going to rely only upon the list of OIDs, predefined in the code. For custom types you’ll have to take the deserialization control in your hands on the application side: the driver will simply provide byte arrays received from DB.
Logging and monitoring
Monitoring and logging helps to prevent problems before the database crashes.
database/sql provides the DB.Stats() method. The returned state snapshot gives you an understanding of what happens inside the driver.
If you use a pool in pgx directly, the ConnPool.Stat() method will provide you with similar information:
Logging is equally important, and pgx lets us do that. The driver accepts the Logger interface, and by implementing it you can receive all the events that happen in the driver.
Almost always you’ll use a connection pooler while working with Postgres, and it will be PgBouncer.
You can brush up on the reasons behind that in this brilliant article brandur.org/postgres-connections. Briefly, if the number of clients is more than 100, the performance degrades. That happens because of Postgres-side implementation quirks: launching a dedicated process for every connection, snapshot-taking mechanism, and using shared memory for interactions — all these factors are relevant.
Here’s a benchmark of different connection poolers:
And a bandwidth benchmark with PgBouncer and without it.
As a result, your infrastructure will look like this:
Where Server is a process in our application that handles API requests. This process runs in Kubernetes. Separately, on a physical hardware server, there is Postgres, with PgBouncer as a connection pooler in front of it. The PgBouncer itself is single-threaded, so we run several instances, balancing the traffic going to them with HAProxy.
As a result, the network roundtrip of our request is made of the following: application connection pool → HAProxy → PgBouncer → Postgres.
PgBouncer can work in 3 modes:
- Session pooling — every session gets a dedicated connection and it is assigned to the session for its entire lifetime.
- Transaction pooling — the connection is assigned to an individual transaction. As soon as the transaction is over, PgBouncer takes the connection away and gives it to another transaction. This mode allows to utilize connections very effectively.
- Statement pooling — deprecated mode. It was created only to support PL/Proxy.
You can take a look at the following feature availability matrix. We usually pick Transaction Pooling, but it has limitations when working with Prepared Statements.
Transaction Pooling + Prepared Statements
Let’s imagine we want to prepare a query and then execute it. At some moment we run a transaction with our PREPARE statement, and then we receive the prepared statement ID from the database.
Afterwards, at any other moment, we start another transaction in which we try to execute the prepared statement by ID.
In the Transaction Pooling mode, two transactions can get executed in two different connections, but Statement ID is valid only within one connection. We’ll get a “prepared statement does not exist” error if we attempt to execute the prepared statement.
The most embarrassing thing is that under low workload during the development and testing PgBouncer often reuses the same connection and everything works correctly. However, as soon as we deploy the application to the production, requests start to fail with an error.
Now, try to find any Prepared Statements in this code:
You won’t see it there! Statement preparation happens implicitly within the Query() method. Meanwhile, both the preparation and the execution of the query will happen in different transactions and we’ll get the full consequences I’ve described above.
What options do we have?
The first and the simplest option — switch PgBouncer into Session pooling mode. In this mode, one connection is issued per one session, all transactions start in this connection and prepared queries work correctly. But connection utilization efficiency leaves much to be desired. So we can’t really consider this option.
The second option is to build queries on the client-side. I don’t want to do that because of two reasons:
- Potential SQL vulnerabilities. A developer might fail to sanitize the queries correctly or forget to do it at all.
- Query parameters escaping has to be written manually every time.
One more option is to explicitly wrap every query into a transaction. After all, while the transaction is alive, PgBouncer doesn’t take the connection away. This works, but aside from verbosity in our code we also get quite a lot of network calls: Begin, Prepare, Execute, Commit. 4 network calls per one query in total. Latency grows.
However, we want safety, comfort, and efficiency. And such an option does exist! You can explicitly indicate to the driver that you want to use the Simple Query mode. In this mode, there won’t be any preparation and the whole query will go in a single network call. Meanwhile, the driver will escape parameters by itself (standard_conforming_strings should be activated either on the database level or when the connection gets established).
The following problems are tied to request cancellation on the application side.
Look at this code. Can you spot any pitfalls?
Go has a program execution flow control mechanism — context.Context. In this code, we pass ctx to the driver, so that when the context is closed the driver would cancel the request on the database level.
At this point, we expect to save the resources by canceling the requests nobody expects to be answered anymore. However, when a request is canceled, PgBouncer 1.7 sends information that the connection is ready to be used into this connection. Only after that, it returns the connection into the pool.
This PgBouncer behavior confuses the driver, which instantly gets back “ReadyForQuery” when the next request is sent. As a result, we catch a lot of unexpected ReadyForQuery errors.
Starting from PgBouncer version 1.8 this behavior was fixed. Please use the up-to-date version of PgBouncer.
And actually, despite the fact that the errors will disappear, some confusing behavior might still happen.
Occasionally, the application may receive an unexpected response, not to a request that it sent, but to a “neighboring” one. Requests must match in requested column order and types.
This occurs due to the processing of canceling requests at different levels: at the driver pool and bouncer pool levels.
The way the query cancellation works is the most interesting part though.
To cancel a query we have to establish a new connection with the database and send a cancellation request. For every connection Postgres creates a dedicated process. We send a command to cancel the current query in a specific process, specifying the process ID (PID).
However, while the cancellation command flies to the database, the query we want to cancel can manage to complete normally.
Postgres will execute the command and cancel the current query in the specified process. But that current query will already be different, not the one we wanted to cancel in the first place.
This behavior of Postgres working with PgBouncer makes it safer not to cancel a query on the driver level. You can set a CustomCancel function, which won’t cancel the query even if context.Context is used.
Working with Postgres checklist
Instead of any conclusions, I decided to make a checklist for working with Postgres. It should help the article to settle in your head.
- Use github.com/jackc/pgx as your Postgres driver of choice.
- Configure limits for the connection pool size
- Cache OIDs or use pgx.ConnPool, if you use pgx v3.
- Collect the connection pool metrics, either using DB.Stats() or ConnPool.Stat().
- Log what is happening in the driver.
- Use the Simple Query mode to avoid problems with prepared statements in the transactional mode PgBouncer.
- Use an up-to-date version of PgBouncer.
- Be careful with request cancellation from the application side.