Postgres in okmeter.io blog on Medium

PgBouncer monitoring improvements in recent versions

pavl t — Mon, 15 Oct 2018 13:48:48 GMT

PgBouncer monitoring improvements in recent versions

As I wrote in my previous article “USE, RED and real world PgBouncer monitoring” there are some nice commands in PgBouncer’s admin interface that allow to collect stats how things going and spot problems, if you know where to look.

This post is about new stats added in these commands in new PgBouncer versions.

So as you know, SHOW STATS shows cumulative stats for each proxied DB:

Since PgBouncer version 1.8 there’s a couple of new columns in its output.

First one — total_xact_time — total number of microseconds spent by pgbouncer when connected to PostgreSQL in a transaction, either idle in transaction or executing queries.

This will allow us to chart db pool utilization in terms of time spent in transactions and compare it to the query time utilization:

We see two totally different situations — while database is used for serving queries only 5 to 25 % of the time, PgBouncer connections around 8:00 am spend up to 70% of time in transactions!

But this total_xact_time is useful in one more very important way.

There’s a known anti-pattern in Postgres usage, is when your application opens up a transaction, makes a query and then starts doing something else, for example some CPU-heavy calculation on that result or query to some other resource/service/database, while transaction keeps hangging. Later this app will probably return back to this transaction and might, for example, update something and commit it. The bad thing in that case is that there’s a corresponding Postgres backend process, that sits there doing nothing, while transaction is idling. And Postgres backends are somewhat expensive.

Your app should avoid such behavior.

This idle in transaction state might be monitored in the Postgres itself — there’s state column in pg_stat_activity system view. But pg_stat_activity provides us only with a snapshot of current states, that leads to possible false negative errors in reporting occurrences of such cases. Using PgBouncer's stats we can calculate a percentage of time that clients were performing some queries (total_query_time) from the total time spent by client while in transaction: total_xact_time. If we subtract that from 100% that will be idling percentage:

Moreover, there’s new two metrics in 1.8 version of PgBouncer that substitute original total_requests stat, that showed number of queries performed. With modern version of PgBouncer you’ll have total_query_count instead of total_requests, and additionally total_xact_count — that counts number of transactions.

So with that in hand, we can divide total_xact_time - total_query_time (total idling time) to the number of transactions —total_xact_count — this will give us how long on average each transaction is idling.

Furthermore with all that, we can characterize database workload in one more useful way: we can calculate average number of queries per transaction, by dividing the rate of queries by the rate of transactions. In okmeter monitoring you can do that as simple as this:

rate(
   metric(name="total_query_count", database="*")
) / rate(
   metric(name="total_xact_count)", database="*")
)

And here’s a corresponding chart:

We can see clearly when there were changes in workload profile.

Request Durations

As we saw in the previous article if you divide total_query_time by total_requests you’ll get average query duration. With newer PgBouncer versions these new stats you get —total_xact_time and total_wait_time can be charted in the same way — divided by the number of transactions and queries respectfully. This will produce a chart like this one:

This wait_time metric is way more handy in spotting pool saturation, than the one we discussed previous time, calculated from the number of waiting client in SHOW STATS output:

With all that and other detailed PgBouncer metrics and Postgres metrics you’ll be prepared to anything happening with your databases.

I hope you find this write up useful. I’ve tried to cover all the bases, if you feel that you have something to add — please, tell me, I’ll be glad to discuss.

We’re preparing next articles on Postgres and monitoring. So if you’re interested — follow our blog here, or at facebook or twitter to stay tuned!

Our monitoring service — okmeter.io will help you stay on-top of everything happening with you Postgresql, RDS and other infrastructure services.

PgBouncer monitoring improvements in recent versions was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

USE, RED and real world PgBouncer monitoring

pavl t — Tue, 25 Sep 2018 17:57:46 GMT

Brendan Gregg’s USE (Utilization, Saturation, Errors) method for monitoring is quite known. There are even some monitoring dashboard templates shared on the Internet. There’s also Tom Wilkie’s RED (Rate, Errors, Durations) method, which is suggested to be better suited to monitor microservices than USE.

We, at okmeter.io, recently updated our PgBouncer monitoring plugin and while doing that we’ve tried to comb everything and we used USE and RED as frameworks to do so.

Why we needed both and not just stuck with USE, as it is more commonly known? To answer that we need to understand their applicability first. While they are know, I don’t think they are widely systematically applied in practice of covering IT-systems with monitoring.

USE

Using Brendan Gregg’s own words:

For every resource, check utilization, saturation, and errors.

Where resource is all and any physical server functional component (CPUs, disks, busses, …). But also some software resources as well, or software imposed limits/resource controls (containers, cgroups, etc).

Utilization: the average time that the resource was busy servicing work. So CPU utilization or disk IO utilisation of 90% means that it is idle, not doing work only 10% of the time, and busy 90% of time. But also, for such resources as memory, where one can’t apply the idea of “non idle percentage of time”, one could measure the proportion of a resource that is used.

Anyways, 100% utilization means no more “work” can be accepted, either at all, i.e. when memory is full, it is full, you can’t do anything about it. Or it’s 100% utilized only now, at the moment (as with CPU), and new work could be put into a waiting list, queue or something. And these two scenarios are covered by the corresponding remaining two USE metrics:

Saturation: the degree to which the resource has extra work which it can’t service, often queued.

Errors: the count of error events, i.e. such as “resource is busy”, “Cannot allocate memory”, “Not enough space”. While those usually do not produce performance impact directly, they either lead to client errors, or, through retries, redundant devices etc, lead to performance impact from client point of view anyway.

RED

Tom Wilkie, former Google engineer and Grafana’s VP Product now, was frustrated with the USE performance monitoring methodology. For instance, how would you measure i.e. the saturation of memory? Error counts also can be problematic, especially IO errors and memory bandwidth.

The nice thing about this [USE] kind of pattern is that it turns the guesswork of figuring out why things are slow into a much more of a methodological approach.

So, as an alternative Wilkie suggests another easy-to-remember acronym, RED:

(Request) Rate: The number of requests per second.
(Request) Errors : The number of failed requests.
(Request) Duration : The amount of time to process a request, a.k.a. service latency.

Though RED is designed only for request-driven services. As opposed to batch-oriented or streaming services for instance.

So how is it better?

RED offers a way of looking at service functioning and performance consistent across different services. Thus reducing cognitive load of on-call engineers, which is crucial in times of outages.

PgBouncer

While PgBouncer is a connections pooling service after all, and as such it might be monitored with RED, it also has all kinds of internal limits and limited resources, so in this case we concluded we’re going to need to use USE as well :)

Application of those methods to Pgbouncer should be done with regards to its main purpose and also one should know specifics of its internal structure — all the software kind of resources and limits to cover with USE.

It’s not enough to monitor PgBouncer as a black-box network service — to know whether a Linux process is alive and a TCP port is open. You actually need to know whether it’s working properly from client point of view — proxying SQL transactions and queries in a timely manner.

So here’s how it looks like from client’s, say, some web-application, PoV:

Client connects to PbBouncer.
Client makes SQL request / query / transaction
Gets a response.
Repeat steps 2–3 as many times as needed.

Here’s client’s connection states diagram:

During LOGIN (CL_ stands for client) Pgbouncer might authorize a client based on some local info (such as auth_file, certificates, PAM or hba files), or in a remote way — with auth_query in a database. Thus a client connection while logging in might need and be executing a query. We show that as Executing substate:

But CL_ACTIVEqueries also might be actually executing some queries and so linked to actual database server connections by PgBouncer, or idling, doing nothing. This linking / matching of clients and server connections is the whole raison d’etre of PgBouncer. PgBouncer links those clients with server only for some time, depending on pool_mode — either for a session, transaction or just one request.

As transaction pooling is the most common, we’ll assume it for the rest of this post

So client while being in cl_active state is actually might be or might be not linked to a server connection. To account for that we split this state in two: active and active-linked/executing. So here’s a new diagram:

These server connections, that clients get linked to, are “pooled” — limited in number and reused. Because of that it might occur that while client sends some request (beginning a transaction or performing a query) a corresponding server connection pool is exhausted, i.e. pgbouncer oppened as many connections as were allowed it and all of them are occupied by (linked to) some other clients. PgBouncer in this scenario puts client into a queue, and this client’s connection goes to a CL_WAITING state. This might happen as well while client only logging in, so there’s CL_WAITING_LOGIN for that also:

On the other end there are server connections — from PgBouncer to the actual database. Those have respectful states: SV_LOGIN for when authorizing, SV_ACTIVE for when it’s linked with (and used or not by) client’s connections, or if it’s free — SV_IDLE.

USE and PgBouncer

Thus we can formulate (a naive version) of a specific PgBouncer pool Utilization:

pool_u = #_server_connections_utilized_by_clients / pool_size

PgBouncer has an administration interface available through a connection to a special ‘virtual’ database named pgbouncer. There are a number of SHOW commands in it, one of those — SHOW POOLS — will show number of connections in each state for each pool:

We see here 4 client’s connections opened, all of them — cl_active. And 5 server connections: 4 — sv_active an one is insv_used.

One can collect this SHOW POOLS output, using for example some prometheus exporter. And chart them, to get something like this:

But how do we get utilization from that? We need to answer these first:

What’s this pool size?
How do we count utilized connections? Current number or as a percentage of time? In average or as peak usage?

Pool size

It’s not a that simple, PgBouncer has 5 different setting related to limiting connection count!

You can specify pool_size for each proxied database. This will create a separate pool of that size for every user connecting to a database. It defaults, if not set, to default_pool_size setting, which again by default has a value of 20. So if you have multiple users in your database (and you, probably, should) it might by default create 20 Postgres connections, which seems alright. But each Postgres connection in a postgres process, and if you have many users, thus many pool, you might end up with a pretty high total number of postgres server processes (aka backends).

Suggestion: If you don’t have a limit on number of different users in your database, your pgbouncer also probably have automatically created database pools. Set a pretty low default_pool_size , just in case.

max_db_connections is exactly suitable for covering this problem — it limits total number of connections to any database, so badly behaving clients won’t be able to create too many Postgres backends. But max_db_connections is not set by default, so it’s unlimited ¯_(ツ)_/¯

Suggestion: limit it! As a baseline you can use, for example, Postgres’s max_connections setting, which is 100 by default. But don’t forget to adjust it if you have multiple PgBouncer instances going directly to one DB server.

reserve_pool_size — is a limit on an additional, reserve pool, which kicks in if a regular pool is exhausted, i.e. there are pool_size open server connections. In that case PgBouncer might open up these additional connections. As I understand it was designed to help serve a burst of clients, but from my understanding it’s not a very useful for that, because in a moment of a peak load, when pool might be exhausted and a DB might be having bad time serving all that load, opening more connections to it won’t do any good. But this reserve pool is handy to watch for pool saturation, as we’ll discuss later.
max_user_connections — this limits total number of conns to any database from one user. From my point of view, it’s a very strange limit, it makes sense only in case of multiple databases with same users.
max_client_conn — limits total number of incoming clients connections. It differs from max_user_connections because it includes connections from any user. And if you see such errors in pgbouncer log: no more connections allowed — this meansmax_client_conn is reached. By default it is set to 100, so pgbouncer will just reset any new incoming TCP connection.

Suggestion: you might want to set max_client_conn >> SUM ( pool_size + reserve pool), like, 10 times, maybe.

Pgbouncer’s administration interface database besides SHOW POOLS has also SHOW DATABASES command, that shows actually applied limits and all configured and currently present pools:

Server connection monitoring

So let’s return to the question — how do we count utilized connections? Just take current number or should we measure it as a percentage of time? Should that be average or peak usage?

In practice it’s not so easy to properly track pool utilization because pgbouncer reports many indicators only in a form of current values. Therefore allowing only sampling mode of metrics collection with a probability of artifacts. Here’s a real life example, where depending on when pgbouncer metrics collection happened, at the start of a minute or at the end of it, one can see quite different picture of pool utilization:

There were no changes in load profile during charted period. Check out these Postgres connections chart and Pgbouncer files usage chart for the same period — no changes at all:

So we, implementing our pgbouncer monitoring, decided that we are going to provide a combined picture to our clients: our monitoring agent samples SHOW POOLS each seconds and once a minute reports average, as well as peak count of connections in each state:

So dividing this by pool_size will give you average and peak pool utilization as a percentage, so you can trigger an alert if it goes somewhere close to 100%.

PgBouncer also provides SHOW STATS command, that provides stats (not a surprize, I know) on requests and traffic for every proxied database:

Here, for the purpose of measuring pool utilization we are mostly interested in total_query_time — total number of microseconds spent by pgbouncer when actively connected to PostgreSQL, executing queries. Dividing this by respectful pool size (considering pool size to be the number of seconds that all the server connections might spent in total serving queries within one wall clock second) we get another measure/estimate of pool utilization, let’s call it “query time utilization”. This one (unlike an utilization estimate calculated from server connection counts) is not prone to problems with sampling, thanks to total_query_time being a cumulative sum, so it won’t miss a thing.

Compare this:

To the one we saw before:

You see later one not really showing all the times when utilization was ~100%, while the first chart with “query time utilization” does.

Monitoring of PgBouncer pools saturation

Let’s discuss for a moment why we need Saturation metric at all, when we can tell if everything is overloaded looking only if Utilization is high or not?

The problem is that even with cumulative stats like total_query_time one can’t tell if there were some short periods of high utilization between two moments when we look at the stats. For example, you have some cron jobs configured to simultaneously start and make some queries to a database. If these queries are short enough, i.e. shorter than stats collection period, then measured utilization might still be low, while at these moment of crons start time they might exhaust resource (be that connection pool or something else). In this case, as we discussed, they probably waited in a queue of some sort. But they probably also might’ve affected queries coming from other clients in that way, leading to a local performance degradation from these clients point of view. But looking only on Utilization metric, you won’t be able to diagnose that.

How can we track that on PgBouncer. A straightforward (and naive) approach is to count clients in SHOW POOLS output in a cl_waiting state, that we discussed. In under normal circumstances you won’t see them, and seeing number of waiting client greater than 0 means pool saturation, as here:

But as you know, you can only sample SHOW POOLS, and this leads to a possibility of missing such waitings.

Here’s where we can use PgBouncer built-in saturation detection — as I wrote before, you can configure it to open additional connections in case of when pool fills up, just set non zeroreserve_pool_size. Thereby we can detect pool saturation by comparing a number of open server connections to a respectful pool_size, if it exceeds the limit, pool was saturated at some point:

Here we can clearly see a picture of some sort of cron/periodic jobs, that kick in at the start of each hour and saturate this pool. And even though we do not see at any moment number of active connections exceeding pool_size limit, we know for sure that pgbouncer detected that and opened reserve connections.

There’s another related setting — reserve_pool_timeout — it defines a timeout so pgbouncer won’t be using that reserve pool. And it defaults to 5 seconds, so if you’re going to user reserve pool for pool saturation detection, you should probably set it quite low.

While I showed problems due to only sampling metrics collection possible with SHOW POOLS data, it is anyways very useful to monitor clients connection states. Because thanks to distinct pools for different users, one can see users that are in active usage of actual server connections (linked with those). At okmeter.io you can chart it as this:

sum_by(user, database, 
   metric(name="pgbouncer.clients.count", state="active-link")
)

And that’s an example chart:

We, at okmeter.io, provide even more deep details on that usage. You can see distribution of client’s IP addresses. This allows to distinguish not only most active DB users, but most active in that way instances of applications:

In this example, you can see IP addresses of specific kubernetes pods with a web-application instances running in them.

Errors

For server connection pool exhaustion there is these saturation metrics. But for client connections there are some limits too, as we discussed. Reaching those will produce not queueing and waiting, but blunt denial of service errors. For which you should monitor in pgbouncer logs, where you might find some of those:

launch_new_connection: user full 
launch_new_connection: database full 
no more connections allowed

RED monitoring and PgBouncer

While USE is designed to find performance issues and bottlenecks, RED is more targeted on characterizing workload, i.e. incoming and outgoing traffic. So RED will tell you if everything works as intended (or at least as before), and if something’s not right, USE will help to find a cause.

Request Rate

This one is, at a first glance, pretty straightforward for SQL proxy / pooler. Clients send requests (transactions and queries) in a form of SQL statements. You will find total_requests in SHOW STATS output. Let’s chart its rate, which in okmeter.io will be simply rate(metric(name="pgbouncer.total_requests"))

We clearly can see daily changes and short, anomalous spikes in usage.

Request Errors

Well, RED and USE have this “E” in common, which is Errors. The way I see it is that USE’s errors are more about the case of when this service/resource we’re monitoring is unable to handle more load. While RED’s errors should be more about errors from clients point of view: statement timeouts “canceling statement due to statement timeout”, rollbacks etc.

Request Durations

Here again we can useSHOW STATS with its cumulative total_query_time and total_requests. Dividing one to another we’ll get average query time, and if you track that in time, you’ll get average query time chart:

So we clearly see that most of the time it is pretty stable. While there were some anomalous spikes at 19:30 an later. Having that, we could dig deeper using more detailed PgBouncer metrics or we might need to look deeper into Postgres metrics.

I hope you find this write up useful. I’ve tried to cover all the bases, if you feel that you have something to add — please, tell me, I’ll be glad to discuss.

I’m preparing next article on PgBouncer and Postgres metrics and monitoring. So if you’re interested — follow us here, at facebook or twitter to stay tuned!

Our monitoring service — okmeter.io will help you stay on-top of everything happening with you Postgresql, RDS and other infrastructure services.

USE, RED and real world PgBouncer monitoring was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

PostgreSQL: why and how WAL bloats

pavl t — Mon, 03 Sep 2018 15:12:58 GMT

Today’s post is about real life of PG’s Write-Ahead log.

WAL. An almost short introduction

Any changes to a Postgresql database first of all are saved in Write-Ahead log, so they will never get lost. Only after that actual changes are made to the data in memory pages (in so called buffer cache) and these pages are marked dirty — meaning they need to be synced to disk later.

For that there’s a Checkpoint process, ran periodically, that dumps all the ‘dirty’ pages to disk. It also saves the position in WAL (called REDO point), up to which all changes are synchronized.

So in case of a Postgres DB crashes, it will restore its state by sequentially replaying the WAL records from REDO point. So all the WAL records before this point are useless for recovery, but still might be needed for replication purposes or for Point In Time Recovery.

From this description a Super-Engineer might’ve figured out all the ways it will go wrong in real life :-) But in reality usually one will do this in a reactive way: one need to stumble upon a problem first.

WAL bloats #1

Our monitoring agent for every instance of Postgres will find WAL files and collect their number and total size.

Here’s a case of some strange x6 growth of WAL size and segment count:

What could that be?

WAL is considered unneeded and to be removed after a Checkpoint is made. This is why we check it first. Postgres has a special system view called pg_stat_bgwriter that has info on checkpoints:

checkpoints_timed — is a counter of checkpoints triggered due that the time elapsed from the previous checkpoint is more than pg setting checkpoint_timeout. These are so called scheduled checkpoints.
checkpoints_req — is a counter of checkpoints ran due to uncheckpointed WAL size grew to more than max_wal_size setting — requested checkpoints.

So let’s see:

We see that after 21 Aug checkpoints ceased to run. Though we would love to know the exact reason it’s so, we can’t ¯\_(ツ)_/¯

As one might remember, Postgres is known to be prone to unexpected behavior due to long lasting transactions. Let’s see:

Yeah, it definitely might be the case.

So what can we do about it?

Kill it. Try pg_cancel_backend
Try to figure out reasons of it halting.
Wait, but check and monitor free disk space.

There’s an additional quirk here: all this leads to WAL bloat on all of replicas too.

Using this as a chance to remind — replica is not a backup.

WAL archiving

Good backup is the one that will allow you to restore at any point in the past.

So if “someone” (not you of course) executes this on primary database:

DELETE FROM very_important_tbl;

You better have a way to restore your DB state right before this transaction. It’s called Point-In-Time-Recovery, or just short — PITR.

And in Postgres you would do this as a periodical full backup + WAL segments archives. For that there’s a special setting — archive_command and ran a special postgres: archiver process. It periodically runs this command of your choose and, if it returns no error, deletes corresponding WAL segment file. But if there’s an error in archiving WAL file, which became more common with wide use of cloud infrastructure (yes, I’m looking at you, AWS S3), it will retry and retry, until success. And this can lead to massive amount of WAL files residing on disk and eating up its space.

So here’s a chart of a broken for a while WAL archiving:

You can get these counters from pg_stat_archiver system view.

Any monitoring systems collects different metrics on server infrastructure. And it’s not only charts, but also you can alert on them and use them to improve your infrastructure to be more resilient.

The thing is that most of widely used software is not designed with goal of having deep observability capabilities. That’s why it’s so hard to have your monitoring set up in such way, so it will show you everything you need in time.

The most crucial metrics are hard to collect. It’s usually not presented by some system view, so you can just SELECT supa_useful_stat FROM cool_stat_view. While developing our monitoring agent we dig deep for meaningful and detailed metrics, so you’ll just have them when there’s need.

That is true for WAL and archiving as well — we not only collect fails from pg_stat_archiver and WAL size on disks, but with okmeter.io you’ll have a metric that shows the amount of WAL residing on disks for the sole purpose of archiving. And here’s how it looks when your archival storage fails:

Our monitoring system — okmeter.io — will not only automatically collect such metrics but also we’ll alert you whenever archiving fails.

Replication

Postgres is well known for its Streaming Replication, that works via continuous transfer and replay of WAL segment files to/on a replica server.

For the case when some replica were unable to get all needed WAL segments instantly, there’s a stash of WAL files on the primary server. Special setting wal_keep_segments controls how many files will be kept by primary. But if a said replica will hang and lag behind for more than that, files will be removed silently. Which will result in that this replica won’t be able to connect to primary and continue it’s streaming replication, drawing it unusable. To turn it back on, one would need to recreate the whole thing from a base backup.

For further controlling and mitigating that, Postgres, since version 9.4, has a special mechanism of Replication slots.

Replication slots

When those are used when setting up replication, and a slot has got a connection from a replica at least once (you can think of it as “was initiated”). Then in case of replica falling behind, Primary server will keep all the needed WAL segments until said replica will connect and catch up with current state.

Or, if replica is forever gone, Primary will keep these segments forever, causing all the disk space to be used for that.

A forgotten (one without monitoring) replication slot cause not only WAL bloat but a possible database downtime.

Fortunately it’s really easy to monitor it through pg_replication_slots system view.

We, at okmeter, suggest that you not only monitor for replication stots statuses, but also track WAL size retained for that, as we do, for example, here:

It not only shows total bloat of WAL, but in detailed view you can see which slot causes that in particular:

When we see which is it, we can decide what to do about it. Either trying to fix those replicas, or, if it’s not needed anymore, delete the slot.

These are most common causes of WAL bloat, though I’m sure there are some others. It’s crucial to monitor it, for database’s uninterruptible service.

Our monitoring service — okmeter.io will help you stay on-top of everything happening with you Postgresql, RDS and other infrastructure services.

PostgreSQL: why and how WAL bloats was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Real world SSD wearout

pavl t — Mon, 27 Aug 2018 15:16:46 GMT

A year ago we’ve added SMART metrics collection to our monitoring agent that collects disk drive attributes on clients servers.

So here a couple of interesting cases from the real world.

Because we needed it to work without installing any additional software, like smartmontools, we implemented collection not of all the attributes, but only basic and not vendor-specific ones — to be able to provide consistent experience. And also that way we skipped burdensome task of maintaining a knowledge base of specific stuff — and I like that a lot :)

This time we’ll discuss only SMART attribute named “media wearout indicator”. Normalized, it shows a percentage of “write resource” left in the device. Under the hood the device keeps track of the number of cycles the NAND media has undergone, and the percentage is calculated against the maximum number of cycles for that device. The normalized value declines linearly from 100 to 1 as the average erase cycle count increases from 0.

Are there any actually dead SSDs?

Though SSDs are pretty common nowadays, just couple of years earlier you could hear a lot of fear talk about SSD wearout. So we wanted to see if some of it were true. So we searched for the maximum wearout across all the devices of all of our clients.

It was just 1%

Reading the docs says it just won’t go below 1%. So it is worn out.

We notified this client. Turns out it was a dedicated server in Hetzner. Their support replaced the device:

Do SSDs die fast?

As we introduced SMART monitoring for some of the clients already some time ago, we have accumulated history. And now we can see it on a timeline.

A server with highest wearout rate we have across our clients servers unfortunately was added to okmeter.io monitoring only two month ago:

This chart indicates that during these two month only, it burned through 8% of “write resource”.

So 100% of this SSD lifetime under that load will end in 100/(8/2) = 2 years.

Is that a lot or too little? I don’t know. But let’s check what kind of load it’s serving?

As you can see, it’s ceph doing all the disk writes, but it’s not doing these writes for itself — it’s a storage system for some application. This particular environment was running under Kubernetes, so let’s sneak a peek what’s running inside:

It’s Redis! Though you might’ve noticed divergence in values with the previous chart — values here are 2 times lower (it’s probably due to ceph’s data replication), load profile is the same, so we conclude it’s redis after all.

Let’s see what redis is doing:

So it’s on average less than 100 write commands per second. As you might know, there’s two ways Redis makes actual writes to disk:

RDB — which periodically snapshots all the dataset to the disk, and
AOF — which writes a log of all the changes.

It’s obvious that’s here we saw RDB with 1 minute dumps:

Case: SSD + RAID

We see that there are three common patterns of server storage system setup with SSDs:

Two SSDs in a RAID-1 that holds everything there is.
Some HDDs + SSDs in a RAID-10 — we see that setup a lot on traditional RDBMS servers: OS, WAL and some “cold” data on HDD, while SSD array hold hotest data.
Just a bunch of SSDs (JBOD) for some NoSQL like Apache Cassandra.

So in the first case with RAID-1 writes go to both disks symmetrically, and wearout happens with the same rate:

Looking for some anomalies we found one server where it was completely different:

Checking mount options, to understand this, didn’t produce much insight — all the partitions were RAID-1 mdraids:

But looking for per device IO metrics we see, again, there’s difference between two disks. And /dev/sda gets more bytes written:

Turns out there’s swap configured on one of the /dev/sda partitions. And pretty decent swap IO on this server:

SSD wearout and PostgreSQL

This journey began with me looking to check SSD wearout with different Postgres write load profiles. But not much luck — all of our client’s Postgres databases, with at least somewhat high write load, are configured pretty carefully — writes go mostly to HDDs.

But I found one pretty interesting case nevertheless:

We see these two SSDs in a RAID-1 wore out 4% during 3 months. But checking if it’s high amount of WAL writes turned out to be wrong — it’s only less than 100Kb/s:

I figured that probably Postgres generates writes in some other way, and it is indeed. Constant temp files writes all the time:

Thanks to Postgres elaborate internal statistics and okmeter.io’s rich support for it, we easily spotted the root cause of that:

It was a SELECT query generating all that load and wearout! SELECT’s in Postgres can sometime generate even non-temp file, but real writes. Read about it here.

Summary

Redis+RDB generates a ton of disk writes and it depends not on the amount of changes in Redis db, but on DB size and dump frequency. RDB seems to produce the maximum Write Amplification from all known to me storages.
Actively used SWAP on SSD is probably a bad idea. Unless you want to add some jitter to RAID-1 SSDs wearout.
In DBMSes like Postgresql it might be not only WAL and datafiles that dominate disk writes. Bad database design or access patterns might produce a lot of temp files writes. Read how to monitor Postgres queries.

That’s all for today. Be aware of your SSDs wearout!

We at okmeter.io believe that for engineer to dig up a root cause of a problem, he needs decent tooling and a lot of metrics on every layer and part of infrastructure. That where we’re trying to help.

Real world SSD wearout was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

PostgreSQL: Exploring how SELECT Queries can produce disk writes

Nikolay Sivko — Sun, 04 Mar 2018 19:57:19 GMT

We already wrote about monitoring posgresql queries, at the time we thought that we completely understood how PostgreSQL works with various server resources.

Working regularly with the statistics of PostgreSQL queries, we noticed some anomalies and decided to dig a bit deeper for better understanding. Through this process, we found that while the behavior of postreSQL is kind of strange at first glance (or at least very peculiar), the clarity of its source code is quite admirable.

A SELECT query can “dirty” some pages

It turns out that in PostgreSQL, SELECTs (which usually believed to be read-only)may cause modifications to some database records, which postgres will then write to disk.

Wait, what?

PostgreSQL uses MVCC (MultiVersion Concurrency Control) technology for transactional integrity. All changes to the database records only happen in transactions. Each transaction is assigned an identification number. Postgresql refers to this transaction ID as txid (int32). Table data — records — is represented in tuples. A tuple contains data from one particular row, as well as metadata associated with this row:

Picture: www.interdb.jp

Postgresql refers to the id of a transaction that created this particular tuple as the xmin (or t_xmin) of a tuple. And the id of a transaction that marked this tuple as deleted (if any) is referred to as the xmax (or t_xmin).

Here’s how they are calculated and changed:

An INSERT to a table creates a new tuple with xmin = txid of that INSERT transaction.
DELETE marks the tuples as deleted, setting its xmax = txid of that DELETE
UPDATE works as a combination of a DELETE and an INSERT.

The SELECT statement queries the database and retrieves selected data from a specified table, while also performing a visibility check, which goes as follow:

At a high level, a transaction with the txid1 identifier will ”see” a specific tuple, only if the following conditions are met:

xmin <= txid1 <= xmax

Although tuple changes occur immediately, transactions may take a long time to complete. This is why during the visibility check it is necessary to also check whether transactions with the identifiers xmin and xmax have been completed or not, and what was the status of each completed transaction.

Because, for example, while for a particular tuple xmin might be < txid1, the transaction with txid = xmin, that created this tuple, might still be in-flight, and still might fail later, leading to deletion of this tuple. So this tuple should nevertheless be “invisible” for txid1.

PostgreSQL stores information about the current state of each transaction in a commit log (CLOG). Checking the states of a large number of transactions in CLOG is resource-intensive so Postgres therefore caches the information about transaction states directly in the header of the tuple. For example, if during a SELECT it is recognised that an xmin transaction is completed, Postgresql saves this knowledge into so called hint bits of the tuple. Both xmin and xmax statuses are recorded in these hint bits, which are placed in an infomask part of the tuple header.

We previously described the process of changing tuples, however, to complete our investigation , we need to clarify the meaning of “dirty pages” in Postgresql. PostgreSQL works with the information stored on the disk, as well as in memory, and organizes data by blocks, or “pages”. This is done to boost efficiency. Each page contains a number of tuples and their associated metadata. If a single tuple is modified, an entire page is marked as “dirty”. This implies that there is a difference between this data in memory and the corresponding data saved to disk. This modified page must therefore be synchronized with that on the disk. Additionally, these modifications are recorded into the WAL (write-ahead log). This is done for the purpose of restoring data integrity (in the event that the database process terminates abnormally).

SELECT can cause synchronous writes on the disk:

As you might know, PostgreSQL works with data using a buffer cache. If needed data is not in the buffer cache, PostgreSQL reads it from the disk and puts into the buffer cache . If there is not enough space in this cache, then the least requested page is pushed out, evicted. If this page turns out to be in a “dirty” condition at the time of an eviction, then it must be written to disk in this exact moment.

Conclusion:

Most cases of “strange” PostgreSQL behaviour are caused by built in functionality, which is intended to optimize the efficiency and performance of a database.

Subscribe to us here, on Twitter or on our Facebook page to receive okmeter updates, or sign up with okmeter.io directly.

PostgreSQL: Exploring how SELECT Queries can produce disk writes was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Postgresql query monitoring

pavl t — Tue, 05 Dec 2017 13:30:36 GMT

Since 2008, ‘pgsql-hackers’ have been discussing an extension for collecting query statistics reports. This extension called ‘pg_stat_statements’, which is now part of PostgreSQL from 8.4 and above, allows for collecting statistical information about queries executed by a server.

The extension is typically used by database administrators as a data source for various reports. Though this data, in fact, represents only cumulative performance indicators since the last reset of counters, such information can be very useful for monitoring query execution time, locating performance issues, and in-depth analysis of what’s going on with a database server.

pg_stat_statements

So, let’s take a closer look at pg_stat_statements view (this one is from 9.4):

postgres=# \d pg_stat_statements;
          View "public.pg_stat_statements"
       Column        |       Type       | Modifiers
---------------------+------------------+-----------
 userid              | oid              |
 dbid                | oid              |
 queryid             | bigint           |
 query               | text             |
 calls               | bigint           |
 total_time          | double precision |
 rows                | bigint           |
 shared_blks_hit     | bigint           |
 shared_blks_read    | bigint           |
 shared_blks_dirtied | bigint           |
 shared_blks_written | bigint           |
 local_blks_hit      | bigint           |
 local_blks_read     | bigint           |
 local_blks_dirtied  | bigint           |
 local_blks_written  | bigint           |
 temp_blks_read      | bigint           |
 temp_blks_written   | bigint           |
 blk_read_time       | double precision |
 blk_write_time      | double precision |

As you can see, all the queries are grouped, i.e. the statistics is collected not for individual queries, but for groups of queries which PostgreSQL considered similar (I will explain it in more detail below). Values of all counters are incrementing since the start of the Postgresql process or since last call of ‘pg_stat_statements_reset’.

query — query text
calls — number of query calls
total_time — total execution time of all query calls, in milliseconds
rows — number of rows returned (‘select’) or modified (‘update’) during the query execution
shared_blks_hit — number of shared memory blocks returned from the cache
shared_blks_read — number of shared memory blocks returned NOT from the cache. It’s not quite clear in the documentation whether it’s a total number of returned memory blocks or only the number of memory blocks not found in the cache, so let’s check out the source code:

/*
* lookup the buffer. IO_IN_PROGRESS is set if the requested block is
* not currently in memory.
*/
bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
                     strategy, &found);
if (found)
    pgBufferUsage.shared_blks_hit++;
else
    pgBufferUsage.shared_blks_read++;

shared_blks_dirtied — number of shared memory blocks marked as “dirty” during the query execution (i.e. a query modified as least one tuple in a block, and thus the block must be written to a drive using ‘checkpointer’ or ‘bgwriter’)
shared_blks_written — number of shared memory blocks written synchronously to a drive during the query execution. PostgreSQL worker attempts to synchronously write a block if it got it “dirty”.
local_blks — similar counters for blocks which are considered as local blocks by the backend and thus are used for temporary tables
temp_blks_read — number of blocks of temporary files read from a drive. Temporary files are used when there’s not enough memory to execute a query (the memory limit can be set in the ‘work_mem’ parameter)
temp_blks_written — number of blocks of temporary files written to a drive
blk_read_time — total waiting time for reading blocks, in milliseconds
blk_write_time — total waiting time for writing blocks to a drive, in milliseconds (only synchronous write operations by a worker are considered, the execution time of ‘checkpointer’/’bgwriter’ is not included)

Note that blk_read_time and blk_write_time are collected only when additional ‘track_io_timing’ setting is enabled.

The thing is that ‘pg_stat_statements’ considers only fully completed queries, i.e. if your query has been performing a resource-intensive task for about an hour, you will be able to view such a query only in ‘pg_stat_activity’.

How the queries are grouped

Previously I’ve been thinking that queries are grouped in the same order as they are executed. However, I also noticed that queries with different number of IN arguments usually show up in different groups, though such queries were expected to have the same execution plan.

After looking it up in the code, I understood that the queries are grouped by hash value of their “query jumble” (consisting of only significant parts of the query text obtained after parsing) — in version 9.4 and above, you can view this hash in the ‘queryid’ column.

In practice, we need to additionally normalize and group queries already in the agent, e.g. merge variable number of IN arguments into a single placeholder ‘?’ or replace inline arguments with placeholders — this is especially difficult when query’s content is not complete.

In versions older than 9.4, each query was cut down to ‘track_activity_query_size’, but in version 9.4 and above, this limitation is removed since query‘s content is now stored outside of the shared memory. However, we still cut large queries down to 8 KB to not to affect the performance of PostgreSQL significantly.

This is why we cannot parse a query with an SQL parser for additional normalization — any SQL parser will report an error on such cutted query text/ And thus we need to write several heuristics and regular expressions for better query cleanup. Certainly, adding new heuristics is far from perfect, but this is the only working solution we came up with.

Yet another issue is that PostgreSQL uses the ‘query’ field to store the first received query text in a group without normalization and any formatting, and thus resetting counters may lead to overwriting this query text with another one from the same group but looking completely different. In addition, many developers write comments directly in a query (e.g. to indicate the query’s ID or a function which calls the query), and these comments are also written to the ‘query’ field.

To avoid creating new metrics for the same queries every time, we remove all comments etc.

The question to ask

Together with our friends from PostgreSQL Consulting, we have deeply analyzed PostgreSQL’s internals and picked up the most useful metrics to locate database issues.

The goal of our monitoring is to answer to the following questions:

How does the database operate at this moment compared to previous periods?
What queries are the most resource-intensive for the server (by CPU, drive, etc.)?
How many queries (by type) are received?
How quickly are various queries executed?

Collecting the metrics

In fact, it’s not reasonable to monitor counters for all queries, so we have picked up TOP-50 queries for our analysis. However, we cannot simply apply ‘top’ to ‘total_time’, because the ‘total_time’ values for new queries will for a long time be much lower than those for older queries.

We decided to apply ‘top’ to the derivative of ‘total_time’ (rate). To do this, our agent fully reads the ‘pg_stat_statements’ values and saves the previous values of the counters. Then for each counter of each query, we attempt to additionally group similar queries (which Postgres considers different) and sum up their statistics. Finally, we apply ‘top’ to them and create dedicated metrics, while all remaining queries are summed up and written to the ”~other” query.

As a result, we will obtain 11 metrics for each query from ‘top’:

postgresql.query.time.cpu (we just subtracted the total drive awaiting time from ‘total_time’ for convenience)
postgresql.query.time.disk_read
postgresql.query.time.disk_write
postgresql.query.calls
postgresql.query.rows
postgresql.query.blocks.hit
postgresql.query.blocks.read
postgresql.query.blocks.written
postgresql.query.blocks.dirtied
postgresql.query.temp_blocks.read
postgresql.query.temp_blocks.written

Each of them will have a set of attributes (labels):

{“database”: “”, “user”: “”, “query”: “”}

Detailed description — Postgres Query Monitoring metrics.

Interpreting the metrics

Users are often confused with the ‘postgresql.query.time.*’ metrics and their physical interpretation. Though it’s not always clear what the total response time really means, such metrics may be a good illustration of how various processes interact with one another.

Assuming that blockings are not taken into account, we can consider that PostgreSQL utilizes certain resources (CPU or disks) during query execution, and thus we can express this usage in resource seconds per second or in percent of CPU core utilization by a query vs. total CPU utilization.

Let’s see what we’ve got

First, we need to check whether our metrics are usable. For example, let’s try to find out why our database server performs more disk write operations than usual.

Let’s check whether PostgreSQL wrote anything to the disks at that moment:

Then we can figure out which queries have ‘dirtied’ the pages:

As we can see, though the query chart is not exactly the same as the buffer write chart, there is certain correlation between the two. The difference is because the block writing process is performed in background, thus changing the drive utilization profile.

Now let’s see the charts for read operations:

And again, there is certain correlation, but not an exact match — this is because PostgreSQL reads blocks from a drive not directly, but from file system’s cache, and thus the actual drive workload is partly hidden.

The CPU utilization rate can also be attributed to specific queries, but the analysis is not quite accurate due to possible locks and other delays:

Summary

We believe that ‘pg_stat_statements’ is a really great extension which provides in-depth statistics without heavy server workload.
However, users should keep in mind certain assumptions and inaccuracies in order to interpret these metrics.

Okmeter.io provides a complete and ready-to-use Postgres monitoring. Though we have an online live demo with query charts and statistics, it is still a demo with a synthetic workload, and it doesn’t quite resemble real life.

It’s always better to try it out on your project to see what’s happening in your actual production environment.

We offer two-week free trial, that you can use just for that!

I encourage you to try it, at the very least, you’ll know that everything is OK with your database and services.

If you want to know more about Postgres operations, you’ll find this article on “When SELECT can cause a data change and a disk write” truly interesting!

Postgresql query monitoring was originally published in okmeter.io blog on Medium, where people are continuing the conversation by highlighting and responding to this story.