How Pyroscope Saved Us Weeks of Wasted Effort

A Lesson in Continuous Profiling

Saif Uddin Mahmud
The Klinify Blog
7 min readSep 16, 2021

--

The Problem

A client is POST-ing to a report-generation endpoint with a terribly large request body. The response takes more than 120s to compute. This triggers HTTP timeouts in various parts of our stack. The support team has found workarounds for now, but we need to fix this before it happens again next month. We look at the code, there’s no funky double-loop making it O(n**2). What do we do?

Photo by Ilona Frey on Unsplash

The Obvious Solution

If you’re a novice like me, “asynchronous processing” will immediately spring to mind:

  1. Record the request in your database in a requestDoc
  2. Send a message to a Broker/Queue
  3. Have a Consumer do the largeAndBulkyProcessing
  4. Have a Consumer update requestDoc with status: completed | failed
  5. Let the Frontend listen for requestDoc changes and update the UI accordingly.

Right? Wrong.

Firstly, there is a hidden, implicit assumption. We’re assuming that the report generation pathway has hit scalability limits; So we need to re-architect part of the system to work around the bottleneck. We took this at face value instead of finding out why we hit a bottleneck.

Secondly, we didn’t pause to inspect our thought process. Whenever my brain jumps to a conclusion, I find it useful to take a moment to introspect. This speed-breaker often points out the assumptions I’m making and the biases I have.

Disclaimer: Please don’t slow down when a snake jumps at you from behind a bush.

I made these costly mistakes for my third project at Klinify. Excitement got the better of me. A time-sensitive project to test new skills under pressure and make architectural changes? Sign me up!

I wrote a technical spec, consulted fellow engineers about icky edge cases common in distributed systems, modified the code base for an MVP, and deployed the services on the test cluster. 2 weeks of hard work later, I had a gradual rollout plan ready for v1 and a v2 already in the pipeline. Time to enjoy the weekend.

Curiosity Saves The Day

Over Saturday morning coffee, I stumbled onto a Medium article about profiling python code. (I know, very meta!) I had used cProfile to optimize a Pandas Data Cruncher during my Thales stint. But I’d never used a profiler on a Flask+uwsgi web app. Cue curiosity.

Turns out that it was easy, barely an inconvenience. Just wrap your Flask Application using ProfilerMiddleware and you'll have a trace file per request:

from werkzeug.middleware.profiler 
import ProfilerMiddleware
app = ProfilerMiddleware(app)

I decided to take it out for a spin and see exactly where that pesky report generation code was being so inefficient. To my surprise (and dismay), it was spending 110s in lots of time.sleep() calls. WHAT??!!

I needed to find out what the hell was going on. So I installed snakeviz, kubectl cp out the traces and started digging around the interactive icicles and sunbursts. It looked like time.sleep() was being called while sending updates to our database.

A little background about Klinify’s data model will help at this point. Since we were dealing with sensitive medical data, we had to keep an audit log of any changes made to the database. Though we practiced Polyglot Persistence, the bulk of the data resided in CouchDB. The wrapper class we used for interfacing with CouchDB, KlinifyDB, made this requirement transparent to Devs by pushing changelogs to KlinifyQueue - a thin wrapper around Azure Service Bus Queue. This was the source of our problem.

Had I looked at the report-generation code closely (instead of glazing my eyes over the business logic), I would have realized that the code was doing the following:

  • Pull X docs from CouchDB using a single HTTP request. For the giant request that started all this, X was about 1024.
  • Make some light edits to each of the X docs, and calculate a reportDoc.
  • Updating the X+1 docs using a single HTTP request.

Even 12s would have been surprisingly slow, much less 120s!

Here’s the source of the problem: even though we called KlinifyDB.update(docs: list) -> [bool] for a batch update, it turned around and did a [KlinifyQueue.push(doc) for doc in docs]. This means we sequentially made an additional X+1 POSTs. The underlying ASBQ library was calling time.sleep() for its retry backoff (or something like that; I don't recall the specifics). All I had to do was rearrange some code and change the API to KlinifyQueue.push(docs: list) -> [bool] . This would cut down 99.9% of the RTT delays.

How much time did I spend working that Saturday? A whopping 2 hours. Facepalm.

Never Again

Having learnt several painful-but-valuable lessons, I was determined to make sure future engineers didn’t step on the same trap. I was also hoping to snipe out similar low-hanging fruits to boost our backend performance.

One simple option was to keep using ProfilerMiddleware. The significant performance penalty, the heavy disk usage, and cumbersome workflow deterred me. I wanted something that had less friction. There must be a better way!

A colleague suggested I look into this new continuous profiler called Pyroscope. It:

  1. is free and open-source
  2. is simple to use
  3. samples requests
  4. is space-efficient
  5. has very little overhead
  6. is very easy to self-host within our Kubernetes cluster

Setting Up Pyroscope Server

The following are the 3 files I needed to deploy the Pyroscope Server (v0.0.29) on AKS:

pyroscope-pvc.yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: pyroscope-claim0
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
storageClassName: default

pyroscope-deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: pyroscope
name: pyroscope
spec:
replicas: 1
selector:
matchLabels:
app: pyroscope
strategy:
type: Recreate
template:
metadata:
labels:
app: pyroscope
spec:
containers:
- name: pyroscope
image: pyroscope/pyroscope:0.0.29
command: ["pyroscope"]
args: ["server"]
ports:
- containerPort: 4040
resources:
limits:
memory: 500Mi
requests:
memory: 150Mi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/pyroscope
name: pyroscope-storage
initContainers:
- name: pyroscope-pv-permission-fix
image: busybox
command: ["chmod", "-R", "777", "/data"]
volumeMounts:
- name: pyroscope-storage
mountPath: /data
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- name: pyroscope-storage
persistentVolumeClaim:
claimName: pyroscope-claim0
status: {}

pyroscope-service.yaml

apiVersion: v1
kind: Service
metadata:
labels:
app: pyroscope
name: pyroscope
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
spec:
ports:
- port: 4040
targetPort: 4040
selector:
app: pyroscope
type: LoadBalancer

Note the internal: true for the LB. Pyroscope did not have a login system back when I implemented this, so I put it behind the Devs' VPN. They’ve implemented auth recently, check it out at https://pyroscope.io/docs/auth-google

Setting Up Pyroscope Clients

After that, we have to configure our services’ Dockerfiles to register them as clients to the Pyroscope Server. Here’s an example:

...
COPY --from=pyroscope/pyroscope:latest /usr/bin/pyroscope /usr/bin/pyroscope
ENV PYROSCOPE_SPY_NAME=pyspy
ENV PYROSCOPE_SERVER_ADDRESS=http://pyroscope:4040
ENV PYROSCOPE_APPLICATION_NAME=klinify.api
...
CMD ["/start.sh"]

and add the following to your deployment file:

securityContext:
capabilities:
add:
- SYS_PTRACE

and the [start.sh file would have something like the following:

...
exec pyroscope exec uwsgi --ini /app/uwsgi.ini

Note: I believe Pyroscope now has integrations for different languages that you can call within the code. They’ve also released Helm charts to make setting the Server up easier. Look at that if you want to try out Pyroscope!

Here’s what the end result looks like:

Note: This is not to say Pyroscope is perfect, there are missing some features (like any other project). If you want something, submit an issue or make a PR on Github. The authors are very friendly and you should join their Slack community if you have questions 🙂

Key Takeaway

Don’t code until you find the root cause. I thought I understood the problem well enough: modify the architecture such that we can work around slow report generation. This primed me toward a certain solution and should have been a major red flag. Had I spent more time refining the problem statement to its core — make sure clients can generate big reports without a hitch — I would have been more open to considering alternatives. I’d save 2 weeks of engineering time.

Secondary Takeaways

Don’t be afraid of throwing your work away. Shortly after my “epiphany”, I closed down all my MRs, marked all the relevant epics/documents as stale, and let the project team know that project was closed. We’d spend a few more weeks implementing v2 and ironing out all the bugs and missed edge cases if I was attached to my implementation. The silver lining is that the ground-work was already done if we ever hit limits of the synchronous report-generation (for real this time wink)

Be open about your mistakes. I won’t say something corny like “I love making mistakes!” I don’t. But mistakes are hard-earned lessons. Make the best of your mistakes. Consolidate the learning, share it with your peers, and laugh a bit at your own stupidity. That’s how you build and maintain a Learning Culture and a Growth Mindset in your life, as well as in your organization.

Prioritize things with good ROI. I didn’t have to set up Pyroscope. I found the problem and had already deployed the fix in production. Adding continuous profiling to our stack was a conscious bet. Spending some time now to make sure we had better tooling would save us even more time in the future. It paid off. Whenever we hit performance bottlenecks over the next few months, it was easy to open up Pyroscope and validate our assumptions and hypotheses. I’m sure it prevented us from repeating my mistake several times.

Story originally published at: https://medium.com/dabbler-in-de-stress/how-pyroscope-saved-me-weeks-of-wasted-effort-dc99aeaf29e9

--

--