How Two Developers Accelerated Pipedrive’s PHP-Monolith

Published in

Pipedrive R&D Blog

13 min readDec 26, 2019

Background

Pipedrive, like many startups, started with PHP-monolith which has since grown into legacy code that we are attempting to get rid of. It is harder to test & deploy on legacy code than with small node/go services and it’s more difficult to grasp, because of so many features that were written more than 5 years ago. Because of this, we have an informal agreement to not add any new functionality to this repo anymore.

Despite its faults, the PHP-monolith still serves the biggest chunk of public API traffic (>80%) and more than 20 internal services depend on it, making it critical for the business. The challenging part is to keep performance high as we make it more decentralized.

In a 2018 mission that I participated in, we finally “dockerized” it, but it still remained in PHP version 7.0, but at the beginning of 2019 the planets aligned and a Backend performance boost mission became possible.

Motivation

First, I personally wanted to get into the field of Performance Engineering so this was a great chance. Second, anyone within the company can pitch and launch missions so this was a good opportunity. Finally, as I later found out, our manager was also interested in improving performance.

The triggering moment happened when accidentally, during a debug of some endpoint, I decided to try blackfire to understand how function calls are made and I saw a dismal picture of translations taking a huge percentage of the total request. I thought it would be fantastic to rewrite this part and then learn how to lead a mission during the process.

Preparations

Since I was afraid of pitching a mission without any slides to a 100+ audience and because Pipedrive is a data-driven company, I had to start researching what metrics I could use for the mission goals.

What is the percentage of public API requests with a GET method that is slower than 400ms? What about POST/DELETE?
What’s the latency of requests for API that is used to add data in bulk? What if it isn’t average, but a 3-sigma coverage (99.7% of users)?
What’s the maximum time that an API call can wait? Why?
What’s the % of API calls that do more than 70 DB requests?
How many DB requests are there on average / 99% of users?
What’s the maximum number of DB calls that one API call can generate?
How many API calls are lost due to DB going down? Due to insufficient memory of the process?
How much memory does PHP process consume on average?

I won’t give you the exact numbers I got, but it was a grim picture indeed. Even though we had a pretty low average API latency (160ms), I still saw some of our clients with endpoints loading from multiple seconds up to several minutes (bulk API).

The main goal of the mission was to decrease the percentage of sluggish GET requests by 20% (from 5.2% to 4.1% of total request count).
The secondary goal was to decrease the number of insufficient memory errors by 80%. From a business side, the goal also aligned with a decrease in client churn rate.

The mission plan was to deal with the global speedup first and then work on individual endpoints.

Non-standard lightweight Trello + Teamgantt to plan the mission

First failure

I planned a mission for 4 developers, but only one brave soul volunteered and so we started our journey with just the two of us.

First I wanted to get an easy win and just get rid of these slow translations. The main complexity was that the gettext function loaded binary .mo files that were coming from a third-party translation system and unpacking them on every request. Or so I thought.

After a full rewrite, with a nice unit-test coverage, I got a 43% speed-up on my machine, but as it turned out, in production gettext loaded translations only on the first attempt and then used an in-memory cache afterwards. In the runtime, it was still fast.

It was a first-week failure with the only positive being a better understanding of the developer experience.

Metrics

On the second week, I realized that we didn’t have enough visibility of performance. This resulted in developers becoming reluctant to approve translation pull-requests without seeing the numbers first.

We used newrelic specifically in production and blackfire worked for me only on local, so I couldn’t prove that our optimization had an impact in an isolated production-like environment. We simply couldn’t continue like this.

I quickly wrote up metric gathering to use datadog, as our infra team was already testing it out. This gave us visibility into the average latency on test environments when API tests are executed and it also gave us a nice latency view per-endpoint.

Unfortunately, issues occurred for us here as well. Datadog for PHP turned out to be quite buggy — curl headers got overwritten and elastic failed. I needed to disable multiple datadog extensions just in case something worse may happen.

During the mission I also gathered metrics in grafana for the slow queries, and, quite frankly, grafana’s dashboards & multiple sources turned out to be much more powerful. With this information, we began to use multiple systems to get a better understanding of what’s going on.

Newrelic, Grafana & Datadog dashboards used in the mission

Upgrading PHP to 7.1 and 7.2

Updating the PHP version was mostly a smooth process, but it did take quite a long time due to things such as curl headers, search and elimination of some deprecated functions, running API tests, etc.

echo count('ab');
// php 7.1 : 2
// php 7.2 : Warning:  count(): Parameter must be an array or an object that implements Countable in

Something that proved headache inducing in the process was a cryptic nginx failure that gave 500 error without any logs. After digging through some php-fpm worker logs in debug mode, I finally saw SIGSEGV fault which gave me the idea to disable XDebug extension in the local environment.

In the test environment, datadog promised average latency reduction from 740ms to 338ms while running API tests, and the time of API test execution dropped down from 23 minutes to 12 — fantastic.

After the mission finished, we split API tests into more parallel (>20) threads and reduced the time spent on them down to 5 min. The performance gains gave us a good side-effect for deployment speed too.

Datadog comparison of API test runs before & after

In production, version upgrades gave the biggest boost in the mission scope and global latency decrease reached 20%.

Impact of PHP version upgrade in production. Newrelic as source

Optimizing class autoloading

Monolith wouldn’t have been monolith if there wasn’t some mess involved.

We had multiple loaders registered that defined how to find class. Some of these came via vendor dependencies (AWS, symfony), one came built-into our base codeigniter framework, and naturally we wrote our own to better understand modular logic.

It was only sub-optimal as it reached into the filesystem on every class load, going through 10+ folders. This IO-load was easily noticeable at the top of blackfire profiling.

Pseudocode for autoloader, some file checks went through multiple directories

We had a legacy folder with some of the libraries so we couldn’t use composer until monolith got dockerized which meant we needed to migrate them to correct versions and start using a new composer autoloader following PSR logic of namespaces.

After cleaning up our custom autoloader and adding a composer with autoloader optimization, we got 30-70% latency reduction, depending on the API endpoint. After the mission ended, we then began to use apcu optimization additionally.

In the test environment, according to datadog, our latency decreased from 240 ms to 200 ms.

In a production environment, API request spends on average more time in MySQL queries and fetching data. So we won a bit less, only 10%.

Speeding up the most used API endpoints

According to our statistics, on top of the (slowest * used) endpoints is the activity list. What complicates this is that we had some client integrations loading an entire list of activities, paginating from 0 to the absolute end. Some clients had as much as 1 million records there, meaning caching wouldn’t really help much.

The API doesn’t call just one table — permissions are joined. The ORM that generates the query is not the simplest instrument and I wasn’t part of the team that owned the business logic. Blackfire hinted that the most difficult parts in this API call is to extract and hydrate functions in the ORM. Since there isn’t much to show, I resorted to micro-optimizations.

decreasing or replacing in_array and array_key_exists calls in favor of isset()
caching data into php static variable to avoid expensive calls. For example we had a MonetaryFormatter class that was instantiated very often which I made it a singleton
Using early termination logic in cycles to decrease the amount of if-checks
reducing the number of CPU-expensive calls with string parsing (is_empty_date)
reducing the number of DB calls (findPermittedUsers)
reducing the amount of transported data from DB (see the pipeline pagination example below)

On a local machine we gained 50% latency reduction in this endpoint, while the person listing got us 25%. Without a good instrument, capable of automatic span instrumentation, it would be very hard to detect these bottlenecks, so I give kudos to the blackfire team.

To debug the performance of a specific client, our OPS and DBA engineers copied and anonymized the client’s DB on an internal machine, which I was then able to experiment with. Using Newrelic, I could retrieve companies with NRQL like:

SELECT average(duration) 
FROM Transaction 
FACET company_id 
WHERE company_id IN (...) AND env = 'live' AND name='WebTransaction/Custom/v1/activities/index_get' 
SINCE 1 week AGO 
TIMESERIES 6 hours

Slowest database queries

First thing I noticed — our ORM kept wrapped all fields in ORDER BY with NULLIF check.

Notice NULLIF added to dates

I wasn’t sure exactly why this was needed and after checking with other developers it turned out that it was intended for date fields, to be interpreted as an empty string to have specific ordering. I then narrowed down the ORM logic to check for the field type (to wrap dates only).

Indexes

I figured out a second useful tidbit with the help of DBA — indexes heavily affect performance. Not only missing indexes, but having wrong indexes also slowed us down.

For example, our ORM forced index usage (idx_add_time) which slowed everything down because most entities had different creation time, meaning that cardinality made no sense and MySQL was doing a full scan and couldn’t use any other index. In worst case scenarios, a query took up to a minute.

explain SELECT `org`.*
FROM `org` 
FORCE INDEX(idx_add_time) # <--- NEEDED REMOVAL IT
WHERE `org`.`active_flag` = 1 AND `org`.`id` = org.id
GROUP BY `org`.`id`
ORDER BY `org`.`add_time` DESC, `org`.`id` ASC 
LIMIT 501
OFFSET 6500;

In other queries we used active_flag, which had a similar effect because it was a boolean and most of the entries in the table had the same value. Removing index helped the query to use a more relevant index, but for clients that had at least 50% of rows marked as inactive, this removal caused the opposite effect — instead slowing the query down. Unfortunately, index proved not to be the golden ticket we were looking for.

Subqueries

Near the end of a different mission, I tried to optimize another case where we used 2 queries instead of one. First we got user IDs and then inserted them into the IN() operator of a second query. This was apparently easier to do than to make ORM work.

SELECT ...
FROM activity activity
#LEFT JOIN 10+ tables
WHERE activity.user_id IN (...)
ORDER ...
LIMIT 125000, 501;

I did finally merge them into a single query with subselect, but it didn’t bring significant improvement. Locally I got around a 10% improvement, but in production it depended on the company. In some cases it got even slower, so I had to hide it behind a feature-flag.

Async PHP attempt

Another failure was an attempt to make PHP process run in parallel using amphp.

//composer require amphp/parallel
use Amp\Parallel\Worker;
function get_fields_for_self(){  $inst = new custom_field_parallel();
  $responses = Promise\wait(Promise\all([
    Worker\enqueueCallable([$inst, 'convert_field_types'], $deal_fields),
    Worker\enqueueCallable([$inst, 'convert_field_types'], $person_fields)
  ]);  return [
    'deal' => $responses[0],
    'person' => $responses[1]
  ];
}

As it turned out, initializing streams took twice as much time than the original endpoint did. Perhaps I bootstrapped it incorrectly or we need another library, but at the time we decided that it’s easier for us to split API endpoints rather than keep trying to optimize them internally.

Before & after trying out streams, 2x slowdown

Clever pagination

One of the features of the pipeline view is loading deals on a page scroll. Visually it looks like all of the stage lists get filled equally and the number of stages is configurable by the user so the number of deals in each stage can vary.

Under the hood, the UI calls a single API endpoint, which should be able to fill each stage of the pipeline equally, scanning horizontally from left to right. So the endpoint should be able to start at a specific offset with a variable limit.

Some fake data in the main pipeline view, using data loading on scroll

This non-trivial task was previously solved by loading all of the deal IDs into memory, scanning through them, splitting them based on what stage they are in, and then doing a horizontal 2D-scan to find out where is the best starting position to read from. The drawback here is obvious — as a company’s data grows, loading 100K rows from MySQL into PHP memory and processing them becomes slower.

These amounts and behaviors were probably considered unexpected, but as it turns out, some companies do work like this.

To optimize this process, we had to rewrite the SQL to start using MySQL-variables, gaining 86% speedup.

SET @count := 0, @tmp_stage_id := "";
SELECT pipeline_deals.id, @count := if(@tmp_stage_id = pipeline_deals.stage_id, @count + 1, 1) as same_stage_count, @tmp_stage_id := pipeline_deals.stage_id AS stage_idFROM (
SELECT ...
FROM deals
) AS pipeline_dealsORDER BY same_stage_count ASC, pipeline_deals.stage_id ASC
LIMIT ...

Ending the mission

In addition to the cases described above, we were also dealing with memory exhaustion errors.

By the end of the mission (since we were monitoring key metrics so much) we found a sudden increase caused by one of the clients using search extensively, generating huge traffic and exceeding 400ms for every case.

Search endpoint latency suddenly rising due to “abuse”

It’s very difficult to fully dissect the cause & effect. Did an impact really come directly from us, or was there a decrease in API usage that month? Maybe other teams optimized their frontend without us being aware?

On product deployments, it’s easier to A/B test impact with feature flags, but with performance improvements, we would need container-level canary releases (ex. PHP v7.2 vs v7.3) rolled out for different clients to prove results statistically within a timeframe of several days.

Comparing key metrics at the beginning of the mission (2nd week) with the metrics at the end (8th week), we see a decrease in the number of slow requests by 35%. Our average API response time dropped from 158ms to 127ms and we got rid of some nasty 500 errors due to insufficient memory problems.

Percent of slow API calls dropping from the beginning of the mission

Lessons learned

Keep refactoring the legacy system, paraphrasing JFK — “Not because it's easy, but because it's hard”. Performance optimization may turn into a good excuse to clean up messy code.
Invest your time in metrics tooling so that you have observability to understand what to focus on. It’s too easy to start optimizing something with little impact or monitor too much (I found myself still watching them at 1AM).
Involve people from other teams early on. I’ve spent too much time on ORM & DB optimization. Having someone experienced would have saved that time.
Plan half of the mission tasks ahead. We knew that we couldn’t move slow endpoints into nodejs, we can’t improve all of the queries, and there was no point to polish specific tasks. We kept our scope small to deliver the biggest impact.

Now that the mission has ended, we are thinking about a PHP upgrade to 7.3 or 7.4, or maybe using something more extravagant like RoadRunner. We are also going to see if Kubernetes can help with canary releases, but that's dependent on a new mission…

Happy refactoring!