How Two Developers Accelerated Pipedrive’s PHP-Monolith

Artjom Kurapov
Dec 26, 2019 · 13 min read

Background

Pipedrive, like many startups, started with PHP-monolith which has since grown into legacy code that we are attempting to get rid of. It is harder to test & deploy on legacy code than with small node/go services and it’s more difficult to grasp, because of so many features that were written more than 5 years ago. Because of this, we have an informal agreement to not add any new functionality to this repo anymore.

Despite its faults, the PHP-monolith still serves the biggest chunk of public API traffic (>80%) and more than 20 internal services depend on it, making it critical for the business. The challenging part is to keep performance high as we make it more decentralized.

In a 2018 mission that I participated in, we finally “dockerized” it, but it still remained in PHP version 7.0, but at the beginning of 2019 the planets aligned and a Backend performance boost mission became possible.

Motivation

First, I personally wanted to get into the field of Performance Engineering so this was a great chance. Second, anyone within the company can pitch and launch missions so this was a good opportunity. Finally, as I later found out, our manager was also interested in improving performance.

Image for post
Image for post

The triggering moment happened when accidentally, during a debug of some endpoint, I decided to try blackfire to understand how function calls are made and I saw a dismal picture of translations taking a huge percentage of the total request. I thought it would be fantastic to rewrite this part and then learn how to lead a mission during the process.

Image for post
Image for post

Preparations

Since I was afraid of pitching a mission without any slides to a 100+ audience and because Pipedrive is a data-driven company, I had to start researching what metrics I could use for the mission goals.

I won’t give you the exact numbers I got, but it was a grim picture indeed. Even though we had a pretty low average API latency (160ms), I still saw some of our clients with endpoints loading from multiple seconds up to several minutes (bulk API).

The main goal of the mission was to decrease the percentage of sluggish GET requests by 20% (from 5.2% to 4.1% of total request count).
The secondary
goal was to decrease the number of insufficient memory errors by 80%. From a business side, the goal also aligned with a decrease in client churn rate.

The mission plan was to deal with the global speedup first and then work on individual endpoints.

Image for post
Image for post
Non-standard lightweight Trello + Teamgantt to plan the mission

First failure

I planned a mission for 4 developers, but only one brave soul volunteered and so we started our journey with just the two of us.

First I wanted to get an easy win and just get rid of these slow translations. The main complexity was that the gettext function loaded binary .mo files that were coming from a third-party translation system and unpacking them on every request. Or so I thought.

After a full rewrite, with a nice unit-test coverage, I got a 43% speed-up on my machine, but as it turned out, in production gettext loaded translations only on the first attempt and then used an in-memory cache afterwards. In the runtime, it was still fast.

It was a first-week failure with the only positive being a better understanding of the developer experience.

Image for post
Image for post
Image for post
Image for post

Metrics

On the second week, I realized that we didn’t have enough visibility of performance. This resulted in developers becoming reluctant to approve translation pull-requests without seeing the numbers first.

We used newrelic specifically in production and blackfire worked for me only on local, so I couldn’t prove that our optimization had an impact in an isolated production-like environment. We simply couldn’t continue like this.

I quickly wrote up metric gathering to use datadog, as our infra team was already testing it out. This gave us visibility into the average latency on test environments when API tests are executed and it also gave us a nice latency view per-endpoint.

Unfortunately, issues occurred for us here as well. Datadog for PHP turned out to be quite buggy — curl headers got overwritten and elastic failed. I needed to disable multiple datadog extensions just in case something worse may happen.

During the mission I also gathered metrics in grafana for the slow queries, and, quite frankly, grafana’s dashboards & multiple sources turned out to be much more powerful. With this information, we began to use multiple systems to get a better understanding of what’s going on.

Image for post
Image for post
Newrelic, Grafana & Datadog dashboards used in the mission

Upgrading PHP to 7.1 and 7.2

Updating the PHP version was mostly a smooth process, but it did take quite a long time due to things such as curl headers, search and elimination of some deprecated functions, running API tests, etc.

echo count('ab');
// php 7.1 : 2
// php 7.2 : Warning: count(): Parameter must be an array or an object that implements Countable in

Something that proved headache inducing in the process was a cryptic nginx failure that gave 500 error without any logs. After digging through some php-fpm worker logs in debug mode, I finally saw SIGSEGV fault which gave me the idea to disable XDebug extension in the local environment.

In the test environment, datadog promised average latency reduction from 740ms to 338ms while running API tests, and the time of API test execution dropped down from 23 minutes to 12 — fantastic.

After the mission finished, we split API tests into more parallel (>20) threads and reduced the time spent on them down to 5 min. The performance gains gave us a good side-effect for deployment speed too.

Image for post
Image for post
Image for post
Image for post
Datadog comparison of API test runs before & after

In production, version upgrades gave the biggest boost in the mission scope and global latency decrease reached 20%.

Image for post
Image for post
Image for post
Image for post
Impact of PHP version upgrade in production. Newrelic as source

Optimizing class autoloading

Monolith wouldn’t have been monolith if there wasn’t some mess involved.

We had multiple loaders registered that defined how to find class. Some of these came via vendor dependencies (AWS, symfony), one came built-into our base codeigniter framework, and naturally we wrote our own to better understand modular logic.

It was only sub-optimal as it reached into the filesystem on every class load, going through 10+ folders. This IO-load was easily noticeable at the top of blackfire profiling.

Pseudocode for autoloader, some file checks went through multiple directories

We had a legacy folder with some of the libraries so we couldn’t use composer until monolith got dockerized which meant we needed to migrate them to correct versions and start using a new composer autoloader following PSR logic of namespaces.

After cleaning up our custom autoloader and adding a composer with autoloader optimization, we got 30-70% latency reduction, depending on the API endpoint. After the mission ended, we then began to use apcu optimization additionally.

In the test environment, according to datadog, our latency decreased from 240 ms to 200 ms.

Image for post
Image for post
Image for post
Image for post
Datadog before and after

In a production environment, API request spends on average more time in MySQL queries and fetching data. So we won a bit less, only 10%.

Image for post
Image for post

Speeding up the most used API endpoints

According to our statistics, on top of the (slowest * used) endpoints is the activity list. What complicates this is that we had some client integrations loading an entire list of activities, paginating from 0 to the absolute end. Some clients had as much as 1 million records there, meaning caching wouldn’t really help much.

The API doesn’t call just one table — permissions are joined. The ORM that generates the query is not the simplest instrument and I wasn’t part of the team that owned the business logic. Blackfire hinted that the most difficult parts in this API call is to extract and hydrate functions in the ORM. Since there isn’t much to show, I resorted to micro-optimizations.

On a local machine we gained 50% latency reduction in this endpoint, while the person listing got us 25%. Without a good instrument, capable of automatic span instrumentation, it would be very hard to detect these bottlenecks, so I give kudos to the blackfire team.

Image for post
Image for post

To debug the performance of a specific client, our OPS and DBA engineers copied and anonymized the client’s DB on an internal machine, which I was then able to experiment with. Using Newrelic, I could retrieve companies with NRQL like:

SELECT average(duration) 
FROM Transaction
FACET company_id
WHERE company_id IN (...) AND env = 'live' AND name='WebTransaction/Custom/v1/activities/index_get'
SINCE 1 week AGO
TIMESERIES 6 hours

Slowest database queries

First thing I noticed — our ORM kept wrapped all fields in ORDER BY with NULLIF check.

Image for post
Image for post
Notice NULLIF added to dates

I wasn’t sure exactly why this was needed and after checking with other developers it turned out that it was intended for date fields, to be interpreted as an empty string to have specific ordering. I then narrowed down the ORM logic to check for the field type (to wrap dates only).

Indexes

I figured out a second useful tidbit with the help of DBA — indexes heavily affect performance. Not only missing indexes, but having wrong indexes also slowed us down.

For example, our ORM forced index usage (idx_add_time) which slowed everything down because most entities had different creation time, meaning that cardinality made no sense and MySQL was doing a full scan and couldn’t use any other index. In worst case scenarios, a query took up to a minute.

explain SELECT `org`.*
FROM `org`
FORCE INDEX(idx_add_time) # <--- NEEDED REMOVAL IT
WHERE `org`.`active_flag` = 1 AND `org`.`id` = org.id
GROUP BY `org`.`id`
ORDER BY `org`.`add_time` DESC, `org`.`id` ASC
LIMIT 501
OFFSET 6500;
Image for post
Image for post
With forced index
Image for post
Image for post
Without forced index

In other queries we used active_flag, which had a similar effect because it was a boolean and most of the entries in the table had the same value. Removing index helped the query to use a more relevant index, but for clients that had at least 50% of rows marked as inactive, this removal caused the opposite effect — instead slowing the query down. Unfortunately, index proved not to be the golden ticket we were looking for.

Subqueries

Near the end of a different mission, I tried to optimize another case where we used 2 queries instead of one. First we got user IDs and then inserted them into the IN() operator of a second query. This was apparently easier to do than to make ORM work.

SELECT ...
FROM activity activity
#LEFT JOIN 10+ tables
WHERE activity.user_id IN (...)
ORDER ...
LIMIT 125000, 501;

I did finally merge them into a single query with subselect, but it didn’t bring significant improvement. Locally I got around a 10% improvement, but in production it depended on the company. In some cases it got even slower, so I had to hide it behind a feature-flag.

Image for post
Image for post

Async PHP attempt

Another failure was an attempt to make PHP process run in parallel using amphp.

//composer require amphp/parallel
use Amp\Parallel\Worker;
function get_fields_for_self(){
$inst = new custom_field_parallel();
$responses = Promise\wait(Promise\all([
Worker\enqueueCallable([$inst, 'convert_field_types'], $deal_fields),
Worker\enqueueCallable([$inst, 'convert_field_types'], $person_fields)
]);
return [
'deal' => $responses[0],
'person' => $responses[1]
];
}

As it turned out, initializing streams took twice as much time than the original endpoint did. Perhaps I bootstrapped it incorrectly or we need another library, but at the time we decided that it’s easier for us to split API endpoints rather than keep trying to optimize them internally.

Image for post
Image for post
Image for post
Image for post
Before & after trying out streams, 2x slowdown

Clever pagination

One of the features of the pipeline view is loading deals on a page scroll. Visually it looks like all of the stage lists get filled equally and the number of stages is configurable by the user so the number of deals in each stage can vary.

Under the hood, the UI calls a single API endpoint, which should be able to fill each stage of the pipeline equally, scanning horizontally from left to right. So the endpoint should be able to start at a specific offset with a variable limit.

Image for post
Image for post
Some fake data in the main pipeline view, using data loading on scroll

This non-trivial task was previously solved by loading all of the deal IDs into memory, scanning through them, splitting them based on what stage they are in, and then doing a horizontal 2D-scan to find out where is the best starting position to read from. The drawback here is obvious — as a company’s data grows, loading 100K rows from MySQL into PHP memory and processing them becomes slower.

These amounts and behaviors were probably considered unexpected, but as it turns out, some companies do work like this.

To optimize this process, we had to rewrite the SQL to start using MySQL-variables, gaining 86% speedup.

SET @count := 0, @tmp_stage_id := "";
SELECT pipeline_deals.id, @count := if(@tmp_stage_id = pipeline_deals.stage_id, @count + 1, 1) as same_stage_count, @tmp_stage_id := pipeline_deals.stage_id AS stage_id
FROM (
SELECT ...
FROM deals
) AS pipeline_deals
ORDER BY same_stage_count ASC, pipeline_deals.stage_id ASC
LIMIT ...
Image for post
Image for post
Yay, -100k rows, thanks Navaneeth

Ending the mission

In addition to the cases described above, we were also dealing with memory exhaustion errors.

By the end of the mission (since we were monitoring key metrics so much) we found a sudden increase caused by one of the clients using search extensively, generating huge traffic and exceeding 400ms for every case.

Image for post
Image for post
Search endpoint latency suddenly rising due to “abuse”

It’s very difficult to fully dissect the cause & effect. Did an impact really come directly from us, or was there a decrease in API usage that month? Maybe other teams optimized their frontend without us being aware?

On product deployments, it’s easier to A/B test impact with feature flags, but with performance improvements, we would need container-level canary releases (ex. PHP v7.2 vs v7.3) rolled out for different clients to prove results statistically within a timeframe of several days.

Comparing key metrics at the beginning of the mission (2nd week) with the metrics at the end (8th week), we see a decrease in the number of slow requests by 35%. Our average API response time dropped from 158ms to 127ms and we got rid of some nasty 500 errors due to insufficient memory problems.

Image for post
Image for post
Image for post
Image for post
Percent of slow API calls dropping from the beginning of the mission
Image for post
Image for post
Mission goals exceeded

Lessons learned

Now that the mission has ended, we are thinking about a PHP upgrade to 7.3 or 7.4, or maybe using something more extravagant like RoadRunner. We are also going to see if Kubernetes can help with canary releases, but that's dependent on a new mission…

Happy refactoring!

Pipedrive Engineering

Stories from the developers in Pipedrive

Thanks to David Lorbiecke, Lauri Piisang, and Mikk Mangus

Artjom Kurapov

Written by

Software Engineer, Core Team at Pipedrive. I like bees, cats and complex spaghetti-systems

Pipedrive Engineering

Stories from the developers in Pipedrive and developers who work with Pipedrive

Artjom Kurapov

Written by

Software Engineer, Core Team at Pipedrive. I like bees, cats and complex spaghetti-systems

Pipedrive Engineering

Stories from the developers in Pipedrive and developers who work with Pipedrive

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app