Prezly technical stack

Still figuring out if i need to own the medium I (very rarily) write content on. This is a report of a post i wrote on my own blog.

A short description of technical components/stack that power Prezly and thousands of clients newsrooms. Components below are ordered in a way that made sense to me :-)

SSL/Termination: Nginx

Every application request will be greeted by Nginx. The job of nginx in this case is two fold:

  1. redirect non secure traffic to ssl (301 redirects)
  2. terminating ssl

For *.prezly.com we use a number of dedicated/wildcard SSL certificates. Client newsrooms using custom domains are served making us of letsencrypt certificates.

Nginx passes off requests to varnish or haproxy based on some hard coded rules. E.g. requests on rock.prezly.com are passed on to the load balancer directly because there is no use in caching them. 90% of requests is passed on to Varnish though.

Caching: Varnish

Frigging awesome caching layer. Prezly serves newsrooms for a number of large enterprise brands and so we encounter traffic peaks from time to time. Did anyone lose its aircraft ? More information on the new Porsche panamera ? Looking for awesome video’s on a Jetman next to an Airbus 380?

Luckily those peaks are highly cachable. We pretty much consider the bottleneck of Varnish to be the uplink from infrastructure set-up to the world. During large peaks I tend to log in to see how our load balancers are holding up. Most of the time the log shipping takes more resources then handling the traffic itself. Again, varnish is friggin awesome!

Did I mention it makes us lazy application engineers too ?

Load Balancer: Haproxy

Distributing the traffic to the different application servers is done by Haproxy. The number of active webservers is dynamic at all times but haproxy.cfg is modified automatically and hot reloaded by chef scripts upon server initialisation.

Requests are split up in a few categories/applications (api, backend, frontend, website) and those applications have different health checks. Upon health check failure (server failure, load problem, bad code,…) requests are passed on to the next available server. Kawabunga!

Webservers: Apache/PHP

To keep this post readable I won’t go into detail on application framework internals. So I’ll stick to stuff that fits in the context of this post.

  • Apache 2.4.7
  • PHP 7.0.7

The pool of active webservers consists of 4 machines. To minimise the monthly AWS invoice during weekends/downtime everything is being scaled down to a single webserver. Sessions are being shared using a memcached instance.

Recently upgrading PHP 5.x to PHP 7.x had a HUGE impact on performance. Response times for pages with some basic rendering logic/data fetching went from 74ms average to less than 50ms. Memory usage is a lot lower and the application performance is more predicable and stable. Where large application peaks used to result in a spin up of up to 4 webservers after upgrading I haven’t seen a load based spin up of over 2 web servers.

Workers: SQS/supervisord/PHP

We use a number of SQS queues to process background jobs. Those jobs are split up into different queues (p1 -> p4) which are handled by long-polling PHP scripts. PHP is daemonised by using supervisord where we spin up around 10 daemons per worker instance.

The good thing about this approach is that SQS supports long poll threading which has a great impact on the performance of the PHP daemons. Before using SQS we had a self-made queuing engine that used some kind of wait() function which caused the cpu load to go crazy during high concurrency.

Worker instances are spawned when the tresholds of the different queues are reached right up to the point where we start noticing slowdowns on database performance. Those rules are defined by using Opsworks auto-scaling rules.

Static Assets: Cloudfront

All images, attachments and videos uploaded by our customers are stored on AWS S3. We make use of cloudfront to surface those assets globally.

Database: PostgreSQL

In the past we have used MySQL, Mongo, CouchDB, MariaDB and probably a few others. Some of them were in production concurrently where we stored parts of our data in different storage engines. Sounds logical to me.

Today we use PostgreSQL only. Gotta admit, we’ve had our pain with background jobs, vacuum/analyse operations and IO tuning but we can say that database operations are under control now.

By making use of Amazon RDS we outsource the management and maintenance of our datastore.

Logging: rsyslog/elastic

For central logging we use rsyslog to ship logs to elastic search. Logs are being consumed using Kibana. I plan on writing a post the elaborates on the set-up of this central logging set-up.

Monitoring: 3 tools

We use new relic. I think it’s a little pricey for what we get out of it, but due to a lack of a better alternative we’re sticking around.

Because the new relic application pings/monitoring was returning too many false positives we traded that in for uptime.com which we are very happy with!

New Relic:

  • Application monitoring gives us a good insight in end-user performance and application performance
  • Server monitoring checks disk, cpu and memory usage and reports on that in slack

Uptime.com:

  • Global uptime
  • Domain renewals
  • SSL certificate checks

Cloudwatch:

  • Server monitoring including AWS performance indicators which is then used to auto scale the web servers.

All three services feed information to opsgenie and slack. Opsgenie then takes care of texting me out of bed when the shit really hits the fan (which never happened the last 2 years).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.