Prezly technical stack
Still figuring out if i need to own the medium I (very rarily) write content on. This is a report of a post i wrote on my own blog.
Every application request will be greeted by Nginx. The job of nginx in this case is two fold:
- redirect non secure traffic to ssl (301 redirects)
- terminating ssl
For *.prezly.com we use a number of dedicated/wildcard SSL certificates. Client newsrooms using custom domains are served making us of letsencrypt certificates.
Nginx passes off requests to varnish or haproxy based on some hard coded rules. E.g. requests on rock.prezly.com are passed on to the load balancer directly because there is no use in caching them. 90% of requests is passed on to Varnish though.
Frigging awesome caching layer. Prezly serves newsrooms for a number of large enterprise brands and so we encounter traffic peaks from time to time. Did anyone lose its aircraft ? More information on the new Porsche panamera ? Looking for awesome video’s on a Jetman next to an Airbus 380?
Luckily those peaks are highly cachable. We pretty much consider the bottleneck of Varnish to be the uplink from infrastructure set-up to the world. During large peaks I tend to log in to see how our load balancers are holding up. Most of the time the log shipping takes more resources then handling the traffic itself. Again, varnish is friggin awesome!
Did I mention it makes us lazy application engineers too ?
Load Balancer: Haproxy
Distributing the traffic to the different application servers is done by Haproxy. The number of active webservers is dynamic at all times but haproxy.cfg is modified automatically and hot reloaded by chef scripts upon server initialisation.
Requests are split up in a few categories/applications (api, backend, frontend, website) and those applications have different health checks. Upon health check failure (server failure, load problem, bad code,…) requests are passed on to the next available server. Kawabunga!
To keep this post readable I won’t go into detail on application framework internals. So I’ll stick to stuff that fits in the context of this post.
- Apache 2.4.7
- PHP 7.0.7
The pool of active webservers consists of 4 machines. To minimise the monthly AWS invoice during weekends/downtime everything is being scaled down to a single webserver. Sessions are being shared using a memcached instance.
Recently upgrading PHP 5.x to PHP 7.x had a HUGE impact on performance. Response times for pages with some basic rendering logic/data fetching went from 74ms average to less than 50ms. Memory usage is a lot lower and the application performance is more predicable and stable. Where large application peaks used to result in a spin up of up to 4 webservers after upgrading I haven’t seen a load based spin up of over 2 web servers.
We use a number of SQS queues to process background jobs. Those jobs are split up into different queues (p1 -> p4) which are handled by long-polling PHP scripts. PHP is daemonised by using supervisord where we spin up around 10 daemons per worker instance.
The good thing about this approach is that SQS supports long poll threading which has a great impact on the performance of the PHP daemons. Before using SQS we had a self-made queuing engine that used some kind of wait() function which caused the cpu load to go crazy during high concurrency.
Worker instances are spawned when the tresholds of the different queues are reached right up to the point where we start noticing slowdowns on database performance. Those rules are defined by using Opsworks auto-scaling rules.
Static Assets: Cloudfront
All images, attachments and videos uploaded by our customers are stored on AWS S3. We make use of cloudfront to surface those assets globally.
In the past we have used MySQL, Mongo, CouchDB, MariaDB and probably a few others. Some of them were in production concurrently where we stored parts of our data in different storage engines. Sounds logical to me.
Today we use PostgreSQL only. Gotta admit, we’ve had our pain with background jobs, vacuum/analyse operations and IO tuning but we can say that database operations are under control now.
By making use of Amazon RDS we outsource the management and maintenance of our datastore.
For central logging we use rsyslog to ship logs to elastic search. Logs are being consumed using Kibana. I plan on writing a post the elaborates on the set-up of this central logging set-up.
Monitoring: 3 tools
We use new relic. I think it’s a little pricey for what we get out of it, but due to a lack of a better alternative we’re sticking around.
Because the new relic application pings/monitoring was returning too many false positives we traded that in for uptime.com which we are very happy with!
- Application monitoring gives us a good insight in end-user performance and application performance
- Server monitoring checks disk, cpu and memory usage and reports on that in slack
- Global uptime
- Domain renewals
- SSL certificate checks
- Server monitoring including AWS performance indicators which is then used to auto scale the web servers.