Robust application design on AWS —an advanced guide to using queues in the cloud.

Utilizing job queues is a game changer for micro-services in the cloud. This is one solution for robustness, it is a queue-based worker design for REST API’s and sites that use them which demand 100% uptime and are highly fault tolerant. This solution is specific to AWS but could be modified to work in Azure or G-Cloud.

This data-flow diagram is what we will be building here.

Goal of this guide

The goal of this guide is to show how to build out the skeleton of an online application that is fault tolerant. This guide will walk you through setting up your front-end webservers in autoscaled ECS instances with Docker using shared data in EFS — pipe data from those servers through Lambda to a Symfony API which does not require a database connection and that utilizes queues to process the data via independent workers (why). I also show you how to set up Galera cluster with HAProxy to route INSERTS and SELECTS differently to avoid multi-master collisions. With this setup, you can turn off any individual worker, service or server with no downtime, you can literally take the database down for maintenance and not lose a single record.

First up, who/what is this for?

This guide was originally conceived while looking for a high-end developer to hire. I found that, at the time of this writing, I could not find candidates with many of these technologies used listed on their resumes, and further, I would end up fielding questions about how I built the system I was explaining they would be working on. I was asked multiple times if I followed a guide to create it. I did follow ‘guides’ (linked herein), but not one ‘single’ guide. This article is the culmination of those talks and includes more details so others can replicate what I have done. I don’t claim this is the best setup in the world, but it has served me very well.

So, here I present an AWS-based design pattern (with examples) which is for anyone who is looking for a new and modern way to build a robust application.


Steps we are going to take in this guide

This is a LARGE guide. You can skip around here as needed, this entire article is for a complete setup of the image above.

Note: where I have utilized outside setup examples, guides and documentation I will provide links.

  1. Autoscale our website(s) with ELB using R53, via ECS and Docker containers with nginx and php7-fpm on EFS. These feed our API, so we need to make sure they can grow and shrink as needed.
  2. Plug the websites into Lambda using the tool ‘serverless’.
  3. Set up the Symfony API with SQS queues and API load balancer and make the API calls flow to SQS
  4. Setup up the database and specialized database load balancer for Galera
  5. Design workers with the Symfony console
  6. Put workers into process management
  7. Optionally — Set up stats with graphite+grafana and piwik for custom stats.
  8. Wrap-up and alternates.
  9. Conclusion and contact.

Technologies used

AWS is notoriously bad with their codenames (I mean; seriously, ‘Route53’ when you could just have called it ‘Amazon DNS’ or AWSDNS?), I would suggest familiarizing yourself with them, you can use this reference so you understand what each of these services do in the human tongue.

  1. AWS services — EC2, ECS, ELB, SQS, LAMBDA, R53, Elasticache, EFS (guides)
  2. PHP7-FPM and nginx — PHP7 is fast, nginx is fast, use them
  3. HAProxy —nuke-strike proof load balancer
  4. Galera Cluster — true multi-master replication (whitepaper/benchmarks) *note: we moved to RDS:Aurora
  5. Docker — throw-away containers for auto-scaled groups
  6. Symfony PHP framework — particularly the symfony console, but Symfony is really great, particularly when paired with PHPStorm which I would recommend as nearly mandatory for OOP PHP-dev.
    UPDATE Apr-2017: Laravel with Dingo (REST API) has several of the concepts here built-in and I would now recommend Laravel instead of Symfony, if anything for speed of setup.
  7. supervisord — process management for our workers (moving these to Lambda)
  8. serverless — lambda functions *note: using raw Lambda instead of serverless.
  9. ubuntu and/or Amazon Linux AMI’s in EC2
So without further adieu, lets continue.
spinach, apparently

Step 1 — autoscaling our websites with Docker on ECS and EFS with PHP7-FPM, nginx and supervisord.

If you already have your sites set up how you like and feeding your API continue past this.

If you have not discovered Docker yet then you as a developer/devops have been missing out. I would suggest you become familiar with it before continuing. Please note that you will ‘need’ to use Amazon Route53 for your website DNS if you continue with this step so, if you do not need, or do not want, to set up autoscaling on ECS skip this section.

NOTES:

Here is a base guide for Docker on AWS that I referenced a lot while setting things up: Running Docker on AWS from the ground up. This covers all the autoscaling groups and the two IAM’s needed, so make sure to go through it and pay special attention to the “Task Definitions” sections.

I used the following container as a base: sadokf/php7-nginx-supervisord I have modified it heavily, see below.

This is a reference I used for the EFS part you will find below: Using Amazon EFS to Persist Data from Amazon ECS Containers, below you will see how the Docker containers attach to EFS.

I use nano instances with my Docker containers and scale in ECS based on CPU averages. This is a preference to keep costs down as I like to have five web servers minimum for my front-facing sites. There is a nice cost breakdown for instance types which may be of use for you when deciding how you want to scale.

Implementation

Here is my Dockerfile changes to the above container base.

#
# This is the webserver dockerfile.
#

FROM php:7-fpm

MAINTAINER you@email.com

RUN \
apt-get -y update && \
apt-get -y install \
curl vim wget git build-essential make gcc nasm mlocate \
nginx supervisor unattended-upgrades nfs-common jq python-pip cron

RUN echo "deb http://packages.dotdeb.org jessie all" >> /etc/apt/sources.list.d/dotdeb.org.list && \
echo "deb-src http://packages.dotdeb.org jessie all" >> /etc/apt/sources.list.d/dotdeb.org.list && \
wget -O- http://www.dotdeb.org/dotdeb.gpg | apt-key add -

#PHP7 dependencies
RUN apt-get -y update && \
apt-get -y install \
php7.0-mysql php7.0-odbc \
php7.0-curl php7.0-gd \
php7.0-intl php-pear \
php7.0-imap php7.0-mcrypt \
php7.0-pspell php7.0-recode \
php7.0-sqlite3 php7.0-tidy \
php7.0-xmlrpc php7.0-xsl \
php7.0-xdebug php7.0-redis \
php-gettext && \
docker-php-ext-install pdo pdo_mysql opcache

COPY docker/resources/etc/ /etc/
COPY
docker/resources/netmounts /usr/bin/init_container
COPY docker/resources/crontab.tmp /usr/src

#install phpUnit & composer
RUN \
wget "https://phar.phpunit.de/phpunit.phar" && \
chmod +x phpunit.phar && \
mv phpunit.phar /usr/local/bin/phpunit && \
curl -sS https://getcomposer.org/installer | php -- --install-dir=/usr/local/bin --filename=composer && \
dpkg-reconfigure --priority=low unattended-upgrades && \
echo "<?php phpinfo() ?>" > /var/www/html/healthcheck.php && \
chmod +x /usr/bin/init_container

ADD . /var/www
WORKDIR /var/www

RUN \
pip install awscli && \
aws configure set preview.efs true && \
mkdir /var/www/sites

RUN usermod -u 1000 www-data

EXPOSE 80
EXPOSE 2049
EXPOSE 20048

# ONLY one CMD command per Dockerfile is executed (the last one encountered)
CMD ["/usr/bin/init_container"]

Below is the init_container script referenced in the Dockerfile, it runs when a container is fired up and if it exits the container will fail and be restarted.

Note that it is called ‘netmounts’ in the Dockerfile above and is copied as init_container.

This follows the EFS mount instructions in AWS. To get this to work as in this example you will need to create three different file systems in EFS replace {YOURIDHERE} below with your ID’s from the EFS interface (“File System ID”) or use one-liner in the script for EFS_FILE_SYSTEM_ID to find them programatically.

  1. configs’ which is where you will keep your nginx configs inside the dir is a git repo with directory “configs/nginx”
  2. webserver_files’ which is where you will keep files used by your webserver with git repo with dir “frontend/www”
  3. logs’ where you can stash your webserver and application logs in one location so you don’t have to deal with things like logstash. The logs dir is flat.
#! /bin/bash
#
# this is called /usr/bin/init_container inside the container.
#

#Join the default ECS cluster
mkdir /etc/ecs
mkdir /etc/nginx/conftmp

echo ECS_CLUSTER=default >> /etc/ecs/ecs.config
PATH=$PATH:/usr/local/bin

#Get region of EC2 from instance metadata
EC2_AVAIL_ZONE=`curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone`
EC2_REGION="`echo \"$EC2_AVAIL_ZONE\" | sed -e 's:\([0-9][0-9]*\)[a-z]*\$:\\1:'`"

# Get EFS FileSystemID attribute
# Instance needs to be added to a EC2 role that give the instance at least read access to EFS
# this is one way to get it by name, otherwise use the id's from the interface below.
#EFS_FILE_SYSTEM_ID=`/usr/local/bin/aws efs describe-file-systems --region $EC2_REGION | jq '.FileSystems[]' | jq 'select(.Name=="webserver_files")' | jq -r '.FileSystemId'`

##Backup fstab
cp -p /etc/fstab /etc/fstab.back-$(date +%F)

# WEBSERVER-files Filesystem - Append line to fstab
EFS_FILE_SYSTEM_ID=fs-{YOURIDHERE}
DIR_SRC=$EC2_AVAIL_ZONE.$EFS_FILE_SYSTEM_ID.efs.$EC2_REGION.amazonaws.com
DIR_TGT=/var/www/sites
echo -e "$DIR_SRC:/ \t\t $DIR_TGT \t\t nfs \t\t defaults \t\t 0 \t\t 0" | tee -a /etc/fstab

# LOGS Filesystem - Append line to fstab
EFS_FILE_SYSTEM_ID=fs-{YOURIDHERE}
DIR_SRC=$EC2_AVAIL_ZONE.$EFS_FILE_SYSTEM_ID.efs.$EC2_REGION.amazonaws.com
DIR_TGT=/var/log/nginx
echo -e "$DIR_SRC:/ \t\t $DIR_TGT \t\t nfs \t\t defaults \t\t 0 \t\t 0" | tee -a /etc/fstab


# CONFIGS Filesystem - Append line to fstab
EFS_FILE_SYSTEM_ID=fs-{YOURIDHERE}
DIR_SRC=$EC2_AVAIL_ZONE.$EFS_FILE_SYSTEM_ID.efs.$EC2_REGION.amazonaws.com
DIR_TGT=/etc/nginx/conftmp
echo -e "$DIR_SRC:/ \t\t $DIR_TGT \t\t nfs \t\t defaults \t\t 0 \t\t 0" | tee -a /etc/fstab

# mount them
/bin/mount -a

mv /etc/nginx/conf.d /etc/nginx/conf.back
ln -s /etc/nginx/conftmp/configs/nginx /etc/nginx/conf.d

ln -s /var/www/sites/frontend/www /usr/share/nginx

/usr/bin/crontab /usr/src/crontab.tmp
/usr/sbin/service cron start

# fire up supervisord
/usr/bin/supervisord

The crontab.tmp file is to make a crontab that reloads any changes to the configs automatically. This is so you do not need to log into many containers to update nginx.

# reload nginx
*/5 * * * * /usr/sbin/service nginx reload

And to build, you will need to find your repository URI in the AWS interface in repositories under ECS and click on your repository name, if you don’t have one, create it by going in the AWS interface to ECS->Repositories. Replace {YOUR-REPO} below with yours

BUILD INSTRUCTIONS:
aws ecr get-login — region us-west-2
-run output from previous.
docker build -t rxmg .
docker tag rxmg:latest {YOUR-REPO}:latest
docker push {YOUR_REPO}:latest

Add in server configs as such for nginx however your websites need to be, this is a good base that will get your score on webpagetest up with gzip and caching, obviously modify as needed.

server  {
listen 80;
server_name domainname.com www.domainname.com;
allow all;

root /var/www/sites/frontend/www/domainname;

index index.php index.html index.htm;

gzip on;
location ~* \.(jpg|jpeg|png|gif|ico|css|js)$ {
expires 365d;
}
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
fastcgi_split_path_info ^(.+\.php)(/.*)$;
include fastcgi_params;
fastcgi_param SCRIPT_FILENAME $realpath_root$fastcgi_script_name;
fastcgi_param DOCUMENT_ROOT $realpath_root;
}
}

With a little fiddling around you should be able to get this working.

If you need to log into a docker container:

ssh -i ~/.ssh/your_key.pem ec2-user@EC2-IP
docker ps
#find the ‘CONTAINER ID’ that is not ‘amazon/amazon-ecs-agent:latest’
docker exec -it CONTAINER_ID bash

Step 2 — use Lambda API endpoints created with serverless.

This step is entirely optional and a little out of the scope of this article, I use these Lambda endpoints to wrap calls to our API from sites so I can attach an API key and API id so that they are not on public sites in Javascript functions that post to our API. I would suggest you at least familiarize yourself with Lambda and what serverless is, it is pretty cool and worth your time.

Using serverless we will create Lambda API endpoints written in Node.JS, these functions in the cloud can access other AWS services such as databases and queues. You are only charged a very small amount per Lambda function run.

Serverless takes care of a lot of the heavy lifting here and this section really deserves it’s own article which will be my next one.

update: I ended up not using serverless, and went to pure Lambda, I am also now alsousing the AWS API Gateway, hooked directly into Lambda’s for our dashboard and I am transitioning our heavy traffic workers (40M+/month)to using Lambda’s triggered by SQS Queue additions.

Step 3 — Set up the Symfony API with SQS queues and API load balancer

SQS is the core of why I am writing this guide. Simply put SQS is a queue service, you could replace SQS with something like RabbitMQ, ZeroMQ or even Redis Pub/Sub or any number of other projects, but saying this is so important to this setup, we want to utilize Amazon for it. Additionally the idea here is to have ‘few’ possible break points, that means running ‘fewer servers’ so, using AWS SQS saves us running a queue server. Also, it is cheap, the first million messages are free per month and after that each million is $0.30 which, in my case, comes out to something like $8-$15/month.

The primary idea with SQS is that you can put a message on a queue and forget it, then you have a worker which will long-poll the queue and process it, if it fully completes processing the message then the worker will delete it from the queue and return to polling to get another message. If the processing fails then the message will sit off the queue for a specified amount of time before it becomes visible again at which time another process will attempt that message. If a message fails a certain number of times then it can be removed or optionally be placed in a ‘redrive’ queue where you can look at it — have specialized processes that log it and/or reprocess it.

To start with a simple API queue and a redrive queue for the API queue first create a queue called “apiRedrive” with defaults and 14 day retention and no redrive policy. Then create an “apiQueue” and use the apiRedrive queue for the redrive policy, the defaults are typically fine.

I use the Friends of Symfony REST bundle which is outside the scope of this article and covered extensively in multiple guides, you do not need to use it — however, you will find that almost every guide for RESTful API development with Symdony uses fosRESTBundle, and for good reason.

composer require friendsofsymfony/rest-bundle

To get started with SQS we need to add the AWS SDK to our symfony project

composer require aws/aws-sdk-php

In our services.yml we need to add aws_sqs_key, aws_sqs_secret and aws_sqs_region as well as the following.

services:
sqs_queue:
class:
AppBundle\Services\sqsService
arguments: ["%aws_sqs_key%", "%aws_sqs_secret%", "%aws_sqs_region%"]

Here is our Symfony service for SQS, (guide). The file is: src/AppBundle/Services/sqsService.php:

<?php
/**
* Created by PhpStorm.
* User: joeldg
* Date: 4/27/16
* Time: 11:47 AM
*/

namespace AppBundle\Services;
/**
* work with Amazon SQS queues
*/

use Aws\Sqs\SqsClient;

/**
* Class SqsService
*
@package AppBundle\Services
*/
class sqsService
{
private $awsSqsKey;
private $awsSqsSecret;
private $awsSqsRegion;
private $awsSqsQueue;

/**
* SqsService constructor.
*
@param $awsSqsKey
*
@param $awsSqsSecret
*
@param $awsSqsRegion
*/
public function __construct($awsSqsKey, $awsSqsSecret, $awsSqsRegion)
{
$this->awsSqsKey = $awsSqsKey;
$this->awsSqsSecret = $awsSqsSecret;
$this->awsSqsRegion = $awsSqsRegion;
}

/**
*
@return static
*/
public function getClient()
{
$client = SqsClient::factory(
array(
'region' => $this->awsSqsRegion,
'version' => '2012-11-05',
'credentials' => array(
'key' => $this->awsSqsKey,
'secret' => $this->awsSqsSecret
)
)
);

return $client;
}

/**
*
@param $message
*
@param $queue
*/
public function addMessage($message, $queue)
{
$message = base64_encode(serialize($message));

$this->getClient()->sendMessage(array(
'QueueUrl' => $queue,
'MessageBody' => $message,
));
}

/**
*
@param $queue
*
http://docs.aws.amazon.com/aws-sdk-php/v2/guide/service-sqs.html
*/
public function getMessage($queue)
{
$client = $this->getClient();
$res = $client->receiveMessage(array(
'QueueUrl' => $queue,
'WaitTimeSeconds' => 10 // long poll

));
if ($res->getPath('Messages')) {
foreach ($res->getPath('Messages') as $msg) {
$ret['body'] = unserialize(base64_decode($msg['Body']));
$ret['rcpt'] = $msg['ReceiptHandle'];
return $ret;
}
}
}

/**
*
@param $queue
*
@param $rcpt
*
@return mixed
*/
public function deleteQueueMessage($queue, $rcpt)
{
$client = $this->getClient();

$res = $client->deleteMessage(array(
'QueueUrl' => $queue,
'ReceiptHandle' => $rcpt
));
return $res;
}

}

Now we will load the SQS service in our controller class:

/**
*
@var string
*/
protected $apiQueue = '{YOUR_QUEUE_URL}/apiQueue';
/**
* SQS service
*/
protected $sqs_queue;
# then in your REST action, i.e. postUserAction()
$this->sqs_queue = $this->get('sqs_queue');
# and with your array $data (array())
$this->sqs_queue->addMessage($data, $this->apiQueue);

The idea here for our primary controller is that we have ‘no’ database connection ‘at this point’ for the API, with only a few basic checks for data integrity we want to return back a 200 code via REST and assume the data is good. Once this data is pushed to the queue we can forget about it for now. Do make sure however that what you pass to the queue is an array with named keys so you can process the data later.

Note: The other controllers do have database access, this is for our primary data post controller for when a user submits their information from a form. We do basic checks (from our site, valid email, all fields etc) and a HTTP status 200 return sends the user to a ‘Thank you’ page.

Optional part: Setting up the load balancing for the API

We have two options here for load balancing our API:

  1. Set up an AWS ELB for the API, this requires using Route53 to point DNS to the ELB.
  2. Set up local HAProxy instance for the API.

Here is a HAProxy setup, the IP’s are all internal AWS IP’s for my private backend network, yours will be different, it also includes a special /exports path which is not load-balanced and only runs off the first server. The file health_check.php simply loads symfony core and autoloads vendors.

#/etc/haproxy/haproxy.cfg
global
log /dev/log local0
log /dev/log local1 notice
chroot /var/lib/haproxy
maxconn 2048
user haproxy
group haproxy
stats socket /var/lib/haproxy/stats.socket uid 106 gid 111
daemon
defaults
log global
mode http
option forwardfor
option http-server-close
option httplog
option dontlognull
contimeout 50000
clitimeout 50000
srvtimeout 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 408 /etc/haproxy/errors/408.http
errorfile 500 /etc/haproxy/errors/500.http
errorfile 502 /etc/haproxy/errors/502.http
errorfile 503 /etc/haproxy/errors/503.http
errorfile 504 /etc/haproxy/errors/504.http
frontend https-in
bind 172.31.36.129:443 ssl crt /etc/ssl/private/site1.pem
option forwardfor
option httplog
default_backend web_block
acl isexport path_beg -i /pages/exports
use_backend exports if isexport
acl ispost path_beg -i /app.php/api/endpoint
use_backend apipost if ispost
frontend http-in
bind 172.31.36.129:80
mode http
option forwardfor
option httplog
default_backend web_block
acl isexport path_beg -i /pages/exports
use_backend exports if isexport
acl ispost path_beg -i /app.php/api/endpoint
use_backend apipost if ispost
backend web_block
balance source
mode http
timeout check 50000
option httpchk GET /health_check.php
http-check expect status 200
server s1 172.31.6.96:80 maxconn 64 check
server s2 172.31.6.97:80 maxconn 64 check
server s3 172.31.2.43:80 maxconn 64 check
backend exports
mode http
server s1_export 172.31.6.96:80 maxconn 64
backend apipost
mode http
balance roundrobin
timeout check 5000
option httpchk GET /health_check.php
http-check expect status 200
server s1 172.31.6.96:80 maxconn 356 check
server s2 172.31.6.97:80 maxconn 356 check
server s3 172.31.2.43:80 maxconn 356 check
listen admin
bind 0.0.0.0:8080
mode http
stats enable
stats uri /
stats realm Strictly\ Private
stats auth admin:{YOURPASSWORD}

Here is what the admin looks like with the above

Step 4 — Galera cluster (Or AWS Aurora)

UPDATE: By using reserved pricing, AWS Aurora is doable if your budget is a little higher. The smallest instance size is an r3.db.large which will run you $0.19/hr per instance ($410/month for a cluster). The on-demand prices are too high to recommend for an average shop. I have personally swapped over to Aurora for a few reasons, 1) it auto-expands up to 64TB 2) the Aurora engine seems to be super fast and the latency is great. 3) It is managed so I will have three less (very important) servers to maintain and keep up to date. 4) one-click backups/restore 5) Better stats. 6) Taking into account things like paying for Galera ClusterControl, the servers and my time fiddling around with updates and taking servers offline for maintenance I decided the cost was worth it.

Galera cluster is a true multi-master MySQL cluster, this means if you modify any node, all the nodes get the modifications in real-time. However, it is not without it’s issues and you can have collisions that will cause you grief. The following is how I have set up a HAProxy load balancer for my Galera cluster which uses port 3307 for INSERTS and UPDATES and limits these to one server at a time and uses port 3306 for round-robin SELECTS, it also attempts to limit the SELECTS done on the INSERT server.

I used the guide: How To Configure a Galera Cluster with MariaDB on Ubuntu 12.04 Servers as a base.

In the Symfony parameters we add.

idatabase_port: 3307

In our Symfony config.yml we need to set up the “inserts” connection which will be used in our workers.

doctrine:
dbal:
default_connection:
main
connections:
main:
driver:
pdo_mysql
host: "%database_host%"
port: "%database_port%"
dbname: "%database_name%"
user: "%database_user%"
password: "%database_password%"
charset: UTF8
inserts:
driver:
pdo_mysql
host: "%database_host%"
port: "%idatabase_port%"
dbname: "%database_name%"
user: "%database_user%"
password: "%database_password%"
charset: UTF8

Here is the database loadbalancer setup: haproxy.cfg to set up port 3307 with failover the IP’s listed are my internal network IP’s of my three Galera servers.

#/etc/haproxy/haproxy.cfg
global
log 127.0.0.1 local0 notice
user haproxy
group haproxy
stats socket /var/lib/haproxy/stats.socket uid 106 gid 111
defaults
log global
retries 2
timeout connect 3000
timeout server 10m
timeout client 10m
listen mysql-cluster
bind 172.31.46.75:3306
mode tcp
option mysql-check user haproxy_check
balance roundrobin
server galera1 172.31.11.103:3306 weight 1 check
server galera2 172.31.10.147:3306 weight 50 check
server galera3 172.31.12.99:3306 weight 50 check
listen mysql_inserts
bind 172.31.46.75:3307
mode tcp
timeout client 60000ms
timeout server 60000ms
balance leastconn
option mysql-check user haproxy_check
default-server port 3306 inter 2s downinter 5s rise 3 fall 2 slowstart 60s maxconn 256 maxqueue 128 weight 100
server galera1 172.31.11.103:3306 check
server galera2 172.31.10.147:3306 check backup
server galera3 172.31.12.99:3306 check backup
listen webinterface
bind 0.0.0.0:8080
mode http
stats enable
stats uri /
stats realm Strictly\ Private
stats auth admin:{YOURPASSWORD}

This looks like the following, with downtime from me doing maintenance.

Step 5 — Design workers with the Symfony console

So, now we have an auto-scaled API feeding data into queues and we have a database cluster set up to handles large volumes of INSERTS and SELECTS with failover. It’s time to create the workers that will poll the SQS queues using the Symfony console app.

composer require symfony/console

If you have not used the console part of the Symfony to create your own command then this will be a crash course.

First up, see what you have, here is part of mine.

$ php bin/console
Symfony version 3.0.10-DEV - app/dev/debug
Usage:
command [options] [arguments]
Options:
-h, --help Display this help message
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi Force ANSI output
--no-ansi Disable ANSI output
-n, --no-interaction Do not ask any interactive question
-e, --env=ENV The Environment name. [default: "dev"]
--no-debug Switches off debug mode.
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
Available commands:
help Displays help for a command
list Lists commands
app
app:ProcessRawFile Provide a raw posts file and generate a report on the contents.
app:TestHygiene ...
app:do_api Use SQS queues for all API posts.
app:do_api_replay Replay API posts from /tmp/apipost (NOTE: must be run on all webservers)
app:do_contacts Use SQS queues for internal logging.
app:do_esp_inject Use SQS queues to send users to ESPs -- Injection
app:do_esp_post Use SQS queues to send users to ESPs -- (This is the primary worker)
app:do_esp_redrive ReDrive SQS messages to selected ESP
app:do_import Handle dashboard imports.
app:file_clean cleaning email lists...
app:finalize_dispositions ...
app:hold_flush ...
app:partner_post ...
app:redo_failed_mail re-run hygine and mail emails which have failed because of network etc outages.

See the docs on generating a command in Symfony.

I use the following utility function for my workers which uses the container aware command get method this sets up our database connections and our SQS connection.

use Symfony\Bundle\FrameworkBundle\Command\ContainerAwareCommand;
# USE AS FOLLOWS
# class AppApiCommand extends ContainerAwareCommand
/********************************************************************************
* Set db connections.
*
* use $this->conn for SELECTS
* use $this->iconn for INSERTS and UPDATE queries
*
* The reason behind this is to limit deadlocks on the db cluster
* iconn selects port 3307 instead of 3306 which always connects just to galera1
* - in the event galera1 is down, galera 2 is always selected, then 3 and so on.
*/
private function setConn()
{
if (!isset($this->conn) || !isset($this->iconn)) {
$this->conn = $this->getContainer()->get('database_connection');
$this->iconn = $this->getContainer()->get('doctrine.dbal.inserts_connection');
}
if (!isset($this->sqs_queues)) {
$this->sqs_queue = $this->getContainer()->get('sqs_queue');
}
}

Now in our execute() function of our console command we need to create an infinite loop with while(1) and inside this loop we add the following code block.

/**
* Slow poll a message off the queue and set everything up.
*/
$success = false;
$msg = $this->sqs_queue->getMessage($url);//long poll
if ($msg) {
$data = $msg['body'];
// $data is the key value pairs you passed to the queue
    // THIS is where you do your work
// set $success to 1 to remove the message.
    /***************************************************
* if success then we delete the message from the SQS queue
*************************************************/
if ($success){
$this->sqs_queue->deleteQueueMessage($url, $msg['rcpt']);
}
}

At this point we can push further processing to other queues. Splitting up the work into various workers has some advantages, in particular it is easy to figure out where things are breaking.

Some ideas of what to use queues for are:

  1. 3rd party (partner) processing/posting — I use this to send contacts to services such as mailchimp.
  2. Statistics — I like to collect a lot of statistics but if my stats server is busy I don’t want it to affect my workers so I put stats on queues and let workers handle them on dev/staging servers.
  3. Archiving — Some records can just go to Amazon Glacier (archive)
  4. additional processing — image conversion/cropping, video conversion, uploads, downloads etc.

Note: I like to add “process_number” as an argument to my workers so I can have multiple identical workers running on the same machine which I can either throw in the background with tmux or screen or even set up in supervisord.

Step 6 — Put workers into process management

Now that we have these workers, we need to make sure they are running and will be automatically restarted if they fail. For this I use supervisord (which we also use in the containers in Step 1).

Supervisord has a couple requirements, one is that the process needs to not return anything, and two the process needs to not ‘spawn off and daemonize’, it needs to simply sit and loop.

Note: I personally prefer my worker logs go into /tmp so they are blown away on server reboot, I do log a lot but the primary logging is done in my database, if you prefer to keep your logs, change the directories in the examples.

This is the logging function I use for my workers which logs showing which process_number is used

/**
*
@param $message
*/
protected function logThis($message)
{
$worker_num = $this->process_number;
$file = "/tmp/worker_api.log";
$line = time() . ":$worker_num:" . $message . "\n";
$this->output->writeln( time() . ":$worker_num:" . $message );
file_put_contents($file, $line, FILE_APPEND);
}

Here is an example supervisord workers.conf file with a workers group in /etc/supervisord/conf.d/

[program:api1]
command=/usr/bin/php /home/ubuntu/rest/bin/console app:do_api 1
stdout_logfile=/tmp/worker_out
startretries=3
stopwaitsecs=10
autostart=true
[program:api2]
command=/usr/bin/php /home/ubuntu/rest/bin/console app:do_api 2
stdout_logfile=/tmp/worker_out
startretries=3
stopwaitsecs=10
autostart=true
[program:stats]
command=/usr/bin/php /home/ubuntu/rest/bin/console app:do_statss 1
stdout_logfile=/tmp/worker_out
startretries=3
stopwaitsecs=10
autostart=true
[group:apiworkers]
programs=api1,api2

Now once we reload supervisord and we can look at, and control the processes.

$  sudo supervisorctl
workers:api1 RUNNING pid 4664, uptime 0:35:57
workers:api2 RUNNING pid 5946, uptime 0:01:24
stats RUNNING pid 6066, uptime 0:00:23
supervisor> help
default commands (type help <topic>):
=====================================
add clear fg open quit remove restart start stop update
avail exit maintail pid reload reread shutdown status tail version

From supervisorctl you can start/stop/restart your processes. When processes are grouped your can start/stop/restart the entire group, so for instance you want to restart both api1 and api2 you can do.

restart workers:*

You can also do this from the command line, so you can set a cron to restart all workers in one group every four hours if you want.

# restart the worker group every four hours
0 0,4,8,12,16,20 * * * /usr/bin/supervisorctl restart workers:* > /dev/null 2>&1

Step 7 — (optional)Set up stats with graphite+grafana

Do yourself a favor and ditch the ELK stack (they are rebranding as Elastic Stack).

I have set up and worked with ELK stack and I have spent a lot of time creating cool Kibana dashboards. It is great and all, but the problem is ‘ElasticSearch’, it’s clustering requirements are pretty massive and when you are working in the cloud a minimum req, bare-bones, three-server ES install can run you hundreds of dollars ($400-$800 is common).

Be honest, you don’t need to ship logs into a massive document search engine, especially if you can do the same thing on a single small EC2 instance.

Graphite+Grafana gives you all the pretty graphs and metrics you could possibly need.

Here is an example of my dashboard (with any identifying info blurred out).

Grafana can also do nice collectd graphs

To get going on this you will need to set up graphite to run on a server somewhere in your network. There are several guides for doing this, once you are set up and ready to start sending metrics from your API.

In your Symfony app

composer require domnikl/statsd

In Symfony, where possible I like to wrap things in services, modify this to replace {YOUR_GRAPHITE_SERVER} with the IP of your machine.

file:src/AppBundle/Services/GraphiteService.php:

<?php
/**
* Created by PhpStorm.
* User: joeldg
* Date: 4/14/16
* Time: 11:12 AM
*/

namespace AppBundle\Services;

use Domnikl\Statsd\Client;
use Domnikl\Statsd\Connection\UdpSocket;
/**
* Class GraphiteService
* see:
https://packagist.org/packages/domnikl/statsd
*
*
@package AppBundle\Services
*
@internal
*/
class GraphiteService
{
/**
*
@var
*/
private $statsd;

/**
*
@param
*/
public function __construct() {
$connection = new UdpSocket('{YOUR_GRAPHITE_SERVER}', 8125);
$this->statsd = new Client($connection, "api.namespace");

// the global namespace is prepended to every key (optional)
$this->statsd->setNamespace("api.");
}

/**
*
@param $key
*/
public function incrementAction($key)
{
$this->statsd->increment($key);
}

/**
*
@param $key
*/
public function decrementAction($key)
{
$this->statsd->decrement($key);
}

/**
*
@param $key
*
@param $amt
*
http://obfuscurity.com/2013/05/Graphite-Tip-Counting-Number-of-Metrics-Reported
*/
public function countAction($key, $amt)
{
$this->statsd->count($key, $amt);
}

/**
*
@param $key
*
@param $amt
*/
public function timingAction($key, $amt)
{
$this->statsd->timing($key, $amt);
}

/**
*
@param $key
*/
public function startTimingAction($key)
{
$this->statsd->startTiming($key);
}

/**
*
@param $key
*/
public function endTimingAction($key)
{
$this->statsd->endTiming($key);
}

/**
*
@param $key
*/
public function startMemoryProfileAction($key)
{
$this->statsd->startMemoryProfile($key);
}

/**
*
@param $key
*/
public function endMemoryProfileAction($key)
{
$this->statsd->endMemoryProfile($key);
}

/**
*
@param $key
*/
public function peakMemoryAction($key)
{
$this->statsd->memory($key);
}

/**
*
@param $key
*
@param $amt
* Pass absolute value or delta values as a string.
* Accepts both positive (+11) and negative (-4) delta values.
*/
public function gaugeAction($key, $amt)
{
$this->statsd->gauge($key, $amt);
}

/**
*
@param $key
*
@param $amt
*/
public function setsAction($key, $amt)
{
$this->statsd->set($key, $amt);
}
}

In your services.yml

graphite_service:
class:
AppBundle\Services\GraphiteService

In your setConn() function we previously made, add

if (!isset($this->stats)) {
$this->stats = $this->getContainer()->get('graphite_service');
}

Now anywhere you want you can increment a stat item, for instance if you wanted to graph the number of times your API was hit with a missing API key.

$this->stats->incrementAction('missing_api_key');

This would pop out fairly quick and you could go looking for where or what was happening. There are other types like gauges and memory profiling, but increment is probably your most used.

If you need stats and don’t want to go the Google Analytics route, I would still advise against elasticsearch and logstash, just go with an option like self-hosted piwik which is easier to set up and has less moving parts to break.

Step 8 — alternates, ideas.

I have mentioned, in several spots in this article, alternatives to the technologies used. Obviously if you are on MS Azure or Google Cloud you have most if not all (or even more) of the services available on AWS but the guides I have linked to will not be suitable and you will be on your own. If you have guides, send them my way and I will link to you.

There are a ton of Docker containers on dockerhub you can use instead of the php7-fpm one I used as an example. Look around, your application surely has specific requirements mine does not. I primarily am showing the example Docker setup to display how to use EFS and to be a primer on Docker.

There are spots here where you may want to go back and forth between AWS offerings and services you install and maintain. There are also areas where you can actually bypass parts. For instance, in the Lambda section you could bypass posting to the API and do all the checks in Node.js and push the data directly onto SQS. I purposely do not do this because one of my requirements is that I have the ability to ‘replay’ all posts and every SQS message as needed.

You can do some really amazing things with Pub/Sub in Google, and paired with their machine learning it is an enticing option. It is worth a look and I may create a companion article for it.

Depending on your needs, ‘Pub/Sub’ may be a better option than SQS, Redis has support for it as does RabbitMQ (which is an AMQP MQ).

I didn’t mention caching here, and there are a few considerations with application level caching and several options.

  1. Redis — great general option that will require adding libraries also supports persistence and pub/sub and many other features. (supported in AWS elasticache)
  2. memcache— baked into PHP with a ton of support. (supported in AWS elasticache)
  3. memcachedb — baked into PHP ‘persistent’ version of memcached that can, on an AWS small instance handle 65,000 reads/sec. It’s a never-die nuke-proof solution and it conforms to memcache protocol so requires no modules or special libraries to get going with.
  4. There are many other key-value NOSQL database solutions which ‘can’ be re-jiggered as cache or used in a cache-like fashion and a lot of it just depends on what you are familiar with.

Specific non-memcache cache(s) would need to be added into the Dockerfile in step one and configured appropriately.

Conclusion

I hope that you may be inspired to create a robust application by this article. Or, to at least start to think about utilizing queue-based architecture for your application -or- at the very least, think about web application robustness in new ways.

Using all or some of the above guide(s) will greatly increase your application robustness and fault tolerance. Each piece works together and handles faults gracefully so you can sleep easier.

If you would like to contact the author:

I am the Vice President of Technology for a marketing firm where each contact submitted to us can be worth anywhere from ~$18 to $100+ and the loss of even one is not acceptable. Additionally I do not like pages in the middle of the night by application alarms as I suffer from ‘alarm fatigue’ very quickly, I decided to architect a system that can crash and keep going and self-heal without skipping a beat. If/when it does crash at 2am I can look at a post-mortem graph in the morning while drinking my coffee and make the appropriate “Oh” sounds and then get back to doing things I need to do for our company.

If you notice any errors in this guide, or you would like to make suggestions for making this guide better, contact me.

Joel De Gan — joeldg @ rxmg.com

We are still hiring.
Robustitude
A single golf clap? Or a long standing ovation?

By clapping more or less, you can signal to us which stories really stand out.