Caching web content with Apache Traffic Server

Published in

DevOops World … and the Universe

10 min readApr 20, 2016

Intro

A few weeks ago, talking with customer and project developers I’ve heard an interesting problem. Guys were discussing how to decrease amount of traffic from end-users that are using proxy to access the web. That very valid case especially when you have a limited internet channel. I think that it's really case-dependent problem, but nevertheless quite a challenging task to solve. So why not to deal with this problem? Let’s discover and build some solution.

First of all I decided to investigate tools and solutions that have been already created and used. The problem is I didn't find much. Squid as caching proxy, Apache Traffic Server as caching proxy, and that’s actually it. Varnish, HAProxy, and other doesn’t solve this problem because of it purpose being a reverse proxy. Earlier I’ve worked a bit with Squid, so I wanted to try something new, despite being familiar with it. Choice of Apache Traffic Server (ATS) has been obvious due to lack of other decent options.

Apache Traffic Server is a powerful HTTP caching proxy server. Documentation says it is very fast, extensible and proven by Yahoo. For my case I needed it as a forward proxy to cache all of the traffic from internal network with a very limited channel. Just imaging a flying Airbus with 300 people inside and a single satellite channel, transatlantic cruise with more than 2000 people with very limited bandwidth, or even an Oil platform in the middle of the ocean with lots of workers with a limited access to the Web. My goal was to made a solution that would cache inbound proxy traffic as much as possible. Moreover, the solution should be simple and scalable with the possibility to run on any platform.

Basic setup

I wanted to deploy this system easily on different type on environments so I decided to run Apache Traffic Server in container using docker, because this solution is portable across providers. This would allow me to launch it almost everywhere: Baremetal, AWS EC2/ECS, Azure, CGE. Implementation is pretty simple, the interesting thing here is that it would be better not to use supervisors inside of it, but as I run container with log output gelf (graylog extended log format) which streams to Logstash that needs some workaround to output access.log to stdout. To run Traffic server as a forward proxy two lines should be added to records.conf. First line permits process requests for hosts not explicitly configured in the remap rules. Second line disables reverse proxy (it’s optional):

CONFIG proxy.config.url_remap.remap_required INT 0
CONFIG proxy.config.reverse_proxy.enabled INT 0

I’ve made couple requests over ATS but with the default configuration I could not get the maximum value from the caching I’ve needed. So I also turned on ignoring client and server requests to bypass the cache:

CONFIG proxy.config.http.cache.ignore_client_no_cache INT 1
CONFIG proxy.config.http.cache.ignore_server_no_cache INT 1

This really increased amount of cached traffic going through since Apache Traffic Server started to ignore HTTP client and server headers that possibly don’t allow him to cache some objects.

Traffic server has a lot of others configuration parameters such as port on which Traffic server will listen, some UI, etc. Huge list of options is available in official documentations.

ATS default logs won't suit me since I needed some custom information about the client and caching objects. For analytical purpose I required some additional info from the user's traffic (browser, operation system, etc). This information would be very useful in production environment with real end-users. To do this, I was need more than default logs can provide, so I enabled option to gather custom logs in records.conf:

CONFIG proxy.config.log.custom_log_enabled INT 1

and changed the default logging of Apache Traffic Server to custom:

Base part is “Apache access” log format
Additional “content type” field which I will use later for analysis.
Additional field with “MISS/HIT” which I would use to see what was fetched from the cache.
Added logs_xml.config with this custom log format

Documentations how to define custom format are here: log-formats.

To run Apache Traffic Server use the following command:

docker run -d -p 8080:8080 --log-driver=gelf --log-opt gelf-address=udp://localhost:12201 --name trafficserver quay.io/repository/7insyde/ats:latest

which tells to run docker container in daemon mod, pull “quay.io/repository/7insyde/ats:latest” if not exists, assign name “trafficserver”, forward port “8080” from container to the host and use log dirver “gelf”. To check that it works, just change proxy settings in browser to localhost:8080 and all traffic will go through Apache Traffic Server.

To validate my configuration of Apache Traffic Server, I’ve accessed some sites, made a couple requests and convinced that proxy is working. But testing myself is not a big deal. It’s time to test it with some serious weapon. I will be using ELK (ElasticSearch, Logstash, Kibana) stack to collect all the Apache Traffic Server logs, and TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) stack to collect all the host metrics.

So what are we waiting for? — Let’s begin with testing.

Testing

As you have seen in the previous steps, Apache Traffic Server is up and running. But the question is: how many users it can handle? I pointed this question to myself earlier, when I’ve just started to work on it. The goal was to made a solution that could easily handle up to 1500 users.

Tons of articles written about Apache Traffic Server performance, its tuning and configuration but I wanted to test it myself. So I was needed some tool to test its performance, emulate some users and send requests over it. The main problem on the current step was to find some good testing/benchmarking tool for web-proxy servers that will satisfy my requirements. If you’ll try to google the solution you would probably found lots of tools such as: ApacheBench, Apache JMeter, http-load, curl-loader, FunkLoad, Gatling, Httperf, Tsung and others, but all this tools were actually made for stress-testing web applications and web-servers, and I was needed something special. And here I remembered a tool that I’ve already used a year ago for some high-load production testing. This tool was “Siege”.

Siege as solution for performance testing

“This is an excellent application that does something expensive suites don’t. ie. It does what the users expects it to do, and does it simply.”
— Sgt. Scrub

Siege is an HTTP load testing and benchmarking utility developed by Jeffrey Fulmer in his position as Webmaster for Armstrong World Industries. It was designed to let web developers measure the performance of their code under stress, to see how it will stand up to load on the internet. Siege can stress a single URL or it can read many URLs into memory and stress them simultaneously. It supports basic authentication, cookies, HTTP, HTTPS and FTP protocols.

Siege has essentially three modes of operation:

regression
internet simulation
brute force.

It can read a large number of URLs from a configuration file and run through them incrementally (regression) or randomly (internet simulation). Or the user may simply pound a single URL with a run-time configuration at the command line (brute force)

Installation and setup

Worked with Siege earlier, I already knew that its CPU and Network dependent so I have couple of options how to set it up:

Launch an AWS EC2 CPU-optimized instance, and install Siege on it. Kind of easy to work with and maintain, but its not so flexible as I want it to be.
Create Siege Docker container, and launch it via Amazon ECS service. Yes, its harder to operate and maintain, but you can always scale it up, and launch multiple Siege services.

So I’ve chosen to build it inside Docker. First idea was to use some really small linux-distros like Alpine Linux, but I faced some C compilation problems and missing packages so decided to move to Ubuntu. I wanted Siege to operate successfully with gzip and deflate encoding, as well as with SSL so the main requirement was to install:

apt-get install -y libssl1.0.0 libssl-dev zlib1g-dev make

and configure it with

./configure --with-ssl=$SSL_PATH

Siege compilation and installation is pretty straightforward, make && make install will do the trick. Another one thing that could be mentioned here, is that I’ve used :

Another one thing that could be mentioned here, is that I’ve used :

ENTRYPOINT ["siege"]
CMD ["--help"]

so the container could be simply run by:

docker run --rm siege [attributes] [site]

Testing environment

Amazon ECS service was chosen as Testing environment. Since Apache Traffic Server is also launched in container I decided to divide different services into different instances. So in my test environment I used 3 EC2-ECS instances:

7insyde-ecs-1 (t2.large): ELK stack + TICK stack
7insyde-ecs-2 (t2.medium): Apache Traffic Server
7insyde-ecs-3 (t2.medium): Siege

So the whole environment looks similar to this:

I’ve started to test Apache Traffic Server with some simple cases: couple of users, no delay between requests, using single site url etc. Slowly increasing amount of users, and changing parameters I’ve managed to define the proper launch options for Siege:

Command ["-c300","-d1","-i","-f","urls.txt","-t30M"]

Description:

-c : concurrent users
-d : delay between requests
-i : internet simulation
-f : file with urls
-t : time (30 minutes)

The main goal of testing was:

Define what would be CPU/RAM usage depends of amount of users
What would be the response time from ATS
How many requests and data ATS can throughput

Here how the 30 minute test looks on ELK:

Results

And here’s the Siege results for 300 concurrent users:

Transactions: 1635773 hits
Availability: 100.00 %
Elapsed time: 1799.58 secs
Data transferred: 64277.13 MB
Response time: 0.32 secs
Transaction rate: 908.97 trans/sec
Throughput: 35.72 MB/sec
Concurrency: 294.96
Successful transactions: 1635773
Failed transactions: 0
Longest transaction: 6.47
Shortest transaction: 0.00

So we transferred more than 64GB of data within 30 minutes with 35.72 MB/sec of throughput. This values are really impressive, and the really interesting part is response time: 0.32 secs which means that our server based on t2.medium instance on AWS operates normally and users (possible users) would not experience any delay using it. I’ve been surfing over the internet using this ATS while Siege was generating requests and didn't notice any delay. So 300 concurrent users on ATS — test passed.

But I was mostly interested in the goals I’ve mentioned above. So using TICK I’ve checked what were the load of the ECS instances. TICK stack was actually collecting all the metrics from the host machine (ECS instance) as well as docker metrics with docker-input plugin for telegraf. More information you can find on github. List of used plugins and collected metrics below:

System input plugin (cpu, mem, disk, diskio, load average)
Docker input plugin (cpu, mem, net, blkio)

Results of testing:

Analyzing the table above, I’ve noticed that mostly requests per second is critical and influences on CPU and Memory consumption, which is quite logical. Ten users can generate the same amount of requests as hundred users, and its mostly depends on website. Considering my results:

640 requests per second is equal to 1% CPU usage
1000 requests per second is equal to 1% MEM usage

And here how the graphs look during the test. I described the CPU and Memory graphs mostly, since Apache Traffic Server and Siege is really CPU dependent. Disk usage is quite insignificant.

So while operating with 300 users the CPU usage was just above 55%, while memory usage was close to 15% and it's on a single AWS EC2 instance with 2 CPU and 4GB RAM. Considering this tests you may run up to 500 users on a single EC2 node with Apache Traffic Server which would operate normally. Great results.

But as you remember, we’ve talked about caching solution, not only about proxying traffic. So what was the amount of data Cached by Apache Traffic Server? ELK provides me with the simple Pie chart that describes the Cache condition for all the objects during the test. Here it is:

More than 87% of traffic included in this stress-test was cached. Yes, that was a single media-site, with lots of text, images, gifs, and coubs. While testing multiple sites this value is lower, but anyway, the more users you have the more caching data it will store. I’ve managed to achieve average 40–45% caching while browsing the internet myself for a couple of days visiting multiple sites, forums, and social networks. Its pretty decent result for this solution.

Summary

The installation and configuration of Apache Traffic Server is very straightforward, but testing is more complex process, though I didn’t have a chance to test it with more user-load, because of huge amount of traffic should be generated for testing, but you can see how Apache Traffic server can perform even with three hundred users on a single t2.medium EC2 instance. I convinced that ATS can be used to solve the problems described at the beginning on this post, as well as be a part of proxy-caching solution in any SMB (small business) or enterprise networks. It has a lot of other features such as cluster mode or ability to create your own custom plugin to extend the functionality that may be widely used in production environments. Deployment could be easily performed using Docker-compose, Amazon ECS task definitions or Kubernetes yaml files.

At the end this environment was built as a proof of concept part. I am sure that it can be tuned, improved, optimized or even reworked. Share your experience working with Apache Traffic Server and caching solutions, propose your ideas how to improve it, and do not hesitate to ask any related information you’d be interested in.