Converting Amazon S3 logs to Avro
Amazon S3 Website Hosting
Amazon S3 is an excellent resource for hosting static websites (html, css, js) because it provides free SSL certs for free and fast content delivery network as well for reasonable pricing. Hosting websites is trivial and well documented. After setting up all these we have a running website using SSL with geographically distributed edge caches for faster page load.
S3 provides access logging for tracking requests to your bucket. Each access log entry (called the record) has information about a single request, including requester, request time, response status, bucket, key, etc. The actual format is described in this document, explaining each field in depth.
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be mybucket [06/Feb/2014:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING - "GET /mybucket?versioning HTTP/1.1" 200 - 113 - 7 - "-" "S3Console/0.4" -
The problem is with such log format that we cant access individual fields easily (without a regexp) and that we store the information a human friendly way using text. This is not optimal for storing and querying larger datasets, we need to transform it to a more space efficient solution that reduces the IO when reading a large chunk of the data on disk or using distributed analytical platforms like Hadoop.
Why Apache Avro
According to the documentation Avro provides
- Rich data structures.
- A compact, fast, binary data format.
- A container file, to store persistent data.
- and few other things we don’t need right now
Avro also uses schemas so we can trust our data while processing it. The other alternative would be Apache ORC that is even more suitable for analytical use. I am going with Avro this time, because it is better supported than ORC in Clojure at the moment.
My personal reasons why I am using Clojure for data projects like this is:
- quick prototyping (REPL)
- support for asynchronous programming (link)
- small code base, less verbose than Java yet more readable
- access to all of the Java libraries
Most of the data services I am working with on a daily basis has decent Java support, that means I just as easily use those libraries in Clojure. I also like small nice things. :)
Just to summarise what are trying to achieve with this project and article series:
- covering reading text files from Amazon S3 and convert the data to Avro (part I)
- explaining how to convert a single thread execution to an asynchronous one with core.async (part II)
- build a simple DSL to query Avro files (part III)
For starting I am going through the major topics involved in the process, how to use AWS S3 api, how to create Avro files and finally how to process lines of log files.
Talking to S3
After some initial poking around with the libraries we need for this I decided to use the raw Java S3 api, since it is so well written, using it in Clojure is a breeze.
Creating a credential and using it to create an AmazonS3Client is simple. We can use many S3 clients at the same time for better performance but for the initial version we are going to stick to a single connection.
Log files are organised around dates, keeping one file per day sounds reasonable. Each day has zero or many entries, where many is less than a 10.000 so there is no need for splitting up a day for smaller chunks. On average there are 1000–2000 files per day, depending on the number of access entries. We are going to process data day by day, using a moving window. The size of the window and when is starts can be configured in the config.
Using the example from the config and yields to the following list of dates:
Fetching actual file names for each day can be tricky at the first sight but we can use the truncated field for checking if there are more than 1000 (by default) files for the particular day.
This function is blocking so it won’t return until all of the items are fetched, it is not recommended to process 100.000+ files at the same time. For processing that many files we need to re-write it to be lazy producing a lazy sequence where the items are looked up when needed. (Added to the TODO). The function that returns the Clojure representation (a hash-map) of a log entry is get-s3-object-summary-clj.
This way are have a list of entires that we are going to process later. For booting up all this in REPL we can use the following few lines assuming the configuration is correct and the credential file is present and it has valid access and secret key.
After got connected to S3 we can play with the log files. Checking the first entry (calling first on all-files-for-a-day):
Since Clojure keywords can be used as functions we can easily list all of the file names in the list we produced earlier.
Processing a single line
Unfortunately there is no better way of processing these lines than using a regular expression.
I guess it is not nice but at least gets the job done. I still need to run it on bigger data sets but for our use case it works. When there are parenthesized groups in the pattern and re-find finds a match, it returns a vector. The first element is the matching string, the remaining elements are the individual groups. In this case we need to pay attention not only that Amazon uses “-” for null values but also to match all of the possible values of the referer and user agent fields.
This works reasonably well, I haven’t found a non matching long entry yet. Now we can extend the s3api with get object content capabilities, that is required for downloading an object from S3.
Creating an S3Object is easy, we just need to supply a connection, a bucket and a key.
Now that we have means to talk to S3 and read files from it we could move on to have a closer look to Avro files and how to write them in Clojure.
Working with Apache Avro in Clojure
Luckily there is a good library that we can use to work with Avro files in Clojure, so we don’t need to re-invent the hot water this time. Abracad provides serialization and deserialization for Clojure data structures with Avro that can be persisted to disk or used in message passing systems like Kafka for example. We are going to persist the data to disk this time.
Before we can write any Avro entry to disk we need a schema for the data that we are collecting here. There are some challenges coming up with the right schema but we can jump these hoops.
First and foremost Avro does not let “-” to be used as the field separator in Avro schemas we have to use “_” instead. Since Amazon allows null values for certain fields we need to reflect that in our schema. Defining nullable fields is easy, we just use an array of possible types for the type as in the example above. There is no strong support for dates yet, we are going to store it as string for now. After we generated the schema we can use it to create an Avro file.
As you can see Clojure provides pretty good tools to work with Amazon S3 and Avro files. The codebase is pretty small (559 LOC) and it already does a lot. In the next articles in the series I am going to make it asynchronous and faster with core.async (channels) and finish the code to upload the converted files to S3 afterwards. Even though I have processed two months worth of log with s3-logrotate already and it works reasonably well it is just a prototype this stage. I am going to improve it during the upcoming months.
The full code is available here check it out:
s3-logrotate - Rotating logs for S3 buckets and converting the content to ORC or Avrogithub.com
Sponsored by StreamBright