Restoring Archived Logs at Ninja Van

Baron Chan
Ninja Van Tech
Published in
7 min readJan 25, 2021

Building a bamboo plan to restore archived logs on demand

Introduction to Logs & Archival at Ninja Van

At Ninja Van, we use the Elasticsearch/Fluentd/Kubernetes (EFK) for our logging purposes. If you wish to know more about how we implemented it, you might want to read about it here.

Delivering more than 2 million parcels per day at peak, we experience immense traffic volumes and logs from our different micro-services. At the moment, we have more than 5TB worth of logs per day. While it would be great if we could store all our past months or even years of logs in Elasticsearch (which would allow us to search them easily), it is simply not viable.

116 million log files in just the past hour!

Firstly, there would be too much cost involved as we would have to pay for lots of standard/nearline storage on a daily basis. (archived logs are stored in coldline storage, which is considerably cheaper to store, but more expensive to access)

Secondly, indexing data in Elasticsearch comes at a high compute costs as the CPU, RAM and disks are heavily utilized. The more data you have, the more data will be indexed, costing more memory and decreasing search speed.

Lastly, it is seldom that we need to access old data. Issues usually get investigated very soon after they’re reported, so it’s only occasionally that we have to look into older issues that were noticed only some time after their occurrence.

As such, to keep cost low and search speed high, we archive logs that are older than two weeks.

The Problem

When we need to investigate issues that happened more than two weeks ago, the logs would already have been archived. Engineers had to approach the infrastructure team (in charge of maintaining our EFK stack) to get them to manually add the logs back into the Elasticsearch cluster, disrupting them from their daily work.

Automated Log Restoration — Our Solution

As such, we came up with this log-restoration tool for anyone who needs to restore archived logs on demand. We use Atlassian’s Bamboo, a continuous integration and deployment tool, but any similar tool should also work.

Running the plan on bamboo (part 1 — starting the build with Run customized)
Running the plan on bamboo (part 2 — supplying the env vars)
Running the plan on bamboo (part 3 — waiting for the build to finish…)

Upon running the Bamboo job, we use ansible to run a playbook. This playbook will take the variables from Bamboo and proceed with the restoration process.

We won’t be covering the contents of the es & kibana roles as they were covered in one of our previous articles.

#restore.yml
---
- hosts: localhost
pre_tasks:
- name: Set bamboo variables
set_fact:
env: "{{ lookup('env', 'bamboo_LOGS_ENV') }}"
start_date: "{{ lookup('env', 'bamboo_START_DATE') }}"
num_of_days: "{{ lookup('env', 'bamboo_NUM_DAYS') }}"
- name: Including common vars
include_vars: vars/common.yml
- name: Including {{ env }} vars
include_vars: "vars/{{ env }}.yml"
- name: Set common variables
set_fact:
domain: '<insert domain name here>'
efk_bucket_name: '<insert bucket name here>'
kubernetes_context: <insert kube context here>
kubernetes_namespace: 'logs-{{ lookup("env", "bamboo_NAME") or ansible_date_time.epoch }}'
task: restore
roles:
- pre-restore #runs the preparation steps
- provision-es-cluster #start ES cluster
- provision-kibana #starts kibana cluster
- post-restore #runs the remaining steps

The pre-restore.yml script prepares for the restoration job by first cleaning the current directory (in the event that a previous job had been ran), then creates a directory for its generated files. It then generates the create_namespace.yml and runs it.

#pre-restore.yml
---
- name: Clean dist restore directory
file:
path: "dist/restore"
state: absent

- name: Create dist restore directory
file:
path: "dist/restore"
state: directory

- name: Generate create_namespace yml
template:
src: create_namespace.yml.j2
dest: dist/restore/create_namespace.yml

- name: Deploy create_namespace yml
command: >
kubectl apply
-f dist/restore/create_namespace.yml
--context={{ kubernetes_context }}

The create_namespace.yml.j2 is a template file that simply creates a kubenetes namespace which all of this restoration job’s kubernetes pods will exist in.

#create_namespace.yml.j2
---
apiVersion: v1
kind: Namespace
metadata:
name: "{{ kubernetes_namespace }}"
labels:
name: log-restoration

First, the post-restore.yml ansible script first sets the server name indication (SNI) of the kibana clusters, so that they will have their SSL certs configured correctly.

Second, it retrieves and sets its Kibana & Elasticsearch pod’s IP addresses (kibana_ip_addresses & es_ip_addresses respectively).

Third, it sets the ES pod’s to point to our Google Cloud Storage (GCS), where our archived logs are stored in.

Fourth, it generates the restore_logs_from_storage.sh shell script and runs it.

Finally, it publishes a simple HTML file which links the user to the newly created kibana instance which contains only the restored logs.

#post-restore.yml
---

### Add SNI for kibana and es ###
- name: "Update SNIs for kibana-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }}"
uri:
validate_certs: no
method: POST
url: https://internal-{{ env }}.{{ domain }}/admin-api/snis
status_code: 201,409
headers:
Content-Type: "application/json"
x-nv-kong-key: "{{ kong_key }}"
body_format: json
body: |
{
"name": "kibana-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }}",
"certificate": {
"id": "{{ ssl_id }}"
}
}

- name: "Update SNIs for es-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }}"
uri:
validate_certs: no
method: POST
url: https://internal-{{ env }}.{{ domain }}/admin-api/snis
status_code: 201,409
headers:
Content-Type: "application/json"
x-nv-kong-key: "{{ kong_key }}"
body_format: json
body: |
{
"name": "es-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }}",
"certificate": {
"id": "{{ ssl_id }}"
}
}

### Gather IP Addresses ###
- name: Retrieve ES IP address
shell: >
kubectl get pods
-l name={{ env }}-global-es-logs
-o wide --no-headers
--field-selector status.phase=Running
--context={{ kubernetes_context }}
-n {{ kubernetes_namespace }} | awk '{print $6}' | head -n1
register: es_ip_addresses

- name: Retrieve Kibana IP address
shell: >
kubectl get pods
-l name={{ env }}-global-kibana-logs
-o wide --no-headers
--field-selector status.phase=Running
--context={{ kubernetes_context }}
-n {{ kubernetes_namespace }} | awk '{print $6}' | head -n1
register: kibana_ip_addresses

### Prepare ES ###
- name: Set GCS bucket & client
shell: >
curl -X PUT
"http://{{ es_ip_addresses.stdout }}:9200/_snapshot/{{ efk_bucket_name }}"
-H 'Content-Type: application/json'
-d '{
"type": "gcs",
"settings": {
"bucket": "{{ efk_bucket_name }}",
"client": "storage"
}
}'

### Start restoring ###
- name: Generate restore.sh
template:
src: restore_logs_from_storage.sh.j2
dest: dist/restore/restore_logs_from_storage.sh

- name: Make restore.sh executable
file:
dest: dist/restore/restore_logs_from_storage.sh
mode: a+x

- name: Publish artifact
copy:
dest: dist/kibana/index.html
mode: a+x
content: |
<!DOCTYPE html>
<html>
<head>
<script type="text/javascript">
document.addEventListener("DOMContentLoaded", function() {
window.location.href = "https://kibana-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }}";
});
</script>
</head>
<body>
</body>
</html>

The restore_logs_from_storage.sh.j2 template basically just iterates through the intended dates to restore, making a POST request to the ES instance and waiting until all requests are successful. This could probably be done in a more efficient way with Ansible’s HTTP request methods instead of a shell script, though.

The benefit of running it via shell script is that it will be easier to debug in case an error occurs, because ansible encapsulates the result as a simple success or fail only.

#restore_logs_from_storage.sh.j2#!/bin/bash

set -x

printf Restoring {{ num_of_days }} days from {{ start_date }}

# Set a COUNTER variable
COUNTER=0
NUM_DAYS={{ num_of_days }}

currDate=$(date +'%Y.%m.%d' -d"{{ start_date }} +$COUNTER days")

set +e

# Increase the COUNTER to get back in time
while [ $COUNTER -lt $NUM_DAYS ]; do
INDEX=logstash-$currDate

printf "\n[$((COUNTER+1))/$NUM_DAYS] Restoring $INDEX\n"

curl -X POST \
"{{ es_ip_addresses.stdout }}:9200/_snapshot/{{ efk_bucket_name }}/$INDEX/_restore?wait_for_completion=true"

# Increment COUNTER and date
COUNTER=$((COUNTER + 1))

currDate=$(date +'%Y.%m.%d' -d"{{ start_date }} +$COUNTER days")
done

### Restore Index in Kibana ###
curl \
-X POST "{{ kibana_ip_addresses.stdout }}:5601/api/saved_objects/index-pattern" \
-H 'kbn-xsrf: true' -H 'Content-Type: application/json' \
-d '
{
"attributes": {
"title": "logstash-*",
"timeFieldName": "@timestamp"
}
}'

printf "\nRestoration complete; visit kibana-{{ kubernetes_namespace }}{{ env_suffix }}.{{ domain }} or open the link in published artifact"

Cleanup

To delete the restored logs and all the kubernetes pods, we can simply do a kubernetes delete namespace <insert namespace here>. We could probably get the engineers to do this after they’re done with their logs, but we know us engineers are *sometimes* forgetful. Deleting a namespace can also have negative results if made by accident to the wrong namespace! (which is why we don’t grant this permission to all engineers)

Using an automated Bamboo job that runs once every day, we run clean.sh to delete all namespaces created by the log-restoration job older than three days. (to increase/decrease the duration, simply edit the match($3,/^[3-9]+d/ regex to your liking)

#clean.sh#!/bin/bash

set -ex

echo "Deleting namespaces created by log restoration older than 48 hours"

kubectl get namespaces \
--context=development-app-context \
-l "name=log-restoration" \
| awk 'match($3,/^[3-9]+d/) {print $0}'

kubectl delete namespaces \
--context=development-app-context \
-l "name=log-restoration" \
| awk 'match($3,/^[3-9]+d/) {print $1}'

echo "Deletion of above namespaces complete!"

Problems

Each day of restored logs is approximately 5TB, and takes around three hours to restore (less if the day had less traffic & logs). It would be nice if we do not need to wait so long to be able to view archived logs — imagine if you realized that you restored the wrong day’s logs… after waiting for three hours.

Our logs are also growing bigger by the day — we used to only have 2.7TB of logs per day, only four months ago! Going by this trend, we might have to keep less than two weeks worth of logs soon.

We Are Hiring!

Are you passionate about building high-quality products? Come and join us! This project was made possible with Ninja Van’s side-project initiative, which gives room for engineers to work on products outside their team’s scope!

You are welcome to browse other things that we do in engineering, for example, how we improve our unit tests in our Java services.

Acknowledgements

A huge thank-you to Ivan Kenneth Wang for guiding me through this project from start to finish; it wouldn’t have been possible without your help.

--

--