Using Velero and Restic to Backup Kubernetes Resources

Kubernetes Bare-metal with Rook-Ceph

Published in

Techlogs

9 min readAug 6, 2020

Environment

We are doing this in a bare-metal K8s cluster. We have a rook-ceph based StorageClass for Persistent Volumes. Rook version 1.2, this supports CSI driver, and Ceph 14.2. Note that this is not important now, as the CSI Snapshot feature is still not available in Rook 1.2. We can use Restic.

Need for Velero and Restic

Using Rook-Ceph 1.2 version or higher, and using the CSI Volume driver, you are able to take VolumeSnapshot, but this is taken as a local persistent volume and cannot be taken out of the cluster. So you need Velero to take the VolumeSnapshot outside of the cluster.

Update: The Ceph-CSI team has now released ver ceph-csi-v3.0.0 which support the v1beta1 K8s snapshot storage API — https://www.humblec.com/ceph-csi-v3-0-0-released-snapshot-clone-multi-arch-rox/ . This is in Rook now and we could use that instead of Restic, though there are bugs here when I tried. See my other blog postrelated to this https://medium.com/techlogs/velero-with-csi-a883e8a24710

Step 1: Install S3 Object Storage. Simplest is Minio

Here is a sample with Minio backed by Rook-Ceph
https://gist.github.com/alexcpn/2986863352400cc1c7907a32f2fd0cac

After this port-forward so that you can access Mino externally

kubectl port-forward minio-64b7c649f9-9xf5x --address 0.0.0.0 7000:9000 --namespace minio

Or you can create ingress and use it

Step 2: Install Velero with Restic

Create a S3 credentials file credentials-velero with the following content

cat credentials-velero 
[default]
 aws_access_key_id = minio
 aws_secret_access_key = <your pass>

Next is to download the velero tar file, from https://github.com/vmware-tanzu/velero/releases/download/v1.5.1/velero-v1.5.1-linux-amd64.tar.gz, and move the binary to the bin folder.

Note that it expects a Kubeconfig file with ClusterAdmin privilege and you can then install it from your local machine to any cluster.

If you are using an Ingress the S3Url need to change s3Url=http://minio.10.x.y.z.nip.io

Without CSI

velero install \
 --provider aws \
 --plugins velero/velero-plugin-for-aws:v1.0.0 \
 --bucket velero2  \
 --secret-file ./credentials-velero  \
 --use-volume-snapshots=true \
 --backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.10.x.y.z.nip.io  \
 --image velero/velero:v1.4.0  \
 --snapshot-location-config region="default" \
 --use-restic

Without Restic and Using CSI Snapshot class of your provider

Let’s tackle that as another blog — https://medium.com/techlogs/velero-with-csi-a883e8a24710

Using Restic

Restic Integration

Velero has support for backing up and restoring Kubernetes volumes using a free open-source backup tool called restic…

velero.io

“We integrated restic with Velero so that users have an out-of-the-box solution for backing up and restoring almost any type of Kubernetes volume*. This is a new capability for Velero, not a replacement for existing functionality. If you’re running on AWS, and taking EBS snapshots as part of your regular Velero backups, there’s no need to switch to using restic. However, if you’ve been waiting for a snapshot plugin for your storage platform, or if you’re using EFS, AzureFile, NFS, emptyDir, local, or any other volume type that doesn’t have a native snapshot concept, restic might be for you.”

restic

restic is a program that does backups right. The design goals are: Easy: Doing backups should be a frictionless…

restic.net

Step 3: Test it out — Create a Test pod and PV and add some data

Pre-requisite. If you are using rook-ceph or similar for storage, ensure that you have the right Storage Driver (CSI or Flex) in both source and target. This is the case where you backup in one cluster and restores in another (target) cluster. Note that for this scenario to work the same storage class and storage plugin should be available in target.

cat << EOF | kubectl apply -f -apiVersion: v1
kind: Namespace
metadata:   
  name: test-nginx
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ceph-ext
  labels:
    app: nginx
  namespace: test-nginx
spec:
  storageClassName: rook-block
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi 
---
apiVersion: v1
kind: Pod
metadata:
  name: nginx-test
  namespace: test-nginx
spec:
  volumes:
    - name: mystorage
      persistentVolumeClaim:
        claimName: ceph-ext
  containers:
    - name: task-pv-container
      image: nginx
      ports:
        - containerPort: 80
          name: "http-server"
      volumeMounts:
        - mountPath: "/usr/share/nginx/html"
          name: mystorage
EOF

We will be using the following Storage Class in both Source and Target. Note again that we are not backing up the StorageClass from source to target. That is usually not a good idea as versions etc could be different

cat << EOF | kubectl apply -f -
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  # TODO if a failure domain of host is selected, then CRUSH will ensure that each replica 
  # of the data is stored on a different host. https://docs.ceph.com/docs/master/rados/operations/crush-map/
  failureDomain: host
  replicated:
    size: 2
    # Disallow setting pool with replica 1, this could lead to data loss without recovery.
    # Make sure you're *ABSOLUTELY CERTAIN* that is what you want
    requireSafeReplicaSize: true
    # gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
    # for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
    #targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: rook-block
 # Change "rook-ceph" provisioner prefix to match the operator namespace if needed
 provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
    # clusterID is the namespace where the rook cluster is running
    # If you change this namespace, also change the namespace below where the secret namespaces are defined
    clusterID: rook-ceph
 # If you want to use erasure coded pool with RBD, you need to create
    # two pools. one erasure coded and one replicated.
    # You need to specify the replicated pool here in the `pool` parameter, it is
    # used for the metadata of the images.
    # The erasure coded pool must be set as the `dataPool` parameter below.
    #dataPool: ec-data-pool
    pool: replicapool
 # RBD image format. Defaults to "2".
    imageFormat: "2"
 # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
    imageFeatures: layering
 # The secrets contain Ceph admin credentials. These are generated automatically by the operator
    # in the same namespace as the cluster.
    csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
    csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
    csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
    csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
    csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
    # Specify the filesystem type of the volume. If not specified, csi-provisioner
    # will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
    # in hyperconverged settings where the volume is mounted on the same node as the osds.
    csi.storage.k8s.io/fstype: ext4
 # uncomment the following to use rbd-nbd as mounter on supported nodes
 # **IMPORTANT**: If you are using rbd-nbd as the mounter, during upgrade you will be hit a ceph-csi
 # issue that causes the mount to be disconnected. You will need to follow special upgrade steps
 # to restart your application pods. Therefore, this option is not recommended.
 #mounter: rbd-nbd
 allowVolumeExpansion: true
 reclaimPolicy: Delete
EOF

Let’s write some random data in the Pod. Restic compresses data, so if it is not random you will not know the performance

kubectl -n test-nginx exec -it nginx-test -- /bin/bashroot@nginx-test:/usr/share/nginx/html# dd if=/dev/urandom of=/usr/share/nginx/html/test-file3.txt count=512000 bs=1024
512000+0 records in
512000+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 8.58373 s, 61.1 MB/s[root@green--1 ~]# kubectl -n test-nginx exec -it nginx-test -- /bin/bash
root@nginx-test:/#  ls -laSh /usr/share/nginx/html/
total 501M
-rw-r--r-- 1 root root 500M Sep  7 05:42 test-file3.txt
drwx------ 2 root root  16K Sep  7 05:29 lost+found
drwxrwxrwx 3 root root 4.0K Sep  7 05:42 .
drwxr-xr-x 3 root root   18 Aug 14 00:36 ..

The state of pods and PVC in the namespace

root@green--1 velero]# k get pods,pvc,pv -n test-nginx
NAME             READY   STATUS    RESTARTS   AGE
pod/nginx-test   1/1     Running   0          18hNAME                             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
persistentvolumeclaim/ceph-ext   Bound    pvc-a7a87cee-abb2-4db8-a445-fc95b4f8a237   1Gi        RWO            rook-ceph-block   18hNAME                                                        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                             STORAGECLASS          REASON   AGE
test-nginx/ceph-ext               rook-ceph-block                18h

Now we need to annotate the Pod volumes that we need to backup from. Note that in a future version* of Velero this will not be necessary, but for now this is needed and without it the PV and PVC is copied but data won’t be copied.

*In v1.5 onward this is not needed- https://github.com/vmware-tanzu/velero/issues/1871

kubectl -n test-nginx  annotate pod/nginx-test backup.velero.io/backup-volumes=mystorage

Now let’s use Velero to take back-up.

[root@green--1 ~]# velero backup create test-nginx-b4 --include-namespaces test-nginx --wait
Backup request "test-nginx-b4" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
..
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-nginx-b4` and `velero backup logs test-nginx-b4`.

That’s it. In your Mino Server, you can see the backup stored.

[root@green--1 velero]# velero get backups
NAME            STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
test-nginx-b4   Completed         0        1          2020-08-06 11:27:30 +0530 IST   29d       default            <none>

Now, let’s delete our `test-nginx` namespace or restore this in a different cluster

kubectl delete ns test-nginx

And restore it back via Velero

velero restore create --from-backup test-nginx-b4

You can see that everything in the namespace is restored

root@k8s-storage-1:~# kubectl -n test-nginx exec -it nginx-test -- /bin/bash
root@nginx-test:/# ls -laSh /usr/share/nginx/html/
total 501M
-rw-r--r-- 1 root root 500M Sep  7 05:42 test-file3.txt
drwx------ 2 root root  16K Sep  7 05:29 lost+found
drwxrwxrwx 4 root root 4.0K Sep  7 06:10 .
drwxr-xr-x 3 root root 4.0K Aug 14 00:36 ..
drwxr-xr-x 2 root root 4.0K Sep  7 06:10 .velero

Here is the restore part

https://asciinema.org/a/358619

That’s it.

A few more details

Limitations of Restic?

https://velero.io/docs/v1.4/restic/

Restic scans each file in a single thread. This means that large files (such as ones storing a database) will take a long time to scan for data deduplication, even if the actual difference is small. If you plan to use the Velero restic integration to backup 100GB of data or more, you may need to customize the resource limits to make sure backups complete successfully.

Here is a test with 5G and 10G data

5GB in 5 mts[root@green--1 ~]# date &&  velero backup create test-5g-2  --include-namespaces test-nginx --wait && date
Thu Sep  3 12:44:24 IST 2020
Backup request "test-5g-2" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
............................................................................................................................................................................................................................................................................................................
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-5g-2` and `velero backup logs test-5g-2`.
Thu Sep  3 12:49:24 IST 20208.5 GB test ~ 9 minutes (delted old backup to prevent incremental)[root@green--1 velero]# date &&  velero backup create test-10g-8  --include-namespaces test-nginx --wait && date
Thu Sep  3 18:27:42 IST 2020
Backup request "test-10g-8" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-10g-8` and `velero backup logs test-10g-8`.
Thu Sep  3 18:36:31 IST 2020

How to restore to another Cluster?

Install Velero in second cluster pointing to the same S3 bucket

How to Schedule backups?

https://velero.io/docs/v1.4/disaster-case/

Example every five minutes

velero schedule create s-test-nginx  --include-namespaces test-nginx  --schedule "*/5 * * * *"

If S3 is down will fail, next will be successful

# velero get backupss-test-nginx-20200902062502   Completed         0        0          2020-09-02 11:55:02 +0530 IST   29d       default            <none>
s-test-nginx-20200902062105   Failed            0        0          2020-09-02 11:51:05 +0530 IST   29d       default            <none>

Can I backup only filter particular resources -Persistent Volumes, Secrerts ?

Yes, you can filter via --include-resources

velero backup create test-pv-10  --include-namespaces test-nginx  --include-resources  persistentvolumeclaims,persistentvolumes --wait

Can I flitler via particular name ?

Partially yes via --selector You can give a label here;

Can I restore to another namespace in same cluster?

Yes, namespace mapping. Note this does not work for PVC’s in same cluster. You can do this in another cluster, --namespace-mapping

Are backups incremental?

Yes

Are restore’s incremental?

What if the connection to S3 breaks?

For scheduled backups, the next schedule will trigger

For Inprogress, will stay in Progress forever. You need to delete the backup and delete the velero operator pod to recover. This looks like a bug