Using Velero and Restic to Backup Kubernetes Resources
Kubernetes Bare-metal with Rook-Ceph
Environment
We are doing this in a bare-metal K8s cluster. We have a rook-ceph based StorageClass for Persistent Volumes. Rook version 1.2, this supports CSI driver, and Ceph 14.2. Note that this is not important now, as the CSI Snapshot feature is still not available in Rook 1.2. We can use Restic.
Need for Velero and Restic
Using Rook-Ceph 1.2 version or higher, and using the CSI Volume driver, you are able to take VolumeSnapshot, but this is taken as a local persistent volume and cannot be taken out of the cluster. So you need Velero to take the VolumeSnapshot outside of the cluster.
Update: The Ceph-CSI team has now released ver ceph-csi-v3.0.0 which support the v1beta1 K8s snapshot storage API — https://www.humblec.com/ceph-csi-v3-0-0-released-snapshot-clone-multi-arch-rox/ . This is in Rook now and we could use that instead of Restic, though there are bugs here when I tried. See my other blog postrelated to this https://medium.com/techlogs/velero-with-csi-a883e8a24710
Step 1: Install S3 Object Storage. Simplest is Minio
Here is a sample with Minio backed by Rook-Ceph
https://gist.github.com/alexcpn/2986863352400cc1c7907a32f2fd0cac
After this port-forward so that you can access Mino externally
kubectl port-forward minio-64b7c649f9-9xf5x --address 0.0.0.0 7000:9000 --namespace minio
Or you can create ingress and use it
Step 2: Install Velero with Restic
Create a S3 credentials file credentials-velero
with the following content
cat credentials-velero
[default]
aws_access_key_id = minio
aws_secret_access_key = <your pass>
Next is to download the velero tar file, from https://github.com/vmware-tanzu/velero/releases/download/v1.5.1/velero-v1.5.1-linux-amd64.tar.gz, and move the binary to the bin folder.
Note that it expects a Kubeconfig file with ClusterAdmin privilege and you can then install it from your local machine to any cluster.
If you are using an Ingress the S3Url need to change s3Url=http://minio.10.x.y.z.nip.io
Without CSI
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.0.0 \
--bucket velero2 \
--secret-file ./credentials-velero \
--use-volume-snapshots=true \
--backup-location-config region=default,s3ForcePathStyle="true",s3Url=http://minio.10.x.y.z.nip.io \
--image velero/velero:v1.4.0 \
--snapshot-location-config region="default" \
--use-restic
Without Restic and Using CSI Snapshot class of your provider
Let’s tackle that as another blog — https://medium.com/techlogs/velero-with-csi-a883e8a24710
Using Restic
“We integrated restic with Velero so that users have an out-of-the-box solution for backing up and restoring almost any type of Kubernetes volume*. This is a new capability for Velero, not a replacement for existing functionality. If you’re running on AWS, and taking EBS snapshots as part of your regular Velero backups, there’s no need to switch to using restic. However, if you’ve been waiting for a snapshot plugin for your storage platform, or if you’re using EFS, AzureFile, NFS, emptyDir, local, or any other volume type that doesn’t have a native snapshot concept, restic might be for you.”
Step 3: Test it out — Create a Test pod and PV and add some data
Pre-requisite. If you are using rook-ceph or similar for storage, ensure that you have the right Storage Driver (CSI or Flex) in both source and target. This is the case where you backup in one cluster and restores in another (target) cluster. Note that for this scenario to work the same storage class and storage plugin should be available in target.
cat << EOF | kubectl apply -f -apiVersion: v1
kind: Namespace
metadata:
name: test-nginx
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-ext
labels:
app: nginx
namespace: test-nginx
spec:
storageClassName: rook-block
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: v1
kind: Pod
metadata:
name: nginx-test
namespace: test-nginx
spec:
volumes:
- name: mystorage
persistentVolumeClaim:
claimName: ceph-ext
containers:
- name: task-pv-container
image: nginx
ports:
- containerPort: 80
name: "http-server"
volumeMounts:
- mountPath: "/usr/share/nginx/html"
name: mystorage
EOF
We will be using the following Storage Class in both Source and Target. Note again that we are not backing up the StorageClass from source to target. That is usually not a good idea as versions etc could be different
cat << EOF | kubectl apply -f -
apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
name: replicapool
namespace: rook-ceph
spec:
# TODO if a failure domain of host is selected, then CRUSH will ensure that each replica
# of the data is stored on a different host. https://docs.ceph.com/docs/master/rados/operations/crush-map/
failureDomain: host
replicated:
size: 2
# Disallow setting pool with replica 1, this could lead to data loss without recovery.
# Make sure you're *ABSOLUTELY CERTAIN* that is what you want
requireSafeReplicaSize: true
# gives a hint (%) to Ceph in terms of expected consumption of the total cluster capacity of a given pool
# for more info: https://docs.ceph.com/docs/master/rados/operations/placement-groups/#specifying-expected-pool-size
#targetSizeRatio: .5
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-block
# Change "rook-ceph" provisioner prefix to match the operator namespace if needed
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
# clusterID is the namespace where the rook cluster is running
# If you change this namespace, also change the namespace below where the secret namespaces are defined
clusterID: rook-ceph
# If you want to use erasure coded pool with RBD, you need to create
# two pools. one erasure coded and one replicated.
# You need to specify the replicated pool here in the `pool` parameter, it is
# used for the metadata of the images.
# The erasure coded pool must be set as the `dataPool` parameter below.
#dataPool: ec-data-pool
pool: replicapool
# RBD image format. Defaults to "2".
imageFormat: "2"
# RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
imageFeatures: layering
# The secrets contain Ceph admin credentials. These are generated automatically by the operator
# in the same namespace as the cluster.
csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
# Specify the filesystem type of the volume. If not specified, csi-provisioner
# will set default as `ext4`. Note that `xfs` is not recommended due to potential deadlock
# in hyperconverged settings where the volume is mounted on the same node as the osds.
csi.storage.k8s.io/fstype: ext4
# uncomment the following to use rbd-nbd as mounter on supported nodes
# **IMPORTANT**: If you are using rbd-nbd as the mounter, during upgrade you will be hit a ceph-csi
# issue that causes the mount to be disconnected. You will need to follow special upgrade steps
# to restart your application pods. Therefore, this option is not recommended.
#mounter: rbd-nbd
allowVolumeExpansion: true
reclaimPolicy: Delete
EOF
Let’s write some random data in the Pod. Restic compresses data, so if it is not random you will not know the performance
kubectl -n test-nginx exec -it nginx-test -- /bin/bashroot@nginx-test:/usr/share/nginx/html# dd if=/dev/urandom of=/usr/share/nginx/html/test-file3.txt count=512000 bs=1024
512000+0 records in
512000+0 records out
524288000 bytes (524 MB, 500 MiB) copied, 8.58373 s, 61.1 MB/s[root@green--1 ~]# kubectl -n test-nginx exec -it nginx-test -- /bin/bash
root@nginx-test:/# ls -laSh /usr/share/nginx/html/
total 501M
-rw-r--r-- 1 root root 500M Sep 7 05:42 test-file3.txt
drwx------ 2 root root 16K Sep 7 05:29 lost+found
drwxrwxrwx 3 root root 4.0K Sep 7 05:42 .
drwxr-xr-x 3 root root 18 Aug 14 00:36 ..
The state of pods and PVC in the namespace
root@green--1 velero]# k get pods,pvc,pv -n test-nginx
NAME READY STATUS RESTARTS AGE
pod/nginx-test 1/1 Running 0 18hNAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/ceph-ext Bound pvc-a7a87cee-abb2-4db8-a445-fc95b4f8a237 1Gi RWO rook-ceph-block 18hNAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
test-nginx/ceph-ext rook-ceph-block 18h
Now we need to annotate the Pod volumes that we need to backup from. Note that in a future version* of Velero this will not be necessary, but for now this is needed and without it the PV and PVC is copied but data won’t be copied.
*In v1.5 onward this is not needed- https://github.com/vmware-tanzu/velero/issues/1871
kubectl -n test-nginx annotate pod/nginx-test backup.velero.io/backup-volumes=mystorage
Now let’s use Velero to take back-up.
[root@green--1 ~]# velero backup create test-nginx-b4 --include-namespaces test-nginx --wait
Backup request "test-nginx-b4" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
..
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-nginx-b4` and `velero backup logs test-nginx-b4`.
That’s it. In your Mino Server, you can see the backup stored.
[root@green--1 velero]# velero get backups
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
test-nginx-b4 Completed 0 1 2020-08-06 11:27:30 +0530 IST 29d default <none>
Now, let’s delete our `test-nginx` namespace or restore this in a different cluster
kubectl delete ns test-nginx
And restore it back via Velero
velero restore create --from-backup test-nginx-b4
You can see that everything in the namespace is restored
root@k8s-storage-1:~# kubectl -n test-nginx exec -it nginx-test -- /bin/bash
root@nginx-test:/# ls -laSh /usr/share/nginx/html/
total 501M
-rw-r--r-- 1 root root 500M Sep 7 05:42 test-file3.txt
drwx------ 2 root root 16K Sep 7 05:29 lost+found
drwxrwxrwx 4 root root 4.0K Sep 7 06:10 .
drwxr-xr-x 3 root root 4.0K Aug 14 00:36 ..
drwxr-xr-x 2 root root 4.0K Sep 7 06:10 .velero
Here is the restore part
https://asciinema.org/a/358619
That’s it.
A few more details
Limitations of Restic?
https://velero.io/docs/v1.4/restic/
Restic scans each file in a single thread. This means that large files (such as ones storing a database) will take a long time to scan for data deduplication, even if the actual difference is small. If you plan to use the Velero restic integration to backup 100GB of data or more, you may need to customize the resource limits to make sure backups complete successfully.
Here is a test with 5G and 10G data
5GB in 5 mts[root@green--1 ~]# date && velero backup create test-5g-2 --include-namespaces test-nginx --wait && date
Thu Sep 3 12:44:24 IST 2020
Backup request "test-5g-2" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
............................................................................................................................................................................................................................................................................................................
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-5g-2` and `velero backup logs test-5g-2`.
Thu Sep 3 12:49:24 IST 20208.5 GB test ~ 9 minutes (delted old backup to prevent incremental)[root@green--1 velero]# date && velero backup create test-10g-8 --include-namespaces test-nginx --wait && date
Thu Sep 3 18:27:42 IST 2020
Backup request "test-10g-8" submitted successfully.
Waiting for backup to complete. You may safely press ctrl-c to stop waiting - your backup will continue in the background.
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
Backup completed with status: Completed. You may check for more information using the commands `velero backup describe test-10g-8` and `velero backup logs test-10g-8`.
Thu Sep 3 18:36:31 IST 2020
How to restore to another Cluster?
Install Velero in second cluster pointing to the same S3 bucket
How to Schedule backups?
https://velero.io/docs/v1.4/disaster-case/
Example every five minutes
velero schedule create s-test-nginx --include-namespaces test-nginx --schedule "*/5 * * * *"
If S3 is down will fail, next will be successful
# velero get backupss-test-nginx-20200902062502 Completed 0 0 2020-09-02 11:55:02 +0530 IST 29d default <none>
s-test-nginx-20200902062105 Failed 0 0 2020-09-02 11:51:05 +0530 IST 29d default <none>
Can I backup only filter particular resources -Persistent Volumes, Secrerts ?
Yes, you can filter via --include-resources
velero backup create test-pv-10 --include-namespaces test-nginx --include-resources persistentvolumeclaims,persistentvolumes --wait
Can I flitler via particular name ?
Partially yes via --selector
You can give a label here;
Can I restore to another namespace in same cluster?
Yes, namespace mapping. Note this does not work for PVC’s in same cluster. You can do this in another cluster, --namespace-mapping
Are backups incremental?
Yes
Are restore’s incremental?
No
What if the connection to S3 breaks?
For scheduled backups, the next schedule will trigger
For Inprogress, will stay in Progress forever. You need to delete the backup and delete the velero operator pod to recover. This looks like a bug