Hands-on with the Lustre FSx plugin for EKS, a shared file system provisioner for machine learning workloads on Kubernetes

Published in

Infrastructure adventures

6 min readFeb 21, 2020

End of 2019 AWS released a Lustre storage driver plugin for their managed Kubernetes solution, EKS. (Technically, it should work on any Kubernetes installation of 1.14 or above, running in AWS.)

Their managed Lustre filesystem, FSx is a pretty nice solution to quickly deploy a high performance, POSIX capable filesystem which allows concurrent writes from multiple clients. This is a very helpful tool in machine learning workloads, when you need to re-use the same training data, especially since you can attach this Lustre FS totally transparently to an S3 storage in the background, so you will see the S3 bucket’s contents as a standard filesystem (with concurrent writes!).

This sounds very nice on paper, and according to AWS, it’s ready for production (despite its beta state), so I decided to do a quick review and analysis regarding the ease of deployment, operation and to find out the possible limitations, before we commit to potentially offering this to our internal customers.

Analysis and findings

This is more or less a brain dump, please excuse me for the format, I didn’t want the article to be even longer with too much explanation. I assume you’re familiar with the basic AWS components.

To get it up and running, on an existing Kubernetes cluster

create: new IAM policy for FSx + new IAM role which need to be attached with Kiam to both the controller Deployment and the per-node agent DaemonSet

Although I’m talking about Kiam here, as a generic and more popular solution, by default AWS demoes this with their native Kubernetes-to-IAM implementation: IAM Roles for Service Accounts.

create: new security group (SG) with tcp/988 bidirectional, attach it to worker nodes and the auto provisioned Lustre FS
deploy: this is a Kustomize deployment, kubectl apply -kinstalls it; only replacement happening is Docker image addresses

Sorry about the unspecific URL on master, the plugin is under heavy development, they don’t put git tags properly at the time of writing this article.

after deploy:
— patch image URL to use eu-west-1 (your region)
— patch Deployment to use IAM role annotation
— patch DaemonSet to use IAM role annotation
patch kube-system namespace to allow Kiam to assume roles inside the same AWS account
— alternatively, move all the FSx stuff under your own (Kiam-enabled) namespace, instead of the default hardcoded kube-system

Components

controller runs as an HA deployment with leader election; watches for PersistentVolumeClaims
node agent runs on every potential worker node, in privileged mode, mounting system disk + host network
StorageClass: defines the Lustre endpoint: can be dynamic as well, no need to pre-create the FS (as you need it in their other storage plugin, the managed NFS solution, AWS EFS)
must specify in the StorageClass (SC) definition:
— single zone subnet ID → we need to create 1 SC per subnet
— security group ID we created above
— if using S3 as background storage: S3 bucket URL + folder → we need to create 1 SC per subnet AND per S3 folder path
PersistentVolumeClaim: references a StorageClass; need to define Lustre FS size: 1200 GB, 2400 GB or n x 3600 GB

0.2 GB/s IO speed is provisioned per each 1000 GB (can scale up to hundreds of GB/s, if you can afford the cost)

Monitoring

there are a lot of Cloudwatch metrics for FSx, can be fetched with Cloudwatch Exporter
there are no (Prometheus) metrics for the app components itself

Billing, costs

$0.154/GB/month (eu-west-1) → 1200 GB is the minimum storage, for each FS created → $185/month/each 1200 GB provisioned (~ $6/day)
it is proportionally billed; you can delete, re-create the FS anytime, takes ~5 minutes for 1200 GB.

If it’s attached to a backing S3 bucket, you will keep the data even if you delete, re-create the filesystem at any time.

additional cost: cross-AZ traffic, if you attach the FS from a node of a different AZ

The managed Lustre filesystem lives in a single AZ.

additional cost: S3 storage cost if you want to persist the data written to this file system

How it actually works

you create a StorageClass which defines in which AZ do you spawn the Lustre filesystem, and whether it has an S3 bucket and folder behind it
you create the PersistentVolumeClaim (PVC) which defines the filesystem size and some mount options (noatime, ...)

1 PVC = 1 Lustre FSx instance.

you can create multiple PVCs (= filesystem instances) to the same StorageClass (and same S3 bucket folder)

The sync to S3 is not real time, you specifically need to trigger it in a controlled fashion.

you can attach the same PVC to multiple containers in RW-RW mode
once the PVC is created, the Controller notices the event and starts provisioning the file system (it takes ~5m for a 1200 GB size)
PVC will be in Pending state; when the underlying FS is created, a new PersistentVolume object appears and is bound to the PVC, thus ready to be mounted
a single ENI is created in the VPC in the specified subnet + a DNS alias for it
— this is the customer-facing endpoint of the AWS-managed Lustre filesystem
— this ENI is attached to the security group specified in the StorageClass definition
you create a pod which wants to mount this PVC
pod gets scheduled to a worker node
worker node’s FSx DaemonSet realizes there’s a Lustre mount request; it will talk to the AWS API to mount this FS to the underlying EC2 host
— once the mount is done, the pod will be started and gets the FS mounted through the EC2 host
pod sees the remote filesystem as a normal directory, can create files inside, they will be automatically saved to the Lustre FS
there is NO automated S3 sync!
whatever you write to Lustre, it will never go to S3, until you manually press the Sync button in the AWS console, in the AWS CLI, or by triggering a special CLI tool which you add to your Docker image
— when asking to sync, a Data Repository Task object will be created and it will eventually do an “rsync” between Lustre and S3, only touching the actually modified files
— if your pod terminates early and cannot finish/do an S3 sync, you don’t lose data, it will persist in Lustre, as long as you don’t delete the filesystem itself (the PVC); you can do S3 sync anytime

Technically you can see the files already existing in S3 in the Lustre FS as normal files and directories, however it did not work for me; I was able to write new files and sync to S3, but anything I put in S3 never appeared on the Lustre side.

Could have been caching issue inside the EC2 DaemonSet or in the EC2 host itself, in the Controller or anywhere else.
I was approaching the end of the week and this information is already good enough to make a decision whether we want to spend more time with it or not. If you have RW-RW anyway and you are only going to write to S3 once, and then process the dataset many-many times, you might just want to use S3 as a backup/checkpoint storage, to avoid reprocessing large datasets from scratch.

Conclusion

Although I just spent a couple of hours testing/validating this storage driver plugin, it seems to be good enough to start playing with it and see some actual workload running on it. It is definitely a big improvement over their EFS plugin, where you had to pre-provision the NFS filesystem, fetch its ID and then hardcode it inside your StorageClass definition.

I’m not particularly happy about the lack of internal metrics in the components and privilege handling is a bit underdeveloped yet: triggering an S3 export requires users to run a container in privileged mode with CAP_SYS_ADMIN capability.

The installation method has the stereotype AWS project traits: curl github.com | kubectl apply - or| bash. Maybe it’s time I contribute a Helm chart to the community.

Managing your own distributed, concurrent-write capable, scalable filesystem is a big enough pain to appreciate the work behind this project and just pay up to a few hundred dollars a month to AWS and use their service, spawn a new filesystem whenever you need it, delete it when your task is finished, etc. — all of it from inside Kubernetes!

Looking forward to how it will develop over the next few months!