UPDATE! As of Docker 1.9.0, Docker has named volumes which replace data-only containers. The article below still has value in the sense of how to think about data inside docker, but consider using named volumes to implement the pattern described below rather than data containers.
You use the hot new containerization technology called Docker. You have persistent data that you need to keep across invocations of a container. Google this and you will find a consistent piece of advice: for portability, use volumes mounted from data-only containers.
When I first saw this, I didn’t understand it. In order to persist that data, I need to have it on my host machine (right?). So why create another data-only container, only to have the same mapping to the host data? Wasn’t it simply a level of abstraction that pushed the problem back one level, but didn’t really accomplish much? I wasn’t the only one.
The Aha! Moment
After futzing with Docker for a while, and reading the documentation for data-only containers multiple times, I finally realized I had to shift my mindset from “this data must logically and physically exist on my host” to “this data logically exists within a data-only container and I (probably) don’t care where it physically exists on my host”.
Even though containers do not persist data across invocations, volumes declared by containers do. Even when not explicitly mapped to any host directory. And since docker volumes, and the data they contain, survive as long as any container references them, even indirectly, as long as the data container exists (even if not running), the data is logically and effectively “stored” within docker.
To see this without getting too far into the data-container abstraction, lets run a container with a bound docker volume (without —rm to keep the container around after execution), and write some data to the mount:
# Note that we do not bind the volume from the host
$ docker run -v /foo --name="vtest" busybox \
sh -c 'echo hello docker volume > /foo/testing.txt'
We’ve created a data container by binding a volume from docker, and written something to the volume using the same container for simplicity.
Now run it again and confirm the data exists, this time binding the volume from the previously run data container:
$ docker run --volumes-from=vtest busybox cat /foo/testing.txt
hello docker volume
The key take-away being that we persisted some data using docker without ever binding it to the host, and as long as vtest exists, the data exists.
As an aside, the physical location of the data is of course on the host — somewhere in /var/lib/docker/ — but this almost never matters. One exception to this is if the data must be mounted on specific hardware for ops purposes. This can be handled simply by a host bind of the volume on the data container, but logically everything remains exactly the same.
OK, the data logically exists within a container and is not (explicitly) mounted to any host location. Fine, but then how do I view the data? How do I edit it? Back it up? And so on? Now you need to think containers containers containers. Whenever you think “I need to mount this volume on my host to do x”, change that in your head to “I need to create a container that uses — volumes-from the data-only container to do x within”.
When everything one needs to do with the data on the volume lives within one or more containers, one neatly side-steps problems that arise when one tries to access the data via the host. For example, the issue of matching a host uid/gid to a container uid/gid, which of course makes the container non-portable to a host where that user/group does not exist or has a different uid/gid, is solved relatively simply.
It also means you can do neat things like creating tools containers with useful utilities and scripts that operate on your data, all of which access the volume via the data-only container, and all of which, as a result, are completely portable from host to host.
Data-only containers, far from being useless as I originally thought, are a key component to building a completely portable container-based ecosystem.