The Problems with VMware Snapshots
There are generally three ways we create snapshots in this world: copy-on-write, redirect-on-write, and VMware’s way. As a result, you should delete VMware snapshots as soon as possible — preferably within a few minutes of their creation.
A previous post discussed the difference between copy-on-write and redirect-on-write snapshots and it is well worth the read if you are not familiar with those terms. VMware’s snapshot style is completely different than either of them.
Once a VMware takes a snapshot, all writes to the VMDKs that comprise that VM stops. Instead, new writes go to an alternate area. As long as that snapshot exists, all new writes will go to the alternate area and all reads will have to read both from the original VMDK and from the alternate area in order to supply the current version of all blocks.
First, this is completely opposite to the way all other snapshots work. Other snapshot systems use an alternate area only to preserve the before image of changed blocks. In the case of a copy-on-write system, the before image is copied out to the alternate area before overwriting a block. In a redirect-on-write system, the before image is left in place and its pointer preserved, while the current view of the active volume is given a pointer to a new block. Again, I cover this in detail in my prior post. But in VMware snapshots, the actual VMDK becomes the “snapshot” of the previous point in time, and the primary volume is forced to read what it needs from the alternate area.
If that were the only difference, one could argue that we are talking about semantics or having too many concerns about the trees and not the forest. Unfortunately there is a much bigger problem awaiting VMware administrators who leave their snapshots in place too long. Once they delete a snapshot, all writes since that the software took the snapshot must be copied from the alternate area to where they should be in the primary volume. Depending on the length of time the snapshot has existed and the number of writes that have happened since it was created, this could be quite a bit of I/O that has to happen all at once.