Mobile filesystems — when native storage will let you down

At the beginning of 2012 I started developing for mobile devices. Before that, I performed some graph visualization work and backend oriented development. In the software engineering world, I have grown accustomed to a sense of gravity, where by “gravity” I mean a certain set of basic ideas/laws you can rely on unconditionally on, similar to how you can rely on the planet’s gravity. In this post, I want to share some of the experiences we’ve had at KeepSafe that left us questioning our own notion of gravity in the software engineering world.

File Systems are unreliable

At KeepSafe we do a lot of file system operations; as our app is essentially a photo vault, dealing with the file system comes with the territory. However, we didn’t just run into issues with the photo files but with smaller files we store beside the photos. We ultimately found that many of the things we take for granted about how file system operations should behave don’t always hold. And when things break, they break in strange, unpredictable ways. Typically with my previous work I would never question the reliability of concrete things such as file systems — it’s what I would consider a part of the software engineering gravity idea. Unfortunately, we’ve seen massive file system related inconsistencies over time that forced us to adhere to a different mindset.

Because our app has by now a good amount of users combined the the availability of good in app crash analytics, we have repeatedly found weird edge cases that should not happen and left us scratching our heads We’ve grown to be more mindful and careful of certain operations and of how we structure our file system code in general. Here are a few things we’ve incorporated into our engineering mindset at KeepSafe.

Avoid changing or moving data

In the next few paragraphs I want to dive a little deeper into behaviors that we’ve encountered during different operations and how we managed to solve those.

As a guiding principle that goes for everything we do by now we are very careful about not touching data that has once data has been written successfully to disk and has been verified as usable. We really try to not touch data to avoid the risk of any type of data loss after a modification. In more detail:

Renaming files and folders can lead to data loss

Unfortunately we have seen problems during and after renaming different files or folders within our application. This was very surprising to us as the assumption has always been that

file.renameTo(destinationFile);

is one of the safest operations that you can do on the file system.

After a renaming a directory that our app created in the past, we’ve experienced FileNotFoundException: ENOENT (No such file or directory) or FileNotFoundException: EISDIR (No such file or directory) exceptions while trying to open a new InputStream on an existing files from within the directory that we just renamed. The files object that we try to open would be returned from file.listFiles() on the directory that was just renamed

We follow the rule of not touching files when possible pretty strictly now. After importing an image into KeepSafe, we do not touch the file except when exporting the file. All other modifications, such as changing a photo’s orientation, are stored in a different place, removing the need to modify the file itself.

Changing a file’s content

If you must change the content of a file, you want to make sure to NEVER change the actual file initially. It is not guaranteed that overwriting a file in-place via a write operation actually completes. When an overwrite does not complete, it will leave the file in an ambiguous state where the new content is not written completely and the old content of the file no longer exists, which is the worst case.

If you must change the content of a file, you want to make sure to NEVER change the actual file!

To help combat such a case, we wrote our own little TransactionSafeFileUtil class to facilitate good practices when attempting to write a new version of a file. Our TransactionSafeFileUtil stores the new content in an original_file_name.new file, and then renames the original file to original_file_name.old. Finally, the original_file_name.new file is renamed back to original_file_name.

Transaction safe file writing

With this technique we can always guarantee that there is at least one valid version of the file on disk in case anything fails during the process. Because the files can be in different states, we synchronize read and write operations per file to guarantee that it’s in the right state.

Despite our attempt to follow best practices, unfortunately when writing a file to disk we’ve seen incomplete writes occurring on quite a few end-user devices. Upon further investigation, we learned that our attempt to make sure that the content was written fully to disk was flawed. As we previously thought, OutputStream.flush() does not ensure that all buffered data is completely written to disk. We now know to always call

fileOutputStream.getFD().sync();

if we need assurance that the file was completely written. The sync() function guarantees that the data buffered by the OS is written to the physical device (disk). You can read more at: http://docs.oracle.com/javase/6/docs/api/java/io/FileDescriptor.html#sync%28%29

Triple check what you write and read

Even with the technique described above we run into strange problems on some phones. For this reason we double check everything we read and write, especially important files. When you read a file that is expected be 32 bytes long, it’s important to make sure to check that the size read is indeed 32 bytes long. On Android we see quite a few write operations fail without any exception thrown.

Here is an example of some code:

FileOutputStream fos = new FileOutputStream(file);
BufferedOutputStream bos = new BufferedOutputStream(fos);
bos.write(byteData);
bos.flush();
// sync to disk as recommended: <http://android-developers.blogspot.com/2010/12/saving-data-safely.html>
fos.getFD().sync();
fos.close();

if (file.length() != byteData.length) {
final byte[] originalMD5Hash = md.digest(byteData);

InputStream is = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(is);
byte[] buffer = new byte[4096];

while(bis.read(buffer) > -1) {
md.update(buffer);
}
is.close();

final byte[] writtenFileMD5Hash = md.digest();

if(!Arrays.equals(originalMD5Hash, writtenFileMD5Hash)) {
String message = String.format(
"After an fsync, the file's length is not
equal to the number of bytes we wrote!\n
path=%s, " +
" expected=%d, actual=%d. >> " +
"Original MD5 Hash: %s, written file MD5 hash: %s",
file.getAbsolutePath(), byteData.length,
file.length(),
digestToHex(originalMD5Hash),
digestToHex(writtenFileMD5Hash));
throw new GiantWtfException(message);
}
}
return true;

We didn’t believe we would hit that GiantWtfException() when we put that check in, but in reality, it happens quite often in the wild. The number of bytes from the file is always less than the number of bytes from the byte array.

Unfortunately, we see the same issue on iOS from time to time:

NSDictionary *fIn = [[NSFileManager defaultManager] 
attributesOfItemAtPath:in error: NULL];
NSDictionary *fOut = [[NSFileManager defaultManager]
attributesOfItemAtPath:out error: NULL];

if (![fOut fileSize] || ([fOut fileSize] != [fIn fileSize])) {
NSLog(@"File size is wrong");
return nil;
}

Seeing how our app contains user generated content that is irreplaceable, checking for consistency is extremely important for data that can’t be reproduced.

If it doesn’t work the first time, try again

We also discovered that just because writing a file does not work the first time, it doesn’t mean that it won’t work when you try again.

I have seen a variety of IOExceptions that should never happen ideally. Retrying the same operation again sometimes works. Some examples:

On a file that I write over and over, and I can see in the logs that I did just that

java.io.FileNotFoundException: /storage/emulated/0/folder/filename: open failed: EACCES (Permission denied)

In the location I use all the time to write

java.io.IOException: Read-only file system

An exception that is even more confusing…

java.lang.IllegalStateException: neither file nor directory: /mnt/sdcard/Android/data/packageName/cache/filename_not_written_by_me

Thrown in my own code at

if(!d.isDirectory()) {
throw new IllegalArgumentException("given param is not a directory");
}

File[] files = d.listFiles();

if (files != null) {
for (File f: files) {
if(f.isFile()) {
f.delete();
} else if (f.isDirectory()) {
deleteDirectoryContent(f);
f.delete();
} else {
throw new IllegalStateException(
"neither file nor directory: " + f.getAbsolutePath()));
}
}
}

Every now and then you will also see some

java.io.FileNotFoundException: /path: open failed: ENOENT (No such file or directory) — On files that you wrote to disk and made 100% sure that everything is fine.

I had a great time reading up on Linux System Errors to see what can go wrong at a file system level.

Prepare for failure

Sometimes things just don’t work. In case we want to write important data to disk and it does not work, even after a few retries, we just throw a RuntimeException. I’d rather have the app crash hard, instead of it running in an unhealthy state. For example, we crash the client when we cannot persist the generated encryption key, to prevent encrypting files with a key that will be nonexistent soon. The goal is to always keep your user’s data in a state that is either recoverable or valid, with strong preference for validity.

Flash memory

Finally, I just want to touch on flash memory for a bit. By and large, for Android devices, all of the data is stored on flash memory. However, flash memory has its limitations. There is a finite amount of write operations that each block can handle, which is called memory wear. To stretch the problem of memory wear there are different kinds of wear leveling. Just a reminder that continuous file system operations are degrading the underlying hardware, which is always something to keep in mind.


Originally published at keepsafe.github.io on July 9, 2014.