File Name Hashing: Creating a Hashed Directory Structure

Most modern file systems do not limit the number of files you can store in a single directory. However, depending on the type of file system, operations such as listing files, following paths, or checking for the existence of a file could take longer as the number of entries in a directory increases. The performance of a file system is dependent on many factors, and it can be influenced by the hardware it is running on.

So this brings us to a typical problem: How can one store a large number of files while maintaining a high level of performance during access? One solution is file name hashing.

File name hashing in the simplest terms can be defined as, creating a known and reproducible path, based on the name of the file. For example, “cat.gif” might be stored on the file system as:

/c/ca/cat.gif

“dog.gif” might be stored as:

/d/do/dog.gif

Using this type of directory structure, there will be 26 possible root directories and 26² possible sub-directories. If we limit the number of files per sub-directory to 1000 this gives us a storage capacity of:

26 * 26^2 * 1000 = 17,576,000 files.

Not bad. But there is a problem with this. There is no guarantee of a balanced distribution. If we have 17 million files that start with “ca”, we will end up putting 17 million files in the /c/ca directory.

A better solution might be to devise a directory path based on the “hash code” of the file name. In Java, the hash code of a String object is returned by the hashCode() method. The following code will print the hash code for the String “cat.gif”:

public static void main(String[] args) {
String fileName = "cat.gif";
int hash = fileName.hashCode();
System.out.println("The hash of " + fileName + "is: " + hash);
}

The code generates the following output:

The hash of cat.gif is: 554180012

The hashCode() method of the String class “converts” the string to a 32-bit integer using this simple algorithm:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

One way of devising a directory structure from this 32-bit integer, is to use each byte as a directory. Here is what 554180012 looks like in binary code:

100001 00001000 00011101 10101100

The first byte is on the right and the last byte is on the left. The trailing bits in the last byte are 0s so they are not displayed, just like 077 is displayed as 77. A space has been inserted between each byte so we can easily see that there are 4 of them.

Let’s extract the first byte. This is done by ANDing the number with 255. Why 255? 255 is the highest number that can be represented with 8 bits (1 byte). All 8 bits are set to 1. When we AND each bit of our hash code with each bit in 255 — we will get a 1 only if BOTH bits are 1:

100001 00001000 00011101 10101100
& 11111111
---------------------------------
= 10101100

Because the 24 trailing bits of 11111111 are all zeros (remember we do not show them) the only bits to “fall through” when we AND, are the bits that make up the 1st byte of our hash code. This is called “masking”. We have “masked” the trailing 3 bytes of our hash code. We can also say, 255 is is our “mask”.

Let’s shift the bits of our hash code 8 places to the right — then AND again:

100001 00001000 00011101
& 11111111
------------------------
= 00011101

As you see, we move the 2nd byte into the 1st position — AND again — thereby extracting the 2nd byte the same way we did the first. If we use each extracted byte value as a level in our directory structure, How many files would our structure hold? Remember, 255 is the max value of a byte, but there are 256 possible values: 0 through 255. So This gives us:

256 * 256 = 65,536 directories * 1000 files in each directory = 65,536,000 files.

Lets do a test. The following java code will create a 2-level directory hash for the String “cat.gif”.

public static void main(String[] args) {

String fileName = "cat.gif";
int hash = fileName.hashCode();
    int mask = 255;
int firstDir = hash & mask;
int secondDir = (hash >> 8) & mask;
    String path = new StringBuilder(File.separator)
.append(String.format("%03d", firstDir))
.append(File.separator)
.append(String.format("%03d", secondDir))
.append(File.separator)
.append(fileName)
.toString();
    System.out.println(path);
}

The code generates the following output:

/172/029/cat.gif

If we ever need to find “cat.gif”, we can easily reproduce this path using the same algorithm.

Now let’s see if we have an even distribution. The following program will create 10,000,000 unique file names, create a hashed path based on each name, then count the number of files that end up in each directory. We want to distribute the files evenly across the 65536 directories. This time instead of a decimal value for each directory name, we will use a HEX value:

public static void main(String[] args) {
Map<String, Integer> counter = new HashMap<>();
    for (int i = 1; i <= 10000000; i++) {
String fileName = UUID.randomUUID().toString();
int hash = fileName.hashCode();

int mask = 255;
int firstDir = hash & mask;
int secondDir = (hash >> 8) & mask;
        String path = new StringBuilder(File.separator)
.append(String.format("%02x", firstDir))
.append(File.separator)
.append(String.format("%02x", secondDir)
.toString();
        if (counter.containsKey(path)) {
counter.put(path, counter.get(path) + 1);
} else {
counter.put(path, 1);
}
}
    int count = 0;
    for (String key : counter.keySet()) {
System.out.println(++count + ". " + key + ": " +
counter.get(key));
}
}

The program generates the following output (results will vary):

1. /77/07: 140
2. /77/06: 145
3. /77/05: 137
4. /77/04: 144
5. /77/03: 148
6. /77/02: 148
7. /77/01: 156
8. /77/00: 153
9. /77/09: 134
10. /77/08: 163
.
.
.
65525. /ea/33: 174
65526. /ea/39: 162
65527. /ea/38: 157
65528. /ea/37: 114
65529. /ea/2a: 176
65530. /ea/2d: 164
65531. /ea/2e: 165
65532. /ea/2b: 142
65533. /ea/2c: 158
65534. /ea/2f: 163
65535. /1b/a1: 145
65536. /1b/a0: 156

The HEX value still gives us 256 * 256 = 65536 directories but we lose two chars off our path. The size of our structure expressed in base16 HEX is:

16² * 16² * 1000 = 65,536,000

This looks pretty balanced. Each directory contains 130–170 files. If you need to store more files you can always use the 3rd and 4th byte of the 4-byte hash code and create a 3 or 4 level directory structure.

~M

Did you find this article useful? Show some love and recommend it. Click the heart at the bottom of the page.

ABOUT THE AUTHOR:
Michael Andrews is a hands-on engineering lead and platform architect with 20 years of experience building scalable high-performance applications, APIs, and backend systems and services. Experienced thought leader with success leading engineering teams, and designing software and architectures which support evolving business needs. Committed to developing light-weight malleable software and decoupled event-driven code. Experienced cloud architect. Specialist in DevOps, CICD, and blue/green fully-tested low-risk no down-time automated deployments.

TECHNICAL PROFICIENCIES:
Backend architecture and platform engineering; Java and JVM performance tuning; distributed message-based systems; multi-threaded programming and concurrency; RESTful web services secured with OAuth2; multi-tenant applications and APIs; various persistence technologies such as JPA, Hibernate, MySQL, MongoDB, Neo4J, and Solr; The Spring Framework; JavaEE and Tomcat; various client-side technologies and frameworks; Logback and SLF4J; Chef; Jenkins; GIT; Maven; Artifactory; Docker; Kubernetes; Helm; Domain-driven Design and Hexagonal Architecture (ports and adaptors); DevOps and CICD including automation for infrastructure, testing, deployments, and business continuity and disaster recovery; Amazon Web Services (AWS) with extensive use of VPC, EC2, CloudFormation, Route53, ECR, Kinesis Firehose, Athena, RDS, S3, and IAM.