C++ Pathnames: A Windows Guide

Josh Weinstein
The Startup
Published in
6 min readJun 30, 2020

--

credits to colorado.com

Pathnames are a unique component in the C++ language that before C++17, are not really implemented in the cross platform, standardized portion of the language. The Windows operating system has it’s own approach to dealing with pathnames for files and directories, that is quite different from UNIX operating systems. This guide will go over the functionality and properties of windows pathnames, and how they can be utilized from the C++ language.

A pathname is a string of characters representing the location of a file in a filesystem. A filesystem uses particular encoding(s) to represent pathnames. Historically, pathnames in windows were represented only with encodings from ANSI code pages. In modern windows systems, ANSI means “8-bit code pages”, as non standard encodings such as UTF-8 are present in them. Windows APIs internally use Unicode code pages, most commonly UTF-16. In a C++ program on windows, you can tell what code page you are using by using the AreFileApisANSI function from windows.h.

To simplify, think about the following meanings for almost all cases:

  • ANSI code page -> can use UTF-8 pathnames
  • OEM code page -> can use UTF-16 pathnames

The windows APIs provide separate functions for UTF-8 pathnames and UTF-16 pathnames. Additionally, there’s a different character type involved for each respectively.

Character Types

In Windows, pathnames can be either UTF-8 encoded or UTF-16 encoded. When they are UTF-8 encoded, they use the char type. When they are UTF-16 encoded, they use the wchar_t type. Which encoding your program or application should use depends on the target windows version as well as application specific needs. If supporting both is desired, there are a few options available. The first is a Microsoft written approach called TCHAR from the tchar.h header. This type is defined as either a char or wchar_t depending on the definition of the UNICODE macro. For the purposes of this article, we will use our own cross-encoding character type, called echar_t. Additionally, we will base the underlying type of echar_t on whether or not _PATH_UTF16 is defined.

Therefore, the definition of echar_t is

Most of the functions dealing with either char or wchar_t will already be included if <windows.h> is included. Normally though, they exist in<string.h> and <wchar.h>, respectively. The two character types also use a different literal syntax. The wide character literals take the form const wchar_t* foo = L"foo";. A macro allowing cross encoding literals can be written as:

These two definitions will allow us to write statements such as const echar_t* str1 = ECHAR("abcd");. An important part of a pathname is the separator between directories. In the Windows operating system, the separator is a backslash,\. On UNIX systems, it's a forward slash,/. For convenience, we can define a specific macro for that separator.

In Windows, absolute paths, those that give the absolute location of a file from the root starting point, contain drive letters. A drive letter is a letter at the beginning of a path, in the form <letter>:, indicating what hardware drive the path is located on. Depending on the setup and configuration of a Windows system, that drive letter could be any A, B, C, D, or more letters. When partitions are created in the file system, this can cause more drive letter assignments, and thus create more drives. However, the most commonly used drive is C:. However, a path such as\foo.txt, is also an absolute path of the current drive. As a rule, a path must start with either C:\ or \ to be considered an absolute path.

In contrast, UNIX absolute paths begin with the root directory, /, such as /usr/bin/python.

Now that the basic character types for paths have been defined, the next step is wrapping them via a class that can handle them as objects.

Path Objects

To make the handling of pathnames more efficient and concise, it’s best to handle them as objects through the methods in a C++ class. The class we implement, WPath, will be an immutable string that represents a file system path with echar_t characters. However, in order to facilitate the methods of a class, we need to establish how each operation will handle UTF-8 or UTF-16 characters differently. To begin with, we will need a string copy function that can work for both. Additionally, to provide the most fundamental string functionality, we will also need a similar definition for a length function. This can be done as follows:

The wcscpy function found in <wchar.h> functions the same as strcpy from string.h, but for wide characters (16 bit characters). Another difference between UTF-8 and UTF-16 strings is the size of the null character. In UTF-8, any character the null character from ASCII is one byte. But in UTF-16, it's two bytes. We can, like other examples so far, use a conditional define to manage a null size for both encodings.

The initial definition of our WPath class should have a constructor which takes a literal or otherwise null terminated const echar_t* string as an argument. To make it not leak memory, let's also add a destructor.

At this point, we have a Windows path represented in our class, but can’t do anything with it related to being a file. The first filesystem operation we can implement for WPath is exists(). This method will tell us whether or not the path held within the instance of the class exists or not. This task can be accomplished through the GetFileAttributesA and GetFileAttributesW functions, used for UTF-8 and UTF-16 respectively. Like previous examples, these can be conditionally defined under a single macro.

These two functions return a DWORD, a Windows specific type that is a 32bit integer. It contains information about the file that exists at the specified path. It fails if the path does not exist, and is a basic way to check for file existence. Fortunately, Windows predefines a macro to represent an invalid file as a DWORD. Such a method simply looks like this for our class:

Now that the existence of a WPath can be checked, there needs to be a way to create and join new paths. Since WPath is an immutable string, the joining operation can be done via an operator function that takes two WPath instances and returns a new instance. This could conveniently be implemented under the operator/ method, as that allows paths to be joined via a syntax WPath c = a / b. To accomplish this, we'll also need a constructor that takes just a size_t and constructs an empty pathname, essentially reserving the size.

Path Information

Now we have a path object that can be used to construct paths. Next, we want to be able to get more information about a specific path. Such as, is it a file, or a directory? Is it read only ? In the previous section, echar_info was defined as function to retrieve the attributes of a path. It actually returns a 32bit, unsigned integer that acts as a bit flagset. Described here, this bitset can give plenty of information about a path. To facilitate this component, a property to the WPath class can be added, DWORD _info;, along with some Boolean methods can allow determination of it's characteristics. Specifically, three methods will be added to determine if the path is a file, a directory, or a symlink to file.

With the respective new property and methods added, the WPath now looks like this:

A note about symlimks: In Windows, both files and directories can have an associated reparse point. This means that, the path contains a tag with user defined information. The operating system uses this unique information find the associated filer when a file or directory with a reparse point is opened. If a file has a reparse point, is it definitely a symlink. For directories, that is not always the case. A directory on windows may also be a junction point. For more details, see here.

Directories

The last part of this tutorial will go over directories in the Windows operating system. Like previous operations, a set of functions for both UTF-8 encoded and UTF-16 encoded paths exists in the Windows headers, CreateDirectoryA and CreateDirectoryW, respectively. Just like before, we can wrap those in a single macro, echar_mkdir(), within the method, mkdir(), that would look like the following:

Here, BOOL is a custom Boolean type defined in the Windows system headers. This method resets the paths info property after calling echar_mkdir because if the creation of the directory is successful, the info previously stored in _info is no longer accurate or valid. To further extend this functionality, many functions are available for more complex directory operations, like removing, copying, moving, and more.

--

--

Josh Weinstein
The Startup

I’m an engineer on a mission to write the fastest software in the world.