Adapting old code to new realities
This article describes useful techniques for transforming old-style C/C++ code to fully managed C# code. We used these methods to port classic libjpeg and libtiff libraries to .NET.
Introduction
In this article, I describe one method that you can use to transform C/C++ code into C# code with the least amount of effort. The principles laid out in this article are also suitable for other pairs of languages, too. I want to warn you straight-off that this method is not applicable for porting of any GUI-related code.
What is this useful for? For example, I have used this method to port libtiff, the well-known TIFF library, to C# (and libjpeg too). This allowed me to reuse the work of many people contributed to libtiff in my .NET program. Code examples in the article are mostly from libtiff and libjpeg libraries.
1. Prerequisites
What you will need:
- Original code that you can build “in one click”
- A set of tests, also runnable “in one click”
- A version control system
- Some basic understanding of refactoring principles
The “one-click” build and test runs requirement is there to speed up the “change — compile — run tests” cycle as much as possible. The more time and effort goes into each such cycle, the fewer times it will be executed. This may lead to massive and complex roll-backs of erroneous changes.
You can use any version control system. I use Git — you may pick up whatever you’re comfortable with. Anything instead of a set of folders on the hard disk will do.
Tests are required to make sure that the code still retains all of its features at any given time. Being sure that no functional changes are introduced into the code is what sets my method apart from the “let’s rewrite it from scratch in the new language” approach. Tests are not required to cover 100% of the code, but it’s desirable to have the tests for all the key features of the code. The tests shouldn’t be accessing the internals of the code to avoid constant rewriting of them.
Here’s what I used to port LibTiff:
- A set of images in TIFF format
tiffcp
, the command-line utility that converts TIFF images between different compression schemes- A set of batch scripts that use
tiffcp
for conversion tasks - A set of reference output images
- A program that performs the binary comparison of output images with the set of reference images
To grasp refactoring concepts, you only need to read one book. Martin Fowler’s Refactoring: Improving the Design of Existing Code. Be sure to read it if you still haven’t. Every developer should know refactoring principles. You don’t have to read the entire book. The first 130 pages from the beginning are enough. This is the first five chapters and the beginning of the sixth, up to the “Inline Method”.
The better you know the languages that are being used in your source and destination code, the easier the transformation will go. Please note that a deep knowledge of the internals of the original code is not required when you begin. It’s enough to understand what the original code does. A deeper understanding of how it does it will come during the process.
2. Transfer process
The essence of the method is that the original code is simplified through a series of simple and small refactorings. You shouldn’t attempt to change a large chunk of code and try to optimize it all at once. You should progress in small steps, run tests after every change cycle, and save every successful modification. Make a small change — test it. If all is well, save the change in the repository.
There are 3 big stages in the transfer process:
- Replacement of everything in the original code that uses language-specific features with something simpler, but functionally equivalent. This frequently leads to slower and not so neat looking code, but let it not concern you at this stage.
- Modification of the altered code so that it can be compiled in the new language.
- Transformation of the tests and making the functionality of the new code match the code in the source language.
Only after completing these stages, look at the speed and the beauty of the code.
The first stage is the most complex. The goal is to refactor C/C++ code into “pure C++” code with syntax that is as close to C# syntax as possible. This stage means getting rid of:
- preprocessor directives
goto
operatorstypedef
operators- pointer arithmetic
- function pointers
- free (non-member) functions
Let’s go over these steps.
2.1 Removing the unnecessary code
First, we should get rid of the unused code. For instance, with libtiff, I removed the files that were not used to build Windows version of the library. Then, I found all the conditional compilation directives ignored by the Visual Studio compiler in the remaining files and removed them as well. Here are some examples:
#if defined(__BORLANDC__) || defined(__MINGW32__)
# define XMD_H 1
#endif
#if 0
extern const int jpeg_zigzag_order[];
#endif
Often, the source code contains unused functions. They should be sent off to greener pastures, too.
2.2 Preprocessor and conditional compilation
A common use of conditional compilation is to create specialized versions of the program. Some files use #define
as a compiler directive, and other files contain code enclosed in #ifdef
and #endif
. Example:
/*jconfig.h for Microsoft Visual C++ on Windows 95 or NT. */
.....
#define BMP_SUPPORTED
#define GIF_SUPPORTED
.....
/* wrbmp.c */
....
#ifdef BMP_SUPPORTED
...
#endif /* BMP_SUPPORTED */
I would suggest selecting what to use straight away and get rid of conditional compilation. For example, should you decide that BMP format support is necessary, remove #ifdef BMP_SUPPORTED
from the entire code base.
If you have to keep the possibility of creating several versions of the program, make tests for every version. I suggest keeping the most complete version and work with it. After the transition is complete, you may add conditional compilation directives back in.
But we are not done working with preprocessor yet. It’s necessary to find preprocessor commands that emulate functions and change them into actual functions.
#define CACHE_STATE(tif, sp) do { \
BitAcc = sp->data; \
BitsAvail = sp->bit; \
EOLcnt = sp->EOLcnt; \
cp = (unsigned char*) tif->tif_rawcp; \
ep = cp + tif->tif_rawcc; \
} while (0)
To make a proper signature for a function, it is necessary to find out what are the types of all the arguments. Please note that BitAcc
, BitsAvail
, EOLcnt
, cp
and ep
get assigned within the preprocessor command. These variables will become arguments of new functions and they should be passed by reference. So, use uint32&
for BitAcc
in the function’s signature.
Programmers sometimes abuse preprocessor. Check out an example of such misuse:
#define HUFF_DECODE(result,state,htbl,failaction,slowlabel) \
{ register int nb, look; \
if (bits_left < HUFF_LOOKAHEAD) { \
if (! jpeg_fill_bit_buffer(&state,get_buffer,bits_left, 0)) {failaction;} \
get_buffer = state.get_buffer; bits_left = state.bits_left; \
if (bits_left < HUFF_LOOKAHEAD) { \
nb = 1; goto slowlabel; \
} \
} \
look = PEEK_BITS(HUFF_LOOKAHEAD); \
if ((nb = htbl->look_nbits[look]) != 0) { \
DROP_BITS(nb); \
result = htbl->look_sym[look]; \
} else { \
nb = HUFF_LOOKAHEAD+1; \
slowlabel: \
if ((result=jpeg_huff_decode(&state,get_buffer,bits_left,htbl,nb)) < 0) \
{ failaction; } \
get_buffer = state.get_buffer; bits_left = state.bits_left; \
} \
}
In the code above, PEEK_BITS
and DROP_BITS
are also “functions”, created similarly to HUFF_DECODE
. Here, the most reasonable approach is probably to include the code of PEEK_BITS
and DROP_BITS
“functions” into HUFF_DECODE
to ease transformation.
You should go to the next stage of refining the code only when most harmless (as seen below) preprocessor directives are left.
#define DATATYPE_VOID 0
2.3 switch and goto operators
You can get rid of goto
operators by introducing boolean variables and/or changing the code of a function. For example, if a function has a loop that uses goto
to break out of it, then change such construction to setting of a boolean variable, a break clause and a check of the variable’s value after the loop.
My next step is to scan the code for all the switch
statements containing a case
without a matching break
.
switch ( test1(buf) )
{
case -1:
if ( line != buf + (bufsize - 1) )
continue;
/* falls through */
default:
fputs(buf, out);
break;
}
C++ allows this, but C# doesn’t. Replace such switch
statements with if
blocks, or you can duplicate code if a fallthrough case
takes up only a couple of lines.
2.4 Time to gather stones
Everything that I described until now is not supposed to take up much time — not compared to what lies ahead. The first massive task that we’re facing is combining of data and functions into classes. What we’re aiming for is making every function a method of a class.
If the code was initially written in C++, it will probably contain few free (non-member) functions. Find a relationship between existing classes and free functions. Usually, it turns out that free functions play an ancillary role for the classes. If only one class uses a function, then move it into that class as a static
method. If several classes use a function, then you can create a new class with this function as its static
member.
If the code was created in C, there’ll be no classes in it. They’ll have to be created by grouping functions around the data that they manipulate. Fortunately, this logical relationship is quite easy to figure out — especially if the C code uses some OOP principles.
Let’s examine the example below:
struct tiff
{
char* tif_name;
int tif_fd;
int tif_mode;
uint32 tif_flags;
......
};
...
extern int TIFFDefaultDirectory(tiff*);
extern void _TIFFSetDefaultCompressionState(tiff*);
extern int TIFFSetCompressionScheme(tiff*, int);
...
It’s easy to see that the tiff
struct begs to become a class and the three functions declared below — to be changed into public
methods of this class. So, we’re changing struct
to class
and the three functions to static
methods of the class.
As most functions become methods of different classes, it’ll become easier to understand what to do with the remaining non-member functions. Don’t forget that not all the free functions will become public
methods. There are usually a few ancillary functions not intended for use from the outside. These functions will become private
methods.
After you changed the free functions to static
methods of classes, I suggest getting down to replacing calls to malloc
/free
functions with new
/delete
operators and adding constructors and destructors. Then you can gradually turn static
methods into full-blown class methods. As you convert more and more static methods to non-static ones, it’ll become clear that at least one of their arguments is redundant. This is the pointer to the original struct
that has become the class. It may also turn out that some arguments of private
methods can become member variables.
2.5 Preprocessor again and multiple inheritance
Now that a set of classes replaced the set of functions and structs, it’s time to get back to the preprocessor. To defines like the one below (there should be no other ones remaining by now):
#define STRIP_SIZE_DEFAULT 8192
Turn such defines into constants and find or create an owner class for them. The same as with functions, the newly created constants may require creating a special new class for them (maybe called Constants
). As well as the functions, the constants may have to be public
or private
.
If the original code was written in C++, it may rely upon multiple inheritance. This is another thing to get rid of before converting code to C#. One way to deal with it is to change the class hierarchy in a way that excludes multiple inheritance. Another way is to make sure that all the base classes of a class that use multiple inheritance contain only pure virtual methods and contain no member variables. For example:
class A
{
public:
virtual bool DoSomething() = 0;
};
class B
{
public:
virtual bool DoAnother() = 0;
};
class C : public A, B
{ ... };
This kind of multiple inheritance you can easily transfer to C# by declaring A
and B
classes as interfaces.
2.6 typedef operator
Before going over to the next big-scale task (getting rid of pointer arithmetic), we should pay special attention to type synonyms declarations (typedef
operator). Sometimes these are used as shorthand for proper types. For instance:
typedef vector<command*> Commands;
I prefer to inline such declarations. For this, locate Commands
in the code, change every occurrence to vector<command*>
, and delete the typedef
.
A more interesting case of using typedef
is this:
typedef signed char int8;
typedef unsigned char uint8;
typedef short int16;
typedef unsigned short uint16;
typedef int int32;
typedef unsigned int uint32;
Mind the names of the types being created. It’s obvious that typedef short int16
and typedef int int32
are somewhat of a hindrance, so it makes sense to change int16
to short
and int32
to int
in the code. Other typedefs are quite useful. It’s a good idea, however, to rename them so that they match type names in C#, like so:
typedef signed char sbyte;
typedef unsigned char byte;
typedef unsigned short ushort
typedef unsigned int uint;
Pay special attention to declarations similar to the following one:
typedef unsigned char JBLOCK[64]; /* one block of coefficients */
This declaration defines a JBLOCK
as an array of 64 elements of the type unsigned char
. I prefer to convert such declarations into classes. In other words, to create JBLOCK
class that serves as a wrapper around array and implements methods to access the individual elements of the array. It facilitates better understanding of the way array of JBLOCKs
(particularly 2- and 3-dimensional ones) are created, used and destroyed.
2.7. Pointer arithmetic
Another large-scale task is getting rid of pointer arithmetic. Many C/C++ programs rely quite heavily on this feature of the language.
E.g.:
void horAcc32(int stride, uint* wp, int wc)
{
if (wc > stride) {
wc -= stride;
do {
wp[stride] += wp[0];
wp++;
wc -= stride;
} while ((int)wc > 0);
}
}
Such functions are to be rewritten, since pointer arithmetic is unavailable in C# by default. You may use such arithmetic in unsafe code, but such code has its disadvantages. That’s why I prefer to rewrite such code using “index arithmetic”. It goes like this:
void horAcc32(int stride, uint* wp, int wc)
{
int wpPos = 0;
if (wc > stride) {
wc -= stride;
do {
wp[wpPos + stride] += wp[wpPos];
wpPos++;
wc -= stride;
} while ((int)wc > 0);
}
}
The resulting function does the same job, but uses no pointer arithmetic and can be easily ported to C#. It could also be slower than the original, but again, this is not our priority for now.
Pay special attention to the functions that change pointers passed to them as arguments. Below is an example of such a function:
void horAcc32(int stride, uint* & wp, int wc)
Here, changing wp
in function horAcc32
changes the pointer in the calling function as well. Still, introducing an index would be a suitable approach here. You just need to define the index in the calling function and pass it to horAcc32
as an argument.
void horAcc32(int stride, uint* wp, int& wpPos, int wc)
It is often convenient to convert int wpPos
into a member variable.
2.8 Function pointers
After pointer arithmetic is out of the way, it is time to deal with function pointers (if there are any in code). Function pointers can be of three different types:
- Function pointers are created and used within one class / function
- Function pointers are created and used by different classes in the program
- Function pointers are created by the users and are passed into the program. Here the program is a dynamically or statically created library.
An example of the first type:
typedef int (*func)(int x, int y);
class Calculator
{
Calculator();
int (*func)(int x, int y);
static int sum(int x, int y) { return x + y; }
static int mul(int x, int y) { return x * y; }
public:
static Calculator* CreateSummator()
{
Calculator* c = new Calculator();
c->func = sum;
return c;
}
static Calculator* CreateMultiplicator()
{
Calculator* c = new Calculator();
c->func = mul;
return c;
}
int Calc(int x, int y) { return (*func)(x,y); }
};
The Calc
method in the above code will produce different results depending on which one of the CreateSummator
and CreateMultiplicator
methods was called to create an instance of the class. I prefer to create a private enum
in the class that describes all choices for the functionality and the field that keeps a value from the enum
. Then, instead of a function pointer, I create a method with a switch operator or several ifs. The created method selects the function based on the value of the field. The changed code:
class Calculator
{
enum FuncType
{ ftSum, ftMul };
FuncType type;
Calculator();
int func(int x, int y)
{
if (type == ftSum)
return sum(x,y);
return mul(x,y);
}
static int sum(int x, int y) { return x + y; }
static int mul(int x, int y) { return x * y; }
public:
static Calculator* createSummator()
{
Calculator* c = new Calculator();
c->type = ftSum;
return c;
}
static Calculator* createMultiplicator()
{
Calculator* c = new Calculator();
c->type = ftMul;
return c;
}
int Calc(int x, int y) { return func(x,y); }
};
You can choose another way: change nothing for the moment and use delegates in the C# version.
Here is an example of the second case:
typedef int (*TIFFVSetMethod)(TIFF*, ttag_t, va_list);
typedef int (*TIFFVGetMethod)(TIFF*, ttag_t, va_list);
typedef void (*TIFFPrintMethod)(TIFF*, FILE*, long);
class TIFFTagMethods
{
public:
TIFFVSetMethod vsetfield;
TIFFVGetMethod vgetfield;
TIFFPrintMethod printdir;
};
This situation is best resolved by turning vsetfield
/vgetfield
/printdir
into virtual methods. Code that has used vsetfield
/vgetfield
/printdir
will have to create a class derived from TIFFTagMethods
with the required implementation of the virtual methods.
Check out an example of the third case:
typedef int (*PROC)(int, int);
int DoUsingMyProc (int, int, PROC lpMyProc, ...);
Delegates are best suited here. At this stage, while the original code is still being polished, nothing else should be done. At the latter stage, when the project is transferred into C#, create a delegate instead of PROC
. Also, change the DoUsingMyProc
function to accept an instance of the delegate as an argument.
2.9 Isolation of the “problem code”
The last change of the original code is the isolation of anything that may be a problem for the new compiler. It may be a code that actively uses standard C/C++ library (functions like fprintf
, gets
, atof
and so on) or WinAPI. In C#, this will have to be changed to use .NET methods or, if need be, p/invoke technique. Check out www.pinvoke.net site in the latter case.
Isolate “problem code” as much as possible. For example, you could create a wrapper class for the functions from C/C++ standard library or WinAPI. Only this wrapper will have to be changed later.
2.10 Changing compiler
This is the moment of truth — the time to bring the changed code into the new project that uses C# compiler. It’s quite trivial, but labor-intensive. You would need to create a new empty project, then add empty classes to it. After that, copy the code from the corresponding original classes.
You’ll have to remove the ballast at this stage (like various #include
, for instance) and make some cosmetic modifications. “Standard” modifications include:
* combining code from .h
and .cpp
files
* replacing obj->method()
with obj.method()
* replacing Class::StaticMethod
with Class.StaticMethod
* removing *
in func(A* anInstance)
* replacing func(int& x)
with func(ref int x)
Most of the modifications are easy, but you would need to comment out some of the code. Mostly the problem code discussed in part 2.9 will be commented out. The main goal here is to get C# code that compiles. It most probably won’t work, but we’ll come to that in due time.
2.11 Making it all work
After we made converted code compile, we need to adjust the code till the functionality matches the original. For that, we need to create a second set of tests that uses the converted code. The methods, commented out earlier, need to be carefully revised and rewritten using .NET. I think this part needs no further explaining. I just want to expand on a few fine points.
When creating strings from byte arrays (and vice versa), carefully select a proper encoding. Avoid Encoding.ASCII
because of its 7-bit nature. It means that bytes with values higher than 127 will become ?
instead of proper characters. It’s best to use Encoding.Default
or Encoding.GetEncoding("Latin1")
. The actual choice of encoding depends on what happens next with the text or the bytes. If the text is to be displayed to the user — then Encoding.Default
is a better choice, and if text is to be converted to bytes and saved into a binary file, then Encoding.GetEncoding("Latin1")
suites better.
Output of formatted strings (code related to the family of printf
functions in C/C++) may present certain problems. Functionality of the String.Format
in .NET is both poorer and different in syntax. You can solve the issue in two ways:
- Create a class that mimics functionality of
printf
functions - Change the format strings so that
String.Format
shows the same result (not always possible).
I prefer the second option. If you will choose it too, then a search for c# format specifiers
in Google and Appendix B. Format Specifiers from C# in a Nutshell may prove useful for you.
The conversion is complete when all the tests that use the converted code pass successfully. Now we can return to the fact that the code does not quite conform to the C# ideology (for example, the code full of get
/set
methods instead of properties) and refactor the converted code. You may use a profiler to identify bottlenecks in the code and optimize it. But that’s quite a different story.
Happy porting!