Lesson #28

Preprocessor

Overview

In this lesson I will cover the following topics:

  1. The #include directive.

  2. Object-like macros and function-like macros.

  3. The #if directive.

  4. The problem with the preprocessor.

Body

What is the preprocessor?

I have completed talking about all the major features of C. In particular: variables and types, flow of control (loops, if statements, etc), functions, arrays, pointers, and structures. These are features that you will also find in many (most?) other programming languages. To complete this course, I need to talk about a few additional topics. Some of these topics are very specific to C. Some of them are required by other courses at Vermont State University which have this course as a prerequisite. The final three lessons will just be a collection of largely unrelated things to fill in some important holes.

The first of these lessons will be on the C preprocessor. Although unique to C (and C++), you can't be an effective C programmer without at least some knowledge of the preprocessor. I will spend this entire lesson talking about it.

The preprocessor is, conceptually, a program that processes your source code before the compiler sees it. The preprocessor is not the C compiler. Its rules are different from those used by the compiler. Furthermore, you could use the C preprocessor on other files besides C source code. I understand that some assembly language programmers are fond of using the C preprocessor on their programs.

The preprocessor looks for special "preprocessing directives" in your source code. It carries out the commands specified by those directives, editing your source code in the process. The resulting edited file, without the preprocessor directives, is what the compiler actually sees. Keep in mind, however, that although I talk about the preprocessor editing your file, the changes that it makes are not (normally) saved to disk. Instead, the edited version of your program is passed directly to the compiler through memory.

Preprocessing directives all begin with a '#' character and run to the end of the line. They are not terminated with a semicolon, and they do not (normally) continue to the next line. There are several preprocessing directives, but the most common are #include, #define, and #if.

#include

You have, of course, already seen this directive in action. Here is an example:

#include <stdio.h>

int main( void )
{
    printf( "Hello, World!\n" );
    return 0;
}

Technically, the first line (with the #include) is not legal C. If you were to let a C compiler process that line, you would get an error. However, all C compilers come with a preprocessor, and they all arrange to have that preprocessor execute first. When the preprocessor sees the #include directive it searches for the file named by the directive. It then removes the directive and replaces it with the contents of the specified file. By the time the compiler sees your program the #include is gone.

The preprocessor couldn't care less about what is in the included file. I could split my "Hello, World" program into several files and paste them all together with #includes. For example, if "first.txt" contained

#include <stdio.h>
int main(

and "second.txt" contained

 void ) {

and "third.txt" contained

    printf( "Hello, World!\n" );
    return 0;
}

My hello.c could look like

#include "first.txt"
#include "second.txt"
#include "third.txt"

and it would compile just fine. Notice how first.txt contains a #include of it's own. This is fine. The preprocessor scans over the included text looking for other preprocessing directives. It will do this as deeply as necessary.

Just because you can put anything you want into an included file doesn't mean you should. The preprocessor can be abused. Traditionally, included files end with a .h extension and contain only declarations that need to be shared between .c files. You should avoid #including function definitions, files other than .h files, or putting #include directives anywhere other than at the top of your .c file. The preprocessor allows other things, but taking advantage of that will cause your program to be very confusing.

#define

The #define directive allows you to create a symbolic name. There are two sorts of such names. The first is called an "object-like macro" and the second is called a "function-like macro".

Here is what a typical object-like macro looks like:

#define MAX_SIZE 1024

There are three parts. The first part is the #define directive itself. The second part is the symbolic name (MAX_SIZE). The third part is what the symbolic name represents (1024). After seeing this definition, the preprocessor will replace every occurrence of MAX_SIZE that it finds with 1024. In effect, it does a simple search and replace operation on your file. Nothing more. Thus:

int main( void )
{
    int array[MAX_SIZE];   // I really said "array[1024]"
    int i;


    for( i = 0; i < MAX_SIZE; i++ ) { // I really said "i < 1024"
        // etc...

Why would I want to do this? Well, for one thing MAX_SIZE is more informational than 1024. It makes the program easier to read. However, the big advantage is when I try to update this program. Even in my short example above, I'm using MAX_SIZE in two different places. In a large program, I might use it in dozens of places (often as the limit in a for loop). When I later decide that 1024 isn't big enough, all I have to do is edit the #define directive where MAX_SIZE is defined. I might change it to:

#define MAX_SIZE 2048

and then recompile. The preprocessor will replace all occurrences of MAX_SIZE with the new value and I will have updated the program by making only one change. This is a major advantage. The alternative would be to search for all occurrences of 1024 and try to replace them with 2048. The problem with that is some 1024's might refer to something else and not really need to be replaced. Thus, I'm faced with the prospect of locating every 1024 in the program and then figuring out if that 1024 is one that needs updating. Ugh. But it's even worse than that! There might be some other numbers, such as some 1023s or 512s (MAX_SIZE/2) that also need to be updated. With the preprocessor, this is not a big deal:

#define MAX_LINE_BUFFER_SIZE 1024
#define MAX_COLOR_DEPTH      1024
#define MAX_SCREEN_BUFFER    1024
#define MAX_USER_LIMIT       1024

In the program I use MAX_USER_LIMIT everywhere it is appropriate to do so. In some cases I might have to write things like MAX_USER_LIMIT - 1 (instead of 1023) or MAX_USER_LIMIT/2 (instead of 512). However, if I do this properly, things are easy afterward. Now when I decide that the user limit size needs to be extended, I just do:

#define MAX_LINE_BUFFER_SIZE 1024
#define MAX_COLOR_DEPTH      1024
#define MAX_SCREEN_BUFFER    1024
#define MAX_USER_LIMIT       2048

and recompile. Many programs have a large number of object-like macros contained in a "configuration" header file that gets #included into every source file of the program. Often you can customize the program by editing the values of those macros (and they are typically heavily commented to make this feasible) and recompiling.

It is very important when you define an object-like macro to use it consistently. If you forget to use it in even one place the advantage of having it is lost.

Notice how I use all uppercase letters for my symbolic names. This is not technically required. The preprocessor, like C, is case-sensitive. If you do choose to use lower case names, you must do so consistently. However, the tradition is to use uppercase names, and most style guides require it.

The name you use for an object-like macro must follow the same rules as other names you choose in your C program. In particular, it can only contain letters, digits, and the underscore character. It can't start with a digit. However, the "expansion text" can contain anything. It need not be just a number. For example:

#define  FOREVER  while( 1 )

Now in your program you could write

FOREVER {
    printf( "Hello, World!\n" );
}

to create an infinite loop. The preprocessor will simply replace the word FOREVER with while (1) and give you:

while( 1 ) {
    printf( "Hello, World!\n" );
}

This is legal C and will be accepted by the compiler without a complaint. People sometimes get carried away with this feature. For example if you do:

#define BEGIN  {
#define END    }
#define AND    &&
#define OR     ||

you can write:

if( x == y AND x < z )
  BEGIN
    printf( "x has the right value now!\n" );
  END

It will compile fine. Keep in mind that C programmers take a dim view of this technique. Nevertheless, using macros to simplify tedious typing can be useful. Suppose you needed to write many loops that ran i over the range 0 to MAX_SIZE. You could do this:

#define MAX_SIZE 1024  // For now. Might upgrade later.
#define SCAN_ARRAY for( i = 0; i < MAX_SIZE; i++ )

and then:

SCAN_ARRAY {
    printf( "array[i] = %d\n", array[i] );
}

If you need to write such loops 20 or 30 times you might prefer to type SCAN_ARRAY over typing out the loop header (and risking a typo) each of those times. Notice in this example I'm using an object-like macro in the expansion text of another object-like macro. That is fine. After substituting expansion text for a macro, the preprocessor will rescan the resulting text looking for other macros to expand.

Function-like macros also involve replacing a symbolic name with expansion text. However, unlike object-like macros, they take parameters. Here is how it might look:

#define SQUARE(x) x * x

Here the macro SQUARE takes a parameter that I'm calling x. When it expands the macro, the final expansion text is computed by copying the macro's parameter into every place x appears in the expansion text above. For example:

int i;

printf( "i squared is %d.\n", SQUARE(i) );

would become:

printf( "i squared is %d.\n", i * i );

It is very important when you define a function-like macro that you don't put a space after the macro's name and before the parenthesis. If you do:

#define SQUARE (x) x * x

The preprocessor will think you defined an object-like macro with the expansion text of (x) x * x. In that case:

printf( "i squared is %d.\n", SQUARE(i) );

would become:

printf( "i squared is %d.\n", (x) x * x(i) );

which is a big, bad syntax error. This is one of the problems with using macros. The error messages you get are produced by the compiler after the preprocessor has edited your file. Sometimes they don't make sense when compared with the original source code. The syntax of the original printf, with the SQUARE macro present, looks perfectly fine.

There are other problems with function-like macros. Let me take my SQUARE example. Suppose I try this:

x = y / SQUARE(z - 1);

Here I think I'm dividing y by the square of z - 1. But look again:

x = y / z - 1 * z - 1;

This is what the compiler actually sees (yes... the macro parameter can be any text you want). The precedence rules of C cause this to be handled like so:

x = (y / z) - (1 * z) - 1;

which is a totally different calculation. What's worse is that this code will still compile! To avoid this, you can define the SQUARE macro with lots of extra parentheses:

#define SQUARE(x) ((x) * (x))

Then my expression expands to:

x = y / ((z - 1) * (z - 1));

This works, but the need to do things like this is a hazard. With real functions the matter doesn't come up, so people tend to forget about it when writing function-like macros.

Another hazard has to do with the fact that my SQUARE macro evaluates its argument more than once. Check this out:

x = SQUARE(y++);

Here I think I'm squaring y and putting the result into x and then incrementing y afterwards (post-increment). But look again:

x = ((y++) * (y++));

Actually y is incremented twice. What's worse is that it's unclear what value is actually put into x. It will depend on just how the compiler decides to compute the expression. When is the multiplication done relative to the incrementing? Again real functions don't have this problem and are thus safer to use.

This probably leads you to wonder why bother with function-like macros at all? There are two reasons.

  1. A function-like macro can expand into any text. You can write a function-like macro with expansion text containing keywords. You can't do that with a function.

  2. Calling a function takes some time. For simple calculations, a function-like macro might be faster. Often the difference in performance doesn't matter. Sometimes it matters a great deal. Note that C99, like C++, has inline functions which help address this concern without resorting to function-like macros.

In fact, several "functions" in the C standard library are really function-like macros because of the reason #2 above. One that you've used quite a bit in this course is getchar. Although the standard does not require getchar to be a function-like macro it often is for performance reasons. Notice that getchar seems to violate the tradition of using uppercase letters for macro names. The standard library can get away with that, but ultimately the reason is that getchar has been around for a long time.

#if

The #if directive is called a "conditional compilation directive." You can use it to cause the compiler to ignore certain parts of your program depending on the value of other macros. Here is one way to use it:

for( i = 0; i < MAX_SIZE; i++ ) {
  #if DEBUG == 1
    printf("Processing loop pass #%d\n", i);
  #endif

  ...
}

Suppose that some place higher up in your program you had:

#define DEBUG 0  // Make into 1 to turn on debugging output.

When the preprocessor examines your file it will see that DEBUG == 1 is false. Thus, it will remove the material between the #if and the #endif. As a result, the compiler will never know that you wrote a printf there. However, if you edit the definition of DEBUG so that it reads:

#define DEBUG 1

and recompile your program, the preprocessor will this time include the printf statement. In this way you can leave your debugging printfs in your source code and selectively turn them off and on by changing the definition of a macro and recompiling.

Doing conditional compilation like this is very common. In fact, many programs have many complex #if... #else... #endif blocks that select special code for one operating system or another, or for one processor or another, etc. That way, a single source file can be used on many different systems with appropriate sections conditionally compiled into place. Take a look at /usr/include/curses.h on a Unix-like system for an example that is typical.

Most compilers have a facility that allows you to define a symbol on the command line as if you had written a #define in the source file. For example, with gcc you can do:

$ gcc -DDEBUG -o hello hello.c

The -D option turns on the specified symbol, in this case DEBUG, so that it has the expansion text of "1." Assuming there were conditional compilation directives in hello.c that were sensitive to the DEBUG symbol this command would produce an appropriately specialized executable.

I won't say much more about how conditional compilation works right now. However, if you ever find yourself working with a large program—especially one that was written to work on many systems—you will see conditional compilation being used extensively.

The problem with the preprocessor

At this point you might be thinking that the preprocessor is a pretty neat feature. In fact, it can be used in very powerful ways. Most C programs use the preprocessor extensively. However, it has some problems as well, and it is falling out of favor. C++ introduces a number of features specifically to make using the preprocessor less necessary. In the future the preprocessor will probably be dropped out of C++ entirely, and most instructors of C++ encourage people to stay away from the preprocessor if at all possible.

Why is this?

The main problem with the preprocessor is that it has no notion of scope. It does not understand C, and it does not know about nested declarations or variable hiding. Here is a simple example to illustrate:

#define MAX_SIZE 1024

...


void f( void )
{
    int MAX_SIZE = 16;

  ...

The person who wrote function f created a local variable named MAX_SIZE. Because that variable is local, that programmer probably felt that they could name it without any concerns of conflict. The existence of a local variable of that name in another function or even a global variable of that name would not affect function f at all.

Alas, it turns out that there is a #define of MAX_SIZE in effect. Probably that #define is sitting in a library header file; the programmer who wrote function f is not even aware that it exists. If that MAX_SIZE had been a global variable it wouldn't have mattered. But because the preprocessor just does a simple search and replace, function f becomes:

void f( void )
{
    int 1024 = 16;

This is a syntax error. The programmer who wrote f is going to get a strange error message about their declaration of MAX_SIZE. In a large program, this issue is a major problem. Large programs typically have hundreds of #defined names spread over dozens of header files. There is no way a single programmer is going to know them all. Yet the preprocessor is reaching into every function, and every block editing those functions in unexpected ways. It's nice when the preprocessor edits the function so that there is an error at compilation time. That way, at least, the problem can be fixed. Now imagine that the change is such that the program still compiles but produces the wrong result! A bug might exist in code that looks perfectly correct because the preprocessor has modified that code in an unexpected way.

This is very bad.

The compiler knows about variable scope and can let one declaration hide another. Thus, the current trend is to avoid using the preprocessor and use language features instead. This is one reason why I put this lesson at the very end of the course. You need to know about the preprocessor if you are to work with C. The C language requires the preprocessor to do certain things because it lacks the necessary features to do without it. However, you should avoid using it when you can. Other languages don't have a preprocessor at all, and they are not likely to get one.

Summary

  1. You can use the #include directive to merge a file into the file you are compiling. It is most often used to cause the compiler to read over declarations of functions and global entities (such as global variables and structure definitions) that are to be shared between several source files.

  2. You can use #define to create a symbolic name for a constant. Doing this makes it easier to update your program when you want to change the value of the constant. Just change the #define and let the preprocessor change everywhere else that it is used.

    You can also use #define to create a macro with parameters. This allows more flexibility.

  3. The #if directive can be used to cause the compiler to skip over sections of your program depending on the value of other macros. This allows you to specialize your program in different ways so that it will work properly on different systems.

  4. The preprocessor is a powerful tool, but it doesn't really understand C, and it knows nothing about scope. Macros can potentially modify code inside nested blocks where the programmer feels safe from such things. In a large program with many included header files, there are more macros active than a programmer typically knows about. The preprocessor can thus introduce mysterious errors and subtle bugs. For this reason the preprocessor is falling out of favor, and many students of programming are advised to avoid it when possible.

© Copyright 2023 by Peter Chapin.
Last Revised: July 17, 2023