Lesson #28

Preprocessor

Overview

In this lesson I will cover the following topics

  1. The #include directive.

  2. Object-like macros and function-like macros.

  3. The #if directive.

  4. The problem with the preprocessor.

Body

What is the preprocessor?

I have completed talking about all the major features of C. In particular: variables and types, flow of control (loops, if statements, etc), functions, arrays, pointers, and structures. These are features that you will also find in many (most?) other programming languages. To complete this course I need to talk about a few additional topics. Some of these topics are very specific to C. Some of them are required by other courses at VTC which hold this course as a prerequisite. The final three lessons will just be a collection of largely unrelated things to fill in some important holes.

The first of these topics will be on the C preprocessor. Although somewhat unique to C, you can't be an effective C programmer without at least some knowledge of the preprocessor. I will spend this entire lesson talking about it.

The preprocessor is, conceptually, a program that processes your source code before the compiler sees it. The preprocessor is not the C compiler. Its rules are different than those used by the compiler. Furthermore you could use the C preprocessor on other files besides C source code. (I understand that some assembly language programmers are fond of using the C preprocessor on their programs).

The preprocessor looks for special "preprocessor directives" in your source code. It carries out the commands specified by those directives, editing your source code in the process. The resulting edited file (without the preprocessor directives) is what the compiler actually sees. Keep in mind, however, that although I talk about the preprocessor editing your file, the changes that it makes are not (normally) saved to disk. Instead the edited version of your program is passed directly to the compiler through memory.

Preprocessor directives all begin with a '#' character and run to the end of the line. They are not terminated with a semicolon and they do not (normally) continue to the next line. There are several preprocessor directives, but the most common are #include, #define, and #if.

#include

You have, of course, already seen this directive in action. Here is an example

#include <stdio.h>

int main(void)
{
  printf("Hello, World!\n");
  return 0;
}

Technically the first line (with the #include) is not legal C. If you were to let a C compiler process that line, you would get an error. However, all C compilers come with a preprocessor and they all arrange to have that preprocessor execute first. When the preprocessor sees the #include it searches for the file named by the include. It then removes the include and replaces it with the contents of the specified file. By the time the compiler sees your program the #include is gone.

The preprocessor couldn't care less about what is in the included file. I could split my "Hello, World" program into several files and paste them all together with #includes. For example if "first.txt" contained

#include <stdio.h>
int main(

and "second.txt" contained

void) {

and "third.txt" contained

  printf("Hello, World!\n");
  return 0;
}

My hello.c could look like

#include "first.txt"
#include "second.txt"
#include "third.txt"

and it would compile just fine. Notice how first.txt contains a #include of it's own. This is fine. The preprocessor scans over the included text looking for other preprocessing directives. It will do this as deeply as necessary.

Just because you can put anything you want into an included file doesn't mean you should. The preprocessor can be abused. Traditionally included files end with a .h extension and contain only declarations that need to be shared between .c files. You should avoid #including function definitions, files other than .h files, or putting #include directives anywhere other than at the top of your .c file. The preprocessor allows other things, but taking advantage of that will cause your program to be very confusing.

#define

The #define directive allows you to create a symbolic name. There are two sorts of such names. The first is called an "object-like macro" and the second is called a "function-like macro".

Here is what a typical object-like macro looks like

#define MAXSIZE 1024

There are three parts. The first part is the #define directive itself. The second part is the symbolic name (MAXSIZE). The third part is what the symbolic name represents (1024). After seeing this definition, the preprocessor will replace every occurrence of MAXSIZE that it finds with 1024. It does a simple search and replace operation on your file. Nothing more. Thus

int main(void)
{
  int array[MAXSIZE];   // I really said "array[1024]"
  int i;


  for (i = 0; i < MAXSIZE; i++) { // I really said "i < 1024"
    // etc...

Why would I want to do this? Well, for one thing MAXSIZE is more informational than 1024. It makes the program easier to read. However, the big advantage is when I try to update this program. Even in my short example above, I'm using MAXSIZE in two different places. In a large program I might use it in dozens of places (often as the limit in a for loop). When I later decide that 1024 isn't big enough, all I have to do is edit the #define directive where MAXSIZE is defined. I might change it to

#define MAXSIZE 2048

and then recompile. The preprocessor will replace all occurrences of MAXSIZE with the new value and I will have updated the program by making only one change. This is a major advantage. The alternative would be to search around for all occurances of 1024 and try to replace them with 2048. The problem with that is some 1024's might refer to something else and not really need to be replaced. Thus I'm faced with the prospect of locating every 1024 in the program and then figuring out if that 1024 is one that needs updating. Ugh. But it's even worse than that! There might be some other numbers, such as some 1023s that also need to be updated. With the preprocessor, this is not a big deal

#define MAXLINEBUFFERSIZE 1024
#define MAXCOLORDEPTH     1024
#define MAXSCREENBUFFER   1024
#define MAXUSERLIMIT      1024

In the program I use MAXUSERLIMIT everywhere it is appropriate to do so. In some cases I might have to write things like MAXUSERLIMIT - 1 (instead of 1023) or MAXUSERLIMIT/2 (instead of 512). However, if I do this properly things are easy afterward. Now when I decide that the user limit size needs to be extended I just do

#define MAXLINEBUFFERSIZE 1024
#define MAXCOLORDEPTH     1024
#define MAXSCREENBUFFER   1024
#define MAXUSERLIMIT      2048

and recompile. Many programs have a large number of object-like macros contained in a "master" header file that gets #included into every source file of the program. Often you can customize the program by editing the values of those macros (and they are often heavily commented to make this feasible) and recompiling.

It is very important when you define an object-like macro to use it consistently. If you forget to use it in even one place the advantage of having it is lost.

Notice how I use all uppercase letters for my symbolic names. This is not technically required. The preprocessor, like C, is case sensitive. If you do choose to use lower case names, you must do so consistently. However, the tradition is to use uppercase names. In fact the VTC style guide requires it.

The name you use for an object-like macro must follow the same rules as other names you choose in your C program. In particular, it can only contain letters, digits, and the underscore character. It can't start with a digit. However the "expansion text" can contain anything. It need not be just a number. For example

#define   INFINITE_LOOP   while (1)

Now in your program you could write

INFINITE_LOOP {
  printf("Hello, World!\n");
}

to create an infinite loop. The preprocessor will simply replace the word INFINITE_LOOP with while (1) and give you

while (1) {
  printf("Hello, World!\n");
}

This is legal C and will be accepted by the compiler without complaint. People sometimes get carried away with this feature. For example if you do

#define BEGIN  {
#define END    }
#define AND    &&
#define OR     ||

you can write

if (x == y AND x < z)
  BEGIN
    printf("x has the right value now!\n");
  END

and it will compile fine. Keep in mind that C programmers take a dim view of this technique. The usual response is, "If you want to program in Pascal, get a Pascal compiler!" Nevertheless, using macros to simplify tedious typing can be useful. Suppose you needed to write many loops that ran i over the range 0 to MAXSIZE. You could do this

#define MAXSIZE 1024  // For now. Might upgrade later.
#define SCAN_ARRAY for (i = 0; i < MAXSIZE; i++)

and then

SCAN_ARRAY {
  printf("array[i] = %d\n", array[i]);
}

If you need to write such loops 20 or 30 times you might prefer to type SCAN_ARRAY over typing out the loop header (and risking a typo) each of those times. Notice in this example I'm using an object-like macro in the expansion text of another object-like macro. This works fine. After substituting expansion text for a macro, the preprocessor will rescan that text looking for other macros to expand.

Function-like macros also involve replacing a symbolic name with expansion text. However, unlike object-like macros, they take parameters. Here is how it might look.

#define SQUARE(x) x * x

Here the macro SQUARE takes a parameter that I'm calling x. When it expands the macro the expansion text is computed by copying the macro's parameter into every place x appears in the expansion text above. For example

int i;

printf("i squared is %d.\n", SQUARE(i));

would become

printf("i squared is %d.\n", i * i);

It is very important when you define a function-like macro that you don't put a space after the macro's name and before the parenthesis. If you do

#define SQUARE (x) x * x

The preprocessor will think you defined an object-like macro with the expansion text of (x) x * x. In that case

printf("i squared is %d.\n", SQUARE(i));

would become

printf("i squared is %d.\n", (x) x * x(i));

which is a big, bad syntax error. This is one of the problems with using macros. The error messages you get are produced by the compiler after the preprocessor has edited your file. Sometimes they don't make sense when compared with the original source code. The syntax of the original printf, with the SQUARE macro present, looks perfectly fine.

There are other problems with function-like macros. Let me take my SQUARE example. Suppose I try this

x = y / SQUARE(z - 1);

Here I think I'm dividing y by the square of z - 1. But look again

x = y / z - 1 * z - 1;

This is what the compiler actually sees (yes... the macro parameter can be any text you want). The precedence rules of C cause this to be handled like so

x = (y / z) - (1 * z) - 1;

which is a totally different calculation. To avoid this you can define the SQUARE macro with lots of extra parentheses.

#define SQUARE(x) ((x) * (x))

Then my expression expands to

x = y / ((z - 1) * (z - 1));

This works, but the need to do things like this is a hazard. With real functions the matter doesn't come up so people tend to forget about it when writing function-like macros.

Another hazard has to do with the fact that my SQUARE macro evaluates its argument more than once. Check this out

x = SQUARE(y++);

Here I think I'm squaring y and putting the result into x and then incrementing y afterwards (post-increment). But look again

x = ((y++) * (y++));

Actually y is incremented twice. What's worse is that it's unclear what value is actually put into x. It will depend on just how the compiler decides to compute the expression. When is the multiplication done relative to the incrementing? Again real functions don't have this problem and are thus safer to use.

This probably leads you to wonder why bother with function-like macros at all? There are two reasons.

  1. A function-like macro can expand into any text. You can write a function-like macro with expansion text containing keywords. You can't do that with a function.

  2. Calling a function takes some time. For very simple calculations a function-like macro might be faster. Often the difference in performance doesn't matter. Sometimes it matters a great deal.

In fact several "functions" in the C standard library are really function-like macros because of reason #2 above. One that you've used quite a bit in this course is getchar. Although the standard does not require getchar to be a function-like macro it often is for performance reasons. Notice that getchar seems to violate the tradition of using uppercase letters for macro names. The standard library can get away with that.

#if

The #if directive is called a "conditional compilation directive". You can use it to cause the compiler to ignore certain parts of your program depending on the value of other macros. Here is one way to use it.

for (i = 0; i < MAXSIZE; i++) {
  #if DEBUG == 1
    printf("Processing loop pass #%d\n", i);
  #endif

  ...
}

Suppose that some place higher up in your program you had

#define DEBUG 0  // Make into 1 to turn on debugging output.

When the preprocessor examines your file it will see that DEBUG == 1 is false. Thus it will remove the material between the #if and the #endif. As a result the compiler will never know that you wrote a printf there. However, if you edit the definition of DEBUG so that it reads

#define DEBUG 1

and recompile your program the preprocessor will this time include the printf statement. In this way you can leave your debugging printfs in your source code and selectively turn them off and on by changing the definition of a macro and recompiling.

Doing conditional compliation like this is very common. In fact, many programs have many complex #if... #else... #endif blocks that select special code for one operating system or another or for one processor or another. That way a single source file can be used on many different systems with appropriate sections conditionally compiled into place. Take a look at /usr/include/curses.h for an example that is all too typical.

Most compilers have a facility that allows you to define a symbol on the command line as if you had written a #define in the source file. For example, with cc you can do

$ cc -DDEBUG -o hello hello.c

The -D option turns on the specified symbol, in this case DEBUG, so that it has the expansion text of "1". Assuming there were conditional compliation directives in hello.c that were sensitive to the DEBUG symbol this command would produce an appropriately specialized executable.

I won't say much more about how conditional compilation works right now. However, if you ever do find yourself working with a large program—especially one that was written to work on many systems—you will see conditional compilation being used extensively.

The problem with the preprocessor

At this point you might be thinking that the preprocessor is a pretty neat feature. In fact, it can be used in very powerful ways. Most C programs use the preprocessor extensively. However, it has some problems as well, and it is falling out of favor. C++ introduces a number of features specifically to make using the preprocessor less necessary. In the future the preprocessor will probably be dropped out of C++ entirely and most instructors of C++ encourage people to stay away from the preprocessor if at all possible.

Why is this?

The main problem with the preprocessor is that it has no notion of scope. It does not understand C and it does not know about nested declarations or variable hiding. Here is a simple example to illustrate.

#define MAXSIZE 1024

...


void f(void)
{
  int MAXSIZE = 16;

  ...

The person who wrote function f created a local variable named MAXSIZE. Because that variable is local that programmer probably felt that he/she could name it without any concerns of conflict. The existence of a local variable of that name in another function or even a global variable of that name would not affect function f at all.

Alas, it turns out that there is a #define of MAXSIZE in effect. Probably that #define is sitting in a library header file; the programmer who wrote function f is not even aware that it exists. If that MAXSIZE had been a global variable it wouldn't have mattered. But because the preprocessor just does a simple search and replace, function f becomes

void f(void)
{
  int 1024 = 16;

This is a syntax error. The programmer who wrote f is going to get a strange error message about his/her declaration of MAXSIZE. In a large program this issue is a major problem. Large programs typically have hundreds of #defined names spread over dozens of header files. There is no way a single programmer is going to know them all. Yet the preprocessor is reaching into every function and every block editing those functions in unexpected ways. It's nice when the preprocessor edits the function so that there is a compile error. That way, at least, you can fix the problem. Now imagine that the change is such that the program still compiles! A bug might exist in code that looks perfectly correct because the preprocessor has modified that code in an unexpected way.

This is very bad.

The compiler knows about variable scope and can let one declaration hide another. Thus the current trend is to avoid using the preprocessor and use language features instead. This is one reason why I put this lesson at the very end of the course. You need to know about the preprocessor if you are to work with C. The C language requires the preprocessor to do certain things because it lacks the necessary features to do without it. However, you should avoid using it when you can. Other languages don't have a preprocessor at all, and they are not likely to get one.

Summary

  1. You can use the #include directive to merge a file into the file you are compiling. It is most often used to cause the compiler to read over declarations of functions and global entities (such as global variables and structure definitions) that are to be shared between several source files.

  2. You can use #define to create a symbolic name for a constant. Doing this makes it easier to update your program when you want to change the value of the constant. Just change the #define and let the preprocessor change everywhere else that it is used.

    You can also use #define to create a macro with parameters. This allows more flexibility.

  3. The #if directive can be used to cause the compiler to skip over sections of your program depending on the value of other macros. This allows you to specialize your program in different ways so that it will work properly on different systems.

  4. The preprocessor is a powerful tool, but it doesn't really understand C and it knows nothing about scope. Macros can potentially modify code inside nested blocks where the programmer feels safe from such things. In a large program with many included header files, there are more macros active than a programmer typically knows about. The preprocessor can thus introduce mysterious errors and subtle bugs. For this reason the preprocessor is falling out of favor and many students of programming are advised to avoid it when possible.

© Copyright 2003 by Peter C. Chapin.
Last Revised: July 8, 2003