Example #2

Filter for Detecting White Space

In this example I will write a useful filter program. This program, which I will call "white," will look for white space at the end of the lines in a text file. It will also look for blank lines at the end of the file. For the purposes of this program, I will define white space as space characters and tab characters. Such characters look "white" when they are printed on paper.

It can be quite difficult to notice white space at the end of a line. Since, by definition, white space is invisible it is not obvious when it is there unless there is a "normal" character after it to make it stand out. The point of this program is to help locate such extra white space.

In addition, this program will look for blank lines at the end of the file. For similar reasons, such lines are not always easy to notice. Yet they can cause surprises at times so it is nice to know when they are present.

The first step is to design the program. Since this will be a filter it will be built around a while loop that reads characters from the standard input device one at a time. That loops looks like

  #include <stdio.h>

  int main( void )
  {
      int ch;

      // Read the input one character at a time.
      while( ( ch = getchar( ) ) != EOF ) {
          // Process Ch.
      }

      return 0;
  }

The question is: what goes inside that loop? How can I get the effect I want while looking at just one character at a time?

First, I will take a closer look at the problem to make sure I under- stand exactly what is being asked. Keep in mind that in a text file there is a newline character at the end of each line. Here is an example that shows those '\n' characters explicitly. Assume this file is called test.txt.

  This is the first line of my file.\n
  This is the second line.\n
  This line has trailing white space.     \n
  This is the last line of the file.\n
  \n
  \n
  \n
      EOF

In this example there are four blank lines at the end. The very last line has a few spaces on it (trailing white space) and no '\n' before the end of the file is reached.

When I run white I want it to look something like this.

  $ white < test.txt
  Line 3 contains trailing white space.
  Line 8 contains trailing white space.
  There are 4 blank lines at the end of the input.

It might be nice if the program also told me how many lines were in the file and if the last line has a '\n' character at the end of it or not. Perhaps those are things that can be included in version 2.0.

I'm going to create this program using a technique that I call "incremental enhancement". I've never heard anyone else use that phrase so keep that in mind if you start talking with other programmers. However, the method is common and well known. The idea is this: I will start by writing a very minimal program that doesn't even fullfill the most basic requirements. I just want to get something going that will compile and run. Then I will enhance it just a little bit to make it do something useful and interesting. Once I get that enhancement working, I will make another enhancement in order to get just a bit more working. Each step is small and (hopefully) easy. Eventually by taking many small steps I will get a program that is big and complicated.

This method works very well for small to medium sized programs. It does not work well for large programs. The problem with large programs is that if you start down the wrong road you might not realize that for a quite a while. By the time you do realize that you made a fundamental mistake in your design you've already committed a lot of time and resources on the program. Incremental enhancement works best if you can easily "back out" of changes that you later decide were mistakes. That implies the program is not too large. However, for such programs it is often highly effective. It is particularly effective if you aren't quite sure how to do something or what you want done.

Okay... so my first step will be to recognize when a line ends. Hmmm...

#include <stdio.h>

int main(void)
{
  int ch;    // A character from the input device.

  // Read the input one character at a time.
  while ((ch = getchar()) != EOF) {

    // Is this the end of a line?
    if (ch == '\n') {
      printf("I've come to the end of a line!\n");
    }
  }

  return 0;
}

This looks good except if the last line in the file does not have a '\n' at the end, this program won't notice it. (You encounter such programs more often than you might think). This issue is this: when the loop ends was the last character a '\n' or not? If the character just before the EOF was not a '\n' then there is an unterminated line at the end of the file. However, if the character just before the EOF was a '\n' then everything is nice and normal. To deal with this I'll need a flag variable to keep track of when I've seen a '\n'. How about:

#include <stdio.h>

#define NO  0
#define YES 1

int main(void)
{
  int ch;               // A character from the input device.
  int line_start = YES; // =NO after I see a non '\n' character.

  // Read the input one character at a time.
  while ((ch = getchar()) != EOF) {

    // Is this the end of a line?
    if (ch == '\n') {
      printf("I've come to the end of a line!\n");
      line_start = YES;
    }
    else {
      line_start = NO;
    }
  }

  // Was there an unterminated line?
  if (line_start == NO) {
    printf("The last line of input does not have a newline at the end.\n");
  }

  return 0;
}

In this version I've introduced a flag variable named line_start to keep track of where I am on a line. It will be set to 1 (true) if the next character I get is going to be the first character on a line. This occurs if the last character I got was the last character on the previous line (the '\n' character).

Instead of using the values 1 and 0 to mean true and false, I introduced two symbols instead. This is a common technique and it can make your programs a lot more readible. The lines

#define NO  0
#define YES 1

tell the compiler (really the preprocessor) that everywhere I use the symbol NO it should replace it with 0 and everywhere I use the symbol YES it should replace it with 1. So a statement like

line_start = NO;

is really just

line_start = 0;

However, it looks a lot nicer to say YES and NO when you mean yes and no. Traditionally such preprocessor symbols are given names in all uppercase letters. You should do that too. The VTC style guide requires it.

When I declared line_start I initialized it to YES since the very first character from the file would be the start of the first line. If I come to a '\n' in the file I print my message and set line_start to YES again since my next character would be the first character on the next line. For any other character in the file I set line_start to NO since after reading such a character I would definitely not be at the start of a line anymore. Finally when the loop ends I check line_start. If it's NO then there must have been some "normal" characters in there without a '\n.'

Remember how I said earlier that it would be nice if the program told the user if the last line was not terminated with a '\n?' Remember how I said that would be something for version 2.0? It turns out that we have to notice such lines anyway so we might as well tell the user about them. Version 2.0 is coming out sooner than expected! This is the sort of thing that happens when you use incremental enhancement. You are never quite sure how it is going to work out when you start.

Okay, now I'm going to tackle the problem of detecting blank lines at the end of the file. I'll worry about the problem of white space at the end of the lines later. I don't want to do too much at once! That would be difficult and I'm trying to avoid difficulty.

What happens if line_start is YES when I see a '\n' character? Hmmm. That means I must have gotten (at least) two '\n' characters in a row. It might also mean that the first character in the file is a '\n' character. In any case it means that there was a blank line above me. In particular, each time I see a '\n' character with line_start at YES implies another blank line above me. I want to count those blank lines (so I'll need an integer to hold the count). Then when the loop ends I'll check the count. If it's not zero there were blank lines at the end. So I get:

#include <stdio.h>

#define NO  0
#define YES 1

int main(void)
{
  int ch;                // A character from the input device.
  int line_start  = YES; // =NO after I see a non '\n' character.
  int blank_count = 0;   // The number of blank lines before current point.

  // Read the input one character at a time.
  while ((ch = getchar()) != EOF) {

    // Is this the end of a line?
    if (ch == '\n') {
      printf("I've come to the end of a line!\n");

      // If this '\n' was at the start of a line, I have a blank line.
      if (line_start == YES) {
        blank_count++;
      }
      line_start = YES;
    }

    // If I see an ordinary character (not '\n') then I can't be at the
    // start of a line any more. Further, this line is not blank (empty)
    // so I'll reset Blank_Count.
    else {
      line_start = NO;
      blank_count = 0;
    }
  }

  // Was there an unterminated line?
  if (line_start == NO) {
    printf("The last line of input does not have a newline at the end.\n");
  }

  // Are there blank lines at the end?
  if (blank_count != 0) {
    printf("There are %d blank lines at the end of the input.\n", blank_count);
  }

  return 0;
}

Notice how I reset blank_count whenever I find a "normal" character. The reason for this is explained in the comments (notice how I enhance the comments while I enhance the program!). If I didn't do this, blank_count would count the total number of blank lines in the entire file. That's not what I want. I just want the total number of blank lines immediately above the current point.

Of course this version assumes that lines containing nothing but spaces are not blank. That isn't right. But hey... I haven't gotten there yet! This version is far enough along, though, so that testing would be a good idea. I could create a file in pico with some blank lines at the end and see how it goes. In fact, it would probably make sense to test one of the earlier versions of this program as well. The whole point of incremental enhancement is to get something that will compile and run as quickly as possible. That makes you feel good and puts you in the mood to work on the next stage. It also makes it easier to find problems.

Hold on one moment while I test the version I have so far...

Okay, it seems to work. I tried four test cases.

  1. The normal case: no blank lines anywhere.

  2. Some blank lines at the end. It gave the correct number of blank lines and was not off by one.

  3. A file with nothing but blank lines. It gave the correct number of blank lines and was not off by one.

  4. A file with a blank line, some normal lines, and a few blank lines at the end. It gave the correct number of blank lines at the end. It did not count the blank lines in the middle.

Cool. I did notice, however, that when there is only one blank line at the end, it says

There are 1 blank lines at the end of the input.

That isn't too grammatical. I'm going to enhance the section that prints that message like this

if (blank_count != 0) {
  if (blank_count == 1) {
    printf("There is 1 blank line at the end of the input.\n");
  }
  else {
    printf("There are %d blank lines at the end of the input.\n", blank_count);
  }
}

This might seem like a waste of time, but attention to such details makes for a much classier program.

Now I'm ready to deal with the question of trailing white space. To handle that I think I'll need another flag variable. I'll call it seen_white. I'll set it to YES when the last character I looked at was white space. Then I can test that variable when I see a '\n' to see if there was white space before the '\n.'

A trickier issue is with lines that are nothing but spaces. Such lines contain characters but they look blank to the user. A file that has several lines of 800 spaces each at the end should be told that the file has blank lines at the end. When I look over what I've got so far, I see that the line_start flag variable could be used to keep track of blank lines. After all, when I created it was assuming that a line became non-blank as soon as any other character showed up on it. But that isn't really true. In this version I will rename that variable to line_blank so that it is easier to understand. I should have done that in the first place, but I didn't appreciate then how this was going to come together. Such is the nature of incremental enhancement.

I believe this version works as desired:

#include <stdio.h>

#define NO  0
#define YES 1

int main(void)
{
  int ch;                // A character from the input device.
  int seen_white  = NO;  // =YES after I see some white space.
  int line_blank  = YES; // =NO after I see a normal character on a line.
  int blank_count = 0;   // The number of blank lines before current point.
  int line_count  = 1;   // The line number of the current line.

  // Read the input one character at a time.
  while ((ch = getchar()) != EOF) {

    // Is this the end of a line?
    if (ch == '\n') {
      if (seen_white == YES) {
        printf("Line %d has trailing white space.\n", line_count);
      }

      // If the previous line was blank, increment the blank line count.
      if (line_blank == YES) {
        blank_count++;
      }

      line_blank = YES;
      seen_white = NO;
      line_count++;
    }

    // If I see white space, remember that.
    else if (ch == ' ' || ch == '\t') {
      seen_white = YES;
    }

    // If I see a normal character (not '\n' and not white space) then this
    // is not a blank line.
    else {
      line_blank = NO;
      seen_white = NO;
      blank_count = 0;
    }
  }

  // Was there an unterminated line?
  if (line_blank == NO) {
    printf("The last line of input does not have a newline at the end.\n");
  }

  // Is the last line unterminated with just white space?
  if (line_blank == YES && seen_white == YES) {
    printf(
      "Line %d (last line) is unterminated and has trailing white space.\n",
      line_count
    );
  }

  // Are there blank lines at the end?
  if (blank_count != 0) {
    if (blank_count == 1) {
      printf("There is 1 blank line at the end of the input.\n");
    }
    else {
      printf("There are %d blank lines at the end of the input.\n",
        blank_count
      );
    }
  }

  return 0;
}

In this version I modified the comments so that they would be more accurate and clearer. I also introduced a variable named line_count to keep track of the line numbers. This was necessary because I want to print out the line numbers of lines with trailing white space.

The heart of the program is an if... else if... chain inside the main while loop. It basically says

IF <I'm at the end of a line> THEN
  <Print message if necessary and update records appropriately>
ELSE IF <I've got a white space character> THEN
  <Update records appropriately>
ELSE
  <I have a normal character. Update records appropriately>
END

Then when the while loop ends at the end of the input I look at the flag variables to check for various conditions and print appropriate messages.

Now when it comes to testing, I first try out all the same tests as before involving blank (empty) lines. There there are a few other tests to try.

  1. One line with trailing white space.

  2. One line that contains only spaces.

  3. Blank lines at the end of the file, one of which contains only spaces.

  4. The last normal line of the file is not terminated with a '\n.'

  5. The last line of the file contains only spaces and is not terminated with a '\n.'

All tests pass. The program appears to work as desired. My final version is in white.c.

© Copyright 2016 by Peter C. Chapin.
Last Revised: February 9, 2016