Example #5

Counting Characters in a File

In this example, I will build a program that works with arrays to give you a better idea of how arrays "look and feel" in real programs. This program will be called char_count. It will be a filter program that reads a text file at its input and counts the number of times each character occurs. It then prints out the top 10 most common characters and some information about them.

The centerpiece of this program will be an array of 256 integers. These integers will serve as counters. For example the 43rd element of the array will count the number of times the character with an ASCII code of 43 has occurred in the file. The other array elements will be handled similarly. Because the array has 256 elements, every possible character can be handled including punctuation, control characters, and "extended" characters with codes greater than 127. Here is the p-code of the program.

<Set all counters to zero>

WHILE <There is more in the file to process> LOOP
  <Increment the appropriate counter>
END

FOR <Ten passes> LOOP
  <Search for the largest counter and print information about it>
  <Set the largest counter to zero so that it is no longer the largest>
END

To make this program a bit more interesting, I will implement the three major sections of p-code above as three different functions. The main function will then become just

int main(void)
{
  initialize();
  read_input();
  output_results();
  return 0;
}

This style is fairly typical. It's nice to make main totally clear like this. Hide all the nasty details in functions.

Ideally I would declare the array of counters in main and pass that array to the function read_input for filling and output_results for analysis. However, we won't discuss passing arrays to functions until Lesson #21. So I will work around that by creating the array of counters as a global array so that all of my functions can access it. This design is not perfect, but it will be acceptable for a program of this small size and complexity.

Here is my array declaration

int counters[256];

Here is my initialization function

void initialize(void)
{
  int i;

  for (i = 0; i < 256; i++) {
    counters[i] = 0;
  }
}

In theory since my array is global I don't need to set its elements to zero. Global variables have static duration and static duration variables are set to zero initially by default. Even so I feel that making a function that explicitly handles initialization is a good idea. A future version of this program might need to do other things to properly initialize itself and if an initialization function already exists it will be clear where to add those other things. In fact, there is some sense in creating an explicit clean up function even though in this case there is nothing to clean up.

void clean_up(void)
{
  return;
}

Then main becomes

int main(void)
{
  initialize();
  read_input();
  output_results();
  clean_up();
  return 0;
}

Although the idea of including a function that does nothing may seem silly, placeholders like this can be very nice later on.

The read_input function is similar to what we have done before. Notice how I use ch as an array index to access the appropriate counter.

void read_input(void)
{
  int ch;

  while ((ch = getchar()) != EOF) {
    counters[ch]++;
  }
}

Again it may seem silly to have such a simple function but again it will help when we later try to enhance this program. For example a future version might want to explicitly open and close files rather than rely on I/O redirection for its input. We will talk about how to do that in Lesson #24. Enhancing this program to do explicit file I/O would entail only modifying this one function (for the most part). Simple programs tend to become complicated. It's nice to plan for that.

The output_results function is more interesting. Here I have to sweep over the counters array looking for the largest value. I think I will first write a function that does that part. I will call it get_largest_index.

/*
 * int get_largest_index(void)
 *
 * This function sweeps over the counters array and look for the element
 * that is the largest. It then returns the index into the array for
 * that element. Note that this function returns an array index. It does
 * not return the largest element directly!
 *
 */
int get_largest_index(void)
{
  int i;
  int max = counters[0];  // Guess that the zeroth element is the max.
  int max_index = 0;      // The index of the maximum element.

  // Sweep over the rest of the array.
  for (i = 1; i < 256; i++) {

    // If this counter is bigger than our current max, remember it.
    if (counters[i] > Max) {
      max = counters[i];
      max_index = i;
    }
  }

  return max_index;
}

Now I can use get_largest_index in my output_results function.

void output_results(void)
{
  int i;
  int max_index;

  // Print a nice header.
  printf("Rank  ASCII  Ch  Count\n");
  printf("----  -----  --  -----\n");

  // I want to print the top ten characters.
  for (i = 0; i < 10; i++) {

    // Find the largest and print information about it.
    max_index = get_largest_index();
    printf(
      "%2d    %3d    %c   %d\n",
      i + 1, max_index, max_index, counters[max_index]
    );

    // Erase this one so that the second largest becomes the largest.
    counters[max_index] = -1;
  }
}

Notice how each function is relatively small and simple. This is exactly what you want. No matter how complex the program becomes, you should try to keep every function short and sweet. For example, this program involves a nested for loop. For each character in the top ten, I have to execute a for loop that sweeps over the array looking for the maximum count. Yet because those two for loops are in different functions, that complexity is not evident. The inner loop is hidden inside get_largest_index where it doesn't intimidate.

Notice also how every function can easily access the counters array without explicitly declaring it. That's because the array is global. Once you learn how to pass arrays to functions, you could (and should) modify this program so that the counters array is local in main and is just handed down from one function to the next as necessary.

The final program is in char_count.c.

© Copyright 2003 by Peter C. Chapin.
Last Revised: June 17, 2003