Using the /proc File System
(C) Copyright 2001 by Peter Chapin
==================================

This document describes how to create a module that implements a simple (read-only) file in the
/proc file system under Linux. The techniques described here apply to the 2.2.x series of
kernels. However, you should be aware that the /proc file system is constantly evolving and in
the 2.4.x series of kernels a nice API is provided to modules that facilitates some of the
operations mentioned here.

The Basics
----------

We will arrange things so that when your module is loaded into the kernel, the /proc file will
appear in the /proc file system. When the module is removed from the kernel the /proc file will
vanish.

Here is an init_module() function that illustrates the ideas. To access the functions and types
that support the /proc file system, you need to #include <proc_fs.h> in your module's source.

int init_module( void )
{
  struct proc_dir_entry *return_value;

  // The create_proc_entry() function will create a proc_dir_entry structure, fill it in, and
  // register it with the proc file system. The NULL third parameter means that this entry is to
  // be created in the root of the proc file system.
  //
  if( ( return_value = create_proc_entry( "myproc", S_IFREG | S_IRUGO, NULL ) ) == NULL ) {
    printk( KERN_ERR "myproc NOT loaded. Error encountered\n" );
    return -EAGAIN;
  }

  // Install a pointer to my reader function.
  return_value->read_proc = myproc_read;

  // In kernel 2.4.x you should also do return_value->owner = THIS_MODULE to prevent race
  // conditions between opening/closing the proc file and unloading the module.

  printk( KERN_INFO "myproc loaded\n" );
  return 0;
}

The first parameter to create_proc_entry() is the name of the /proc file you want to create. The
second parameter is the mode for the file. S_IFREG means you want to create a regular file and
S_IRUGO means you want r--r--r-- permissions (Read for User, Group, and Other). The third
parameter points at a proc_dir_entry structure that defines the /proc directory where you want
the file created. A NULL pointer implies the root of the proc file system.

The create_proc_entry() function returns a pointer to the newly created proc_dir_entry. Each
file in the /proc file system is described by one of these structures. See the definition of
proc_dir_entry in proc_fs.h. Note that the structures are linked together in a left-child,
right- sibling tree that mimics the tree structure of the /proc file system. Information about
all files in the /proc file system are thus stored in memory all the time in this tree.

After create_proc_entry() returns you need to install pointers to operation functions in the
structure it gives you. The sample above fills in the read_proc member of that structure with a
pointer to a function "myproc_read". This function is invoked by the kernel whenever a process
attempts to read from the /proc file.

When the module is unloaded the proc file should be removed as follows

void cleanup_module( void )
{
  // The remove_proc_entry() function locates the named proc entry and unregisters it.
  //
  remove_proc_entry( "myproc", NULL );
  printk( KERN_INFO "myproc unloaded\n" );
}

Here the /proc file to remove is identified by name and by the parent directory in which it is
located. Again the NULL pointer implies the root directory.

The functions create_proc_entry() and remove_proc_entry() are defined in fs/proc/generic.c. The
create function allocates space for a new proc_dir_entry, fills it in, and then calls
proc_register() to link it into the /proc tree. The remove function undoes these effects. The
functions proc_register() and proc_unregister() are both in fs/proc/root.c.

An example of a function that actually supports reading the /proc file is shown below

static
int myproc_read( char *page, char **start, off_t offset, int count, int *eof, void *data )
{
  int size;

  // It is important to use MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT to bracket any operations in
  // here that might sleep. If you do not do that then it is possible that someone might unload
  // this module while myproc_read is sleeping. When myproc_read then wakes up, the sleeping
  // operation will return to a function that isn't there.

  // Compute the entire page (and note the amount of data produced).
  size = sprintf( page, "Hello, World!\n" );
  return size;
}

The name of this function can be anything. It is the address of it, stored in the read_proc
member of the proc_dir_entry structure that matters. I show the function as static here to
minimize the chance of name collisions in the kernel.

The reading function gets several parameters. To best understand the semantics of those
parameters you should inspect the function proc_file_read() in fs/proc/generic.c. That function
is invoked by the kernel to read each /proc file. (The function proc_file_read() is the file
system read operation function for the /proc file system).

To understand how the read function you provide must work, it is best to consider two cases. The
first and easiest case is the case where the entire /proc file fits into a single page of
memory. Note that the page size is at least 4 KBytes on all systems that currently support
Linux. Thus if your /proc file has 4 KBytes or less data in it your life is simple. (The actual
critical size is 3 KBytes... see fs/proc/generic.c for the details).

The first parameter to myproc_read() is a pointer to a single page of memory where the function
is to write the data in the /proc file. To support the file, it merely has to write that data
into the page and return the number of bytes written. All the other parameters to myproc_read()
can be ignored. This is what the sample above does.

Note that it is important that myproc_read() produce the same amount of data each time it is
called. Since the user process will probably try to read the file twice (the first time to get
the data and again to get an EOF indication) the function myproc_read() will probably be invoked
twice as well. If it returns a different (say larger) number the second time, the kernel will
assume that there is more in the file and the user process will get the tail end of second
reading. That data might not be related to the data returned by the first reading, thus
resulting in data with "garbage" at the end as seen by the user process.


Handling Large /proc Files
--------------------------

If your /proc file contains more than 3 KBytes of data then you will have to make use of the
other parameters to myproc_read(). When myproc_read() is called the /proc filesystem will
request count bytes from the file starting at offset. The value of count will always be small
enough to fit into a page (<= 3 KBytes). For example, if the user process requests 16 KBytes of
data, the /proc filesystem will invoke myproc_read() several times, using a smaller value of
count and various offsets each time in its attempt to accumulate all the data requested by the
user process. This means that myproc_read() needs to use the values of count and offset to
determine which part of the logical file needs to be produced. That data is always stored in the
given page (at offset zero in the page).

As usual myproc_read() should return the number of bytes generated. In addition, it should also
return in *start the number of bytes generated (since *start is a pointer you will need to use a
cast to do this). The fact that you are modifying *start signals to the /proc file system that
you are not returning the entire logical file and that offset zero on the page provided does not
correspond to offset zero in the file.

It turns out that you can treat your /proc file as a collection of variable length records if
you wish. In this case the offset you are given should be taken as a record offset and the value
you write into *start should be taken as the number of records generated. You should still
return the number of bytes generated from myproc_read(), however.

For example, if you define a record as a line of text, then if your myproc_read() function is
called with an offset of 10 and a count of 20, you should attempt to generate 20 lines of text
starting with line 10 in the logical file. If you can't generate all 20, generate as many as you
can. In either case, update *start with the number of lines generated and return the number of
bytes generated. In theory this should work even if the size of any earlier logical lines have
changed since the last call to myproc_read().

Yes this is a disgusting hack, but that's the way it is.


Other Parameters
----------------

You will notice that myproc_read() takes two additional parameters that I haven't talked about
yet. The data parameter gives you a way of using the same myproc_read() function for several
different (but similar) /proc files. When you create the /proc file you can install in data
member of the proc_dir_entry structure a pointer to a structure of your choosing. That pointer
is then passed to myproc_read() as the data parameter. Using it myproc_read() can look up
information specific to this particular /proc file instance and act accordingly.

The eof parameter gives you a way of of indicating that the material you generated contains the
end of the logical file. If there is no more material after what you have already provided, you
can store a 1 in *eof to prevent the /proc file system from invoking your myproc_read() function
another time. (This probably isn't much of a savings because the user process will probably
attempt to read another chunk from your /proc file anyway). Everything should work fine even if
you ignore the eof parameter.


Annotated proc_file_read()
--------------------------

The following function in fs/proc/generic.c is used by the kernel to read a /proc file. It is
enlightening to go over exactly how it works. My block comments start at the left margin and
below the code they are associated with.

/* 4K page size but our output routines use some slack for overruns */
#define PROC_BLOCK_SIZE (3*1024)

Notice that no matter what the architecture is, the /proc file will only be read in 3 KByte
chunks at the most. (In Kernel 2.4 this is changed to PAGE_SIZE-1024 so that the chunk size is
related to the page size of the current architecture).

static ssize_t
proc_file_read(struct file * file, char * buf, size_t nbytes, loff_t *ppos)
{

This function is a file system operation function. It is called by the VFS and its signature
applies to any file reading operation. The parameter file is a pointer to a file structure used
by the kernel to manage an open file's information. The parameter buf is a pointer into user
space where the data is to go. The parameter nbytes is the number of bytes the user wants (could
be large). The parameter ppos is a pointer to the offset in the file where the user wants to
start reading. (The type loff_t is used to make it easier to support offsets that are larger
than 32 bits).

        struct inode * inode = file->f_dentry->d_inode;
        char    *page;
        ssize_t retval=0;

Return the number of bytes generated or an error code.

        int     eof=0;

Used by the /proc file to indicate EOF

        ssize_t n, count;
        char    *start;
        struct proc_dir_entry * dp;

        dp = (struct proc_dir_entry *) inode->u.generic_ip;

Here a pointer to the proc_dir_entry structure is looked up from the open /proc file's inode.

        if (!(page = (char*) __get_free_page(GFP_KERNEL)))
                return -ENOMEM;

Allocate one page of memory. The call to __get_free_page() might sleep if, for example, some
swapping needs to be done to locate the free memory.

        while ((nbytes > 0) && !eof)

Keep working as long as the user still wants more and the /proc file says there is more. This
loop will also break on various conditions (see below) causing it to end in other ways.

        {
                count = MIN(PROC_BLOCK_SIZE, nbytes);

The amount of data we will ask for at first will be the smaller of the block size (3 KBytes) or
nbytes. Thus if the caller only wants, say, 100 bytes, that is all we will ask for. However if
the caller wants 64 KBytes, we will start by asking for only 3 KBytes.

                start = NULL;
                if (dp->get_info) {

The get_info method in proc_dir_entry is an older (apparently depricated) approach to reading
/proc files. This dates from the day when /proc files could not be written. It also has a
somewhat simpiler interface.

                        /*
                         * Handle backwards compatibility with the old net
                         * routines.
                         *
                         * XXX What gives with the file->f_flags & O_ACCMODE
                         * test?  Seems stupid to me....
                         */
                        n = dp->get_info(page, &start, *ppos, count,
                                 (file->f_flags & O_ACCMODE) == O_RDWR);
                        if (n < count)
                                eof = 1;

                } else if (dp->read_proc) {

Here the read_proc method is used if it is defined. The value of dp->read_proc will be NULL
(false) if the /proc file never installed a read function.

                        n = dp->read_proc(page, &start, *ppos,
                                          count, &eof, dp->data);

Read the file and store the number of bytes generated into n. Notice that the /proc file is
given a pointer to eof. It can, if it chooses, write a 1 into that location to signal EOF to
this loop. This prevents this loop for iterating one additional time just to get zero generated
bytes. Thus by using the eof parameter the /proc file is slightly faster. Notice also that the
proc file gets dp->data as the last parameter. By loading dp->data with something unique and
interesting when the /proc file is first created, a single read function can be used to support
many different /proc files.

                } else
                        break;

If neither read operation is defined, break out of the loop and return zero. Personally I would
have checked for this case before entering the loop to simplify the logic here... but I didn't
write this.

                if (!start) {

If start is still NULL after returning from the read, the /proc file did not attempt to update
it. This implies that the entire /proc file contents were returned in the provided page. In
order for that to make any sense, the requested offset must have been small (it will be zero the
first time the /proc file is read, after all). The code below sets start to point at the
requested data in the block that was returned (here *ppos is assumed to be less than 3 KBytes).
It then adjusts n to reflect the amount of the data that was generated and that fits into the
requested range. If there is no such data (the requested range is off the end of the generated
data), the loop ends at once.

                        /*
                         * For proc files that are less than 4k
                         */
                        start = page + *ppos;
                        n -= *ppos;
                        if (n <= 0)
                                break;
                        if (n > count)
                                n = count;
                }
                if (n == 0)
                        break;  /* End of file */

If the read function returned no generated data, we are done with this request. Note that the
user process might attempt to read the file again later (to get an EOF indication).

                if (n < 0) {
                        if (retval == 0)
                                retval = n;
                        break;
                }

If the read function returned an error code we should also return that error code (provided we
haven't yet read anything). If we have read something then we should just return with the count
of what we did read. The user process will probably try to call us again and in that case,
assuming that the error condition persists, we will return an error code.

                /* This is a hack to allow mangling of file pos independent
                 * of actual bytes read.  Simply place the data at page,
                 * return the bytes, and set `start' to the desired offset
                 * as an unsigned int. - Paul.Russell@rustcorp.com.au
                 */
                n -= copy_to_user(buf, start < page ? page : start, n);

The copy_to_user() function copies a block of data to user space (in this case to buf). Here n
bytes are copied. The starting address of the copy is either page (in the case where a piece of
a large /proc file was returned) or start (in the case where the entire /proc file was
returned... start was adjusted above to point at the appropriate section of the data). This
function returns the number of bytes left to copy. Thus it returns zero if completely
successful. Thus n is left with the number of bytes transfered.

In the case where just a piece of a large /proc file is returned, the /proc file should have
installed into start the number of records generated. In cases where you want to treat an
individual byte as a record, the value stored in start and the value returned by the read_proc
function should be the same.

                if (n == 0) {
                        if (retval == 0)
                                retval = -EFAULT;
                        break;
                }

                *ppos += start < page ? (long)start : n;
                   /* Move down the file */

Advance the file offset by n bytes or, in the case of a record oriented /proc file, start
records.

                nbytes -= n;
                buf += n;
                retval += n;

Update records. We've satisfied n bytes worth of the user's request so reduce nbytes by that
amount (we will loop back again if nbytes > 0 (meaning that the user wants more). We also
advance the user's buffer pointer by n to prepare for the next chunk and we increase retval by n
so that when we eventually return we will return the total number of bytes the user wanted.

        }
        free_page((unsigned long) page);
        return retval;

Release the page of memory and return the count (or error code).

}


Writing to /proc Files
----------------------

{Say more about this}


The sysctl Interface
--------------------

In many cases /proc files are used to read and modify various kernel parameters. This is how
Linux allows runtime kernel tuning to be done. Typically the /proc files that are used for this
purpose are located in the /proc/sys directory subtree. Each /proc file in this area generally
returns one number when read. Writing into the file (if supported) modifies the corresponding
kernel parameter.

To facilitate setting this up the kernel provides a "sysctl" API that a module can use to manage
such files more easily. The sysctl(8) command can be used by administrators to easily interact
with /proc/sys. (A GUI administrative tool might invoke sysctl(8) to get the actual job done).

If the purpose of your /proc file is to provide a kernel tunable parameter (either read-only or
writable), then you should look into the sysctl interface.

{Say more about this}


Creating /proc Directories and Other Things
-------------------------------------------

{Say more about this}


Kernel v2.4.x
-------------

In order to prevent races on loading and unloading the module, you should set the "owner" member
of the proc_dir_entry structure returned by create_proc_entry() to THIS_MODULE. This is
necessary because (unlike the case with a device driver) the module is not informed when the
corresponding /proc file is opened or closed. Thus you could have the following sequence of
events:

1. A process opens the /proc file.
2. The module is unloaded.
3. The process attempts to read from the /proc file.

Clearly race conditions of this nature are unlikely in real life, but they are possible. To
avoid them kernel 2.4.x introduces an owner member in proc_dir_entry. When a module owns a /proc
file it should indicate that in init_module() by using the special symbol THIS_MODULE to
initialize the owner member. The kernel will then refuse to unload a module if one of the /proc
files it is supporting is open.

Kernel v2.4 also provides a more extensive (and convenient) API for creating and using /proc
files. This API exists to a degree in kernel v2.2 but it is under developed there. {Say more
about this}