Using the /proc File System (C) Copyright 2001 by Peter Chapin ================================== This document describes how to create a module that implements a simple (read-only) file in the /proc file system under Linux. The techniques described here apply to the 2.2.x series of kernels. However, you should be aware that the /proc file system is constantly evolving and in the 2.4.x series of kernels a nice API is provided to modules that facilitates some of the operations mentioned here. The Basics ---------- We will arrange things so that when your module is loaded into the kernel, the /proc file will appear in the /proc file system. When the module is removed from the kernel the /proc file will vanish. Here is an init_module() function that illustrates the ideas. To access the functions and types that support the /proc file system, you need to #include in your module's source. int init_module( void ) { struct proc_dir_entry *return_value; // The create_proc_entry() function will create a proc_dir_entry structure, fill it in, and // register it with the proc file system. The NULL third parameter means that this entry is to // be created in the root of the proc file system. // if( ( return_value = create_proc_entry( "myproc", S_IFREG | S_IRUGO, NULL ) ) == NULL ) { printk( KERN_ERR "myproc NOT loaded. Error encountered\n" ); return -EAGAIN; } // Install a pointer to my reader function. return_value->read_proc = myproc_read; // In kernel 2.4.x you should also do return_value->owner = THIS_MODULE to prevent race // conditions between opening/closing the proc file and unloading the module. printk( KERN_INFO "myproc loaded\n" ); return 0; } The first parameter to create_proc_entry() is the name of the /proc file you want to create. The second parameter is the mode for the file. S_IFREG means you want to create a regular file and S_IRUGO means you want r--r--r-- permissions (Read for User, Group, and Other). The third parameter points at a proc_dir_entry structure that defines the /proc directory where you want the file created. A NULL pointer implies the root of the proc file system. The create_proc_entry() function returns a pointer to the newly created proc_dir_entry. Each file in the /proc file system is described by one of these structures. See the definition of proc_dir_entry in proc_fs.h. Note that the structures are linked together in a left-child, right- sibling tree that mimics the tree structure of the /proc file system. Information about all files in the /proc file system are thus stored in memory all the time in this tree. After create_proc_entry() returns you need to install pointers to operation functions in the structure it gives you. The sample above fills in the read_proc member of that structure with a pointer to a function "myproc_read". This function is invoked by the kernel whenever a process attempts to read from the /proc file. When the module is unloaded the proc file should be removed as follows void cleanup_module( void ) { // The remove_proc_entry() function locates the named proc entry and unregisters it. // remove_proc_entry( "myproc", NULL ); printk( KERN_INFO "myproc unloaded\n" ); } Here the /proc file to remove is identified by name and by the parent directory in which it is located. Again the NULL pointer implies the root directory. The functions create_proc_entry() and remove_proc_entry() are defined in fs/proc/generic.c. The create function allocates space for a new proc_dir_entry, fills it in, and then calls proc_register() to link it into the /proc tree. The remove function undoes these effects. The functions proc_register() and proc_unregister() are both in fs/proc/root.c. An example of a function that actually supports reading the /proc file is shown below static int myproc_read( char *page, char **start, off_t offset, int count, int *eof, void *data ) { int size; // It is important to use MOD_INC_USE_COUNT and MOD_DEC_USE_COUNT to bracket any operations in // here that might sleep. If you do not do that then it is possible that someone might unload // this module while myproc_read is sleeping. When myproc_read then wakes up, the sleeping // operation will return to a function that isn't there. // Compute the entire page (and note the amount of data produced). size = sprintf( page, "Hello, World!\n" ); return size; } The name of this function can be anything. It is the address of it, stored in the read_proc member of the proc_dir_entry structure that matters. I show the function as static here to minimize the chance of name collisions in the kernel. The reading function gets several parameters. To best understand the semantics of those parameters you should inspect the function proc_file_read() in fs/proc/generic.c. That function is invoked by the kernel to read each /proc file. (The function proc_file_read() is the file system read operation function for the /proc file system). To understand how the read function you provide must work, it is best to consider two cases. The first and easiest case is the case where the entire /proc file fits into a single page of memory. Note that the page size is at least 4 KBytes on all systems that currently support Linux. Thus if your /proc file has 4 KBytes or less data in it your life is simple. (The actual critical size is 3 KBytes... see fs/proc/generic.c for the details). The first parameter to myproc_read() is a pointer to a single page of memory where the function is to write the data in the /proc file. To support the file, it merely has to write that data into the page and return the number of bytes written. All the other parameters to myproc_read() can be ignored. This is what the sample above does. Note that it is important that myproc_read() produce the same amount of data each time it is called. Since the user process will probably try to read the file twice (the first time to get the data and again to get an EOF indication) the function myproc_read() will probably be invoked twice as well. If it returns a different (say larger) number the second time, the kernel will assume that there is more in the file and the user process will get the tail end of second reading. That data might not be related to the data returned by the first reading, thus resulting in data with "garbage" at the end as seen by the user process. Handling Large /proc Files -------------------------- If your /proc file contains more than 3 KBytes of data then you will have to make use of the other parameters to myproc_read(). When myproc_read() is called the /proc filesystem will request count bytes from the file starting at offset. The value of count will always be small enough to fit into a page (<= 3 KBytes). For example, if the user process requests 16 KBytes of data, the /proc filesystem will invoke myproc_read() several times, using a smaller value of count and various offsets each time in its attempt to accumulate all the data requested by the user process. This means that myproc_read() needs to use the values of count and offset to determine which part of the logical file needs to be produced. That data is always stored in the given page (at offset zero in the page). As usual myproc_read() should return the number of bytes generated. In addition, it should also return in *start the number of bytes generated (since *start is a pointer you will need to use a cast to do this). The fact that you are modifying *start signals to the /proc file system that you are not returning the entire logical file and that offset zero on the page provided does not correspond to offset zero in the file. It turns out that you can treat your /proc file as a collection of variable length records if you wish. In this case the offset you are given should be taken as a record offset and the value you write into *start should be taken as the number of records generated. You should still return the number of bytes generated from myproc_read(), however. For example, if you define a record as a line of text, then if your myproc_read() function is called with an offset of 10 and a count of 20, you should attempt to generate 20 lines of text starting with line 10 in the logical file. If you can't generate all 20, generate as many as you can. In either case, update *start with the number of lines generated and return the number of bytes generated. In theory this should work even if the size of any earlier logical lines have changed since the last call to myproc_read(). Yes this is a disgusting hack, but that's the way it is. Other Parameters ---------------- You will notice that myproc_read() takes two additional parameters that I haven't talked about yet. The data parameter gives you a way of using the same myproc_read() function for several different (but similar) /proc files. When you create the /proc file you can install in data member of the proc_dir_entry structure a pointer to a structure of your choosing. That pointer is then passed to myproc_read() as the data parameter. Using it myproc_read() can look up information specific to this particular /proc file instance and act accordingly. The eof parameter gives you a way of of indicating that the material you generated contains the end of the logical file. If there is no more material after what you have already provided, you can store a 1 in *eof to prevent the /proc file system from invoking your myproc_read() function another time. (This probably isn't much of a savings because the user process will probably attempt to read another chunk from your /proc file anyway). Everything should work fine even if you ignore the eof parameter. Annotated proc_file_read() -------------------------- The following function in fs/proc/generic.c is used by the kernel to read a /proc file. It is enlightening to go over exactly how it works. My block comments start at the left margin and below the code they are associated with. /* 4K page size but our output routines use some slack for overruns */ #define PROC_BLOCK_SIZE (3*1024) Notice that no matter what the architecture is, the /proc file will only be read in 3 KByte chunks at the most. (In Kernel 2.4 this is changed to PAGE_SIZE-1024 so that the chunk size is related to the page size of the current architecture). static ssize_t proc_file_read(struct file * file, char * buf, size_t nbytes, loff_t *ppos) { This function is a file system operation function. It is called by the VFS and its signature applies to any file reading operation. The parameter file is a pointer to a file structure used by the kernel to manage an open file's information. The parameter buf is a pointer into user space where the data is to go. The parameter nbytes is the number of bytes the user wants (could be large). The parameter ppos is a pointer to the offset in the file where the user wants to start reading. (The type loff_t is used to make it easier to support offsets that are larger than 32 bits). struct inode * inode = file->f_dentry->d_inode; char *page; ssize_t retval=0; Return the number of bytes generated or an error code. int eof=0; Used by the /proc file to indicate EOF ssize_t n, count; char *start; struct proc_dir_entry * dp; dp = (struct proc_dir_entry *) inode->u.generic_ip; Here a pointer to the proc_dir_entry structure is looked up from the open /proc file's inode. if (!(page = (char*) __get_free_page(GFP_KERNEL))) return -ENOMEM; Allocate one page of memory. The call to __get_free_page() might sleep if, for example, some swapping needs to be done to locate the free memory. while ((nbytes > 0) && !eof) Keep working as long as the user still wants more and the /proc file says there is more. This loop will also break on various conditions (see below) causing it to end in other ways. { count = MIN(PROC_BLOCK_SIZE, nbytes); The amount of data we will ask for at first will be the smaller of the block size (3 KBytes) or nbytes. Thus if the caller only wants, say, 100 bytes, that is all we will ask for. However if the caller wants 64 KBytes, we will start by asking for only 3 KBytes. start = NULL; if (dp->get_info) { The get_info method in proc_dir_entry is an older (apparently depricated) approach to reading /proc files. This dates from the day when /proc files could not be written. It also has a somewhat simpiler interface. /* * Handle backwards compatibility with the old net * routines. * * XXX What gives with the file->f_flags & O_ACCMODE * test? Seems stupid to me.... */ n = dp->get_info(page, &start, *ppos, count, (file->f_flags & O_ACCMODE) == O_RDWR); if (n < count) eof = 1; } else if (dp->read_proc) { Here the read_proc method is used if it is defined. The value of dp->read_proc will be NULL (false) if the /proc file never installed a read function. n = dp->read_proc(page, &start, *ppos, count, &eof, dp->data); Read the file and store the number of bytes generated into n. Notice that the /proc file is given a pointer to eof. It can, if it chooses, write a 1 into that location to signal EOF to this loop. This prevents this loop for iterating one additional time just to get zero generated bytes. Thus by using the eof parameter the /proc file is slightly faster. Notice also that the proc file gets dp->data as the last parameter. By loading dp->data with something unique and interesting when the /proc file is first created, a single read function can be used to support many different /proc files. } else break; If neither read operation is defined, break out of the loop and return zero. Personally I would have checked for this case before entering the loop to simplify the logic here... but I didn't write this. if (!start) { If start is still NULL after returning from the read, the /proc file did not attempt to update it. This implies that the entire /proc file contents were returned in the provided page. In order for that to make any sense, the requested offset must have been small (it will be zero the first time the /proc file is read, after all). The code below sets start to point at the requested data in the block that was returned (here *ppos is assumed to be less than 3 KBytes). It then adjusts n to reflect the amount of the data that was generated and that fits into the requested range. If there is no such data (the requested range is off the end of the generated data), the loop ends at once. /* * For proc files that are less than 4k */ start = page + *ppos; n -= *ppos; if (n <= 0) break; if (n > count) n = count; } if (n == 0) break; /* End of file */ If the read function returned no generated data, we are done with this request. Note that the user process might attempt to read the file again later (to get an EOF indication). if (n < 0) { if (retval == 0) retval = n; break; } If the read function returned an error code we should also return that error code (provided we haven't yet read anything). If we have read something then we should just return with the count of what we did read. The user process will probably try to call us again and in that case, assuming that the error condition persists, we will return an error code. /* This is a hack to allow mangling of file pos independent * of actual bytes read. Simply place the data at page, * return the bytes, and set `start' to the desired offset * as an unsigned int. - Paul.Russell@rustcorp.com.au */ n -= copy_to_user(buf, start < page ? page : start, n); The copy_to_user() function copies a block of data to user space (in this case to buf). Here n bytes are copied. The starting address of the copy is either page (in the case where a piece of a large /proc file was returned) or start (in the case where the entire /proc file was returned... start was adjusted above to point at the appropriate section of the data). This function returns the number of bytes left to copy. Thus it returns zero if completely successful. Thus n is left with the number of bytes transfered. In the case where just a piece of a large /proc file is returned, the /proc file should have installed into start the number of records generated. In cases where you want to treat an individual byte as a record, the value stored in start and the value returned by the read_proc function should be the same. if (n == 0) { if (retval == 0) retval = -EFAULT; break; } *ppos += start < page ? (long)start : n; /* Move down the file */ Advance the file offset by n bytes or, in the case of a record oriented /proc file, start records. nbytes -= n; buf += n; retval += n; Update records. We've satisfied n bytes worth of the user's request so reduce nbytes by that amount (we will loop back again if nbytes > 0 (meaning that the user wants more). We also advance the user's buffer pointer by n to prepare for the next chunk and we increase retval by n so that when we eventually return we will return the total number of bytes the user wanted. } free_page((unsigned long) page); return retval; Release the page of memory and return the count (or error code). } Writing to /proc Files ---------------------- {Say more about this} The sysctl Interface -------------------- In many cases /proc files are used to read and modify various kernel parameters. This is how Linux allows runtime kernel tuning to be done. Typically the /proc files that are used for this purpose are located in the /proc/sys directory subtree. Each /proc file in this area generally returns one number when read. Writing into the file (if supported) modifies the corresponding kernel parameter. To facilitate setting this up the kernel provides a "sysctl" API that a module can use to manage such files more easily. The sysctl(8) command can be used by administrators to easily interact with /proc/sys. (A GUI administrative tool might invoke sysctl(8) to get the actual job done). If the purpose of your /proc file is to provide a kernel tunable parameter (either read-only or writable), then you should look into the sysctl interface. {Say more about this} Creating /proc Directories and Other Things ------------------------------------------- {Say more about this} Kernel v2.4.x ------------- In order to prevent races on loading and unloading the module, you should set the "owner" member of the proc_dir_entry structure returned by create_proc_entry() to THIS_MODULE. This is necessary because (unlike the case with a device driver) the module is not informed when the corresponding /proc file is opened or closed. Thus you could have the following sequence of events: 1. A process opens the /proc file. 2. The module is unloaded. 3. The process attempts to read from the /proc file. Clearly race conditions of this nature are unlikely in real life, but they are possible. To avoid them kernel 2.4.x introduces an owner member in proc_dir_entry. When a module owns a /proc file it should indicate that in init_module() by using the special symbol THIS_MODULE to initialize the owner member. The kernel will then refuse to unload a module if one of the /proc files it is supporting is open. Kernel v2.4 also provides a more extensive (and convenient) API for creating and using /proc files. This API exists to a degree in kernel v2.2 but it is under developed there. {Say more about this}