Linux内核编程

linuxCN · 发表于 2002-10-23 16:12:58

著者：Ori Pomerantz
翻译：徐辉
2000年8月19日
译者前言
这是我的第一次尝试，在此之前我还没有接触过Linux，所以翻译得很粗糙，有的地方我自己也不明白，只好照着翻下来。而且急急匆匆，毛毛草草，一定有许多错误或不当之处。我一向就是这么毛草的啦，总是给我的组织丢脸。J所以如果你发现了有什么错误或者解释不清的地方，希望能够指正，敬请把您的金玉之言发到我的信箱里。
本人此举旨在结识天下Linux英雄。本人徐辉（号：水光月影，真命天子）现在北大方正研究院读研，主要研究方向是信息安全、数据加密和Linux的安全性。由于我们的工作在方正尚属开创，所以希望能够结识最多的Linux、网络安全方面的高手。如果您有什么项目需要合作，或者有什么好的提议，或者有关于安全方面的需求，或者有比较好的资料，敬请与我们联系。本人将感激不尽。J //bow
本书英文下载版可在http://metalab.unc.edu/ldp找到。印刷版请见书后的说明。
最后必须声明：本书翻译完全是个人行为，我只代表我个人。本资料为内部交流使用，未经作者及译者许可，任何单位和个人不得将本资料用作商业用途。如经发现，本人有权力追究法律责任。
译者email

[email protected]

2000年8月19日  于北大燕园

目录
1．HELLO, WORLD３
EXHELLO.C３
1．1内核模块的编译文件４
1.2 多文件内核模块５
2．字符设备文件８
2．1多内核版本源文件１６
3．/PROC文件系统１７
4．使用/PROC进行输入２２
5．和设备文件对话（写和IOCTLS）３０
6．启动参数４４
7．系统调用４７
8．阻塞进程５３
9．替换PRINTK’S６３
10．调度任务６６
11．中断处理程序７１
11.1  INTEL 结构上的键盘７２
12．对称多处理７５
常见的错误７６
2.0和2.2版本的区别７６
除此以外７７
其他７８
GOODS AND SERVICES７８
GNU GENERAL PUBLIC LICENSE７８
注８４

1．Hello, world
当第一个穴居的原始人程序员在墙上凿出第一个“洞穴计算机”的程序时，那是一个打印出用羚羊角上的图案表示的“Hello world”的程序。罗马编程教科书上是以“Salut, Mundi”的程序开始的。我不知道如果人们打破这个传统后会有什么后果，但我认为还是不要去发现这个后果比较安全。
一个内核模块至少包括两个函数：init_module，在这个模块插入内核时调用；cleanup_module，在模块被移出时调用。典型情况下，init_module为内核中的某些东西注册一个句柄，或者把内核中的程序提换成它自己的代码（通常是进行一些工作以后再调用原来工作的代码）。Clean_module模块要求撤销init_module进行的所有处理工作，使得模块可以被安全的卸载。

Exhello.c
/* hello.c
* Copyright (C) 1998 by Ori Pomerantz
*
* "Hello, world" - the kernel module version.
*/

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* Initialize the module */
int init_module()
{
printk("Hello, world - this is the kernel speaking\n");

/* If we return a non zero value, it means that
  * init_module failed and the kernel module
  * can't be loaded */
return 0;
}

/* Cleanup - undid whatever init_module did */
void cleanup_module()
{
printk("Short is the life of a kernel module\n");
}

1．1内核模块的编译文件
一个内核模块不是一个可以独立执行的文件，而是需要在运行时刻连接入内核的目标文件。所以，它们需要用-c选项进行编译。而且，所有的内核模块都必须包含特定的标志：
l__KERNEL__——这个标志告诉头文件此代码将在内核模块中运行，而不是作为用户进程。
lMODULE——这个标志告诉头文件要给出适当的内核模块的定义。
lLINUX——从技术上讲，这个标志不是必要的。但是，如果你希望写一个比较正规的内核模块，在多个操作系统上编译，这个标志将会使你感到方便。它可以允许你在独立于操作系统的部分进行常规的编译。
还有其它的一些可被选择包含标志，取决于编译模块是的选项。如果你不能明确内核怎样被编译，可以在in/usr/include/linux/config.h中查到。
l__SMP__——对称多线程。在内核被编译成支持对称多线程（尽管在一台处理机上运行）是必须定义。如果是这样，还需要做一些别的事情（参见第12章）。
lCONFIG_MODVERSIONS——如果CONFIG_MODVERSIONS被激活，你需要在编译是定义它并且包含文件/usr/include/linux/modversions.h。这可以有代码自动完成。

ex Makefile

# Makefile for a basic kernel module

CC=gcc
MODCFLAGS := -Wall -DMODULE -D__KERNEL__ -DLINUX

hello.o:hello.c /usr/include/linux/version.h
$(CC) $(MODCFLAGS) -c hello.c
echo insmod hello.o to turn it on
echo rmmod hello to turn if off
echo
echo X and kernel programming do not mix.
echo Do the insmod and rmmod from outside

所以，并不是剩下的事情就是root（你没有把它编译成root，而是在边缘（注1.1）。对吗？），然后就在你的核心内容里插入或移出hello。当你这样做的时候，要注意到你的新模块在/proc/modules里。
而且，编译文件不推荐从X下插入的原因是内核有一条需要用printk打印的消息，它把它送给了控制台。如果你不使用X，它就送到了你使用的虚拟终端（你用Alt-F<n>选择的哪个）并且你可以看到。相反的，如果你使用了X，就有两种可能性。如果用xterm –C打开了一个控制台，输出将被送到哪里。如果没有，输出将被送到虚拟终端7——被X“覆盖”的那个。
如果你的内核变得不稳定，你可以在没有X的情况下得到调试消息。在X外，printk可以直接从内核中输出到控制台。而如果在X里，printk输出到一个用户态的进程（xterm –C）。当进程接收到CPU时间，它会将其送到X服务器进程。然后，当X服务器进程接收到CPU时间，它将会显示，但是一个不稳定的内核意味着系统将会崩溃或重起，所以你不希望显示错误的消息，然后可能被解释给你什么发生了错误，但是超出了正确的时间。
1.2 多文件内核模块
有些时候在几个源文件之间分出一个内核模块是很有意义的。在这种情况下，你需要做下面的事情：
1.在除了一个以外的所有源文件中，增加一行#define __NO_VERSION__。这是很重要的，因为module.h一般包括kernel_version的定义，这是一个全局变量，包含模块编译的内核版本。如果你需要version.h，你需要把自己把它包含进去，因为如果有__NO_VERSION__的话module.h不会自动包含。
2.象通常一样编译源文件。
3.把所有目标文件联编成一个。在X86下，用ld –m elf_i386 –r –o <name of module>.o <1st source file>
这里给出一个这样的内核模块的例子。
ex start.c

/* start.c
* Copyright (C) 1999 by Ori Pomerantz
*
* "Hello, world" - the kernel module version.
* This file includes just the start routine
*/

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* Initialize the module */
int init_module()
{
printk("Hello, world - this is the kernel speaking\n");

/* If we return a non zero value, it means that
  * init_module failed and the kernel module
  * can't be loaded */
return 0;
}
ex stop.c

/* stop.c
* Copyright (C) 1999 by Ori Pomerantz
*
* "Hello, world" - the kernel module version. This
* file includes just the stop routine.
*/

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */

#define __NO_VERSION__    /* This isn't "the" file
                        * of the kernel module */
#include <linux/module.h> /* Specifically, a module */

#include <linux/version.h> /* Not included by
                           * module.h because
                           * of the __NO_VERSION__ */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* Cleanup - undid whatever init_module did */
void cleanup_module()
{
printk("Short is the life of a kernel module\n");
}
ex Makefile

# Makefile for a multifile kernel module

CC=gcc
MODCFLAGS := -Wall -DMODULE -D__KERNEL__ -DLINUX

hello.o:start.o stop.o
ld -m elf_i386 -r -o hello.o start.o stop.o

start.o:start.c /usr/include/linux/version.h
$(CC) $(MODCFLAGS) -c start.c

stop.o:stop.c /usr/include/linux/version.h
$(CC) $(MODCFLAGS) -c stop.c

2．字符设备文件
那么，现在我们是原始级的内核程序员，我们知道如何写不做任何事情的内核模块。我们为自己而骄傲并且高昂起头来。但是不知何故我们感觉到缺了什么东西。患有精神紧张症的模块不是那么有意义。
内核模块同进程对话有两种主要途径。一种是通过设备文件（比如/dev 目录中的文件），另一种是使用proc文件系统。我们把一些东西写入内核的一个主要原因就是支持一些硬件设备，所以我们从设备文件开始。
设备文件的最初目的是允许进程同内核中的设备驱动通信，并且通过它们和物理设备通信（modem，终端，等等）。这种方法的实现如下：
每个设备驱动都对应着一定类型的硬件设备，并且被赋予一个主码。设备驱动的列表和它们的主码可以在in/proc/devices中找到。每个设备驱动管理下的物理设备也被赋予一个从码。无论这些设备是否真的安装，在/dev目录中都将有一个文件，称作设备文件，对应着每一个设备。
例如，如果你进行ls –l /dev/hd[ab] *操作，你将看见可能联结到某台机器上的所有的IDE硬盘分区。注意它们都使用了同一个主码，3，但是从码却互不相同。（声明：这是在PC结构上的情况，我不知道在其他结构上运行的linux是否如此。）
在系统安装时，所有设备文件在mknod命令下被创建。它们必须创建在/dev目录下没有技术上的原因，只是一种使用上的便利。如果是为测试目的而创建的设备文件，比如我们这里的练习，可能放在你编译内核模块的的目录下更加合适。
设备可以被分成两类：字符设备和块设备。它们的区别是块设备有一个用于请求的缓冲区，所以它们可以选择用什么样的顺序来响应它们。这对于存储设备是非常重要的，读取相邻的扇区比互相远离的分区速度会快得多。另一个区别是块设备只能按块（块大小对应不同设备而变化）接受输入和返回输出，而字符设备却按照它们能接受的最少字节块来接受输入。大部分设备是字符设备，因为它们不需要这种类型的缓冲。你可以通过观看ls -l命令的输出中的第一个字符而知道一个设备文件是块设备还是字符设备。如果是b就是块设备，如果是c就是字符设备。
这个模块可以被分成两部分：模块部分和设备及设备驱动部分。Init_module函数调用module_register_chrdev在内核得块设备表里增加设备驱动。同时返回该驱动所使用的主码。Cleanup_module函数撤销设备的注册。
这些操作（注册和注销）是这两个函数的主要功能。内核中的函数不是象进程一样自发运行的，而是通过系统调用，或硬件中断或者内核中的其它部分（只要是调用具体的函数）被进程调用的。所以，当你向内和中增加代码时，你应该把它注册为具体某种事件的句柄，而当你把它删除的时候，你需要注销这个句柄。
设备驱动完全由四个设备_<action〉函数构成，它们在希望通过有主码的设备文件实现一些操作时被调用。内核调用它们的途径是通过file_operation结构Fops。此结构在设备被注册是创建，它包含指向这四个函数的指针。
另一点我们需要记住的是，我们不能允许管理员随心所欲的删除内核模块。这是因为如果设备文件是被进程打开的，那么我们删除内核模块的时候，要使用这些文件就会导致访问正常的函数（读/写）所在的内存位置。如果幸运，那里不会有其他代码被装载，我们将得到一个恶性的错误信息。如果不行，另一个内核模块会被装载到同一个位置，这将意味着会跳入内核中另一个程序的中间，结果将是不可预料的恶劣。
通常你不希望一个函数做什么事情的时候，会从那个函数返回一个错误码（一个负数）。但这在cleanup_module中是不可能的，因为它是一个void型的函数。一旦cleanup_module被调用，这个模块就死掉了。然而有一个计数器记录着有多少个内核模块在使用这个模块，这个计数器称为索引计数器（/proc/modules中没行的最后一个数字）。如果这个数字不是0，删除就会失败。模块的索引计数器包含在变量mod_use_count_中。有定义好的处理这个变量的宏（MOD_INC_USE_COUNT和MOD_DEC_USE_COUNT），所以我们一般使用宏而不是直接使用变量mod_use_count_，这样在以后实现变化的时候会带来安全性。

ex chardev.c

/* chardev.c
* Copyright (C) 1998-1999 by Ori Pomerantz
*
* Create a character device (read only)
*/

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* For character devices */
#include <linux/fs.h>    /* The character device
                        * definitions are here */
#include <linux/wrapper.h>  /* A wrapper which does
                        * next to nothing at
                        * at present, but may
                        * help for compatibility
                        * with future versions
                        * of Linux */

/* In 2.2.3 /usr/include/linux/version.h includes
* a macro for this, but 2.0.35 doesn't - so I add
* it here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

/* Conditional compilation. LINUX_VERSION_CODE is
* the code (as per KERNEL_VERSION) of this version. */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>  /* for put_user */
#endif

#define SUCCESS 0

/* Device Declarations **************************** */

/* The name for our device, as it will appear
* in /proc/devices */
#define DEVICE_NAME "char_dev"

/* The maximum length of the message from the device */
#define BUF_LEN 80

/* Is the device open right now? Used to prevent
* concurent access into the same device */
static int Device_Open = 0;

/* The message the device will give when asked */
static char Message[BUF_LEN];

/* How far did the process reading the message
* get? Useful if the message is larger than the size
* of the buffer we get to fill in device_read. */
static char *Message_Ptr;

/* This function is called whenever a process
* attempts to open the device file */
static int device_open(struct inode *inode,
   struct file *file)
{
static int counter = 0;

#ifdef DEBUG
printk ("device_open(%p,%p)\n", inode, file);
#endif

/* This is how you get the minor device number in
  * case you have more than one physical device using
  * the driver. */
printk("Device: %d.%d\n",
inode->i_rdev >> 8, inode->i_rdev & 0xFF);

/* We don't want to talk to two processes at the
  * same time */
if (Device_Open)
return -EBUSY;

/* If this was a process, we would have had to
* be more careful here.
*
*In the case of processes, the danger would be
*that one process might have check Device_Open
*and then be replaced by the schedualer by another
*process which runs this function. Then, when
*the first process was back on the CPU, it would assume
*the device is still not open.
* However, Linux guarantees that a process won't
* be replaced while it is running in kernel context.
  *
* In the case of SMP, one CPU might increment
*Device_Open while another CPU is here, right after the check.
*However, in version 2.0 of the kernel this is not a problem
*because there's a lock to guarantee only one CPU will
*be kernel module at the same time.
*This is bad in  terms of  performance, so version 2.2 changed it.
*Unfortunately, I don't have access to an SMP box
*to check how it works with SMP.
  */

Device_Open++;

/* Initialize the message. */
sprintf(Message,
"If I told you once, I told you %d times - %s",
counter++,
"Hello, world\n");
/* The only reason we're allowed to do this sprintf
  * is because the maximum length of the message
  * (assuming 32 bit integers - up to 10 digits
  * with the minus sign) is less than BUF_LEN, which
  * is 80. BE CAREFUL NOT TO OVERFLOW BUFFERS,
  * ESPECIALLY IN THE KERNEL!!!
  */

Message_Ptr = Message;

/* Make sure that the module isn't removed while
  * the file is open by incrementing the usage count
  * (the number of opened references to the module, if
  * it's not zero rmmod will fail)
  */
MOD_INC_USE_COUNT;

return SUCCESS;
}

/* This function is called when a process closes the
* device file. It doesn't have a return value in
* version 2.0.x because it can't fail (you must ALWAYS
* be able to close a device). In version 2.2.x it is
* allowed to fail - but we won't let it.
*/
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static int device_release(struct inode *inode,
struct file *file)
#else
static void device_release(struct inode *inode,
  struct file *file)
#endif
{
#ifdef DEBUG
printk ("device_release(%p,%p)\n", inode, file);
#endif

/* We're now ready for our next caller */
Device_Open --;

/* Decrement the usage count, otherwise once you
  * opened the file you'll never get rid of the module.
  */
MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
return 0;
#endif
}

/* This function is called whenever a process which
* have already opened the device file attempts to
* read from it. */

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_read(struct file *file,
char *buffer, /* The buffer to fill with data */
size_t length, /* The length of the buffer */
loff_t *offset)  /* Our offset in the file */
#else
static int device_read(struct inode *inode,
                  struct file *file,
char *buffer, /* The buffer to fill with
* the data */
int length)    /* The length of the buffer
                  * (mustn't write beyond that!) */
#endif
{
/* Number of bytes actually written to the buffer */
int bytes_read = 0;

/* If we're at the end of the message, return 0
  * (which signifies end of file) */
if (*Message_Ptr == 0)
return 0;

/* Actually put the data into the buffer */
while (length && *Message_Ptr)  {

/* Because the buffer is in the user data segment,
* not the kernel data segment, assignment wouldn't
* work. Instead, we have to use put_user which
* copies data from the kernel data segment to the
* user data segment. */
put_user(*(Message_Ptr++), buffer++);

length --;
bytes_read ++;
}

#ifdef DEBUG
  printk ("Read %d bytes, %d left\n",
bytes_read, length);
#endif

  /* Read functions are supposed to return the number
* of bytes actually inserted into the buffer */
return bytes_read;
}

/* This function is called when somebody tries to write
* into our device file - unsupported in this example. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_write(struct file *file,
const char *buffer, /* The buffer */
size_t length, /* The length of the buffer */
loff_t *offset)  /* Our offset in the file */
#else
static int device_write(struct inode *inode,
                     struct file *file,
                     const char *buffer,
                     int length)
#endif
{
return -EINVAL;
}

/* Module Declarations ***************************** */

/* The major device number for the device. This is
* global (well, static, which in this context is global
* within this file) because it has to be accessible
* both for registration and for release. */
static int Major;

/* This structure will hold the functions to be
* called when a process does something to the device
* we created. Since a pointer to this structure is
* kept in the devices table, it can't be local to
* init_module. NULL is for unimplemented functions. */

struct file_operations Fops = {
NULL, /* seek */
device_read,
device_write,
NULL, /* readdir */
NULL, /* select */
NULL, /* ioctl */
NULL, /* mmap */
device_open,
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
NULL, /* flush */
#endif
device_release  /* a.k.a. close */
};

/* Initialize the module - Register the character device */
int init_module()
{
/* Register the character device (atleast try) */
Major = module_register_chrdev(0,
                              DEVICE_NAME,
                              &Fops);

/* Negative values signify an error */
if (Major < 0) {
printk ("%s device failed with %d\n",
"Sorry, registering the character",
Major);
return Major;
}

printk ("%s The major device number is %d.\n",
      "Registeration is a success.",
      Major);
printk ("If you want to talk to the device driver,\n");
printk ("you'll have to create a device file. \n");
printk ("We suggest you use:\n");
printk ("mknod <name> c %d <minor>\n", Major);
printk ("You can try different minor numbers %s",
      "and see what happens.\n");

return 0;
}

/* Cleanup - unregister the appropriate file from /proc */
void cleanup_module()
{
int ret;

/* Unregister the device */
ret = module_unregister_chrdev(Major, DEVICE_NAME);

/* If there's an error, report it */
if (ret < 0)
printk("Error in unregister_chrdev: %d\n", ret);
}
2．1多内核版本源文件
系统调用是内核出示给进程的主要接口，在不同版本中一般是相同的。可能会增加新的系统，但是旧的系统的行为是不变的。向后兼容是必要的——新的内核版本不能打破正常的进程规律。在大多数情况下，设备文件是不变的。然而，内核中的内部接口是可以在不同版本间改变的。
Linux内核的版本分为稳定版（n.<偶数>.m）和发展版（n.<奇数>.m）。发展版包含了所有新奇的思想，包括那些在下一版中被认为是错的，或者被重新实现的。所以，你不能相信在那些版本中这些接口是保持不变的（这就是为什么我在本书中不厌其烦的支持不同接口。这是很大量的工作但是马上就会过时）。但是在稳定版中我们就可以认为接口是相同的，即使在修正版中（数字m所指的）。
MPG版本包括了对内核2.0.x和2.2.x的支持。这两种内核仍有不同之处，所以编译时要取决于内核版本而决定。方法是使用宏LINUX_VERSION_CODE。在a.b.c版中，这个宏的值是216a+28b+c。如果希望得到具体内核版本号，我们可以使用宏KERNEL_VERSION。在2.0.35版中没有定义这个宏，在需要时我们可以自己定义。

3．/proc文件系统
在Linux中有一个另外的机制来使内核及内核模块发送信息给进程——/proc文件系统。/proc文件系统最初是设计使得容易得到进程的信息（从名字可以看出），现在却被任意一块有内容需要报告的内核使用，比如拥有模块列表的/proc/modules和拥有内存使用统计信息的/proc/meminfo。
使用proc文件系统的方法很象使用设备驱动——你创建一个数据结构，使之包含/proc文件需要的全部信息，包括所有函数的句柄（在我们的例子里只有一个，在试图读取/proc文件时调用）。然后，用init_module注册这个结构，用cleanup_module注销。
我们使用proc_register_dynamic（注3.1）的原因是我们不希望决定以后在文件中使用的索引节点数，而是让内核来决定它，为了防止冲突。标准的文件系统是在磁盘上而不是在内存（/proc的位置在内存），在这种情况下节点数是一个指向文件的索引节点所在磁盘地址的指针。这个索引节点包含了文件的有关信息比如文件的访问权限以及指向磁盘地址的指真或者文件数据的位置。
因为在文件打开或关闭时我们没有调用，所以在模块里无处可放宏MOD_INC_USE_COUNT和MOD_DEC_USE_COUNT，而且如果文件被打开了或者模块被删除了，就没有办法来避免这个结果。下一章我们将会看到一个更困难的处理/proc的方法，但是也更加灵活，也能够解决这个问题。
ex procfs.c

/* procfs.c -  create a "file" in /proc
* Copyright (C) 1998-1999 by Ori Pomerantz
*/

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* Necessary because we use the proc fs */
#include <linux/proc_fs.h>

/* In 2.2.3 /usr/include/linux/version.h includes a
* macro for this, but 2.0.35 doesn't - so I add it
* here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

/* Put data into the proc fs file.

  Arguments
  =========
  1. The buffer where the data is to be inserted, if
   you decide to use it.
  2. A pointer to a pointer to characters. This is
   useful if you don't want to use the buffer
   allocated by the kernel.
  3. The current position in the file.
  4. The size of the buffer in the first argument.
  5. Zero (for future use?).

  Usage and Return value
  ======================
  If you use your own buffer, like I do, put its
  location in the second argument and return the
  number of bytes used in the buffer.

  A return value of zero means you have no further
  information at this time (end of file). A negative
  return value is an error condition.


  For More Information
  ====================
  The way I discovered what to do with this function
  wasn't by reading documentation, but by reading the
  code which used it. I just looked to see what uses
  the get_info field of proc_dir_entry struct (I used a
  combination of find and grep, if you're interested),
  and I saw that  it is used in <kernel source
  directory>/fs/proc/array.c.

  If something is unknown about the kernel, this is
  usually the way to go. In Linux we have the great
  advantage of having the kernel source code for
  free - use it.
*/
int procfile_read(char *buffer,
char **buffer_location,
off_t offset,
int buffer_length,
int zero)
{
int len;  /* The number of bytes actually used */

/* This is static so it will still be in memory
  * when we leave this function */
static char my_buffer[80];

static int count = 1;

/* We give all of our information in one go, so if the
  * user asks us if we have more information the
  * answer should always be no.
  *
  * This is important because the standard read
  * function from the library would continue to issue
  * the read system call until the kernel replies
  * that it has no more information, or until its
  * buffer is filled.
  */
if (offset > 0)
return 0;

/* Fill the buffer and get its length */
len = sprintf(my_buffer,
"For the %d%s time, go away!\n", count,
(count % 100 > 10 && count % 100 < 14) ? "th" :
   (count % 10 == 1) ? "st" :
   (count % 10 == 2) ? "nd" :
      (count % 10 == 3) ? "rd" : "th" );
count++;

/* Tell the function which called us where the
  * buffer is */
*buffer_location = my_buffer;

/* Return the length */
return len;
}

struct proc_dir_entry Our_Proc_File =
{
0, /* Inode number - ignore, it will be filled by
   * proc_register[_dynamic] */
4, /* Length of the file name */
"test", /* The file name */
S_IFREG | S_IRUGO, /* File mode - this is a regular
                     * file which can be read by its
                     * owner, its group, and everybody
                     * else */
1,/* Number of links (directories where the
      * file is referenced) */
0, 0,  /* The uid and gid for the file - we give it
         * to root */
80, /* The size of the file reported by ls. */
NULL, /* functions which can be done on the inode
      * (linking, removing, etc.) - we don't
      * support any. */
procfile_read, /* The read function for this file,
               * the function called when somebody
               * tries to read something from it. */
NULL /* We could have here a function to fill the
      * file's inode, to enable us to play with
      * permissions, ownership, etc. */
};

/* Initialize the module - register the proc file */
int init_module()
{
/* Success if proc_register[_dynamic] is a success,
  * failure otherwise. */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
/* In version 2.2, proc_register assign a dynamic
  * inode number automatically if it is zero in the
  * structure , so there's no more need for
  * proc_register_dynamic
  */
return proc_register(&proc_root, &Our_Proc_File);
#else
return proc_register_dynamic(&proc_root, &Our_Proc_File);
#endif

/* proc_root is the root directory for the proc
  * fs (/proc). This is where we want our file to be
  * located.
  */
}

/* Cleanup - unregister our file from /proc */
void cleanup_module()
{
proc_unregister(&proc_root, Our_Proc_File.low_ino);
}

4．使用/proc进行输入
现在我们已经有了两种方法从内核模块中产生输出：注册一个设备驱动并且mknod一个设备文件，或者创建一个/proc文件。这可以使内核告诉我们任何信息。现在的问题是我们没有办法回答给内核。我们象内核输入的第一种方法是写给/proc文件。
因为proc文件系统主要是为满足内核向进程报告其状态的，没有为输入留出特别的规定。数据结构proc_dir_entry没有包含一个指向某个输入函数的指针，就象指向输出函数那样。如果我们要向一个/proc文件写入，我们需要使用标准文件系统机制。
在Linux里有一个文件系统注册的标准机制。每个文件系统都有自己的函数来处理索引节点和文件操作，所以就有一个特殊的机构来存放指向所有函数的指针，struct inode_operations，它有一个指向struct file_operations的指针。在/proc里，无论何时我们注册一个新文件，我们就被允许指定用inod_operations访问哪个结构。这就是我们要用的机制，一个inode_operations，包括一个指向file_operations的指针，file_operations里包含我们的module_input和module_output函数。
必须指出标准的读写角色在内核中被倒置了，读函数用来输出，而写函数用来输入。这是因为读和写是在用户的观点看，如果一个进程从内核中读取一些内容，那么内核就必须输出处理。而进程要写入内核，内核就要接受输入。
另一个有趣的地方是module_permission函数。这个函数每当进程试图对/proc文件进行处理时调用，它可以决定是否允许访问。目前这个函数只定义在操作和当前使用的uid（当前可用的是一个指针指向一个当前运行进程的信息的结构）的基础上，但是它可以在我们希望的任何事物的基础上定义，比如其他进程正在对文件做的操作，日期时间或者接收到的最后一个输入。
使用put_usr和get_user的原因是Linux的内存是分段的（在Intel结构下，其他系列的处理器下可能不同）。这意味着一个指针本身不代表内存中的一个唯一地址，而是段中的一个地址，所以你还需要知道哪一个段可以使用它。内核占有一个段，每个进程都各占有一个段。
一个进程可以访问的唯一的段就是它自己拥有的那个，所以当你写作为进程运行的程序时可以不用关心段的问题。如果你要写内核模块，一般你希望访问内核的段，这由系统自动处理。然而，如果内存缓冲区的内容需要在当前运行的进程和内核之间传递时，内核函数会接到在此进程段里的指向内存缓冲区的一个指针。Put_user和get_user允许你访问那块内存。
ex procfs.c

/* procfs.c -  create a "file" in /proc, which allows
* both input and output. */

/* Copyright (C) 1998-1999 by Ori Pomerantz */

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* Necessary because we use proc fs */
#include <linux/proc_fs.h>

/* In 2.2.3 /usr/include/linux/version.h includes a
* macro for this, but 2.0.35 doesn't - so I add it
* here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>  /* for get_user and put_user */
#endif

/* The module's file functions ********************** */

/* Here we keep the last message received, to prove
* that we can process our input */
#define MESSAGE_LENGTH 80
static char Message[MESSAGE_LENGTH];

/* Since we use the file operations struct, we can't
* use the special proc output provisions - we have to
* use a standard read function, which is this function */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t module_output(
struct file *file, /* The file read */
char *buf, /* The buffer to put data to (in the
            * user segment) */
size_t len,  /* The length of the buffer */
loff_t *offset) /* Offset in the file - ignore */
#else
static int module_output(
struct inode *inode, /* The inode read */
struct file *file, /* The file read */
char *buf, /* The buffer to put data to (in the
            * user segment) */
int len)  /* The length of the buffer */
#endif
{
static int finished = 0;
int i;
char message[MESSAGE_LENGTH+30];

/* We return 0 to indicate end of file, that we have
  * no more information. Otherwise, processes will
  * continue to read from us in an endless loop. */
if (finished) {
finished = 0;
return 0;
}

/* We use put_user to copy the string from the kernel's
  * memory segment to the memory segment of the process
  * that called us. get_user, BTW, is
  * used for the reverse. */
sprintf(message, "Last input:%s", Message);
for(i=0; i<len && message; i++)
put_user(message, buf+i);

/* Notice, we assume here that the size of the message
  * is below len, or it will be received cut. In a real
  * life situation, if the size of the message is less
  * than len then we'd return len and on the second call
  * start filling the buffer with the len+1'th byte of
  * the message. */
finished = 1;

return i;  /* Return the number of bytes "read" */
}

/* This function receives input from the user when the
* user writes to the /proc file. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t module_input(
struct file *file, /* The file itself */
const char *buf,    /* The buffer with input */
size_t length,    /* The buffer's length */
loff_t *offset)    /* offset to file - ignore */
#else
static int module_input(
struct inode *inode, /* The file's inode */
struct file *file, /* The file itself */
const char *buf,    /* The buffer with the input */
int length)       /* The buffer's length */
#endif
{
int i;

/* Put the input into Message, where module_output
  * will later be able to use it */
for(i=0; i<MESSAGE_LENGTH-1 && i<length; i++)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
get_user(Message, buf+i);
/* In version 2.2 the semantics of get_user changed,
  * it not longer returns a character, but expects a
  * variable to fill up as its first argument and a
  * user segment pointer to fill it from as the its
  * second.
  *
  * The reason for this change is that the version 2.2
  * get_user can also read an short or an int. The way
  * it knows the type of the variable it should read
  * is by using sizeof, and for that it needs the
  * variable itself.
  */
#else
Message = get_user(buf+i);
#endif
Message = '\0';  /* we want a standard, zero
                  * terminated string */

/* We need to return the number of input characters
  * used */
return i;
}

/* This function decides whether to allow an operation
* (return zero) or not allow it (return a non-zero
* which indicates why it is not allowed).
*
* The operation can be one of the following values:
* 0 - Execute (run the "file" - meaningless in our case)
* 2 - Write (input to the kernel module)
* 4 - Read (output from the kernel module)
*
* This is the real function that checks file
* permissions. The permissions returned by ls -l are
* for referece only, and can be overridden here.
*/
static int module_permission(struct inode *inode, int op)
{
/* We allow everybody to read from our module, but
  * only root (uid 0) may write to it */
if (op == 4 || (op == 2 && current->euid == 0))
return 0;

/* If it's anything else, access is denied */
return -EACCES;
}

/* The file is opened - we don't really care about
* that, but it does mean we need to increment the
* module's reference count. */
int module_open(struct inode *inode, struct file *file)
{
MOD_INC_USE_COUNT;

return 0;
}

/* The file is closed - again, interesting only because
* of the reference count. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
int module_close(struct inode *inode, struct file *file)
#else
void module_close(struct inode *inode, struct file *file)
#endif
{
MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
return 0;  /* success */
#endif
}

/* Structures to register as the /proc file, with
* pointers to all the relevant functions. ********** */

/* File operations for our proc file. This is where we
* place pointers to all the functions called when
* somebody tries to do something to our file. NULL
* means we don't want to deal with something. */
static struct file_operations File_Ops_4_Our_Proc_File =
{
NULL,  /* lseek */
module_output,  /* "read" from the file */
module_input, /* "write" to the file */
NULL,  /* readdir */
NULL,  /* select */
NULL,  /* ioctl */
NULL,  /* mmap */
module_open, /* Somebody opened the file */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
NULL, /* flush, added here in version 2.2 */
#endif
module_close, /* Somebody closed the file */
/* etc. etc. etc. (they are all given in
* /usr/include/linux/fs.h). Since we don't put
* anything here, the system will keep the default
* data, which in Unix is zeros (NULLs when taken as
* pointers). */
};

/* Inode operations for our proc file. We need it so
* we'll have some place to specify the file operations
* structure we want to use, and the function we use for
* permissions. It's also possible to specify functions
* to be called for anything else which could be done to
* an inode (although we don't bother, we just put
* NULL). */
static struct inode_operations Inode_Ops_4_Our_Proc_File =
{
&File_Ops_4_Our_Proc_File,
NULL, /* create */
NULL, /* lookup */
NULL, /* link */
NULL, /* unlink */
NULL, /* symlink */
NULL, /* mkdir */
NULL, /* rmdir */
NULL, /* mknod */
NULL, /* rename */
NULL, /* readlink */
NULL, /* follow_link */
NULL, /* readpage */
NULL, /* writepage */
NULL, /* bmap */
NULL, /* truncate */
module_permission /* check for permissions */
};

/* Directory entry */
static struct proc_dir_entry Our_Proc_File =
{
0, /* Inode number - ignore, it will be filled by
   * proc_register[_dynamic] */
7, /* Length of the file name */
"rw_test", /* The file name */
S_IFREG | S_IRUGO | S_IWUSR,
/* File mode - this is a regular file which
* can be read by its owner, its group, and everybody
* else. Also, its owner can write to it.
*
* Actually, this field is just for reference, it's
* module_permission that does the actual check. It
* could use this field, but in our implementation it
* doesn't, for simplicity. */
1,  /* Number of links (directories where the
      * file is referenced) */
0, 0,  /* The uid and gid for the file -
         * we give it to root */
80, /* The size of the file reported by ls. */
&Inode_Ops_4_Our_Proc_File,
/* A pointer to the inode structure for
* the file, if we need it. In our case we
* do, because we need a write function. */
NULL
/* The read function for the file. Irrelevant,
* because we put it in the inode structure above */
};

/* Module initialization and cleanup ******************* */

/* Initialize the module - register the proc file */
int init_module()
{
/* Success if proc_register[_dynamic] is a success,
  * failure otherwise */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
/* In version 2.2, proc_register assign a dynamic
  * inode number automatically if it is zero in the
  * structure , so there's no more need for
  * proc_register_dynamic
  */
return proc_register(&proc_root, &Our_Proc_File);
#else
return proc_register_dynamic(&proc_root, &Our_Proc_File);
#endif
}

/* Cleanup - unregister our file from /proc */
void cleanup_module()
{
proc_unregister(&proc_root, Our_Proc_File.low_ino);
}

5．和设备文件对话（写和IOCTLS）
设备文件是用来代表物理设备的。多数物理设备是用来进行输出或输入的，所以必须由某种机制使得内核中的设备驱动从进程中得到输出送给设备。这可以通过打开输出设备文件并且写入做到，就想写入一个普通文件。在下面的例子里，这由device_write实现。
这不是总能奏效的。设想你与一个连向modem的串口（技是你有一个内猫，从CPU看来它也是作为一个串口实现，所以你不需要认为这个设想太困难）。最自然要做的事情就是使用设备文件把内容写到modem上（无论用modem命令还是电话线）或者从modem读信息（同样可以从modem命令回答或者通过电话线）。但是这留下的问题是当你需要和串口本身对话的时候需要怎样做？比如发送数据发送和接收的速率。
回答是Unix使用一个叫做ioctl(input output control的简写)的特殊函数。每个设备都有自己的ioctl命令，这个命令可以是ioctl读的，也可以是写的，也可以是两者都是或都不是。Ioctl函数由三个参数调用：适当设备的描述子，ioctl数，和一个长整型参数，可以赋予一个角色用来传递任何东西。
Ioctl数对设备主码、ioctl类型、编码、和参数的类型进行编码。Ioctl数通常在头文件由一个宏调用（_IO，_IOR，_IOW或_IOWR——决定于类型）。这个头文件必须包含在使用ioctl（所以它们可以产生正确的ioctl’s）程序和内核模块（所以它可以理解）中。在下面的例子里，这个头文件是chardev.h，使用它的程序是ioctl.c。
如果你希望在你自己的内核模块中使用ioctl’s，最好去接受一分正式的ioctl职位，这样你就可以得到别人的ioctl’s，或者他们得到你，你就可以知道哪里出了错误。如果想得到更多的信息，到’documentation/ioctl-number.txt’中查看内核源文件树。
ex chardev.c

/* chardev.c
*
* Create an input/output character device
*/

/* Copyright (C) 1998-99 by Ori Pomerantz */

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

/* For character devices */

/* The character device definitions are here */
#include <linux/fs.h>

/* A wrapper which does next to nothing at
* at present, but may help for compatibility
* with future versions of Linux */
#include <linux/wrapper.h>

/* Our own ioctl numbers */
#include "chardev.h"

/* In 2.2.3 /usr/include/linux/version.h includes a
* macro for this, but 2.0.35 doesn't - so I add it
* here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>  /* for get_user and put_user */
#endif

#define SUCCESS 0

/* Device Declarations ******************************** */

/* The name for our device, as it will appear in
* /proc/devices */
#define DEVICE_NAME "char_dev"

/* The maximum length of the message for the device */
#define BUF_LEN 80

/* Is the device open right now? Used to prevent
* concurent access into the same device */
static int Device_Open = 0;

/* The message the device will give when asked */
static char Message[BUF_LEN];

/* How far did the process reading the message get?
* Useful if the message is larger than the size of the
* buffer we get to fill in device_read. */
static char *Message_Ptr;

/* This function is called whenever a process attempts
* to open the device file */
static int device_open(struct inode *inode,
                  struct file *file)
{
#ifdef DEBUG
printk ("device_open(%p)\n", file);
#endif

/* We don't want to talk to two processes at the
  * same time */
if (Device_Open)
return -EBUSY;

/* If this was a process, we would have had to be
  * more careful here, because one process might have
  * checked Device_Open right before the other one
  * tried to increment it. However, we're in the
  * kernel, so we're protected against context switches.
  *
  * This is NOT the right attitude to take, because we
  * might be running on an SMP box, but we'll deal with
  * SMP in a later chapter.
  */

Device_Open++;

/* Initialize the message */
Message_Ptr = Message;

MOD_INC_USE_COUNT;

return SUCCESS;
}

/* This function is called when a process closes the
* device file. It doesn't have a return value because
* it cannot fail. Regardless of what else happens, you
* should always be able to close a device (in 2.0, a 2.2
* device file could be impossible to close). */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static int device_release(struct inode *inode,
                     struct file *file)
#else
static void device_release(struct inode *inode,
                        struct file *file)
#endif
{
#ifdef DEBUG
printk ("device_release(%p,%p)\n", inode, file);
#endif

/* We're now ready for our next caller */
Device_Open --;

MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
return 0;
#endif
}

/* This function is called whenever a process which
* has already opened the device file attempts to
* read from it. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_read(
struct file *file,
char *buffer, /* The buffer to fill with the data */
size_t length,    /* The length of the buffer */
loff_t *offset) /* offset to the file */
#else
static int device_read(
struct inode *inode,
struct file *file,
char *buffer, /* The buffer to fill with the data */
int length)    /* The length of the buffer
                  * (mustn't write beyond that!) */
#endif
{
/* Number of bytes actually written to the buffer */
int bytes_read = 0;

#ifdef DEBUG
printk("device_read(%p,%p,%d)\n",
file, buffer, length);
#endif

/* If we're at the end of the message, return 0
  * (which signifies end of file) */
if (*Message_Ptr == 0)
return 0;

/* Actually put the data into the buffer */
while (length && *Message_Ptr)  {

/* Because the buffer is in the user data segment,
* not the kernel data segment, assignment wouldn't
* work. Instead, we have to use put_user which
* copies data from the kernel data segment to the
* user data segment. */
put_user(*(Message_Ptr++), buffer++);
length --;
bytes_read ++;
}

#ifdef DEBUG
  printk ("Read %d bytes, %d left\n",
bytes_read, length);
#endif

  /* Read functions are supposed to return the number
* of bytes actually inserted into the buffer */
return bytes_read;
}

/* This function is called when somebody tries to
* write into our device file. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_write(struct file *file,
                        const char *buffer,
                        size_t length,
                        loff_t *offset)
#else
static int device_write(struct inode *inode,
                     struct file *file,
                     const char *buffer,
                     int length)
#endif
{
int i;

#ifdef DEBUG
printk ("device_write(%p,%s,%d)",
file, buffer, length);
#endif

for(i=0; i<length && i<BUF_LEN; i++)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
get_user(Message, buffer+i);
#else
Message = get_user(buffer+i);
#endif

Message_Ptr = Message;

/* Again, return the number of input characters used */
return i;
}

/* This function is called whenever a process tries to
* do an ioctl on our device file. We get two extra
* parameters (additional to the inode and file
* structures, which all device functions get): the number
* of the ioctl called and the parameter given to the
* ioctl function.
*
* If the ioctl is write or read/write (meaning output
* is returned to the calling process), the ioctl call
* returns the output of this function.
*/
int device_ioctl(
struct inode *inode,
struct file *file,
unsigned int ioctl_num,/* The number of the ioctl */
unsigned long ioctl_param) /* The parameter to it */
{
int i;
char *temp;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
char ch;
#endif

/* Switch according to the ioctl called */
switch (ioctl_num) {
case IOCTL_SET_MSG:
   /* Receive a pointer to a message (in user space)
   * and set that to be the device's message. */

   /* Get the parameter given to ioctl by the process */
   temp = (char *) ioctl_param;

   /* Find the length of the message */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   get_user(ch, temp);
   for (i=0; ch && i<BUF_LEN; i++, temp++)
   get_user(ch, temp);
#else
   for (i=0; get_user(temp) && i<BUF_LEN; i++, temp++)
;
#endif

   /* Don't reinvent the wheel - call device_write */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   device_write(file, (char *) ioctl_param, i, 0);
#else
   device_write(inode, file, (char *) ioctl_param, i);
#endif
   break;

case IOCTL_GET_MSG:
   /* Give the current message to the calling
   * process - the parameter we got is a pointer,
   * fill it. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   i = device_read(file, (char *) ioctl_param, 99, 0);
#else
   i = device_read(inode, file, (char *) ioctl_param,
                  99);
#endif
   /* Warning - we assume here the buffer length is
   * 100. If it's less than that we might overflow
   * the buffer, causing the process to core dump.
   *
   * The reason we only allow up to 99 characters is
   * that the NULL which terminates the string also
   * needs room. */

   /* Put a zero at the end of the buffer, so it
   * will be properly terminated */
   put_user('\0', (char *) ioctl_param+i);
   break;

case IOCTL_GET_NTH_BYTE:
   /* This ioctl is both input (ioctl_param) and
   * output (the return value of this function) */
   return Message[ioctl_param];
   break;
}

return SUCCESS;
}

/* Module Declarations *************************** */

/* This structure will hold the functions to be called
* when a process does something to the device we
* created. Since a pointer to this structure is kept in
* the devices table, it can't be local to
* init_module. NULL is for unimplemented functions. */
struct file_operations Fops = {
NULL, /* seek */
device_read,
device_write,
NULL, /* readdir */
NULL, /* select */
device_ioctl, /* ioctl */
NULL, /* mmap */
device_open,
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
NULL,  /* flush */
#endif
device_release  /* a.k.a. close */
};

/* Initialize the module - Register the character device */
int init_module()
{
int ret_val;

/* Register the character device (atleast try) */
ret_val = module_register_chrdev(MAJOR_NUM,
                              DEVICE_NAME,
                              &Fops);

/* Negative values signify an error */
if (ret_val < 0) {
printk ("%s failed with %d\n",
         "Sorry, registering the character device ",
         ret_val);
return ret_val;
}

printk ("%s The major device number is %d.\n",
      "Registeration is a success",
      MAJOR_NUM);
printk ("If you want to talk to the device driver,\n");
printk ("you'll have to create a device file. \n");
printk ("We suggest you use:\n");
printk ("mknod %s c %d 0\n", DEVICE_FILE_NAME,
      MAJOR_NUM);
printk ("The device file name is important, because\n");
printk ("the ioctl program assumes that's the\n");
printk ("file you'll use.\n");

return 0;
}

/* Cleanup - unregister the appropriate file from /proc */
void cleanup_module()
{
int ret;

/* Unregister the device */
ret = module_unregister_chrdev(MAJOR_NUM, DEVICE_NAME);

/* If there's an error, report it */
if (ret < 0)
printk("Error in module_unregister_chrdev: %d\n", ret);
}
ex chardev.h

/* chardev.h - the header file with the ioctl definitions.
*
* The declarations here have to be in a header file,
* because they need to be known both to the kernel
* module (in chardev.c) and the process calling ioctl
* (ioctl.c)
*/

#ifndef CHARDEV_H
#define CHARDEV_H

#include <linux/ioctl.h>

/* The major device number. We can't rely on dynamic
* registration any more, because ioctls need to know
* it. */
#define MAJOR_NUM 100

/* Set the message of the device driver */
#define IOCTL_SET_MSG _IOR(MAJOR_NUM, 0, char *)
/* _IOR means that we're creating an ioctl command
* number for passing information from a user process
* to the kernel module.
*
* The first arguments, MAJOR_NUM, is the major device
* number we're using.
*
* The second argument is the number of the command
* (there could be several with different meanings).
*
* The third argument is the type we want to get from
* the process to the kernel.
*/

/* Get the message of the device driver */
#define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *)
/* This IOCTL is used for output, to get the message
* of the device driver. However, we still need the
* buffer to place the message in to be input,
* as it is allocated by the process.
*/

/* Get the n'th byte of the message */
#define IOCTL_GET_NTH_BYTE _IOWR(MAJOR_NUM, 2, int)
/* The IOCTL is used for both input and output. It
* receives from the user a number, n, and returns
* Message[n]. */

/* The name of the device file */
#define DEVICE_FILE_NAME "char_dev"

#endif

ex ioctl.c

/* ioctl.c - the process to use ioctl's to control the
* kernel module
*
* Until now we could have used cat for input and
* output. But now we need to do ioctl's, which require
* writing our own process.
*/

/* Copyright (C) 1998 by Ori Pomerantz */

/* device specifics, such as ioctl numbers and the
* major device file. */
#include "chardev.h"

#include <fcntl.h>    /* open */
#include <unistd.h>    /* exit */
#include <sys/ioctl.h>  /* ioctl */

/* Functions for the ioctl calls */

ioctl_set_msg(int file_desc, char *message)
{
int ret_val;

ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);

if (ret_val < 0) {
printf ("ioctl_set_msg failed:%d\n", ret_val);
exit(-1);
}
}

ioctl_get_msg(int file_desc)
{
int ret_val;
char message[100];

/* Warning - this is dangerous because we don't tell
  * the kernel how far it's allowed to write, so it
  * might overflow the buffer. In a real production
  * program, we would have used two ioctls - one to tell
  * the kernel the buffer length and another to give
  * it the buffer to fill
  */
ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);

if (ret_val < 0) {
printf ("ioctl_get_msg failed:%d\n", ret_val);
exit(-1);
}

printf("get_msg message:%s\n", message);
}

ioctl_get_nth_byte(int file_desc)
{
int i;
char c;

printf("get_nth_byte message:");

i = 0;
while (c != 0) {
c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);

if (c < 0) {
   printf(
   "ioctl_get_nth_byte failed at the %d'th byte:\n", i);
   exit(-1);
}

putchar(c);
}
putchar('\n');
}

/* Main - Call the ioctl functions */
main()
{
int file_desc, ret_val;
char *msg = "Message passed by ioctl\n";

file_desc = open(DEVICE_FILE_NAME, 0);
if (file_desc < 0) {
printf ("Can't open device file: %s\n",
         DEVICE_FILE_NAME);
exit(-1);
}

ioctl_get_nth_byte(file_desc);
ioctl_get_msg(file_desc);
ioctl_set_msg(file_desc, msg);

close(file_desc);
}

6．启动参数
在以前的许多例子里，我们要把一些东西强制地写入内核模块，比如/proc文件名或设备主码，以至我们可以用ioctl’s处理它。这样句违背了Unix以及Linux的原则：写用户可以自由设定的灵活程序。
在程序或者内核模块启动之前通知它一些消息是通过命令行参数做到的。在内核模块的情况下，我们没有argc和argv参数，而是有更好的东西。我们可以在内核模块里定义全局变量，insmod会给我们赋值。
在这个内核模块中，我们定义了两个变量：str1和str2。你需要做的只是编译内核模块，然后运行str1=xxx str2=yyy。当调用init_module时，str1将指向串xxx，str2将指向串yyy。
在2.0版对这些参数没有类型检查。如果str1和str2的第一个字符是数字，内核就会把这些变量赋为整数，而不是指向串的指针。这在实际情况中你一定要检查类型。
另一方面，在2.2版本中，你可以使用宏MACRO_PARM告诉insmod你需要一个参数，它的名字和类型。这样解决了类型问题，并且允许内核模块接收以数字开始的串。
ex param.c

/* param.c
*
* Receive command line parameters at module installation
*/

/* Copyright (C) 1998-99 by Ori Pomerantz */

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

#include <stdio.h>  /* I need NULL */

/* In 2.2.3 /usr/include/linux/version.h includes a
* macro for this, but 2.0.35 doesn't - so I add it
* here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

/* Emmanuel Papirakis:
*
* Prameter names are now (2.2) handled in a macro.
* The kernel doesn't resolve the symbol names
* like it seems to have once did.
*
* To pass parameters to a module, you have to use a macro
* defined in include/linux/modules.h (line 176).
* The macro takes two parameters. The parameter's name and
* it's type. The type is a letter in double quotes.
* For example, "i" should be an integer and "s" should
* be a string.
*/

char *str1, *str2;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
MODULE_PARM(str1, "s");
MODULE_PARM(str2, "s");
#endif

/* Initialize the module - show the parameters */
int init_module()
{
if (str1 == NULL || str2 == NULL) {
printk("Next time, do insmod param str1=<something>");
printk("str2=<something>\n");
} else
printk("Strings:%s and %s\n", str1, str2);

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
printk("If you try to insmod this module twice,");
printk("(without rmmod'ing\n");
printk("it first), you might get the wrong");
printk("error message:\n");
printk("'symbol for parameters str1 not found'.\n");
#endif

return 0;
}

/* Cleanup */
void cleanup_module()
{
}

7．系统调用
到此为止，我们做的事情就是使用定义好的内核机制来注册/proc文件和设备句柄。这在做内核常规处理的事情时是很理想的。但是如果你希望做一些非常规的事情、改变系统的行为的时候该怎么办呢？这就必须依靠自己。
这就是内核编程变得危险的地方。在写下面的例子的时候，我关闭了open系统调用。这意味着我不能打开任何文件，不能运行任何程序，而且不能关闭计算机。我必须拉住电源开关。幸运的是，没有文件丢失。为确保你也不会丢失任何文件，在做insmod以及rmmod前请执行sync权限，
忘记/proc文件，忘记设备文件。它们只是不重要的细节。真正的同内核通信的过程机制是被所有进程公用的，这就是系统调用。当一个进程请求内核服务时（比如打开文件、创建一个新进程或者要求更多内存），就需要使用这个机制。如果你想用比较有趣的方法改变内核行为，这就是你所需要的。另外，如果你希望看到程序使用了哪一个系统调用，运行strace <command> <arguments>。
一般的，进程是不能访问内核的。它不能访问内核所占内存空间也不能调用内核函数。CPU硬件决定了这些（这就是为什么它被称作“保护模式”）。系统调用是这些规则的一个例外。其原理是进程先用适当的值填充寄存器，然后调用一个特殊的指令，这个指令会跳到一个事先定义的内核中的一个位置（当然，这个位置是用户进程可读但是不可写的）。在Intel CPU中，这个由中断0x80实现。硬件知道一旦你跳到这个位置，你就不是在限制模式下运行的用户，而是作为操作系统的内核——所以你就可以为所欲为。
进程可以跳转到的内核位置叫做sysem_call。这个过程检查系统调用号，这个号码告诉内核进程请求哪种服务。然后，它查看系统调用表(sys_call_table)找到所调用的内核函数入口地址。接着，就调用函数，等返回后，做一些系统检查，最后返回到进程（或到其他进程，如果这个进程时间用尽）。如果你希望读这段代码，它在源文件目录/<architecture>/kernel/entry.S，Entry(system_call)的下一行。
所以，如果我们希望改变某个系统调用的工作方式，我们需要写我们自己的函数（通常是加一点我们自己的代码然后调用原来的函数）来实现，然后改变sys_call_table中的指针使其指向我们的函数。因为我们可能以后会删除，而且不希望系统处在不稳定状态，所以在cleanup_module中保存该表的原来状态很重要。
这里的源代码是一个这样的核心模块的例子。我们希望“窥探”一个用户，每当这个用户打开一个文件是就printk一条消息。为达到这个目的，我们把打开文件的系统调用替换为我们自己的函数，our_sys_open。这个函数检查当前进程的uid（用户的id），如果它等于我们要窥探的uid，就调用printk来显示所打开文件的文件名。然后，可以用任何一种方法，用同样的参数调用原来的open函数，或者真正打开文件。
Init_module函数把sys_call_table中的适当地址上的内容替换，把原来的指针保存在一个变量里。Cleanup_module函数用这些变量恢复所有的东西。这种方法是危险的，因为两个内核模块可能改变了同一个系统调用。设想我们由两个内核模块，A和B。A的open系统调用是A_open，B的open系统调用是B_open。现在，如果A插入内核，系统调用将被替换为A_open，当完成以后调用sys_open。然后，B被插入内核，把系统调用替换为B_open，而完成的时候，它将会调用它认为原始的系统调用的A_open，。
那么，如果B被首先删除，不会出现任何错误——它只是把系统调用恢复成A_open，A_open再去调用原始的的系统调用。然而，如果先删除A，再删除B，系统就会崩溃。A的删除将会把系统调用恢复成sys_open，而把B切换出了循环。然后，当B被删除时，将会把系统调用恢复成A_open，但是A_open已经不在内存。初看来，似乎我们可以通过检查系统调用是否等于我们的open函数来解决这个问题，如果是就不要改变它（这样B被删除的时候就不会改变系统调用），但是这样会引起一个更加恶劣的问题。当A被删除时，它看到系统调用被改成了B_open而不再指向A_open，所以在它被删除时就不会恢复sys_open。不幸的是，B_open仍然试图恢复A_open，但它已不再内存，这样，即使没有删除B系统也会崩溃。
我可以提出两个方法来解决这个问题。第一个方法是把调用恢复成原始值，sys_open。不幸的是sys_open不是在/proc/ksyms中的内核系统表中的一部分，所以我们不能访问它。另一个解决办法是使用索引计数器来阻止root 去rmmod这个模块，一旦它被装载。这在生产性模块中是好的，但是对教学里中不是很好——这就是为什么我不在这里这样做。
ex syscall.c

/* syscall.c
*
* System call "stealing" sample
*/

/* Copyright (C) 1998-99 by Ori Pomerantz */

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h> /* We're doing kernel work */
#include <linux/module.h> /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif

#include <sys/syscall.h>  /* The list of system calls */

/* For the current (process) structure, we need
* this to know who the current user is. */
#include <linux/sched.h>

/* In 2.2.3 /usr/include/linux/version.h includes a
* macro for this, but 2.0.35 doesn't - so I add it
* here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>
#endif

/* The system call table (a table of functions). We
* just define this as external, and the kernel will
* fill it up for us when we are insmod'ed
*/
extern void *sys_call_table[];

/* UID we want to spy on - will be filled from the
* command line */
int uid;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
MODULE_PARM(uid, "i");
#endif

/* A pointer to the original system call. The reason
* we keep this, rather than call the original function
* (sys_open), is because somebody else might have
* replaced the system call before us. Note that this
* is not 100% safe, because if another module
* replaced sys_open before us, then when we're inserted
* we'll call the function in that module - and it
* might be removed before we are.
*
* Another reason for this is that we can't get sys_open.
* It's a static variable, so it is not exported. */
asmlinkage int (*original_call)(const char *, int, int);

/* For some reason, in 2.2.3 current->uid gave me
* zero, not the real user ID. I tried to find what went
* wrong, but I couldn't do it in a short time, and
* I'm lazy - so I'll just use the system call to get the
* uid, the way a process would.
*
* For some reason, after I recompiled the kernel this
* problem went away.
*/
asmlinkage int (*getuid_call)();

/* The function we'll replace sys_open (the function
* called when you call the open system call) with. To
* find the exact prototype, with the number and type
* of arguments, we find the original function first
* (it's at fs/open.c).
*
* In theory, this means that we're tied to the
* current version of the kernel. In practice, the
* system calls almost never change (it would wreck havoc
* and require programs to be recompiled, since the system
* calls are the interface between the kernel and the
* processes).
*/
asmlinkage int our_sys_open(const char *filename,
                        int flags,
                        int mode)
{
int i = 0;
char ch;

/* Check if this is the user we're spying on */
if (uid == getuid_call()) {
  /* getuid_call is the getuid system call,
* which gives the uid of the user who
* ran the process which called the system
* call we got */

/* Report the file, if relevant */
printk("Opened file by %d: ", uid);
do {
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   get_user(ch, filename+i);
#else
   ch = get_user(filename+i);
#endif
   i++;
   printk("%c", ch);
} while (ch != 0);
printk("\n");
}

/* Call the original sys_open - otherwise, we lose
  * the ability to open files */
return original_call(filename, flags, mode);
}

/* Initialize the module - replace the system call */
int init_module()
{
/* Warning - too late for it now, but maybe for
  * next time... */
printk("I'm dangerous. I hope

		自动登录	找回密码
密码			注册

Linux内核编程

浏览过的版块

Linux内核编程