


cpu load中所说的不可中断状态到底是啥?


cpu load是cpu比较关键的一个指标,当你的系统cpu load 高问题时,如果你去网上查阅资料一定会查到类似“cpu load 是统计一段时间内正在使用和等待使用CPU的平均任务数”的指标,当说到等待cpu时,更明确的解释是“处于不可中断睡眠的(D状态)的任务

The load average is calculated as the average number of runnable or running tasks (R state), and the number of tasks in uninterruptible sleep (D state) over the specified interval.

所以看上去当你的系统出现cpu load高,但是cpu使用率却很低的时候,去找到这些不可中断状态就成了唯一线索了。




When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.


The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).

When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a “dummy” process which tells the cpu to slow down a bit and sits in a loop — the idle loop).


If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:

如果一个信号被发送到一个正在sleep的进程, 那么它将被唤醒,然后返回用户空间并处理信号。这里我们有两种主要sleep类型的区别:

TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won’t go into details on how that works).


TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.




    /* wait to be given the lock */
    while (true) {
        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
        if (!waiter.task)


所以我觉得这个IO的说法,应该和我之前博客讨论的iowait指标类似,是内核调度I/O,它理论上可以运行,但不能运行,因为它需要的一些数据还不存在,要硬扯到网络的话,那应该是NFS。我在其它一些文章中看到,当时这个不可中断状态的D,就是来源于Disk Wait的简写。



那么为啥不可中断状态要算在cpu load里?

我在Brendan Gregg大神关于不可中断io的博客中找到了答案。这篇博客介绍了cpu load的历史。

在1993的一封邮件中,Linux的设计者觉得这些看似短暂的uninterruptible sleep也要算在runnable中,他举了个例子,如果你把磁盘从快的换成慢的,然后你的系统负载一定是会下降的,但是你在用cpu load的统计方式就很不直观。所以加入了不可中断io的统计,这个指标其实已经在此时从cpu负载变成系统负载了。

然而现在的负载统计里也不仅仅是只考虑磁盘IO了,还有其它一些不可中断锁,现在的linux代码已经有很多不可中断的标记了:“in Linux 4.12, there are nearly 400 codepaths that set TASK_UNINTERRUPTIBLE”。

这篇博客里,作者还用火焰图拆解出了一份“TASK_UNINTERRUPTIBLE“的代码。而且作者也明确指出,在Linux里load averages这个指标,就是指CPU, disk和uninterruptible locks:

On Linux, load averages are (or try to be) “system load averages”, for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren’t completely idle. Advantage: includes demand for different resources.

作者也建议当你发现cpu load有问题,不好确定具体原因时,用其它一些更精细的指令去排查问题原因。不过作者也总结到load averages依然很好用:

The use of the uninterruptible state has since grown in the Linux kernel, and nowadays includes uninterruptible lock primitives. If the load average is a measure of demand in terms of running and waiting threads (and not strictly threads wanting hardware resources), then they are still working the way we want them to.
