cpu load中所说的不可中断状态到底是啥？

提示，文章中有引用英文原文，如果你使用了浏览器自动翻译，会让文章读起来怪怪的。

cpu load是cpu比较关键的一个指标，当你的系统cpu load 高问题时，如果你去网上查阅资料一定会查到类似“cpu load 是统计一段时间内正在使用和等待使用CPU的平均任务数”的指标，当说到等待cpu时，更明确的解释是“处于不可中断睡眠的(D状态)的任务
”。man的解释是：

The load average is calculated as the average number of runnable or running tasks (R state), and the number of tasks in uninterruptible sleep (D state) over the specified interval.

所以看上去当你的系统出现cpu load高，但是cpu使用率却很低的时候，去找到这些不可中断状态就成了唯一线索了。

但问题是什么是不可中断状态，如何去找它们，我在网上搜了一些中文资料，发现大部分博客是截取的书中内容，或者相互抄袭，还有甚至有明显错误，比如有些博客让你用top命令找D状态，但真实情况是D状态存在的瞬间那么短，你怎么可能用top命令找到D状态呢？还有些博客说不可中断状态就是等待IO，那IO到底是磁盘IO还是网络IO，再说IO还有很多阶段啊，难道我发起个阻塞调用这段时间会一直是不可中断状态？找了半天没有一个让我一下明白到底啥是不可中断状态的。所以我打算在外网找一些答案，经过了长时间的查找，还是有所收获的，在这里分享出来。

为什么要有不可中断状态？

简单来说是为了让内核的某些处理流程不能被打断。我在stackoverflow的一个帖子上找到了一些答案，尝试翻译一下。

When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as SIGTERM and SIGKILL). This means a process can be killed only on return to user mode.

处于用户态的进程，可以随时中断（切换到内核态）。当从内核态返回到用户态时，它会检查是否有任何挂起的信号（包括用于终止进程的信号，如SIGTERM和SIGKILL）。这意味着只能在返回到用户态时终止进程。

The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
无法在内核态下终止进程的原因是，它可能会损坏同一台机器中所有其他进程使用的内核结构（同样终止线程可能会损坏同一进程中其他线程使用的数据结构）。

When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a “dummy” process which tells the cpu to slow down a bit and sits in a loop — the idle loop).

当内核需要做一些可能需要很长时间的事情时（比如等待另一个进程写管道或等待硬件做一些事情），它将自己标记为sleep状态，这是操作系统的调度程序会切换到另一个进程。

If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:

如果一个信号被发送到一个正在sleep的进程，那么它将被唤醒，然后返回用户空间并处理信号。这里我们有两种主要sleep类型的区别：

TASK_INTERRUPTIBLE, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won’t go into details on how that works).

任务可中断，即可中断的sleep。如果任务有此标志，则表示此任务即使处于sleep状态，也可以通过信号被唤醒。这意味着处于sleep状态的任务，需要一个将它唤醒的信号，在任务醒来后，将检查信号并从系统调用返回。处理完信号后，系统调用可能会自动重新启动（我在这里没有写详细细节）。

TASK_UNINTERRUPTIBLE, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.

任务不可中断，即不可中断的sleep。如果一个任务有此标志，那么它就不希望被除了它正在等待之外的任何东西唤醒，这可能是因为它不能容易地重新启动，或者是因为程序希望系统调用是原子的。这也可以用于sleep时间很短的情况。

那么是如何让这些特殊的代码不可中断的？

答案是内核中的一些代码被标记为不可中断的，主要是因为代码必须遵守严格的定时（对设备进行应答），或者因为它正在执行不允许干扰的操作。
比如像是这样：

    /* wait to be given the lock */
    while (true) {
        set_task_state(tsk, TASK_UNINTERRUPTIBLE);
        if (!waiter.task)
            break;
        schedule();
    }

那么所谓的等待IO导致不可中断状态的说法是否合理呢？

明显不合理，我举个例子，我发起一个socket阻塞调用，然后一直在那里等着，你觉得这会是不可中断状态么，这种明显是可以中断的。
所以我觉得这个IO的说法，应该和我之前博客讨论的iowait指标类似，是内核调度I/O，它理论上可以运行，但不能运行，因为它需要的一些数据还不存在，要硬扯到网络的话，那应该是NFS。我在其它一些文章中看到，当时这个不可中断状态的D，就是来源于Disk Wait的简写。

但是也不能说网络io不会引起不可中断状态，理论上你执行到某个底层驱动的时候，还是会进入不可中断状态的，但这也不能简单粗暴的归类到网络io导致不可中断状态里。

所以如果真的要找D状态，还是多从磁盘读写的角度考虑，比如日志读写或者swap，堆外内存，甚至想到NFS（不过你大概率不会使用这种技术）。

那么为啥不可中断状态要算在cpu load里？

我在Brendan Gregg大神关于不可中断io的博客中找到了答案。这篇博客介绍了cpu load的历史。

在1993的一封邮件中，Linux的设计者觉得这些看似短暂的uninterruptible sleep也要算在runnable中，他举了个例子，如果你把磁盘从快的换成慢的，然后你的系统负载一定是会下降的，但是你在用cpu load的统计方式就很不直观。所以加入了不可中断io的统计，这个指标其实已经在此时从cpu负载变成系统负载了。

然而现在的负载统计里也不仅仅是只考虑磁盘IO了，还有其它一些不可中断锁，现在的linux代码已经有很多不可中断的标记了：“in Linux 4.12, there are nearly 400 codepaths that set TASK_UNINTERRUPTIBLE”。

这篇博客里，作者还用火焰图拆解出了一份“TASK_UNINTERRUPTIBLE“的代码。而且作者也明确指出，在Linux里load averages这个指标，就是指CPU, disk和uninterruptible locks：

On Linux, load averages are (or try to be) “system load averages”, for the system as a whole, measuring the number of threads that are working and waiting to work (CPU, disk, uninterruptible locks). Put differently, it measures the number of threads that aren’t completely idle. Advantage: includes demand for different resources.

作者也建议当你发现cpu load有问题，不好确定具体原因时，用其它一些更精细的指令去排查问题原因。不过作者也总结到load averages依然很好用：

The use of the uninterruptible state has since grown in the Linux kernel, and nowadays includes uninterruptible lock primitives. If the load average is a measure of demand in terms of running and waiting threads (and not strictly threads wanting hardware resources), then they are still working the way we want them to.

在linux内核的发展中，越来越的代码使用了不可中断状态，现在还包括了锁原语，如果要度量系统中正在执行和等待执行的线程（不是严格意义的需要硬件资源），那它依然可以很好的胜任工作。