第 2 章：初期演进 —— 在正确性与性能之间

阶段概览

Per-CFS_RQ throttle model 从首次合入到 v5.x 经历了大量修复和改进。这个阶段的核心主题包括：

PELT (Per-Entity Load Tracking) 集成：确保 throttled 期间的统计正确性（v3.7~v4.7）
竞态条件修复：throttle 与 period timer 之间的各种 race condition（v3.12~v4.2）
throttle_count 同步机制重构（v4.7）
饥饿问题修复：带有限配额的 throttle_list 遍历顺序（v4.19）
带宽 race 消除：throttle 与 distribution 之间的经典问题（v5.7）

这个阶段涉及的贡献者包括 Paul Turner、Ben Segall、Peter Zijlstra、Vincent Guittot、Phil Auld 等。

关键 Commit 分析

Commit f1b17280efbd — sched: Maintain runnable averages across throttled periods

作者: Paul Turner | 日期: 2012-10-04 | 版本: v3.7

规模: +50 / -11 行, 2 个文件

类型: 🆕 新增功能

变更概述

引入 cfs_rq_clock_task() 和 PELT 时钟冻结机制。在 per-CFS_RQ 模型中，当 cfs_rq 被 throttled 时，该队列的 PELT 时钟应该停止前进，否则 task group 的负载统计会被不准确地稀释（quota 小、period 长时问题尤其严重）。

核心代码分析

引入 cfs_rq_clock_task()：

/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq)
{
	if (unlikely(cfs_rq->throttle_count))
		return cfs_rq->throttled_clock_task;         // 冻结时钟

	return rq_of(cfs_rq)->clock_task - cfs_rq->throttled_clock_task_time;
}

关键设计：

当 throttle_count > 0 时（即 cfs_rq 处于 throttled 状态），返回冻结的时间戳
否则返回 clock_task 减去累计 throttled 时间
这样 PELT 计算中使用的时钟在 throttled 期间不前进，负载统计不会被稀释

冻结/解冻逻辑：

// tg_throttle_down — 进入 throttled 状态时记录时间
if (!cfs_rq->throttle_count) {
    update_cfs_load(cfs_rq, 0);
    cfs_rq->throttled_clock_task = rq->clock_task;  // 冻结
}

// tg_unthrottle_up — 恢复时调整累计时间
cfs_rq->throttled_clock_task_time += rq->clock_task -
                             cfs_rq->throttled_clock_task;

update_entity_load_avg() 的改进：

// 原来直接使用 rq->clock_task
// 改为：如果是 group entity，使用其 cfs_rq 的 throttled-aware 时钟
if (entity_is_task(se))
    now = cfs_rq_clock_task(cfs_rq);
else
    now = cfs_rq_clock_task(group_cfs_rq(se));

演进意义

这为 PELT 在 CFS 带宽控制中的正确性奠定了基础。后续所有的 PELT 相关改进都建立在此机制之上。后来 cfs_rq_clock_pelt()、update_idle_cfs_rq_clock_pelt() 等函数都是从这个框架演进而来。

Commit f9f9ffc237dd — sched: Avoid throttle_cfs_rq() racing with period_timer stopping

作者: Ben Segall | 日期: 2013-10-16 | 版本: v3.12

规模: +33 / -3 行, 1 个文件

类型: 🐛 修复

变更概述

修复了一个关键的竞态条件：throttle_cfs_rq() 可能与 period_timer 的停止操作并发执行，导致 throttle 后的 cfs_rq 无法被 unthrottle。这是 Google 在生产环境中发现的 bug。

核心代码分析

问题：当 do_sched_cfs_period_timer() 发现没有 throttled cfs_rq 时，将定时器标记为 idle（停止）。但与此同时，可能有一个 cfs_rq 正在被 throttle，但尚未加入 throttled_cfs_rq 链表。结果就是这个 cfs_rq 永远不会被 unthrottle。

修复：在 throttle_cfs_rq 中，先将 cfs_rq 加入链表，再确保定时器已启动：

// 修复前：throttle 与 timer 启动没有原子保证
cfs_rq->throttled = 1;
raw_spin_lock(&cfs_b->lock);
list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
empty = list_empty(&cfs_b->throttled_cfs_rq);
raw_spin_unlock(&cfs_b->lock);

if (empty)
    start_cfs_bandwidth(cfs_b);

这个修复引入了 start_cfs_bandwidth() 调用，确保 throttle 后定时器一定在运行。这是 throttle 正确性的基础保证之一。

Commit 55e16d30bd99 — sched/fair: Rework throttle_count sync

作者: Peter Zijlstra | 日期: 2016-06-22 | 版本: v4.7

规模: +20 / -21 行, 2 个文件

类型: 🔧 重构

变更概述

重写了 throttle_count 的同步机制，移除了之前复杂的延迟同步方式（throttle_uptodate 字段）和 check_enqueue_throttle() 中的逐父同步循环。

核心代码分析

移除 throttle_uptodate：

之前，check_enqueue_throttle() 在每个任务入队时，需要向上遍历到父 cfs_rq 来同步 throttle_count：

// 旧方案：每次入队时检查是否需要同步
if (unlikely(!cfs_rq->throttle_uptodate)) {
    // 向上找到最近的 uptodate 父节点，复制 throttle_count
    for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) {
        pcfs_rq = tg->cfs_rq[cpu_of(rq)];
        if (pcfs_rq->throttle_uptodate)
            break;
    }
    if (tg) {
        cfs_rq->throttle_count = pcfs_rq->throttle_count;
        cfs_rq->throttled_clock_task = rq_clock_task(rq);
    }
}

引入 sync_throttle()：

// 新方案：在 tg_set_cfs_bandwidth/online_fair_sched_group 时同步
static void sync_throttle(struct task_group *tg, int cpu)
{
	struct cfs_rq *pcfs_rq, *cfs_rq;

	if (!cfs_bandwidth_used())
		return;
	if (!tg->parent)
		return;

	cfs_rq = tg->cfs_rq[cpu];
	pcfs_rq = tg->parent->cfs_rq[cpu];

	cfs_rq->throttle_count = pcfs_rq->throttle_count;
	pcfs_rq->throttled_clock_task = rq_clock_task(cpu_rq(cpu));
}

设计意图：Peter 将复杂度从"运行时路径"移到了"配置变更路径"。不再需要每个 enqueue 操作都检查 throttle_uptodate，而是只在带宽配置变更或新的 task group 上线时进行一次同步。这使得 check_enqueue_throttle() 变得更轻量，减少了调度关键路径上的开销。

同时移除了 struct cfs_rq 中的 throttle_uptodate 字段，略微缩减了 cfs_rq 的大小。

Commit baa9be4ffb55 — sched/fair: Fix throttle_list starvation with low CFS quota

作者: Phil Auld | 日期: 2018-10-08 | 版本: v4.19

规模: +22 / -3 行, 2 个文件

类型: 🐛 修复

变更概述

修复了 throttle_list 饥饿问题。当 quota 远小于 period 时，distribute_cfs_runtime() 在遍历 throttled_cfs_rq 链表时可能反复错过新加入的 cfs_rq，导致某些 cfs_rq 长期得不到带宽。

核心代码分析

问题：每个新被 throttle 的 cfs_rq 总是被添加到链表头部（list_add_rcu），而 distribution 从链表头部开始遍历。这意味着 distribution 完成后，新 throttle 的 cfs_rq 排在链表头部，而之前未得到配额的 cfs_rq 被推到链表尾部——如果 distribution 每次只够服务前几个 cfs_rq，后面的就会永久饥饿。

修复：引入了 distribute_running 标志和 head/tail 选择逻辑：

// throttle_cfs_rq() 中——根据是否正在 distribution 决定插入位置
if (cfs_b->distribute_running)
    list_add_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);   // 头部
else
    list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq); // 尾部

// do_sched_cfs_period_timer() 中——标记分布正在运行
while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
    cfs_b->distribute_running = 1;
    // ... 执行 distribution ...
    cfs_b->distribute_running = 0;
}

设计要点：

当 distribute_running 为 true 时（即 period timer 正在分发带宽），新 throttle 的 cfs_rq 插到头部，确保当前 distribution 轮次不会看到它（避免重复处理）
当没有 distribution 在进行时，新 throttle 的 cfs_rq 插到尾部，保证公平的顺序遍历
slack timer 路径中也加入了对 distribute_running 的检查，避免双重分发

Commit 2a4b03ffc69f — sched/fair: Prevent unlimited runtime on throttled group

作者: Vincent Guittot | 日期: 2020-01-14 | 版本: v5.5

规模: +8 / -1 行, 1 个文件

类型: 🐛 修复

变更概述

修复了通过 sched_move_task() 移动到 throttled group 的任务可以不受限运行的问题。

核心代码分析

问题：当一个 running 任务通过 sched_move_task() 从一个非 throttled group 迁移到一个 throttled group 时，该任务会继续运行（因为它已经是 ->curr），且由于 throttle 是在 put_prev_entity() 时触发，这个任务下次让出 CPU 前不会触发 throttle。（实际上，如果该 cfs_rq 已经被 throttle，任务可能运行到下一次 resched。）

修复：在 sched_move_task() 中，如果任务在迁移后仍然是 running 状态，并且其新 group 是 throttled 的，显式触发一个 resched：

if (running) {
    set_next_task(rq, tsk);
    /*
     * After changing group, the running task may have joined a
     * throttled one but it's still the running task. Trigger a
     * resched to make sure that task can still run.
     */
    resched_curr(rq);
}

这行修复看上去很小，但它解决了被 throttle 的 group 中仍有运行中任务的"逃生出口"问题。

Commit e98fa02c4f2e — sched/fair: Eliminate bandwidth race between throttling and distribution

作者: Paul Turner | 日期: 2020-04-10 | 版本: v5.7

规模: +79 / -32 行, 1 个文件

类型: 🔧 重构 + 🐛 修复

变更概述

这是 Paul Turner 时隔近 9 年后的再一次重大修复。他彻底消除了 throttle_cfs_rq() 和 distribute_cfs_runtime() 之间经典的竞态条件，同时为 throttle_cfs_rq 增加了"撤销 throttle"的能力——如果能正好在 throttle 前分配到 runtime，就不再执行 throttle。

核心代码分析

重构 throttle_cfs_rq() 为返回 bool：

之前 throttle 是一个"开弓没有回头箭"的操作——一旦进入就没法反悔。Paul 将其改造为在加锁后再确认一次：

static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
	raw_spin_lock(&cfs_b->lock);
	/* 在持有 cfs_b->lock 的情况下再次尝试分配 runtime */
	if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) {
		/*
		 * 我们幸运地竞争到了刚刚刷新的带宽！
		 * 如果此时 throttle 了，定时器可能在一整个周期内不会 unthrottle 我们。
		 * 所以请求 1ns 而不是检查 cfs_b，确保后续的 check_cfs_rq_runtime
		 * 保持一致判断不再 throttle 我们
		 */
		dequeue = 0;
	} else {
		list_add_tail_rcu(&cfs_rq->throttled_list,
				  &cfs_b->throttled_cfs_rq);
	}
	raw_spin_unlock(&cfs_b->lock);

	if (!dequeue)
		return false;  /* Throttle no longer required. */

	// ... 实际 throttle 逻辑 ...
}

引入 __assign_cfs_rq_runtime()：

这是原有的 assign_cfs_rq_runtime() 的底层版本，在调用之前 cfs_b->lock 已经被持有：

static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
				   struct cfs_rq *cfs_rq, u64 target_runtime)
{
	u64 min_amount, amount = 0;

	lockdep_assert_held(&cfs_b->lock);

	min_amount = target_runtime - cfs_rq->runtime_remaining;

	if (cfs_b->quota == RUNTIME_INF)
		amount = min_amount;
	else {
		start_cfs_bandwidth(cfs_b);
		if (cfs_b->runtime > 0) {
			amount = min(cfs_b->runtime, min_amount);
			cfs_b->runtime -= amount;
			cfs_b->idle = 0;
		}
	}

	cfs_rq->runtime_remaining += amount;
	return cfs_rq->runtime_remaining > 0;
}

设计意图：这个问题源于 throttle 和 distribution 在不同锁域下操作：

throttle_cfs_rq() 持有 rq->lock，需要获取 cfs_b->lock 来检查是否还能分配 runtime
distribute_cfs_runtime() 持有 cfs_b->lock，需要获取 rq->lock 来 unthrottle

两者形成了一个 AB-BA 锁模式。在旧设计中，throttle 先不加锁检查一次 runtime，然后加锁再加入列表——中间有个时间窗口。Paul 的修复将"最后确认"移到了持有 cfs_b->lock 之后，消除了这个窗口。

演进意义

这是 throttle 模型的核心正确性修复之一。引入了"throttle 前再确认一次"的模式，也为后续的分布式带宽系统（如 burst、async unthrottle）提供了更稳健的基础。

Commit ab93a4bc955b — sched/fair: Remove distribute_running from CFS bandwidth

作者: Josh Don | 日期: 2020-04-10 | 版本: v5.7

规模: +1 / -13 行, 2 个文件

类型: 🔧 重构

变更概述

在 e98fa02c4f2e 消除了 throttle/distribution 之间的竞态后，之前用于防止嵌套 distribution 的 distribute_running 标志（由 baa9be4ffb55 引入）不再需要，可以被移除了。

// 移除 distribute_running 检查
while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
// 变为：
while (throttled && cfs_b->runtime > 0) {

演进意义

这展示了架构演进中"先加临时修复，等根本原因修复后再清理"的典型模式。distribute_running 作为一个补丁级别的修复存在了近 2 年后，随着锁竞争根因的解决而被清理。

阶段性总结

经过这个阶段的演进，per-CFS_RQ throttle model 在正确性和稳定性方面达到了较成熟的状态。截至 v5.7，系统具备了：

正确的 PELT 统计：throttled 期间时钟冻结，负载统计不受影响
可靠的 throttle/unthrottle 闭环：没有竞争条件导致的永久 throttle
公平的带宽分配：throttle_list 的插入策略避免了节点饥饿
基本的边界保护：阻止了"逃逸"任务在 throttled group 中无限运行

然而，这个模型仍然有三个本质问题：

粗粒度导致的锁持有者优先级反转：一个持有 spinlock 的任务被 throttle 后，其他 CPU 上等待该锁的任务陷入无限等待——这是真实生产环境中触发 hung task watchdog 的常见原因
同步 unthrottle 的延迟：分布带宽需要在 period timer 中逐 CPU 获取 rq->lock，在大规模系统上延迟不可控
没有 burst 能力：quota 严格按周期限制，不能借用之前未用完的配额

这些问题将在后续章节中得到解决。

第 2 章：初期演进 —— 在正确性与性能之间 ​

阶段概览 ​

关键 Commit 分析 ​

Commit f1b17280efbd — sched: Maintain runnable averages across throttled periods ​

变更概述 ​

核心代码分析 ​

演进意义 ​

Commit f9f9ffc237dd — sched: Avoid throttle_cfs_rq() racing with period_timer stopping ​

变更概述 ​

核心代码分析 ​

Commit 55e16d30bd99 — sched/fair: Rework throttle_count sync ​

变更概述 ​

核心代码分析 ​

Commit baa9be4ffb55 — sched/fair: Fix throttle_list starvation with low CFS quota ​

变更概述 ​

核心代码分析 ​

Commit 2a4b03ffc69f — sched/fair: Prevent unlimited runtime on throttled group ​

变更概述 ​

核心代码分析 ​

Commit e98fa02c4f2e — sched/fair: Eliminate bandwidth race between throttling and distribution ​

变更概述 ​

核心代码分析 ​

演进意义 ​

Commit ab93a4bc955b — sched/fair: Remove distribute_running from CFS bandwidth ​

变更概述 ​

演进意义 ​

阶段性总结 ​

第 2 章：初期演进 —— 在正确性与性能之间

阶段概览

关键 Commit 分析

Commit f1b17280efbd — sched: Maintain runnable averages across throttled periods

变更概述

核心代码分析

演进意义

Commit f9f9ffc237dd — sched: Avoid throttle_cfs_rq() racing with period_timer stopping

变更概述

核心代码分析

Commit 55e16d30bd99 — sched/fair: Rework throttle_count sync

变更概述

核心代码分析

Commit baa9be4ffb55 — sched/fair: Fix throttle_list starvation with low CFS quota

变更概述

核心代码分析

Commit 2a4b03ffc69f — sched/fair: Prevent unlimited runtime on throttled group

变更概述

核心代码分析

Commit e98fa02c4f2e — sched/fair: Eliminate bandwidth race between throttling and distribution

变更概述

核心代码分析

演进意义

Commit ab93a4bc955b — sched/fair: Remove distribute_running from CFS bandwidth

变更概述

演进意义

阶段性总结