第 1 章：起源与初始实现 — per-CFS_RQ Throttle Model

阶段概览

CFS 带宽控制的初始实现由 Google 的 Paul Turner 及其团队在 2011 年完成，合入 Linux v3.2。这一套 patch series 包含了 10+ commits，从带宽会计（accounting）、定时器刷新、throttle 执行到 unthrottle 恢复，构建了完整的 per-CFS_RQ throttle 模型。

设计理念：带宽控制以 CFS runqueue（cfs_rq）为粒度。每个 task group 有一个全局带宽池（cfs_bandwidth），group 在每个 CPU 上的 cfs_rq 从中获取 runtime。当一个 cfs_rq 的 runtime 耗尽时，整个队列中的所有任务都会被 dequeue（即从 CPU 的 runqueue 中移除），CPU 不再调度该 group 的任何任务，直到下一个周期的配额刷新。

源码结构变化

初始实现涉及的文件：

kernel/sched.c — struct cfs_bandwidth 定义，定时器初始化，带宽设置接口
kernel/sched_fair.c — throttle/unthrottle 核心逻辑
include/linux/sched.h — sysctl_sched_cfs_bandwidth_slice 导出
kernel/sysctl.c — 带宽 slice 的 sysctl 接口

数据结构层面，新增了：

struct cfs_bandwidth（在 task_group 中）：全局带宽池
cfs_rq->throttled：标记 cfs_rq 是否被 throttled
cfs_rq->runtime_enabled、cfs_rq->runtime_remaining：本地配额
cfs_bandwidth->throttled_cfs_rq：被 throttle 的 cfs_rq 链表

关键 Commit 分析

Commit 58088ad0152b — sched: Add a timer to handle CFS bandwidth refresh

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +123 / -24 行, 2 个文件

类型: 🆕 新增功能

变更概述

这是 CFS 带宽控制的基础设施层。它引入了 struct hrtimer period_timer（per-task_group 的 per-CPU 高精度定时器），用于在每个周期结束时刷新全局带宽池。同时将 RT 调度类中已有的带宽定时器代码重构为通用的 start_bandwidth_timer()。

核心代码分析

通用化带宽定时器：

diff

-static void start_rt_bandwidth(struct rt_bandwidth *rt_b)
+static void start_bandwidth_timer(struct hrtimer *period_timer, ktime_t period)
 {
-	ktime_t now;
+	unsigned long delta;
+	ktime_t soft, hard, now;
 
+	for (;;) {
+		if (hrtimer_active(period_timer))
+			break;
+		now = hrtimer_cb_get_time(period_timer);
+		hrtimer_forward(period_timer, now, period);
+		soft = hrtimer_get_softexpires(period_timer);
+		hard = hrtimer_get_expires(period_timer);
+		delta = ktime_to_ns(ktime_sub(hard, soft));
+		__hrtimer_start_range_ns(period_timer, soft, delta,
+					 HRTIMER_MODE_ABS_PINNED, 0);
+	}
 }

这标志着 CFS 从 RT 借用了带宽控制的设计模式。

do_sched_cfs_period_timer() — 周期刷新逻辑：

static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun)
{
	int idle = 1;

	raw_spin_lock(&cfs_b->lock);
	if (cfs_b->quota == RUNTIME_INF)  // 无限制时直接退出
		goto out_unlock;

	idle = cfs_b->idle;
	cfs_b->runtime = cfs_b->quota;   // 刷新全局 pool

	/* mark as potentially idle for the upcoming period */
	cfs_b->idle = 1;                 // 标记为可 idle
out_unlock:
	if (idle)
		cfs_b->timer_active = 0;  // 如果没有 throttled cfs_rq，停定时器
	raw_spin_unlock(&cfs_b->lock);
	return idle;
}

此时的逻辑非常简单：每个周期结束时将 cfs_b->runtime 重置为 quota，同时通过 idle 标志来判断是否可以停掉定时器（如果没有任何活动的 throttled cfs_rq，就没有必要继续保持定时器运行）。

__start_cfs_bandwidth() — 定时器激活：

static void __start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
	/*
	 * 防止与正在进行的关闭操作竞争
	 */
	while (unlikely(hrtimer_active(&cfs_b->period_timer))) {
		raw_spin_unlock(&cfs_b->lock);
		hrtimer_cancel(&cfs_b->period_timer);
		raw_spin_lock(&cfs_b->lock);
		if (cfs_b->timer_active)
			return;
	}
	cfs_b->timer_active = 1;
	start_bandwidth_timer(&cfs_b->period_timer, cfs_b->period);
}

演进意义

这是 CFS 带宽控制的定时器基础设施。它为后续的 throttle/unthrottle、distribution 等机制提供了时间驱动的"心跳"。此后的演进中，定时器逻辑经历了多次重写（如 slack timer 的引入、idle 判断逻辑的调整）。

Commit ec12cb7f31e2 — sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +94 / -3 行, 4 个文件

类型: 🆕 新增功能

变更概述

这是带宽控制的会计（accounting）层。它在 update_curr() 路径中增加了对每个 cfs_rq 的 CPU 消耗追踪，从全局带宽池（cfs_bandwidth->runtime）中扣减 quota，并负责在本地 runtime 不足时从全局池中补充。

核心代码分析

assign_cfs_rq_runtime() — 从全局池分配本地配额：

static void assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
	struct task_group *tg = cfs_rq->tg;
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg);
	u64 amount = 0, min_amount;

	/* 计算需要的最小配额 */
	min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining;

	raw_spin_lock(&cfs_b->lock);
	if (cfs_b->quota == RUNTIME_INF)
		amount = min_amount;        // 无限配额
	else if (cfs_b->runtime > 0) {
		amount = min(cfs_b->runtime, min_amount);
		cfs_b->runtime -= amount;   // 从全局池扣除
	}
	raw_spin_unlock(&cfs_b->lock);

	cfs_rq->runtime_remaining += amount;
}

设计要点：

使用 slice（默认 5ms）作为一次分配的单位——这是为了减少对 cfs_b->lock 的争用
当本地 runtime_remaining 变为负数时触发重新分配
全局池 cfs_b->runtime 是 per-task_group 共享的，受 cfs_b->lock 保护

account_cfs_rq_runtime() — 在 update_curr 中被调用：

static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq,
				     unsigned long delta_exec)
{
	if (!cfs_rq->runtime_enabled)   // 该 cfs_rq 没有带宽限制
		return;

	cfs_rq->runtime_remaining -= delta_exec;
	if (cfs_rq->runtime_remaining > 0)
		return;

	assign_cfs_rq_runtime(cfs_rq);
}

在每个调度 tick 中，update_curr() 都会调用 account_cfs_rq_runtime()，消耗对应的 runtime。当本地剩余不足时，尝试从全局池补充。

演进意义

这是带宽会计的核心。此实现中，当全局池也耗尽时，assign_cfs_rq_runtime() 什么都不做，runtime_remaining 保持负数。但此时实际 throttle 并未发生——它只是埋下了"需要 throttle"的种子。真正的 throttle 由后续的 commit 补充。

Commit 85dac906bec3 — sched: Add support for throttling group entities

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +92 / -4 行, 2 个文件

类型: 🆕 新增功能

变更概述

这是 throttle 执行层。当 runtime_remaining 为负且无法补充时，throttle_cfs_rq() 将该 cfs_rq 的 group entity 从调度树中完全出队，并添加到 cfs_b->throttled_cfs_rq 链表中。

核心代码分析

throttle_cfs_rq() — 核心 throttle 逻辑：

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
	struct rq *rq = rq_of(cfs_rq);
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	struct sched_entity *se;
	long task_delta, dequeue = 1;

	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];

	update_cfs_load(cfs_rq, 0);        // 冻结前的最后统计

	task_delta = cfs_rq->h_nr_running;
	for_each_sched_entity(se) {
		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
		if (!se->on_rq)
			break;

		if (dequeue)
			dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP);
		// 将 h_nr_running 减去被 throttle 的任务数
		qcfs_rq->h_nr_running -= task_delta;

		if (qcfs_rq->load.weight)
			dequeue = 0;
	}

	if (!se)
		rq->nr_running -= task_delta;

	cfs_rq->throttled = 1;
	raw_spin_lock(&cfs_b->lock);
	list_add_tail_rcu(&cfs_rq->throttled_list, &cfs_b->throttled_cfs_rq);
	raw_spin_unlock(&cfs_b->lock);
}

关键设计决策：

遍历整个 hierarchy：从被 throttle 的 cfs_rq 向上遍历到根，逐级减少 h_nr_running
遇到有负载的 cfs_rq 停遍历：如果某个上级 cfs_rq 还有其他子 group 在运行，只减 h_nr_running 计数，不解出它的 entity
DEQUEUE_SLEEP 标志：这意味着被 throttle 出队时被认为是在"睡眠"，不影响其 vruntime 的 normalization

在 enqueue 和 dequeue 路径中引入对 throttled cfs_rq 的特殊处理：

// enqueue 路径中：遇到 throttled cfs_rq 就停止向上传播
if (cfs_rq_throttled(cfs_rq))
    break;

// dequeue 路径中同样
if (cfs_rq_throttled(cfs_rq))
    break;

设计意图：这表示 throttle 层级以下的 cfs_rq 与 rq 完全隔离。enqueue/dequeue 操作不会穿透 throttle 边界，使得 scheduler 不再感知 throttled group 的任务。

Commit 671fd9dabe52 — sched: Add support for unthrottling group entities

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +126 / -4 行, 2 个文件

类型: 🆕 新增功能

变更概述

与 throttle 对应的 unthrottle 逻辑，以及 distribute_cfs_runtime() 带宽分发机制。当 period timer 刷新了全局池后，遍历 throttled_cfs_rq 链表，逐个分配 runtime 并 unthrottle。

核心代码分析

unthrottle_cfs_rq()：

static void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
	struct rq *rq = rq_of(cfs_rq);
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	struct sched_entity *se;
	int enqueue = 1;
	long task_delta;

	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];

	cfs_rq->throttled = 0;
	raw_spin_lock(&cfs_b->lock);
	list_del_rcu(&cfs_rq->throttled_list);     // 从 throttled 链表移除
	raw_spin_unlock(&cfs_b->lock);

	if (!cfs_rq->load.weight)
		return;                              // 没有负载，没必要恢复

	task_delta = cfs_rq->h_nr_running;
	for_each_sched_entity(se) {
		if (se->on_rq)
			enqueue = 0;                 // 如果上级已在 rq，只加计数

		cfs_rq = cfs_rq_of(se);
		if (enqueue)
			enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
		cfs_rq->h_nr_running += task_delta;

		if (cfs_rq_throttled(cfs_rq))
			break;                        // 遇到上级也被 throttle 则停
	}

	if (!se)
		rq->nr_running += task_delta;

	if (rq->curr == rq->idle && rq->cfs.nr_running)
		resched_task(rq->curr);
}

设计要点：

unthrottle 是 throttle 的逆操作：逐级 re-enqueue entity 并恢复 h_nr_running
惰性恢复：如果 entity 已经在 rq 上（se->on_rq），则只更新计数而不重复入队

distribute_cfs_runtime()：

static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b,
		u64 remaining, u64 expires)
{
	struct cfs_rq *cfs_rq;
	u64 runtime = remaining;

	rcu_read_lock();
	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
				throttled_list) {
		struct rq *rq = rq_of(cfs_rq);

		raw_spin_lock(&rq->lock);
		if (!cfs_rq_throttled(cfs_rq))
			goto next;

		// 计算需要多少 runtime 才能让 runtime_remaining > 0
		runtime = -cfs_rq->runtime_remaining + 1;
		if (runtime > remaining)
			runtime = remaining;
		remaining -= runtime;

		cfs_rq->runtime_remaining += runtime;
		cfs_rq->runtime_expires = expires;

		if (cfs_rq->runtime_remaining > 0)
			unthrottle_cfs_rq(cfs_rq);       // 配额足够则立即恢复

next:
		raw_spin_unlock(&rq->lock);
		if (!remaining)
			break;
	}
	rcu_read_unlock();
	return remaining;
}

设计意图

distribute_cfs_runtime() 在 do_sched_cfs_period_timer() 的定时器上下文中运行，遍历被 throttle 的 cfs_rq 链表，力求先用新配额度偿还"旧债"。这是经典的 debt-first 策略：在所有 cfs_rq 都得到足够的运行时之前，新配额不会直接变成 cfs_b->runtime 被自由分配。

Commit d3d9dc330236 — sched: Throttle entities exceeding their allowed bandwidth

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +50 / -2 行, 1 个文件

类型: 🆕 新增功能

变更概述

将 throttle 逻辑与调度器的调度路径集成。通过 check_enqueue_throttle() 和 check_cfs_rq_runtime() 两个检查点，在关键的调度点触发 throttle。

核心代码分析

两个关键的 throttle 触发点：

// 1. 在 enqueue_entity() 中，当 cfs_rq 的第一个任务入队时触发
if (cfs_rq->nr_running == 1) {
    list_add_leaf_cfs_rq(cfs_rq);
    check_enqueue_throttle(cfs_rq);  // <-- 新加入的 throttle 检查
}

// 2. 在 put_prev_entity() 中，当当前任务被换出时触发
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
    update_curr(cfs_rq);
    check_cfs_rq_runtime(cfs_rq);  // <-- 新加入的 throttle 检查
    ...
}

check_cfs_rq_runtime() — 在 put_prev 路径触发：

static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
{
	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
		return;

	/* 防止已 throttled 的 entity 被重复 throttle */
	if (cfs_rq_throttled(cfs_rq))
		return;

	throttle_cfs_rq(cfs_rq);
}

设计意图：throttle 的触发点在 put_prev_entity()（任务让出 CPU 时）和 enqueue_entity()（第一个任务入队时，防止刚 wake up 的 group 立刻超限）。这是最小侵入的设计——不需要在 schedule tick 中额外判断，直接复用调度器的天然调用点。

Commit 8cb120d3e41a — sched: Migrate throttled tasks on HOTPLUG

作者: Paul Turner | 日期: 2011-07-21 | 版本: v3.2-rc1

规模: +27 行, 1 个文件

类型: 🆕 新增功能

变更概述

处理 CPU hotplug 时的特殊情况：当一个 CPU 被 hotplug 移除时，需要 unthrottle 该 CPU 上所有被 throttle 的 cfs_rq，否则其中任务会因无法被调度而丢失。

static void unthrottle_offline_cfs_rqs(struct rq *rq)
{
	for_each_leaf_cfs_rq(rq, cfs_rq) {
		if (!cfs_rq->runtime_enabled)
			continue;
		/* clock_task is not advancing so we just need to make sure
		 * there's some valid quota amount */
		cfs_rq->runtime_remaining = cfs_b->quota;
		if (cfs_rq_throttled(cfs_rq))
			unthrottle_cfs_rq(cfs_rq);
	}
}

阶段性总结

至此，per-CFS_RQ throttle model 的完整架构建立完毕：

                     ┌──────────────┐
                     │ cfs_bandwidth │
                     │  runtime=X   │
                     │  period_timer│
                     └──────┬───────┘
                            │ 分配 slice
        ┌───────────────────┼───────────────────┐
        │                   │                   │
   ┌────▼────┐        ┌────▼────┐        ┌────▼────┐
   │cfs_rq #0│        │cfs_rq #1│        │cfs_rq #2│
   │runtime= │        │runtime= │        │runtime= │
   │remaining│        │remaining│        │remaining│
   │throttled│────────│throttled│────────│throttled│
   └─────────┘ list   └─────────┘ list   └─────────┘
   ┌─────────┐        ┌─────────┐        ┌─────────┐
   │ task A  │        │ task D  │        │ task G  │
   │ task B  │        │ task E  │        │ task H  │
   │ task C  │        │ task F  │        │ task I  │
   └─────────┘        └─────────┘        └─────────┘
   CPU 0              CPU 1              CPU 2

核心特征：

粒度是 cfs_rq（per-CPU per-group）：throttle/unthrottle 操作针对的是整个队列
出队和解出：throttle 时，group entity 在调度树中被 dequeue，所有任务如同消失
层级传播：throttle 通过 group 层级向上传播 h_nr_running 计数的变化
定时器驱动：period timer 是带宽控制和 distribution 的唯一驱动力

设计权衡：这个模型实现简单、语义清晰——"throttled = 不可见"。但缺点是粒度太粗：整个 group 中的所有任务（即使只超额一点点）都被一起暂停，持有锁的任务被 throttle 时会导致连锁的锁持有者优先级反转。这些问题在随后的演进中逐步暴露和解决，最终导致了 task-based throttle model 的诞生。

第 1 章：起源与初始实现 — per-CFS_RQ Throttle Model ​

阶段概览 ​

源码结构变化 ​

关键 Commit 分析 ​

Commit 58088ad0152b — sched: Add a timer to handle CFS bandwidth refresh ​

变更概述 ​

核心代码分析 ​

演进意义 ​

Commit ec12cb7f31e2 — sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth ​

变更概述 ​

核心代码分析 ​

演进意义 ​

Commit 85dac906bec3 — sched: Add support for throttling group entities ​

变更概述 ​

核心代码分析 ​

Commit 671fd9dabe52 — sched: Add support for unthrottling group entities ​

变更概述 ​

核心代码分析 ​

设计意图 ​

Commit d3d9dc330236 — sched: Throttle entities exceeding their allowed bandwidth ​

变更概述 ​

核心代码分析 ​

Commit 8cb120d3e41a — sched: Migrate throttled tasks on HOTPLUG ​

变更概述 ​

阶段性总结 ​

第 1 章：起源与初始实现 — per-CFS_RQ Throttle Model

阶段概览

源码结构变化

关键 Commit 分析

Commit 58088ad0152b — sched: Add a timer to handle CFS bandwidth refresh

变更概述

核心代码分析

演进意义

Commit ec12cb7f31e2 — sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth

变更概述

核心代码分析

演进意义

Commit 85dac906bec3 — sched: Add support for throttling group entities

变更概述

核心代码分析

Commit 671fd9dabe52 — sched: Add support for unthrottling group entities

变更概述

核心代码分析

设计意图

Commit d3d9dc330236 — sched: Throttle entities exceeding their allowed bandwidth

变更概述

核心代码分析

Commit 8cb120d3e41a — sched: Migrate throttled tasks on HOTPLUG

变更概述

阶段性总结