第 4 章：Task-based Throttle Model 转型

阶段概览

背景：per-CFS_RQ 模型有一个根本性的问题——当一个任务持有内核锁（如 percpu_rwsem、spinlock）时被 throttle，等待该锁的所有其他任务（可能在别的 CPU 上）都会陷入等待。而且由于被 throttle 的任务不在 rq 上、不会被调度，持有锁的任务必须等下一个周期才能继续运行。这种 lock holder priority inversion 在实际生产环境中造成了大量 task hung 和 lockup 问题。

解决方案：Valentin Schneider 和 Aaron Lu 在 2025 年提出了 task-based throttle model（Linux v6.18），核心思路是：

不再将整个 cfs_rq 从 rq 上移除：throttle 时只标记 cfs_rq 的 throttled 状态，不解出实体
通过 task work 逐个 throttle 任务：当被 throttle 的 cfs_rq 上的任务被 pick 时，添加 task work，让它在返回用户态时（TWA_RESUME）自动 dequeue
任务继续执行内核代码：持有内核锁的任务在返回用户态之前不会被 throttle，因此可以完成锁的持有并释放

这个转型涉及 9 个 commit，构成了一个完整的架构变更系列。

源码结构变化

新增结构体/字段：

task_struct->throttled：标记任务是否已被 throttle（per-task）
task_struct->sched_throttle_work：task work 结构体，在返回用户态时执行 dequeue
task_struct->throttle_node：per-task 的链表节点，用于在 throttled_limbo_list 中排队
cfs_rq->throttled_limbo_list：per-cfs_rq 的链表，存放被 throttle 的任务
cfs_rq->pelt_clock_throttled：新标志，独立于 throttled，表示 PELT 时钟是否冻结

移除的字段/逻辑：

throttle_cfs_rq() 中的 enqueue_entity/dequeue_entity 遍历：不再需要将整个 hierarchy 出队
unthrottle_cfs_rq() 中的对应遍历：不再需要重新入队整个 hierarchy
enqueue_task_fair()/dequeue_task_fair() 中的 cfs_rq_throttled break 检查：throttled cfs_rq 不再阻止 enqueue/dequeue 传播
throttled_lb_pair() 函数：不再需要，因为任务即使在 throttled 的 cfs_rq 上，仍然在 rq 上可见

关键 Commit 分析

作者: Valentin Schneider | 日期: 2025-08-29 | 版本: v6.18

规模: +24 行, 4 个文件

类型: 🏗️ 架构变化

变更概述

这是 task-based throttle 模型的第一枪，引入核心数据结构。

核心代码分析

task_struct 中新增字段：

// include/linux/sched.h
#ifdef CONFIG_CFS_BANDWIDTH
	struct callback_head		sched_throttle_work;
	struct list_head		throttle_node;
	bool				throttled;
#endif

throttled 字段是 per-task 的标志，而非 per-cfs_rq。这是数据模型最根本的转变——throttle 状态从"这是整个队列的事"变为"这是每个单独任务的事"。

cfs_rq 中新增 throttled_limbo_list：

// sched.h
struct cfs_rq {
    // ...
    struct list_head        throttled_limbo_list;  // 待处理的 throttled 任务
};

这是"limbo"（中间状态）列表——任务被 throttle 后不会消失，而是在此列表中等待 unthrottle。

初始化函数：

static void throttle_cfs_rq_work(struct callback_head *work)
{
	// 暂为 stub
}

void init_cfs_throttle_work(struct task_struct *p)
{
	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
	/* 通过将 next 指向自身来防止双重添加 */
	p->sched_throttle_work.next = &p->sched_throttle_work;
	INIT_LIST_HEAD(&p->throttle_node);
}

演进意义：核心数据结构的迁移。throttle 的粒度从 cfs_rq 降到了 task。

作者: Valentin Schneider | 日期: 2025-08-29 | 版本: v6.18

规模: +65 行, 1 个文件

类型: 🏗️ 架构变化

变更概述

实现了核心的 throttle_cfs_rq_work() 函数——这是 task-based throttle 的执行引擎。同时引入了 task_throttle_setup_work() 和 task_is_throttled() 辅助函数。

核心代码分析

throttle_cfs_rq_work() — 真正执行 throttle 的地方：

static void throttle_cfs_rq_work(struct callback_head *work)
{
	struct task_struct *p = container_of(work, struct task_struct,
					    sched_throttle_work);
	struct sched_entity *se;
	struct cfs_rq *cfs_rq;
	struct rq *rq;

	WARN_ON_ONCE(p != current);
	/* 重置标记，允许 future re-arming */
	p->sched_throttle_work.next = &p->sched_throttle_work;

	/* 如果任务正在退出，不需要 throttle 了 */
	if ((p->flags & PF_EXITING))
		return;

	scoped_guard(task_rq_lock, p) {
		se = &p->se;
		cfs_rq = cfs_rq_of(se);

		/* 任务已经不在 fair 调度类中了 */
		if (p->sched_class != &fair_sched_class)
			return;

		/* 如果 cfs_rq 已经不再是 throttled，不需要操作 */
		if (!cfs_rq->throttle_count)
			return;

		rq = scope.rq;
		update_rq_clock(rq);
		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
		p->throttled = true;
		resched_curr(rq);
	}
}

设计要点：

task_work 机制：throttle_cfs_rq_work() 通过 task_work_add(p, ..., TWA_RESUME) 注册，在任务从用户态返回时执行。这意味着持有内核锁的任务在被 throttle 之前会自然完成内核工作——这正是解决 priority inversion 的关键。
scoped_guard(task_rq_lock, p)：自动获取和释放 task 的 rq lock，无需显式 unlock
p->throttled = true：在 dequeue之后设置，防止 dequeue 路径中错误地把已 throttle 的任务当作"已经出队"来处理
resched_curr(rq)：确保 dequeue 后立即调度，这个 CPU 不会继续浪费"time slice"在被 throttle 的任务上

task_throttle_setup_work() — 注册 task work：

static inline void task_throttle_setup_work(struct task_struct *p)
{
	if (task_has_throttle_work(p))
		return;

	/* Kthreads 和 exiting 任务不会返回用户态，无需注册 */
	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
		return;

	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
}

防止重复注册：通过 task_has_throttle_work() 检查 next 指针是否指向自身（即未被注册）。

Commit e1fad12dcb66 — sched/fair: Switch to task based throttle model

作者: Valentin Schneider / Aaron Lu | 日期: 2025-08-29 | 版本: v6.18

规模: +181 / -167 行, 3 个文件

类型: 🏗️ 架构变化（核心转型）

变更概述

这是整个系列中最核心的 commit。它实质上重写了 throttle_cfs_rq()、unthrottle_cfs_rq()、tg_throttle_down()、tg_unthrottle_up() 四个核心函数，以及 enqueue_task_fair()、dequeue_task_fair()、pick_task_fair() 中的 throttle 相关路径。

变更的哲学：从"throttle = dequeue hierarchy"变为"throttle = 标记 + 延迟 task work"。

核心代码分析

1. `throttle_cfs_rq()` 的简化

之前（per-CFS_RQ 模型）：

static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
    // 1. 获取 group entity
    se = cfs_rq->tg->se[cpu_of(rq)];
    // 2. 遍历整个 hierarchy 向上 dequeue_entity
    for_each_sched_entity(se) {
        dequeue_entity(qcfs_rq, se, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
        qcfs_rq->h_nr_queued -= queued_delta;
        qcfs_rq->h_nr_runnable -= runnable_delta;
        // ...
    }
    // 3. 第二遍遍历更新 load_avg
    for_each_sched_entity(se) {
        update_load_avg(qcfs_rq, se, 0);
        // ...
    }
    // 4. 减少 nr_running
    sub_nr_running(rq, queued_delta);
}

之后（task-based 模型）：

static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
	// ... 加锁检查是否还能分配 runtime ...

	if (!dequeue)
		return false;

	/* freeze hierarchy runnable averages while throttled */
	rcu_read_lock();
	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
	rcu_read_unlock();

	cfs_rq->throttled = 1;
}

差别巨大：所有 dequeue_entity 遍历全部被移除。throttle_cfs_rq() 现在只做两件事：

walk_tg_tree_from(... tg_throttle_down ...) — 更新 throttle_count，冻结 PELT 时钟
cfs_rq->throttled = 1 — 标记状态

实际的 dequeue 操作推迟到任务被 pick 并返回用户态时。

2. `tg_throttle_down()` 的变化

之前：

static int tg_throttle_down(struct task_group *tg, void *data)
{
	if (!cfs_rq->throttle_count) {
		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
		list_del_leaf_cfs_rq(cfs_rq);
		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
		if (cfs_rq->nr_queued)
			cfs_rq->throttled_clock_self = rq_clock(rq);
	}
	cfs_rq->throttle_count++;
}

之后：

static int tg_throttle_down(struct task_group *tg, void *data)
{
	if (cfs_rq->throttle_count++)
		return 0;

	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
	if (cfs_rq->nr_queued)
		cfs_rq->throttled_clock_self = rq_clock(rq);
	else {
		/*
		 * 对于仍有实体在队列中的 cfs_rq，PELT 时钟的冻结
		 * 在[所有实体都被出队时]才执行，而不是在 throttle 时立即执行
		 */
		list_del_leaf_cfs_rq(cfs_rq);
		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
		cfs_rq->pelt_clock_throttled = 1;
	}

	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
	return 0;
}

关键变化：

pelt_clock_throttled 标志的新增：PELT 时钟冻结不再是 throttle 的同步动作。如果 cfs_rq 还有任务在运行（尚未 dequeue），PELT 时钟保持前进。只在 cfs_rq 没有任务时（else 分支）才真正冻结 PELT——因为此时所有任务都已经被 task work 出队了。
list_del_leaf_cfs_rq() 仅在 cfs_rq 完全无任务时才调用。
新增 WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)) 确保不会在仍有 pending throttled 任务时重新 throttle。

3. `tg_unthrottle_up()` 的变化

之前：递减 throttle_count，解冻 PELT 时钟，重新入队 leaf cfs_rq。

之后（最重要的变化）：

static int tg_unthrottle_up(struct task_group *tg, void *data)
{
	struct rq *rq = data;
	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
	struct task_struct *p, *tmp;

	if (--cfs_rq->throttle_count)
		return 0;

	if (cfs_rq->pelt_clock_throttled) {
		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
					     cfs_rq->throttled_clock_pelt;
		cfs_rq->pelt_clock_throttled = 0;
	}

	if (cfs_rq->throttled_clock_self) {
		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
		cfs_rq->throttled_clock_self = 0;
		if (WARN_ON_ONCE((s64)delta < 0))
			delta = 0;
		cfs_rq->throttled_clock_self_time += delta;
	}

	/* Re-enqueue the tasks that have been throttled at this level. */
	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list,
				 throttle_node) {
		list_del_init(&p->throttle_node);
		p->throttled = false;
		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
	}

	/* Add cfs_rq with load or one or more already running entities */
	if (!cfs_rq_is_decayed(cfs_rq))
		list_add_leaf_cfs_rq(cfs_rq);

	return 0;
}

关键变化：

throttled_limbo_list 的遍历：unthrottle 时，遍历该列表，对每个被 throttle 的任务调用 enqueue_task_fair() 将其重新入队。这是 task-based 模型的核心恢复逻辑。
不再自动 add leaf cfs_rq：改为先恢复所有任务，再判断是否需要 add——因为任务可能在 enqueue 时已经通过 enqueue_entity() 的 list_add_leaf_cfs_rq 被加入了。
p->throttled = false 在 enqueue 之前设置，确保 enqueue 路径不受 throttled 标志的影响。

4. `unthrottle_cfs_rq()` 的简化

之前：需要遍历整个 hierarchy 重新入队 entity，包含两层 for_each_sched_entity 循环以及 load_avg 更新。

之后：大部分 hierarchy 遍历逻辑被移除：

void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
	struct rq *rq = rq_of(cfs_rq);
	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];

	/* 如果没有足够的 runtime，不能 unthrottle */
	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
		return;

	// ... walk_tg_tree_from(tg_unthrottle_up) ... 处理 throttled_limbo_list

	// 不再需要 dequeue_entity/enqueue_entity 遍历
	// 不再需要 load_avg 同步
	// 不再需要 add_nr_running 等操作

	assert_list_leaf_cfs_rq(rq);
	if (rq->curr == rq->idle && rq->cfs.nr_running)
		resched_curr(rq);
}

5. `pick_task_fair()` 中注册 task work

这是 task-based throttle 触发点：

static struct task_struct *pick_task_fair(struct rq *rq)
{
	struct sched_entity *se;
	struct cfs_rq *cfs_rq;
	struct task_struct *p;
	bool throttled;

again:
	cfs_rq = &rq->cfs;
	if (!cfs_rq->nr_queued)
		return NULL;

	throttled = false;

	do {
		if (cfs_rq->curr && cfs_rq->curr->on_rq)
			update_curr(cfs_rq);

		throttled |= check_cfs_rq_runtime(cfs_rq);  // <-- 改为 OR 累积

		se = pick_next_entity(rq, cfs_rq);
		if (!se)
			goto again;
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

	p = task_of(se);
	if (unlikely(throttled))
		task_throttle_setup_work(p);  // <-- 被 throttled 的任务被注册 task work
	return p;
}

最关键的一行代码：if (unlikely(throttled)) task_throttle_setup_work(p);

当 pick_task_fair() 从 hierarchy 中选择了一个任务，且该 hierarchy 中有任何 cfs_rq 被 throttle 时，就为该任务注册 task work。该任务会继续被调度执行，直到它返回到用户态时，throttle_cfs_rq_work() 才会被触发执行 dequeue。

6. enqueue/dequeue 路径中移除 throttle break

之前：

enqueue_task_fair() {
    for_each_sched_entity(se) {
        enqueue_entity(cfs_rq, se, flags);
        if (cfs_rq_throttled(cfs_rq))
            break;  // <-- 遇到 throttled 就停止
        cfs_rq->h_nr_running++;
    }
}

之后：所有 if (cfs_rq_throttled(cfs_rq)) break; 被移除。enqueue/dequeue 现在可以正常穿透 throttled 边界。

但 enqueue 路径增加了新入口：

enqueue_task_fair() {
    // 新增：如果任务已经被 throttled，走特殊路径
    if (task_is_throttled(p) && enqueue_throttled_task(p))
        return;  // 直接回到 limbo 列表，不做实际 enqueue
    // ...正常 enqueue...
}

7. `dequeue_task_fair()` 中的 throttled 任务特殊处理

static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	if (task_is_throttled(p)) {
		dequeue_throttled_task(p, flags);
		return true;
	}
	// ...正常 dequeue...
}

8. 新增 `enqueue_throttled_task()` / `dequeue_throttled_task()`

dequeue_throttled_task() 处理 throttled 任务因迁移、group 变更、sched class 变更等需要出队的情况。当任务在 throttled 状态下被迁移到另一个 cfs_rq（如 sched_move_task）时，需要从 throttled_limbo_list 中移除。

enqueue_throttled_task() 尝试快速路径：如果目标 cfs_rq 也是 throttled 的，且任务不是 current，则直接将任务加入 limbo 列表，避免实际的 enqueue/dequeue 周期。

static bool enqueue_throttled_task(struct task_struct *p)
{
	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);

	/*
	 * 如果目标 cfs_rq 也是 throttled，且任务不是 current，
	 * 直接将任务放入 limbo 列表，跳过实际 enqueue
	 */
	if (throttled_hierarchy(cfs_rq) &&
	    !task_current_donor(rq_of(cfs_rq), p)) {
		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
		return true;
	}

	/* 不能走快速路径，执行实际 enqueue */
	p->throttled = false;
	return false;
}

9. PELT 时钟模型的调整

在 pelt.h 中，所有调度类相关函数从检查 throttle_count 改为检查 pelt_clock_throttled：

// 之前
if (unlikely(cfs_rq->throttle_count))
    throttled = U64_MAX;
// 之后
if (unlikely(cfs_rq->pelt_clock_throttled))
    throttled = U64_MAX;

这反映了 PELT 时钟冻结与 throttle 状态解耦。一个 cfs_rq 可以是 throttled 的（throttle_count > 0）但 PELT 时钟仍在前进（因为还有任务在运行）。

Commit eb962f251fbb — sched/fair: Task based throttle time accounting

作者: Aaron Lu | 日期: 2025-08-29 | 版本: v6.18

类型: 🏗️ 架构变化

变更概述

调整 throttle 时间统计（throttled_clock、throttled_clock_self）以适应 task-based 模型。在旧模型中，throttle 时间在 throttle_cfs_rq() 时记录；在新模型中，记录时机改为 enqueue_entity() 和 schedule() 中。

核心变化在 enqueue_entity() 中：当有任务入队到一个 PELT 时钟被冻结的 throttled cfs_rq 时，需要重新解冻 PELT 时钟（之前在 tg_unthrottle_up() 中处理）。同时新增 DEQUEUE_THROTTLE 标志，在 throttle_cfs_rq_work() 中使用，防止任务被 DELAY_DEQUEUE 延迟出队。

Commit 5b726e9bf954 — sched/fair: Get rid of throttled_lb_pair()

作者: Aaron Lu | 日期: 2025-08-29 | 版本: v6.18

规模: +4 / -31 行, 1 个文件

类型: 🔧 重构

变更概述

在 task-based model 中，throttled_lb_pair() 不再需要。因为即使 cfs_rq 被 throttled，任务仍然在 rq 上（作为 throttled 状态存在，可以通过 dequeue_throttled_task() 处理迁移）。所以 load balance 期间不需要检查迁移源和目标的 throttle 状态。

移除的代码：

// 从 can_migrate_task() 中移除
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
    return 0;

// 以及整个 throttled_lb_pair() 函数
static inline int throttled_lb_pair(struct task_group *tg,
				    int src_cpu, int dest_cpu)
{
    return throttled_hierarchy(src_cfs_rq) ||
           throttled_hierarchy(dest_cfs_rq);
}

Commit fe8d238e646e — sched/fair: Propagate load for throttled cfs_rq

作者: Aaron Lu | 日期: 2025-09-10 | 版本: v6.18

规模: +18 / -8 行, 1 个文件

类型: 🐛 修复

确保 throttled cfs_rq 的负载在 PELT 中正确传播。在旧模型中，throttled hierarchy 的 load_avg 被隔离（因为 entity 被 dequeue 了）；在新模型中，任务仍然在 rq 上，所以负载需要继续向上传播。

Commit fcd394866e3d — sched/fair: update_cfs_group() for throttled cfs_rqs

作者: Aaron Lu | 日期: 2025-09-10 | 版本: v6.18

规模: -3 行, 1 个文件

类型: 🐛 修复

移除 update_cfs_group() 路径中对 throttled cfs_rq 的特殊处理。在旧模型中，throttled cfs_rq 不需要更新 shares，因为实体不在 rq 上。在新模型中，实体仍在 rq 上，所以需要正常更新。

Commit 253b3f587241 — sched/fair: Do not special case tasks in throttled hierarchy

作者: Aaron Lu | 日期: 2025-09-10 | 版本: v6.18

类型: 🔧 重构

移除 h_nr_running 计算中 throttled hierarchy 的特殊处理。在旧模型中，throttled hierarchy 的任务不计入上级的 h_nr_running；在新模型中，它们仍然被计入。

Commit 0d4eaf8caf8c — sched/fair: Do not balance task to a throttled cfs_rq

作者: Aaron Lu | 日期: 2025-09-12 | 版本: v6.18

规模: +18 / -4 行, 1 个文件

类型: 🐛 修复

迁移负载平衡时，不要将任务迁移到一个 throttled cfs_rq。虽然 task-based model 允许 throttled cfs_rq 上有任务，但主动迁移新任务进去没有意义——它们会立即被 throttled。

// can_migrate_task() 中新增检查
if (throttled_hierarchy(cpu_rq(env->dst_cpu)->cfs))
    return 0;

设计解读

Task-based throttle 模型的收益

1. 消除了锁持有者优先级反转

这是最重要的收益。在旧模型中：

CPU 0                          CPU 1
[Task A] holds spinlock X      [Task B] spins on spinlock X
[Task A] runs out of quota
[Task A] is DEQUEUED (throttled)
                               [Task B] keeps spinning — cannot make progress
                               因为 X 的持有者 (A) 不在 rq 上

在新模型中：

CPU 0                          CPU 1
[Task A] holds spinlock X      [Task B] spins on spinlock X
[Task A] runs out of quota
[Task A] gets task_work added
[Task A] finishes kernel work  [Task B] continues spinning
[Task A] releases spinlock X
                               [Task B] acquires spinlock X, makes progress
[Task A] returns to userspace
[Task A] task_work fires
[Task A] is DEQUEUED (throttled)

2. 减少了不必要的上下文切换

在旧模型中，throttle 导致整个 hierarchy 被 dequeue，即使只超出 1ns 的 quota，所有相关任务都被迫让出 CPU。

3. 更可预测的调度延迟

被 throttle 的任务在返回用户态后才被暂停，不会在持有内核资源的半途中被"切断"。

代价和权衡

1. 超额执行：task-based model 允许任务在 throttle 后继续执行一小段内核代码。这产生了"超过 quota 的执行时间"——但设计上认为这个超额是有限的（受限于任务在返回用户态前的内核执行路径长度）。

2. PELT 精确度：因为 throttled 期间任务仍可能在运行，PELT 统计的精确度略有降低。但 pelt_clock_throttled 机制的引入将不影响限制在最小范围内。

3. throttled_limbo_list 的管理：新增的链表需要额外维护，在任务跨 group 迁移时需要特殊处理。

架构对比

维度	Per-CFS_RQ 模型	Task-based 模型
Throttle 粒度	整个 cfs_rq（队列级）	单个任务（per-task）
Throttle 执行	同步，立即 dequeue whole hierarchy	异步，通过 task work 在返回用户态时 dequeue
Lock holder PI	严重（锁持有者被直接 dequeue）	已消除（持有者在返回用户态前完成工作）
代码复杂度	较低（dequeue 整棵树）	更高（需要维护 per-task 状态和 limbo 列表）
PELT 时钟	throttle 时冻结，unthrottle 时解冻	解耦：throttle 时不一定冻结，只在无任务时冻结
Load balancing	禁止迁移到/从 throttled 的 cfs_rq	允许从 throttled 的 cfs_rq 迁出（迁入仍禁止）
超额执行	无	有限超额（内核上下文不立即暂停）

后续优化

Task-based model 的引入之后，社区继续在相关领域进行了改进：

956dfda6a708 — "sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining"（修复了 unthrottle 时 runtime 不足的情况）
0e4a169d1a2b — "sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled"（处理 group 创建在 throttled hierarchy 中的情况）
e34881c84c25 — "sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups"（迁移任务出 throttled cgroup 后重新评估调度）

阶段性总结

Task-based throttle model 代表了 CFS 带宽控制自 2011 年引入以来最根本的架构转型。它不只是一个优化——而是从根本上改变了 throttle 的语义和粒度。

从核心设计思想来看，它的洞察是：throttle 的目标是限制 CPU 时间的长期使用，而不是在纳秒级精确地切断任务。允许任务在 throttle 后继续完成内核路径的工作，虽然引入了一点点超额运行，但换来了巨大的系统稳定性提升——以及开发者 9 年前就想要但当时没有实现的 "throttle 不会导致 lock holder priority inversion"。

这个系列由 Valentin Schneider 发起（数据结构、task work、核心切换），由 Aaron Lu 完成（时间统计、负载传播、兼容性修复），代表了内核调度团队在 CFS 带宽控制上的最新思考和实践。

第 4 章：Task-based Throttle Model 转型 ​

阶段概览 ​

源码结构变化 ​

关键 Commit 分析 ​

Commit 2cd571245b43 — sched/fair: Add related data structure for task based throttle ​

变更概述 ​

核心代码分析 ​

Commit 7fc2d1439247 — sched/fair: Implement throttle task work and related helpers ​

变更概述 ​

核心代码分析 ​

Commit e1fad12dcb66 — sched/fair: Switch to task based throttle model ​

变更概述 ​

核心代码分析 ​

1. throttle_cfs_rq() 的简化 ​

2. tg_throttle_down() 的变化 ​

3. tg_unthrottle_up() 的变化 ​

4. unthrottle_cfs_rq() 的简化 ​

5. pick_task_fair() 中注册 task work ​

6. enqueue/dequeue 路径中移除 throttle break ​

7. dequeue_task_fair() 中的 throttled 任务特殊处理 ​

8. 新增 enqueue_throttled_task() / dequeue_throttled_task() ​

9. PELT 时钟模型的调整 ​

Commit eb962f251fbb — sched/fair: Task based throttle time accounting ​

变更概述 ​

Commit 5b726e9bf954 — sched/fair: Get rid of throttled_lb_pair() ​

变更概述 ​

Commit fe8d238e646e — sched/fair: Propagate load for throttled cfs_rq ​

Commit fcd394866e3d — sched/fair: update_cfs_group() for throttled cfs_rqs ​

Commit 253b3f587241 — sched/fair: Do not special case tasks in throttled hierarchy ​

Commit 0d4eaf8caf8c — sched/fair: Do not balance task to a throttled cfs_rq ​

设计解读 ​

Task-based throttle 模型的收益 ​

代价和权衡 ​

架构对比 ​

后续优化 ​

阶段性总结 ​

第 4 章：Task-based Throttle Model 转型

阶段概览

源码结构变化

关键 Commit 分析

Commit 2cd571245b43 — sched/fair: Add related data structure for task based throttle

变更概述

核心代码分析

Commit 7fc2d1439247 — sched/fair: Implement throttle task work and related helpers

变更概述

核心代码分析

Commit e1fad12dcb66 — sched/fair: Switch to task based throttle model

变更概述

核心代码分析

1. `throttle_cfs_rq()` 的简化

2. `tg_throttle_down()` 的变化

3. `tg_unthrottle_up()` 的变化

4. `unthrottle_cfs_rq()` 的简化

5. `pick_task_fair()` 中注册 task work

6. enqueue/dequeue 路径中移除 throttle break

7. `dequeue_task_fair()` 中的 throttled 任务特殊处理

8. 新增 `enqueue_throttled_task()` / `dequeue_throttled_task()`

9. PELT 时钟模型的调整

Commit eb962f251fbb — sched/fair: Task based throttle time accounting

变更概述

Commit 5b726e9bf954 — sched/fair: Get rid of throttled_lb_pair()

变更概述

Commit fe8d238e646e — sched/fair: Propagate load for throttled cfs_rq

Commit fcd394866e3d — sched/fair: update_cfs_group() for throttled cfs_rqs

Commit 253b3f587241 — sched/fair: Do not special case tasks in throttled hierarchy

Commit 0d4eaf8caf8c — sched/fair: Do not balance task to a throttled cfs_rq

设计解读

Task-based throttle 模型的收益

代价和权衡

架构对比

后续优化

阶段性总结