探索 Intel P-state 在Linux下限制cpu频率过程中的影响
本文最后更新于 855 天前,其中的信息可能已经有所发展或是发生改变。

本文主要介绍解决问题过程中的发现,而非相关知识的系统介绍。

探索背景

最近实验室在研究运行在同core不同thread上的进程间干扰,为了控制变量需要锁定CPU的运行频率,来排除频率升高带来的延迟降低。但是频率的变化根本不care我通过cpupower frequency-set设置的调度频率区间,自由的放飞自我,让我感到非常头疼。于是就开始调研到底是什么因素影响了处理器频率的调度。

实际上影响CPU频率调度情况的主要因素有三个:

  1. 支持的P-state范围,以及是否支持Hardware P-state (hwp)

  2. 内核的启动参数,对intel cpu一般是intel_pstate=active|passive|no_hwp|......

  3. 服务器BIOS设置:是否进行了”硬件频率控制“等设置

因此我先展示一下实验环境,不过如果CPU版本比较老不支持HWP的话,大概率不会遇到这类问题,本文的帮助也会比较小。

由于实验中出现的具体问题讲起来太繁琐,就直接省略大部分细节,转而介绍相关信息了。

P-state相关介绍

硬件

根据P-State (intel.com)的介绍,其表示在ACPI规范中,定义的电压-频率控制状态。该状态用于调整CPU运行的电压与频率,以此降低CPU的功耗,这一工作可以由OS调度负责(EIST)。

image-20220509164439302

但在Intel较新的CPU中,引入了 Hardware-controlled P-State(HWP)技术,该技术允许硬件(即CPU本身)自行控制P-state(该技术作为Speed Shift被熟知)。

image-20220508175142146

而HWP调整频率的偏好则由Energy-Performance Preference (EPP)寄存器控制

image-20220508183403907

image-20220508183319758

到这里问题是什么已经很明确了:HWP的控制override了OS对CPU频率的控制,导致通过sysfs进行的约束失去作用。虽然按照kernel doc所描述的,sysfs中的scaling_min_freqscaling_max_freq可以限制频率,但是实际上并没有任何作用。

经过阅读文档,如果要限制,需要通过IA32_HWP_REQUEST寄存器(per logical processor)的Maximum_Performance字段控制。

软件

硬件需要对应的驱动来控制,Intel Power Management & MSR_SAFE (前文slides) 里包含详细的策略介绍,本文只简单介绍Linux系统中的相关策略(intel_pstate CPU Performance Scaling Driver — The Linux Kernel documentation)。

With Turbo

  1. intel_pstate with active mode

    该模式下通过MSR-EPP/EPB寄存器控制频率调度的偏好。支持两种策略

    • performance: intel_pstate will write 0 to the processor’s Energy-Performance Preference (EPP) knob (if supported) or its Energy-Performance Bias (EPB) knob (otherwise). This will override the EPP/EPB setting coming from the sysfs interface

      注意,该模式并非总是进入允许的最高P-state,而是激进地提升P状态。

    • powersave: intel_pstate will set the processor’s Energy-Performance Preference (EPP) knob (if supported) or its EnergyPerformance Bias (EPB) knob (otherwise) to whatever value it was previously set to via sysfs (or whatever default value it was set to by the platform firmware). This usually causes the processor’s internal P-state selection logic to be less performance-focused.

  2. intel_pstate with no_hwp mode

    该模式是不支持HWP功能CPU的默认模式,也是添加了intel_pstate=no_hwp内核启动参数的模式(不一定生效,在实验中如果BIOS开启了HWP,该参数无法对调度产生影响)。支持两种策略

    • performance: It selects the maximum P-state it is allowed to use, subject to limits set via sysfs, every time the P-state selection computations are carried out by the driver’s utilization update callback for the given CPU (that does not happen more often than every 10 ms), but the hardware configuration will not be changed if the new P-state is the same as the current one.

    • powersave: It generally selects P-states proportional to the current CPU utilization, so it is referred to as the “proportional” algorithm

  3. intel_pstate with passive mode

    该模式隐式声明了no_hwp参数,并被称为intel_cpufreq

    • The driver behaves like a regular CPUFreq scaling driver. That is, it is invoked by generic scaling governors when necessary to talk to the hardware in order to change the P-state of a CPU (in particular, the schedutil governor can invoke it directly from scheduler context).

    • While in this mode, intel_pstate can be used with all of the (generic) scaling governors listed by the scaling_available_governors policy attribute in sysfs (and the P-state selection algorithms described above are not used).

    • In other words, in the passive mode the entire range of available P-states is exposed by intel_pstate to the CPUFreq core. However, in this mode the driver does not register utilization update callbacks with the CPU scheduler and the scaling_cur_freq information comes from the CPUFreq core (and is the last frequency selected by the current scaling governor for the given policy).

  4. intel_pstate with per_cpu_perf_limits mode

    该模式屏蔽max_perf_pctmin_perf_pct这两个全局限制。

Without Turbo

  1. intel_pstate with disable parameter (usually acpi-cpufreq)

    不使用 intel_pstate 即使处理器支持。 实际上此时linux系统会选择acpi-cpufreq调度器。

    • Apart from the above, acpi-cpufreq works like intel_pstate in the passive mode, except that the number of P-states it can set is limited to the ones listed by the ACPI _PSS objects.

      实质上,此时CPU不支持睿频

解决方案

因为scaling_min_freqscaling_max_freq没有起到作用,所以只能使用IA32_HWP_REQUEST这一per thread MSR寄存器进行限制。

该寄存器编号0x774,需要设置Minimum Valid Bit以及Minimum_Performance Bit。

image-20220509201456138

image-20220509201509365

image-20220509201615844

能否成功设置与BIOS配置有关,如果设置失败请注意开启HWP以及其他选项的配置。

暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇