本文主要介绍解决问题过程中的发现,而非相关知识的系统介绍。
探索背景
最近实验室在研究运行在同core不同thread上的进程间干扰,为了控制变量需要锁定CPU的运行频率,来排除频率升高带来的延迟降低。但是频率的变化根本不care我通过cpupower frequency-set
设置的调度频率区间,自由的放飞自我,让我感到非常头疼。于是就开始调研到底是什么因素影响了处理器频率的调度。
实际上影响CPU频率调度情况的主要因素有三个:
-
支持的
P-state
Hardware P-state (hwp)
-
内核的启动参数,对intel cpu一般是
intel_pstate=active|passive|no_hwp|......
-
服务器BIOS设置:是否进行了”硬件频率控制“等设置
因此我先展示一下实验环境,不过如果CPU版本比较老不支持HWP的话,大概率不会遇到这类问题,本文的帮助也会比较小。
由于实验中出现的具体问题讲起来太繁琐,就直接省略大部分细节,转而介绍相关信息了。
P-state相关介绍
硬件
根据的介绍,其表示在ACPI规范中,定义的电压-频率控制状态。该状态用于调整CPU运行的电压与频率,以此降低CPU的功耗,这一工作可以由OS调度负责()。
但在Intel较新的CPU中,引入了 Hardware-controlled P-State(HWP)技术,该技术允许硬件(即CPU本身)自行控制P-state(该技术作为被熟知)。
而HWP调整频率的偏好则由Energy-Performance Preference (EPP)寄存器控制
到这里问题是什么已经很明确了:HWP的控制override了OS对CPU频率的控制,导致通过sysfs
进行的约束失去作用。虽然按照kernel doc所描述的,sysfs中的scaling_min_freq
与scaling_max_freq
可以限制频率,但是实际上并没有任何作用。
经过阅读文档,如果要限制,需要通过IA32_HWP_REQUEST
寄存器(per logical processor)的Maximum_Performance
字段控制。
软件
硬件需要对应的驱动来控制, 里包含详细的策略介绍,本文只简单介绍Linux系统中的相关策略()。
With Turbo
-
intel_pstate
withactive
mode该模式下通过MSR-EPP/EPB寄存器控制频率调度的偏好。支持两种策略
-
performance: intel_pstate will write 0 to the processor’s Energy-Performance Preference (EPP) knob (if supported) or its Energy-Performance Bias (EPB) knob (otherwise). This will override the EPP/EPB setting coming from the sysfs interface
注意,该模式并非总是进入允许的最高P-state,而是激进地提升P状态。
-
powersave: intel_pstate will set the processor’s Energy-Performance Preference (EPP) knob (if supported) or its EnergyPerformance Bias (EPB) knob (otherwise) to whatever value it was previously set to via sysfs (or whatever default value it was set to by the platform firmware). This usually causes the processor’s internal P-state selection logic to be less performance-focused.
-
-
intel_pstate
withno_hwp
mode该模式是不支持HWP功能CPU的默认模式,也是添加了
intel_pstate=no_hwp
内核启动参数的模式(不一定生效,在实验中如果BIOS开启了HWP,该参数无法对调度产生影响)。支持两种策略-
performance: It selects the maximum P-state it is allowed to use, subject to limits set via sysfs, every time the P-state selection computations are carried out by the driver’s utilization update callback for the given CPU (that does not happen more often than every 10 ms), but the hardware configuration will not be changed if the new P-state is the same as the current one.
-
powersave: It generally selects P-states proportional to the current CPU utilization, so it is referred to as the “proportional” algorithm
-
-
intel_pstate
withpassive
mode该模式隐式声明了
no_hwp
参数,并被称为intel_cpufreq
。-
The driver behaves like a regular
CPUFreq
scaling driver. That is, it is invoked by generic scaling governors when necessary to talk to the hardware in order to change the P-state of a CPU (in particular, theschedutil
governor can invoke it directly from scheduler context). -
While in this mode,
intel_pstate
can be used with all of the (generic) scaling governors listed by thescaling_available_governors
policy attribute insysfs
(and the P-state selection algorithms described above are not used). -
In other words, in the passive mode the entire range of available P-states is exposed by
intel_pstate
to theCPUFreq
core. However, in this mode the driver does not register utilization update callbacks with the CPU scheduler and thescaling_cur_freq
information comes from theCPUFreq
core (and is the last frequency selected by the current scaling governor for the given policy).
-
-
intel_pstate
withper_cpu_perf_limits
mode该模式屏蔽
max_perf_pct
和min_perf_pct
这两个全局限制。
Without Turbo
-
intel_pstate
withdisable
parameter (usuallyacpi-cpufreq
)不使用
intel_pstate
即使处理器支持。 实际上此时linux系统会选择acpi-cpufreq
调度器。-
Apart from the above,
acpi-cpufreq
works likeintel_pstate
in the , except that the number of P-states it can set is limited to the ones listed by the ACPI_PSS
objects.实质上,此时CPU不支持睿频。
-
解决方案
因为scaling_min_freq
与scaling_max_freq
没有起到作用,所以只能使用IA32_HWP_REQUEST
这一per thread MSR寄存器进行限制。
该寄存器编号0x774
,需要设置Minimum Valid
Bit以及Minimum_Performance
Bit。