探索 Intel P-state 在Linux下限制cpu频率过程中的影响

本文主要介绍解决问题过程中的发现,而非相关知识的系统介绍。

探索背景

最近实验室在研究运行在同core不同thread上的进程间干扰,为了控制变量需要锁定CPU的运行频率,来排除频率升高带来的延迟降低。但是频率的变化根本不care我通过cpupower frequency-set设置的调度频率区间,自由的放飞自我,让我感到非常头疼。于是就开始调研到底是什么因素影响了处理器频率的调度。

实际上影响CPU频率调度情况的主要因素有三个:

  1. 支持的P-state范围,以及是否支持Hardware P-state (hwp)

  2. 内核的启动参数,对intel cpu一般是intel_pstate=active|passive|no_hwp|......

  3. 服务器BIOS设置:是否进行了”硬件频率控制“等设置

因此我先展示一下实验环境,不过如果CPU版本比较老不支持HWP的话,大概率不会遇到这类问题,本文的帮助也会比较小。

[collapse title=”环境” color=”blue”]

CPU详细参数:

 

  └─(18:24:22)──> cat /proc/cpuinfo  
 ...
 model name     : Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz
 stepping       : 7
 microcode       : 0x5003102
 cpu MHz         : 2900.000
 ...
 flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
 bugs           : spectre_v1 spectre_v2 spec_store_bypass swapgs taa itlb_multihit
 bogomips       : 5800.00
 clflush size   : 64
 cache_alignment : 64
 address sizes   : 46 bits physical, 48 bits virtual

 

 

系统环境参数:

 kernel="/boot/vmlinuz-4.18.0-240.el8.x86_64"
 args="ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap rhgb quiet $tuned_params"
 root="/dev/mapper/cl-root"
 initrd="/boot/initramfs-4.18.0-240.el8.x86_64.img $tuned_initrd"
 title="CentOS Linux (4.18.0-240.el8.x86_64) 8"
 id="f9b0118c74204b76a903ababf564e2db-4.18.0-240.el8.x86_64"

 

服务器参数:

 RH 2288H V5
 iBMC固件版本   5.02 (U4282)
 BIOS版本 7.09 (U47)
 CPLD版本 2.10 (U4269)
 iBMC主Uboot版本 2.1.13 (Dec 24 2018 - 20:23:20)
 PCB 版本   .B
 单板ID   0x0017
 主板厂商 Huawei
 主板型号 BC11SPSCB

[/collapse]

由于实验中出现的具体问题讲起来太繁琐,就直接省略大部分细节,转而介绍相关信息了。

P-state相关介绍

硬件

根据P-State (intel.com)的介绍,其表示在ACPI规范中,定义的电压-频率控制状态。该状态用于调整CPU运行的电压与频率,以此降低CPU的功耗,这一工作可以由OS调度负责(EIST)。

但在Intel较新的CPU中,引入了 Hardware-controlled P-State(HWP)技术,该技术允许硬件(即CPU本身)自行控制P-state(该技术作为Speed Shift被熟知)。

而HWP调整频率的偏好则由Energy-Performance Preference (EPP)寄存器控制

到这里问题是什么已经很明确了:HWP的控制override了OS对CPU频率的控制,导致通过sysfs进行的约束失去作用。虽然按照kernel doc所描述的,sysfs中的scaling_min_freqscaling_max_freq可以限制频率,但是实际上并没有任何作用。

经过阅读文档,如果要限制,需要通过IA32_HWP_REQUEST寄存器(per logical processor)的Maximum_Performance字段控制。

软件

硬件需要对应的驱动来控制,Intel Power Management & MSR_SAFE (前文slides) 里包含详细的策略介绍,本文只简单介绍Linux系统中的相关策略(intel_pstate CPU Performance Scaling Driver — The Linux Kernel documentation)。

With Turbo

  1. intel_pstate with active mode

    该模式下通过MSR-EPP/EPB寄存器控制频率调度的偏好。支持两种策略

    • performance: intel_pstate will write 0 to the processor’s Energy-Performance Preference (EPP) knob (if supported) or its Energy-Performance Bias (EPB) knob (otherwise). This will override the EPP/EPB setting coming from the sysfs interface

      注意,该模式并非总是进入允许的最高P-state,而是激进地提升P状态。

    • powersave: intel_pstate will set the processor’s Energy-Performance Preference (EPP) knob (if supported) or its EnergyPerformance Bias (EPB) knob (otherwise) to whatever value it was previously set to via sysfs (or whatever default value it was set to by the platform firmware). This usually causes the processor’s internal P-state selection logic to be less performance-focused.

  2. intel_pstate with no_hwp mode

    该模式是不支持HWP功能CPU的默认模式,也是添加了intel_pstate=no_hwp内核启动参数的模式(不一定生效,在实验中如果BIOS开启了HWP,该参数无法对调度产生影响)。支持两种策略

    • performance: It selects the maximum P-state it is allowed to use, subject to limits set via sysfs, every time the P-state selection computations are carried out by the driver’s utilization update callback for the given CPU (that does not happen more often than every 10 ms), but the hardware configuration will not be changed if the new P-state is the same as the current one.

    • powersave: It generally selects P-states proportional to the current CPU utilization, so it is referred to as the “proportional” algorithm

  3. intel_pstate with passive mode

    该模式隐式声明了no_hwp参数,并被称为intel_cpufreq

    • The driver behaves like a regular CPUFreq scaling driver. That is, it is invoked by generic scaling governors when necessary to talk to the hardware in order to change the P-state of a CPU (in particular, the schedutil governor can invoke it directly from scheduler context).

    • While in this mode, intel_pstate can be used with all of the (generic) scaling governors listed by the scaling_available_governors policy attribute in sysfs (and the P-state selection algorithms described above are not used).

    • In other words, in the passive mode the entire range of available P-states is exposed by intel_pstate to the CPUFreq core. However, in this mode the driver does not register utilization update callbacks with the CPU scheduler and the scaling_cur_freq information comes from the CPUFreq core (and is the last frequency selected by the current scaling governor for the given policy).

  4. intel_pstate with per_cpu_perf_limits mode

    该模式屏蔽max_perf_pctmin_perf_pct这两个全局限制。

Without Turbo

  1. intel_pstate with disable parameter (usually acpi-cpufreq)

    不使用 intel_pstate 即使处理器支持。 实际上此时linux系统会选择acpi-cpufreq调度器。

    • Apart from the above, acpi-cpufreq works like intel_pstate in the passive mode, except that the number of P-states it can set is limited to the ones listed by the ACPI _PSS objects.

      实质上,此时CPU不支持睿频

解决方案

因为scaling_min_freqscaling_max_freq没有起到作用,所以只能使用IA32_HWP_REQUEST这一per thread MSR寄存器进行限制。

该寄存器编号0x774,需要设置Minimum Valid Bit以及Minimum_Performance Bit。

能否成功设置与BIOS配置有关,如果设置失败请注意开启HWP以及其他选项的配置。

发表评论

您的电子邮箱地址不会被公开。 必填项已用*标注

退出移动版