Skip to content

Commit

Permalink
Introduce core counters into BuiltinMetrics (#179)
Browse files Browse the repository at this point in the history
Summary:
Pull Request resolved: #179

Core counters enclosed in metrics.

Looks like a random job is breaking (mcrouter-oss), it seems fine to bypass for now: https://fb.workplace.com/groups/604299349618684/permalink/25134499986171946/

Reviewed By: briancoutinho

Differential Revision: D49940539

fbshipit-source-id: ae4ecc45f91bca9864000e83cee3cd856fc4ac1c
  • Loading branch information
williamsumendap authored and facebook-github-bot committed Oct 17, 2023
1 parent 950cd3f commit 15116e5
Show file tree
Hide file tree
Showing 19 changed files with 379 additions and 156 deletions.
290 changes: 290 additions & 0 deletions hbt/src/perf_event/BuiltinMetrics.cpp

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions hbt/src/perf_event/BuiltinMetrics.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ namespace facebook::hbt::perf_event {

std::shared_ptr<PmuDeviceManager> makePmuDeviceManager();
std::shared_ptr<Metrics> makeAvailableMetrics();
void addCoreMetrics(std::shared_ptr<Metrics>& metrics);
struct HwCacheEventInfo {
// this is not unique, but enum of different cache event
uint32_t id;
Expand Down
1 change: 0 additions & 1 deletion hbt/src/perf_event/json_events/generated/CpuArch.h
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ enum class CpuArch {
NEOVERSE_N2,
NEOVERSE_V2,
AMPERE_ONE,

// AMD Architectures
MILAN,
GENOA,
Expand Down
25 changes: 10 additions & 15 deletions hbt/src/perf_event/json_events/generated/intel/broadwell_core.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ void addEvents(PmuDeviceManager& pmu_manager) {
EventDef::Encoding{
.code = 0x00, .umask = 0x01, .cmask = 0, .msr_values = {0}},
R"(Instructions retired from execution.)",
R"(This event counts the number of instructions retired from execution. For instructions that consist of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction. Counting continues during hardware interrupts, traps, and inside interrupt handlers.
Notes: INST_RETIRED.ANY is counted by a designated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. INST_RETIRED.ANY_P is counted by a programmable counter and it is an architectural performance event.
R"(This event counts the number of instructions retired from execution. For instructions that consist of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction. Counting continues during hardware interrupts, traps, and inside interrupt handlers.
Notes: INST_RETIRED.ANY is counted by a designated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. INST_RETIRED.ANY_P is counted by a programmable counter and it is an architectural performance event.
Counting: Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retired instructions.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -76,7 +76,7 @@ Counting: Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as
EventDef::Encoding{
.code = 0x00, .umask = 0x03, .cmask = 0, .msr_values = {0}},
R"(Reference cycles when the core is not in halt state.)",
R"(This event counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK event. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events.
R"(This event counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK event. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events.
Note: On all current platforms this event stops counting during 'throttling (TM)' states duty off periods the processor is 'halted'. This event is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is done at a lower clock rate then the core clock the overflow status bit for this counter may appear 'sticky'. After the counter has overflowed and software clears the overflow status bit and resets the counter to less than MAX. The reset value to the counter is not clocked immediately so the overflow status bit will flip 'high (1)' and generate another PMI (if enabled) after which the reset value gets clocked into the counter. Therefore, software will get the interrupt, read the overflow status bit '1 for bit 34 while the counter value is less than MAX. Software should ignore this case.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -617,7 +617,7 @@ See the table of not supported store forwards in the Optimization Guide.)",
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L2_RQSTS.ALL_CODE_RD is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L2_RQSTS.ALL_CODE_RD",
Expand All @@ -630,7 +630,6 @@ See the table of not supported store forwards in the Optimization Guide.)",
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -1091,7 +1090,7 @@ Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads,
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L1D.REPLACEMENT is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L1D.REPLACEMENT",
Expand All @@ -1104,7 +1103,6 @@ Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads,
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -1935,7 +1933,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
R"(BDM69)"));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event ITLB_MISSES.WALK_COMPLETED is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"ITLB_MISSES.WALK_COMPLETED",
Expand All @@ -1947,7 +1945,6 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
std::nullopt, // ScaleUnit
EventDef::IntelFeatures{},
R"(BDM69)"));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -2362,7 +2359,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
R"(Uops not delivered to Resource Allocation Table (RAT) per thread when backend of the machine is not stalled)",
R"(This event counts the number of uops not delivered to Resource Allocation Table (RAT) per thread adding 4 x when Resource Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0,1,2,3}). Counting does not cover cases when:
a. IDQ-Resource Allocation Table (RAT) pipe serves the other thread;
b. Resource Allocation Table (RAT) is stalled for the thread (including uop drops and clear BE conditions);
b. Resource Allocation Table (RAT) is stalled for the thread (including uop drops and clear BE conditions);
c. Instruction Decode Queue (IDQ) delivers four uops.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -3175,7 +3172,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
EventDef::Encoding{
.code = 0xAB, .umask = 0x02, .cmask = 0, .msr_values = {0}},
R"(Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles.)",
R"(This event counts Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles. These cycles do not include uops routed through because of the switch itself, for example, when Instruction Decode Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (IDQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receiving the first MITE uop.
R"(This event counts Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles. These cycles do not include uops routed through because of the switch itself, for example, when Instruction Decode Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (IDQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receiving the first MITE uop.
MM is placed before Instruction Decode Queue (IDQ) to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode Stream Buffer (DSB)-to-MITE switch occurs.
Penalty: A Decode Stream Buffer (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cycles in which no uops are delivered to the IDQ. Most often, such switches from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.)",
2000003,
Expand Down Expand Up @@ -3855,7 +3852,7 @@ Note: Writeback pending FIFO has six entries.)",
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event BR_INST_RETIRED.ALL_BRANCHES is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"BR_INST_RETIRED.ALL_BRANCHES",
Expand All @@ -3868,7 +3865,6 @@ Note: Writeback pending FIFO has six entries.)",
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -5141,7 +5137,7 @@ Note: Only two data-sources of L1/FB are applicable for AVX-256bit even though
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L2_LINES_IN.ALL is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L2_LINES_IN.ALL",
Expand All @@ -5154,7 +5150,6 @@ Note: Only two data-sources of L1/FB are applicable for AVX-256bit even though
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down
25 changes: 10 additions & 15 deletions hbt/src/perf_event/json_events/generated/intel/broadwellde_core.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,8 @@ void addEvents(PmuDeviceManager& pmu_manager) {
EventDef::Encoding{
.code = 0x00, .umask = 0x01, .cmask = 0, .msr_values = {0}},
R"(Instructions retired from execution.)",
R"(This event counts the number of instructions retired from execution. For instructions that consist of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction. Counting continues during hardware interrupts, traps, and inside interrupt handlers.
Notes: INST_RETIRED.ANY is counted by a designated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. INST_RETIRED.ANY_P is counted by a programmable counter and it is an architectural performance event.
R"(This event counts the number of instructions retired from execution. For instructions that consist of multiple micro-ops, this event counts the retirement of the last micro-op of the instruction. Counting continues during hardware interrupts, traps, and inside interrupt handlers.
Notes: INST_RETIRED.ANY is counted by a designated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events. INST_RETIRED.ANY_P is counted by a programmable counter and it is an architectural performance event.
Counting: Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as retired instructions.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -75,7 +75,7 @@ Counting: Faulting executions of GETSEC/VM entry/VM Exit/MWait will not count as
EventDef::Encoding{
.code = 0x00, .umask = 0x03, .cmask = 0, .msr_values = {0}},
R"(Reference cycles when the core is not in halt state.)",
R"(This event counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK event. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events.
R"(This event counts the number of reference cycles when the core is not in a halt state. The core enters the halt state when it is running the HLT instruction or the MWAIT instruction. This event is not affected by core frequency changes (for example, P states, TM2 transitions) but has the same incrementing frequency as the time stamp counter. This event can approximate elapsed time while the core was not in a halt state. This event has a constant ratio with the CPU_CLK_UNHALTED.REF_XCLK event. It is counted on a dedicated fixed counter, leaving the four (eight when Hyperthreading is disabled) programmable counters available for other events.
Note: On all current platforms this event stops counting during 'throttling (TM)' states duty off periods the processor is 'halted'. This event is clocked by base clock (100 Mhz) on Sandy Bridge. The counter update is done at a lower clock rate then the core clock the overflow status bit for this counter may appear 'sticky'. After the counter has overflowed and software clears the overflow status bit and resets the counter to less than MAX. The reset value to the counter is not clocked immediately so the overflow status bit will flip 'high (1)' and generate another PMI (if enabled) after which the reset value gets clocked into the counter. Therefore, software will get the interrupt, read the overflow status bit '1 for bit 34 while the counter value is less than MAX. Software should ignore this case.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -616,7 +616,7 @@ See the table of not supported store forwards in the Optimization Guide.)",
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L2_RQSTS.ALL_CODE_RD is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L2_RQSTS.ALL_CODE_RD",
Expand All @@ -629,7 +629,6 @@ See the table of not supported store forwards in the Optimization Guide.)",
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -1090,7 +1089,7 @@ Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads,
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L1D.REPLACEMENT is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L1D.REPLACEMENT",
Expand All @@ -1103,7 +1102,6 @@ Note: In the L1D, a Demand Read contains cacheable or noncacheable demand loads,
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -1934,7 +1932,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
R"(BDM69)"));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event ITLB_MISSES.WALK_COMPLETED is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"ITLB_MISSES.WALK_COMPLETED",
Expand All @@ -1946,7 +1944,6 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
std::nullopt, // ScaleUnit
EventDef::IntelFeatures{},
R"(BDM69)"));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -2361,7 +2358,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
R"(Uops not delivered to Resource Allocation Table (RAT) per thread when backend of the machine is not stalled)",
R"(This event counts the number of uops not delivered to Resource Allocation Table (RAT) per thread adding 4 x when Resource Allocation Table (RAT) is not stalled and Instruction Decode Queue (IDQ) delivers x uops to Resource Allocation Table (RAT) (where x belongs to {0,1,2,3}). Counting does not cover cases when:
a. IDQ-Resource Allocation Table (RAT) pipe serves the other thread;
b. Resource Allocation Table (RAT) is stalled for the thread (including uop drops and clear BE conditions);
b. Resource Allocation Table (RAT) is stalled for the thread (including uop drops and clear BE conditions);
c. Instruction Decode Queue (IDQ) delivers four uops.)",
2000003,
std::nullopt, // ScaleUnit
Expand Down Expand Up @@ -3174,7 +3171,7 @@ Note: A prefetch promoted to Demand is counted from the promotion point.)",
EventDef::Encoding{
.code = 0xAB, .umask = 0x02, .cmask = 0, .msr_values = {0}},
R"(Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles.)",
R"(This event counts Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles. These cycles do not include uops routed through because of the switch itself, for example, when Instruction Decode Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (IDQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receiving the first MITE uop.
R"(This event counts Decode Stream Buffer (DSB)-to-MITE switch true penalty cycles. These cycles do not include uops routed through because of the switch itself, for example, when Instruction Decode Queue (IDQ) pre-allocation is unavailable, or Instruction Decode Queue (IDQ) is full. SBD-to-MITE switch true penalty cycles happen after the merge mux (MM) receives Decode Stream Buffer (DSB) Sync-indication until receiving the first MITE uop.
MM is placed before Instruction Decode Queue (IDQ) to merge uops being fed from the MITE and Decode Stream Buffer (DSB) paths. Decode Stream Buffer (DSB) inserts the Sync-indication whenever a Decode Stream Buffer (DSB)-to-MITE switch occurs.
Penalty: A Decode Stream Buffer (DSB) hit followed by a Decode Stream Buffer (DSB) miss can cost up to six cycles in which no uops are delivered to the IDQ. Most often, such switches from the Decode Stream Buffer (DSB) to the legacy pipeline cost 02 cycles.)",
2000003,
Expand Down Expand Up @@ -3854,7 +3851,7 @@ Note: Writeback pending FIFO has six entries.)",
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event BR_INST_RETIRED.ALL_BRANCHES is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"BR_INST_RETIRED.ALL_BRANCHES",
Expand All @@ -3867,7 +3864,6 @@ Note: Writeback pending FIFO has six entries.)",
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down Expand Up @@ -5182,7 +5178,7 @@ Note: Only two data-sources of L1/FB are applicable for AVX-256bit even though
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
// Event L2_LINES_IN.ALL is allowlisted
pmu_manager.addEvent(std::make_shared<EventDef>(
PmuType::cpu,
"L2_LINES_IN.ALL",
Expand All @@ -5195,7 +5191,6 @@ Note: Only two data-sources of L1/FB are applicable for AVX-256bit even though
EventDef::IntelFeatures{},
std::nullopt // Errata
));
#endif // HBT_ADD_ALL_GENERATED_EVENTS

#ifdef HBT_ADD_ALL_GENERATED_EVENTS
pmu_manager.addEvent(std::make_shared<EventDef>(
Expand Down
Loading

0 comments on commit 15116e5

Please sign in to comment.