Contributors: Kyriafinis Vasilis, Nikolaos Giannopoulos
Winter Semester 2021 - 2022
In the document presented McPAT is an integrated power system, area, and timing modeling framework that supports integrated design space exploration for multicore and processor configurations ranging from 90nm to 22nm and beyond. At the microarchitecture level, McPAT includes Models for the fundamental elements of a multiprocessor chip, including on- and off-series processor cores, on-chip networks, shared caches, embedded memory controllers and multi-region timing. In the circuit and in the technology level, McPAT supports critical path timing modeling, surface modeling, and dynamic power, short circuit, and leakage power modeling for each of the device types provided in the ITRS roadmap, including bulk CMOS, SOI and dual-gate transistors. McPAT has a flexible XML interface for to facilitate its use with multiple performance simulators.
McPAT advances the state of the art in several directions compared to Wattch, which is the current standard for power research. First, McPAT is an integrated power, area, and timing modeling framework that enables architects to use new metrics combining performance with both power and area such as energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP), which are useful to quantify the cost of new architectural ideas. McPAT specifies the low-level design parameters of regular components (e.g. interconnects, caches, and other array-based structures) based on high-level constraints (clock rate and optimization target) given by a user, ensuring the user is always modeling a reasonable design. This approach enables the user, if they choose, to ignore many of the low-level details of the circuits being modeled.
The following table shows which processors were originally used in McPat:
Processor | Published total power and area | McPAT Results | % McPAT error |
---|---|---|---|
Niagara | 63 W / 378 mm^2 | 56.17 W / 295 mm^2 | -10.84 / -21.8 |
Niagara2 | 84 W / 342 mm^2 | 69.70 W / 248 mm^2 | -17.02 / -27.3 |
Alpha 21364 | 125 W / 396 mm^2 | 97.9 W / 324 mm^2 | -21.68 / -18.2 |
Xeon Tulsa | 150 W / 435 mm^2 | 116.08 W / 362 mm^2 | -22.61 / -16.7 |
Wattch (Wattch: a framework for architectural-level power analysis and optimizations) is a widely-used processor power estimation tool. Wattch calculates dynamic power dissipation from switching events obtained from an architectural simulation and capacitance models of components of the microarchitecture. For array structures, Wattch uses capacitance models from CACTI, and for the pipeline it uses models from "Complexity-Effective Superscalar Processors". When modeling out-of-order processors, Wattch uses the synthetic RUU model that is tightly coupled to the SimpleScalar simulator (The simplescalar tool set, version 2.0). Wattch has enabled the computer architecture research community to explore power-efficient design options, as technology has progressed; however, limitations of Wattch have become apparent. First, Wattch models power without considering timing and area. Second, Wattch only models dynamic power consumption the HotLeakage package (“HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects") partially addressed this deficiency by adding models for subthreshold leakage. Third, Wattch uses simple linear scaling models based on 0.8μm technology that are inaccurate to make predictions for current and future deep-submicron technology nodes.
This interface allows both the specification of the static microarchitecture configuration parameters and the passing of dynamic activity statistics generated by the performance simulator. McPAT can also send runtime power dissipation back to the performance simulator through the XML-based interface, so that the performance simulator can react to power (or even temperature) data. This approach makes McPAT very flexible and easily ported to other performance simulators. McPAT runs separately from a simulator and only reads performance statistics from it. Performance simulator overhead is minor – only the possible addition of some performance counters. Since McPAT provides complete hierarchical models from the architecture to the technology level, the XML interface also contains circuit implementation style and technology parameters that are specific to a particular target processor. Examples are array types, crossbar types, and the CMOS technology generation with associated voltage and device types
Switching circuits also dissipate short-circuit power due to a momentary short through the pull-up and pull-down devices. We compute the shortcircuit power using the equations derived in the work by Nose(Analysis and Future Trend of Short-circuit Power) that predicts trends for short-circuit power. If the ratio of the threshold voltage to the supply voltage shrinks, short-circuit power becomes more significant. Short-circuit power is around 10% of the total dynamic power, with fluctuations within 3.1% across all the technology generations. The main reason for the stable short-circuit power is that we use ITRS technology models that have stable Vth to Vdd ratios.
Gate leakage is an important component in 90nm and 65nm technology, being 37.6% of the total leakage power at 65nm technology. Hi-k metal gate transistors (45nm High-k+Metal Gate Strain-Ehanced Transistors) are introduced at 45nm, which reduces the gate leakage by more than 90%. SOI technology and double gate (DG) devices that are used at 32nm and 22nm technology also help to keep the subthreshold leakage under control.
All the results will depend on the runtime of the program and the load it will create.But there is also the constant power consumption due to the short circuits and the leakage power of the technology used by the processor, which we consider as "constant power loss".
The answer can be given by looking the performance per watt for each processor. The second processor requires 35/25 = 1.4 times more energy. If that processor could perform more that 1.4 operations per second that the firts processor then it would make more sence to use the proessor that uses more power since it would consume less energy performing the same operations by finishing 1.4 or more times faster. For McPAT to give an answer to this question information about the performnace of each processor would need to be provided. Finaly another consideration would be the battery capacity for different output power levels. Batteries tend to be less efficient the more power they need to provide. So with that in mind a 25W processor could be preferable even if it was 1.4 times slower that a 35W one.
By running the exact commands in the mcpat/mcpat folder
:
./mcpat -infile ProcessorDescriptionFiles/Xeon.xml -print_level 2
./mcpat -infile ProcessorDescriptionFiles/ARM_A9_2GHz.xml -print_level 2
the results can be found here Xeon and ARM A9. As shown in the table below the differences between Xeon VS ARM A9
Processor | Area | Peak Power | Total Leakage | Peak Dynamic | Subthreshold Leakage | Gate Leakage | Runtime Dynamic | Subthreshold Leakage with power gating |
---|---|---|---|---|---|---|---|---|
Xeon | 410.507 mm^2 | 134.938 W | 36.8319 W | 98.1063 W | 35.1632 W | 1.66871 W | 72.9199 W | 16.3977 W |
ARM A9 | 5.39698 mm^2 | 1.74189 W | 0.108687 W | 1.6332 W | 0.0523094 W | 0.0563774 W | 2.96053 W | - |
The metric that can be used to compare the two completely different processors is performance per watt and more specific FLOPS per Watt. This metric shows how many floating point operations per Watt can the processor perform. The FLOPS of the two processors are unknown but not really required since we need to compare the two. It is given that Xeon is 40 times faster that the ARM A9 which means that the Xeon performs in general 40 times more FLOPS that the ARM A9. Using the mcPAT it was found that the peak power for the Xeon was 134.9W and for the ARM A9 was 1.74W. Dividing the two (134.9/1.74) gives 77.53. What this result shows is that if the Xeon was 77.53 times faster that the A9 then the two processors would be equaly efficient energy wise. Since Xeon is only 40 times faster it can never reach the efficiency of the A9.
The EDAP
was calculated for each benchmark and processor configuration as the product of the total power (runtime dynamic + gate leakage + subthreshold leakage)
by the execution time of each benchmark (sim_seconds)
.The results are presented in the table below to four decimal places.
Cases | EDAP-specbzip | EDAP-speclibm |
---|---|---|
L2_size = 1MB | 0.1230 | 0.2239 |
L2_size = 4MB | 0.1198 | 0.2256 |
L2_assoc = 2-way | 0.1212 | 0.2245 |
L2_assoc = 4-way | 0.1209 | 0.2245 |
L1i_size = 64kB | 0.1525 | 0.2948 |
L1i_assoc = 4-way | 0.1252 | 0.2383 |
L1i_assoc = 8-way | 0.1289 | 0.2460 |
L1d_size = 32kB | 0.0688 | 0.1328 |
L1d_size = 128kB | 0.1478 | 0.2835 |
L1d_assoc = 4-way | 0.0950 | 0.1768 |
L1d_assoc = 8-way | 0.1031 | 0.1891 |
cacheline_size = 32B | 0.0860 | 0.2439 |
cacheline_size = 128B | 0.1504 | 0.2060 |
cacheline_size = 256B | 0.2995 | 0.3285 |
cacheline_size = 256k | 0.1724 | 0.3297 |
The peak power for each case is shown in the following graphs, divided by benchmark. The orange line shows the power recorded for the MinorCPU case without any change in its characteristics.
So we see that peak power is only affected by the different choices in cache size, associativity etc. for each processor, and does not vary depending on the computational load of each benchmark. The graphs show that the largest influence on the final peak power is the cache line size
, with 27.4066 W
for the largest 256 byte
option and 2.3259 W
for the smallest 32 byte
option.
McPAT does not do a full transistor-level simulation of the processor circuits as a mixed-signal simulator would do.Another possible source of errors is gem5, as it defaults to only syscall emulation, ignoring any hardware delays and therefore may have run-time errors. In fact, the TimingSimpleCPU model can reduce errors to some extent, as it takes hardware timing into account. The combination of the two programs multiplies the errors in the generated values if no correction is made before the final calculation.