Skip to content

Billkyriaf/computer_architecture_3

Repository files navigation


Computer Architecture Assignment 3

Aristotle University of Thessaloniki

School of Electrical & Computer Engineering

Contributors: Kyriafinis Vasilis, Nikolaos Giannopoulos
Winter Semester 2021 - 2022


1. Step 1

1.1. Original Paper McPat and which processors were used.


In the document presented McPAT is an integrated power system, area, and timing modeling framework that supports integrated design space exploration for multicore and processor configurations ranging from 90nm to 22nm and beyond. At the microarchitecture level, McPAT includes Models for the fundamental elements of a multiprocessor chip, including on- and off-series processor cores, on-chip networks, shared caches, embedded memory controllers and multi-region timing. In the circuit and in the technology level, McPAT supports critical path timing modeling, surface modeling, and dynamic power, short circuit, and leakage power modeling for each of the device types provided in the ITRS roadmap, including bulk CMOS, SOI and dual-gate transistors. McPAT has a flexible XML interface for to facilitate its use with multiple performance simulators.

McPAT advances the state of the art in several directions compared to Wattch, which is the current standard for power research. First, McPAT is an integrated power, area, and timing modeling framework that enables architects to use new metrics combining performance with both power and area such as energy-delay-area2 product (EDA2P) and energy-delay-area product (EDAP), which are useful to quantify the cost of new architectural ideas. McPAT specifies the low-level design parameters of regular components (e.g. interconnects, caches, and other array-based structures) based on high-level constraints (clock rate and optimization target) given by a user, ensuring the user is always modeling a reasonable design. This approach enables the user, if they choose, to ignore many of the low-level details of the circuits being modeled.

The following table shows which processors were originally used in McPat:

Processor Published total power and area McPAT Results % McPAT error
Niagara 63 W / 378 mm^2 56.17 W / 295 mm^2 -10.84 / -21.8
Niagara2 84 W / 342 mm^2 69.70 W / 248 mm^2 -17.02 / -27.3
Alpha 21364 125 W / 396 mm^2 97.9 W / 324 mm^2 -21.68 / -18.2
Xeon Tulsa 150 W / 435 mm^2 116.08 W / 362 mm^2 -22.61 / -16.7


1.2. Looking at the results that the McPAT output gives you.

Dynamic Power

Wattch (Wattch: a framework for architectural-level power analysis and optimizations) is a widely-used processor power estimation tool. Wattch calculates dynamic power dissipation from switching events obtained from an architectural simulation and capacitance models of components of the microarchitecture. For array structures, Wattch uses capacitance models from CACTI, and for the pipeline it uses models from "Complexity-Effective Superscalar Processors". When modeling out-of-order processors, Wattch uses the synthetic RUU model that is tightly coupled to the SimpleScalar simulator (The simplescalar tool set, version 2.0). Wattch has enabled the computer architecture research community to explore power-efficient design options, as technology has progressed; however, limitations of Wattch have become apparent. First, Wattch models power without considering timing and area. Second, Wattch only models dynamic power consumption the HotLeakage package (“HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects") partially addressed this deficiency by adding models for subthreshold leakage. Third, Wattch uses simple linear scaling models based on 0.8μm technology that are inaccurate to make predictions for current and future deep-submicron technology nodes.

Static Power

This interface allows both the specification of the static microarchitecture configuration parameters and the passing of dynamic activity statistics generated by the performance simulator. McPAT can also send runtime power dissipation back to the performance simulator through the XML-based interface, so that the performance simulator can react to power (or even temperature) data. This approach makes McPAT very flexible and easily ported to other performance simulators. McPAT runs separately from a simulator and only reads performance statistics from it. Performance simulator overhead is minor – only the possible addition of some performance counters. Since McPAT provides complete hierarchical models from the architecture to the technology level, the XML interface also contains circuit implementation style and technology parameters that are specific to a particular target processor. Examples are array types, crossbar types, and the CMOS technology generation with associated voltage and device types

Short-Circuit Power

Switching circuits also dissipate short-circuit power due to a momentary short through the pull-up and pull-down devices. We compute the shortcircuit power using the equations derived in the work by Nose(Analysis and Future Trend of Short-circuit Power) that predicts trends for short-circuit power. If the ratio of the threshold voltage to the supply voltage shrinks, short-circuit power becomes more significant. Short-circuit power is around 10% of the total dynamic power, with fluctuations within 3.1% across all the technology generations. The main reason for the stable short-circuit power is that we use ITRS technology models that have stable Vth to Vdd ratios.

Leakage Power

Gate leakage is an important component in 90nm and 65nm technology, being 37.6% of the total leakage power at 65nm technology. Hi-k metal gate transistors (45nm High-k+Metal Gate Strain-Ehanced Transistors) are introduced at 45nm, which reduces the gate leakage by more than 90%. SOI technology and double gate (DG) devices that are used at 32nm and 22nm technology also help to keep the subthreshold leakage under control.

Running different programs on the same processor architecture

All the results will depend on the runtime of the program and the load it will create.But there is also the constant power consumption due to the short circuits and the leakage power of the technology used by the processor, which we consider as "constant power loss".

1.3. A system with different processors

The answer can be given by looking the performance per watt for each processor. The second processor requires 35/25 = 1.4 times more energy. If that processor could perform more that 1.4 operations per second that the firts processor then it would make more sence to use the proessor that uses more power since it would consume less energy performing the same operations by finishing 1.4 or more times faster. For McPAT to give an answer to this question information about the performnace of each processor would need to be provided. Finaly another consideration would be the battery capacity for different output power levels. Batteries tend to be less efficient the more power they need to provide. So with that in mind a 25W processor could be preferable even if it was 1.4 times slower that a 35W one.

1.4. Xeon VS ARM A9.

By running the exact commands in the mcpat/mcpat folder:

  ./mcpat -infile ProcessorDescriptionFiles/Xeon.xml -print_level 2
  ./mcpat -infile ProcessorDescriptionFiles/ARM_A9_2GHz.xml -print_level 2

the results can be found here Xeon and ARM A9. As shown in the table below the differences between Xeon VS ARM A9

Processor Area Peak Power Total Leakage Peak Dynamic Subthreshold Leakage Gate Leakage Runtime Dynamic Subthreshold Leakage with power gating
Xeon 410.507 mm^2 134.938 W 36.8319 W 98.1063 W 35.1632 W 1.66871 W 72.9199 W 16.3977 W
ARM A9 5.39698 mm^2 1.74189 W 0.108687 W 1.6332 W 0.0523094 W 0.0563774 W 2.96053 W -

The metric that can be used to compare the two completely different processors is performance per watt and more specific FLOPS per Watt. This metric shows how many floating point operations per Watt can the processor perform. The FLOPS of the two processors are unknown but not really required since we need to compare the two. It is given that Xeon is 40 times faster that the ARM A9 which means that the Xeon performs in general 40 times more FLOPS that the ARM A9. Using the mcPAT it was found that the peak power for the Xeon was 134.9W and for the ARM A9 was 1.74W. Dividing the two (134.9/1.74) gives 77.53. What this result shows is that if the Xeon was 77.53 times faster that the A9 then the two processors would be equaly efficient energy wise. Since Xeon is only 40 times faster it can never reach the efficiency of the A9.

2. Step 2.

2.1. How will you calculate the energy.

The EDAP was calculated for each benchmark and processor configuration as the product of the total power (runtime dynamic + gate leakage + subthreshold leakage) by the execution time of each benchmark (sim_seconds).The results are presented in the table below to four decimal places.

Cases EDAP-specbzip EDAP-speclibm
L2_size = 1MB 0.1230 0.2239
L2_size = 4MB 0.1198 0.2256
L2_assoc = 2-way 0.1212 0.2245
L2_assoc = 4-way 0.1209 0.2245
L1i_size = 64kB 0.1525 0.2948
L1i_assoc = 4-way 0.1252 0.2383
L1i_assoc = 8-way 0.1289 0.2460
L1d_size = 32kB 0.0688 0.1328
L1d_size = 128kB 0.1478 0.2835
L1d_assoc = 4-way 0.0950 0.1768
L1d_assoc = 8-way 0.1031 0.1891
cacheline_size = 32B 0.0860 0.2439
cacheline_size = 128B 0.1504 0.2060
cacheline_size = 256B 0.2995 0.3285
cacheline_size = 256k 0.1724 0.3297

2.2. Graphs.

The peak power for each case is shown in the following graphs, divided by benchmark. The orange line shows the power recorded for the MinorCPU case without any change in its characteristics.



So we see that peak power is only affected by the different choices in cache size, associativity etc. for each processor, and does not vary depending on the computational load of each benchmark. The graphs show that the largest influence on the final peak power is the cache line size, with 27.4066 W for the largest 256 byte option and 2.3259 W for the smallest 32 byte option.


2.3. Your results obtained in relation to the cost function.

McPAT does not do a full transistor-level simulation of the processor circuits as a mixed-signal simulator would do.Another possible source of errors is gem5, as it defaults to only syscall emulation, ignoring any hardware delays and therefore may have run-time errors. In fact, the TimingSimpleCPU model can reduce errors to some extent, as it takes hardware timing into account. The combination of the two programs multiplies the errors in the generated values if no correction is made before the final calculation.

3. Source.

Paper McPAT
Github From Andreas Brokalakis

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published