Performance Measurements and Issues

( 0 users )

"For better or worse, benchmarks shape a field."

Performance metrics are a measure to evaluate computer design. First of all, it is imperative to know the metrics. The most likely parameters of Real-world performance metrics are Speed, Capacity, Cost and Energy consumption. The world evaluates these four factors with relevance to a target application or the requirement and chooses the computer. The trade-off is amongst these four factors. Of these four, Speed is the performance of CPU. Capacity is relevant to Disk and/or memory storage.

In a real-world scenario, an organization planning to acquire a computer(s) freezes the specifications in terms of its expectation related to the application. Then the options are weighed and evaluated. A common man may not detail his requirement as much, as the Off the Shelf Systems generally exceed his requirements in terms of capability and hence Cost might be the main criteria. In this chapter, we discuss the performance regarding the design of computers.

Performance of a computer depends on the constituent subsystems of the system including software. Each of the subsystems can be measured and tuned for performance. Thus performance can be measured for:

  • CPU performance for Scientific application, Vector processing, Business application, etc – Instructions per Second
  • Graphics performance – Rendering - Pixels per second
  • I/O Performance – Transactions Per Second
  • Internet performance and more – bandwidth utilization in Mbps or Gbps

Memory size, speed and bandwidth play a key role in both CPU and I/O performance. While CPU performance is almost expected in any operational environment, I/O performance is critical in a Transaction processing environment.

System Performance Measurement

There are three metrics for any system performance measure and these are Performance, Execution Time and Thruput.

Total time taken for execution of a program = CPU Time + I/O Time + Others (like Queuing time etc.) ... Eqn 7.1

Generally, Time taken to execute a program (maybe a standard program or application program) is a thumb measure for System performance. This is said to be Execution time.

Performance = 1/Execution Time

Thruput is the measure of work done in a unit of time.

CPU Time

CPU Time is the time for which the CPU was busy executing the program under consideration i.e. the CPU time utilized by the program to execute the instructions. We know that any program is converted into a set of machine instructions executable by the CPU. The larger the program, more the instructions, more the time taken by CPU. This is exactly why we need a standard program with which a system or CPU is evaluated in addition to the target application program. Such a standard program is known as Benchmark Program.

CPU Time in seconds (TCPU) = Number of Instructions in the program / average number of instructions executed per second by the CPU. OR

Number of Instructions in the program x Average clock cycles per instructions x time per clock cycle. This is written rhythmically as below.

CPU Time equation 2
CPU Time equation 2

Time per clock cycle = 1/ CPU clock frequency.

CPU clock frequency is nothing but the most familiar CPU speed that we all know as y Ghz.

Equation 7.3 is technical equivalent to equation 7.2.

CPU Time = Number of Instructions in the program (N)
			x Average clock cycles per instructions (CPI)
			x time per clock cycle (Tclk)
		= N x CPI x Tclk							

(Equation 7.3)

N is the number of machine instructions. This depends on the conversion from program to executable code. The program here is considered as Software. This software can be optimized at the program level by the programmer and at Compiler level Intermediate code generation by the compiler.

CPI is Cycles Per Instruction rather average Cycles per Instruction required by the CPU. This very much depends on the Instruction Set Architecture (ISA) design of Computer Architecture.

Time per Clock Cycle. This is a hardware feature. A feature whose threshold is limited by the logic design at chip-level and component level. Generations have passed in CPU design, that this is more said as the CPU frequency(f). T = 1/f is the famous physics equation that needs to be reminded here for conversion from CPU clock speed to time per clock cycle.

Thus a system performance is a combination of:

  • Hardware(increasing Clockfrequecy tends to reduce T),
  • Software ((the efficiency of the code influences N) and
  • the architecture (influences CPI); the compiler can also influence CPI by generating instructions with a lower average CPI or lower the instruction count by optimisation.
Performance Contributors
Figure 7.1 Performance Contributors

Let us use an example to reinforce our learning on CPU performance. A program ABCD has 15000 instructions is executed on a system whose clock frequency is 3.3Ghz and the design facilitates average Cycles per instruction of 12. Calculate the CPU time utilized to execute Program ABCD.

Here, N = 1500,	CPI = 12, Tclk = 1/3.3Ghz
		Tclk 	= 1/3.3Ghz
			= 1/(3.3 x 10^9)
			= 0.3 x 10^-9
		
Therefore,
CPU Execution Time TCPU = N x CPI x Tclk
			= 15000 x 12 x 0.3 x 10^-9 seconds
			= 54000 x 10^-9 seconds
			= 54 x 10^-6 seconds
			= 54 micro seconds

Amazing. You are a Millenial. Your CPU can execute the program ABCD in just 54 microseconds. The same would have taken 54 seconds a few decades ago.

If the same program is executed on a CPU with 20 CPI design and the same 3.3 GHz clock, the time taken by CPU to execute the ABCD program would be 90 Microseconds. Thus it is clear that the ISA design and hence architecture is very important to obtain CPU efficiency. The same way any two systems may be compared against a target application or benchmark.

CPU performance Evaluation Tools

Although benchmarks evaluate the systems against standard programs or procedures, it does not replace any application-specific performance evaluation requirement. There are many different tools available as standard benchmarks each meant for a purpose.

MIPS – Million Instructions per second. MIPS is simply an execution rate of an or a set of instructions. MIPS is instruction implementation-specific. It could produce a different figure for a different set of programs on the same machine. Hence does not truly reflect the capability of a CPU on a wider perspective. For this reason, it is not in use these days. In the early era of computers, there were not many benchmark programs. Hence was used then with select instructions.

MIPS = NInstr / TE x 106

MFLOPS – Millions of Floating Point Operations Per Second. This measures the execution rate of floating-point Operations. This is also a crude measure of performance and not in use for the same reason as MIPS.

SPEC – The Standard Performance Evaluation Corporation. A non-profit organization which develops SPEC Benchmark suites. The SPEC Benchmarks are available for performance evaluation of CLOUD, CPU, Web Servers, Graphics and Workstations, Storage, MAIL Servers, Virtualization, etc. The CPU SPEC benchmark dates back to SPEC CPU 92. The latest series is SPEC CPU 2017, which has four suites. Interested readers may visit the SPEC website.

TPC-B, TPC-C, TPC-D - These benchmark programs are meant to evaluate systems with DBMS like transaction processing applications in terms of transactions per second.

Performance Enhancement Techniques

The Performance enhancement on CPU execution time is facilitated by the following factors in a major way.

  • Internal Architecture of the CPU
  • Instruction Set of the CPU
  • Memory Speed and bandwidth
  • Percentage use of the registers in execution (note: Registers are at least 5 times faster than memory).

Further, the following features of a system also enhance the overall performance:

  • Architectural extensions (Register set/GPRs/Register File)
  • Special instructions and addressing modes
  • Status register contents
  • Program control stack
  • Pipelining
  • Multiple levels of Cache Memory
  • Use of co-processors or specialized hardware for Floating-Point operations, Vector processing, Multimedia processing.
  • Virtual Memory and Memory management Unit implementation.
  • System Bus performance.
  • Super Scalar Processing

Speedup - Amadhal's Law

Performance improvement is achieved by tuning part(s) of hardware. It is to be noted that such improvements may not improve the overall performance; the improvement will be limited to the extent that this tuned feature is utilized. Amadhal's law defines the measure for this Speedup. Amadhal's law states that "Performance improvement from speeding up a part of a computer system is limited by the proportion of time the enhancement is used". Amadhal's equation for speedup estimation is as per equation 7.4.


						Execution Time (before improvement)
Speedup (achieved) = 	----------------------------------
						Execution Time (after improvement)
			

(Equation 7.4)

RISC V/s CISC Comparison

CISCRISC
Complex (comprehensive) Instruction Set ComputerReduced Instruction Set Computer
Emphasis on hardwareEmphasis on software
Generally two address ISA, register – memory architecture. The result overwrites the second operand.Generally three address ISA, register - register Architecture. The source operands are never overwritten.
Small code sizes and hence less working memorylarge code sizes and hence requires more working memory
Choice available for instructionsCompiler facilitates code optimisation and better use of registers
The approach is to reduce the number of instructions per program (program code compaction)ISA approach is one instruction per cycle
The CISC approach attempts to minimize the number of instructions per programRISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program.
Generally more clock cycles per instructionSingle-clock, reduced instruction only
Generally variable-length instruction formatFixed length instruction format
Comprehensive and complex instruction setFewer simpler standard instructions
A large number of addressing modes supportedVery few addressing modes sufficient because of the load and store architecture
Pipelining possible although not so conduciveBecause of simpler instruction, the design is more conducive for pipeline implementation
More often, Instructions use identified registers and hence those registers are unavailable as GPRs.Register independence available on the instructions. Hence all registers can be used as GPRs.
Usually, Microcoded Control Unit implementationHardwired Control Unit implementation.
Bigger die size and hence More power consumptionSmaller die size and hence lower power consumption

To Do

* Note : These actions will be locked once done and hence can not be reverted.

1. Track your progress [Earn 200 points]

2. Provide your ratings to this chapter [Earn 100 points]

0
Instruction Set Architecture : Addressing Modes
Computer Architecture Assessment 1
Note : At the end of this chapter, there is a ToDo section where you have to mark this chapter as completed to record your progress.