

#### **Processor Performance**

Experimental/simulation study of instruction streams to find ways to speed up execution is central to computer architecture.

Computer performance evaluation, however, can be controversial since much is at stake.

Users care about speedy results, a faster computer runs programs in less real time.

A **synthetic workload** is often used to simulate specific execution scenarios.

Memory access and other non-CPU event delays also.

#### 0

**Power** consumption is an important aspect that must also be considered.

In a hyper-connected world, performance must also be balanced against **security** (esp. protecting microarchitectural state).

© 2024 Dr. Muhammad Al-Hashimi

# What counts Versial since much is at Physical run time of real programs

#### No one magic workload

#### Program runtime components

#### Performance metric

CPU time <u>component</u> of runtime (s) = CPU cycles × cycle time (s)

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:35 PM Color profile: Disabled Composite Default screen

#### Processor Performance Cycles/Instruction (CPI)

Instruction latency



© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:35 PM Color profile: Disabled Composite Default screen

#### Processor Performance Cycle Length

Quiz

How many cycles to execute the 10 instr sequence in each case?

Clearly, instruction <u>physical</u> execution time, for the same CPI, becomes smaller as cycles get shorter.



**Quiz** Suggest at least 3 ways to shorten cycle. *Answer last slide.* 

© 2024 Dr. Muhammad Al-Hashimi

#### Assume instrs execute at ~2c on average



#### **Processor Performance Instruction Count**

| Logically equivalent, perhaps<br>by different compilers from the<br>same high-level code, these<br>programs are different from<br>datapath viewpoint.<br><b>Exercise</b><br>What is the IC in each case?<br>Generally, shorter program<br>should run faster if instructions<br>have same CPI and cycle time.<br>$\bigcirc$<br>More instructions increase<br>fetch-decode overheads which<br>include memory access. | slt<br>bne<br>slt<br>beq<br>add<br>add<br>add<br>lw<br>jr | <pre>\$t3,\$s0,\$zerd<br/>\$t3,\$zero,Exi<br/>\$t3,\$s0,\$t2<br/>\$t3,\$zero,Exi<br/>\$t1,\$s0,\$s0<br/>\$t1,\$t1,\$t1<br/>\$t1,\$t1,\$t1<br/>\$t1,\$t1,\$t4<br/>\$t0,0(\$t1)<br/>\$t0</pre> | it                                | slt<br>bne<br>slt<br>beq<br>( sll<br>add<br>lw<br>jr | \$t3,<br>\$t3,<br>\$t3,<br>\$t1,<br>\$t1, | \$\$\$0,\$zero,<br>\$zero,Exit<br>\$\$0,\$t2<br>\$zero,Exit<br>\$\$0,2<br>\$\$1,\$t4<br>\$0(\$t1)                       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------|------------------------------------------------------|-------------------------------------------|-------------------------------------------------------------------------------------------------------------------------|
| <b>Quiz</b><br>Is a processor design that<br>produces longer programs<br>necessarily slower? (See next                                                                                                                                                                                                                                                                                                             |                                                           |                                                                                                                                                                                              | blt*<br>bge*<br><mark>addi</mark> | \$s0,\$zero<br>\$s0,\$t2,E<br>\$t1,\$zero            | xit                                       | * Core RISC-V machine<br>instructions (pseudo-instructions<br>in the original MIPS R2000<br>provided by the assembler). |

necessarily slower? (See next slide). Hint: multiply is a higher pipeline latency instruction than add/sub/set/shift.

© 2024 Dr. Muhammad Al-Hashimi

> mul\* \$t1,\$t1,\$s0 \$t1,\$t1,\$t4 \$t0,0(\$t1) \$t0

add

lw

jr

#### Processor Performance CPU Time Components

#### Quiz

Which aspect of processor design has a <u>major</u> effect on each factor of performance?

IC due to ISA (not poor compilation); e.g., CISC tends to yield shorter programs with complex execution profiles that may be slower when timed physically.

A detailed CPI is obtained from averaging over families of instrs that share exec profiles (due to shared datapath segments) weighted by their frequencies.

#### Performance factors (check units)

- ① Instruction count (IC)
- ② Cycles per instr (CPI), on average
- ③ Clock cycle time

#### **CPU performance equation**

Clock cycles are a poor indicator of runtime <u>alone</u>, valid only when the other two factors are the same (i.e., misleading to compare performance based on cycles when instrs have different execution characteristics).

CPU time (s) =  $IC \times CPI \times cycle$  time

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:36 PM Color profile: Disabled Composite Default screen

# **More Performance**

Superpipeline

Pattern B

Compute Control

Make stage smaller or otherwise just shorten cycle (higher clock rates); more stages have higher (ideal) speed-up ceiling but how far can we go?

More stages increase performance, too many have terrible misprediction and power costs.

Intel basically abandoned this approach in 2006 with move from Pentium 4 (20 stages\*) to the Core architecture.

CPI<1 (IPC>1) *superscalar performance* (where *scalar* CPI ≥ 1); it implies more than one pipeline or execution pathway.



\*31+ stages in Pentium 4/Rev. E (Prescott, 2004), not counting front-end side fetch-decode stages apparently.

# More ILP: multiple issue Start multiple instructions every cycle (hopefully) to exploit Pattern B

© 2024 Dr. Muhammad Al-Hashimi

#### More Performance Multiple Issue

Sissue slots



COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:40 PM Color profile: Disabled

#### Composite Default screen

#### Quiz

Identify data hazards (Hint: 3). Compare CPI to a hazard-free execution on the scalar original MIPS. (Hint: tricky).

#### 0

Exercise

Use Amdahl's law to calculate overall speedup. (Hint: note, only managed to double performance for 2 of 5 instrs).

Compare to near ideal scalar

COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi



# **Flipping the Metric**

Sissue packet

# **Speeding the Stream**

tion effectively.

Dependance analysis performed by compiler on static code.

Dependance analysis performed by hardware at run time.

Superscalar (term) also refers to older form of dynamic issue with out-of-order execution; also class of architectures that avoid recompiling older program binaries.

Superscalar is more powerful when combined with other execution tricks.

© 2024 Dr. Muhammad Al-Hashimi

In all cases hazards must be handled to pipeline execu-Multi-instr/opcode scheduling

#### → Static multiple issue Issue long instruction words

#### Dynamic multiple issue

Dynamic scheduling (more later) Hardware-based issue packet Out-of-order issue

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:40 PM Color profile: Disabled Composite Default screen

#### Exercise 🎱

Discuss effects of instr dependencies on VLIW coding deficiency; suggest some mitigation strategies.

**Static** = concerning non running instructions (in static state); not during run time; as loaded in memory.



**Exercise** Compare typical CISC, RISC, and VLIW instructions in terms of info content, bit length, and fetch-decode overhead.

© 2024 Dr. Muhammad Al-Hashimi

# **Static Multiple Issue**

#### ⊲> VLIW

#### Compiler-based packaging Schedule sets of ops + handle hazards

#### Issue packet

Predetermined sets of ops per cycle

# Very long instruction word

" One long [not complex] instruction with multiple [MIPS-like] operations

cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:41 PM Color profile: Disabled Composite Default screen

# **Register Renaming**

# Antidependence Loop unrolling

Regular loops best <u>unrolled</u> to avoid (excessive) decision stalls.

Update different memory operands using the same register needlessly.

Quiz

How could the WAR (**write-after-read**) condition on \$t1 affect executing the sw-lw sequence?

Technique can be implemented in either software (as shown here) or in hardware (next).

#### **Common pattern**



| Loop: |                |
|-------|----------------|
| lw    | \$t0,0(\$s1)   |
| addu  | \$t0,\$t0,\$s2 |
| SW    | \$t0,0(\$s1) ◀ |
| ∖addi | \$s1,\$s1,-4   |
| bne   | \$s1,\$0,Loop∢ |
|       |                |
|       |                |

| _   | _     |           |
|-----|-------|-----------|
| lw  | \$t0, | 0(\$t0)   |
| add | \$t0, | \$t1,\$s0 |
| SW  | \$t0, | 0(\$t0)   |
| lw  | \$t1, | 4(\$t0)   |
| add | \$t1, | \$t2,\$s1 |
| SW  | \$t1, | 4(\$t0)   |
|     |       |           |

#### → Data (value) dependence →

#### Same dependence (anti-dependence)

© 2024 Dr. Muhammad Al-Hashimi



# **Dynamic Multiple Issue**

trol during execution, they vary between runs.

ISA independence.

Dynamic multiple issue relies on older dynamic issue and dynamic pipeline scheduling techniques.

A dynamic pipeline picks instruction(s) to execute in a given cycle, reordering and effectively renaming registers to avoid stalls.

© 2024 Dr. Muhammad Al-Hashimi

#### Dynamic means at run-time, decisions are made by con-

Hardware-based reordering Transparent (hidden) to programs Compiler assist useful, <u>not expected</u>

#### **Dynamic pipeline**

[Re]scheduling by hardware On-the-fly = during execution Out-of-order execution



Composite Default screen

# cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:42 PA Basic Dynamic Pipeline

(Basic in-order) An instruction is not issued if a dependence on a previously scheduled one was not eliminated or the forwarding hardware can't bypass it (i.e., issue pause and pipeline stalls, even if a later close instruction was independent).

8-For multiple issue, execution and issue are essentially decoupled.

Out-of-order execution implies out-of-order completion, a queued buffer can guarantee delivery of correct results.

#### ٩

Dynamic scheduling is transparent to software, it can be improved by upgrading processor control algorithms and func units for a relatively cheap software performance increase.

#### Exercise

Compare: IBM's Tomasulo algorithm, CDC-6600 scoreboard, and the basic scenario (slide + fig).

COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi



cs704fig\_ilp2.cdr Sunday, March 17, 2024 4:34:42 PM Color profile: Disabled Composite Default screen

#### Dynamic Issue Pipeline Operation

Dynamic pipelines operate outside **architectural state** (programmer/ISA registers and memory).

Reservation stations decouple operand values from architectural registers (register renaming effectively).

Reorder buffer provides builtin forwarding since results may be used before committing to the final register and memory destinations.

© 2024 Dr. Muhammad Al-Hashimi

#### 2 main things to know

Op executes as soon as operands and a functional unit are ready

Operand cases:

Ready: available from reg. file or reorder buffer

Not ready: not produced yet

#### More Performance Consequences

#### 0

Historically, processors were built for speed mainly; oldschool speed came at a significant power cost (at times too steep to accept).

Born originally out of a classic need-for-speed thinking.

Two architectural flaws, named *Spectre* and *Meltdown*, came to attention in 2017 (publicly early 2018).

Later a growing family (headache) of variants; they even have cute icons.



A new line of attacks emerged in 2022, *Retbleed*, which mitigates the *retpoline* mitigation!

# Speed-power tradeoff Speculative execution The Spectre design "flaw"

#### -> Security design concerns

#### A new tradeoff?

https://arstechnica.com/information-technology/2022/07/intel-and-amd-cpus-vulnerable-to-a-new-speculative-execution-attack/ © 2024 Dr. Muhammad Al-Hashimi

© 2024 DI. Muhammad Amasim

#### Processor Performance © Conclusions

Involving the decoding component/ burdens of the front-end.

A front-facing one associated with an input instruction stream, and another one focused on speed with an internal stream.

8086/8088 (1978) binaries (.exe) from 1980s may still be run as-is on the latest Intel core processors (except those dependant on obsolete PC parts).

An understated way of saying don't take them numbers too seriously, especially when realistic workloads and run environments are factored in.

© 2024 Dr. Muhammad Al-Hashimi

## -> Compiler-hardware interplay

### Processor dual personality

#### 🖒 Superscalar arch. advantage

Binary compatibility; hide machine details, hardware develops independently

#### An understated way of saying don't take them numbers to corrigade, appendix when

Theoretical <u>peak instruction throughput</u> in hardware often not sustainable

#### Conclusions More Performance

#### ۲

Processor datapaths contribute relatively few cycles to actual average instruction latencies.

Realistic CPI/IPC is not easily predictable and must be <u>mea-</u> <u>sured carefully</u> after we factor in interaction with memory.

Best performance trick by far is access to at least *some* memory that can keep up with short and fast processor cycles.

**To consider next:** extra resources are cheap but the software side of MIMD parallel processing is not easy + SIMD workloads highly significant + single processor/thread performance still important to some workloads (as of 2022).

© 2024 Dr. Muhammad Al-Hashimi

#### Historically slow memory May add tens to 1000s+ of cycles unpredictably

#### -> Fast and furious (next)

# What about multiprocessing? Multicores, processor arrays, later

KAU • CS-704 17

# **Essay Assignment**

#### **First technical essay reminder** Pick a topic 1–5

#### **Expectations**

Contribute a small essay in discussion group (check topic for rules and topic details)

Slide 3 Answer Reduce work per stage (simplify instr or break it further), simplify control (output control plify control (output control switch faster or reduce distances (shrink circuits), or make (shrink circuits), or make functional unit response times smaller (perhaps better logic). Some changes are architectural (such as reorganizing or reimplementing function/logic), and others are technological (e.g., better transistors).

© 2024 Dr. Muhammad Al-Hashimi