cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:06 PM Color profile: Disabled Composite Default screen

# **The Processor**

### ▷ [Processor] architecture

In earlier machines, the instruction set, addressing modes, register and memory organization closely matched the hardware.

The processor has two personalities depending on where you look!

**Instruction set architec**ture (ISA), often preserved across generations of hardware, remains useful as a specification for writing machine programs (directly or via compiler agents).

The internal makeup.

© 2024 Dr. Muhammad Al-Hashimi

→ The " active part "

## **Follow** instructions Perform requested operations

# Programmer interface

What functions are available for use

# Implementation details

How parts/functions are setup/organized

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:07 PM Color profile: Disabled Composite Default screen

In a <u>simplified scenario</u>, the processor deals with instructions and data from different threads of many programs in memory encoded in a logical bit-stream from <u>processing</u> viewpoint.

At least one datapath for core instructions (typically default 2's comp operands, int in C); there may be more for specialized operands and ops.

Goal, generally, to minimize instruction time (**latency**) and maximize processing rates (**throughput**).

### 

Opcodes indicate operations to perform, how to obtain operands, and which variants to use, i.e., the **datapaths**. **Quiz** 

Give examples from MIPS.

© 2024 Dr. Muhammad Al-Hashimi

Processor Implementation The Datapath

### Microarchitecture



perform arithmetic, logical, branch, and load-store operations

**Recall Specification (ISA)** 

Setch instr word from instr memory

Load base addr from \$1 Sum base addr and constant passed in instr Fetch data word using

Store data word in \$8/\$9

Composite Default screen

sum as addr

**Required Functions/** 

Serform op in ALU

Nead a Data Memory

🗞 Read an Instr Memory

**Resources** 

# **Instruction Implementation Datapath Resources**

- Architectural regs

## **Example: Load Word**



### ۲

Load word has to go through specified functions in sequence regardless of how a datapath is designed.

### Exercise

Specify a different behavior, determine the sequence of functions and suggest functional units to implement.

© 2024 Dr. Muhammad Al-Hashimi

**Instruction Overlapping** 



© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:09 PM Color profile: Disabled

Composite Default screen



# Processor Implementation Pipelining Load-Word



© 2024 Dr. Muhammad Al-Hashimi

# Pipelining Load-Word Performance

### -> Instr throughput

- Pipeline paradox

Reproduction (concept P&H 1998-2012)

Time between instr

non-pipelined

 $.8 \times 3$ 

?

п

3

10

(time to start or *issue* nth instr)

pipelined

 $.2 \times 3$ 

?

Speedup

= 4

4 🔨

 $\cdot 8 \times 3$ 

 $\cdot 2 \times 3$ 

**Quiz** What is the formula for the pipelined case? *Answer later slide.* 

# Time to complete (total exec time)

long run, should complete instrs as fast as can issue them!

| -       |                  |                       |                                  |
|---------|------------------|-----------------------|----------------------------------|
| п       | non-pipelined    | pipelined             | Speedup                          |
| 3       | .8×3             | 1.4                   | $\frac{2.4}{1.4} \approx 1.71$   |
| 10      | .8×10            | 2.8                   | $\frac{8.0}{2.8} \approx 2.86$   |
| 1000    | $.8 \times 1000$ | $.8 + 1000 \times .2$ | $\frac{800}{200.8} \approx 3.98$ |
| 100,000 | ?                | ?                     | <mark>≈ 4.0</mark> <b>*</b>      |

Seems attractive to add stages to raise speedup ceiling/headroom. (Keep an eye on this idea).

Reproduction (concept P&H 1998-2012)

© 2024 Dr. Muhammad Al-Hashimi



#### cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:12 PM Color profile: Disabled Composite Default screen

# **An Execution Pipeline**



© 2024 Dr. Muhammad Al-Hashimi



© 2024 Dr. Muhammad Al-Hashimi

# cs704fig\_lip1.cdr Monday, March 18, 2024 12:19:14 PM A Simple MIPS Datapath



© 2024 Dr. Muhammad Al-Hashimi

Composite Default screen

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:15 PM Color profile: Disabled Composite Default screen

#### IF/ID ID/EX EX/MEM MEM/WB 4 Shift-left 2 Rea Read data ┶ Read eg 2 Address PC Rea /rite ÂU /rite Data Memory Mei 16 Sign Extend 32 64 128 97) (64)







ID/EX

IF/ID

© 2024 Dr. Muhammad Al-Hashimi

KAU • CS-704 11

MEM/WB

EX/MEM

# **Practice Sheet**

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:15 PM Color profile: Disabled Composite Default screen

# **Pipeline Hazards**

### ⇒ [Pipeline] Conflicts

Realized speedup will be may not be perfectly balanced, and, more crucially, 2) one instr/cycle issue may not always be possible.

=14 cycles if cycle set at 0.2 ns.  $0I + \frac{1}{2} = n + \frac{1}{2} ni \ sn \ 8.2 = 2.0 \times 0I + 8.0$ therefore 10 instres complete after ,2.0×n + 8.0 ni ətəlqmoo bluohe complete per cycle, i.e., n instrs lliw riter one hold one instr will case, 1st instr goes through m-1 instr (pipelined). Note in pipelined etailetic state of the second state of the sec strui atalquios of amit = qubaaqR sildmaxs brow bool gninilsqif

© 2024 Dr. Muhammad Al-Hashimi

less than the ideal: 1) stages may not be perfectly bal-

## Types: depending on cause

Structural: functional unit conflict Solution Data: operand <u>dependence</u> Control: decision <u>dependance</u>

# **Structural Hazards**

### [Instr] Orthogonality



# the hazard

Typically a one-time cost at design time.

# Resolving structural hazards

© 2024 Dr. Muhammad Al-Hashimi

# Data Hazards

### Stall cycle (bubble)

KAU • CS-70414



## Pipeline representation: add



Exercise

The **read-after-write** data hazard will actually cause less than 3c delay in MIPS. Why? *Hint: lookup MIPS-2000 actual 5-stage timing.* 

© 2024 Dr. Muhammad Al-Hashimi

#### Composite Default screen

## Resolving Pipeline Hazards Forward Results

Actual *ready* and *use* stages may be aligned if extra hardware were available to get ALU result early.

Additional paths, MUX and control, mostly; note ALU result must be recorded somewhere for next stage.

#### Exercise

Construct a **write-after-read** example from these instructions. (*Hint: tricky, play with \$t1 position.*)

#### 0

**Forwarding** can't solve all data dependance problems.

**Exercise** Draw the diagram for the **load-use** hazard without forwarding. How many bubbles?

© 2024 Dr. Muhammad Al-Hashimi





# **Resolving Pipeline Hazards Reorder Instructions**

8→ 
Cycles-per-instruction (CPI)

High-level code compiled into a "sensible" (symbolic) machine code based on <u>register allocation</u> in figures.

Assume high-levle variables in memory (fig) starting at address loaded in \$1

Exercise

Suggest a better schedule hazard-wise. (Ans. bottom corner.)



© 2024 Dr. Muhammad Al-Hashimi



Reproduction (concept P&H 1998-2012)

Exercise

- Draw a pipeline diagram, determine <u>number of cycles</u> to complete sequence in each case (a) Without forwarding or reordering (worst case). Hint: 5 data hazards.
  - (b) With forwarding alone (hardware-only solution)(c) With both forwarding and reordering (pipeline-aware compiler)
- 2. What's the CPI for ideal hazard-free pipeline?

3. What is the speedup in each case? (CPI case/CPI hazard-free)

A better schedule: the independent loads of vars (B, E, F), next the arithmetic, then the stores.

# **Control Hazards**

Delay slot



# Resolving Pipeline Hazards Branch Prediction

Hypothetical scenario, more changes needed to the ID stage + can bypass the obvious data hazard on \$8.

#### At least <u>some of the time</u> the pipeline will operate without a bubble (better than always having bubbles with no prediction).

## Branch always untaken (why reasonable?)



© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:20 PM Color profile: Disabled Composite Default screen

# **Better Prediction**

Basic block

Example **Basic block** depicting a typical loop. Refer to Glossary for definition.

## ➡ Common looping pattern Typical run: *n* taken to 1 not-taken

add \$t1,\$zero,\$zero
inner:
add \$t3,\$a1,\$t1
iw \$t3,0(\$t3)
bne \$t3,\$t4,skip x
addi \$s0,\$s0,1
skip: addi \$t1,\$t1,4
bne \$t1,\$a3,inner \$t1<(\$a3 addr >0)
Reproduction (concept P&H 1998-2012)

Basic blocks offer opportunities to overcome pipeline hazards.

# Prediction Back branch always taken

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_lip1.cdr Monday, March 18, 2024 1 Resolving Control Hazards Composite Default screen

Simple prediction (guessing) improves pipeline performance by removing bubbles in some cases; better guessing should improve even more.

## Branch prediction

Static (fixed), i.e., same, prediction Dynamic (changing) prediction

Reordering instrs to avoid data and control hazards is a 🖒 Compiler-assisted resolution form of compiler-assissted resolution (e.g., transparently fill as many branch delay slots as possible).

compiler assisted approach for a pipeline with more stages?

© 2024 Dr. Muhammad Al-Hashimi

Quiz What is the drawback of the Speculative execution (later)

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:21 PM Color profile: Disabled Composite Default screen

A control unit must output, in each clock cycle, the required signals in response to an opcode to correctly: a) pick paths, b) read/write state, or c) select ALU functions.

In a scenario of two successive cycles/stages, the first one performs an ALU operation (add); the following writes a word in memory based on ALU output from the previous cycle.

#### Quiz

Determine the *ALU func* control bits (how many, source)? Guess the instruction.

### 0

Electronic signals will propagate in all circuits and invalid bits may occur everywhere in a datapath.

A **control word** ensures desired functions will compute correctly, and <u>only wanted results</u> are recorded at the end of each cycle in *some* registers or memory.

© 2024 Dr. Muhammad Al-Hashimi

Recall, physical devices in a computer naturally encode and transform a binary state (bits).

### Machine state

SIMPLIFIED Control

- Program state





KAU • CS-704 22

5

Execution

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:24 PM Color profile: Disabled Composite Default screen

# Control How It Works

Solution → Microcode



The control unit must output a suitable **control word** in each clock cycle.

Sequence of control words needed to exec an instruction may be generated on-the-fly by fixed logic (hard-wired), or stored in a memory (microcode).

#### Quiz

Suggest situations where one approach to control may be preferable to the other.

Note relatively cheap to add/modify instructions in the microcode.

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:24 PM Color profile: Disabled

Composite Default screen



# Control Summary

### ۲

Realistically, a multi-stage control must output for all active stages, in every cycle, control words for different instructions in different execution stages + handle forwarding, stalls, exceptions, and flow predictions.

### 8-

The primary duty of the micro-architectural state is maintaining the **program** state as contractually agreed with users at the ISA level.

Pre-agreed timing relative to a fixed clock + a control that ensures correct bits at inputs and selects valid bits to record leads to a manageable physical machine to implement an abstract FSM.

© 2024 Dr. Muhammad Al-Hashimi

# An embedded [programmable] micro machine to host (i.e., run) an instr set

# Microarchitectural state

A fixed clock <u>hides</u> (masks) Recall continuous, unpredictable physical timings ... Simplified (relatively), reliable machine

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:24 PM Color profile: Disabled Composite Default screen

### Quiz

What is **interlocking** in context of pipelining? (Look up the name MIPS.)

**Exercise** Explain the impact of each feature on a pipeline, give examples from the original MIPS.

places make **pre-fetch** possible, that is, to decode independent of opcode.

# Processor Instructions Design to Pipeline

Instr sets can either simplify or make life harder for pipeline designers
P&H

## -> Fixed length

Same applies to compiler writers

Operands in the same

Segular (e.g., few, similar encodings)

## Load-store

Memory operand alignment

## Single final write back

**Exercise** Discuss the costs of pipelining instructions relative to sequential execution.

© 2024 Dr. Muhammad Al-Hashimi

cs704fig. lip1.cdr Monday, March 18, 274 1 Chnical Essay Assignment

See essay descriptions in *Technical Essay Assignments* conversation for this semester in the course group.

Check the RISC-V links in the Reading File (under Background).

Check the case study links in the Reading File (under Background) for REQUIRED ARM material.

Check the group for specifications.

© 2024 Dr. Muhammad Al-Hashimi

## **Essay 1:** pick one from 1-5

- Discuss how well the classic MIPS fits RISC criteria by Colwell et al.
- Solution Discuss how well MIPS fits Wulf's principles
- Sompare ISA: classic MIPS to RISC-V
- Sompare dynamic scheduling/exec schemes
- Section Se

# Expectations

Contribute small essay in course group

cs704fig\_iip1.cdr Monday, March 18, 2024 12:19:24 PM Color profile: Disabled Composite Default screen

# Concept Review © Pipeline Performance

Performance ultimately depends on how well **conflicts** (=hazards) are handled.

## Conflicts due to dependencies arise while executing instruction streams causing delays

## Conflict types

Datapath resource
 Program dataflow
 Program decision

© 2024 Dr. Muhammad Al-Hashimi

cs704fig\_ilp1.cdr Monday, March 18, 2024 12:19:25 PM Color profile: Disabled Composite Default screen

Flow is even in a perfect execution pipeline, like a real one, each pipe contributes equally to the flow.

**Concept Review ©** Pipelined Execution

A logical "pipeline" is created by overlapping instr exec

### Misprediction



Stall cycles, like bubbles in a real pipeline, reduce the throughput

More to actual latency than nominal number of stages; more to come.

line must be *flushed*, e.g., state rolled back.

**Exercise** Discuss the misprediction cost in long pipelines and complex decision scenarios.

COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi

## 

# On top of wasted cycles on wrong instructions, a pipe-



## **Resolving Hazards Exercise**



COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi

## **Resolving Hazards Exercise**

Execution timings and

CPI here are ideal. We

will see why CPI must be measured carefully when

| 1 | lw    | \$t1, | 0(\$t0)   |
|---|-------|-------|-----------|
| 2 | 2 lw  | \$t2, | 4(\$t0)   |
| 5 | 5 lw  | \$t4, | 8(\$t0)   |
| З | add a | \$t3, | \$t1,\$t2 |
| 4 | SW    | \$t3. | 12(\$t0)  |

(\$t0) 4 sw 6 add 7 sw \$t5, \$t1,\$t4 \$t5, 16(\$t0)

Re-ordered with forwarding



COPYRIGHT 1998 MKP 2005 ELSEVIER, INC. ALL RIGHTS RESERVED © 2024 Dr. Muhammad Al-Hashimi