Welcome > Arch > Basics > Class: Adv

Class: Adv

Aug 29

ECE ⁴⁰¹⁄₂₀₁. Prof. Engin Ipek, Cornell from High school to Ph.D.

Text book: Computer Arch: a quantitative approach, 6th ed. by John L. Hennessy, David A. Patterson.

Org: building computer works;
Arch: works faster;

FSM design: finite state machine.

TA: Ryan Wong; Muhammad Mehdi; 527 CSB office hr.

Blackboard.

Midterm Oct 29. Final Dec 19.

Three verilog projects.

Verilog design projects. expand on basic MIPS R3000 processor.
Final lab: choices.

Stories:

Sleep bag, take turns.

Pollan; ECE; Pluses and Minus; when ug, parternes with Ph.D.

Features:

Award-winning publishs.

Verilator: Verilog HDL to C/C++ source code;

ECE 461 (VLSI), 402 (memory), 404 (), 455 (compilers), 456 (OS)

Intro

Goals: multi-dimensional opt problem.

correctness (verify), speed, power efficiently, energy, form factor, cost, usability, safety, security, modularity, compatibility, reliability, programmability, scalability,

ISAs: can only grow, cannot shrink: compatibility.

what are the instructions
# of registers
addressing

If history teaches us anything, it is that man, in his quest for knowledge and progress, is determined and cannot be deterred. – John F. Kennedy (1962)

Moore’s Law.

Sep 3

ECE accounts. Simulators.

story of processors

number of transistors. Moore’s law. recent, TSMC/Samsung.

Soft errors:

RAM: flip bits. noises, e.g. cosmic rays. 2018, NAND Flash. 11 nanoclous.

Minicomputers/mainframes/supercomputers vs. Micro-processors.

micro-processors 50% perf increase per year. slowing: pwr + wire delays + soft errors.

wire delays: 2000: Titan, 4 cycles half chip. now: local vs global wires on multi-cores.

CMOS metal layers. M1 pitch.

square of 2 size of nm: 250, 180, 130, 90, 65, 45, …., 14, …, 11

parallel

more than pipelines, to reduces cycles per ins (CPI):

super scalar: multi issue
thread level para. SMT: simultaneous, CMT: multi-processor.

increate of freq

pwr density: nuclear reactor, rocket nozzle, sun’s surface.

num of cores

domain specific chips

similar story of RAM

parallelism and locality.

VLSI

Cost and performance

die and round wafer/weife/.

cost of IC: die, die test, package, package test, yield ratio.

cost of die: $N{wafer}$ \over ${N{die} x Y}$

why not square wafer? way of making wafer, by spin during chemical solutions and cut horizontal.

Cost trends of RAM: 1978 — 2001

cost of same RAM goes down: yield ratio goes up. learn from failures.

Cost of each generation of RAM goes up.

Sep 05, 2019

Throughput vs. latency.

Customer’s workload:

best option. Can build a hw has 1000x faster for a specific workload;
not always available:
- secrets cannot be shared;
- user has no idea about how the workload;
- who is the customer: millions of different customers; marvolous possibility of usages;

‘Standard’ workload:

SPEC (real programs)
kernels (livermore loops, linpack): common core of many algorithms. e.g. matrix vector multiplication, FFT, etc.
synthetic benchmarks (Whetstone, Dhrystone);
toy benchmarks (Quicksort, Towers of Hanoi).

Simulation at many levels

ISA simulation: instruction stream + memory + reg file => simulate computing on faked memory & reg file;
- emulate software for correctness without an ISA implementation;
- dynamic instruction count;
- instructin mix;
- size of working set;
- cannot estimate cost, power, speed.
RTL (register transfer language) simulation:
- cycle accurate (can count exactly how many cycles will cost)
- cannot give you cycle time (no frequency), no area, no power
gate-level, circuit level
- Frequency
- area
- power

Architect: RTL level.

Analytical modeling (e.g. queuing theory): too complex

How is frequency determined?

by changing the design.
such that IPC goes down.

arithmetic mean vs. geometic mean

Arithmetic mean. is bogus for normalized times. can make anyone faster.
Geometric mean. consistent independent of reference; does not track execution time.

Amdahl’s law：

5 minute walk + 10 min bike + 5 min climb to classroom. A fantastic fast bike: 20min -> 10 min. Maximum 2x speedup. $Speedup = T_{old}/T_{new}$.
$Fraction{enhanced}$: if 20 seconds of the exe time of a program that takes 60 seconds in total can use enhancement, the fraction is ²⁰⁄₆₀. This value, which we call $Fraction{enhanced}$, which is always less than or equal to 1.
$Speedup{enhanced}$: If the enhanced portion of program takes 2 seconds in enhanced mode while 5 seconds in original (non-enhanced) mode, the improvement is ⁵⁄₂. We will call this value, $Speedup{enhanced}$, which is always greater than 1.
(0, $f_e$, 1): $f_e$ speed up is $S_e$, then total speed up is
- $S = 1/ ((1 - f_e) + f_e / S_e)$
If an infinite fast speedup of $f_e$
- $S_{max} = 1/(1- f_e)$
Always have big picture in mind
- Eg: 1 GHz to 2 GHz, but 10% cpu and 90% memory time. ==> 1/(0.95) speedup.

EG: multiprocessor scalability.

CPU perf eval

$ T=N_{clk}*T_{clk}=IC*CPI*T_{clk}$

IC: instruction count; CPI: cycles per instruction.

Sep 10 ISA Principles

ISA: SW/HW interface.

ISA vs Implementation.

ISA: defines the what.

what is the programmer visable state
what happens on each intruction
e.g: num. of registers; types of instructions

Impl: defines the how.

the sequence of steps;
e.g.: the width of the data bus between processor and memory; “micro-architecture”

Classifying ISAs

type of internal storage: - stack abstraction of memory: add; –> add two on the stack; - JVM. Java byte code ISA. - Internet: keep instruction small. - accumulator: add M; –> add M with accumulator and store it in accumulator; - register-memory: add Rd, Rs, M; –> (ALU) can operate both on memory/register; - register-register/load-store: add Rd, Rs, Rt; –> (ALU) only operate on register; - much faster by avoiding corner cases; - easier.

Compiler Effects/Wishlist

number of general regs
regularity, orthogonality
- law of least surprise
- no strange side effects
- consistent addressing modes
Three more
- provide primitives, not solutions
- simplify trade-offs of alternatives
- make compile-time constants fast
- e.g. x = y + 5; // addi, 5 encoded in instruction.
Little/Big endian
- abcd -> mem[4]
Alignment
- aligned architecture: instructions no need to store 2 bits, 32bits aligned;
Addressing modes:
- Rs, #n immediate,
- (Rs), Register Indirect: Mem[Reg[s]];
- (n), n(Rs) Displacement, (Rs+Rt) ,
- @(Rs), Memory Indirect: Mem[Mem[Reg[s]]];
- (Rs)+, Auto Increment: Mem[Reg[s]]; Reg[s] + d; ==> d is determined by data type during operation;
- -(Rs), Auto decrement;
- n(Rs)[Rt], Scaled: Mem[n+Reg[s]+Reg[t]*d]

VAX had them all. For ease of hand-writing assmebly codes.

Conditional/unconditional branch
- branch/jump distances.
- Jump register: register r31 for func returns.

Sep 12

Procedure calls

call a function written by somebody else: calling convention.

Caller saved convention:
- save all registers contain live variables: data may be read again before program ends.
- Problem: some registers might not be used in callee, then the saving is unnecessary.
Callee saved convention:
- prior to overwriting any register, the callee saves and eventually restores that register’s content;
- Problem: some registers might not be used by the caller, then the saving is unnecessary.
MIPS:
- Split registers into 2 groups: callee saved & caller saved.
- caller or callee’s responsibility to save, can also no save for dead variables or not rewritten.

Instruction Set Encoding

Unary number: 1 1 1 1. Know the instruction without decoding.

binary num: 0b100.

Variable length instructions:

efficient encoding: no bits wasted
Problem: sequential determination of each operand

MIPS Arch

load/store, or reg-reg instruction set
R format: register instruction. ALUs
I format: immediate instruction. load/stores
J format: jump instructions: j, jal.
branch delay slot: instruction after branch is always executed.
load delay slot: value returned from load cannot be used the next cycle. unspecified result if not followed.

RISC: any isa defined after 1980.

Sep 17

Exe time = IC x CPI x $\tao$

Pipeline:

Datapath phases;

Pipeline: introducing latches between different stages.

IPC better than CPI to compare ISA impl.;

Pipleline Hazards

dependence;
slower phase;
latches time
…

data hazards

3 kinds of data dependence in the original program:

Read after write (RAW); not a problem if no pipeline; forwarding to solve;
Write after write (WAW); cannot be reordered; reg rename in hardware;
Write after read (WAR); cannot be reordered;

control hazards

control dependences

branches: which to fetch next;
- MIPS branch delay/wait slot: not good if large num of pipeline stages;

structural hazards

content for same hw resource (eg. register, non-pipelined FU, divider): adding more hardware;

access the register file

Read and Write of register file: RD in 2nd half clock cycle; WB in 1st half.

Two-phase non-overlapping clocks: $\phi_1, \phi_2$;

non-overlap: never high at the same time;

Latches in pipeline

latches vs. ‘foot blocks’.

A D-latch: Din Dout Clk. Can be open, or close; when it is closed, it remembers last input; if clock high, the out equals in, ‘open’; if clock low, the out got stuck where it is, will not reflect input, ‘closed’;

two D-latches forms ‘footblocks’

high usually shorter than the low state in the timeline graph.

Is there 2-D/3-D pipelines, for parallel: e.g. one input, 4x output

Sep 19

Instruction scheduling Basic Block scheduling

Boosting: move instructions across basic blocks. Must maintain semantics.

Branch delay slot scheduling

beq.likely R4, R6, L : if not taken, squash the delay slot (turn it into nop)

Static branch prediction:

Exceptions

Dynamic branch prediction.

Commit: When does state change

commit: An instruction is committed when it is guaranteed to complete. And the result will be visible to the programmer, either in register or memory.

MIPS: WB stage is for committing.

OOO pipelines

Out-of-order pipeline: 5 stage, 4 load in 10 instructions ==> ~ 400 cycles.

if out of order and parallel memory requests load ==> ~100 cycles.

Multicycle pipelines

IF -> ID -> { EX | FP adder | FP mult | FP dividers } -> MEM -> WB

Problems: Different pipleline can finish in different time (can cause out of order commit).

An implmentation problem.

structrual hazards for entry into MEM
RAW:
```
fmult R1, R2, R3
fadd  R4, R3, R5
```
WAW: fadd executes faster than fmult:
```
fmult R1 <--
fadd R1 <--
```
WAR: not a problem. Since only in ID (decode) stage does the read.
```
fmult <-- R1
fadd R1 <--
```

Sep 24

Multicycle pipelines(continued)

structural hazards: mem stage
RAW. forward does not work with this. Stall@ID
WAW. Possible. Stall@ID
WAR. Not possible.

Reservation Shift Register (Left shift <–): 0 the resource is free; 1 is not free; One shift register for one conflict parts’ ID; one for ‘now’; ‘now’ = OR of all resources.

When to know mem op finishes: assume hit L1, or assume …; if not do recover; –> speculate.

Shift Register: shift in a 0, or shift in a 1;

Dynamic scheduling

ILP: Instruction level parallelism

EG:

fdiv R5 <--
fadd R5 <--  // stall 2 cycles

add R7 <--   // can be rescheduled to amortize the above stall
sub R8 <--
xor R11 <--
...

no ILP:

add r1, r2, r3
sub r4, r5, r1
xor r6, r7, r4

value speculate: 90s idea; not good for silicon.

ILP:

add r1, r2, r3
sub r4, r5, r6
xor r7, r8, r9

Sep 26

Next time: Verilog by TA.

CPU:

        Instr queue.                    Reg file.
                                            0   0 
                                            1   1 -> RS0
                                            2   2
    fadd                                    3   3
    fadd (clk2. )                           4   4
    fmul (clk1. when r1 = RS0)              5   5
         
                                               /\
         || issue (always in order)            ||
         \/                                    ||
                                               ||
    Reservation station.      |===============>||
        ready,op1,1-rdy,op2,2-rdy              ||        -> *ready* of means all operands of an inst are ready or operands has a value (not a pointer).
    RS0 ___1___r0___1___r2____1____fmul_        
    RS1
    RS2
    RS3 ___1___r6___1___r7____1____fadd_
    RS4 ___0,__RS0__0___r3____1____fadd_
       ___________________________________
         |      |            |        |        ||
        Multiplier         Adder/Subtractor    ||
             |                      |          ||
             -----------------------------------


fmul r1, r0, r2
fadd r4, r1, r3
fsub r5, r6, r7

CLK: 
0, start, fmul, fadd, fadd in queue
1, fmul out of queue; 
   r0, r2: ready;
   fmul: ready;
   fmul, r1 <- RS0;

2, fadd, src reg: 
    r1: pointer to RS0, not ready ; 
    r3: value, ready; 
    fadd ready = r1.rdy && r3.rdy = no
    r4: RS4.
   fmul, executing,

3, fsub, 
    r6, r7, instr: all ready;
    r5: RS3

4, fadd, r1: not ready
   fsub, r6: exe 1st cycle (only 1 cycle to finish).
   fmul, exe: 3rd cycle of exec (need 4 cycles to finish).

5, fmul, exe: 4th cycle
   fsub, r6: done. RS3 result. Finished execution stage, will WB in next stage.

6, fsub, r6: result -1, to write back: **broadcast**: <-1, r5, RS3>: to all RS and Reg files.
        all RS and register file **snooping** the common data bus to receive the broadcast
        - reg file get the broadcast and set r5 = -1;
   fmul: 5th cycle, done exe. 

7, fmul, broadcast, <0, r1, RS0>
      RS4: see the broadcast, RS0 = 0 in its operands, and set it to ready;
      reg file: see r1, and set r1 = 0;

8, fadd, executing (1 cycle left)

9, fadd, broadcast: <3, r4, RS4>
    reg file: r4 = 3.


EG2:

fmul r4, r1, r2
fadd r4, r5, r6


CPU
        Instr queue.                    Reg file.
                                            0   0 
                                            1   1 -> RS0
                                            2   2
                                            3   3
    fadd (clk2. )                           4   4
    fmul (clk1. when r1 = RS0)              5   5
         
                                               /\
         || issue (always in order)            ||
         \/                                    ||
                                               ||
    Reservation station.      |===============>||
        ready,op1,1-rdy,op2,2-rdy              ||        -> *ready* of means all operands of an inst are ready or operands has a value (not a pointer).
    RS0 ___1___r0___1___r2____1____fmul_        
    RS1
    RS2
    RS3 ___1___r6___1___r7____1____fadd_
    RS4 ___0,__RS0__0___r3____1____fadd_
       ___________________________________
         |      |            |        |        ||
        Multiplier         Adder/Subtractor    ||
             |                      |          ||
             -----------------------------------



cycle

1. issue: RS0 ready, r4 = RS0.
2. issue: RS3 ready, r4 = RS3.
   fmul: start exe
3. fmul: exe (2nd cycle, 3 left)
   fadd: exe (1st cycle, 1 left)
4. fadd: done, broadcast: <5, r4, RS3>
        - reg file: r4 points to RS3, and set r4 = 5.
   fmul exe (3rd cycle.)

5. fmul (4th cycle)

6. fmul: broadcast <0, r4, RS0>
      register file: **r4 does not points to RS0**, we dont apply 0 to r4.

Oct 01

Verilog

HDL vs Prog. Lang

NOT A PROGRAMMING LANGUAGE
No variables (outlawed) – signals!
- registers (containers)
- wires (connections)
HDLs concurrent
- which happens first:
```
assign a = ~b;
assign c = d;
```

NAND/AND Gate:

module NAND (a,b,out);
input a;
input b;
output out;

  assign out = ~(a&b);
  // assign out = a&b; // AND gate
endmodule



module AND (x,y,out);
input x;
input y;
output out;

wire x;

  NAND MyNAND(.a(x), .b(y), out(z));
  assign out = ~z; // AND gate from NAND // block statement: set out until ~z is true.
endmodule

Combinational logic

using assign; evaluated every time RHS changes
LHS must be declared wire (cannot feed into reg – it is combinational!)
Typical operators
- & | ^ ~
- == !=
- physical data types: 0 1, x (dont care), x (high impedance)

Buses

module AND8 (x,y,out);
input [7:0] x;
input [7:0] y;
output [7:0] out;

  assign out = x&y;
endmodule

Contatenation, Repetition

Syntax: R{E1,E2,…,En}. R repetitions (default 1) of the concatenation of E1, .. En
16{a[15],a} + b

Sequential Logic

CLK

D ~~~_

module DFF (d,q,clk)
input clk;
input d;
output q;
reg q;
  
  always @(posedge clk) begin
   q <= d; 
  end
endmodule

// can be negedge as well (and clk any other name)

nonblodkint assignment ‘<=’ in sequential always blocks;

Clock only signal in sensitivity list; (posedge clk)
LHS must be declared reg
- cannot use wire – it is sequential logic
Hoist combinational logic outside of always blocks as much as possible

(slow down clock freq)

Control flow

always @(posedge clk) begin
if (r == 1'b1) begin  // 1 bit signal 1'b1; n bits: n'b0000_n ; n'dxxx
  q <= 1'b0;
end
else begin
 q <= d;
end
end

endmodule

Combinational always Blocks

All RHS singals must appear on sensi list

LHS must be assigned in every possible case

otherwise implied sequential logic!

always @(sel or a) begin
if (sel == 2'b0) begin
z = 1'b0;
end
else if (sel == 2'b1) begin
z = a;
end
end

Extra Hardware?

watch out for too much hardware
every if condition, every single operation you do will increase the hardware

Verilog is concurrent, C is not.

differnet block code represent different hardware
sequential block: can execute sequential objects
block ‘=’, non-block ‘<=’
combination block: concurrent.

reg[31:0] regfile [0:7] // num of bits; num of registers

wire [31:0] reg2; wire b4r2;

assign reg2 = regfile[2] // 2nd register assign b4r2 = reg2[4]; // 4th bit of reg2

// tri-state devices assign reg2 = b4r2 ? regfile[2] : 32’bz;

// z:

Verilog Prog:

draw hardware diagrams first; Textbook
think about what hardware does
read the manual
exercise

Project 1

Branch predictor

initial 5 stage pipeline.

always do not take branch

verilator 3.876

verilog/ - ALU.v - Decoder.v - EXE.v - ID.v - MIPS.v

make ./VMIPS -f tests/cpp/class

// enter -> next circle.

Oct 03

Tomasulo

instruction status
reservation status: Busy bit, Op field, address field, V_j & V_k: source values of registers(if not in resevation station); Q_j & Q_k: pointers to source registers (reservation stations)

IBM ³⁶⁰⁄₉₁: Tomasulo

update register in program order? NO: update as soon as it finishes. Problem: exceptions.
Solution: out-of-order HW.

Out-of-order

Single issue OOO processor.

Memory speculation.

Problem:

In a modern pipeline, there are many stages (cycles) between a branch is fetched and when its direction/target are known (i.e. when it resolves).
worse in superscalar pipelines that process multiple instructions in each stage.
- e.g 8-way superscalar (8 instructions processed in every pipeline stage); and 8 stages between fetch & execute ==> 64 instruction must be fetched in the shadow of unresolved branch.

Solution:

Dynamic branch prediction.

Predict both the direction and target for the branch.
Assume prediction is true. Fetch instructions from the predicted path.
Do not allow either the branch or any subsequent instructions to commit (i.e. modify architectural states).
Verify if the prediction was true when the branch executes. If misprediction, will squash the branch and all subsequent instructions; fetch from the corrent path. (Misprediction Recovery)

Note:

Prediction does not have to be correct 100% of the time.
precition is useful if it is correct most of the time.
Accuracy of branch predictor = (#correct)/(#correct + #misprediction).
- how much accuracy is ‘good enough’?

EX:

what is the probability that we are still on the right path after speculating through N branches assuming accuracy of p? ==> p^N
if 8 stages to EX stage, and 8-way superscalar, every 4th instruction branch -> N = 16

P = 0.9 => p^N = 0.9^16 = 0.18

p = 0.99 => 0.85.


Fetch                                    Decode                     Rename              Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                         Front-end      -->   Issue Queue        Physical       ALU     | L1 d-$ |  |d-TLB            reorder 
                                                                   Register alias       ------------       reg file     |\                                      buffer (ROB)    Retirement
                                                                   table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                        ------------                     |       |holding reg (MSHR)   stack                    alias  
                                                                                          (OOO)                         |/                                                      Table
                                                                                                                                                                                (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_        Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue

Oct 08

Branch predictors

Categories of dynamic branch predictions:

Always taken branch predictor:
- predict the branch will always be (not) taken.
- Problem: not accurate enough (~60%)
Backward-Taken, Forward Not Taken:
- predict that branches whose target PC is less than the branch PC (i.e. backward branches) are taken, others are not taken.
- Rational is loops.
- Insufficient accuracy (60-70%).

==> problem: treat every branch instr equally. always make same prediction for the same instruction.

Last Direction:
- Table of branch directions(taken/not-taken) + use n bits of branch PC as index
- To predict: index into table; read the taken/not-taken bit, 0 not taken, 1 taken;
- Update table: Index into table; write 1 if the branch is actually taken, 0 if not;
- Accuracy: 80%.
- Aliasing possible in the prediction table (collisions). ==> can help each other & distructive aliasing if two aliase behaves differently.

EX:

// assume the branch pattern:

    1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
    x o o o x x o o o x x o o o x

   ==> problem: two mistaken at the end of every loop

Bimodal Predictor

Table: two bits: saturating up/down counter (Finite State Machine)

problem:

// Two bit saturating up/down counter

strong taken  11        

weak taken    10        predict taken
-------------------------------------------
weak n-taken  01        predict not-taken

strong n-t    00   

// transitions: 

00 --1--> 01
01 --1--> 10
10 --1--> 11
11 --1--> 11

11 --0--> 10
10 --0--> 01
01 --0--> 00

// init: 10

Ex:

10 11 11 11 11 10 11 11 11 11 10 11 11 11 11
1  1  1  1  0  1  1  1  1  0  1  1  1  1  0
o  o  o  o  x  o  o  o  o  x  o  o  o  o  x

error: 3/15 
accurary: 80%

Two-level Branch Predictors

Goal: accurately predict repeating sequences of branch outcomes.

Local Predictor
Global Predictor

Local Predictor (SAg):

Branch PC [n-bits] -> index into first level table:
- Branch History Table (BHT): each entry k-bits wide to log pattern of taken/not-taken; then goes into PHT.
- Pattern History Table (PHT); do predict; 2-bit saturating up/down counter.
Prediction:
- index into BHT;
- use BHT entry (row) as index into PHT;
- prediction is the value of PHT;

Update:

shift the branch outcome into the indexed BHT entry;

updating the corresponding PHT entry, the 2-bit saturating up/down counter.

// k=4 in BHT; after training using pattern
// assuming no aliasing, then predicts all repeating sequences
// of length <= k with 100% accuracy.

... 1  1  1  1  0  1  1  1  1  0  1  1  1  1  0  ...

------------------------------
PHT  11  00   11
------------------------------
0    1   1
1    1   1
1    1   1
1    1   0

Problem of aliasing when indexing using BHT to PHT: a corner case that rarely happen dynamically.
local predictor captures self-correlations
90-99% (99.99% on floating point)

What about correlations among different branches?

// an opportunity not exploited by the local predictor
// but global predictor can.
if (A & B){

}
if (B){

}

Global predictor

Global history register (GHR): outcome of all branches. Store patterns, then used to index PHT;
Prediction:
- index into PHT using GHR;
- prediction in the msb read act
Updating:
- shift in GHR
- Perform Saturating up/down in PHT
- 90-95% accurate on an integer code.

Oct 10.

Hybrid branch prediction

local branch predictors capture self-corelations.
global branch predictors capture correlations among different branches.
Combine the two

Predict:

Local & global provide their predictions

A selector (bimotdel predictor) picks which predictor to trust for the current branch

// Meta-predictor          //global                     // local
// (selector)

PC[n]                                                   PC[n]
| ---> 00,                 GHR                          | ----> BHT
01                  |   |  --->  PHT
10
11

||                    ||                              ||
-------------------------------------------------------
                     ||
                     \/
                    predict


Branch-    local(0)    global(1)     Meta-          local-      global-     meta-
outcome                              predictor      training    traning     traning

0         0            0             0/1            down         down      no-change

0         0            1             0/1            down         down      down

0         1            0             0/1            down         down      up

0         1            1             0/1            down         down      no-change

1         0            0             0/1            up           up        no-change

1         0            1             0/1            up           up        up

1         1            0             0/1            up           up        down

1         1            1             0/1            up           up        no-change

Branch Target Prediction

We need to predict two other pieces of information in addtion to direction

what is the branch target (if taken)?
whether this is branch instruction?

Answer: Branch Target Buffer.

BTB: is a cache which is indexed and tagged using the branch PC, if holds the branch target for each cached branch. -> typically, block size = 1 branch.

EG: 4-way set associative, 1024-entry is common.
cache hit or miss. Reason of miss: 1. branch target not cached; 2. instruction is not a branch.
Update: BTB is updated after decode;
BTB and direction predictor are accessed in parallel.

LLM: BTB is accessed before decoding done: we don’t even know this is an branch or not, but will fetch only according to PC’s address. Is this safe in multi-process env? Why not just access BTB after decoding done?

Call/Return sequences

In MIPS, jal/jr

BTB not useful in predicting target for return instructions. (because we have multiple call sites/multiple return site for a same function)

Instead, use a structure called the Return Address Stack (RAS)

on a call, push the return address onto RAS.
on a return, we pop the top of the RAS.

will forget something if overflows; LLM: how about saving the overflowed ones to memory? and use this hw stack instead of compiler managed stack?

BTB  direction-  what to fetch next
 predictor
hit   0          PC+4
hit   1          target from BTB
miss  0          PC+4
miss  1          PC+4

Speculative updates

Problem: information to update the direction predictor not known for many cycles. Makeing predictors of stable prediction results in inaccuracy.

Solution: assuming that the prediction is correct & update the predictor state speculatively (i.e. update direction)

before the update, save the existing predictor state in the outstanding branch queue (QBQ).
on a misprediction, restore the relavent QBQ entry to predictor.

Oct 17

(A copy of cpu figure)


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |  |d-TLB              reorder 
                                                                Register alias       ------------       reg file      |\                                      buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                     ------------       r0            |       |holding reg (MSHR)   stack                    alias  
                                                                                      (OOO)             r1            |/                                                     Table
                                                                                                        ...                                                                   (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue


  v              v                   x: FIFO                     x: renaming        x: issue queue      x:phys                                              x: ROB         x:R-RAT

FIFO: FIFO allows disruptions in either the producer (cache) or the consumer (decoder) stage to be hidden.

if instruction cache got a miss, decoder can go to next instruction instead of being stalled.

Renaming using a double RAT scheme

Goal: rename false dependences (WAW, WAR)

Architectural registers: specified by ISA (R0-R31) (typically big R)
Physical registers: actually implemented in hardware (e.e. r0 - r511). (typically little r)
Typically, # of r is much greater than # of R. (LLM: why: stores the results for speculatively executed instructions)


F-RAT table       (map: 1 --> 1)           Phy Reg File         (1 <-- 1)            Retire-RAT

entries: 

R0: r1                                   r0                                          R0      
R1:                                      r1                                          R1
R2: r100                                 ...
R3:
R4
R5
..
R31:                                     r31                                         R31
                                        ...

                                        r511


Free-List

r400 r312  r5  ...
----------------------

----------------------

F-RAT: points to “speculative state” –> program cannot see. R-RAT: points to “architectual state” –> program can see

Free-list: holds list of physical reg IDs available to be written.

Every operator using current mapped phy reg in F-RAT. Rename source registers by reading the F-RAT and mapping architectural source registers to physical registers.
Every destination register got a new phy reg from free list (that register must not be used).
- pick up a free register from the the free list.
- update the F-RAT with the new map.
Allocate a reorder buffer: ROB entry (FIFO) in program order. Instruction wait in the ROB until they are allowed to update arch state.
- instructions must have executed.
- instruction must be free of misspeculation.
- all earlier instructions must have committed (updated architectural state). –> instruction must be the ROB head in order to be committed.
Allocate on issue queue entry.
- similar to tomasulo’s reservation station.
If instruction is a load/store, allocate a load/store queue entry. (issue queue is not FIFO).
If operands are results of some instructions in the issue queue? to distinguish whether or not: use busy bits for every physical register file.
- check the busy bits for the source physical registers: if any operands reg is busy, this instruction must wait in the issue queue for the producer to depart from issue queue.
- set busy for the destination physical register.
when the instruction commits, update the R-RAT with the architectural->physical mapping of the instruction.
- R-RAT is updated in program order
- R-RAT reflects the arch state, which programmers can see.
when a branch resolves, and is found to be mispredicted, setting a bit in the branch’s ROB entry.
when a mispredict branch reaches the ROB head, recover by:
- Flush the ROB starting from this branch instruction (by setting a single bit flagging ‘ignore these instructions’)
- & Flust the load/store queues.
- Copy R-RAT to the F-RAT.
- Start fetch from the correct path.

Stop rename

If we run out of any of the following, rename must stop:
- Issue queue entries are run out
- ROB entries
- Free register unless: store/branch/jump instructions that do not use new registers.
- no instruction to rename
- load queue entry if load
- store queue entry if store Once stopped, no more instructions will be put into OOO execution.

Register recycling

When is the earliest safe time to return a physical register to the free list?

When the next instruction that renames the same destination register commits.

Oct 22

(A copy of cpu figure)


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |  |d-TLB              reorder 
                                                                Register alias       ------------       reg file      |\                                      buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                     ------------       r0            |       |holding reg (MSHR)   stack                    alias  
                                                                                      (OOO)             r1            |/                                                     Table
                                                                                                        ...                                                                   (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue


  v              v                   v: FIFO                     v: renaming        x: issue queue      v:phys                                              x: ROB         v:R-RAT

Issue queue

allows instructions to wait for their operands to become available;
instructions are going to respect RAW dependences and structural hazards; but will issue as soon as possible;
fullfils two critical functionalities: (this is the critical path, will determine clock speed)
- wake-up: instruction become ready to issue when all operands become available. (set instruction to ‘wakeup’)
- select: one among multiple ready instructions must be selected to issue. (be removed from the issue queue, called being ‘issued’)
- ** for high performance, wake up and select must be atomic (done in one cycle) ==> This typically determines the critical path that sets the clock speed.
instructions can be issued out of order.

Wake-up


issue queue entries:

0   1   2   3   4   5
----------------------

----------------------
 |  |   |   |   |   |
 ^^^^^^^^^^^^^^^^^^^^ (common data bus; broadcast)^^^^^^^^^^^^^^^^^^^^^^^^^^^^



one entry:

  {src1} | {src1.ready} | {src2} | {src2.ready} | {instr.ready} | ...
     |     |    /\                  |    /\                /\
     |    \/    |                  \/    |                 |
     |    |  |or gate|             |  |or gate|         |and gate|
     |    -----| --------->---------------|--------->-------|
 |compare|-----
 | unit  |
    ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^



src1/2: physical register specifier;

src1/2.ready: 

(physical registers the instruction needs)


For the instructions that are 'born' to be ready: e.g. Jump instruction.  

src1 and src2.ready are initialized using busy bits of the physical registers when being put into the issue queue.

Select

Use tree of arbitors to select the instructions that are ready.


G2 = GG and {not R0/R1} and R2
issue queue entries:

              0   1   2   3   4   5   6   7   8   9   10  11
              --------------------------------------------
              |   |   |   |   |   |   |   |   |   |   |   |
              --------------------------------------------
read bits     |   |   |   |   |   |   |  |   |   |   |   |
              \/  \/  \/  \/

Arbitors:    arb01                 arb02                 arb03               arb04   .....
(selection 
logic)

                          arb11                                    arb12
                        (child)

                                              arb21
                                              (parent)


One arbitor:

Request from
  child/queue entries:    R0    R1    R2    R3 




Response back to child:   G0 G1 G2 G3

group request send to parent:  GR

Response from parent: GG (granted by parent)

GR = or {R0, R1, R2, R3}
G0= GG and R0
G1 = GG and {not R0} and R1
G2 = GG and {not R0/R1} and R2
G3 = GG and {not R0/R1/R2} and R3

// memory cell (what kind of info it stores??? opcode, src1, src2, dest, immediate, or issue queue index)

- wordline(enables): when wordline is high, the cell drives bit line.
- bitline (read): 

        |M|-- wordline --------------
   ------|-------------- bit line --------

Issue queue wired with memory cells:


              0   1   2   3   4   5   6   7   8   9   10  11
              -------------------------------------------------
              |   |   |   |   |   |   |   |   |   |   |   |
              -------------------------------------------------
                | M | M | M | M | M | M | M | M | M | M | M | 
                |   |   |   |   |   |   |   |   |   |   |   |
                --------------------------------------------------- bit lines
                | M | M | M | M | M | M | M | M | M | M | M | 
                |   |   |   |   |   |   |   |   |   |   |   |
                --------------------------------------------
                | M | M | M | M | M | M | M | M | M | M | M | 
                |   |   |   |   |   |   |   |   |   |   |   |
                --------------------------------------------
                | M | M | M | M | M | M | M | M | M | M | M | 
                |   |   |   |   |   |   |   |   |   |   |   |
                --------------------------------------------
                | M | M | M | M | M | M | M | M | M | M | M | 
                |   |   |   |   |   |   |   |   |   |   |   |
                |   |   |   |   |   |   |   |   |   |   |   |word line
                G0  G1  G2  G3  G4  G5  G6  G7  G8  G9  G10 G11
                0   0   0   1   0   0   0   0   0   0   0

Pipeline wakeup and select

W&S = 1 cycle; IPC = 1

W&S = 3 cycles; IPC = ¹⁄₃ with one chain; = 1 is >= 3 chains

instruction dependence chains:

a -> b -> c -> d -> …. ->

a -> b -> c -> d -> … ->

==> the white spot in processor thermal die.

Q&A: given Issue queue entries: 128; load queue: 64; store queue 64; reorder buffer size 512 (instructions);

what is the max size of physical register file should be where performance cannot improve?

545 = 512 + 32 + 1.

RENAMING depends on the size of all, not single ones;

Oct 24 (missed)

Oct 29

Lockup-free cache

A lockup-free cache does not block future accesses on a miss immediately. By contrast, a blocking cache does.

Hit-under-miss: continue to service hits in the shadow of ongoing miss.
miss-under-miss: take a miss in the shadow of multiple outstanding misses.

? What can go wrong?

the request result can be resolved out of order (different with the issued order);
different missing requests can be in the same cache block. ==> cause energy and performance overhead.

Two data structures:

Miss status holding registers (MSHRs):
- make sure we issue the same miss only once (avoid multiple outstanding misses to the same cache block);
- Allocated on a primary miss (no other outstanding misses to the same block);
- On a secondary miss (there is already a pending request for the block due to an earlier miss); got a hit in MSHRs.
- Search on a new miss: Do a broadcast to all MSHRs, then each MSHR will compare the address against its recorded address. (similar search in the issue queue, broadcasting for register updates)
- On an MSHR hit: secondary miss; wait for reply;
- On an MSHR miss: a primary miss; request the block; add address to MSHR.

when load executes, it first computes the address.

MSHR: 
  valide,
  address,

Address stack

wake up pending loads in the load queue when a block is returned.
allocate on primary and secondary misses.
when a block returns, check the address stack, index into the load queue, to set a bit, indicating the data is now arrived;

the load will retry again (be replayed) and it will hit.

One entry: 

- instruction pointer
- valid
- address

Tests:

# MSHRs > # address stack entries: does not make sense;

To increase MLP:

Larget MSHRs and address stack
Larger load queue
Larger ROB (reorder buffer)
Larger issue queue (load instruction would be issued (removed from issue queue), once the address is computed; but when dependencies exists, the load instruction must be in the issue queue)
More physical registers
Larger store queue (if there load in bewteen those stores)

Second level cache

Local miss & global miss:

local L2 miss rate = (# L2 misses)/(# L2 accesses); ==> usually “50% - 60%”; local l1 miss (5-6%); data accesses goes to L2 are much more random. ==> locality opt on L2 has opptunity!!!

global L2 miss rate = (# L2 misses)/(# all memory accesses)

Unified L2 cache: contains both data and code while L1 has L1-code/L1-data separated caches.

Write hits longer than read hits (why?)

a write: first check tag; then write only if hit;
a read: read the value while read the tag; if tag miss, discard;

Main memory

Latency vs bandwidth.

Latency:

access time: read request, first word of data;
cycle time: minimum time between success requests;
cycle time is > than the access time;

Nov 05

(A copy of cpu figure)


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |  |d-TLB              reorder 
                                                                Register alias       ------------       reg file      |\                                      buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                     ------------       r0            |       |holding reg (MSHR)   stack                    alias  
                                                                                      (OOO)             r1            |/                                                     Table
                                                                                                        ...                                                                   (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue


  v              v                   v: FIFO                     v: renaming        v: issue queue      v:phys                  x ld/st queue               v: ROB         v:R-RAT
                                                                                                                                  depend pred.

Examples

add R1, R2, R3

sub R4, R1, R5

sw R1, 0(R7)

lw R4, 8(R12)

==> store and load address can change.

Addr can be the same and can be different;
This kind of dependence cannot be resolved in the issue queue (by simply inspecting register specifiers)

Out-of-order Loads

** Process ld/st in prog order?

Place ld/st inst into a single FIFO l/s queue (LSQ)
The load/store queue is signaled to access the L1 d-cache when the ld/st is read to consume.
Problem: at most 1 miss; no MLP.

** out of order loads: (no ooo stores, because you cannot undo it since memory renaming cost is high)

Separate load and store queue.
Stores write to L1 in program order.
Loads access data cache as soon as they calculate the effective address, but must make sure there is no older store with a partialy or completely matching effective address.

==> load search the store queue with effective address (load also search issue queue with effective addr). If older matching store, the load stalls until the store commits.

==> what if older store with unresolved effective address? load stall until resolves.

** Out of order loads with forwarding.

No need for load to wait for store to write to the cache. Instead, store forwards the valve to the load.
Which store?
- The youngest older store. load search the store queue for the youngest older store with matching effective address.
load will access the data cache in parallel with store queue search:
- if store queue does not find a conflict and no unresolved address –> use value loaded from data cache;
- if store queue forwards –> use forwarded value, ignore ld from cache;
- unresolved addr in store(s) –> ignore ld from cache;

** Speculative loads:

Same as 3. except:
- when there is an unresolved older store, the load will speculatively access the data cache and consume the fetched value.
- Later, we must check if the speculation failed, and if so, recover.
Store search the load queue.
- for oldest younger load with a matching eff addr that has issued speculatively;
- if found, mark the load as misspeculated in the ROB; recover when load becomes ROB head by flushing the pipeline; and copying RRAT -> FRAT

** Dependence prediction:

Always accessing the data cache speculatively may result in too many miss-predicates.
Solution: predict if the load depends on an unresolved store using a dependence predictor.

A simple one: Load Wait Table (LWT)


an array of one bit wide table

index of the PC of load: |xxx| index |xx

Index into the LWT using load PC; if 1, do not speculate.
When we misspredictable, index into the LWT, set entry to 1.
Problem: before ??, LWT is filled with 1s (always)
Solution: periodically (e.g. every 1 million cycles) erase the LWT to all 0s.

EA: effective addr (Generated at exe stage)


               |=> Instruction load:      access,   LWT, cache , Store Queue, 
               |                        load wati    1     v     older unresolved
               |                        use forward  1     v     forwards
               |                                     1   value     x            ld use cache
               |                                     0     v     older unres    ld specc from cache
               |                                     0     v     forwards       use forward
               |                                     0     v      x             use cache
Effective      | 
addr       ==> | 
               | 
               | 
               | 
               | 
               | 
               |=> Instruction store: search the load queue...  ... update lwt.

Nov 07

Quiz:

Reduce cache associativity: reduce cold miss ratio.

associativity/ num of sets.

Main memory

Dynamic Random Access Memory (DRAM), diffrences with SRAM ?
Hierachy of channels, ranks ,bancks, rows, and columns. ==> both parallelism and locality at the same time
Accessed using a command set.
Command Set is implemented by a memory controller.
- A memory controller usually integrated on the same die as the processor.
Timing contraints dictate how soon different commands can be issued.
Memory controllers reorders requests to optimize:
- throughput.
- power.
- quality of service.

DRAM Cells

1T1C Cell (page ¹⁰⁄₁₉)

    access transistor
    ____
----____-----
            |
            ---
            ---  capacitor
            |

Information is representted by the absence/present of charge on the capacitor;
The charge on the capacitor leaks over time;
- The cell must be periodically refreshed (read and written back) to maitain state;
The period is called the refresh interval (64ms @ < 80 Celcius>)

Can refresh 30 minutes a time, but need to keep at 40 Celcius? Used in superconductor tech.

Reading the DRAM Cell

![A DRAM Cell]()

Reads destroy the cell content (reads are distructive); content must be written back after a read.

Writing the DRAM Cell to 1:

Drive bitline to VDD;
Drive the wordline high (switch close, charge from bitline to capacitor);
capacitor is charged to VDD;
Drive wordline low (switch open, no charge from bitline to capacitor);

DRAM Array

![figure of DRAM array]()

Precharge all bitlines
- Precharge the bitlines destroys the contents of the sense amplifiers.
Activiting a row:
- Decoders drive one wordline high.
- An entire row of cells are sersed by the sense amplifiers.
Sense amplifiers holds the last sensed value.
- architecturally, architecturally like a buffer; called ‘Row Buffer’;

E.G.

Assume it takes 30ns for precharge bitlines
40ns for activate
20ns to read from row buffer.

Type Row# Col# Rd 1 5 Rd 3 4 Rd 1 3 Rd 3 2 Rd 1 6 Rd 3 9

FCFS (first come first served, in-order): 600ns

Out-of-order:

1,5 –> 1,3 –> 1,6 –> 3,4 –> 3,2 –> 3,9
100 20ns 20ns 100ns 20 20ns –> 280ns

Exploits the row buffer locality:

If a row is already stored at the sens amplifiers, we say row is ‘open’, otherwise ‘closed’
If a request find the row open, it is called a row buffer hit.
If a request find the row closed, and the bitlines are not precharged; we call it a ‘row buffer miss’;
If the row is closed but the bitline is already precharged; we call it a ‘row empty’.

Nov 12 no class

Nov 14

DRAM Memory

DRAM Array + row buffer

T = R C

Invertors cannot be used in DRAM: bitline is not digital signal, but analog signal.

Use array + buffer as basic building block.

DRAM Bank

Leverage the row buffer locality.

1 bank:

cache block: A B C D; // four arrays accessed in parallel

4 * (DRAM array (1k*1k) + row buffer 1k) ===> a bank of 4 arrays: 1k * 4k + 4k

each array provides a subset of the bits on every access;
logically, the row buffer size is the array row buffer size * #arrays;

only exploit spatial locality; but no parallelism, so we have rank.

DRAM Rank

Leverage the bank parallelism

1 Rank:

Bank 0
Bank 1 + Common Bus Bank 2 - commands: active, precharge, r/w, refresh Bank 3 - address: bank ID, row/col ID; A read: bank + col ID (assume already activated) - data: 64-bits wide;

Cache block size: 64 bytes.

Data transmitted in ‘bursts’: after read commands, the data bus will starting to have data (after a TCL-xxx latency)
- command bus: —-RD——————————————-time
- data bus: ——–bursts: d0 d1 dx ————————
- 2 64-bits chunks per DRAM clock: one on positive edge, one on negative edge.

Assume 3 cache blocks: x, y, z. resp. in bank0/row1/col4, back1/r57/col8, back3/r1/col

Read x Read y Read z

How to retrieve the data?

(if only one back in rank: read one by one: pre-charge x, activate, read.)

The command schedule with 4 banks in a rank (assume pre-charge 8 cycles, act: 8 cycles, TCL: 4 cycles):

command, time:

pre B0, 0
pre B1, 1
pre B3, 2
…
act B0 row1, 8
act B1 r57, 9
act B3 r1, 10
…
rd B0 col4, 16 // when is next read?
rd B1 col8, 20
rd B3 col20 24

Not RD every 8 cycles, but 4 cycles.

command bus: ----RD-----------8cycle--------RD-------------8---------RD---------8------------RD----------time

data bus:    --------d0-4 data---------- 4 empty-----------d1-------------------dx ------------------------

Bank Conflict

Problem:

two request to different rows within same bank.
three misses might be faster resolved than 2 misses: 3 misses in different banks can be paralleled; but 2 misses in the same bank cannot be paralleled.
More banks could help as long as the bus is not too small to be the bottle neck.

DRAM Channels

A Channle: contains one/more ranks;

A Channel Figure: 4 channels around a processor:

Memory controller0 – Rank 0, 1

Memory controller1 – Rank 2, 3

Memory controller2 – Rank 4, 5

Memory Controller3 – Rank 6, 7

Example: processor walks around an array.

for (i=0; i< 1 billion; i++){
   sum += a[i]; // a byte array.
}

// miss rate: 1/cache block.

DRAM Address Mapping: Physical address -> DRAM coordinate (chnl, rank, bank, row, col)

e.g 32-bit physical address, one method of mapping:

0 -------------------------------------------------- 31
  | Row | Col | Bank | Rank | Channel | offset |           // this will keep all channels up.

// one XOR gate, improve performance by 2x.

Nov 19


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |  |d-TLB              reorder 
                                                                Register alias       ------------       reg file      |\                                      buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                     ------------       r0            |       |holding reg (MSHR)   stack                    alias  
                                                                                      (OOO)             r1            |/                                                     Table
                                                                                                        ...                                                                   (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue


  v              v                   v: FIFO                     v: renaming        v: issue queue      v:phys                  x ld/st queue               v: ROB         v:R-RAT
                                                                                                                                  depend pred.

Two philosophies to multiple issue:

VLIW.
Superscalar.

Terminology: 4-way super scalar.

4 inst issued every cycle ==> 4-issue superscalar.

Sometimes, not all pipeline stage are symmetric, e.g. 4-way fetch, decode, issue, but 6-way commit.

Intel: design & debug(verification) team.

Superscalar, in-order, 5-stage pipeline.

instruction fetch    |     inst decode(ID)      |   Execute (exe)    |    Memory (MEM)   | write back (WB)
                                                |                                        |
PC                          register file       |    ALU                    data-cache   | 
                                                |                                        |
Instruction cache           decoder             |   forwarding              data-TLB     | ---
                                                              <-------------------------------|
instruction TLB

Two ways to increasing the bandwidth of memory structures.

Multiporting. Allows simultaneous access to a memory cell from multiple ports.
- two ports: add one more decoder, bitline, wordline, sense amplifier, for every cell.
- advantage: no matter addresses are, can always fetch in parallel.
- problem: not scalable. area grows n^2 for n ports.
- Used only when absolutely necessary.
Banking.
- area grows as N.
- disadvantage: bank conflicts (access to same bank).
- increase # of banks.

i-cache or i-TLB/ d-cache or d-TLB: multiported or banked.

Multiple PCs, latch, decoders, ALU, …

Forwarding network: n^2 growth.

Control logic to detect dependences

(still in order core)

Create if condition for next instruction.

Problem: it’s hard to find a dependent instruction one after another in one cycle.

Nov 21

To parallelize a processor


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE        Memory                 Retire            Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |  |d-TLB              reorder 
                                                                Register alias       ------------       reg file      |\                                      buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states          address                  register
                                                                                     ------------       r0            |       |holding reg (MSHR)   stack                    alias  
                                                                                      (OOO)             r1            |/                                                     Table
                                                                                                        ...                                                                   (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                           |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                     |in-order store queue
                                                                         | |
return          Outstanding                                                                                                      | Memory Dependence
address         branch                                                                                                           | predictor
stack           queue


  v              v                   v: FIFO                     v: renaming        v: issue queue      v:phys                  v ld/st queue               v: ROB         v:R-RAT
                                                                                                                                  depend pred.

pc: multiplex

i-$, i-TLB, Branch Predictor, BTB, RAS, FIFO ==> bank or multiport.

Decoder: multiple decoders.

Rename

Rename Group

1: in parallel

a) access free list the pick up physical reg specifier for destination.

b) compose architectural save register to destination of addr ???. Find youngest addr match.

2: in parallel

a) assign physical regs specifier picked in 1-a to multiple

b) at the same time, write dest specifier to F-RAT

Issue queue

Common data bus (CDB) 1: ————————————————–

arbitors: [] []

[]

Common data bus (CDB) 2: ————————————————–

Unified vs Distributed Issue queues:

Unified:

one queue + Two ALUs

Distributed:

Two < one queues, one ALU > pairs

Reducing the number of register file ports:

Problem: 8-way superscalar requires 24 RF ports

-> Area (and latency, energy) sacle quadratically with # of ports.

Solutions:

(1) Replication: two (multiple) copies of same register file. Two copy of same register:

both copies must be updated; it reduces only the # of read ports.

(2) how do we pipelining the register file accesses?

MSHR: miss states holding register. Address stack:

==> multi-parallel.

Nov 26


Fetch                                    Decode                     Rename           Dispatch         Register Read   EXE         Memory                 Retire        Commit


pc   -->    i-$      i-TLB                                      Front-end      -->   Issue Queue        Physical      ALU     | L1 d-$ |   |d-TLB       reorder 
                                                                Register alias       ------------       reg file      |\                                buffer (ROB)   Retirement
                                                                table (F-RAT)        |   |  | | |                     |       |Miss states   address                   register
                                                                                     ------------       r0            |       |holding reg   stack                     alias  
                                                                                      (OOO)             r1            |/      |(MSHR)                                  Table
                                                                                                        ...                                                            (R-RAT)
branch          Branch       _FIFO_ --> Decoder -->  _FIFO_      Free List                                                                                                        
direction       Target                                                                                                        |OOO load queue                                   
predictor       Buffer                                             Busy  | |
                                                                   Bits  | |                                                  |in-order store queue
                                                                         | |
return          Outstanding                                                                                                   | Memory Dependence
address         branch                                                                                                        | predictor
stack           queue


  v              v                   v: FIFO                     v: renaming        v: issue queue      v:phys                v ld/st queue               v: ROB         v:R-RAT
                                                                                                                                depend pred.

Miss prediction occurs for a br instruction:

time point: when that br reaches the head of ROB: all inst before this branch have committed, all stuff in pipeline are garbage due to mis-prediction, so we can flash entire pipeline, easily by setting one bit.

Problem:

A situation in ROB: garbage insts | br | ld with miss (500 cycles) | –> head

=> have to wait 500 cycles in order for br to be at the head of ROB and recover the misprediction.

=> why not recover while ld is underway and br is not yet at the head? Then when is good to recover? ==> early recovery.

Early Recovery from branch misprediction

as soon as branch executes:

Flush the branch and all later instructions from ROB:
- nullified bit in ROB entry: instruction has been turned into nop;
- in verilog: ROB index of the branch, <=, ROB index of this entry. ==> on every entry do such comparison.
Memory dependence predictor: no need to update, allows mis-prediction;
Store/Load queue: use instruction ages/inst numbers/prog orders/rob index ==> flush all ld/st instructions that are younger than branch from ld/st queues.
Issue queue: flush younger instructions. (e.g. mark them as ready, let them depart, do garbage calculation, and in ROB will be nop and commit nothing)
Cannot replace F-RAT with R-RAT immediately; instead:
Eliminate R-RAT, but take checkpoints of F-RAT by copying contents at every branch. Use checkpoints to recover that architectural values.
- when we run out of checkpoints, rename stops at the next branch.

multi-core processors

Exploit thread-level parallelism: TLP by simultoneously executing multiple threads. can be 1) threads of same processes; or 2) different processes all together.

fuse multi-core. 50% of ipek’s Ph.D thesis.

Why not increase superscalar width instead of designing more cores?

Power.
- P_{dynamic}: due to switching. cause industry to use multicore. capacitor: wire + gate capacitor. $P = \alpha C V^2 f$, $\alpha$ activity factor; Valtage $V$ propotional to frequency $f$ in digital design.
- P_{static}: due to leakage
- google 16-issue superscalar, no 32-issue.

From 4-way superscalar: Power_4_way = $\alpha C V^2 f$

8-way superscalar.
two 4-way multi-core. Power_2 = 2 x Power_4_way
two half-fast 4-way, with f/2. Speed same as 4-way. Power = $\alpha C/2 * (V/2)^2 * f/2$ = ¹⁄₄ Power_4_way.

Problem: How do we keep cores busy?

execute multiple applications at the same time.
- users have limited tasks in parallel;
- single thread no benefit.
parallelize a single threaded application to take advantage of multiple cores.
- Intel’s commercial idea since 2005.
- minimal source code changes.
- algorithms change.

==> Problem:

Data dependences must be identified and program must be synchronized to respect the dependences.
Certain codes are inheritly sequential. E.G Graph search.
Amdahl’s Law: 20% sequential, 5x max speed up.

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?