ECE 401⁄201. Prof. Engin Ipek, Cornell from High school to Ph.D.
Text book: Computer Arch: a quantitative approach, 6th ed. by John L. Hennessy, David A. Patterson.
FSM design: finite state machine.
TA: Ryan Wong; Muhammad Mehdi; 527 CSB office hr.
Blackboard.
Midterm Oct 29. Final Dec 19.
Three verilog projects.
Verilog design projects. expand on basic MIPS R3000 processor.
Final lab: choices.
Sleep bag, take turns.
Pollan; ECE; Pluses and Minus; when ug, parternes with Ph.D.
Award-winning publishs.
Verilator: Verilog HDL to C/C++ source code;
ECE 461 (VLSI), 402 (memory), 404 (), 455 (compilers), 456 (OS)
Goals: multi-dimensional opt problem.
correctness (verify), speed, power efficiently, energy, form factor, cost, usability, safety, security, modularity, compatibility, reliability, programmability, scalability,
ISAs: can only grow, cannot shrink: compatibility.
If history teaches us anything, it is that man, in his quest for knowledge and progress, is determined and cannot be deterred. – John F. Kennedy (1962)
Moore’s Law.
ECE accounts. Simulators.
number of transistors. Moore’s law. recent, TSMC/Samsung.
Soft errors:
RAM: flip bits. noises, e.g. cosmic rays. 2018, NAND Flash. 11 nanoclous.
Minicomputers/mainframes/supercomputers vs. Micro-processors.
micro-processors 50% perf increase per year. slowing: pwr + wire delays + soft errors.
CMOS metal layers. M1 pitch.
square of 2 size of nm: 250, 180, 130, 90, 65, 45, …., 14, …, 11
more than pipelines, to reduces cycles per ins (CPI):
pwr density: nuclear reactor, rocket nozzle, sun’s surface.
parallelism and locality.
die and round wafer/weife/.
cost of IC: die, die test, package, package test, yield ratio.
cost of die: $N{wafer}$ \over ${N{die} x Y}$
why not square wafer? way of making wafer, by spin during chemical solutions and cut horizontal.
Cost trends of RAM: 1978 — 2001
cost of same RAM goes down: yield ratio goes up. learn from failures.
Cost of each generation of RAM goes up.
Throughput vs. latency.
Customer’s workload:
‘Standard’ workload:
Simulation at many levels
Architect: RTL level.
Analytical modeling (e.g. queuing theory): too complex
How is frequency determined?
arithmetic mean vs. geometic mean
$Speedup{enhanced}$: If the enhanced portion of program takes 2 seconds in enhanced mode while 5 seconds in original (non-enhanced) mode, the improvement is 5⁄2. We will call this value, $Speedup{enhanced}$, which is always greater than 1.
(0, $f_e$, 1): $f_e$ speed up is $S_e$, then total speed up is
If an infinite fast speedup of $f_e$
Always have big picture in mind
EG: multiprocessor scalability.
CPU perf eval
$ T=N_{clk}*T_{clk}=IC*CPI*T_{clk}$
IC: instruction count; CPI: cycles per instruction.
ISA: SW/HW interface.
ISA vs Implementation.
ISA: defines the what.
Impl: defines the how.
Classifying ISAs
type of internal storage: - stack abstraction of memory: add; –> add two on the stack; - JVM. Java byte code ISA. - Internet: keep instruction small. - accumulator: add M; –> add M with accumulator and store it in accumulator; - register-memory: add Rd, Rs, M; –> (ALU) can operate both on memory/register; - register-register/load-store: add Rd, Rs, Rt; –> (ALU) only operate on register; - much faster by avoiding corner cases; - easier.
Compiler Effects/Wishlist
Three more
Little/Big endian
Alignment
Addressing modes:
VAX had them all. For ease of hand-writing assmebly codes.
Procedure calls
call a function written by somebody else: calling convention.
Caller saved convention:
Callee saved convention:
MIPS:
Instruction Set Encoding
Unary number: 1 1 1 1. Know the instruction without decoding.
binary num: 0b100.
Variable length instructions:
MIPS Arch
J format: jump instructions: j, jal.
branch delay slot: instruction after branch is always executed.
load delay slot: value returned from load cannot be used the next cycle. unspecified result if not followed.
RISC: any isa defined after 1980.
Exe time = IC x CPI x $\tao$
Pipeline:
Datapath phases;
Pipeline: introducing latches between different stages.
IPC better than CPI to compare ISA impl.;
3 kinds of data dependence in the original program:
control dependences
content for same hw resource (eg. register, non-pipelined FU, divider): adding more hardware;
Read and Write of register file: RD in 2nd half clock cycle; WB in 1st half.
Two-phase non-overlapping clocks: $\phi_1, \phi_2$;
non-overlap: never high at the same time;
latches vs. ‘foot blocks’.
A D-latch: Din Dout Clk. Can be open, or close; when it is closed, it remembers last input; if clock high, the out equals in, ‘open’; if clock low, the out got stuck where it is, will not reflect input, ‘closed’;
two D-latches forms ‘footblocks’
high usually shorter than the low state in the timeline graph.
Is there 2-D/3-D pipelines, for parallel: e.g. one input, 4x output
Instruction scheduling Basic Block scheduling
Branch delay slot scheduling
beq.likely R4, R6, L : if not taken, squash the delay slot (turn it into nop)
Static branch prediction:
Dynamic branch prediction.
commit
: An instruction is committed when it is guaranteed to complete. And the result will be visible to the programmer, either in register or memory.
MIPS: WB stage is for committing.
Out-of-order pipeline: 5 stage, 4 load in 10 instructions ==> ~ 400 cycles.
IF -> ID -> { EX | FP adder | FP mult | FP dividers } -> MEM -> WB
Problems: Different pipleline can finish in different time (can cause out of order commit).
An implmentation problem.
RAW:
fmult R1, R2, R3
fadd R4, R3, R5
WAW: fadd executes faster than fmult:
fmult R1 <--
fadd R1 <--
WAR: not a problem. Since only in ID (decode) stage does the read.
fmult <-- R1
fadd R1 <--
Reservation Shift Register (Left shift <–): 0 the resource is free; 1 is not free; One shift register for one conflict parts’ ID; one for ‘now’; ‘now’ = OR of all resources.
When to know mem op finishes: assume hit L1, or assume …; if not do recover; –> speculate.
Shift Register: shift in a 0, or shift in a 1;
ILP: Instruction level parallelism
EG:
fdiv R5 <--
fadd R5 <-- // stall 2 cycles
add R7 <-- // can be rescheduled to amortize the above stall
sub R8 <--
xor R11 <--
...
no ILP:
add r1, r2, r3
sub r4, r5, r1
xor r6, r7, r4
value speculate: 90s idea; not good for silicon.
ILP:
add r1, r2, r3
sub r4, r5, r6
xor r7, r8, r9
Next time: Verilog by TA.
CPU:
Instr queue. Reg file.
0 0
1 1 -> RS0
2 2
fadd 3 3
fadd (clk2. ) 4 4
fmul (clk1. when r1 = RS0) 5 5
/\
|| issue (always in order) ||
\/ ||
||
Reservation station. |===============>||
ready,op1,1-rdy,op2,2-rdy || -> *ready* of means all operands of an inst are ready or operands has a value (not a pointer).
RS0 ___1___r0___1___r2____1____fmul_
RS1
RS2
RS3 ___1___r6___1___r7____1____fadd_
RS4 ___0,__RS0__0___r3____1____fadd_
___________________________________
| | | | ||
Multiplier Adder/Subtractor ||
| | ||
-----------------------------------
fmul r1, r0, r2
fadd r4, r1, r3
fsub r5, r6, r7
CLK:
0, start, fmul, fadd, fadd in queue
1, fmul out of queue;
r0, r2: ready;
fmul: ready;
fmul, r1 <- RS0;
2, fadd, src reg:
r1: pointer to RS0, not ready ;
r3: value, ready;
fadd ready = r1.rdy && r3.rdy = no
r4: RS4.
fmul, executing,
3, fsub,
r6, r7, instr: all ready;
r5: RS3
4, fadd, r1: not ready
fsub, r6: exe 1st cycle (only 1 cycle to finish).
fmul, exe: 3rd cycle of exec (need 4 cycles to finish).
5, fmul, exe: 4th cycle
fsub, r6: done. RS3 result. Finished execution stage, will WB in next stage.
6, fsub, r6: result -1, to write back: **broadcast**: <-1, r5, RS3>: to all RS and Reg files.
all RS and register file **snooping** the common data bus to receive the broadcast
- reg file get the broadcast and set r5 = -1;
fmul: 5th cycle, done exe.
7, fmul, broadcast, <0, r1, RS0>
RS4: see the broadcast, RS0 = 0 in its operands, and set it to ready;
reg file: see r1, and set r1 = 0;
8, fadd, executing (1 cycle left)
9, fadd, broadcast: <3, r4, RS4>
reg file: r4 = 3.
EG2:
fmul r4, r1, r2
fadd r4, r5, r6
CPU
Instr queue. Reg file.
0 0
1 1 -> RS0
2 2
3 3
fadd (clk2. ) 4 4
fmul (clk1. when r1 = RS0) 5 5
/\
|| issue (always in order) ||
\/ ||
||
Reservation station. |===============>||
ready,op1,1-rdy,op2,2-rdy || -> *ready* of means all operands of an inst are ready or operands has a value (not a pointer).
RS0 ___1___r0___1___r2____1____fmul_
RS1
RS2
RS3 ___1___r6___1___r7____1____fadd_
RS4 ___0,__RS0__0___r3____1____fadd_
___________________________________
| | | | ||
Multiplier Adder/Subtractor ||
| | ||
-----------------------------------
cycle
1. issue: RS0 ready, r4 = RS0.
2. issue: RS3 ready, r4 = RS3.
fmul: start exe
3. fmul: exe (2nd cycle, 3 left)
fadd: exe (1st cycle, 1 left)
4. fadd: done, broadcast: <5, r4, RS3>
- reg file: r4 points to RS3, and set r4 = 5.
fmul exe (3rd cycle.)
5. fmul (4th cycle)
6. fmul: broadcast <0, r4, RS0>
register file: **r4 does not points to RS0**, we dont apply 0 to r4.
HDL vs Prog. Lang
HDLs concurrent
which happens first:
assign a = ~b;
assign c = d;
NAND/AND Gate:
module NAND (a,b,out);
input a;
input b;
output out;
assign out = ~(a&b);
// assign out = a&b; // AND gate
endmodule
module AND (x,y,out);
input x;
input y;
output out;
wire x;
NAND MyNAND(.a(x), .b(y), out(z));
assign out = ~z; // AND gate from NAND // block statement: set out until ~z is true.
endmodule
Buses
module AND8 (x,y,out);
input [7:0] x;
input [7:0] y;
output [7:0] out;
assign out = x&y;
endmodule
Contatenation, Repetition
CLK
D ~~~_
Q
module DFF (d,q,clk)
input clk;
input d;
output q;
reg q;
always @(posedge clk) begin
q <= d;
end
endmodule
// can be negedge as well (and clk any other name)
nonblodkint assignment ‘<=’ in sequential always blocks;
(slow down clock freq)
Control flow
always @(posedge clk) begin
if (r == 1'b1) begin // 1 bit signal 1'b1; n bits: n'b0000_n ; n'dxxx
q <= 1'b0;
end
else begin
q <= d;
end
end
endmodule
Combinational always Blocks
LHS must be assigned in every possible case
otherwise implied sequential logic!
always @(sel or a) begin
if (sel == 2'b0) begin
z = 1'b0;
end
else if (sel == 2'b1) begin
z = a;
end
end
Extra Hardware?
Verilog is concurrent, C is not.
reg[31:0] regfile [0:7] // num of bits; num of registers
wire [31:0] reg2; wire b4r2;
assign reg2 = regfile[2] // 2nd register assign b4r2 = reg2[4]; // 4th bit of reg2
// tri-state devices assign reg2 = b4r2 ? regfile[2] : 32’bz;
// z:
Verilog Prog:
Branch predictor
initial 5 stage pipeline.
verilator 3.876
verilog/ - ALU.v - Decoder.v - EXE.v - ID.v - MIPS.v
make ./VMIPS -f tests/cpp/class
// enter -> next circle.
IBM 360⁄91: Tomasulo
Single issue OOO processor.
Register renaming.
Memory speculation.
Problem:
Solution:
Dynamic branch prediction.
Note:
EX:
P = 0.9 => p^N = 0.9^16 = 0.18
p = 0.99 => 0.85.
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ | |holding reg (MSHR) stack alias
(OOO) |/ Table
(R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
Categories of dynamic branch predictions:
Always taken branch predictor:
Backward-Taken, Forward Not Taken:
==> problem: treat every branch instr equally. always make same prediction for the same instruction.
EX:
// assume the branch pattern:
1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
x o o o x x o o o x x o o o x
==> problem: two mistaken at the end of every loop
Bimodal Predictor
problem:
// Two bit saturating up/down counter
strong taken 11
weak taken 10 predict taken
-------------------------------------------
weak n-taken 01 predict not-taken
strong n-t 00
// transitions:
00 --1--> 01
01 --1--> 10
10 --1--> 11
11 --1--> 11
11 --0--> 10
10 --0--> 01
01 --0--> 00
// init: 10
Ex:
10 11 11 11 11 10 11 11 11 11 10 11 11 11 11
1 1 1 1 0 1 1 1 1 0 1 1 1 1 0
o o o o x o o o o x o o o o x
error: 3/15
accurary: 80%
Goal: accurately predict repeating sequences of branch outcomes.
Prediction:
Update:
updating the corresponding PHT entry, the 2-bit saturating up/down counter.
// k=4 in BHT; after training using pattern
// assuming no aliasing, then predicts all repeating sequences
// of length <= k with 100% accuracy.
... 1 1 1 1 0 1 1 1 1 0 1 1 1 1 0 ...
------------------------------
PHT 11 00 11
------------------------------
0 1 1
1 1 1
1 1 1
1 1 0
Problem of aliasing when indexing using BHT to PHT: a corner case that rarely happen dynamically.
local predictor captures self-correlations
90-99% (99.99% on floating point)
What about correlations among different branches?
// an opportunity not exploited by the local predictor
// but global predictor can.
if (A & B){
}
if (B){
}
global branch predictors capture correlations among different branches.
Combine the two
Predict:
A selector (bimotdel predictor) picks which predictor to trust for the current branch
// Meta-predictor //global // local
// (selector)
PC[n] PC[n]
| ---> 00, GHR | ----> BHT
01 | | ---> PHT
10
11
|| || ||
-------------------------------------------------------
||
\/
predict
Branch- local(0) global(1) Meta- local- global- meta-
outcome predictor training traning traning
0 0 0 0/1 down down no-change
0 0 1 0/1 down down down
0 1 0 0/1 down down up
0 1 1 0/1 down down no-change
1 0 0 0/1 up up no-change
1 0 1 0/1 up up up
1 1 0 0/1 up up down
1 1 1 0/1 up up no-change
We need to predict two other pieces of information in addtion to direction
Answer: Branch Target Buffer.
BTB: is a cache which is indexed and tagged using the branch PC, if holds the branch target for each cached branch. -> typically, block size = 1 branch.
cache hit or miss. Reason of miss: 1. branch target not cached; 2. instruction is not a branch.
Update: BTB is updated after decode;
BTB and direction predictor are accessed in parallel.
LLM: BTB is accessed before decoding done: we don’t even know this is an branch or not, but will fetch only according to PC’s address. Is this safe in multi-process env? Why not just access BTB after decoding done?
In MIPS, jal/jr
BTB not useful in predicting target for return instructions. (because we have multiple call sites/multiple return site for a same function)
Instead, use a structure called the Return Address Stack (RAS)
will forget something if overflows; LLM: how about saving the overflowed ones to memory? and use this hw stack instead of compiler managed stack?
BTB direction- what to fetch next
predictor
hit 0 PC+4
hit 1 target from BTB
miss 0 PC+4
miss 1 PC+4
Problem: information to update the direction predictor not known for many cycles. Makeing predictors of stable prediction results in inaccuracy.
Solution: assuming that the prediction is correct & update the predictor state speculatively (i.e. update direction)
(A copy of cpu figure)
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg (MSHR) stack alias
(OOO) r1 |/ Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v x: FIFO x: renaming x: issue queue x:phys x: ROB x:R-RAT
FIFO: FIFO allows disruptions in either the producer (cache) or the consumer (decoder) stage to be hidden.
Goal: rename false dependences (WAW, WAR)
Architectural registers: specified by ISA (R0-R31) (typically big R)
Physical registers: actually implemented in hardware (e.e. r0 - r511). (typically little r)
Typically, # of r is much greater than # of R. (LLM: why: stores the results for speculatively executed instructions)
Register renaming maps architectural registers to physical registers.
F-RAT table (map: 1 --> 1) Phy Reg File (1 <-- 1) Retire-RAT
entries:
R0: r1 r0 R0
R1: r1 R1
R2: r100 ...
R3:
R4
R5
..
R31: r31 R31
...
r511
Free-List
r400 r312 r5 ...
----------------------
----------------------
F-RAT: points to “speculative state” –> program cannot see. R-RAT: points to “architectual state” –> program can see
Free-list: holds list of physical reg IDs available to be written.
Allocate on issue queue entry.
If instruction is a load/store, allocate a load/store queue entry. (issue queue is not FIFO).
If operands are results of some instructions in the issue queue? to distinguish whether or not: use busy bits for every physical register file.
when the instruction commits, update the R-RAT with the architectural->physical mapping of the instruction.
when a branch resolves, and is found to be mispredicted, setting a bit in the branch’s ROB entry.
when a mispredict branch reaches the ROB head, recover by:
When is the earliest safe time to return a physical register to the free list?
When the next instruction that renames the same destination register commits.
(A copy of cpu figure)
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg (MSHR) stack alias
(OOO) r1 |/ Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v v: FIFO v: renaming x: issue queue v:phys x: ROB v:R-RAT
fullfils two critical functionalities: (this is the critical path, will determine clock speed)
instructions can be issued out of order.
issue queue entries:
0 1 2 3 4 5
----------------------
----------------------
| | | | | |
^^^^^^^^^^^^^^^^^^^^ (common data bus; broadcast)^^^^^^^^^^^^^^^^^^^^^^^^^^^^
one entry:
{src1} | {src1.ready} | {src2} | {src2.ready} | {instr.ready} | ...
| | /\ | /\ /\
| \/ | \/ | |
| | |or gate| | |or gate| |and gate|
| -----| --------->---------------|--------->-------|
|compare|-----
| unit |
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
src1/2: physical register specifier;
src1/2.ready:
(physical registers the instruction needs)
For the instructions that are 'born' to be ready: e.g. Jump instruction.
src1 and src2.ready are initialized using busy bits of the physical registers when being put into the issue queue.
Use tree of arbitors to select the instructions that are ready.
G2 = GG and {not R0/R1} and R2
issue queue entries:
0 1 2 3 4 5 6 7 8 9 10 11
--------------------------------------------
| | | | | | | | | | | |
--------------------------------------------
read bits | | | | | | | | | | | |
\/ \/ \/ \/
Arbitors: arb01 arb02 arb03 arb04 .....
(selection
logic)
arb11 arb12
(child)
arb21
(parent)
One arbitor:
Request from
child/queue entries: R0 R1 R2 R3
Response back to child: G0 G1 G2 G3
group request send to parent: GR
Response from parent: GG (granted by parent)
GR = or {R0, R1, R2, R3}
G0= GG and R0
G1 = GG and {not R0} and R1
G2 = GG and {not R0/R1} and R2
G3 = GG and {not R0/R1/R2} and R3
// memory cell (what kind of info it stores??? opcode, src1, src2, dest, immediate, or issue queue index)
- wordline(enables): when wordline is high, the cell drives bit line.
- bitline (read):
|M|-- wordline --------------
------|-------------- bit line --------
Issue queue wired with memory cells:
0 1 2 3 4 5 6 7 8 9 10 11
-------------------------------------------------
| | | | | | | | | | | |
-------------------------------------------------
| M | M | M | M | M | M | M | M | M | M | M |
| | | | | | | | | | | |
--------------------------------------------------- bit lines
| M | M | M | M | M | M | M | M | M | M | M |
| | | | | | | | | | | |
--------------------------------------------
| M | M | M | M | M | M | M | M | M | M | M |
| | | | | | | | | | | |
--------------------------------------------
| M | M | M | M | M | M | M | M | M | M | M |
| | | | | | | | | | | |
--------------------------------------------
| M | M | M | M | M | M | M | M | M | M | M |
| | | | | | | | | | | |
| | | | | | | | | | | |word line
G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11
0 0 0 1 0 0 0 0 0 0 0
W&S = 1 cycle; IPC = 1
W&S = 3 cycles; IPC = 1⁄3 with one chain; = 1 is >= 3 chains
instruction dependence chains:
a -> b -> c -> d -> …. ->
a -> b -> c -> d -> … ->
a -> b -> c -> d -> … ->
==> the white spot in processor thermal die.
Q&A: given Issue queue entries: 128; load queue: 64; store queue 64; reorder buffer size 512 (instructions);
what is the max size of physical register file should be where performance cannot improve?
RENAMING depends on the size of all, not single ones;
A lockup-free cache does not block future accesses on a miss immediately. By contrast, a blocking cache does.
Hit-under-miss: continue to service hits in the shadow of ongoing miss.
miss-under-miss: take a miss in the shadow of multiple outstanding misses.
? What can go wrong?
Two data structures:
when load executes, it first computes the address.
MSHR:
valide,
address,
the load will retry again (be replayed) and it will hit.
One entry:
- instruction pointer
- valid
- address
Tests:
To increase MLP:
Local miss & global miss:
local L2 miss rate = (# L2 misses)/(# L2 accesses); ==> usually “50% - 60%”; local l1 miss (5-6%); data accesses goes to L2 are much more random. ==> locality opt on L2 has opptunity!!!
global L2 miss rate = (# L2 misses)/(# all memory accesses)
Unified L2 cache: contains both data and code while L1 has L1-code/L1-data separated caches.
Write hits longer than read hits (why?)
Latency vs bandwidth.
Latency:
(A copy of cpu figure)
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg (MSHR) stack alias
(OOO) r1 |/ Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v v: FIFO v: renaming v: issue queue v:phys x ld/st queue v: ROB v:R-RAT
depend pred.
Examples
add R1, R2, R3
sub R4, R1, R5
sw R1, 0(R7)
lw R4, 8(R12)
==> store and load address can change.
Addr can be the same and can be different;
This kind of dependence cannot be resolved in the issue queue (by simply inspecting register specifiers)
==> load search the store queue with effective address (load also search issue queue with effective addr). If older matching store, the load stalls until the store commits.
==> what if older store with unresolved effective address? load stall until resolves.
Which store?
load will access the data cache in parallel with store queue search:
A simple one: Load Wait Table (LWT)
an array of one bit wide table
index of the PC of load: |xxx| index |xx
EA: effective addr (Generated at exe stage)
|=> Instruction load: access, LWT, cache , Store Queue,
| load wati 1 v older unresolved
| use forward 1 v forwards
| 1 value x ld use cache
| 0 v older unres ld specc from cache
| 0 v forwards use forward
| 0 v x use cache
Effective |
addr ==> |
|
|
|
|
|
|=> Instruction store: search the load queue... ... update lwt.
Quiz:
Reduce cache associativity: reduce cold miss ratio.
associativity/ num of sets.
Dynamic Random Access Memory (DRAM), diffrences with SRAM ?
Hierachy of channels, ranks ,bancks, rows, and columns. ==> both parallelism and locality at the same time
Accessed using a command set.
Command Set is implemented by a memory controller.
Timing contraints dictate how soon different commands can be issued.
Memory controllers reorders requests to optimize:
DRAM Cells
access transistor
____
----____-----
|
---
--- capacitor
|
Can refresh 30 minutes a time, but need to keep at 40 Celcius? Used in superconductor tech.
Reading the DRAM Cell
![A DRAM Cell]()
Reads destroy the cell content (reads are distructive); content must be written back after a read.
Writing the DRAM Cell to 1:
![figure of DRAM array]()
Precharge all bitlines
Activiting a row:
Sense amplifiers holds the last sensed value.
E.G.
Type Row# Col# Rd 1 5 Rd 3 4 Rd 1 3 Rd 3 2 Rd 1 6 Rd 3 9
FCFS (first come first served, in-order): 600ns
Out-of-order:
1,5 –> 1,3 –> 1,6 –> 3,4 –> 3,2 –> 3,9
100 20ns 20ns 100ns 20 20ns –> 280ns
Exploits the row buffer locality:
DRAM Memory
DRAM Array + row buffer
T = R C
Invertors cannot be used in DRAM: bitline is not digital signal, but analog signal.
Use array + buffer as basic building block.
Leverage the row buffer locality.
1 bank:
cache block: A B C D; // four arrays accessed in parallel
4 * (DRAM array (1k*1k) + row buffer 1k) ===> a bank of 4 arrays: 1k * 4k + 4k
only exploit spatial locality; but no parallelism, so we have rank.
Leverage the bank parallelism
1 Rank:
Bank 0
Bank 1 + Common Bus
Bank 2 - commands: active, precharge, r/w, refresh
Bank 3 - address: bank ID, row/col ID; A read: bank + col ID (assume already activated)
- data: 64-bits wide;
Cache block size: 64 bytes.
Assume 3 cache blocks: x, y, z. resp. in bank0/row1/col4, back1/r57/col8, back3/r1/col
Read x Read y Read z
How to retrieve the data?
(if only one back in rank: read one by one: pre-charge x, activate, read.)
The command schedule with 4 banks in a rank (assume pre-charge 8 cycles, act: 8 cycles, TCL: 4 cycles):
command, time:
Not RD every 8 cycles, but 4 cycles.
command bus: ----RD-----------8cycle--------RD-------------8---------RD---------8------------RD----------time
data bus: --------d0-4 data---------- 4 empty-----------d1-------------------dx ------------------------
Problem:
two request to different rows within same bank.
three misses might be faster resolved than 2 misses: 3 misses in different banks can be paralleled; but 2 misses in the same bank cannot be paralleled.
More banks could help as long as the bus is not too small to be the bottle neck.
A Channle: contains one/more ranks;
A Channel Figure: 4 channels around a processor:
Memory controller0 – Rank 0, 1
Memory controller1 – Rank 2, 3
Memory controller2 – Rank 4, 5
Memory Controller3 – Rank 6, 7
Example: processor walks around an array.
for (i=0; i< 1 billion; i++){
sum += a[i]; // a byte array.
}
// miss rate: 1/cache block.
DRAM Address Mapping: Physical address -> DRAM coordinate (chnl, rank, bank, row, col)
e.g 32-bit physical address, one method of mapping:
0 -------------------------------------------------- 31
| Row | Col | Bank | Rank | Channel | offset | // this will keep all channels up.
// one XOR gate, improve performance by 2x.
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg (MSHR) stack alias
(OOO) r1 |/ Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v v: FIFO v: renaming v: issue queue v:phys x ld/st queue v: ROB v:R-RAT
depend pred.
Two philosophies to multiple issue:
VLIW.
Superscalar.
Terminology: 4-way super scalar.
4 inst issued every cycle ==> 4-issue superscalar.
Intel: design & debug(verification) team.
Superscalar, in-order, 5-stage pipeline.
instruction fetch | inst decode(ID) | Execute (exe) | Memory (MEM) | write back (WB)
| |
PC register file | ALU data-cache |
| |
Instruction cache decoder | forwarding data-TLB | ---
<-------------------------------|
instruction TLB
Two ways to increasing the bandwidth of memory structures.
i-cache or i-TLB/ d-cache or d-TLB: multiported or banked.
Multiple PCs, latch, decoders, ALU, …
Forwarding network: n^2 growth.
register file: greater degree of multiporting.
(still in order core)
Create if condition for next instruction.
Problem: it’s hard to find a dependent instruction one after another in one cycle.
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg (MSHR) stack alias
(OOO) r1 |/ Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v v: FIFO v: renaming v: issue queue v:phys v ld/st queue v: ROB v:R-RAT
depend pred.
pc: multiplex
i-$, i-TLB, Branch Predictor, BTB, RAS, FIFO ==> bank or multiport.
Decoder: multiple decoders.
Rename Group
1: in parallel
a) access free list the pick up physical reg specifier for destination.
b) compose architectural save register to destination of addr ???. Find youngest addr match.
2: in parallel
a) assign physical regs specifier picked in 1-a to multiple
b) at the same time, write dest specifier to F-RAT
Common data bus (CDB) 1: ————————————————–
issue queue: | | | | | | | |
arbitors: [] []
[]
Common data bus (CDB) 2: ————————————————–
Unified vs Distributed Issue queues:
Unified:
one queue + Two ALUs
Distributed:
Two < one queues, one ALU > pairs
Reducing the number of register file ports:
Problem: 8-way superscalar requires 24 RF ports
-> Area (and latency, energy) sacle quadratically with # of ports.
Solutions:
(1) Replication: two (multiple) copies of same register file. Two copy of same register:
(2) how do we pipelining the register file accesses?
MSHR: miss states holding register. Address stack:
==> multi-parallel.
Fetch Decode Rename Dispatch Register Read EXE Memory Retire Commit
pc --> i-$ i-TLB Front-end --> Issue Queue Physical ALU | L1 d-$ | |d-TLB reorder
Register alias ------------ reg file |\ buffer (ROB) Retirement
table (F-RAT) | | | | | | |Miss states address register
------------ r0 | |holding reg stack alias
(OOO) r1 |/ |(MSHR) Table
... (R-RAT)
branch Branch _FIFO_ --> Decoder --> _FIFO_ Free List
direction Target |OOO load queue
predictor Buffer Busy | |
Bits | | |in-order store queue
| |
return Outstanding | Memory Dependence
address branch | predictor
stack queue
v v v: FIFO v: renaming v: issue queue v:phys v ld/st queue v: ROB v:R-RAT
depend pred.
Miss prediction occurs for a br
instruction:
br
reaches the head of ROB: all inst before this branch have committed, all stuff in pipeline are garbage due to mis-prediction, so we can flash entire pipeline, easily by setting one bit.Problem:
A situation in ROB: garbage insts | br | ld with miss (500 cycles) | –> head
=> have to wait 500 cycles in order for br
to be at the head of ROB and recover the misprediction.
=> why not recover while ld is underway and br
is not yet at the head? Then when is good to recover? ==> early recovery.
as soon as branch executes:
Exploit thread-level parallelism: TLP by simultoneously executing multiple threads. can be 1) threads of same processes; or 2) different processes all together.
fuse multi-core. 50% of ipek’s Ph.D thesis.
Why not increase superscalar width instead of designing more cores?
From 4-way superscalar: Power_4_way = $\alpha C V^2 f$
Problem: How do we keep cores busy?
execute multiple applications at the same time.
parallelize a single threaded application to take advantage of multiple cores.
==> Problem:
If you could revise
the fundmental principles of
computer system design
to improve security...
... what would you change?