## For each of the three designated flow-dependence types, indicate the number of stalls in adjacent producer-consumer pairs as functions of ‘#m” and ‘#x’.

1. Loop Unrolling [40 marks]

Consider the following loop:

loop: l.d    f4,0(r1)    l1

l.d    f6,0(r2)    l2

mul.d  f4,f4,f0    m1

mul.d  f6,f6,f2    m2

s.d    f4,0(r1)    s1

bnez   r1,loop     br

Note: Our default is that FP arithmetics have 4 x-boxes.

a) [4 marks] Using the names ‘l1’ to ‘s1’ for the first six instructions

in the body of this loop, draw the flow-dependence graph for just these

instructions.  Label each arrow with the dependence gap between the

producer and the consumer.

graph:

In what follows, focus on three flow-dependence types: i) FP arith to

FP arith, ii) FP arith to FP store, and iii) FP load to FP arith.  Denote

the number of m-boxes in memory references by ‘#m’, and the number of

x-boxes in FP arithmetics by ‘#x’.  By the way, memrefs have one x-box,

and arithmetics have one (never-used) m-box.

b) [6 marks] For each of the three designated flow-dependence types,

indicate the number of stalls in adjacent producer-consumer pairs as

functions of ‘#m” and ‘#x’.

FP arith —> FP arith =

FP arith —> FP store =

FP load  —> FP arith =

c) [10 marks] Suppose #m = 1 and #x = 4.  How many stalls occur in one

iteration of the loop if it is executed exactly as written?  Don’t

write out a space-time diagram; use the method of Chapter 3.

ans:

e) [10 marks] Unroll the loop twice.  If one reschedules the unrolled

loop optimally, how many stalls are left?  (Keep the branch as the last

instruction.  Show the rescheduled code using the _short_ names.)

ans:

f) [10 marks] Increases the ‘mul-add dependence gap’ to 5 cycles, leaving

everything else unchanged.  Unroll the loop three times.  If one reschedules

the unrolled loop optimally, how many stalls are left?  (Keep the branch as

the last instruction.  Show the rescheduled code using the _short_ names.)

ans:

2. Dynamic Instruction Scheduling I [30 marks]

Imagine that reservation stations only track whether floating-point operands

are valid, and that integer operands appear by magic whenever needed.

a) [10 marks] Stage instructions ‘l1’ to ‘s1’ to reservation stations rs1(l1)

to rs6(s1).  Show the contents of each reservation station.  Indicate both

present and pending operands in each station.  For the loads, you may make

the operand entries ‘blank’.  That is, mark all load dependences as resolved.

For the store, just invent a regular single-operand reservation station.

rs1(l1):

rs2(l2):

rs3(m1):

rs4(m2):

rs5(a1):

rs6(s1):

b) [10 marks] After both loads have completed, but no further action has

occurred, show the contents of each reservation station.  You may mark

entire reservation stations as ‘free’.  Syntax: result[rs9(a9)]; val[f12].

You may alter the tag inside parentheses if warranted.

rs1(l1):

rs2(l2):

rs3(m1):

rs4(m2):

rs5(a1):

rs6(s1):

c) [10 marks] At this point, dispatch the two loads of the second iteration,

viz., ‘l3’ and ‘l4’, into reservation stations ‘rs1’ and ‘rs2’.  Moreover,

let instruction ‘m1’ of the first iteration complete.  Show the contents of

each reservation station.

rs1(l3):

rs2(l4):

rs3(m1):

rs4(m2):

rs5(a1):

rs6(s1):

3. Dynamic Instruction Scheduling II [15 marks]

a) [5 marks] Find a flow dependence in the original code, and show that it

cannot cause harm.

ans:

b) [5 marks] Find an antidependence in the original code, and show that it

cannot cause harm.

ans:

c) [5 marks] Find an output dependence in the original code, and show that

it cannot cause harm.

ans:

4. Load Buffers and Store Buffers [15 marks]

a) [5 marks] Consider two code sequences:

l.d f4,0(r1)      l.d f6,0(r1)

s.d f6,0(r1)      s.d f6,0(r1)

Show that ignoring the register dependence and respecting the memory

dependence provides a uniform, correct solution to issue delay in

both code sequences.

ans:

b) [5 marks] Consider:

s.d f4,0(r1)

l.d f6,0(r1)

Assume these two instructions have been loaded into a store buffer and

a load buffer, respectively.  There is a memory dependence.  If we

delay the issue of the load until the store “completes”, operationally

what do we mean by “complete”?  (Forwarding is a different approach).

ans:

c) [5 marks] A _speculated_ load is allowed to progress through more

stages of an out-of-order pipeline compared to a _speculated_ store.

What is difference and why is it necessary?