jMIPS

an open source MIPS processor in Java

3 Optimizing performance in the pipeline
- 3.1 What's new?
- 3.2 Exercises to get familiar with the optimized pipelined processor model

Optimizing performance in the pipeline

The CPU3 process model improves the Read stage in the pipelined model by making it more liberal about letting instructions in to the remainder of the pipeline together at the same time. The result is that the pipeline runs faster:

	unpipelined	pipelined	pipelined+optimized
220 instrs	0.000001580s	0.000001291s	0.000001144s

Can you beat that with still better pipeline optimization and control? You'll find detailed suggestions below, but running the simulator with the -d option will show you many remaining pipeline stalls which you will want to work to avoid in your own manner. Stalls are visible as the absence of an instruction in a pipeline stage as can be seen at numerous points in the printout below from the CPU2 code:

% java CPU/Cpu2 -d hello_mips32
...

,-----------.-----------.-----------.-----------.-----------.
|F          |D          |R          |E          |W          |
`-----------.-----------.-----------.-----------.-----------'
Fetch   0x80030080
,-----------.-----------.-----------.-----------.-----------.
|F0x80030080|D          |R          |E          |W          |
`-----------.-----------.-----------.-----------.-----------'
Decode  0x80030080
Fetch   0x80030084
,-----------.-----------.-----------.-----------.-----------.
|F0x80030084|D0x80030080|R          |E          |W          |
`-----------.-----------.-----------.-----------.-----------'
Read    0x80030080
Decode  0x80030084
Fetch   0x80030088
,-----------.-----------.-----------.-----------.-----------.
|F0x80030088|D0x80030084|R0x80030080|E          |W          |
`-----------.-----------.-----------.-----------.-----------'
Execute 0x80030080
,-----------.-----------.-----------.-----------.-----------.
|F0x80030088|D0x80030084|R          |E0x80030080|W          |
`-----------.-----------.-----------.-----------.-----------'
Write   0x80030080
0:      0.000000005s:   0x80030080:     addiu $29, $29, -32
,-----------.-----------.-----------.-----------.-----------.
|F0x80030088|D0x80030084|R          |E          |W0x80030080|
`-----------.-----------.-----------.-----------.-----------'
Done    0x80030080
Read    0x80030084
Decode  0x80030088
Fetch   0x8003008c
,-----------.-----------.-----------.-----------.-----------.
|F0x8003008c|D0x80030088|R0x80030084|E          |W          |
`-----------.-----------.-----------.-----------.-----------'
...

Each pipeline stage (Fetch, Decode, Read, ...) is reported as it completes in the simulation. You can see that the image of the pipeline shows the instructions that have just completed within each stage. All the output shown above is concerned with just the very first instruction's execution, namely "addiu $29, $29, -32".

A pipeline image is produced after each complete round of stage updates. The extra "Done" message (corresponding to no pipeline stage) is emitted as an instruction leaves the pipeline. It will eventually be where a check for interrupts is performed.

With a perfect pipeline design in this particular CPU, 220 instructions would take 0.000000220s to execute, and the pipeline would always be full (i.e. no stalls), resulting in one instruction completed every clock cycle (every nanosecond), i.e. 1000MIPS (1000 million instructions per second).

With the pipeline more or less only half-full on average, one instruction will complete only about every 2 clock cycles, i.e. 500MIPS (500 million instructions per second). The non-pipelined CPU simulator took about 5 clock ticks to execute every instruction. That's 200MIPS (200 million instructions per second).

Here's the pipeline from CPU3 executing the same instructions as shown for CPU2 above:

% java CPU/Cpu3 -d hello_mips32
...
,-----------.-----------.-----------.-----------.-----------.
|          F|          D|          R|          E|          W|
`-----------'-----------'-----------'-----------'-----------'
Fetch   0x80030080
,-----------.-----------.-----------.-----------.-----------.
|0x80030080F|          D|          R|          E|          W|
`-----------'-----------'-----------'-----------'-----------'
Decode  0x80030080
Fetch   0x80030084
,-----------.-----------.-----------.-----------.-----------.
|0x80030084F|0x80030080D|          R|          E|          W|
`-----------'-----------'-----------'-----------'-----------'
Read    0x80030080
Decode  0x80030084
Fetch   0x80030088
,-----------.-----------.-----------.-----------.-----------.
|0x80030088F|0x80030084D|0x80030080R|          E|          W|
`-----------'-----------'-----------'-----------'-----------'
Execute 0x80030080
,-----------.-----------.-----------.-----------.-----------.
|0x8003008cF|0x80030088D|0x80030084R|0x80030080E|          W|
`-----------'-----------'-----------'-----------'-----------'
Write   0x80030080
0:      0.000000005s:   0x80030080:     addiu $29, $29, -32
Read    0x80030084
Decode  0x80030088
Fetch   0x8003008c
,-----------.-----------.-----------.-----------.-----------.
|0x8003008cF|0x80030088D|0x80030084R|          E|0x80030080W|
`-----------'-----------'-----------'-----------'-----------'
Done    0x80030080
Execute 0x80030084
Read    0x80030088
Decode  0x8003008c
Fetch   0x80030090
,-----------.-----------.-----------.-----------.-----------.
|0x80030090F|0x8003008cD|0x80030088R|0x80030084E|          W|
`-----------'-----------'-----------'-----------'-----------'
Write   0x80030084
1:      0.000000007s:  0x80030084:     sw $31, 28($29)
Execute 0x80030088
Read    0x8003008c
Decode  0x80030090
Fetch   0x80030094
,-----------.-----------.-----------.-----------.-----------.
|0x80030094F|0x80030090D|0x8003008cR|0x80030088E|0x80030084W|
`-----------'-----------'-----------'-----------'-----------'
Done    0x80030084
Write   0x80030088
2:      0.000000008s:  0x80030088:     sw $30, 24($29)
Execute 0x8003008c
Read    0x80030090
Decode  0x80030094
Fetch   0x80030098
,-----------.-----------.-----------.-----------.-----------.
|0x80030098F|0x80030094D|0x80030090R|0x8003008cE|0x80030088W|
`-----------'-----------'-----------'-----------'-----------'
...

The pipeline is noticably fuller. Significantly, there is more than one instruction at a time traversing the Read/Execute/Write trio of stages!

What's new?

The Java source code still has exactly the same structure at top level:

Cpu3
static void main (String args[])	entry point; handles command line options, etc.

CPU3
CPU3 ()	CPU builder
void run ()	fetch/decode/execute cycle

Lower down in the code hierarchy, there's also no difference anywhere except in the internals of the pipeline Read stage.

There are extra dependency-calculation units involved in the CPU. In fact, there are three. They calculate the set of input registers d1.ins, d3.ins, d4.ins on which the instructions respectively in the Read, Execute, Write stages depend, and the sets d1.out, d3.out, d4.out of output registers that they affect. The input to the dependency unit is the configuration information in the little data holder at the entrance to the Read stage of the pipeline.

The dependency calculation unit is represented by a new Depend class in the java source code. It only has one method, which performs the required calculation.

Depend
Depend ()	builds an instruction dependency calculation unit
DependReturnT execute (DecodeReturnT)	return a pair of bitmasks giving input and output register indices

The return type DependReturnT carries the pair "ins, out" of long integer bit masks. In the CPU pipeline, the bit mask calculation

d1.ins & (d3.out | d4.out)

is carried out, and if there is indeed some input register for the instruction about to enter the Read stage ("d1.ins"), which will be amongst the registers written by instructions run by the Execute or Write stages ("d3.out | d4.out") then the entry into the Read stage is postponed and the pipeline stalls.

The code inserted into the pipeline managent code for the Read stage is exactly the following:

 DependReturnT d1 = depend1.execute(conf1);
 DependReturnT d3 = depend3.execute(conf3);
 DependReturnT d4 = depend4.execute(conf4);

 if ((~1 & (d1.ins & (d3.out | d4.out))) != 0) {
     return;                                           // don't let this instr into Read stage!
 }

The "~1" excludes bit zero from consideration. It's the bit representing a dependency on the $0 register, which is always zero no matter what. It does not matter if the instruction tries to read from the $0 register while another instruction ahead of it in the pipeline has yet to write to $0 because the register will not change.

The code has stuck to a high ethical standard by using three different dependency units in order to do the three calculations involved more or less simultaneously, thus "emulating" what the hardware would do inside a single clock cycle. Reusing just one dependency unit to do the three in the hardware would require complicated circuitry and a special three-times clock multiplier inside the unit, and it really would not be feasible.

Exercises to get familiar with the optimized pipelined processor model

You'll find some meaty exercises here. The idea is to improve the pipeline further so that the runtime figures for hello world get better.

Read the "Performance" chapters of Tannenbaum and of Stallings for more ideas.

[edit] Implementing a branch delay slot

Here's a radical idea: don't flush the following instruction in the pipeline after a jump or successful branch!

If you look at the assembler code generated for the Hello world program, you will see that most every jump and branch instruction is followed by a no-op. This position is called the `branch delay slot'.

Many real MIPS architectures do not flush the pipeline after a jump or a branch has rendered inappropriate the prefetched instructions following on behind. That is because it would stall the pipeline (evidently!), and since branches and jumps are common (30% of a real code load), the pipeline would perform badly on average.

Instead, the architecture lets the pipeline drain naturally for one instruction more, and relies on the compiler to make sure that that `delay slot' behind the jump or branch contains something inoffensive or even useful if executed.

It is common, for example, for a compiler to put the stack shift that precedes a subroutine return in the delay slot. It gets executed after the jump exits the subroutine on such an architecture. It will be executed just before the calling code resumes.

While the machine code for Hello world has not been generated specifically for that context, the compiler has carefully avoided placing in the delay slots anything that might be harmful, just in case it does get executed (in some hypothetical universe). So there is no harm in executing it, and there is just a chance that the speed might possibly improve overall. It is not going to be a loss. We were going to "waste" the one cycle the no-op takes in starting refilling the pipeline. We might as well execute a no-op at the same time.

But the real speed-up, if any, is going to come when we do a JAL to a subroutine and we execute the instruction in the delay-slot that should notionally follow the return from the subroutine before arriving in the subroutine.

The compiler will have ensured that the shuffle is semantically harmless, and it might be helpful.

If the compiler knows its socks, it will be helpful. The result from that instruction will be available way ahead of time, which might prevent the instruction after it from stalling waiting on the result.

It is hard to imagine how getting the result you want way ahead of time could slow things down. Perhaps when the result occupies a register that could otherwise be used by the compiler to avoid storing an intermediate in memory. But MIPS has plenty of registers, so the compiler is not likely to be tight for those.

It is simple to "not" do something. Just drop the code that does it:

At the end of the pipeline code, inside the do_write_tick method, you will find the comment "jump completions need us to abort anything else pipelined behind ...", and similarly for branch. The comment precedes the business end of a flush:

conf2 = null; conf1 = null; conf0 = null;

which wipes out the instructions within each of the instruction holders on the entries to the stages following behind.

You just have to not set to null the first of those (conf2, conf1, conf0, in that order) which is not already null. That is the delay slot instruction. It is alright to continue to null the rest - indeed, you must.

However, that's not quite the end of it. For a JAL and JALR instruction, the following JR no longer should return from the subroutine to the current PC+4 (the next instruction), but to PC+8 (the next but one instruction - the next instruction is the one in the delay slot and it already got executed after the jump itself and before the first instruction in the subroutine). So JAL and JALR have to load the RA register not with PC+4, but with PC+8. That's in the Write stage.

You need to look very carefully indeed at the way JALR and JAL pass through the Write stage and arrange by hook or by crook that that load to RA happens the new way, with PC+8 instead of PC+4.

JALR is an opcode 0 instruction, so it is dealt with in the stanza in the Write stage that deals with the register instructions. There is an opportunity to alter what happens with JALR before the write to the register unit in the stanza, and also an opportunity after. I would probably choose to add an intercept "before", where there is already a s.c=conf.pc insert for JALR. That can be changed to s.c=conf.pc+4.

The s.c value is written to the target register for the instruction, which in the case of JALR is RA. The conf.pc value is the PC value for the next instruction after the jump.

JAL is simpler to figure out, since it and J get their own dedicated little stanza in the Write stage code and it is easy to see what happens for JAL and change it just slightly from an RA.write(conf.pc) to a RA.write(conf.pc+4).

For extra kudos, do the "+4" using an extra specially dedicated ALU. That's how hardware does arithmetic.

If you don't manage the JAL and JALR modification for the jump return, don't worry too much: the delay slot instruction is just going to be executed twice each time :-). Once after the jump to the subroutine and once on the return from the subroutine. That's likely not harmful, especially as it's often a no-op. When it's not a no-op, well, then, yes, it could be harmful. Sometimes it's a LUI or other instruction that can be executed twice harmlessly. Sometimes it's not. You could probably ride your luck and get a sensible execution out of it all. But executing instructions twice will certainly ruin your execution timings.

A truly amusing side-effect of this change is that each subroutine as it ends (with a JR back to the caller) accidentally executes the first instruction of the subroutine following it in the memory map. The compiler has taken care in constructing the layout to make sure the accident is harmless!

Watch the execution trace to spot the extra instructions and see how the effect of each such 'little accident' is naturally nullified by an instruction following soon after.

Why hasn't the compiler just padded every subroutine with an extra few no-ops before the front? Probably because making subroutines further apart would negatively affect the program cache.

[edit] Changing the default prefetch prediction mechanism

Here's a hint as to how to" improve" the pipeline prefetch mechanism.

Note that this will not natively mix well with the previous hint on how to arrange to always execute the `delay slot' instructions, and so I'll finish up with extra instructions on how to do both at once.

The defaukt prefetch always brings in the instruction at PC+4 as a best guess for what comes next. The relevant code is in the Fetch stage. There it reads the PC register with

pc=PC.read()

modifies the value by adding four, fills the IR with

IR.write(memory.read32be(pc))

as the prefetch guess, and sets the PC with

PC.write(pc)

To change the default prefetch guess, all you need to alter is the "modifies ... by adding four".

If a straight jump (J or JAL) instruction is just ahead in the pipeline, in the Decode stage, then you know that it is going to set the PC if it ever gets to the Write stage, so why not guess that it will and set the PC the way the jump wants now? If we're wrong, then the jump will be preempted by some other branch instruction reaching the Write stage first and flushing the whole pipeline clean anyway, so there is not a lot to lose.

So change the Fetch stage to check the instruction in the conf1 holder that should contain the instruction that is exiting the Decode stage (the way the pipeline is executed in the simulation, the stages ahead 'go first'). The fields should be filled out. If conf1.op is J or JAL then there is work to do. The jjjjjj...jjjjj field of conf1 should by then contain the jump target address (but check .. see how the jjjjjj...jjjj field is used to generate the address in the Write stage code), so set pc=conf1.jjjjj...jjjjj instead of "adding four".

Now in the do_write_tick code you have to sometimes suppress the blind pipeline flush that follows a J or JAL instruction.

As in the delay slot hint, use the comment "jump completions need us to abort anything else pipelined behind ...", to locate the business end of the flush:

conf2 = null; conf1 = null; conf0 = null;

Ensure that you have the code that triggers on jump and not branch. If necessary, separate the conditional if into two stanzas, one for jumps and one for branches. In the/a stanza for a J or JAL instruction you have to check the immediately following instruction's address by looking for the first non-null value among conf2, conf1, conf0, and reading the conf.pc field from it. If it is 4 more than the address of the jump instruction target, it is right, and one should not flush the pipeline. If it is not right, then the flush should proceed as usual. I'm told by people who've done it that if you get the prefetch guess perfect each time for a J and JAL then you don't ever have to check for whether to flush behind a J and JAL or not, just rely on it and never flush.

That's all.

Of course, you can improve this to take account of JR and JALR instructions too. If there is one of those ahead in the Read stage (and nothing between Read and Fetch stages that might also change the PC), then the register containing the address they will jump to has been read (into the r.a field, I believe, but check!) and you can fish out the address value and set the PC from that in the Fetch stage, just as you have already done for J and JAL. Remember to also extend the suppress of the blind pipeline flush in do_write_tick.

Why stop there? Aren't branch instructions ahead also fair game for this play? By the time they are exiting the Exec stage it is known if they will branch or not. But the branch target address itself is not known until they exit the Write stage, which is a bit late. By then the branch instructions themselves would be changing the PC, so it's hardly "guessing" to have the Fetch stage do it on its own account.

If you ask me, one could move the calculation of the branch target address backwards into the Exec stage. Sure, one could continue to write the PC from it only in the Write stage, but the target address could be readied in the Exec stage.

Now the branch is fair game for the prefetch to use as an early cue. If there is one ahead in the Exec stage (and nothing between Exec and Fetch that might also change the PC) then if the branch is successful (the s.z field is set in the Exec stage) you can use the calculated branch target address to set the PC in the Fetch stage. Remember to extend the suppression of the blind pipeline flush in the do_write_tick code to cover this case too.

If you wanted to improve on even that, you would guess in Fetch which way the branch ahead were going to hop while it was still in the Read or even Decode stage, using a little cache that records which way the branches have jumped in the past. I used a 16-entry 1-way cache when I tried this. The cache records for each of the last 16 branch addresses that have been seen the address of the eventual branch target (PC+4 in the case when the branch fails). The information in the `branch cache' is brought up to date whenever a branch exits the Write stage.

You can use that same cache to also `remember' where JR and JALR instructions jumped to last time, and use that to guess where they will jump to again while they are still in the Decode stage (they only figure out for sure where they are going in the Read stage, when they actually read the register). Guess right and you avoid one `erronious' instruction following on behind the JR or JALR being flushed when the jump eventually gets around to happening in the Write stage. Guess wrong, and you lose nothing.

It's quite tricky doing all of this without getting into a mess, but the J and JAL intercepts I suggested to start with are easy. Taking it along further from there requires excellent programming skills and close control of all the interactions that develop, so go easy, test after every single change, plan your changes to be testable as you go, line by line (old programming maxim: the universe bites, don't give it a chance to), and be ready to back out one of your planned changes like a shot as soon as it goes wrong, as it will. Second or third time at it you may get it right. First time at it is always just for practice.

If you are going to mix doing this with allowing execution of the delay slot instructions (previous hint), good luck, and please realise that you will always have to let jump and branch instructions 'escape' at least as far as the Read stage before cueing off them for the prefetch, as suggested above. Only J and JAL instructions are affected by this consideration. You want the standard prefetch +4 mechanism to be still at work to drag one more instruction in behind them (the `delay slot' instruction) in the usual way while they are still in the Decode stage.

So start watching for J and JAL in the Read stage, not the Decode stage. And modify the suppression of the blind flush in the do_write_tick code so that it checks the address of the second following instruction, not the first following. You really need to skip and spare that delay slot instruction!

Other Exercises

Essentially, you want to try and keep the pipeline rather fuller than it is at present.

Medium: you could try passing register updates directly to other instructions following on in the pipeline, rather than delaying a reading instruction's progress until the pipeline ahead is clear of instructions that might write the registers it reads (as happens now). That would definitely be a major improvement in efficiency.
Easy to hard: There is some opportunity for promoting a following instruction ahead in the pipeline when the read stage is nominally barred to the instruction that should have advanced because of a pending write, but the reordering has to be done very early.
Easy to hard: you could try expanding the Write stage into two stages, the first where an instruction can read from (cache) memory and the second where an instruction can write to memory or registers. There's not obviously any advantage to be had here because no MIPS instruction tries to both read and write to memory at the same time, but the reorganization may allow further optimizations afterwards, such as running a load and a store instruction simultaneously in the two different stages. A dependency analysis is needed to prevent reading before writing on a common memory location.
Very hard: a clever strategy is to double the number of fetch and decode units and after fetching a branch prefetch both the next instruction that would result if the branch were taken and that which would result if it were not taken. Pass them both for decoding. When the branch reaches the execute stage it is known which way it will go and the appropriate following instruction can be moved out of the decode stage into the read stage, having already been decoded.

There are no easy options! The best thing to do is catalogue the stalls that you can see ocurring, and figure out some way of filling the vacancies with work to be done. The most obvious trick is to fill with instructions promoted out of order from behind, where it's allowable, but there won't always be instructions yet in the pipeline that can safely be promoted. The only answer to that is a longer pipeline, or two pipelines, so that more instructions are visible to the CPU at any one time and therefore there are more potential candidates for promotions. You need to experiment - the worst you can do is not go any faster. I'll be happy to help with structuring your code so that it achieves its aim clearly and concisely, which is 80% of the battle! (if you can't see what your code does, then you cannot see why it doesn't do what you want it to do).