A simple unoptimized
pipelined processor model
Following from this page on, you'll find more
sophisticated MIPS processor models to play with.
They're all significantly faster (and more
difficult to understand) than the basic MIPS
simulator. Each is more sophisticated than the
previous one.
You will want to work through them one at a time.
Running the CPU2
model
If you have the jar file from the distributed
archive, you should be able to run the bytecode
directly with
% java -cp lib/CPU.jar CPU.Cpu2 misc/helloworld_mips32
The Java source code is in the Cpu2.java file and
if necessary (i.e. the jar file is missing or
damaged) you can compile it with
% javac CPU/Cpu2.java
from the src/ directory of the unarchived
distribution. That will produce bytecode that you
can run with
% java CPU.Cpu2 misc/helloworld_mips32
At the top level, the code still has the same
interface as the basic CPU1 simulator code:
Cpu2
static void main (String args[]) |
entry point; handles command line options
|
CPU2
CPU2 () |
CPU builder |
void run () |
fetch/decode/execute cycle |
Internally, however, the CPU2 code is much more
sophisticated than the unpipelined code in CPU1,
and not as useful to walk through, though you may
do so.
What's new?
The major differences in the pipelined code with
respect to the unpipelined code are that The
single fetch-decode-execute cycle loop in the
unpipelined code has been decomposed into separate
Fetch, Decode, (Read), Execute (and Write)
components ("stages"), corresponding to the
commented note about where each stage starts and
ends that you can see in the unpipelined CPU1 run
code (e.g. "// write"). Here's a diagram of the
pipeline that has been implemented:
Notice that the pipeline contains extra little
registers (I'll call them "holders") which hold
the inputs and outputs from each stage for passing
on to the next. The pipeline is a physically
manifested pipeline, in other words. The "holders"
are like laundry baskets in which the laundry is
kept before passing it on from the washer stage to
the dryer stage in the laundry room - it's not
just a question of shovelling the laundry out of
the washer and into the dryer directly. Laundry
baskets and holders both allow for some minor
hold-up in the work-flow. To clear the washer for
another load you don't have to have the dryer
already clear - you can put the laundry from the
washer into the basket instead. The input
basket/holder for the next stage is the output
basket/holder from the previous stage.
There's a fairly boring list of names of pipeline
stage input and output holders that you'll need to
get used to eventually. They're shown in the
diagram above. I'll work through the list below,
and you can skip past it for the moment if you
aren't interested.
- The input for the Fetch stage of the pipeline
is the PC register, and the output is the IR
plus a record of the initial value of the PC in
the conf0 holder. This initial value of the PC
is needed later on in the pipeline when branch
offsets are converted into addresses, for
example.
- The IR is the input to the Decode stage of
the pipeline, and the fully detailed decode of
the instruction is output from it into the conf1
holder, along with the initial value of the PC
transferred from conf0. From then onwards the
decode information is passed along the pipeline
unaltered, passing through the conf2, conf3,
conf4 holders in turn as the instruction is
treated.
- From the Read stage of the pipeline, a double
register read result r is output and that forms
the input to the Execute stage.
- The output from the Execute stage is a paired
result and carry s from the ALU, and that forms
the input to the Write stage.
- There is a class representing each stage, and
also a class representing the whole pipeline
composed of the five stages.
- Both the stages and the pipeline as a whole
work in the simulator by executing a single tick
method.
- Each tick causes some work to be done in the
pipeline. The ticks are timed. The CPU2 run
method now simply calls tick in a continuous
loop on the CPU pipeline p until an instruction
is executed which stops the clock, thereby
ending the simulation. Here is the code:
public void run() {
while (Clock.running()) {
p.tick();
}
}
In real life the hardware runs all the pipeline
stages simultaneously. Because that can't be done
in an emulation using a single thread of
computation, the simulation instead runs the
stages one after the other but in such a way that
the correct final state is achieved.
- The simulation executes the Write stage
"first" then sets back the clock after recording
how long it took. That empties the conf3 holder
which is the input to the Write stage, and fills
the conf4 holder which is the output from the
Write stage.
- The simulation then runs the Execute stage,
filling the empty conf3 holder and emptying the
conf2 holder.
- The simulation then runs the Read stage,
filling the empty conf2 holder and emptying the
conf1 holder, etc.
- The longest time taken by any of the
individual stages is how long the whole pipeline
tick takes.
You'll see this implementation mechanism
enumerated in the codes below. It's simply the
easiest way of calculating in a single thread how
the parallel hardware behaves. One has to choose
to do one part of the calculation before another.
There follows the list of all the new Java
classes and their methods.
The pipeline
("Pipe") component class
Inside the CPU2 processor, you'll find a conceptual
aggregation of most of its components into a pipeline.
Because the pipeline has no physical existence
outside of the CPU, it is implemented as an interior
class in Java, to signal that "this is not real".
It's a concept in the mind of the designer, so the
public can't have access to it. It's not a public
class. Only the CPU2 object knows it's there, and
makes calls to it (and even then, it has just one
thing to say to it: "tick", please!).
Pipe
Pipe () |
builds a pipeline |
void tick () |
runs the pipeline for one cycle |
The pipeline's tick method is implemented
internally like this:
public void tick() {
do_finish_tick(); // purely for local accounting and debugging
// reset clock
do_write_tick(); // run the write stage
// reset clock
do_execute_tick(); // run the execute stage
// reset clock
do_read_tick(); // run the read stage
// reset clock
do_decode_tick(); // run the decode stage
// reset clock
do_fetch_tick(); // run the fetch stage
// set clock to max
}
Certain trickery with the clock to get the
parallel execution time-wise accounting right is
shown as commentary here. In the real code it's a
call (to "Clock.reset()", funnily enough).
The Fetch component
class within the pipeline
"Fetch" is the first stage of the pipeline in the
CPU. It's there to get an instruction into the IR
("instruction register"). Afterwards the
instruction will be worked on in subsequent
stages. It's a factory assembly line!
The component is again an aggregation or way of
making work together of other units that already
exist within the PC, so it is represented as an
interior class within an interior class in Java!
It's a 'concept class', an interior class within
the interior class that is the pipeline class
within the CPU code. Only the pipeline object
knows about this object class - it's a figment of
a pipeline's dream, which is itself a CPU's dream.
Non-philosophically, however:
Fetch
Fetch () |
builds the fetch stage of a pipeline |
void tick () |
runs the fetch stage for one cycle |
The Fetch stage tick method Java code collects
together the statements found at the start of the
while loop in the CPU1 run method:
public void tick() {
// code from CPU1 starts here
int pc = PC.read(); // .
int ir = memory.read32be(pc); // .
IR.write(ir); // .
pc = feAlu.execute(ALU_ADDU, pc, 4).c; // .
PC.write(pc); // .
// code from CPU1 ends here
conf0 = new DecodeReturnT();
{ // conf0 is the stage output
conf0.ir = ir;
conf0.pc = pc;
}
}
The only addition with respect to CPU1 is the
filling of the conf0 holder with the IR and PC
settings obtained here.
There's just one bit of trickery that needs
mentioning here. A jump or branch instruction
ahead in the Write stage of the pipeline usually
alters the PC. Normally that change would not be
seen, reading from PC, until the next tick of the
clock comes round. Registers show what has been
written to them last clock tick on the next clock
tick. That means that one would normally expect
the Fetch stage to be prefetching rubbish that
will almost certainly be jetisoned next clock tick
while a branch or jump is being treated in the
Write stage. While that is nothing that will hurt
the functionality of the pipeline, it is
inefficient (it slows things up), and worse, it
means having to tolerate rubbish instructions that
may mean nothing at all and be improperly
formatted in the IR register. The address from
which the instruction is prefetched may not even
be in the range of the program text, but from one
instruction beyond. It's a fair can of worms.
There is a simple way to avoid it, and that has
been taken in the code. Changes in the PC made in
the Write stage are forwarded immediately to the
Fetch stage, thus avoiding any spuriousy
prefetched code in the IR alongside a branch or
jump in the Write stage. Concretely, the PC is
only consulted for the pc value in the Fetch stage
when the Write stage is not dominating the PC. In
case there is an instruction in the Write stage
the conf4 holder will contain data. In case the
instruction is a jump, or a branch that succeeded,
its branched field will be set and the epc field
will contain the target address of the jump or
branch that should be directed to the PC. Thus the
way the pc variable is set in the Fetch stage code
is not as shown above, but instead by the more
complicated:
int pc = (conf4 != null && conf4.branched) ? conf.epc : PC.read();
There are alternatives to this approach (for
example: `forwarding PC writes'; the CPU5 code
does that). One could, for example, simply tell
the Fetch stage not to do any prefetching at all
while a jump or successful branch is in the Write
stage. That would create an artificial hole in the
flow of instructions through the pipeline, but
would work. It would avoid prefetches of rubbish.
It is clearly slower overall, however. We have
instead preferred to replace the pc= line as
above, since it's an easy change. It does create
some code maintenance burden - in the Write stage
the author will have to remember to fill out the
branched and epc fields of the conf4 holder
appropriately wherever the PC is written.
Is this kind of code `legal'? Does it properly
represent what happens in hardware? Yes, it does.
The pc value here is a signal level (measured in
volts) on a metal line. The Write and Fetch stages
are running at the same time. All it needs is a
bit of combinatorial logic to substitute on that
line at that moment the value being read from the
PC register in the Fetch stage by the value being
written to the PC register in the Write stage.
The Decode component
class within the pipeline
Decode
Decode () |
builds the decode stage of a pipeline |
void tick () |
runs the decode stage for one cycle |
Similarly the Decode tick() method groups
together the code from the CPU1
fetch/decode/execute loop that implements the
decode funtionality.
public void tick() {
// input - conf0 is nonempty on entry
int pc = conf0.pc;
int ir = conf0.ir;
// output - conf1 is empty on entry
conf1 = new DecodeReturnT();
{
conf1.pc = pc;
conf1.ir = ir;
conf1.op = InstructionRegister.OPCODE(ir);
}
if (conf1.op == 0) { // code from CPU1 starts here
// ALU ops // .
conf1.sssss = InstructionRegister.aluopSRC1 (ir); // .
conf1.ttttt = InstructionRegister.aluopSRC2 (ir); // .
conf1.ddddd = InstructionRegister.aluopDST (ir); // .
conf1.func = InstructionRegister.aluopOP (ir); // .
} else ... // .
// treat other instruction types // .
} // .
Clock.increment(DECODE_LATENCY); // .
// code from CPU1 ends here
conf0 = null;
}
The only addition with respect to CPU1 has been
the very few lines of code that move the data in
the conf0 holder on entry across to the conf1
holder on exit. The stage fills out conf1 with all
the broken-out fields decoded from the IR
instruction (itself recorded as is in the ir
field).
The Read component
class within the pipeline
Read
Read () |
builds the read stage of a pipeline |
void tick () |
runs the read stage for one cycle |
The Read tick() method contains precisely the
code from the CPU1 fetch/decode/execute cycle
which reads the registers specified in the
instruction, filling thereby the two-word
data-holder for the stage (called r ):
public void tick() {
// conf1 is the input configuration holder
if (conf1.op == 0) {
// deal with ALU ops // r is a holder for 2 output words
r = register_unit.read(conf1.sssss, conf1.ttttt);
} else ...
// handle other instruction types
}
conf2 = conf1.copy(); // conf2 is the output configuration holder
conf1 = null;
}
All the broken-out fields from the instruction
decode are present in the conf1 holder on entry to
the stage, and are moved to the conf2 holder on
exit from the stage. The sssss and ttttt fields
are the ones which determine the registers.
The Execute
component class within the pipeline
Execute
Execute () |
builds the execute stage of a pipeline |
void tick () |
runs the execute stage for one cycle |
The execute stage tick() method is implemented
using all the code from the CPU1
fetch/decode/execute loop which was dedicated to
setting up and running the ALU during an
instruction.
public void tick () {
if (conf2.op == 0) { // conf2 is the input configuration holder
// deal with ALU ops
s = alu.execute (conf2.func, r.a, r.b); // s is a holder for 2 output words
} else ...
// deal with other instruction types
}
conf3 = conf2.copy();
conf2 = null; // conf3 is the output configuration holder
}
All the broken-out fields from the instruction
decode are present in the conf2 holder on entry to
the stage, and are moved to the conf3 holder on
exit from the stage. The func field configures the
ALU behaviour. In some cases data is provided
directly from other broken-out fields (such as the
shift value in a logical or arithmetic shift
instruction, which is provided by the hhhhh
field), but usually the data provided to the ALU
comes from the output holder r from the Read
stage. The stage fills the 2-word holder (called s
) with the (two) subsequent outputs from the ALU.
They're usually a 32-bit result and a single carry
bit, but sometimes two full 32-bit words are
produced, depending on the operation carried out.
Multiplication is one of the latter cases. It
produces a notionally 64-bit result, output as two
full 32-bit words.
The Write component
class within the pipeline
Write
Write () |
builds the write stage of a pipeline |
void tick () |
runs the write stage for one cycle |
The Write stage's tick() method does the final
write-back of results to registers or memory. It
collects together all the relevant code from the
CPU1 fetch/decode/execute loop.
public void tick() {
// conf3 is the input configuration holder
if (conf3.op == 0) {
// deal with ALU ops
register_unit.write (conf3.ddddd, s.c);
} else ...
// deal with other instruction types
}
conf4 = conf3.copy();
conf3 = null; // conf4 is the output configuration holder
...
}
The broken-out fields from the instruction decode
are present in the conf3 holder on entry to the
stage, and moved to the conf4 holder on exit from
the stage. These fields drive where the register
unit writes to, if the register unit is involved
in the write. For an arithmetic operation like
ADD, the ddddd field specifies the output
register.The instruction decode is then finally
used one more time in its embodiment as the conf4
copy just after the Write stage finishes in debug
mode in order to print out details of the
completed instruction - there's no physical need
for them in the CPU at that point.
The author has also had to carry through in the
code here in the Write stage the maintenance
burden accepted when deciding to forward changes
made to the PC to the Fetch stage immediately,
without waiting for the next clock tick. Wherever
the PC is written here (i.e., in the stanzas
dealing with the jump and branch instructions),
the conf3 holder is conscientiously updated with
the target value in its epc field, while its
branched field is set to true.
Notes on the
pipeline code
As in the unpipelined code, there are classes
representing an ALU, a clock, a memory unit, a
register unit, registers, and so on.
If you ask for a lot of debugging output ("-d
-d"), an image of the state of the pipeline will
be output with each tick.
Because the pipeline always fetches a new
instruction as soon as the old instruction has
cleared the fetch stage, it effectively is running
a branch-prediction in which branches are
predicted to fail, because the instruction that is
(pre-)fetched into the pipeline is from PC+4. Even
though the pipeline forwards changes in the PC
from the Write stage to the Fetch stage
immediately, that leaves in the Decode, Read and
Execute stages instructions that have been
prefetched while the branch was proceding towards
Write but had not got there yet.
Thus the pipeline needs to and does take care to
flush those later instructions that have been
erroniously prefetched whenever a branch
instruction reaches the Write pipeline stage
having succeeded at the Execution stage. Those
instructions should (likely) not be executed now,
so they can't safely be left in the pipeline to
eventually execute and complete! The code to do
this is to be found at the end of the Write tick()
method:
if (conf3.op == BEQZ
|| conf3.op == BEQ
|| conf3.op == BNEZ
|| conf3.op == BNE
|| conf3.op == BLTZ
|| conf3.op == BGEZ
|| conf3.op == BLE
|| conf3.op == BGT
) { // branch instruction exiting Write stage
if (s.z != 0) {
conf0 = null; // if the branch succeded, flush following instrs
conf1 = null;
conf2 = null;
}
}
If you want to watch chaos in action, remove that
flush.
Similarly the pipeline flushes all instructions
which entered later whenever a jump (j, jal, jr,
jalr) instruction reaches the Write stage. As
discussed above, that is because the prefetched
instructions already in the pipe come from what is
now the wrong code area (with high probability).
The code to do this is at the end of the Write
stage tick() method:
if ((conf3.op == 0 && (conf3.func == ALU_JALR || conf3.func == ALU_JR))
|| conf3.op == J
|| conf3.op == JAL
) { // JALR, JR, J or JAL instruction finishes Write
conf0 = null;
conf1 = null; // flush the pipeline following this instruction
conf2 = null;
}
The pipeline also will (naively) not allow any
instruction into the Read stage (in which
registers and/or memory are read) until all
instructions ahead of it in the pipe have been
completed. That is because some of those might be
about to alter (at the Write stage of the
pipeline) data that the instruction should read
now at the Read stage. The earlier instructions
have to be given a chance to put the data in place
before the later instruction now in the Read stage
tries to read it.
The code is found at the beginning of the Read tick() method.
if (conf1.op != J
&& conf1.op != JAL
&& conf1.op != LUI
) { // we're an instr which may read registers and ..
if (conf3 != null || conf4 != null) { // there're instrs ahead in the Exec/Write stages
return; // .. so don't let this instr into Read stage yet
}
}
If you wonder why LW is apparently considered a
dangerous instruction that may read registers
(it's not in the short list of "OK" instruction
types in the code), it's because it does! "lw $2
0($sp)" reads the base register $sp, for example,
when calculating the address in memory it should
eventually read from to put data into register $2.
This code has just J, JAL and LUI as the
instructions that don't read from registers (in
reality, there are a few extra MIPS instructions
that you haven't met yet which also indubitably do
not read from registers, and one might want to add
them to the list; leaving them out of the list
does no harm to the functionality, but it makes
the pipeline slower).
This is a very crude and inexact restriction (but
it works!) that could do with with being refined
down to a more exact analysis of exactly when
there is a register data dependency between the
instructions in the pipeline and when there is
not. The more sophisticated emulator code in the
CPU3 class discussed later remedies that
situation. The CPU3 pipeline analyses the data
dependency between the read and write stage
instructions and only delays the read when it
really does need to wait for the write ahead of it
to finish first. Note that, despite the name, the
Write stage also sometimes reads from memory.
During a LW instruction the address to load from
is computed during the Read (base register
contents) stage and Execute (add displacement to
address) stage, and the Write stage hooks up the
output from the memory unit to the register unit
input for writing. Perhaps a better name than
`Write' might be `Access Memory and/or Write
Register'.
Results from the
simulation
We are gratified to report that the pipelined
simulator code executes "Hello world" slightly
faster than the unpipelined code (the simulated
clock is running at 1GHz):
|
unpipelined |
pipelined |
220 instrs |
0.000001580s |
0.000001291s |
Exercises on getting
familiar with the (unoptimized) pipelined CPU2
processor model Java code
Here are some suggestions for getting to know the
pipelined CPU2 code:
- Comment and/or beautify some part of the CPU2
java source code to your satisfaction, with the
aim of getting familiar with it. Send in your
changes, etc., or publish them somewhere on your
own, as you wish.
- Try adding a BGEZAL and BLTZAL instruction to
the CPU2 java source code as you did for CPU1.
|