jMIPS

an open source MIPS processor in Java

2 A simple unoptimized pipelined processor model

A simple unoptimized pipelined processor model

Following from this page on, you'll find more sophisticated MIPS processor models to play with.

They're all significantly faster (and more difficult to understand) than the basic MIPS simulator. Each is more sophisticated than the previous one.

You will want to work through them one at a time.

Running the CPU2 model

If you have the jar file from the distributed archive, you should be able to run the bytecode directly with

% java -cp lib/CPU.jar CPU.Cpu2 misc/helloworld_mips32

The Java source code is in the Cpu2.java file and if necessary (i.e. the jar file is missing or damaged) you can compile it with

% javac CPU/Cpu2.java

from the src/ directory of the unarchived distribution. That will produce bytecode that you can run with

% java  CPU.Cpu2 misc/helloworld_mips32

At the top level, the code still has the same interface as the basic CPU1 simulator code:

Cpu2
static void main (String args[])	entry point; handles command line options

CPU2
CPU2 ()	CPU builder
void run ()	fetch/decode/execute cycle

Internally, however, the CPU2 code is much more sophisticated than the unpipelined code in CPU1, and not as useful to walk through, though you may do so.

What's new?

The major differences in the pipelined code with respect to the unpipelined code are that The single fetch-decode-execute cycle loop in the unpipelined code has been decomposed into separate Fetch, Decode, (Read), Execute (and Write) components ("stages"), corresponding to the commented note about where each stage starts and ends that you can see in the unpipelined CPU1 run code (e.g. "// write"). Here's a diagram of the pipeline that has been implemented:

Notice that the pipeline contains extra little registers (I'll call them "holders") which hold the inputs and outputs from each stage for passing on to the next. The pipeline is a physically manifested pipeline, in other words. The "holders" are like laundry baskets in which the laundry is kept before passing it on from the washer stage to the dryer stage in the laundry room - it's not just a question of shovelling the laundry out of the washer and into the dryer directly. Laundry baskets and holders both allow for some minor hold-up in the work-flow. To clear the washer for another load you don't have to have the dryer already clear - you can put the laundry from the washer into the basket instead. The input basket/holder for the next stage is the output basket/holder from the previous stage.

There's a fairly boring list of names of pipeline stage input and output holders that you'll need to get used to eventually. They're shown in the diagram above. I'll work through the list below, and you can skip past it for the moment if you aren't interested.

The input for the Fetch stage of the pipeline is the PC register, and the output is the IR plus a record of the initial value of the PC in the conf0 holder. This initial value of the PC is needed later on in the pipeline when branch offsets are converted into addresses, for example.

The IR is the input to the Decode stage of the pipeline, and the fully detailed decode of the instruction is output from it into the conf1 holder, along with the initial value of the PC transferred from conf0. From then onwards the decode information is passed along the pipeline unaltered, passing through the conf2, conf3, conf4 holders in turn as the instruction is treated.

From the Read stage of the pipeline, a double register read result r is output and that forms the input to the Execute stage.

The output from the Execute stage is a paired result and carry s from the ALU, and that forms the input to the Write stage.

There is a class representing each stage, and also a class representing the whole pipeline composed of the five stages.

Both the stages and the pipeline as a whole work in the simulator by executing a single tick method.

Each tick causes some work to be done in the pipeline. The ticks are timed. The CPU2 run method now simply calls tick in a continuous loop on the CPU pipeline p until an instruction is executed which stops the clock, thereby ending the simulation. Here is the code:

   public void run() {

        while (Clock.running()) {

                p.tick();
        }
    }

In real life the hardware runs all the pipeline stages simultaneously. Because that can't be done in an emulation using a single thread of computation, the simulation instead runs the stages one after the other but in such a way that the correct final state is achieved.

The simulation executes the Write stage "first" then sets back the clock after recording how long it took. That empties the conf3 holder which is the input to the Write stage, and fills the conf4 holder which is the output from the Write stage.
The simulation then runs the Execute stage, filling the empty conf3 holder and emptying the conf2 holder.
The simulation then runs the Read stage, filling the empty conf2 holder and emptying the conf1 holder, etc.
The longest time taken by any of the individual stages is how long the whole pipeline tick takes.

You'll see this implementation mechanism enumerated in the codes below. It's simply the easiest way of calculating in a single thread how the parallel hardware behaves. One has to choose to do one part of the calculation before another.

There follows the list of all the new Java classes and their methods.

The pipeline ("Pipe") component class

Inside the CPU2 processor, you'll find a conceptual aggregation of most of its components into a pipeline.

Because the pipeline has no physical existence outside of the CPU, it is implemented as an interior class in Java, to signal that "this is not real". It's a concept in the mind of the designer, so the public can't have access to it. It's not a public class. Only the CPU2 object knows it's there, and makes calls to it (and even then, it has just one thing to say to it: "tick", please!).

Pipe
Pipe ()	builds a pipeline
void tick ()	runs the pipeline for one cycle

The pipeline's tick method is implemented internally like this:

        public void tick() {

            do_finish_tick();   // purely for local accounting and debugging
              // reset clock
            do_write_tick();    // run the write stage
              // reset clock
            do_execute_tick();  // run the execute stage
              // reset clock
            do_read_tick();     // run the read stage
              // reset clock
            do_decode_tick();   // run the decode stage
              // reset clock
            do_fetch_tick();    // run the fetch stage
              // set clock to max
        }

Certain trickery with the clock to get the parallel execution time-wise accounting right is shown as commentary here. In the real code it's a call (to "Clock.reset()", funnily enough).

The Fetch component class within the pipeline

"Fetch" is the first stage of the pipeline in the CPU. It's there to get an instruction into the IR ("instruction register"). Afterwards the instruction will be worked on in subsequent stages. It's a factory assembly line!

The component is again an aggregation or way of making work together of other units that already exist within the PC, so it is represented as an interior class within an interior class in Java! It's a 'concept class', an interior class within the interior class that is the pipeline class within the CPU code. Only the pipeline object knows about this object class - it's a figment of a pipeline's dream, which is itself a CPU's dream.

Non-philosophically, however:

Fetch
Fetch ()	builds the fetch stage of a pipeline
void tick ()	runs the fetch stage for one cycle

The Fetch stage tick method Java code collects together the statements found at the start of the while loop in the CPU1 run method:

    public void tick() {
                                                          // code from CPU1 starts here
        int pc = PC.read();                               //         .
        int ir = memory.read32be(pc);                     //         .
        IR.write(ir);                                     //         .
        pc = feAlu.execute(ALU_ADDU, pc, 4).c;            //         .
        PC.write(pc);                                     //         .
                                                          // code from CPU1 ends here
        conf0 = new DecodeReturnT();
        {                                                 // conf0 is the stage output
            conf0.ir = ir;
            conf0.pc = pc;
        }
    }

The only addition with respect to CPU1 is the filling of the conf0 holder with the IR and PC settings obtained here.

There's just one bit of trickery that needs mentioning here. A jump or branch instruction ahead in the Write stage of the pipeline usually alters the PC. Normally that change would not be seen, reading from PC, until the next tick of the clock comes round. Registers show what has been written to them last clock tick on the next clock tick. That means that one would normally expect the Fetch stage to be prefetching rubbish that will almost certainly be jetisoned next clock tick while a branch or jump is being treated in the Write stage. While that is nothing that will hurt the functionality of the pipeline, it is inefficient (it slows things up), and worse, it means having to tolerate rubbish instructions that may mean nothing at all and be improperly formatted in the IR register. The address from which the instruction is prefetched may not even be in the range of the program text, but from one instruction beyond. It's a fair can of worms.

There is a simple way to avoid it, and that has been taken in the code. Changes in the PC made in the Write stage are forwarded immediately to the Fetch stage, thus avoiding any spuriousy prefetched code in the IR alongside a branch or jump in the Write stage. Concretely, the PC is only consulted for the pc value in the Fetch stage when the Write stage is not dominating the PC. In case there is an instruction in the Write stage the conf4 holder will contain data. In case the instruction is a jump, or a branch that succeeded, its branched field will be set and the epc field will contain the target address of the jump or branch that should be directed to the PC. Thus the way the pc variable is set in the Fetch stage code is not as shown above, but instead by the more complicated:

    int pc = (conf4 != null && conf4.branched) ? conf.epc : PC.read();

There are alternatives to this approach (for example: `forwarding PC writes'; the CPU5 code does that). One could, for example, simply tell the Fetch stage not to do any prefetching at all while a jump or successful branch is in the Write stage. That would create an artificial hole in the flow of instructions through the pipeline, but would work. It would avoid prefetches of rubbish. It is clearly slower overall, however. We have instead preferred to replace the pc= line as above, since it's an easy change. It does create some code maintenance burden - in the Write stage the author will have to remember to fill out the branched and epc fields of the conf4 holder appropriately wherever the PC is written.

Is this kind of code `legal'? Does it properly represent what happens in hardware? Yes, it does. The pc value here is a signal level (measured in volts) on a metal line. The Write and Fetch stages are running at the same time. All it needs is a bit of combinatorial logic to substitute on that line at that moment the value being read from the PC register in the Fetch stage by the value being written to the PC register in the Write stage.

The Decode component class within the pipeline

Decode
Decode ()	builds the decode stage of a pipeline
void tick ()	runs the decode stage for one cycle

Similarly the Decode tick() method groups together the code from the CPU1 fetch/decode/execute loop that implements the decode funtionality.

             public void tick() {
                
              // input - conf0 is nonempty on entry
              int pc = conf0.pc;
              int ir = conf0.ir;

              // output - conf1 is empty on entry
              conf1 = new DecodeReturnT();
              {
                  conf1.pc = pc;
                  conf1.ir = ir;
                  conf1.op = InstructionRegister.OPCODE(ir);
              }

                if (conf1.op == 0) {                                     // code from CPU1 starts here
                    // ALU ops                                           //         .
                    conf1.sssss =  InstructionRegister.aluopSRC1 (ir);   //         .
                    conf1.ttttt =  InstructionRegister.aluopSRC2 (ir);   //         .
                    conf1.ddddd =  InstructionRegister.aluopDST  (ir);   //         .
                    conf1.func  =  InstructionRegister.aluopOP   (ir);   //         .
                } else ...                                               //         .
                    // treat other instruction types                     //         .
                }                                                        //         .
                Clock.increment(DECODE_LATENCY);                         //         .
                                                                         // code from CPU1 ends here
              conf0 = null;
         }

The only addition with respect to CPU1 has been the very few lines of code that move the data in the conf0 holder on entry across to the conf1 holder on exit. The stage fills out conf1 with all the broken-out fields decoded from the IR instruction (itself recorded as is in the ir field).

The Read component class within the pipeline

Read
Read ()	builds the read stage of a pipeline
void tick ()	runs the read stage for one cycle

The Read tick() method contains precisely the code from the CPU1 fetch/decode/execute cycle which reads the registers specified in the instruction, filling thereby the two-word data-holder for the stage (called r ):

  public void tick() {
                                                    // conf1 is the input configuration holder
      if (conf1.op == 0) {
          // deal with ALU ops                      // r is a holder for 2 output words
          r = register_unit.read(conf1.sssss, conf1.ttttt); 
      } else ...
          // handle other instruction types
      }

      conf2 = conf1.copy();                         // conf2 is the output configuration holder
      conf1 = null;
  }

All the broken-out fields from the instruction decode are present in the conf1 holder on entry to the stage, and are moved to the conf2 holder on exit from the stage. The sssss and ttttt fields are the ones which determine the registers.

The Execute component class within the pipeline

Execute
Execute ()	builds the execute stage of a pipeline
void tick ()	runs the execute stage for one cycle

The execute stage tick() method is implemented using all the code from the CPU1 fetch/decode/execute loop which was dedicated to setting up and running the ALU during an instruction.

   public void tick () {
            
       if (conf2.op == 0) {                         // conf2 is the input configuration holder
           // deal with ALU ops
           s = alu.execute (conf2.func, r.a, r.b);  // s is a holder for 2 output words
       } else ...
           // deal with other instruction types
       }

       conf3 = conf2.copy();
       conf2 = null;                               // conf3 is the output configuration holder
   }

All the broken-out fields from the instruction decode are present in the conf2 holder on entry to the stage, and are moved to the conf3 holder on exit from the stage. The func field configures the ALU behaviour. In some cases data is provided directly from other broken-out fields (such as the shift value in a logical or arithmetic shift instruction, which is provided by the hhhhh field), but usually the data provided to the ALU comes from the output holder r from the Read stage. The stage fills the 2-word holder (called s ) with the (two) subsequent outputs from the ALU. They're usually a 32-bit result and a single carry bit, but sometimes two full 32-bit words are produced, depending on the operation carried out. Multiplication is one of the latter cases. It produces a notionally 64-bit result, output as two full 32-bit words.

The Write component class within the pipeline

Write
Write ()	builds the write stage of a pipeline
void tick ()	runs the write stage for one cycle

The Write stage's tick() method does the final write-back of results to registers or memory. It collects together all the relevant code from the CPU1 fetch/decode/execute loop.

    public void tick() {
                                                    // conf3 is the input configuration holder
        if (conf3.op == 0) {
            // deal with ALU ops
            register_unit.write (conf3.ddddd, s.c);
        } else ...
            // deal with other instruction types 
        }

        conf4 = conf3.copy();
        conf3 = null;                               // conf4 is the output configuration holder
        ...
    }

The broken-out fields from the instruction decode are present in the conf3 holder on entry to the stage, and moved to the conf4 holder on exit from the stage. These fields drive where the register unit writes to, if the register unit is involved in the write. For an arithmetic operation like ADD, the ddddd field specifies the output register.The instruction decode is then finally used one more time in its embodiment as the conf4 copy just after the Write stage finishes in debug mode in order to print out details of the completed instruction - there's no physical need for them in the CPU at that point.

The author has also had to carry through in the code here in the Write stage the maintenance burden accepted when deciding to forward changes made to the PC to the Fetch stage immediately, without waiting for the next clock tick. Wherever the PC is written here (i.e., in the stanzas dealing with the jump and branch instructions), the conf3 holder is conscientiously updated with the target value in its epc field, while its branched field is set to true.

Notes on the pipeline code

As in the unpipelined code, there are classes representing an ALU, a clock, a memory unit, a register unit, registers, and so on.

If you ask for a lot of debugging output ("-d -d"), an image of the state of the pipeline will be output with each tick.

Because the pipeline always fetches a new instruction as soon as the old instruction has cleared the fetch stage, it effectively is running a branch-prediction in which branches are predicted to fail, because the instruction that is (pre-)fetched into the pipeline is from PC+4. Even though the pipeline forwards changes in the PC from the Write stage to the Fetch stage immediately, that leaves in the Decode, Read and Execute stages instructions that have been prefetched while the branch was proceding towards Write but had not got there yet.

Thus the pipeline needs to and does take care to flush those later instructions that have been erroniously prefetched whenever a branch instruction reaches the Write pipeline stage having succeeded at the Execution stage. Those instructions should (likely) not be executed now, so they can't safely be left in the pipeline to eventually execute and complete! The code to do this is to be found at the end of the Write tick() method:

            if (conf3.op == BEQZ
            ||  conf3.op == BEQ
            ||  conf3.op == BNEZ
            ||  conf3.op == BNE
            ||  conf3.op == BLTZ
            ||  conf3.op == BGEZ
            ||  conf3.op == BLE
            ||  conf3.op == BGT
            ) {                            // branch instruction exiting Write stage
                if (s.z != 0) { 
                    conf0 = null;          // if the branch succeded, flush following instrs
                    conf1 = null;
                    conf2 = null;
                }
            }

If you want to watch chaos in action, remove that flush.

Similarly the pipeline flushes all instructions which entered later whenever a jump (j, jal, jr, jalr) instruction reaches the Write stage. As discussed above, that is because the prefetched instructions already in the pipe come from what is now the wrong code area (with high probability). The code to do this is at the end of the Write stage tick() method:

            if ((conf3.op == 0 && (conf3.func == ALU_JALR || conf3.func == ALU_JR))
            ||   conf3.op == J
            ||   conf3.op == JAL
            ) {                             // JALR, JR, J or JAL instruction finishes Write
                conf0 = null;
                conf1 = null;               // flush the pipeline following this instruction
                conf2 = null;
            }

The pipeline also will (naively) not allow any instruction into the Read stage (in which registers and/or memory are read) until all instructions ahead of it in the pipe have been completed. That is because some of those might be about to alter (at the Write stage of the pipeline) data that the instruction should read now at the Read stage. The earlier instructions have to be given a chance to put the data in place before the later instruction now in the Read stage tries to read it.

 The code is found at the beginning of the Read tick() method. 
 if (conf1.op != J
 &&  conf1.op != JAL
 &&  conf1.op != LUI
 ) {                                        // we're an instr which may read registers and ..
     if (conf3 != null || conf4 != null) {  // there're instrs ahead in the Exec/Write stages
         return;                            // .. so don't let this instr into Read stage yet
     }
 }

If you wonder why LW is apparently considered a dangerous instruction that may read registers (it's not in the short list of "OK" instruction types in the code), it's because it does! "lw $2 0($sp)" reads the base register $sp, for example, when calculating the address in memory it should eventually read from to put data into register $2. This code has just J, JAL and LUI as the instructions that don't read from registers (in reality, there are a few extra MIPS instructions that you haven't met yet which also indubitably do not read from registers, and one might want to add them to the list; leaving them out of the list does no harm to the functionality, but it makes the pipeline slower).

This is a very crude and inexact restriction (but it works!) that could do with with being refined down to a more exact analysis of exactly when there is a register data dependency between the instructions in the pipeline and when there is not. The more sophisticated emulator code in the CPU3 class discussed later remedies that situation. The CPU3 pipeline analyses the data dependency between the read and write stage instructions and only delays the read when it really does need to wait for the write ahead of it to finish first. Note that, despite the name, the Write stage also sometimes reads from memory. During a LW instruction the address to load from is computed during the Read (base register contents) stage and Execute (add displacement to address) stage, and the Write stage hooks up the output from the memory unit to the register unit input for writing. Perhaps a better name than `Write' might be `Access Memory and/or Write Register'.

Results from the simulation

We are gratified to report that the pipelined simulator code executes "Hello world" slightly faster than the unpipelined code (the simulated clock is running at 1GHz):

	unpipelined	pipelined
220 instrs	0.000001580s	0.000001291s

Exercises on getting familiar with the (unoptimized) pipelined CPU2 processor model Java code

Here are some suggestions for getting to know the pipelined CPU2 code:

Comment and/or beautify some part of the CPU2 java source code to your satisfaction, with the aim of getting familiar with it. Send in your changes, etc., or publish them somewhere on your own, as you wish.
Try adding a BGEZAL and BLTZAL instruction to the CPU2 java source code as you did for CPU1.