jMIPS

an open source MIPS processor in Java

External Navigation

A MIPS processor in Java

This is the jMIPS easy-to-use open-source MIPS processor in Java. You should already have downloaded the archive (up to date copies are on the jMIPS project pages) within which this documentation is to be found as the contents of the doc/html subdirectory.

Detail of the basic unpipelined MIPS processor architecture

Getting to know the software here greatly aids learning about the MIPS hardware architecture and also creates familiarity with more general concepts of computer systems and architectures.

There are five processor models to look at and play with in the archive, identified here as 1 to 5. The numbers increase with increasing sophistication in the model, thus:

Basic unpipelined processor (these pages)
Unoptimized pipelined processor
Optimized pipelined processor
Optimized pipelined processor with memory cache
Optimized pipelined processor with memory cache and interrupt handling

The following pages will take you through the process of using and perhaps building the basic processor model in detail, and then go on to consider the other models. The models do not differ in how they are built or used - it's only the innards that are different in each case and that results in one model being faster than another in the same context.

You will want to get to work with each model processor in turn, perhaps with the aim in mind of improving the model to get it to run faster still.

How long it takes to run overall and per instruction is one of the default printed outputs. If you want to get out statistics such as how long each class of instruction takes to execute in particular circumstances, you will have to add in the code for that yourself. It's open source! You can do that. The code is clearly written and explained in the pages here.

How to run the basic MIPS processor model

There should be a Java archive file (a jar) in the supplied zip archive or compressed tar file (get one of them from the download link on the project page). If there isn't, then go to the next section.

You should extract the jar file from the archive with (for example, the archive name depending on your download choice)

% unzip jMIPS-1.7.zip jMIPS-1.7/lib/CPU.jar

% tar xzvf jMIPS-1.7.tgz jMIPS-1.7/lib/CPU.jar

Also extract some of the contents of the misc directory with

% unzip jMIPS-1.7.zip jMIPS-1.7/misc/hello_mips32

% tar xzvf jMIPS-1.7.tgz jMIPS-1.7/lib/hello_mips32

then the Cpu1 class can be run by a Java Virtual Machine (JVM) on any platform. You might do so under Linux with

% java -cp jMIPS-1.7/lib/CPU.jar CPU.Cpu1 -q jMIPS-1.7/misc/hello_mips32

For example:

% java -cp jMIPS-1.7/lib/CPU.jar CPU.Cpu1 -q jMIPS-1.7/misc/hello_mips32
  Hello world
%

If you are unpacking the 1.7d or later archive, you'll get a GUI to launch the five pre-built models with. Run

% jMIPS-1.7d/lib/CPU.jar

%

and you'll get the window shown at right. It's already been used to pick out a hello_mips32 executable file from the misc/ subdirectory of the source archive, so that's shown as available and by default ready to run; and the "quiet" switch has already been selected too!

To change the CPU model (set at "3" in the image), use the "model" control. Then hit "run" and a window will pop up containing the output from the run.

How to compile the basic MIPS processor model

If you want to or have to compile the Java source code to get executable code, you would do so as follows, depending on which tools you have available. Since there are different sets of tools that people are used to using on different operating systems, the following subsections are specific to an operating system.

Compiling under Linux or Unix

If you are on a Mac, and use a HFS or HFS+ format file system, go into the system properties and turn on case sensitivity for it. You will need that as otherwise you will get case folding bringing several file names into collision with each other, which you do not want.

Unarchive the zip or compressed tar file (get them from the project page download link) with

% unzip jMIPS-1.7.zip

% tar xzvf jMIPS-1.7.tgz

respectively.

Then find the src/ directory in the newly unpacked file hierarchy and change your current directory to it (use the cd jMIPS-1.7/src change directory command to do so). I prefer to produce generic Java bytecode for a Java Virtual machine (JVM), with

% javac CPU/Cpu1.java

in the src/ directory, and then the resulting Cpu1.class file can be run by a JVM on any platform. You might do so under Linux with

% java CPU/Cpu1 -q ../misc/hello_mips32

For example:

% java CPU/Cpu1 -q ../misc/hello_mips32
  Hello world
%

Building a jar file is just a question of first building all the java class files:

% javac CPU/*.java
%

and then making a jar file with

% jar cf ../lib/CPU.jar CPU/*.class
%

In reality, a jar is just a zip archive of the package done without compression, and with one extra manifest file also included. So you can make it using just a zip archiver and you don't necessarily need the jar tool. Look up the details on Java tutorials site or examine an existing jar file using the zip archiver for comparison.

Compiling under Windows

To import the source into a Java NetBeans IDE, start yourself a new NB project (called `jMIPS', presumably), making sure the IDE dialog's tick-boxes for `Main Class', etc, are all unchecked.

Once the IDE has built all the directories and control files that it needs, copy the src/CPU/ directory *.java files from the source code archive to a new src/CPU/ subdirectory of the jMIPS project directory that has just been created by NetBeans. Use an operating system copy command straight from inside the zip archive, if possible, to avoid the files touching disk on the way. Windows may fold some of the file names otherwise. If Netbeans has a menu choice to import from zip, or compile from zip, that would be perfect.

The IDE will detect you populating the source code area and it will expand its `Source Packages' tree view (the jMIPS/src directory) to include a CPU "package" and its Java files.

If you'd rather, you can systematically rename the java class files for each processor model from Cpu1.java, Cpu2.java, etc, to, for example, WinCpu1.java, WinCpu2.java, etc. You will have to rename the class declared within each file to match.

Notes on running a processor model

The meaning of the command line options is as follows:

No options: running the model without any options (in particular without -q; i.e. without "quiet") will list each machine code instruction as it is executed with timing information by the side.

% java CPU.Cpu1  hello_mips32
0:      0.000000007s:   0x80030080:     addiu $29, $29, -32
1:      0.000000012s:   0x80030084:     sw $31, 28($29)
2:      0.000000019s:   0x80030088:     sw $30, 24($29)
3:      0.000000024s:   0x8003008c:     addu $30, $29, $0
4:      0.000000030s:   0x80030090:     sw $28, 16($29)
...
218:    0.000001567s:   0x8003000c:     lui $3, -20480
219:    0.000001573s:   0x80030010:     ori $3, $3, 16
220:    0.000001580s:   0x80030014:     sb $3, 0($3)
%

This run executed 220 instructions in 0.000001580 simulated seconds (the clock rate is 1 simulated GHz). That was around 5 clock ticks per instruction execution.

-q: using -q ("quiet") suppresses the listing of the disassembled machine code instructions as they run. In other words, with -q you see only what the program being run should produce, and no extra noise.

% java CPU.Cpu1 -q hello_mips32
Hello world
%

-d: extra information is output if -d ("debug") is given. Repeated -d options increase the amount of debugging noise.

% java CPU.Cpu1 -d hello_mips32
text start at virtual addr 0x80030000 file offset 0x10000
text end   at virtual addr 0x800300e0 file offset 0x100e0
text entry at virtual addr 0x80030080 file offset 0x10080
read 224B at offset 65536 from file 'hello_mips32'
stack start at virtual addr 0xb0000000
stack end   at virtual addr 0xb0100000
0:      0.000000007s:   0x80030080:     addiu $29, $29, -32
1:      0.000000012s:   0x80030084:     sw $31, 28($29)
2:      0.000000019s:   0x80030088:     sw $30, 24($29)
...
%

-o: you can set certain internal variables in the simulation via -o MEMORY_LATENCY=1000 (for example). Look up the code in Cmdline.java and add your own possible option settings to the presently very short list!

Please do just edit away at the source in order to add anything else you like, add yourself to the credits list at the top of the source code in the file, and send in your changed code - or publish it yourself elsewhere, as you wish.

Producing MIPS machine code to run in the processor

The "Hello world" program ready-built in MIPS R3000 machine code is in the archive as the hello_mips32 file in the misc/ subdirectory.

The (C language) source code for it is in the hello_mips32.c file in the archive misc/ directory, and the MIPS assembler for it is the hello_mips32.s. The machine code has been compiled quite standardly from the given source to give the machine code, via

% gcc -static -o hello_mips32 -Wl,-e,f hello_mips32.c

on a real MIPS machine. Check the manual page for the gcc command (i.e., "man gcc") to find out exactly what the options given in this command line mean.

On non-MIPS but Unix platforms, the following should achieve the same result using the mips-gcc cross-compiler suite (you may typically have to run a "setup MIPS" command in your shell environment in order to set your executable search path to pick up the cross-compiler's component parts):

% mips-gcc -DMIPS -mips1 -mabi=32 -c hello_mips32.c
% mips-ld -Ttext 0x80003000 -e f -o hello_mips32 hello_mips32.o

The software model of a MIPS processor stands in handily for a real MIPS machine when running the "Hello world" machine code.

% java CPU/Cpu1 -q hello_mips32
Hello world

However, more complicated machine code involving interrupts and peripherals might defeat this model.

Getting into the source code

Here are suggestions for how to get yourself "into" the source code, and having fun.

Choose some part of the very simplest CPU1 Java class source code and comment it to your personal satisfaction. It is pretty well-commented as it is, but there will surely be points where you feel it needs more (or less, or different).

Edit it, credit yourself at top, and send the changes to the project.

This is generally a good way to go about getting to know code. Complain about how badly written and difficult to understand it is, and fix it.

You'll find copious notes on the code in the following section. You will want to look at the code with those notes in hand. The notes will prove most helpful in terms of understanding the large-scale features, leaving only nuances to be explained via source code comments.

Add bgezal and bltzal instructions to the MIPS instruction set handled in the emulator's single big loop.

First check on the web to see what the instructions do and what the machine code format for them is! One can say they must be a cross between a branch instruction and a jump-and-link instruction just from the name. Suppose bgezal is much the same as bgez but in the event of a successful test it does the same as a jal instruction does (places the address of the following instruction in the return address register, ra). That is useful for implementing a conditional subroutine call. Check google!

Edit the machine code for Hello world with a binary editor ("bvi" will work well for you on Unix if you are a vi user; emacs users will already know that emacs has a binary editor mode) and replace the bnez v0, foo; with the machine code for bgezal v0, foo.

Test your modified emulator on the modified machine code. You'll need to make a compensating change to the program in order to save and restore the $ra return address register around the branch, but you'll see the new branch instruction itself working.

Notes on the basic MIPS processor model code

The regularity of the MIPS instruction set architecture results in very transparent simulator code. The simulator code has also been written with the objective of producing code that is clear without being over-sophisticated.

As a result, you can check out what any MIPS instruction does by eyeballing the appropriate area of the code.

The CPU1 class code is imperative with just the minimum of object-oriented wrapping in order to make it easy to walk through. It's a state machine! And so it is, in reality. Whatever you need to see in order to understand the CPU1 processor model is in that code in the CPU1 class, and you need look nowhere else.

You'll see there's just one big while loop in the code. It runs the generic Von Neumann fetch-decode-(read data)-execute-(write data) cycle (see accompanying figure at right), as implemented in all mainstream processor designs since the 1940s, and all the action is embedded inside this one long loop, with one short contiguous part dedicated to each kind of MIPS instruction. So you'll find there are about 10 or 12 shorter contiguous sections making up the body of the loop.

For example, the part dealing with jump instructions looks like this ('fetch' has already been done at the start of the loop body, in order to read the next instruction into the IR register). It's mostly block comment:

    /*
     *  Jump instructions
     *
     *  format  J  : 000010 jjjjj jjjjj jjjjjjjjjjjjjjjj 
     *          JAL: 000011 jjjjj jjjjj jjjjjjjjjjjjjjjj
     *
     *  action  of J, JAL:
     *  PC <- (bigpageof(PC) << 28) | (jjj...jjj << 2);
     *
     *  extra action of JAL:
     *  RA <- PC
     */
  if (op == J || op == JAL) {
      conf.jjjjjjjjjjjjjjjjjjjjjjjjjj = IR.jDST();              // decode:  IR register
      pc = PC.read();                                           // read:    PC register
                                                                // execute: (not present)
      if (op == JAL)                                            // write:   RA and PC register
          register_unit.write (R_RA, pc);
      PC.write((bigpageof(pc) << 28) | (conf.jjjjjjjjjjjjjjjjjjjjjjjjjj << 2));
  }

That's just 6 lines of real code, not counting the comment lines.

In summary, the model encoded by the CPU1 class embodies an abstract Von Neumann processor design without low-level details of precisely how the electrons are being shuffled around and you should have no problem at all in looking at the code and understanding what it does as hardware (tip: when you look at a small section of any source code, look at it with a particular question in mind, such as figuring out how function X is used so you can make use of it yourself, and ignore everything else; repeat until done).

The difference with respect to real hardware is that this code can only do one thing at a time, whereas real physics results in hardware doing all these things done in sequence here all at the same time during a processor cycle.

However, execution is timed using the Clock class methods and the final accounting to you of what has happened only occurs when a complete processor cycle is up. So the order things are done in in software between those points does not particularly matter. So long as the order makes logical sense, the model works.

For example, the jump code above writes the RA register with data read from the instruction in the IR register. In real hardware, those two things happen simultaneously, as the result of an electric potential field transmitted across a conductive wire. In the code above the read happens before the write, because the Java implementation requires it that way! We can't write in Java "read A and write the result into B simultaneously". The physics of the real universe makes that easy in reality! The accounting by the components involved, however, will record the same simulation time moment of occurrence for both the read and the write, and that is all you will see in the end.

Class layout

There are the following top-level classes in the source code, all in all. Only the first five or so contain any code worth discussing, and the discussion follows below the table:

Cpu1	(containing the main entry point) sets up the cpu and runs the simulation
CPU1	the continuously running fetch-decode-execute cycle
RegistersUnit	an emulation of the registers unit in the cpu
ALU	an emulation of the ALU in the cpu
Memory	an emulation of the memory unit
IOBus	an emulation of an I/O bus
StopClock	a special I/O unit that accesses the system clock
Register	an emulation of a single register, as incorporated in RegistersUnit
SignExtend	an emulation of the 16-32 bit sign extension unit in the cpu
Console	an emulation of the I/O unit which inputs bytes from keyboard and exports bytes to screen, as mapped into addresses in memory by the memory unit
Clock	a single global clock emulation, emulating the clock signal within the cpu
Utility	a class containing useful routines - byteswapping, etc.
Globals	a class containing all the global constants, such as the numerical value of the opcode for JAL
Cmdline	this class handles the simulator command line options and arguments
OptionT	a class encapsulating a command line option description, for use by the CmdLine routines
Elf	a class modelling the structure of standard executable (`ELF') format files on disk, as produced by compilers such as gcc, used to read an executable file and capture the program code inside it, called from main()
RegReturnT	a class modelling a pair of values simultaneously read from the registers unit in the cpu
AluReturnT	a class modelling a pair of values simultaneously output from the ALU in the cpu
DecodeReturnT	a class modelling the set of attributes (opcode, source register index, branch offset, etc.) decoded from a machine code instruction in the IR within the cpu.
InstructionRegister	a class containing methods for decoding a machine code instruction in the IR within the cpu
Debug	a class containing methods which display the workings of the cpu on screen

This is a so-called domain-model design. The classes in the software code correspond to real physical components in the hardware MIPS processor. The methods correspond to actual physical operations that the hardware components can do. Virtual operations, such as those that can be composed as combinations of simpler operations, are never implemented in the software no matter how "convenient" it might seem, as they have no physical existence.

UML class diagram for jMIPS code

The class diagram at right shows the dependencies.

If you want to add a single-step facility to the Cpu1 simulator, you need to take the single big fetch-decode-execute while loop in the CPU1 code and make it pause for user input before each new cycle. You should make an s ("step") from the user execute another cyle; a c ("continue") should send it back to cycling continuously; a p ("print") and a register number or memory address should show the register or memory contents; a d ("display") should be like print but cause the printout to occur at every step.

The main routine in Cpu1

This is a quick and dirty precis of the main routine in Cpu1, the wrapper class whose job is to understand the Java command line arguments you used, and then set the CPU1 processor model going:

public void main(String argv[]) {

    Cmdline.analyse(..., argv);                                // parse command line

    Elf elf[] = new Elf[Cmdline.args.length];
    for (int j = 0; j < Cmdline.args.length; j++) {            // load `ELF format' executable files
        String filename = Cmdline.args[j];
        elf[j] = new Elf(filename);
    }
    byte stack[] = new byte[STACK_END - STACK_START];          // prepare the CPU's stack memory area

    CPU1 cpu = new CPU1();                                     // build the CPU
    {                                                          // initialize the stack area
        cpu.memory.addRegion(stack, STACK_START, STACK_END);
        for (int j = 0; j < elf.length; j++) {                 // initialize program code areas
            cpu.memory.addRegion(elf[j].text, elf[j].text_start, elf[j].text_end);
        }
    }
    cpu.SP.write(STACK_END);                                   // set stack pointer to top of stack
    cpu.PC.write(elf[0].text_entry);                           // set program counter to 1st entry pt
    cpu.run();                                                 // run the simulation!
}

You see here that main parses the command line using the Cmdline analyse method, obtaining a list of executable files whose code it will run. The executable files are in what is known as `ELF format'.

ELF is a cross-platform standard used in many operating systems in use today and it is the format produced by the gcc MIPS compilers and assemblers. The main routine then goes ahead and loads those ELF files.

That is, it calls the Elf constructor on each file name, which parses the contents of the file and extracts information such as entry points, intended virtual address locations, etc., all of which get placed in the resulting Elf object.

Then main builds the stack region of memory and the CPU, and tells the CPU's memory unit about the stack and its intended virtual address via a call to the memory unit's addRegion method. And it tells the memory unit about the various stretches of program code it has picked up from the ELF executable files via more calls to addRegion.

It remains to set the CPU's SP (`stack pointer') register to point to the top of stack, set the CPU's PC (`program counter') register to the address of the intended program entry point - obtained from the first file named on the command line - and then call the newly built and prepped CPU's run method.

That's all. So the following is all the programmer needs to know about Cpu1:

Cpu1
static void main (String args[])	entry point; handles command line options, builds CPU, runs it

The next section shows you what the 'runs it' means.

The CPU1 processor model

The following is all the programmer can see or needs to know about the CPU1 processor class. There's a (pretty skeletel) constructor, and one method.

CPU1
CPU1()	CPU1 constructor
void run()	fetch/decode/execute cycle

The run method code is not conceptually difficult. It's just the one big loop:

public void run() {
    while (Clock.running()) {
         ... /* interior of Von Neumann fetch-decode-execute cycle */ ...
    }
}

As it says in the comment, the loop contains the interior of the fetch-decode-execute cycle. Here it comes:

// FETCH!

int pc  = PC.read();                                  // read PC
int ir  = memory.read32be(pc);                        // read instr at address given by PC
IR.write(ir);                                         // write instruction to IR
pc      = feAlu.execute(ALU_ADDU, pc, 4).c;           // add 4 to PC value using  adder
PC.write(pc);                                         // route incremented PC value to PC

// DECODE!

byte op = InstructionRegister.OPCODE(ir);             // decode opcode
DecodeReturnT conf = new DecodeReturnT();
{
    conf.op = op;                                     // decode output to wires op,ir,pc
    conf.ir = ir;
    conf.pc = pc;                                     // NB: "original PC"+4 is in conf.pc
}
switch(op) { 
  case 0:                                             // deal with ALU ops

    conf.sssss = InstructionRegister.aluopSRC1(ir);   // decode remaining instr fields 
    conf.ttttt = InstructionRegister.aluopSRC2(ir);
    conf.ddddd = InstructionRegister.aluopDST (ir);
    conf.func  = InstructionRegister.aluopOP  (ir);

    Clock.increment(DECODE_LATENCY);                  // update here for decode

    // READ!                                          // read designated register contents

    RegReturnT r = register_unit.read(conf.sssss, conf.ttttt);

    // EXECUTE!

    AluReturnT s = alu.execute(conf.func, r.a, r.b);  // do the designated arithmetic

    // WRITE!

    register_unit.write(conf.ddddd, s.c);             // route result to designated register
    break;                                            // ESAC

The abstract fetch/decode/(read)/execute/(write) sequence is clearly visible in the code.

There's one conditional block dedicated to each kind of instruction (i.e., one per distinct opcode). The fragment above shows the block dedicated to treating ALU operations, which all have the opcode 0.

The different kinds of ALU operation are distinguished by different values of the func field in the instruction (the last 6 bits at the least significant end of the word in bigendian layout).

All the distinct fields of the instruction in the IR are broken out, including func, in the section labelled DECODE!, and then the instruction's functionality is implemented in the short Java code sections labelled READ!, EXECUTE!, WRITE!

What's acceptable and what's not acceptable code in a processor model

Note that the code in the CPU1 run code is conscientiously using the emulated CPU hardware to do all the work.

Instead of adding 4 to the PC value using the power of the Java programming language ("pc += 4"), for example, the FETCH section reads the PC using the PC register object's read method, adds 4 using the dedicated fetch adder's execute method, and routes the result back to the PC using the PC register object's write method. That's a long way round where "pc += 4" might have done the job! What's going on! What's allowed here and what's not?

It is acceptable in the Java model to label values output by a hardware unit and use them as inputs to other hardware units. We've done that with 'pc = ...' and '... = pc'. That sort of thing amounts to using simple wiring to connect the data signals through the CPU circuitry within a single clock cycle. That's alright.

But anything even as substantial as a +4 on the 'pc' would be cheating to do in Java because it takes time and circuitry to do it in real life. The Java objects standing in for the CPU hardware components (alu, register_unit, etc) have timing embedded in them and using them in the Java code makes the simulated clock advance correctly. They're OK to use for computations. Computations take time to do in real life. The Java objects methods advance simulated time. All OK.

Using Java code like "pc+=4" would magically achieve the required effect without taking any simulated time and not occupy any of the simulated hardware in doing it either! That's not emulating the real life situation, that's emulating magic. Not OK.

So any arithmetic has to be done using the Java objects representing CPU hardware components, not Java arithmetic.

Because the DECODE section is done explicitly in inline Java code here, which makes it seem like it takes no time at all to do in simulated time, an explicit clock update has been embedded at the end of the DECODE code in order to keep the simulator timing honest.

Contrariwise, the FETCH section needs no extra time update because the time taken is dominated by the instruction read from cache/memory, and the clock is updated properly via the memory unit access method call in that code section.

The rule in writing simulator code like this is that one can use Java code variables to hold values but for no more than (less than) what is conceptually one simulated clock interval. They're like transitory signal values. Any signal value that is to persist for one simulated clock interval or more must be held in registers.

That's what's done in the CPU1 run code.

The notionally transitory pc and ir values are routed to the PC and IR for safekeeping across multiple clock intervals.

The conf object that groups together the decode outputs looks like it persists, but as soon as the IR changes next cycle, the conf will be changed too. So it's conceptually just representing a set of wires hooked up to the IR via some decode circuitry and there is no extra storage involved.

The CPU component classes

The remainder of this section documents the Java classes that represent components within the CPU. It's a reference document! You read a section just as and when you need to, and not before.

The RegisterUnit component class

This component comprises the 'small and super fast memory' inside the CPU.

RegisterUnit
RegisterUnit()	register unit constructor
RegReturnT read (byte, byte)	read two ints from two numbered registers
void write (byte, int)	write an int to a numbered register
void write (byte, byte, int, int)	write two ints respectively to two numbered registers

RegisterUnit objects are composed of an array r of 32 (actually a little more in order to incorporate PC and IR as well) Register objects.

There is nothing surprising in the read and write method codes. Read and/or write is all that can be done by or to hardware registers. The number of arguments says how many registers are being read and written at the same time (up to two reads and one write are supported, simultaneously). The 'byte' type for the register indices as arguments is just so you don't confuse the arguments that are the 5-bit indices of registers with the arguments that are the 32-bit contents of registers. Yes, we could use 'int' to represent both, but then you couldn't see which was which looking at the method type signature, and programming errors would follow on.

private RegReturnT readAndWrite (byte x1, byte x2, bool w_e, int d1)
    {
        RegReturnT res = new RegReturnT ();
        Clock.increment (REGISTER_LATENCY);         // keep timing up to date
        res.a = r[x1].read ();                      // read two registers
        res.b = r[x2].read ();
        if (w_e && x1 != 0)                         // if write enabled, write one register
            r[x1].write (d1);
        return res;                                 // return read register (old) contents
    }
public RegReturnT write (byte x1, int d1)           // write one register, return old value
    {
        return readAndWrite (x1, 0, true, d1);
    }
public RegReturnT read (byte x1, byte x2)           // read two registers
    { 
        return readAndWrite (x1, x2, false, 0);
    }

The only thing of any note is the automatic update of the clock when using the register unit. That guarantees that using this component leads to honest time accounting in the simulation. Reads and writes happen simultaneously. When we look at the Register component class code, we'll see that writes take effect next cycle, so one can do many distinct reads and one write to a register each cycle in the software, and it'll look as though it all happened at once in the simulator at the end of the cycle.

The ALU class

The ALU component does the arithmetic within the CPU.

ALU
ALU ()	ALU constructor
AluReturnT execute (byte, int, int)	execute an ALU operation given by opcode on two ints returning the int result and an int carry

The ALU class is a substantial piece of code, but it's entirely simplistic. The execute method code consists of a large switch statement which chooses a short block of code to execute depending on the func (function code) value supplied as argument. The code combines the two input integers a and b in the appropriate way to produce the result c and a carry or zero indicator z:

public AluReturnT execute(byte func, int a, int b)
{
    AluReturnT res = new AluReturnT ();           // prepare to return two 32-bit words
    switch (func) {
      case ALU_ADD:
        long r = a; r += b;                       // internally 64-bit signed addition
        res.c = (int) (r & 0xffffffff);           // capture bottom 32 bits of result
        res.z = (int) ((r >> 32) & 0xffffffff);   // capture top 32 bits of result
        break;
      ... 
        // other ALU ops treated
    }
    Clock.increment(ALU_LATENCY);                 // note the automatic clock update
    return res;
}

The above illustrates only the addition operation code, but it is perfectly representative. None of the ALU code is any more complicated than that.

The Memory component class

The Memory component represents random access memory (RAM) as seen by the CPU. The implementation here really mashes together two real hardware components, the memory manager and the memory units. It would be a bit of fussiness to separate out the two as two different components here, because we're not concerned with modeling memory in the same detail as we are modeling the CPU, and we would rather look at memory for the moment as a shiny, smooth, black box.

Memory
Memory ()	basic memory unit constructor
void addRegion(byte[], int, int)	set up a new memory region (e.g. stack, program text, ...)
int read32be (int)	read a 4-byte bigendian int from a location in memory given by a 32-bit address
void write32be (int, int)	write a 4-byte bigendian int to a location in memory given by a 32-bit address
byte read8 (int)	read one byte from a location in memory given by a 32-bit address
void write8 (int, byte)	write one byte to a location in memory given by a 32-bit address

A memory object internally consists of an array of regions. These really exist as 'zone boundaries' configured into the memory manager side of our memory unit.

Each region consists of a first address in memory, a last address, and a sequence of bytes representing the memory region content. In real life, once configured (and the configuration changes on the fly as one process after another enters the CPU), the memory manager imposes different policies on the different regions. One region may be read-only, another may be read-write. One may be cached, another not.

The addRegion method is used to configure one more region into the memory unit.

The read32be method then simply cycles through the array of regions looking for one which contains the supplied address for lookup, and returns the data stored in the region at the supplied offset when it finds the right region. The write32be method is complementary:

public int read32be(int addr)
{
    for (int i = 0; i < nRegions; i++) {
        if (region[i].start <= addr && addr < region[i].end) {
            int data = Utility.grab32be(region[i].text, addr - region[i].start);
            Clock.increment(MEMORY_LATENCY);                      // update clock
            return data;
        }
    }                                                             // missing memory is zero
    return 0;
}
public void write32be(int addr, int data)
{
    for (int i = 0; i < nRegions; i++) {
        if (region[i].start <= addr && addr < region[i].end) {
            Utility.put32be(region[i].text, addr - region[i].start, data);
            Clock.increment(MEMORY_LATENCY);                      // update clock
            return;
        }
    }                                                             // missing memory ignored
}

One might eventually want to signal an exception when a missing (i.e., unconfigured) region is addressed, but right now this code does the simple thing and ignores the problem.

The memory unit also contains special read8 and write8 methods which read and write only one byte at a time.

While they might seem like a convenience only (and indeed, internally they read and write 4 bytes at a time when they can using the read32be and write32be stalwarts), their real use is in performing memory mapped I/O.

When the address supplied is right, the memory unit read8 and write8 methods pass the access attempt on to an iobus instead of to memory.

Behind the bus and attached to it lie several I/O units.

The code will be described in more detail below, but the most relevant thing to take in right now is that this arrangement means that talking to a special memory address allows characters to be read from the keyboard and printed to the screen. It's called memory-mapped I/O.

The way the iobus intercepts are implemented inside the memory unit code is like this:

public byte read8 (int addr)
    {   
        byte data;
        if (addr == GETCHAR_ADDRESS) {
            try {
                data = iobus.read8(0);             // do read I/O from console keyboard
            } catch (Exception e) {
                data = 0;
            }
            return data;
        }
        ...
    }
    public void write8 (int addr, byte data)
    {   
        if (addr == PUTCHAR_ADDRESS) {
            try {
                iobus.write8(1, data);             // do write I/O to console screen
            } catch (Exception e) {
                // was best effort
            }
            return;
        }
        ...
    }

"0" is the I/O bus address of the console keyboard input, and "1" is the I/O bus address of the console display output.

Via these intercepts, the memory unit translates reads from memory address GETCHAR_ADDRESS to reads from I/O bus address 0 which the bus translates to reads from the console attached to it. Similarly for PUTCHAR_ADDRESS and writes to the console.

The memory unit's read8 code is more accurately represented as follows, with a loop over a port list instead of the hard-coded if statement address matches shown in the text above:

public byte read8 (int addr)
    {   
        byte data;
        for (int j = 0; j < port.length; j++) {
            if (addr == port[j].memAddr) {
                try {
                    data = iobus.read8(port[j].busAddr);     // do read I/O from I/O unit j
                } catch (Exception e) {
                    data = 0;
                }
                return data;
            }
        }
        ...
    }

Ports just consist of a memory address memAddr paired with the bus address busAddr it should be translated to by the memory unit.

We could get away with adding a single port at a single memory address for the console unit, but I've preferred to cater for the general case in which PUTCHAR_ADDRESS and GETCHAR_ADDRESS are not necessarily the same, thus registering two different port addresses which give access to exactly the same console in exactly the same way.

It's not a significant waste. It often happens in real life that I/O units have several alternative address mappings, all of which can be used for access. The redundancy is sometimes useful, and is often mandated for reasons of backwards-compatibility.

The port list is appended to within the memory unit like this:

port.add(new Port(GETCHAR_ADDR, 0));           // translation GETCHAR_ADDR -> bus 0 added
port.add(new Port(PUTCHAR_ADDR, 1));           // translation PUTCHAR_ADDR -> bus 1 added

In a later iteration of this code, you may wish to allow certain ports to be designated READ_ONLY or WRITE_ONLY! For now, each port is expected to support both reads and writes.

Timing: raw access to memory is set to take MEMORY_LATENCY picoseconds, by default 2500, i.e., 2.5ns. You can change this value by fiddling with the MEMORY_LATENCY static variable in the Globals class. Or you can use "-o MEMORY_LATENCY=..." on the command line.

Access to the I/O devices through the I/O bus address mappings takes a shorter time, nominally CACHE_LATENCY, which is set by default to 1000 picoseconds, i.e. 1ns.

The idea is that on write all that happens is that the data is launched onto the I/O bus and then the CPU continues with its cycle. The I/O bus works on its own to send the data on to the printer (say), and the printer works on its own to buffer the data, and later print it out.

On read all that happens is that the keyboard (say) device buffer is read via the bus for any characters previously typed. The ink is not sprayed on the paper, nor the keypress executed within the 1ns period itself.

But how does the CPU cope with delays of the magnitude of memory acceses? They are longer than the nominal 1ns associated with the default clock rate of 1GHz and at least on read the memory access must take place fully within the CPU clock cycle.

The answer is that the CPU executes a single exceptionally elongated cycle for a main memory read or write. The elongated cycle is of fixed duration but of such a length that the memory is able to respond within it. It's 2.5ns long, corresponding roughly to memory that can deliver 400MB/s.

That's less than half as fast as L1 cache (which you'll meet in the CPU4 design).

You will have to wait until you meet the CPU5 code to be able to improve on this design.

The IOBus component class

The IOBus is part of a three-way communication between CPU, memory, and it. When the CPU addresses certain places in memory, the access is redirected my the memory manager part of the memory unit to the IOBus instead.

The IOBus class read8 and write8 methods take a bus address ("0", "1", etc.) instead of a memory address. The redirect substitutes a bus address for a memory address.

There's an addIO method which registers an I/O unit (e.g. a Console object) onto the bus, which gives it a bus address. The bus addresses are awarded in order of registration to each I/O unit that wants one, in turn. As the CPU and memory unit is being put together, the bus is built and a single console is registered to the bus at both of bus addresses 0 and 1 like this:

IOBus iobus     = new IOBus();
Console console = new Console();
short stdin     = iobus.addIO(console);             // console registered as bus addr 0
short stdout    = iobus.addIO(console);             // console registered as bus addr 1 also

Bus address 0 will be the mapping used for GETCHAR_ADDRESS within the memory unit, corresponding to reads from the console, and bus address 1 will be the mapping for PUTCHAR_ADDRESS, corresponding to writes to the console.

The full interface to the I/O bus methods is

IOBus
IOBus ( )	Constructor
read8 (short busaddr)	read a byte thru bus from addressed source using I/O bus address
write8 (short busaddr, byte data)	write a byte onto bus to addressed target, return number written
addIO (IOUnit u)	register a new I/O unit on bus for read and write, return bus address

IOUnit is an abstract class ("interface"). Console objects match that interface. To meet the requirements of the interface an object just needs to have read8 and write8 methods. These methods don't take an address at all as they are conceptually associated with just their own I/O unit:

IOUnit
byte read8 ( ) throws IOException	read a byte (throw exception if fail)
int write8 (byte)	write a byte, return number of bytes actually written

An IOUnit produces a byte on a read request, and absorbs a byte on a write request. Console objects in particular have these methods. The memory unit build code adds a console to the iobus and maps its bus address to a memory address via the port list. The upshot is that the console object is accessible over the iobus via memory mapped addressing from the memory unit.

Why have we gone to the trouble of introducing a bus to mediate between the memory unit and what is at the moment a single I/O unit, a single console?

It's really so that we can model delays introduced by the fact that the bus can only be used to access one I/O unit at a time, and only one byte at a time can travel down it (in one direction at a time), and the fact that the I/O bus in real life is relatively slow too - 33MHz, as compared to the CPU's 1GHz.

The CPU can execute 30 instructions in the time that it takes 1 byte to travel down this I/O bus.

If the memory unit tries to access the bus while it is already in use, there's trouble. Make sure that writes to the I/O bus are at least 30ns apart or you will find that trouble and strange things will start to happen like writes to the bus being discarded.

In practice, it's not a deal-breaker. Whatever machine code you write, it's almost unheard of to produce a loop tight enough to do two writes to the I/O bus within 30ns. The CPU1 design struggles to deliver 6 machine instruction executions in that time. That's not enough to save registers, run a jump to a subroutine, do something, jump back, restore registers, increment a loop counter, check a loop bound, etc.

There's code in there that makes a noise on the simulation error stream if the I/O bus ever gets swamped and it does not happen in practice. It could, though. For fun, try making it happen, by writing the appropriate machine code (via MIPS assembler, of course). You'll get a stream of imprecatory complaints on the screen.

Real CPUs have their limits too, and real programmers write to them. You can write nearly anything you want in assembler, and whether the metal underneath can respond as you intend or not is a subtly different question. Real programmers follow the maxim of don't do that then.

For example, if you check real MIPS assembler code you'll find that every jump instruction is followed by a NOP because nearly every real MIPS machine's pipeline and instruction prefetch mechanism makes it inevitable that the instruction after a jump will be executed!

Miserly programmers could in theory save program text space by directly following a jump with a labelled non-trivial instruction, but they don't, because they know what will happen. Evolution has taught real programmers (the ones that survive to get their next paycheck and eat their next lunch) that reality is stronger than theory. Compilers don't do that then either, for the same reason.

The StopClock component class

One more special I/O address mapping intercept is pre-set. It causes the Clock.stop method to be called when HALT_ADDRESS is written to, which ends the simulation!

        ... 
        if (addr == HALT_ADDRESS) {
            Clock.stop ();                                              // do halt
            return;
        }
        ...

There's really a special extra I/O unit called stopClock placed on the I/O bus to implement this.

If you check out what the Memory code really looks like in the source file, you'll see instead of the above something like the following in the memory unit setup code:

StopClock stopClock = new StopClock();
short stpclk        = iobus.addIO(stopClock);       // stopClock registered as bus addr 3
port.add(new Port(HALT_ADDRESS, stpclk));           // translation HALT_ADDR -> bus 3 added

And the StopClock write8 method is only a thin wrapper round that vital call to Clock.stop.

class StopClock implements IOUnit {

     public int write8(byte data) {

         Clock.stop();
     }
     public byte read8() {

        return 0;
     }
}

If you are going to build more I/O units to implement special effects (perhaps audio?), the stopClock example provides you with the model for the way to go about it: first code the unit as a class with methods write8 and read8 that really do whatever you want, then make a new instance of the class at CPU or memory build time in the CPU code and put in a registration call with iobus.addIO for it, finally add a memory mapping intercept into the memory unit code. For best results do that last by updating the port list in the memory unit using port.add.

The Register component class

Registers are the components within the CPU that remember. Everything else starts each processor cycle as though it were starting anew each time. A register remembers what value it had last cycle. 32 addressable registers (and a few more) make up the register unit component in the CPU.

Register
Register (String)	register builder
int read ()	read a 32-bit int from a register
int write (int)	write a 32-bit int to a register, return old value

Register objects protect a single internal integer value. In principle, read and write methods simply access the integer, like this:

    private int x;                              // the stored register value

    // read the 32 bit register value
    public int read() {
        return x;                               // return stored value
    }

    // write the 32 bit register value
    public int write(int x) {

        int xprev = this.x;

        this.x = x;                             // update stored value
        return xprev;                           // return old stored value
    }

However, there's a twist.

To better mimic the action of a real, clocked, register, we've arranged that the result of a write cannot be seen until the next clock tick.

That is, reading the register shows the value stored in the register as of the last clock tick. You can write the register and read the register as you like during the period of a single clock tick, but reading will always show the same old value. After you call Clock.sync then reads will start showing the value that you wrote last before the Clock.sync call.

To implement that we've added a variable xtick that holds the number of the clock cycle on which the current value of x was written.

There's also a variable xnext that holds the value inserted by a write call. Reading simply returns x if the current clock cycle is still xtick, otherwise it moves xnext into x and updates xtick, before returning x.

    private int x, xnext;
    private long xtick;

    public int read() {
        if (Clock.tick <= xtick)
            return x;
        x = xnext;
        xtick = Clock.tick;
        return this.x;
    }

    public int write(int x) {
        xnext = x;
        return this.x;
    }

To make sure that the initial settings of the PC and SP registers in the main routine at CPU start-up are effective, xtick is initialized to -1.

Then on clock cycle number 0 the PC is written, which sets xnext. But on the first read in the fetch part of the fetch-decode-execute cycle, also on cycle 0, xnext will be moved into x and returned by the read.

It's a trick, but it works. Setting PC and SP in main makes those values available in the 0th processor cycle.

The SignExtend component class

The SignExtend component is the unit within the CPU that changes a 16-bit number into a 32-bit number, preserving sign in the process. In other words, it does not simply extend to the left with 16 zero bits.

SignExtend
SignExtend ()	builds a sign extension unit
execute (short)	convert a 16-bit short to a 32-bit int preserving sign

The sign-extend object's one and only method extends a 16-bit half-word to a 32-bit full-word while preserving sign.

Fortunately, in Java, that's just achieved by a (here implicit) cast to int. So the Java code is as trivial as can be!

    public int execute (short x) {
        return x;
    }

There's nothing more to say (in Java). Of course, somebody looking with a debugger at the hardware will see a negative 16-bit number being extended to the left with 16 1s, not 0s.

The Console component class

The Console component implements the IOUnit interface of read8 and write8 methods. It's a thing that inputs and outputs characters, reading from a keyboard and writing to a screen.

Console implements IOUnit
Console ()	console unit builder
byte read8 ()	read one character (ascii code value) from keyboard; return 0 if nothing ready
int write8 (byte)	write one character to screen; lose the character if screen not ready, return number written

A Console object's methods use the Java System.out.print and System.in.read methods to move a single byte to/from screen/keyboard.

  public byte read8 () {
      byte data[] = new byte[1];
      try {
          System.in.read (data, 0, 1);
      } catch (IOException e) {
      }
      Clock.increment (CACHE_LATENCY);         // the CPU does not pause for input! It polls.
      return data[0];
  }
  public void write8 (byte data)
  {
      ...
      // if output_queue_length < OUTPUT_QUEUE_LENGTH
      // && elapsed_time < IO_LATENCY, then queue it for output

      output_queue_length++;
      System.out.print ((char) data);          // emit the char that is now inevitable
      Clock.increment (CACHE_LATENCY);
  }

On input, if no character is in the keyboard buffer, the read8 method returns 0. The fast timing encoded in the routine reflects the immediate nature of this behaviour. The CPU is polling the keyboard input buffer. There's a character in it or there is not.

On output, there is a finite screen buffer to consider. All the CPU does is place a character in the screen output queue, which is immediate. It's the console that has to slowly pass the characters in its queue out to the screen. If the CPU writes to the queue faster than the console can empty the queue to screen, then characters soon start being lost because the CPU does not pause! The characters that are lost are the last ones that the CPU attempts to place on the already full screen output queue. They go nowhere.

All that results in quite complex timing emulation code in the write8 method. It's not reproduced above.

But beware! This console unit will eventually lose characters if you attempt to write through it to screen too quickly for too long at a time without allowing it to drain. You won't find a cure for that until you look at the CPU5 code.

We've actually been kind and supposed that the Console is fast enough internally to be able to see every character passed to it by the I/O bus. It's another question as to whether it will queue, output or drop the character, but see it it will. It might have been the case that the I/O bus were considerably faster than the Console's internals. Let's not go there! Slow printers are enough of a pain as it is. The CPU5 code will allow you to deal with that problem.

The InstructionRegister component class

The InstructionRegister class encapsulates the decode combinatorial logic. Its numerous methods break out the fields from a machine code instruction of the appropriate kind, but there's nothing at all complicated or subtle about the way it's done.

InstructionRegister extends Register
static byte OPCODE (int)	read a 32-bit machine code and return the opcode field<
static int jDST (int)	read a 32-bit machine code jump instruction and return the 26-bit jump address field
static ...	...

The methods are all one-liners.

    // get the opcode (top 6 bits) from an instr
    public static byte OPCODE(int x) {
        return (byte)((x >> 26) & 0x3f);
    }
    // get the 26 bit jump address part out of a jump instruction
    public static int jDST(int x) {
        return x & 0x03ffffff;
    }
    ...

There are about 20 different decoding methods, all similar to the above. The best thing to do is to skim-read the source file and become aware of what there are but look up what they are called only when or if you need one.

The global Clock

Notice that there is only one of this component. All its methods (and its internal attributes) are marked static, meaning that there's only one.

Clock
static long increment (long)	move the clock on by increment picoseconds from current cycle start
static boolean running ()	return true if clock still running
static void stop ()	stop the clock
static void reset ()	reset the clock to the current clock cycle start point
static void sync ()	advance the clock to the next clock cycle start point

The global clock encapsulates a single 64-bit number representing the number of picoseconds t elapsed since startup.

The increment method changes the value, returning the old value.

The running method returns the boolean value true until stop has been called, after which it returns false. In addition to the global internal picosecond counter t, the clock maintains a sense of when the last whole clock cycle started and when it is due to end, which could be at virtually any picosecond value.

The reset method sets t back to the current cycle start and the sync method sets t to the current cycle end/next cycle start point.

The idea is that calls to increment advise the clock of delays introduced by CPU circuit elements within the current clock cycle.

Since the elements (ALU, registers, etc.) are arrayed in parallel all the increment delays notionally happen in parallel. So they are not cumulative.

Instead the effect of increment is to push the time t towards the maximum of all the delays within the current cycle announced via increment so far.

Come the next call of sync the value of t will be checked against the next clock cycle start point and if t has overrun a nasty message will be displayed on your screen and the CPU will break down and stop, reflecting the idea that we have been running the CPU clock too fast for the CPU components to be able to keep up with it.

Helper classes

The following sections detail a few extra classes which do not represent CPU components. They're there just to give Java a helping hand.

The Utility routines

The Utility class contains a swathe of leftover routines that are generally helpful and don't belong anywhere else at all. They are static. That is, they don't have any access to any local objects's attributes. They're just pure functions.

Utility
static int grab32be (byte[], int)	read a 32-bit bigendian int out of a byte array at a given offset
static short grab16be (byte[], int)	read a 16-bit bigendian short out of a byte array at a given offset
static void put32be (byte[], int, int)	write a 32-bit bigendian int into a byte array at a given offset
static void put16be (byte[], int, short)	write a 16-bit bigendian short into a byte array at a given offset
static int swab32 (int)	byteswap a 32-bit integer
static short swab16 (short)	byteswap a 16-bit integer

These utility routines are the stuff of elementary Java exercises, and you may wish to reimplement them to suit your own style preferences.

    public static int grab32be(byte b[], int off) {
        return    ((b[off + 3] & 0xff) << 0)
                + ((b[off + 2] & 0xff) << 8)
                + ((b[off + 1] & 0xff) << 16)
                + ((b[off + 0] & 0xff) << 24);
    }

The Cmdline routines

The Cmdline class wraps some common code that is used to help parse command-line options and arguments.

Cmdline
static void analyse (OptionT[], String[])	parse the command line against the given table of expected options
static void usage (OptionT[], String)	produce a generic error message

While the command line parse routines are really quite complex codes internally, they only implement the very well-known API for the GNU getopt library call.

Look up getopt(3) using the unix man[ual] command.

It's implemented as a private static method here and called repeatedly by analyse until the command line is exausted.

        while ((opt = getopt(..., argv)) >= 0) {
            switch (opt) {
              case 'd': Globals.debug++;
                        break;
              ...
            }
        }

The counter optind (index into the array of strings argv passed from main) is incremented every time getopt is called and opt (a short code for the option parsed) and optarg (the string representing the option's argument, if any) are set by the getopt call.

I've skimped on implementing more of the standard getopt functionality than I needed. In particular, my hasty implementation doesn't bother to reorder the command line so that the non-option arguments come last.

In consequence, you'll have to use a command line like "CPU/Cpu1 -q hello_mips32" instead of being allowed to write "CPU/Cpu1 hello_mips32 -q" at the command line. All the options have to come before all the non-option arguments.

That's something to apologise for sorry and you may wish to fix that lazy oversight which is currently spoiling the implementation of getopt.

The Elf helper

The Elf class is there to hide all the messy business of picking up an executable file from the file system and reading its machine code contents into memory. You really do not want to know about it.

Elf
Elf (String)	build an ELF executable file representation. Afterwards its fields are available for interrogation

The Elf class code is complex internally, but all it does is hack its way through an ELF-formated file picking up data as it goes.

Everything needed and more than you want to know about the ELF format standard is to be found in the unix elf(7) man[ual] page and the elf.h C header file on any system with a working C compiler. The Java code has been commented quite heavily.

ELF files consist of a header followed by several segments. Each segment is made up of sections.

The ELF header consists of a fixed format initial part and a variable length trailer. The fixed format initial part of the header defines the sizes and numbers of the entries and the offsets to two tables farther on in the trailing part of the header. As far as one can see, the header is in practice 64KB long in total, even when the data it contains is relatively little. It may be possible that it is sometimes longer than 64KB. The total header size is one of the more important bits of information in the fixed format initial part of the header!

The first table in the trailing part of the ELF header enumerates and describes the segments in the ELF file. It's called the "segment header table".

The Java code searches that table looking for a segment that is read-only and executable. There appears to be only one such segment in all ELF files in real life, but perhaps there can be more.

This is the segment that contains the program code part of the ELF file. This so-called program segment should contain both a program section, and also a read-only data section, and there usually is at least one of the latter.

It'll contain the manifest string constants used within the program, for one thing.

But I didn't bother with the distinction between those two sections within the program segment. I arranged that the whole of the program segment, containing both program instructions and read-only data, be lifted into the Elf object as a single byte array.

The segment header table in the ELF file lists the offset in the file and the length of the program segment. It also lists the virtual address that it is supposed to be loaded into memory at for the purposes of running the program code successfully. That's all information which is lifted into the Elf object by my parse. The ELF header also contains the intended entry point in virtual address space for the program, and that also gets lifted into the eventual Elf object. I did not check for consistency, and you may wish to add such a check (I'd hope the entry point address were within the area in memory where the supposed program gets loaded to!).

I essentially blithely ignored the second table in the ELF header, which lists the sections (the sub-parts of the segments), and which therefore potentially gives up more fine-grained information than the segment header table. It's called the "section header table".

Ignoring this table as I did may be fatal in some situations. In particular, the section header table provides a pointer to an area of the ELF file where the names of all the segments and sections are kept, and that's potentially a much more reliable way of locating the program segment/section than the way I did it. The program section name is always ".text". Check the elf(7) manual page for more details.

However, parsing the name table itself is not at all simple, once one has found it, and I opted not to try! If one does parse it then in particular one also has a reliable way of finding a ".bss" segment, if there is one. If it exists, that defines a range of memory addresses which are supposed to be zeroed and provided to the program to serve as writable scratch space. These bss sections are quite common in compiler output. Any variable declared static in C will be in one. I didn't even attempt to look for it, so if a program has a bss segment, the simulator will fail when it runs and doom and gloom will descend on the multitudes ...

A hack that may find a bss segment with little work may be to look for one that is described as writable and not executable. It's hard to imagine another with those two characteristics. Please add a search for such a section and prepare a corresponding memory area in the simulator runtime.

And while I assumed that a read-only program data section would be found in the executable program segment, that's also only a guess which may turn out not to be true of some executable files out there. Please add some checks which make a loud noise if the assumption is proved incorrect and there are read-only sections elsewhere.

So there's lots of room for improvement here, but it's tangential to the issue of whether the simulator itself works or works well on those executable program files it can run.

The Debug class

Debug
static String str_op (byte)	return string representation of the instruction name part by opcode
static String str_func (byte)	return string representation of the instruction name part by the function code in an ALU operation (the instruction has opcode 0)
static void print_instr (PrintStream, DecodeReturnT, long, int)	print the disassembly of an instruction given its decoded representation, time of execution and execution sequence count
static void queue_state (byte, int)	queue a register index and value for printing
static void print_state (PrintStream)	print the queued state vector

There is no code of relevance to understanding how a CPU or the CPU simulation works to be found among the Debug routines!