jMIPS

an open source MIPS processor in Java

4 Adding a memory cache to the model

Adding a memory cache to the model

To study the effect of caching, a memory cache has been added to the model. The augmented Java is the CPU4 class code.

Whereas memory accesses take 2.5ns, cache accesses take just 1ns.

The new cache is 64KB, organised as 32KB of program cache, and 32KB of data cache. The two parts are both 'direct-mapped' and 'write-through', with cache lines that are 4 bytes long.

That is to say that the cache always reads 4 bytes at a time from main memory, and always writes 4 bytes at a time to main memory. If the cache needs to read the byte at address 57, then it reads the bytes at addresses 56, 57, 58, 59 all at the same time, even though bytes 56, 58, 59 are not needed. Ditto for write.

Saying that it is a direct-mapped cache means that there is only one possible place in the cache for each byte from main memory. In the case of a byte with address n, it goes in cache line number (n >> 2) % 8192. There are 8192 (i.e. 2 to the power of 13) cache lines in each cache, each containing 4 bytes. The byte goes in the line that has the number formed from the 13 low bits 14-2 of the byte's address. That is a fair design for a non-multitasking operating system, but the cache will thrash badly if more than one program runs alternately in the CPU.

The cache line that fills with the group of 4 bytes starting at address n gets given the cache tag (n >> 2) / 8192 consisting of the 17 high bits 31-15 of the byte's address. Inside the cache, the tag of the predicted cache line is examined to determine if a wanted byte is in the cache or not.

The cache is write-through, which means that as soon as data is written to the cache, it is written on to main memory too. This makes writing to the cache as slow as writing to main memory. So why bother? The answer is that one would have to write to main memory anyway, and letting the cache snoop the data on the way there doesn't cost anything extra over that. It's just that one could have run even faster if main memory could have been updated opportunistically instead (a "write-back" cache). On the plus side, it saved me effort, as I didn't have to put in the accounting for whether data in cache has gone to main memory yet or not.

Direct mapping also saved programming effort in the code because no (non-trivial!) LRU algorithm was required. There's no "choosing" which cache entry to evict when a new datum is written in a direct-mapped cache. There is only one place each datum can go, so only one possible choice for eviction.

What's new now?

Concretely, the caching model differs from the non-caching models in terms of the Java code as follows:

There is no difference visible at the top level:

Cpu4
static void main (String args[])	entry point; handles command line options

CPU4
CPU4 ()	CPU builder
void run ()	fetch/decode/execute cycle

There is a new Cache class representing the cache, and the only difference in the code is that the Memory unit has two Cache objects created inside it by the main() setup code.

Caches have just read and write methods for 32-bit integers (4 bytes) at a time. Read can raise an exception (when the cache doesn't have the data requested), write always succeeds. When memory wants to read or write just a single byte to cache it has to diddle with the 32-bit cache data itself. Both cache read and write methods take the address/4 as argument, because they always deal in sets of 4 bytes.

Cache
Cache (int, String)	builds a cache unit of a given size with the given name
void write32be (int, int)	write a 32-bit int to the cache tag given
int read32be (int) throws Exception	read a 32-bit int with the cache tag given, throw exception if tag not in cache

The constructor builds an internal array of nlines cache line tags and a corresponding space for the lines of cache data themselves, each 4 bytes long:

    Cache(int size) {
        nlines = size / 4;
        tags   = new int [nlines];
        lines  = new byte[nlines][4];
    }

A cache write involves calculating the array index of the cache line that is affected (line_idx) and checking the tag on that line to see if it matches that for the address being written (line_tag). If it does, there has been a cache hit on write. If it does not, then the write is a cache miss. The hit/miss statistics are collected for later, though the code inserts that do that are not shown here.

 void write32be(int addr, int val) {           // addr on input is the true addr divided by 4

    int line_idx = addr % nlines;              // index of cache line affected
    int line_tag = addr / nlines;              // expected cache tag
    
    if (tags[line_idx] != line_tag) {
        tags[line_idx] = line_tag;             // cache miss - write new tag
    }
    Utility.put32be(lines[line_idx], 0, val);  // update the cache line
}

Cache reads throw an exception if there is a cache miss.

int read32be(int addr) throws Exception {           // addr on input is the true addr divided by 4

    int line_idx = addr % nlines;                   // index of cache line affected
    int line_tag = addr / nlines;                   // expected cache tag
        
    if (tags[line_idx] != line_tag) {               // cache miss raises exception
        throw new Exception("addr " + addr + " not in cache");
    }
    return Utility.grab32be(lines[line_idx], 0);
}

Again, the statistics-collecting code inserts are not shown here. Neither is some carefully engineered timing code which calculates how long the cache access would have taken. The timing code can be switched to use the timings appropriate to write-back cache for comparison purposes. Use "-o CACHE_TYPE=2". Type 1 is the (default) write-through cache.

The memory unit read methods (for single bytes and 4 bytes) now ask the cache first. If the cache read throws an exception, the main memory area will be consulted directly instead, and the cache updated with what is found. Memory write methods always update both main memory and cache ("write-through") together.

    public int read32be (int addr)                                 // requested addr is aligned at 4
    {
        ...
        if (region.start <= addr && addr < region.end) {
            try {
                data = cache.read32be (0x3fffffff & (addr >> 2));
            } catch (Exception e) {                                // not in cache, get it from memory
                int physaddr = addr - region.start + region.offset;
                data = Utility.grab32be (region.text, physaddr);
                cache.update32be (0x3fffffff & (addr >> 2), data); // snoop the data into cache too
            }
            return data;
        }
        ....
    }
    public void write32be (int addr, int data)                     // requested addr is aligned at 4
    {
        ...
        if (region.start <= addr && addr < region.end) {
            int physaddr = addr - region.start + region.offset;
            cache.write32be (0x3fffffff & (addr >> 2), data);      // update the data in cache
            Utility.put32be (region.text, physaddr, data);
            return;
        }
        ...
    }

The read8 and write8 methods are not shown. They pull in a whole cache line at a time and then update or isolate only the byte that is requested.

MEMORY_LATENCY is 2.5ns in the code, versus a CACHE_LATENCY of 1ns. Memory reads which hit cache take only 1ns. Memory reads which miss cache take 2.5ns. Memory writes ("write through cache") always take 2.5ns. The CPU may be supposed to have its clock cycle elongated ("wait state") in a load or store in order to allow for the expected length of memory accesses. If the memory is slower than the 2.5ns allocated, it won't work!

Results from the caching model

With the new caching model, the processor simulation runs the "Hello world" code as shown in the table below. The first row is obtained by temporarily setting all memory accesses to take 1ns via "-o MEMORY_LATENCY=1000". The label on the row reflects the idea that the speed seen is the same as what one would get if one surreptitiously pre-loaded all code and data into the cache before program start, and let the cache "write back" updates (later) rather than "write through" to main memory (immediately).

	unpipelined	pipelined	pipelined+optimized
220 instrs preloaded, writeback cache	0.000001105s	0.000000675s	0.000000518s
220 instrs, writethrough cache		0.000000777s	0.000000612s
220 instrs, no cache	0.000001580s	0.000001291s	0.000001144s

It is clear that the cache saves at least 200 main memory reads (about 500ns). It is enormously influential here.

A look at the statistics produced by "-d" on the command line shows that the cache intercepts 320 of 370 program instruction read attempts:

% java Cpu4 -d hello_mips32 
...
220:    0.000000739s:   0x80030014:     sb $3, 0($3)
...
prog cache read hits 320/370, write hits 0/0
data cache read hits 52/52, write hits 12/19
%

That's interesting, because only 220 instructions in total are executed to completion in the "Hello world" program.

The extra reads occur when instructions are prefetched and then flushed from the pipeline before they can complete. Evidently 150 poor predictions were made. It would be worth trying to improve the pipeline's prediction mechanism. At the moment the prefetched instruction is always the next instruction in strict address order in the code, at the PC+4 location.

The count also shows that there are only 50 different instructions to execute (the program loops through them) and they all ended up in the cache. One has to conclude that caching the program text is a major win on its own.

The data cache intercepted all (52) of the reads that it tried, which shows that the data was always written before being read. The "data" is actually the contents of registers saved onto the stack before subroutine calls. Yes, writing does occur before reading there. The count shows that 7 different stack locations were used overall (19 writes, of which 12 evicted data already present in the cache).

That's enough data to predict the theoretical perfect performance from a pyschic pipeline. It's in the last column:

	unpipelined	pipelined	pipelined+optimized	perfect
220 instrs preloaded, writeback cache	0.000001105s	0.000000675s	0.000000518s	0.000000220s
220 instrs, writethru cache		0.000000777s	0.000000612s	0.000000333s
220 instrs, no cache	0.000001580s	0.000001291s	0.000001144s	0.000000578s

For the top row (final column), We've counted just 220ns for 220 instructions, at 1ns apiece.

For the middle row (final column), We've counted the 220ns for the 220 instructions, plus an extra 1.5ns for each of the 50 reads from main memory of the program instructions that must necessarily take place, and an extra 1.5ns for each of the 19 data writes to main memory that did take place.

For the bottom row (final column), We've counted the 220ns for the 220 instructions, plus an extra 1.5ns for each of the 220 reads from main memory of program instructions that must take place, and an extra 1.5ns for each of the 19 data writes to main memory that did take place.

You could fill in the blanks in the table by retro-fitting the more sophisticated memory-with-cache class code into the non-pipelined simulator code. It should drop right in.

Exercises with the optimized pipelined caching model

Improve the cache in the cached and optimized pipelined emulator CPU4 java class source code. It's only direct mapped write-through by default. See how good you can get the cache figures with respect to the "perfect psychic cache" noted above.

[edit] Doubling cache line length

You may easily improve the CPU performance figures by making the cache lines longer, so that more data is read-ahead at a time into the cache, saving time overall. But don't go overboard. It's not certain that even 64-bit wide data paths from cache to main memory would be approved by the project accountant! And it's unlikely that you'll get anywhere as much improvment in doubling from 64 to 128 bits as you do in doubling from 32 to 64 bits.

Here are hints on how to extend the cache lines in the CPU4 model code from 4 bytes to 8 bytes length. That should make cache hits rather more frequent. Indeed, indications from a few test runs are that it's something like a 10% win from this rather easy coding change in the model. However little more one would win with 16 byte long cache lines, it would require substantially more coding work and is not worth even thinking about until you have tackled this task first.

The changes you make should be centered on the Cache.java file, which contains the cache methods read32be and write32be (plus its simpler-minded cousin, update32be), in about 20 lines of code.

You will modify them to read64be and write64be methods, allowing 64-bit data to be read and written from/to cache.

Start by changing the linelen=4 initialization to linelen=8, doubling the length of the cache lines in the cache and halving their number.

Change the type of the read method to allow it to return a long, instead of an int. Change the declaration of the variable val that is returned from int to long, and change the way it is read from a cache line to use the utility function grab64be (write it!) instead of grab32be. Cache lines are now 64 bits long, not 32, so the change is forced. That's it for read.

Similarly, change the type of the write method to allow it to accept a long, instead of an int. Change the call to put32be that puts its argument into a tag line to use put64be (write it!) instead. Cache lines are now 64 bits long, not 32, so the change is forced. Make the same changes in the update method, which is the same as write bar the extra cache statistical accounting lines that write has and update has not. That's it - you've got a couple of 64-bit cache methods. You have now notionally widened the bus between cache and memory to 64 bits.

But now you're temporarily stuck, because the two calls in Memory.java are to the old cache read32be and write32be/update32be methods, which you just modified and renamed to read64be and write64be. The old 32-bit cache interface no longer exists. The bus between memory and cache is now wider than 32 bits - it's now 64 bits wide. You need to fix something here.

Before thinking about it, make yourself a couple of new methods in Memory.java, called read64be and write64be. Make them by modifying the existing memory read32be and write32be methods, and change the type returned by read and the type accepted by write to long, from int. Then there'll be room to pass 64-bit data through them. Mark them as private, as there'll be no external use of these methods. Once they're made, we'll see a use for these named lumps of code - they're nothing more significant than that - as the memory's end of the memory-to-cache newly 64-bit wide bus.

For read64be, change the declaration of the data variable to long, from int. And load it using a cache read64be method call, not a cache read32be method call. The address passed to the cache in the call needs to be divided by another 2 too, as the cache uses whole lines as data items and expects not to have useless zeros on the end of addresses passed to its methods (this is arguably a design mistake, and you might wish to correct it). Cache lines are twice as long as before, so there are half as many of them as before, so cache addresses are half what they were. If the cache misses, the code falls back to looking in the memory area, and the method call needs to be changed from a grab32be to a grab64be, since we are now dealing with 64 bit data. Finally, the snoop of the data brought back from memory to the cache needs to be done with a cache update64be method call, not an update32be call. And the address passed in the call needs to be divided by 2.

For write64be, exactly the same sorts of changes need to be made. There are only two lines that need changing! Remember to return 8, not 4, in case of success. The write method returns the number of bytes written successfully.

Now you have a couple of very neat pieces of code, write64be, read64be, in Memory.java, but no other code calls them! You just made them, fresh, and they're private to memory. The existing CPUx code calls to a public memory read32be and write32be method when it wants access to memory and hopefully those vanished when you modified them to read64be and write64be, so the compiler will complain. You have to reinvent a public read32be and write32be in Memory.java for the CPU to call to. There's still only a 32-bit bus between CPU and memory/cache and you need the interface for it!

For the memory read32be method, make it call the private memory read64be method that has access to the 64-bit memory-to-cache bus and throw away `the wrong half' of the result.

Notice that if you are interested in getting the 32-bit int at addr (necessarily a multiple of 4) for read32be, you have to ask for the 64-bit long at addr&0xfffffff8 (the multiple of 8 equal to or just below it) from read64be. That is, you will call

long data64 = read64be(addr & ~7)

8-byte memory and cache blocks (`long's) exist at addresses that are multiples of 8 (called "8-byte aligned", or "64-bit aligned" addresses). They do not overlap. Ditto 4-byte `int' blocks and addresses which are multiples of 4, also 2-byte `short' blocks and addresses which are multiples of 2. So read64be can be expected to get 64 bits from the 64-bit-aligned memory address you supply to it and you want just 32 of those 64 returned bits.

So what is the `wrong half of the result'? Here's a picture of the 64 bits at memory address 0x18, addressed byte by byte:

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

What if you want the 32 bits at address 0x1c? That's bytes 0x1c, 0x1d, 0x1e, 0x1f. Then you will ask read64be for the eight bytes pictured above at 0x18, and throw away the first four (the big end) and return the remainder (the little end). Here the bytes to throw away are shown in red, and those to keep are shown in green:

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

In other words, if

addr == (addr & ~7) + 4

then you will want to

return (int)data64

What if you want the 32 bits at address 0x18? That's bytes 0x18, 0x19, 0x1a, 0x1b. Then you will ask read64be for the eight bytes at 0x18, just the same, and throw away the last four (the little end) and keep the first four (the big end):

0x18

0x19

0x1a

0x1b

0x1c

0x1d

0x1e

0x1f

In other words, if

addr == (addr & ~7)

then you will want to

return (int)(data64 >> 32)

So you throw away either the big end or the little end of the 8-byte long returned from read64be, depending on the address of the 4-byte int you want. The `wrong half' is whatever half you have to throw away, and it varies from case to case.

For write32be, never mind the wastage for now, just make it call read64be first to get 64 bits, modify 32 of those with the 32-bit data you want to write, either the big end or the little end, then write all of the modified 64-bit data back with write64be.

Start with the read you've been advised to do:

long data64 = read64be(addr & ~7)

You want to write `int data' at addr.

In the end you will use the expression

data64 = (long) topbits << 32 | botbits & 0xffffffffL

to create the 64-bit data to write with write64be from two 32-bit integer halves. Note that the 'L' means a 64-bit constant is understood here, and this one contains 32 zero bits in its top half and 32 ones in its bottom half. We'll look here at how to get the 32-bit integers topbits and botbits.

addr == (addr & ~7) + 4

then

int topbits = (int)(data64 >> 32)

is the big end of the 64-bit data read at addr & ~7, which you do not change, and

int botbits = data

Check the picture above to confirm.

If the target integer address is

addr == (addr & ~7)

then

int topbits = data

and

int botbits = (int)data64

That's the little end of the 64-bit data read at address addr & ~7, which you do not change. Again, see the picture above for confirmation.

Build data64 from topbits and botbits as advised above, and write it to addr&~7 with write64be.

Yes, you can improve that. It's better to look in cache first during a 32 bit write. If the data is in cache already, then there is 64 bits of it right there that can be modified in situ and only 32 changed bits need be written through to memory. No memory read is required. Only if the data is not already there in cache does an extra 32 bits need to be read from memory into cache to fill the cache line completely. The 32 bits that are being written can be written straight through to memory, as well as into cache. So the memory read during memory write suggested as an expedient above can be avoided at the cost of some complicated if-then-else-ing in the code.

And then you may want to make the lines longer still.

[edit] Making the cache 2-way

Here are some coding hints on making the model cache 2-way.

I would go about it in a low-level fashion, doubling the number of arrays held inside the present cache object. You, however, may wish to go about it in a more "programmerly" fashion, inventing a new cache object (class) that contains two of the old cache objects, a primary and a secondary.

On read, check both of these internal caches for the cache line that's requested. Just pass on whatever you receive in your read request to the two internal caches read methods in turn. One of them may have it (which means not throwing an exception in reply). If neither has it, i.e., both internal caches throw exceptions on read, then throw an exception back to the caller.

On write, choose one of the internal caches to write the cache line to. If the entry at that address was already in the other internal cache, delete it from there! (Alternatively, don't write it to a different internal cache if it is already in one).

That's all.

Yes, your new cache has a pretty lousy replacement policy - effectively pure happenstance. You really want to improve it! Perhaps maintain a table of which of the two internal caches has the older entry for each cache index. Then when you write, choose to write to the internal cache with the older entry at that index position. Remember to change round the table entry for which is older afterward. Update the table entry on read too. That implements a LRU (`least recently used') replacement policy.

You can do all that in a low-level way too, modifying the internals of the existing Cache class to include two sets of what it has inside it now, plus the extra table of `oldest' indices. I'm sure I'd do that. Expand your calls to the methods of the internal caches out into flat code, and you should get what I would have written. It's a much closer representation of the hardware. For starters, you'll have to name the two sets of internal local variables (`names of traces') differently, so you will become aware that they correspond in real life to separate parts of the chip real-estate.

Making the cache 2-way or 4-way associative ought not to improve the performance figures for the single "Hello world" program, because there is no thrashing. A program larger than 32KB, or which used data sets larger than 32KB, would cause cache thrashing. So would running two different programs alternately, as is effectively the case with the CPU5 model that we will look at next. There the interrupt handler plays the part of a second program running alternately, and performance will be improved by a 2-way associative cache.

[edit] Making the cache write-back

You might try making the cache write-back and check the predicted performance figures - make writes take only CACHE_LATENCY each time.

How would you make the cache write-back?

You'll start by taking the control of the write-through to memory out of the hands of the Memory class write method! Do it in the write method of the Cache class instead, where you can put in a call to memory.update32(addr, data). Remember to reconstruct the appropriate memory address from the address the cache receives (which is the memory address divided by 4, as I had it originally). You will have to multiply by 4. Or 8, if your cache lines are 8B long and you are dividing memory addresses by 8 when passing a line to the cache. Etc. Of course, to call the memory update method, you will need to implant a reference to the memory object in the cache. Declare a new field and fill it just after the cache is created, in the main code.

In the cache write method, most of the time you do not want to fire the update-through-to-memory call. That will be the default action ("don't write through" - it's easy to implement1). Every time you do the don't, you will have to set a mark in a boolean table of dirty cache lines instead. That table will show which cache lines are pending on a write through to memory. I'd simply declare a boolean[] dirty array the same size as the number of cache lines, and not try for anything fancier. If the cache write method tries to overwrite a cache line that is marked as dirty in the table with a different cache line, then and only then you have to use the memory update call to write through the old cache line to memory immediately. The new line is dirty too (since it has not yet been written through to memory), so the dirty annotation stays the way it is.

When does this cache ever get cleaner? When it spontaneously writes a line through to memory on its own account, that's when. After that you can mark that cache line as clean in the dirty table. Every tick of the pipeline clock you want to look at the instructions that are in the pipeline (I'm talking about the Pipe.tick method code here) and see if none of them are load instructions. You'll have to examine the conf1, ..., conf4 holders op fields to determine that. If the coast is indeed clear then you can put in a call to a new cache flush method at the front of the pipeline tick code. You know that you now have 5ns in which no load instruction will be executed, and you can manage a good two memory writes in that time if pushed.

The new cache flush method should simply select one dirty cache line (if each line can be written in one memory write operation, as I had it originally; if not, think again!) and update it through to memory, marking the line as clean afterwards. We know that there would be time to get this done, so call Clock.reset after the cache.flush call you've put in at the front of the pipeline tick. The cache is notionally feeding the data through to memory asynchronously in the background, so it doesn't count as time taken during a pipeline clock tick. We're simply triggering the cache flush from there and the cache is doing it while the pipeline ticks on.

What about timing? With luck and crossed fingers, setting the CACHE_TYPE to 2 in the Globals class should already get the timing right for a writeback cache. But check by eye on the standard printout from each instruction's execution that load instructions are now taking less time than they used to. If it doesn't happen, let me know and we'll figure out why.

jMIPS

an open source MIPS processor in Java

Contents

Adding a memory cache to the model

What's new now?

Results from the caching model

Exercises with the optimized pipelined caching model

[edit] Doubling cache line length

[edit] Making the cache 2-way

[edit] Making the cache write-back