ECEn 424 Lab #7

ECEn 424 Lab #7: The Cache Lab

Click here for a pdf file describing this lab.

Click here for the tar file for this lab.

Notes:

Updates and tips will be posted here as needed.

Some of you have discovered that valgrind is not currently installed on the spice machines. I've installed the latest version in /ee2/ee424/valgrind/bin which should always be mounted on the spice machines. Either type the full path name or define a convenient alias in your .bashrc file. Sorry for the added complication...

To add valgrind to your system PATH (i.e. make it so you don't need to type the entire file path each time you run it), do the following:
Open up ~/.bashrc (or create it if it doesn't exist) and add the following line:

export PATH=$PATH:/ee2/ee424/valgrind/bin

Save and close the file. Then, open up a new terminal, or re-source ~/.bashrc in the current terminal like so:

. ~/.bashrc

Now any call to the command valgrind should work.

Here are some thoughts on Part A:

The learning curve will feel steep, but using the getopt() function will simplify the task of reading in command-line options. In general, man pages aren't the easiest things to understand, but there is a good example on the man page for getopt() that will give you a good idea of how to use the function.
Reading the valgrind trace files is one of those programming tasks that appears straightforward, but then it is hard to get it just right. You can probably figure out how to do it with fscanf. An alternative would be to read each input line into a char array using fgets, then test the initial character values directly. You can also use strtol() to convert a numerical string (even a hex one) into a long or int.
As you think about how to model a cache, consider defining a struct that includes everything you need to model a block entry in a cache, then using malloc() to allocate an array of these structs that is just the right size (based on the values of command-line arguments). Note that this struct need not include the actual data in the cache block. This is perhaps counter-intuitive, but in this type of cache simulation, we're just watching blocks come and go without any idea of the actual block contents, since data values are not recorded in the address trace that Valgrind produces. We are concerned only with counting hits and misses (and hence tracking which blocks are in the cache at any instant), so we only need valid, tag, and reference bits (to implement an LRU replacement policy).
Any technique you use that allows you to determine LRU entries will be acceptable -- you don't need to reflect how hardware would actually implement it. One simple way to do it (that works in software but is not well suited for hardware) is to have a reference field associated with each entry that is updated with the number of the current memory reference (from an internal counter updated each time you advance to the next entry in the trace). Then when you need to find the LRU entry in a set, you march through them all and find the entry with the smallest (and hence oldest) value in the reference field.

Thoughts on Part B:

An observation: each 32x32 matrix is 4 KB in size and occupies 128 cache blocks. (Remember, for this part, the cache is direct-mapped, with 32 sets, and 32-byte blocks.) Each 64x64 matrix is 16 KB in size and occupies 512 cache blocks. For each run of the transpose function, the very best we can possibly do is to experience a compulsory miss on every cache block (when it is initially loaded into the cache). For the 32x32 case, that will be 256 compulsory misses (128 for A and 128 for B), and a solution cannot exceed 300 for full points. For the 64x64 case, there will be 1024 compulsory misses, and we are allowed just 1300 for full credit. To hit numbers like these, you have to read or write the entire cache block before it gets kicked out by a conflict miss in the direct-mapped cache. This is the key to any solution we can come up with.
The rules for Part B prevent us from having more than 12 local variables, packing multiple temporary values into a local variable, or defining any other arrays in our code. (These restrictions force us to think long and hard about the access patterns to the source and destination arrays.)
How could it help to have more local variables? They could store values that we've read from the source until we're ready to write them to the destination. We might want to wait because we want to fill up one dest cache block before loading another that maps to the same slot in the cache. Draw a picture and convince yourself of this: entries in the 64x64 matrix that are four rows apart will map to the same set in the cache. This is a tough limitation to work around.
The real challenge here is to find a way to access both src and dst cache blocks before they get kicked out by subsequent accesses. Consider this idea: instead of sweeping through the src matrix just one row at a time, suppose we do 8 rows at a time. Note that the 8 adjacent entries in a column actually map to the same cache block in the dst array. This approach is actually good enough to get full credit for 32x32 arrays (provided you do all 8 reads in a column into local variables, then write all 8 out to the dst array). In effect, this is a "blocking" optimization, in which we're marching through memory an 8x8 block at a time. However, it does not work for 64x64 because rows 4 apart kick each other out of the cache. It makes sense to try blocking with 4x4 blocks, but it is hard to structure the accesses in such a way that you have just one miss per src and dst block (for the typical case).
One approach that appears promising for the 64x64 case is hinted at in the instructions: "You may, however, do whatever you want with the contents of array B." Thus, it is possible to use parts of array B that are not yet used as temporary scratch-pad storage. I think this can be made to work, but make sure that the cache blocks of B that you use in this way don't cause something else important to be kicked out, and that they stay around in the cache until their final values are updated without being replaced. (Our budget of misses does not allow two misses per block.)

Last updated 28 March 2021