ECEn 424 Lab #7: The Cache Lab
Click here for a pdf file describing this
lab.
Click here for the
tar file for this lab.
Notes:
Updates and tips will be posted here as needed.
Some of you have discovered that valgrind is not currently installed
on the spice machines. I've installed the latest version in
/ee2/ee424/valgrind/bin which should always be mounted on the spice
machines. Either type the full path name or define a convenient alias
in your .bashrc file. Sorry for the added complication...
To add valgrind to your system PATH (i.e. make it so you don't need to type the entire file path each time you run it), do the following:
Open up ~/.bashrc
(or create it if it doesn't exist) and add the following line:
export PATH=$PATH:/ee2/ee424/valgrind/bin
Save and close the file. Then, open up a new terminal, or re-source ~/.bashrc
in the current terminal like so:
. ~/.bashrc
Now any call to the command valgrind
should work.
Here are some thoughts on Part A:
- The learning curve will feel steep, but using the getopt()
function will simplify the task of reading in command-line
options. In general, man pages aren't the easiest things to
understand, but there is a good example on the man page for getopt()
that will give you a good idea of how to use the function.
- Reading the valgrind trace files is one of those programming
tasks that appears straightforward, but then it is hard to get it
just right. You can probably figure out how to do it with
fscanf. An alternative would be to read each input line into a char
array using fgets, then test the initial character values
directly. You can also use strtol() to convert a numerical string
(even a hex one) into a long or int.
- As you think about how to model a cache, consider defining
a struct that includes everything you need to model a block entry in
a cache, then using malloc() to allocate an array of these structs
that is just the right size (based on the values of command-line
arguments). Note that this struct need not include the actual data
in the cache block. This is perhaps counter-intuitive, but in this
type of cache simulation, we're just watching blocks come and go
without any idea of the actual block contents, since data values are
not recorded in the address trace that Valgrind produces. We are
concerned only with counting hits and misses (and hence tracking
which blocks are in the cache at any instant), so we only need
valid, tag, and reference bits (to implement an LRU replacement
policy).
- Any technique you use that allows you to determine LRU
entries will be acceptable -- you don't need to reflect how hardware
would actually implement it. One simple way to do it (that works in
software but is not well suited for hardware) is to have a reference
field associated with each entry that is updated with the number of
the current memory reference (from an internal counter updated each
time you advance to the next entry in the trace). Then when you
need to find the LRU entry in a set, you march through them all and
find the entry with the smallest (and hence oldest) value in the
reference field.
Thoughts on Part B:
- An observation: each 32x32 matrix is 4 KB in size and occupies
128 cache blocks. (Remember, for this part, the cache is
direct-mapped, with 32 sets, and 32-byte blocks.) Each 64x64 matrix
is 16 KB in size and occupies 512 cache blocks. For each run of the
transpose function, the very best we can possibly do is to
experience a compulsory miss on every cache block (when it is
initially loaded into the cache). For the 32x32 case, that will be
256 compulsory misses (128 for A and 128 for B), and a solution cannot
exceed 300 for full points. For the 64x64 case, there will be 1024
compulsory misses, and we are allowed just 1300 for full credit. To
hit numbers like these, you have to read or write the entire cache
block before it gets kicked out by a conflict miss in the
direct-mapped cache. This is the key to any solution we can come up
with.
- The rules for Part B prevent us from having more than 12 local
variables, packing multiple temporary values into a local variable,
or defining any other arrays in our code. (These restrictions force
us to think long and hard about the access patterns to the source
and destination arrays.)
- How could it help to have more local variables? They could
store values that we've read from the source until we're ready to
write them to the destination. We might want to wait because we
want to fill up one dest cache block before loading another that
maps to the same slot in the cache. Draw a picture and convince
yourself of this: entries in the 64x64 matrix that are four rows
apart will map to the same set in the cache. This is a tough
limitation to work around.
- The real challenge here is to find a way to access both src and
dst cache blocks before they get kicked out by subsequent
accesses. Consider this idea: instead of sweeping through the src
matrix just one row at a time, suppose we do 8 rows at a time. Note
that the 8 adjacent entries in a column actually map to the same
cache block in the dst array. This approach is actually good enough
to get full credit for 32x32 arrays (provided you do all 8 reads in
a column into local variables, then write all 8 out to the dst
array). In effect, this is a "blocking" optimization, in which we're
marching through memory an 8x8 block at a time. However, it does
not work for 64x64 because rows 4 apart kick each other out of the
cache. It makes sense to try blocking with 4x4 blocks, but it is
hard to structure the accesses in such a way that you have just one
miss per src and dst block (for the typical case).
- One approach that appears promising for the 64x64 case is
hinted at in the instructions: "You may, however, do whatever you
want with the contents of array B." Thus, it is possible to use
parts of array B that are not yet used as temporary scratch-pad
storage. I think this can be made to work, but make sure that the
cache blocks of B that you use in this way don't cause something
else important to be kicked out, and that they stay around in the
cache until their final values are updated without being replaced.
(Our budget of misses does not allow two misses per block.)
Last updated 28 March 2021