ECEn 424 Lab #6

ECEn 424 Lab #6 The Performance Lab: Code Optimization

Click here for a pdf file describing this lab.

Click here for the tar file for this lab.

Notes:

Updates and tips will be posted here as needed.

It is challenging to achieve the speedup thresholds required for full credit. We will be quite reasonable with partial credit, but do your best to achieve the full speedup -- it is certainly possible! If you struggle to get the full speedup, consider the hints below.

As you optimize your code, you must think carefully about how the code in your inner loops is accessing memory, and the locality of those memory accesses. Optimizations that you should consider include loop unrolling, interchanging inner and outer loops, using pointers rather than array indices, and blocking (see the blue box on page 647).

You will come across some inconsistencies in the CPE reported across multiple runs of the same code. These arise as a function of system load (what other jobs are running), the order in which the various versions of your functions are called, whether or not GCC inlines some of the functions, etc. Don't worry too much about these inconsistencies; there is not much we can do about them anyway. Your assignment is to achieve the specified speedup with respect to the original CPE value (measured repeatedly on spice machines), which is hardwired into the driver code. To be safe, you should try to obtain a speedup that is, say, 0.2 higher than the specified threshold for full credit. The variation between different runs of the same code is typically well within this margin.

At the outset, you might want to try a few simple experiments that will help you understand some of the performance effects of inlining functions. Try creating your own smooth function that is an exact copy of the original. Even if you don't call the function, the performance for the naive solution may fall by a significant margin. What is going on? If you look through the .s file, you can see that the compiler typically inlines the function calls to avg and other helper functions that are called just once in the code. When you create another function that also calls avg, the compiler appears to be making different decisions about which calls to inline, resulting in substantially different performance on versions of smooth that you didn't even modify. Things you can do that affect this (for any given version of GCC) include using the "inline" keyword to suggest to the compiler that it inline a function, create a new copy of each helper function with a different name, using a macro instead of the function, and completely eliminating function calls. One lesson to learn here is that optimizing around function calls is always a bit problematic.

Hints

Over time, the compiler gets better and better, improving on the CPE we get with the naive (original) code. Overall, this is undoubtedly good news, but it does make it harder to transform the code to get any sort of meaningful speedup.

For rotate, it is important to recognize the poor locality in the access pattern to the dst array in the original code. Think about ways to unroll the loop so that additional elements of dst are processed in the same pass. Consider ways of eliminating unnecessary work done by the macro used to access array elements in the unrolled loop. Consider transforming the code to use pointers, and consider the use of cache blocking.

For smooth, think about all the computation, conditionals, and memory accesses wrapped up in those calls to avg() in the original code. Note that required pixels are read three times in one row pass (to write all the pixel values in one row). Consider ways of transforming the code to reduce the number of reads to the src array per row pass. (For example, could you store values read from the array in local variables for use over multiple column steps?) Consider ways of transforming the code to reduce the overhead of conditionals. (Could you handle pixels on the perimeter as special cases, so the code can go at full speed on interior pixels?) As the size of your code increases (and it can get quite lengthy), it is highly recommended that you create useful macros along the way that keep your code very readable.

Partial Credit

Rotate score

For a given speedup of s, if...

s <= 1.0
score = 0

1.0 < s <= 1.5
score = 40.0 * ((s-1.0)/(1.5-1.0))

1.5 < s <= 1.9
score = 40.0 + 10.0 * ((s-1.5) / (1.9-1.5));

s > 1.9
score = 50

Smooth score

For a given speedup of s, if...

s <= 1.0
score = 0

1.0 < s <= 3.2
score = 40.0 * ((s-1.0)/(3.2-1.0))

3.2 < s <= 3.9
score = 40.0 + 10.0 * ((s-3.2) / (3.9-3.2));

s > 3.9
score = 50

Last updated 8 March 2021