ECEn 324 Homework Set #4

Submit your (hardcopy) solutions to the problem below in the homework box by 5:00 PM on the assigned date.

  1. In this problem, you will explore some of the challenges of getting reliable performance measures for the loop-based vector code examples discussed in the book.
    1. Write a paragraph about the challenges that arise when using the time stamp counter in current x86 processors. Look at the "time stamp counter" Wikipedia entry to get some insight here. (Feel free to consult additional sources.) Be sure to say something about repeatability, and specific problems arising if the measurement interval is too short or too long.
    2. Analyze the source code in this tar file that is used to create a "getcpe" executable that can produce CPE measures for loop-based code. Write a paragraph describing how this code addresses the problems you identified above. Be sure to identify both strengths and weaknesses in the overall approach taken.
    3. Briefly describe the key parameters in this code that must be fine-tuned to get the most accurate results. For two of these parameters, devise and conduct a few experiments to see if the default parameter values work well on a system you have access to. On the basis of your results, suggest values for the two selected parameters that result in the most reliable results.
    4. For a machine of your choise, use the getcpe tool to try to produce results along the lines of those reported in the book (Sections 5.3 through 5.6) for this set of functions: {combine1(), combine2(), combine3(), combine4()} and for these combinations of data type and operation: {long add, long multiply, double add, double, multiply}. Compile for the x86-64 ISA (no "-m32" flag).
    5. Write a paragraph about how the results you obtained compare with those reported in the book. Highlight the major differences you found, and say something about what the causes might be.

  2. Chapter 5 presents a series of performance measurements showing the performance benefits of a sequence of optimizations to the original combine1() function. For this problem, you will repeat that sequence of optimizations starting with this function:
        void dotproduct1(vec_ptr u, vec_ptr v, data_t *dest)
        {
            long int i;
            *dest = 1.0;
            for (i = 0; i < vec_length(u); i++)
            {
    	    data_t val1;
    	    data_t val2;
    	    get_vec_element(u, i, &val1);
    	    get_vec_element(v, i, &val2);
    	    *dest = *dest + val1 * val2;
            }
        }
      
    Start with with this tar file that includes dotproduct1() and a version of the getcpe timing code used in the previous homework set. Compiling and running the initial getcpe program should give you the CPE for the original code. Consistent with the treatment in the text, you should create a different version of the function for each of the six required optimizations listed below. Follow the naming conventions of the test -- dotproduct5() should be the version with 2x loop unrolling. For this assignment, you need only consider the cose where data_t is a double. Your submission should include the C source code for all 6 new versions of the dotproduct function along with the reported CPE of each. (No other source code needs to be submitted.)
    After you obtain all 7 required CPE measurements, you should write a paragraph comparing your results with those in the book and, where possible, explaining the differences. Finally, state what can be inferred from your results about the functional units in the processor of the system you used. (Can you determine both latency and throughput bounds, for example?)

  3. You are to determine the branch misprediction penalty for at least one machine that you have access to. The technique employed is described on pages 208-209 of the text. The absdiff() and measurement code in this tar file will serve as a good starting point. First, you should write a short paragraph describing the technique used to measure the branch misprediction penalty. (Study the code and read the book.) Secondly, report the penalty you calculated and identify the platform you obtained it on. Finally, find a compiler option that generates code for the absdiff() function that uses a conditional move instruction rather than a branch. (Compile getbrpen.c with the -S option and examine the code generated for absdiff().) Report the optimization level you used and results from running the brnchpen program with this version of absdiff(). How much is performance improved when the conditional move instruction is used?

  4. Problem 6.25 from the text

  5. Problem 6.27

  6. Problem 6.30

Clarifications

Problem 1, Part 1. The challenges of interest arise because of the characteristics of the system we're interested in measuring. (The challenge is not creating the assembly code to access the counter -- that is easy.)

Problem 1, Part 2. Once you have the file cpecode.tar, place it in your directory of choice. From within that directory, typing "tar xvf *" will create a new "cpecode" subdirectory with all the files required to build the getcpe executable. Typing "make" within that subdirectory should produce an executable named getcpe. It should compile correctly under Linux or Mac OS X.

Problem 1, Part 3. To identify key parameters, start by considering what values controls the number of iterations of important loops in the code.

Problem 1, Part 4. You'll be populating a 4x4 grid of CPE measurements. Do all your work on one machine, but you can use any machine you want that this code compiles and runs on -- it need not be a spice machine. Note that the book reports results on ints while you will be using longs. To get each desired data point, you'll need to edit one or more of these files: getcpe.c, vector.h, and the Makefile (to change GCC optimization levels). After your edits, you'll recompile and run the new version.
Depending on your test platform, you may encounter problems arising from the OS dynamically scaling the processor speed, a power-saving feature that kicks in when the CPU is lightly loaded. You can tell if this happens because the processor speed reported by the "getcpe" program will be a small fraction of what you would otherwise expect. The simplest solution is to make sure that a handful of other programs are running (actually consuming CPU resources) while you run "getcpe" and collect the measurements you report.

Problem 1, Part 5. Focus your attention on anything that you find surprising. (It would be surprising to me if you got exactly the results reported in the book.) Different processors and different versions of GCC can produce quite different results, and our method for measuring GCC is not the same as that used by the authors of our text.

Problem 2: You may or may not see performance boosts at every step. Make a few runs for each data-point and discuss your results in your submission. Once you have the file dotprod.tar, place it in your directory of choice. From within that directory, typing "tar xvf *" will create a new "dotprod" subdirectory with all the files required to build the getcpe executable. Typing "make" within that subdirectory should produce an executable named getcpe. It should compile correctly under Linux or Mac OS X. The initial version will measure the performance of the dotproduct1(). As you add each new version of the function to getcpe.c, you'll need to change the measurement code (in measure() in getcpe.c) so that the new version of your function is called.

Problem 3: Copy the tar file to your working directory, and type "tar xvf *" to create a "brnchpen" subdirectory, and type "make" to produce a "brnchpen" executable. It should compile correctly under Linux or on a Mac. It is strongly encouraged to try this measurement on systems with different processors -- you might be surprised how different the results can be.

Problem 4-6: No programming or source code is required for the problems assigned from the text.


Last updated 2 May 2013