Introduction to Programming with OpenMP --------------------------------------- Practical Exercise 2 (Basics and Simple SIMD) --------------------------------------------- You should normally start with the following commands: export OMP_NUM_THREADS=4 export OMP_SCHEDULE=static Fortran: gfortran -Wall -Wextra -std=f2003 -pedantic -O3 -fopenmp \ -o -lblas .f90 ./ Programs/matrices_f_10 C: gcc -Wall -Wextra -ansi -pedantic -O3 -fopenmp \ -o -lblas .c ./ Programs/matrices_c_10 or: g++ -Wall -Wextra -ansi -pedantic -O3 -fopenmp \ -o -lblas .cpp ./ Programs/matrices_c_10 When you have got the changes working, then increase the size of the matrices to see how your tuning is going, by increasing the 10 at the end of the argument to 1000. You can also use 100 as an intermediate step, if it helps. If you have more cores, you can run with more than 4 threads, and this will help to expose any race conditions you have coded (i.e. increase the 4 in the setting of OMP_NUM_THREADS) and make the timing effects clearer; all of the examples should work with any number. Regrettably, this will not show you parallelisation on the PWF, as they are all single core systems. Question 1 ---------- In doing this question, declare every variable and array that is used in any of the loops inside the DO/for directive blocks as either shared or private. Declare all loop indices and temporary variables as private, and the main arrays as shared. More details are given on this in the next lecture. 1.1 Starting with Programs/Multiply.f90 or Programs/Multiply.c, add directives to parallelise the outer loop in procedure Multiply, using a combined parallel and DO/for directive. Make sure that you declare ALL variables used as shared or private, even though the default ones do what you want. 1.2 Change the program to use the transpose of the first matrix. In Fortran, that needs only the addition of a TRANSPOSE call in the MATMUL, the first argument of DGEMM changing to 't' and the sections reversing in the DOT_PRODUCT call. In C, it needs only the indes expressions reversing in the innermost loop and the first argument of DGEMM changing to 't'. Notice that the output shows different values, and the coded multiply now runs a lot faster. That is because it is more cache-friendly. 1.3 Change the program to split the directives into separate parallel and DO/for ones. Put the former around the call to Multiply and leave the latter where it is. Remember to declare all variables as shared or private, even if the default is what you want, except that you will have to leave the shared declarations off the DO/for directive (see the next lecture for why). You should notice no difference in the values or the times. 1.4 This example is to show some of the effects of including serial code in parallel blocks, and why it is easier to use the combined directives. Change the program to move the parallel directive to outside the while timing code (i.e. from just before the first call to Times to just after the last call to Check). Don't both with declaring anything as shared, as that is the default here. Doing that may well go horribly wrong, and is always incorrect. Note that, even when this example appears to work, an excessive number of lines of output appear.