The Map-Reduce Pattern
The figure above illustrates the essence of the map-reduce pattern. A Map-operation is made on each of the data elements D1 - D6 resulting in a new array D'1 - D'6 whose elements are combined into a scalar value Res through a reduction operator. There is this a Map operator which will operate on the original array elements and a Reduction operator to combine them.
Below we have a sequential version in C++ for the dot-product operation which follows the map-reduce pattern.
Here the Map is the pair-wise multiplication of the array elements and the Reduction is the calculation of the sum of all these multiplications.
Just like the map-pattern had a C++ algorithm, there is a standard algorithm for map-reduce as well. It’s aptly called tranform_reduce
and is invoked like this (for the same example as above):
The first argument is the start and end of the vector a
. The second argument is the start of vector b
which must be of same size as a
. The third argument is the initial value of the reduction variable, the sum
in this case. The comes the operators. The first operator specified is the reduction operator which in this case is a lambda function returning the sum of two values, and finally the last argument specifies the map operator which is the multiplication of two numbers.
The Map operator can be essentially anything, but there are some requirements on the Reduction operator if you want the code to be parallel. In practice, the reduction operator has to be associative and commutative. This makes subtraction not possible as a reduction operator as it is neither. But most things you want to do, like addition, multiplication, max/min value etc. fulfil these criteria and can be used to make a map reduce in paralle.
Parallel
C++ standard template library
Parallel? We haven’t seen that yet, so here we go with the C++ standard template library parallel executor in practice:
Can you spot the difference?
Execution policy execution::par_unseq
specifies that the implementation of transform_reduce
may execute in parallel (the par
part) and that it might use vector instructions (the unseq
part). Other than that, no extra work is needed. Obviously, if your lambdas happen to use references to variables that also are updated in them, you are in big trouble. Never do that!
We’ll investigate the performance issues later.
OpenMP
Next version is an almost equally simple extension of the sequential version we had at the top:
Again, just a one line added to make the code parallel when using OpenMP. It’s a wonderful world! If you remember from my first post on patterns, the OpenMP parallel programming API works mostly with compiler directives. #pragma omp
signifies an OpenMP directive. The parallel
directive denotes that the coming statement should execute in parallel by a team of threads, and the for
directive is a work-sharing construct making the different threads execute parts of the following loop in parallel. What’s new here is the clause reduction(+:sum)
. It specifies that a reduction operation (+
) is to be made on variable sum
by the different threads. The threads will do a sequential reduction on its part of the loop, and when done, will update the global variable sum
.
There is an alternative way of doing it with task
parallelism, but that is a topic for another post.
Threading building blocks: TBB
The intel threading building blocks, or TBB for short, is a wonderful C++ library for writing composable parallel programs that neither OpenMP or the STL algorithms can achieve. I hope to be able to get back to the composability aspect in a later post. TBB has a function for parallel for and reduce, just like the C++ standard.
At first sight, it does not look as neat and tidy as the STL or OpenMP versions, but that’s a price which might be well worth paying for composability1.
Let’s unpack it a little. The tbb::parallel_reduce
function takes, in this version, four arguments. On line 3 we give it a blocked_range
which essentially is an integer range which the TBB runtime can slice and give to different threads to do some work on. The second argument on line 4 is the starting value of the reduction variable, 0 in this case. The third argument is the map function. In my first post on patterns, you could see that the TBB parallel_for
could directly take the start and end indeces and then the TBB runtime passed each index at a time to the map function. The parallel_reduce
does not (yet) have this so we need to pass the range to the map function instead of the index to work on. There are some advantages to this in terms of performance. For instance, a good compiler may be able to vectorice the inner loop boosting performance even further.
Finally, the fourth argument on line 11 is the reduction function which is quite similar as you have seen above in the standard library variants.
Naked threads
Let’s go to the final and, in my opinion, the ugliest parallel map-reduce which uses naked threads.
It’s similar to the threads version of the map algorith. What is different in on line 4, where we keep a vector of results for ach thread which is updated on line 17. Then, as each thread finishes and joins the creating thread on line 23, the global reduction value is computed at line 24.
Again you can see that the same simple algorithm can be expressed in many different ways. Which one you choose surely depends on what the rest of the code looks like and how it is called.
Performance
So how do these versions compare in performance?
Variant | Time (s) | Speedup |
---|---|---|
Sequential | 0.144 s | |
Threads | 0.031 s | 4.61 |
OpenMP | 0.022 s | 6.65 |
TBB | 0.022 s | 6.65 |
——————- | ———- | ———– |
stl seq | 0.144 s | |
stl par | 0.021 s | 6.75 |
The table above it similar to he one I had in the original post on parallel map. The main difference is that TBB now has performance on par with the best, and that the threads version instead performs significantly worse. I need yet to analyse why the threads version is not so good. I first thought it was because of false sharing in the results
vector leading to cache invalidations, but that should be insignificant to the large number of elements in the arrays, and a simple experiment revealed this was also not the issue. A topic for a future post to investigate.
Oh, by the way, the code is available here.
With composability I mean when a parallel function or piece of code, can in turn call other parallel functions without needing to consider this fact. For instance, if a function creates four threads that in turn calls a function that creates four threads we are quickly up to 16 total threads and the program may not behave as you expected. A parallel programming model which is composable works with exposing parallel activity, not how it’s mapped to processor cores. TBB is composable, naked threads or OpenMP is not. ↩︎