The Map Reduce pattern

How it all comes together

The Map-Reduce Pattern

graph TD; D1 --> |Map| D'1; D2 --> |Map| D'2; D3 --> |Map| D'3; D4 --> |Map| D'4; D5 --> |Map| D'5; D6 --> |Map| D'6; D'1--> |Red| Res; D'2--> |Red| Res; D'3--> |Red| Res; D'4--> |Red| Res; D'5--> |Red| Res; D'6--> |Red| Res;

The figure above illustrates the essence of the map-reduce pattern. A Map-operation is made on each of the data elements D1 - D6 resulting in a new array D'1 - D'6 whose elements are combined into a scalar value Res through a reduction operator. There is this a Map operator which will operate on the original array elements and a Reduction operator to combine them.

Below we have a sequential version in C++ for the dot-product operation which follows the map-reduce pattern.

Here the Map is the pair-wise multiplication of the array elements and the Reduction is the calculation of the sum of all these multiplications.

Just like the map-pattern had a C++ algorithm, there is a standard algorithm for map-reduce as well. It’s aptly called tranform_reduce and is invoked like this (for the same example as above):

The first argument is the start and end of the vector a. The second argument is the start of vector b which must be of same size as a. The third argument is the initial value of the reduction variable, the sum in this case. The comes the operators. The first operator specified is the reduction operator which in this case is a lambda function returning the sum of two values, and finally the last argument specifies the map operator which is the multiplication of two numbers.

The Map operator can be essentially anything, but there are some requirements on the Reduction operator if you want the code to be parallel. In practice, the reduction operator has to be associative and commutative. This makes subtraction not possible as a reduction operator as it is neither. But most things you want to do, like addition, multiplication, max/min value etc. fulfil these criteria and can be used to make a map reduce in paralle.

Parallel

C++ standard template library

Parallel? We haven’t seen that yet, so here we go with the C++ standard template library parallel executor in practice:

Can you spot the difference?

Execution policy execution::par_unseq specifies that the implementation of transform_reduce may execute in parallel (the par part) and that it might use vector instructions (the unseq part). Other than that, no extra work is needed. Obviously, if your lambdas happen to use references to variables that also are updated in them, you are in big trouble. Never do that!

We’ll investigate the performance issues later.

OpenMP

Next version is an almost equally simple extension of the sequential version we had at the top:

Again, just a one line added to make the code parallel when using OpenMP. It’s a wonderful world! If you remember from my first post on patterns, the OpenMP parallel programming API works mostly with compiler directives. #pragma omp signifies an OpenMP directive. The parallel directive denotes that the coming statement should execute in parallel by a team of threads, and the for directive is a work-sharing construct making the different threads execute parts of the following loop in parallel. What’s new here is the clause reduction(+:sum). It specifies that a reduction operation (+) is to be made on variable sum by the different threads. The threads will do a sequential reduction on its part of the loop, and when done, will update the global variable sum.

There is an alternative way of doing it with task parallelism, but that is a topic for another post.

Threading building blocks: TBB

The intel threading building blocks, or TBB for short, is a wonderful C++ library for writing composable parallel programs that neither OpenMP or the STL algorithms can achieve. I hope to be able to get back to the composability aspect in a later post. TBB has a function for parallel for and reduce, just like the C++ standard.

At first sight, it does not look as neat and tidy as the STL or OpenMP versions, but that’s a price which might be well worth paying for composability1.

Let’s unpack it a little. The tbb::parallel_reduce function takes, in this version, four arguments. On line 3 we give it a blocked_range which essentially is an integer range which the TBB runtime can slice and give to different threads to do some work on. The second argument on line 4 is the starting value of the reduction variable, 0 in this case. The third argument is the map function. In my first post on patterns, you could see that the TBB parallel_for could directly take the start and end indeces and then the TBB runtime passed each index at a time to the map function. The parallel_reduce does not (yet) have this so we need to pass the range to the map function instead of the index to work on. There are some advantages to this in terms of performance. For instance, a good compiler may be able to vectorice the inner loop boosting performance even further.

Finally, the fourth argument on line 11 is the reduction function which is quite similar as you have seen above in the standard library variants.

Naked threads

Let’s go to the final and, in my opinion, the ugliest parallel map-reduce which uses naked threads.

It’s similar to the threads version of the map algorith. What is different in on line 4, where we keep a vector of results for ach thread which is updated on line 17. Then, as each thread finishes and joins the creating thread on line 23, the global reduction value is computed at line 24.

Again you can see that the same simple algorithm can be expressed in many different ways. Which one you choose surely depends on what the rest of the code looks like and how it is called.

Performance

So how do these versions compare in performance?

VariantTime (s)Speedup
Sequential0.144 s
Threads0.031 s4.61
OpenMP0.022 s6.65
TBB0.022 s6.65
——————-———-———–
stl seq0.144 s
stl par0.021 s6.75

The table above it similar to he one I had in the original post on parallel map. The main difference is that TBB now has performance on par with the best, and that the threads version instead performs significantly worse. I need yet to analyse why the threads version is not so good. I first thought it was because of false sharing in the results vector leading to cache invalidations, but that should be insignificant to the large number of elements in the arrays, and a simple experiment revealed this was also not the issue. A topic for a future post to investigate.

Oh, by the way, the code is available here.


  1. With composability I mean when a parallel function or piece of code, can in turn call other parallel functions without needing to consider this fact. For instance, if a function creates four threads that in turn calls a function that creates four threads we are quickly up to 16 total threads and the program may not behave as you expected. A parallel programming model which is composable works with exposing parallel activity, not how it’s mapped to processor cores. TBB is composable, naked threads or OpenMP is not. ↩︎

Mats Brorsson
CEO EMBE Innovation, Research scientist @ uni.lu, Professor @ KTH

I strive to make the increasingly complex parallel computing hardware resources available to the general software developer. Passionate about software development with high performance requirements and to coach people to reach their potential. I make parallel programming composable and easier.

Related

comments powered by Disqus