This document gives a quick example of how to use Intel's Thread Building Blocks, by means of a simple example. The example is available as a tarball here. The files are also available as syntax-highlighted HTML here (fatals.* and hrtime.h are elided). The latter portion of this document assumes the reader is following along using the syntax-highlighted Makefile and main.C.
Log into the machine on which you would like to use TBB (this example uses an eight-processor x86-based machine called clover-01), and create a directory in which your TBB install will reside (you do NOT need root permissions on your machine).
Get the open-source TBB tarball from http://osstbb.intel.com/download.php (select the Commercial Aligned Release). Copy or move the tarball to whatever directory you made above. The name of the tarball in this example is tbb20_20070815oss_src.tar.gz, but your name may vary depending on the version of TBB.
Use tar to unpack the files:
clover-01(1)% cd ~/
clover-01(2)% mkdir tbb
clover-01(3)% cd tbb
clover-01(4)% cp $DOWNLOADS/tbb20_20070815oss_src.tar.gz .
clover-01(5)% tar -xvzf tbb20_20070815oss_src.tar.gz
CD to the new directory, and run gmake to build the TBB libraries (and run the TBB tests). The compile doesn't seem to be parallel-friendly, so don't specify any -j options. You might want to go get yourself some coffee, because this is a rather lengthy step. Note: You'll want to have X-window Forwarding ON for this step.
gmake -C "./build/linux_em64t_gcc_cc_libc2.3.4_kernel2.6.9_release" -r -f ../../build/Makefile.tbb cfg=release tbb_root=../..
gmake: Entering directory `/afs/cs.wisc.edu/u/g/i/gibson/private/tbb/tbb20_20070815oss_src/build/linux_em64t_gcc_cc_libc2.3.4_kernel2.6.9_release'
../../build/Makefile.tbb:38: CONFIG: cfg=release arch=em64t compiler=gcc os=linux runtime=cc_libc2.3.4_kernel2.6.9
gmake: Leaving directory `~/tbb/tbb20_20070815oss_src/examples'
TBB provides a host of tools for the C++ programmer to leverage when writing parallel code. Chiefly these include:
Loop parallelization is one of the easiest ways to achieve parallelism from a single-threaded code. It is generally most useful for embarassingly data parallel applications, but can be used elsewhere with some programmer effort. The idea is simple:
The age-old problem: given two arrays, each element of the first array with its countepart in the second. Admittedly, the problem is not horribly interesting, but can still benefit from parallelism, provided the arrays are reasonably large. Follow along with main.C.
Strategy: Write some serial code to sum the arrays into a third result array. Then let TBB autoparallelize the process.
To start off, after we initialize all the memory, parse arguments, etc., we're first going to sum the array using a single thread. We're going to use x86's high-resolution timers to find out how long the summing task runs single-threaded, so we'll know how much speedup we've gained by processing in parallel. The single-thread summing occurs at Lines 92-102 of main.C.
There is nothing special going on here -- this code is 100% straightforward.
Next, we're going to implement the same loop in parallel using TBB's parallel_for template. In order to do this, we have to write some slightly arcane C++, which you can see at Lines 12-46 of main.C and at Lines 104-113 of main.C.
TBB implements parallel loops by encapsulating them inside operator functions of specialized classes. This allows the TBB library headers to handle the parallelism without making any modifications to the compiler. However, it makes the syntax a little cumbersome (moreso than, say, OpenMP's #pragma omp parallel_for).
In the example, class ArraySummer is actually an elaborate function definition. The empty constructor just initializes the "function parameters" (aka the class data members), and the operator() function actually runs the loop. Notice that the loop's bounds are encapsulated into a blocked_range. This is how TBB divides iterations among threads -- each thread's blocked_range includes a different, non-overlapping range of iterations.
Unpacke the Example Tarball wherever you like. To actually compile with TBB, we have to set some environment variables. A handy shell script for setting up the environment is sitting in your TBB install directory. You must source this script before building the example or any TBB-enabled application!
clover-01(8)% cd ~/tbb
clover-01(9)% mkdir SumArray
clover-01(10)% cd SumArray
clover-01(11)% cp $DOWNLOADS/tbbexample.tar.bz2 .
clover-01(12)% tar -xvpf tbbexample.tar.bz2
clover-01(13)% source ../tbb20_20070815oss_src/build/linux_em64t_gcc_cc_libc2.3.4_kernel2.6.9_release/tbbvars.csh
Making .generated/ path for object files...
Making bin/ path for binaries...
g++ -c -O2 fatals.cpp -o .generated/fatals.o
g++ -c -O2 main.C -o .generated/main.o
***Making binary sumArrayTBB...
g++ -ltbb .generated/fatals.o .generated/main.o -o bin/sumArrayTBB
My work here is done.
Use the debug version of the build if you like -- Intel's documentation details the differences in the libraries. You should source the tbbvars.csh file for C-shells and tbbvars.sh for SH-based shells.
After sourcing the environment variables, the only thing to keep in mind is that you must link with the -ltbb flag. Note Line 13 of Makefile.
Just invoke the binary with a single numeric argument (the length of the array).
1T summing time: 406 ticks
TBB summing time: 6244 ticks
clover-01(16)% ./bin/sumArrayTBB 1000000
1T summing time: 11194533 ticks
TBB summing time: 8025955 ticks
clover-01(17)% ./bin/sumArrayTBB 10000000
1T summing time: 110286568 ticks
TBB summing time: 63916293 ticks
Notice that TBB's parallel_for only becomes useful for larger-sized arrays...