Intel's Thread Building Blocks: HowTo

This document gives a quick example of how to use Intel's Thread Building Blocks, by means of a simple example. The example is available as a tarball here. The files are also available as syntax-highlighted HTML here (fatals.* and hrtime.h are elided). The latter portion of this document assumes the reader is following along using the syntax-highlighted Makefile and main.C.

Installing Thread Building Blocks

Log into the machine on which you would like to use TBB (this example uses an eight-processor x86-based machine called clover-01), and create a directory in which your TBB install will reside (you do NOT need root permissions on your machine).

Get the open-source TBB tarball from (select the Commercial Aligned Release). Copy or move the tarball to whatever directory you made above. The name of the tarball in this example is tbb20_20070815oss_src.tar.gz, but your name may vary depending on the version of TBB.

Use tar to unpack the files:

clover-01(1)% cd ~/
clover-01(2)% mkdir tbb
clover-01(3)% cd tbb
clover-01(4)% cp $DOWNLOADS/tbb20_20070815oss_src.tar.gz .
clover-01(5)% tar -xvzf tbb20_20070815oss_src.tar.gz
clover-01(6)% ls
tbb20_20070815oss_src/ tbb20_20070815oss_src.tar.gz

CD to the new directory, and run gmake to build the TBB libraries (and run the TBB tests). The compile doesn't seem to be parallel-friendly, so don't specify any -j options. You might want to go get yourself some coffee, because this is a rather lengthy step. Note: You'll want to have X-window Forwarding ON for this step.

clover-01(7)% gmake
gmake -C "./build/linux_em64t_gcc_cc_libc2.3.4_kernel2.6.9_release" -r -f ../../build/Makefile.tbb cfg=release tbb_root=../..
gmake[1]: Entering directory `/afs/'
../../build/Makefile.tbb:38: CONFIG: cfg=release arch=em64t compiler=gcc os=linux runtime=cc_libc2.3.4_kernel2.6.9


gmake[1]: Leaving directory `~/tbb/tbb20_20070815oss_src/examples'

TBB's Programming Model

TBB provides a host of tools for the C++ programmer to leverage when writing parallel code. Chiefly these include:

This example explores the first bullet -- the remainder of the points are well covered in Intel's Tutorial Document accessible via

Loop parallelization is one of the easiest ways to achieve parallelism from a single-threaded code. It is generally most useful for embarassingly data parallel applications, but can be used elsewhere with some programmer effort. The idea is simple:

  1. Write a serial loop with no inter-iteration dependence
  2. Apply TBB's magic to the loop
  3. Voila! the loop is parallel
The obvious disadvantage to this approach is the italicized requirement of inter-iteration independence.

Summing two Arrays with TBB

The age-old problem: given two arrays, each element of the first array with its countepart in the second. Admittedly, the problem is not horribly interesting, but can still benefit from parallelism, provided the arrays are reasonably large. Follow along with main.C.

Strategy: Write some serial code to sum the arrays into a third result array. Then let TBB autoparallelize the process.

Writing Serial Code

To start off, after we initialize all the memory, parse arguments, etc., we're first going to sum the array using a single thread. We're going to use x86's high-resolution timers to find out how long the summing task runs single-threaded, so we'll know how much speedup we've gained by processing in parallel. The single-thread summing occurs at Lines 92-102 of main.C.

There is nothing special going on here -- this code is 100% straightforward.

Applying TBB's parallel_for Template

Next, we're going to implement the same loop in parallel using TBB's parallel_for template. In order to do this, we have to write some slightly arcane C++, which you can see at Lines 12-46 of main.C and at Lines 104-113 of main.C.

TBB implements parallel loops by encapsulating them inside operator functions of specialized classes. This allows the TBB library headers to handle the parallelism without making any modifications to the compiler. However, it makes the syntax a little cumbersome (moreso than, say, OpenMP's #pragma omp parallel_for).

In the example, class ArraySummer is actually an elaborate function definition. The empty constructor just initializes the "function parameters" (aka the class data members), and the operator() function actually runs the loop. Notice that the loop's bounds are encapsulated into a blocked_range. This is how TBB divides iterations among threads -- each thread's blocked_range includes a different, non-overlapping range of iterations.

Setting up the Runtime

Unpacke the Example Tarball wherever you like. To actually compile with TBB, we have to set some environment variables. A handy shell script for setting up the environment is sitting in your TBB install directory. You must source this script before building the example or any TBB-enabled application!

clover-01(8)% cd ~/tbb
clover-01(9)% mkdir SumArray
clover-01(10)% cd SumArray
clover-01(11)% cp $DOWNLOADS/tbbexample.tar.bz2 .
clover-01(12)% tar -xvpf tbbexample.tar.bz2
clover-01(13)% source ../tbb20_20070815oss_src/build/linux_em64t_gcc_cc_libc2.3.4_kernel2.6.9_release/tbbvars.csh
clover-01(14)% make

Making .generated/ path for object files...
mkdir .generated

Making bin/ path for binaries...
mkdir bin

*Compiling fatals.cpp...
g++ -c -O2 fatals.cpp -o .generated/fatals.o

*Compiling main.C...
g++ -c -O2 main.C -o .generated/main.o

***Making binary sumArrayTBB...
g++ -ltbb .generated/fatals.o .generated/main.o -o bin/sumArrayTBB

My work here is done.

Use the debug version of the build if you like -- Intel's documentation details the differences in the libraries. You should source the tbbvars.csh file for C-shells and for SH-based shells.

Building with TBB

After sourcing the environment variables, the only thing to keep in mind is that you must link with the -ltbb flag. Note Line 13 of Makefile.

Running the Example Binary

Just invoke the binary with a single numeric argument (the length of the array).

clover-01(15)%./bin/sumArrayTBB 100
1T summing time: 406 ticks
TBB summing time: 6244 ticks
clover-01(16)% ./bin/sumArrayTBB 1000000
1T summing time: 11194533 ticks
TBB summing time: 8025955 ticks
clover-01(17)% ./bin/sumArrayTBB 10000000
1T summing time: 110286568 ticks
TBB summing time: 63916293 ticks

Notice that TBB's parallel_for only becomes useful for larger-sized arrays...