Hwloc tutorial

Welcome to the hwloc tutorial! The presentation slides which go along this tutorial are available.

Part 0: installation

hwloc is already available pre-compiled in a lot of Linux distribution, otherwise, windows binaries and the source code can be downloaded from the open-mpi hwloc project website and the installation is as usual with free software:

./configure ... check at the end that the summary shows the features you want to see enabled. For this tutorial PCI support will be useful. make sudo make install

If you do not have administration rights for the make install part, you can pass e.g. --prefix=$HOME/install to ./configure, run make install without sudo and you will need to set the following variables in your work shell:

export PATH=$PATH:$HOME/install/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/install/lib export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$HOME/install/lib/pkgconfig export MANPATH=$MANPATH:$HOME/install/share/man

You can also go through Part 1 without installing hwloc: simply run the tools from ./utils/

Part 1: command-line tools

lstopo

lstopo renders the topology of the machine, as discovered by hwloc. It is a very intuitive way to show what is in there.

There are two main modes: textual rendering, and graphical rendering

Exercice 1: lstopo introduction

Run
lstopo - -v --no-io
to show the textual rendering, and
lstopo --no-io
to show the graphical rendering (if you are working on a remote machine, and do not have X11 forwarding enabled, run
lstopo --no-io -.txt
to get pseudo-graphical rendering in textmode); you may have to enlarge your terminal in order to get proper output.

Compare the outputs; the textual output shows the objects hierarchy in a way similar to directory hierarchies in file managers. The bottom of the output also shows the various levels of objects that hwloc has detected.

Now run
lstopo - --no-io
to show a compressed view of the same output: when objects span the same scope of the machine, they are shown on the same line. This condensed view takes less room, but also shows more clearly the actual hierarchy levels.

Finally, run
lstopo -
and
lstopo
(or
lstopo -.txt
if you do not have graphical support) to include the io devices in the output. They show up as a tree of PCI bridges and devices, inserted in the machine tree where the host bridge lies.
Exercice 2: lstopo formats

In addition to on-screen rendering, hwloc supports several file format output. pdf, ps, png and svg formats need cairo development libraries installed on the system to be available. The fig format is always available, and can be opened in xfig.
lstopo out.png lstopo out.fig lstopo out.svg
The fig and svg formats are very interesting to post-process the output, e.g. to remove some parts, reorder things, change numbers, etc. before including in slides for instance.

The xml format can be used for instance to save the topology, and reload it on another machine:
machineA$ lstopo out.xml machineB$ scp machineA:out.xml . machineB$ lstopo --input out.xml
It can be useful for instance to keep reference of the topology of your servers.
Exercice 3: lstopo output

The default lstopo output may not exactly fit your needs. lstopo --help lists options to tinker with the output, you can notably play with the following options:
--only core --ignore cache --no-bridges --no-legend --whole-io --fontsize 20 --vert --horiz --horiz=core
Depending on your machine, --horiz=core may not have effect. Pass it the name of an object whose content shows vertically in the default output.
Exercice 4: synthetic topologies

A very useful feature for slides is to create arbitrary topologies! See for instance:

lstopo --input "node:1 socket:1 cache:1 cache:2 cache:1 core:1 pu:2"

which builds a machine with one NUMA node, containing one socket, containing one L3 cache, containing two L2 caches, containing 1 L1 cache each, containing one core each, containing 2 logical processors each.

Play a bit with the values to make sure you have understood how they work.

Build a system with 2 6-hyperthreaded-core sockets, L3 caches being shared by all cores of the socket, but L2 caches being shared by pairs of cores and L1 caches not being shared.
Exercice 5: physical vs logical indexes

Compare the output of lstopo -.txt and lstopo -l -.txt. The latter uses logical indexes, which always increment contiguously along the figure. The former uses the indexes provided by the OS. Follow how they are constructed, why is it so?

hwloc-bind

hwloc-bind permits to bind a processus to a given CPU set. For instance,

hwloc-bind core:1 -- sh

will start a new shell, bound to logical core 1.

lstopo --ps

conveniently shows the bound processes inside the lstopo output.

hwloc-bind --pid 1234 core:2

will bind an existing process (with pid 1234) to logical core 1. Details on the specification possibilities are available in man hwloc-bind

Exercice 6: CPU binding

Bind one sleep 1h process per each core of your machine. It will be useful to use
hwloc-calc --intersect core --sep " " all
in order to build a for i loop. Observe the result in lstopo --ps

Move one of the sleep processes on another core, check the result. Move it to the whole machine, check the result.

Another way to observe the binding is to bind lstopo itself:

hwloc-bind core:1 -- lstopo --pid 0

With the --pid 0 option, lstopo shows in green the set of processors it is bound to. This permits to easily check the understanding of the cpu set specification.

Exercice 7: CPU binding

By using this trick, check that you know how to specify (see man hwloc-bind)
- the second logical core of the first socket.
- the first two logical cores of the first socket.
- the whole first socket except the first core.
- the CPUs near the hard disk.
- the CPUs near the video board (the lstopo -v output might be useful).

hwloc-calc

This tool takes the same input as hwloc-bind, but shows the resulting cpuset instead of binding a process. This can be used to make advanced cpuset computations.

Exercice 8: hwloc-calc

Observe the output of hwloc-calc when given the same input as was used in exercice 7. Check that it indeed matches the physical indexes reported by lstopo.

See how the following options change the output:
- --intersect core
- --number-of core
- --hierarchical socket.core
- --single

hwloc-assembler

This tool permits to build network topologies, try for instance

lstopo out.xml hwloc-assembler out2.xml out.xml out.xml lstopo --input out2.xml

This builds a network of two machines like yours.

Exercice 9: network assembly

Build a network of 2 clusters: one with 4 machines like yours, the other with 4 bi-quadcore machines.

Part 2: API

My first hwloc program

This is a very simple hwloc example (to be saved as mytest.c):

#include <hwloc.h> #include <stdio.h> int main(void) { hwloc_topology_t topology; int nbcores; hwloc_topology_init(&topology); // initialization hwloc_topology_load(topology); // actual detection nbcores = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_CORE); printf("%d cores\n", nbcores); hwloc_topology_destroy(topology); return 0; }

It is essentially the same as hwloc-calc --number-of-core all . To compile it, the simple way is to use gcc mytest.c -o mytest -lhwloc, but depending on the installation it may not work. The preferred way is:

gcc mytest.c -o mytest $(pkg-config --cflags hwloc) $(pkg-config --libs hwloc)

Or better, using the following Makefile:

CFLAGS += $(shell pkg-config --cflags hwloc) LDLIBS += $(shell pkg-config --libs hwloc) all: mytest

and simply running make. If pkg-config does not find hwloc.pc, make sure you have set PKG_CONFIG_PATH as described in part 0.

Check that it runs fine. If it does not find libhwloc.so, make sure you have set LD_LIBRARY_PATH as described in part 0.

Traversals

For the following exercices, it will be useful to have the manpages under the hand. Make sure for instance that man hwloc_obj works, if not make sure you have set MANPATH as described in part 0.

Exercice 1: simple traversals

hwloc_get_root_obj(topology) returns the hwloc_obj_t corresponding to the root of the tree of objets (usually, the machine object). Write a function which takes an hwloc_obj_t and prints its depth, type, and os_index fields, and check it on the root object.

The arity field provides the number of children of the object, and the children field is an array of the children, indexed from 0 to arity-1. Write a recursive function which prints the whole tree of objects.

The hwloc_topology_get_depth(topology) function returns the number of levels as shown at the end of the output of lstopo -v --no-io. The hwloc_get_nbobjs_by_depth(topology, depth) returns the number of objects at level depth. The hwloc_get_obj_by_depth(topology, depth, i) returns the object #i of level depth. Write a function which prints the whole tree level by level.

Exercice 2: less simple traversals

hwloc_get_type_depth(topology, type) returns the depth of a given object type. This can be used to display the list of cores of the second socket, by first getting the second socket (i.e. the second object of the socket level), and then recurse from there, displaying only core objects.

Write an is_ancestor function which checks whether an object obj1 is one of the ancestors of object obj2, by looping from obj2 along the parent field until it is equal to obj1 (in which case it is indeed an ancestor) or NULL (in which case it is not an ancestor since we have reached the root without finding obj1).

Another way to list the cores of the second socket is then to use hwloc_get_obj_by_depth to iterate over all cores of the machine, and only prints those for which the second socket is an ancestor.

A third way is by using cpusets. Instead of using is_ancestor, simply compare the cpuset field of each core of the machine with the cpuset field of the second socket. If hwloc_bitmap_intersects returns true, the core is somewhere inside the socket, and should thus be printed.

hwloc/helper.h contains a lot of other examples which can be studied.

Exercice 2: devices

PCI devices and OS devices are special kinds of object, on their own level. They are not exposed by default, hwloc_topology_set_flags(topology, HWLOC_TOPOLOGY_FLAG_IO_DEVICES) has to be called between hwloc_topology_init and hwloc_topology_load. Add this to the recursive traversal of exercice 1 and re-run it to see the changes in the output.

hwloc_get_type_depth can be used to retrieve the depth of the os devices by passing it HWLOC_OBJ_OS_DEVICE. Traverse this level the same way as was done in exercice 1, and print the name field. Write code which finds the eth0 device.

Starting from the hwloc object for the eth0 device, traverse the parent pointers until finding an object whose type is neither HWLOC_OBJ_PCI_DEVICE nor HWLOC_OBJ_BRIDGE. Use hwloc_obj_snprintf to print it verbosely. This tells you where the network board is connected inside the machine!

Binding

Once the target object is found, binding to it is very easy:

hwloc_set_cpubind(t, obj->cpuset, 0)

to bind the process (assumed to be single-threaded), or

hwloc_set_cpubind(t, obj->cpuset, HWLOC_CPUBIND_THREAD)

to bind only the current thread, or

hwloc_set_cpubind(t, obj->cpuset, HWLOC_CPUBIND_PROCESS)

to bind the whole process (which can be multithreaded).

Exercice 3: CPU binding

Use it to bind your test program next to the eth0 device (or the first core if you did not complete exercice 2) and make it sleep for a long time. Check with lstopo --ps that the binding indeed worked.

Create and bind one thread to each core, making them sleep for a long time afterwards. Make sure that main joins the threads or also waits. Check with lstopo --ps that it worked properly.

Hwloc tutorial

Part 0: installation

Part 1: command-line tools

lstopo

Exercice 1: lstopo introduction

Exercice 2: lstopo formats

Exercice 3: lstopo output

Exercice 4: synthetic topologies

Exercice 5: physical vs logical indexes

hwloc-bind

Exercice 6: CPU binding

Exercice 7: CPU binding

hwloc-calc

Exercice 8: hwloc-calc

hwloc-assembler

Exercice 9: network assembly

Part 2: API

My first hwloc program

Traversals

Exercice 1: simple traversals

Exercice 2: less simple traversals

Exercice 2: devices

Binding

Exercice 3: CPU binding