Portable abstraction of hierarchical architectures for high-performance computing
Introduction
hwloc provides command line tools and a C API to obtain the hierarchical map of key computing elements, such as: NUMA memory nodes, shared caches, processor sockets, processor cores, and processing units (logical processors or "threads"). hwloc also gathers various attributes such as cache and memory information, and is portable across a variety of different operating systems and platforms.
hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms.
Note that the hwloc project represents the merger of the libtopology project from INRIA and the Portable Linux Processor Affinity (PLPA) sub-project from Open MPI. Both of these prior projects are now deprecated. The first hwloc release is essentially a "re-branding" of the libtopology code base, but with both a few genuinely new features and a few PLPA-like features added in. More new features and more PLPA-like features will be added to hwloc over time. See Switching from PLPA to hwloc for more details about converting your application from PLPA to hwloc.
hwloc supports the following operating systems:
-
Linux (including old kernels not having sysfs topology information, with knowledge of cpusets, offline CPUs, ScaleMP vSMP, and Kerrighed support)
-
Solaris
-
AIX
-
Darwin / OS X
-
FreeBSD and its variants, such as kFreeBSD/GNU
-
OSF/1 (a.k.a., Tru64)
-
HP-UX
-
Microsoft Windows
hwloc only reports the number of processors on unsupported operating systems; no topology information is available.
For development and debugging purposes, hwloc also offers the ability to work on "fake" topologies:
-
Symmetrical tree of resources generated from a list of level arities
-
Remote machine simulation through the gathering of Linux sysfs topology files
hwloc can display the topology in a human-readable format, either in graphical mode (X11), or by exporting in one of several different formats, including: plain text, PDF, PNG, and FIG (see CLI Examples below). Note that some of the export formats require additional support libraries.
hwloc offers a programming interface for manipulating topologies and objects. It also brings a powerful CPU bitmap API that is used to describe topology objects location on physical/logical processors. See the Programming Interface below. It may also be used to binding applications onto certain cores or memory nodes. Several utility programs are also provided to ease command-line manipulation of topology objects, binding of processes, and so on.
Installation
hwloc (https://www.open-mpi.org/projects/hwloc/) is available under the BSD license. It is hosted as a sub-project of the overall Open MPI project (https://www.open-mpi.org/). Note that hwloc does not require any functionality from Open MPI -- it is a wholly separate (and much smaller!) project and code base. It just happens to be hosted as part of the overall Open MPI project.
Nightly development snapshots are available on the web site. Additionally, the code can be directly checked out of Subversion:
shell$ svn checkout http:
shell$ cd hwloc-trunk
shell$ ./autogen.sh
Note that GNU Autoconf >=2.63, Automake >=1.10 and Libtool >=2.2.6 are required when building from a Subversion checkout.
Installation by itself is the fairly common GNU-based process:
shell$ ./configure --prefix=...
shell$ make
shell$ make install
The hwloc command-line tool "lstopo" produces human-readable topology maps, as mentioned above. It can also export maps to the "fig" file format. Support for PDF, Postscript, and PNG exporting is provided if the "Cairo" development package can be found when hwloc is configured and build. Similarly, lstopo's XML support requires the libxml2 development package.
CLI Examples
On a 4-socket 2-core machine with hyperthreading, the lstopo
tool may show the following graphical output:
Here's the equivalent output in textual form:
Machine (16GB)
Socket L#0 + L3 L#0 (4096KB)
L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
PU L#2 (P#4)
PU L#3 (P#12)
Socket L#1 + L3 L#1 (4096KB)
L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
PU L#4 (P#1)
PU L#5 (P#9)
L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
PU L#6 (P#5)
PU L#7 (P#13)
Socket L#2 + L3 L#2 (4096KB)
L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
PU L#8 (P#2)
PU L#9 (P#10)
L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
PU L#10 (P#6)
PU L#11 (P#14)
Socket L#3 + L3 L#3 (4096KB)
L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
PU L#12 (P#3)
PU L#13 (P#11)
L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
Finally, here's the equivalent output in XML. Long lines were artificially broken for document clarity (in the real output, each XML tag is on a single line), and only socket #0 is shown for brevity:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
<object type="Machine" os_level="-1" os_index="0" cpuset="0x0000ffff"
complete_cpuset="0x0000ffff" online_cpuset="0x0000ffff"
allowed_cpuset="0x0000ffff"
dmi_board_vendor="Dell Computer Corporation" dmi_board_name="0RD318"
local_memory="16648183808">
<page_type size="4096" count="4064498"/>
<page_type size="2097152" count="0"/>
<object type="Socket" os_level="-1" os_index="0" cpuset="0x00001111"
complete_cpuset="0x00001111" online_cpuset="0x00001111"
allowed_cpuset="0x00001111">
<object type="Cache" os_level="-1" cpuset="0x00001111"
complete_cpuset="0x00001111" online_cpuset="0x00001111"
allowed_cpuset="0x00001111" cache_size="4194304" depth="3"
cache_linesize="64">
<object type="Cache" os_level="-1" cpuset="0x00000101"
complete_cpuset="0x00000101" online_cpuset="0x00000101"
allowed_cpuset="0x00000101" cache_size="1048576" depth="2"
cache_linesize="64">
<object type="Cache" os_level="-1" cpuset="0x00000101"
complete_cpuset="0x00000101" online_cpuset="0x00000101"
allowed_cpuset="0x00000101" cache_size="16384" depth="1"
cache_linesize="64">
<object type="Core" os_level="-1" os_index="0" cpuset="0x00000101"
complete_cpuset="0x00000101" online_cpuset="0x00000101"
allowed_cpuset="0x00000101">
<object type="PU" os_level="-1" os_index="0" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001"/>
<object type="PU" os_level="-1" os_index="8" cpuset="0x00000100"
complete_cpuset="0x00000100" online_cpuset="0x00000100"
allowed_cpuset="0x00000100"/>
</object>
</object>
</object>
<object type="Cache" os_level="-1" cpuset="0x00001010"
complete_cpuset="0x00001010" online_cpuset="0x00001010"
allowed_cpuset="0x00001010" cache_size="1048576" depth="2"
cache_linesize="64">
<object type="Cache" os_level="-1" cpuset="0x00001010"
complete_cpuset="0x00001010" online_cpuset="0x00001010"
allowed_cpuset="0x00001010" cache_size="16384" depth="1"
cache_linesize="64">
<object type="Core" os_level="-1" os_index="1" cpuset="0x00001010"
complete_cpuset="0x00001010" online_cpuset="0x00001010"
allowed_cpuset="0x00001010">
<object type="PU" os_level="-1" os_index="4" cpuset="0x00000010"
complete_cpuset="0x00000010" online_cpuset="0x00000010"
allowed_cpuset="0x00000010"/>
<object type="PU" os_level="-1" os_index="12" cpuset="0x00001000"
complete_cpuset="0x00001000" online_cpuset="0x00001000"
allowed_cpuset="0x00001000"/>
</object>
</object>
</object>
</object>
</object>
<!-- ...other sockets listed here ... -->
</object>
</topology>
On a 4-socket 2-core Opteron NUMA machine, the lstopo
tool may show the following graphical output:
Here's the equivalent output in textual form:
Machine (32GB)
NUMANode L#0 (P#0 8190MB) + Socket L#0
L2 L#0 (1024KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#1)
NUMANode L#1 (P#1 8192MB) + Socket L#1
L2 L#2 (1024KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#3)
NUMANode L#2 (P#2 8192MB) + Socket L#2
L2 L#4 (1024KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#5)
NUMANode L#3 (P#3 8192MB) + Socket L#3
L2 L#6 (1024KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#7)
And here's the equivalent output in XML. Similar to above, line breaks were added and only PU #0 is shown for brevity:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
<object type="Machine" os_level="-1" os_index="0" cpuset="0x000000ff"
complete_cpuset="0x000000ff" online_cpuset="0x000000ff"
allowed_cpuset="0x000000ff" nodeset="0x000000ff"
complete_nodeset="0x000000ff" allowed_nodeset="0x000000ff"
dmi_board_vendor="TYAN Computer Corp" dmi_board_name="S4881 ">
<page_type size="4096" count="0"/>
<page_type size="2097152" count="0"/>
<object type="NUMANode" os_level="-1" os_index="0" cpuset="0x00000003"
complete_cpuset="0x00000003" online_cpuset="0x00000003"
allowed_cpuset="0x00000003" nodeset="0x00000001"
complete_nodeset="0x00000001" allowed_nodeset="0x00000001"
local_memory="7514177536">
<page_type size="4096" count="1834516"/>
<page_type size="2097152" count="0"/>
<object type="Socket" os_level="-1" os_index="0" cpuset="0x00000003"
complete_cpuset="0x00000003" online_cpuset="0x00000003"
allowed_cpuset="0x00000003" nodeset="0x00000001"
complete_nodeset="0x00000001" allowed_nodeset="0x00000001">
<object type="Cache" os_level="-1" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001" nodeset="0x00000001"
complete_nodeset="0x00000001" allowed_nodeset="0x00000001"
cache_size="1048576" depth="2" cache_linesize="64">
<object type="Cache" os_level="-1" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001" nodeset="0x00000001"
complete_nodeset="0x00000001" allowed_nodeset="0x00000001"
cache_size="65536" depth="1" cache_linesize="64">
<object type="Core" os_level="-1" os_index="0"
cpuset="0x00000001" complete_cpuset="0x00000001"
online_cpuset="0x00000001" allowed_cpuset="0x00000001"
nodeset="0x00000001" complete_nodeset="0x00000001"
allowed_nodeset="0x00000001">
<object type="PU" os_level="-1" os_index="0" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001" nodeset="0x00000001"
complete_nodeset="0x00000001" allowed_nodeset="0x00000001"/>
</object>
</object>
</object>
<!-- ...more objects listed here ... -->
</topology>
On a 2-socket quad-core Xeon (pre-Nehalem, with 2 dual-core dies into each socket):
Here's the same output in textual form:
Machine (16GB)
Socket L#0
L2 L#0 (4096KB)
L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L1 L#1 (32KB) + Core L#1 + PU L#1 (P#4)
L2 L#1 (4096KB)
L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
Socket L#1
L2 L#2 (4096KB)
L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#3 (4096KB)
L1 L#6 (32KB) + Core L#6 + PU L#6 (P#3)
L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
And the same output in XML (line breaks added, only PU #0 shown):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE topology SYSTEM "hwloc.dtd">
<topology>
<object type="Machine" os_level="-1" os_index="0" cpuset="0x000000ff"
complete_cpuset="0x000000ff" online_cpuset="0x000000ff"
allowed_cpuset="0x000000ff" dmi_board_vendor="Dell Inc."
dmi_board_name="0NR282" local_memory="16865292288">
<page_type size="4096" count="4117503"/>
<page_type size="2097152" count="0"/>
<object type="Socket" os_level="-1" os_index="0" cpuset="0x00000055"
complete_cpuset="0x00000055" online_cpuset="0x00000055"
allowed_cpuset="0x00000055">
<object type="Cache" os_level="-1" cpuset="0x00000011"
complete_cpuset="0x00000011" online_cpuset="0x00000011"
allowed_cpuset="0x00000011" cache_size="4194304" depth="2"
cache_linesize="64">
<object type="Cache" os_level="-1" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001" cache_size="32768" depth="1"
cache_linesize="64">
<object type="Core" os_level="-1" os_index="0" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001">
<object type="PU" os_level="-1" os_index="0" cpuset="0x00000001"
complete_cpuset="0x00000001" online_cpuset="0x00000001"
allowed_cpuset="0x00000001"/>
</object>
</object>
<object type="Cache" os_level="-1" cpuset="0x00000010"
complete_cpuset="0x00000010" online_cpuset="0x00000010"
allowed_cpuset="0x00000010" cache_size="32768" depth="1"
cache_linesize="64">
<object type="Core" os_level="-1" os_index="1" cpuset="0x00000010"
complete_cpuset="0x00000010" online_cpuset="0x00000010"
allowed_cpuset="0x00000010">
<object type="PU" os_level="-1" os_index="4" cpuset="0x00000010"
complete_cpuset="0x00000010" online_cpuset="0x00000010"
allowed_cpuset="0x00000010"/>
</object>
</object>
</object>
<!-- ...more objects listed here ... -->
</topology>
Programming Interface
The basic interface is available in hwloc.h. It essentially offers low-level routines for advanced programmers that want to manually manipulate objects and follow links between them. Documentation for everything in hwloc.h are provided later in this document. Developers should also look at hwloc/helper.h (and also in this document, which provides good higher-level topology traversal examples.
To precisely define the vocabulary used by hwloc, a Terms and Definitions section is available and should probably be read first.
Each hwloc object contains a cpuset describing the list of processing units that it contains. These bitmaps may be used for CPU binding and Memory binding. hwloc offers an extensive bitmap manipulation interface in hwloc/bitmap.h.
Moreover, hwloc also comes with additional helpers for interoperability with several commonly used environments. See the Interoperability With Other Software section for details.
The complete API documentation is available in a full set of HTML pages, man pages, and self-contained PDF files (formatted for both both US letter and A4 formats) in the source tarball in doc/doxygen-doc/.
NOTE: If you are building the documentation from a Subversion checkout, you will need to have Doxygen and pdflatex installed -- the documentation will be built during the normal "make" process. The documentation is installed during "make install" to $prefix/share/doc/hwloc/ and your systems default man page tree (under $prefix, of course).
Portability
As shown in CLI Examples, hwloc can obtain information on a wide variety of hardware topologies. However, some platforms and/or operating system versions will only report a subset of this information. For example, on an PPC64-based system with 32 cores (each with 2 hardware threads) running a default 2.6.18-based kernel from RHEL 5.4, hwloc is only able to glean information about NUMA nodes and processor units (PUs). No information about caches, sockets, or cores is available.
Similarly, Operating System have varying support for CPU and memory binding, e.g. while some Operating Systems provide interfaces for all kinds of CPU and memory bindings, some others provide only interfaces for a limited number of kinds of CPU and memory binding, and some do not provide any binding interface at all. Hwloc's binding functions would then simply return the ENOSYS error (Function not implemented), meaning that the underlying Operating System does not provide any interface for them. CPU binding and Memory binding provide more information on which hwloc binding functions should be preferred because interfaces for them are usually available on the supported Operating Systems.
Here's the graphical output from lstopo on this platform when Simultaneous Multi-Threading (SMT) is enabled:
And here's the graphical output from lstopo on this platform when SMT is disabled:
Notice that hwloc only sees half the PUs when SMT is disabled. PU #15, for example, seems to change location from NUMA node #0 to #1. In reality, no PUs "moved" -- they were simply re-numbered when hwloc only saw half as many. Hence, PU #15 in the SMT-disabled picture probably corresponds to PU #30 in the SMT-enabled picture.
This same "PUs have disappeared" effect can be seen on other platforms -- even platforms / OSs that provide much more information than the above PPC64 system. This is an unfortunate side-effect of how operating systems report information to hwloc.
Note that upgrading the Linux kernel on the same PPC64 system mentioned above to 2.6.34, hwloc is able to discover all the topology information. The following picture shows the entire topology layout when SMT is enabled:
Developers using the hwloc API or XML output for portable applications should therefore be extremely careful to not make any assumptions about the structure of data that is returned. For example, per the above reported PPC topology, it is not safe to assume that PUs will always be descendants of cores.
Additionally, future hardware may insert new topology elements that are not available in this version of hwloc. Long-lived applications that are meant to span multiple different hardware platforms should also be careful about making structure assumptions. For example, there may someday be an element "lower" than a PU, or perhaps a new element may exist between a core and a PU.
API Example
The following small C example (named ``hwloc-hello.c'') prints the topology of the machine and bring the process to the first logical processor of the second core of the machine.
#include <hwloc.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>
static void print_children(hwloc_topology_t topology, hwloc_obj_t obj,
int depth)
{
char string[128];
unsigned i;
hwloc_obj_snprintf(string, sizeof(string), topology, obj, "#", 0);
printf("%*s%s\n", 2*depth, "", string);
for (i = 0; i < obj->arity; i++) {
print_children(topology, obj->children[i], depth + 1);
}
}
int main(void)
{
int depth;
unsigned i, n;
unsigned long size;
int levels;
char string[128];
int topodepth;
hwloc_topology_t topology;
hwloc_cpuset_t cpuset;
hwloc_obj_t obj;
hwloc_topology_init(&topology);
hwloc_topology_load(topology);
topodepth = hwloc_topology_get_depth(topology);
for (depth = 0; depth < topodepth; depth++) {
printf("*** Objects at level %d\n", depth);
for (i = 0; i < hwloc_get_nbobjs_by_depth(topology, depth);
i++) {
hwloc_obj_snprintf(string, sizeof(string), topology,
hwloc_get_obj_by_depth(topology, depth, i),
"#", 0);
printf("Index %u: %s\n", i, string);
}
}
printf("*** Printing overall tree\n");
print_children(topology, hwloc_get_root_obj(topology), 0);
depth = hwloc_get_type_depth(topology, HWLOC_OBJ_SOCKET);
if (depth == HWLOC_TYPE_DEPTH_UNKNOWN) {
printf("*** The number of sockets is unknown\n");
} else {
printf("*** %u socket(s)\n",
hwloc_get_nbobjs_by_depth(topology, depth));
}
levels = 0;
size = 0;
for (obj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_PU, 0);
obj;
obj = obj->parent)
if (obj->type == HWLOC_OBJ_CACHE) {
levels++;
size += obj->attr->cache.size;
}
printf("*** Logical processor 0 has %d caches totaling %luKB\n",
levels, size / 1024);
depth = hwloc_get_type_or_below_depth(topology, HWLOC_OBJ_CORE);
obj = hwloc_get_obj_by_depth(topology, depth,
hwloc_get_nbobjs_by_depth(topology, depth) - 1);
if (obj) {
cpuset = hwloc_bitmap_dup(obj->cpuset);
hwloc_bitmap_singlify(cpuset);
if (hwloc_set_cpubind(topology, cpuset, 0)) {
char *str;
int error = errno;
hwloc_bitmap_asprintf(&str, obj->cpuset);
printf("Couldn't bind to cpuset %s: %s\n", str, strerror(error));
free(str);
}
hwloc_bitmap_free(cpuset);
}
n = hwloc_get_nbobjs_by_type(topology, HWLOC_OBJ_NODE);
if (n) {
void *m;
size = 1024*1024;
obj = hwloc_get_obj_by_type(topology, HWLOC_OBJ_NODE, n - 1);
m = hwloc_alloc_membind_nodeset(topology, size, obj->nodeset,
HWLOC_MEMBIND_DEFAULT, 0);
hwloc_free(topology, m, size);
m = malloc(size);
hwloc_set_area_membind_nodeset(topology, m, size, obj->nodeset,
HWLOC_MEMBIND_DEFAULT, 0);
free(m);
}
hwloc_topology_destroy(topology);
return 0;
}
hwloc provides a pkg-config
executable to obtain relevant compiler and linker flags. For example, it can be used thusly to compile applications that utilize the hwloc library (assuming GNU Make):
CFLAGS += $(pkg-config --cflags hwloc)
LDLIBS += $(pkg-config --libs hwloc)
cc hwloc-hello.c $(CFLAGS) -o hwloc-hello $(LDLIBS)
On a machine with 4GB of RAM and 2 processor sockets -- each socket of which has two processing cores -- the output from running hwloc-hello
could be something like the following:
shell$ ./hwloc-hello
*** Objects at level 0
Index 0: Machine(3938MB)
*** Objects at level 1
Index 0: Socket#0
Index 1: Socket#1
*** Objects at level 2
Index 0: Core#0
Index 1: Core#1
Index 2: Core#3
Index 3: Core#2
*** Objects at level 3
Index 0: PU#0
Index 1: PU#1
Index 2: PU#2
Index 3: PU#3
*** Printing overall tree
Machine(3938MB)
Socket#0
Core#0
PU#0
Core#1
PU#1
Socket#1
Core#3
PU#2
Core#2
PU#3
*** 2 socket(s)
shell$
Questions and Bugs
Questions should be sent to the devel mailing list (https://www.open-mpi.org/community/lists/hwloc.php). Bug reports should be reported in the tracker (https://svn.open-mpi.org/trac/hwloc/).
If hwloc discovers an incorrect topology for your machine, the very first thing you should check is to ensure that you have the most recent updates installed for your operating system. Indeed, most of hwloc topology discovery relies on hardware information retrieved through the operation system (e.g., via the /sys virtual filesystem of the Linux kernel). If upgrading your OS or Linux kernel does not solve your problem, you may also want to ensure that you are running the most recent version of the BIOS for your machine.
If those things fail, contact us on the mailing list for additional help. Please attach the output of lstopo after having given the --enable-debug option to ./configure and rebuilt completely, to get debugging output.
Further Reading
The documentation chapters include
Make sure to have had a look at those too!