|
Portable abstraction of hierarchical architectures for high-performance computing
See also Further Reading
or the Related pages tab above
for links to more sections about hwloc concepts.
hwloc Summary
hwloc provides command line tools and a C API to obtain the hierarchical map of key computing elements within a node, such as: NUMA memory nodes, shared caches, processor packages, dies and cores, processing units (logical processors or "threads") and even I/O devices. hwloc also gathers various attributes such as cache and memory information, and is portable across a variety of different operating systems and platforms.
hwloc primarily aims at helping high-performance computing (HPC) applications, but is also applicable to any project seeking to exploit code and/or data locality on modern computing platforms.
hwloc supports the following operating systems:
-
Linux (including old kernels not having sysfs topology information, with knowledge of cpusets, ScaleMP vSMP support, etc.) on all supported hardware, including Intel Xeon Phi and NumaScale NumaConnect.
-
Solaris (with support for processor sets and logical domains)
-
AIX
-
Darwin / OS X
-
FreeBSD and its variants (such as kFreeBSD/GNU)
-
NetBSD
-
HP-UX
-
Microsoft Windows
-
IBM BlueGene/Q Compute Node Kernel (CNK)
Since it uses standard Operating System information, hwloc's support is mostly independant from the processor type (x86, powerpc, ...) and just relies on the Operating System support. The main exception is BSD operating systems (NetBSD, FreeBSD, etc.) because they do not provide support topology information, hence hwloc uses an x86-only CPUID-based backend (which can be used for other OSes too, see the Components and plugins section).
To check whether hwloc works on a particular machine, just try to build it and run lstopo or lstopo-no-graphics . If some things do not look right (e.g. bogus or missing cache information), see Questions and Bugs.
hwloc only reports the number of processors on unsupported operating systems; no topology information is available.
For development and debugging purposes, hwloc also offers the ability to work on "fake" topologies:
hwloc can display the topology in a human-readable format, either in graphical mode (X11), or by exporting in one of several different formats, including: plain text, LaTeX tikzpicture, PDF, PNG, and FIG (see Command-line Examples below). Note that some of the export formats require additional support libraries.
hwloc offers a programming interface for manipulating topologies and objects. It also brings a powerful CPU bitmap API that is used to describe topology objects location on physical/logical processors. See the Programming Interface below. It may also be used to binding applications onto certain cores or memory nodes. Several utility programs are also provided to ease command-line manipulation of topology objects, binding of processes, and so on.
Perl bindings are available from Bernd Kallies on CPAN.
Python bindings are available from Guy Streeter:
hwloc Installation
The generic installation procedure for both hwloc and netloc is described in Installation.
The hwloc command-line tool "lstopo" produces human-readable topology maps, as mentioned above. It can also export maps to the "fig" file format. Support for PDF, Postscript, and PNG exporting is provided if the "Cairo" development package (usually cairo-devel or libcairo2-dev ) can be found in "lstopo" when hwloc is configured and build.
The hwloc core may also benefit from the following development packages:
PCI and XML support may be statically built inside the main hwloc library, or as separate dynamically-loaded plugins (see the Components and plugins section).
Note that because of the possibility of GPL taint, the pciutils library libpci will not be used (remember that hwloc is BSD-licensed).
Command-line Examples
On a 4-package 2-core machine with hyper-threading, the lstopo tool may show the following graphical output:
Here's the equivalent output in textual form:
Machine
NUMANode L#0 (P#0)
Package L#0 + L3 L#0 (4096KB)
L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#8)
L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
PU L#2 (P#4)
PU L#3 (P#12)
Package L#1 + L3 L#1 (4096KB)
L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
PU L#4 (P#1)
PU L#5 (P#9)
L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
PU L#6 (P#5)
PU L#7 (P#13)
Package L#2 + L3 L#2 (4096KB)
L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
PU L#8 (P#2)
PU L#9 (P#10)
L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
PU L#10 (P#6)
PU L#11 (P#14)
Package L#3 + L3 L#3 (4096KB)
L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
PU L#12 (P#3)
PU L#13 (P#11)
L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
PU L#14 (P#7)
PU L#15 (P#15)
Note that there is also an equivalent output in XML that is meant for exporting/importing topologies but it is hardly readable to human-beings (see Importing and exporting topologies from/to XML files for details).
On a 4-package 2-core Opteron NUMA machine (with two core cores disallowed by the administrator), the lstopo tool may show the following graphical output (with --disallowed for displaying disallowed objects):
Here's the equivalent output in textual form:
Machine (32GB total)
Package L#0
NUMANode L#0 (P#0 8190MB)
L2 L#0 (1024KB) + L1 L#0 (64KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1 L#1 (64KB) + Core L#1 + PU L#1 (P#1)
Package L#1
NUMANode L#1 (P#1 8192MB)
L2 L#2 (1024KB) + L1 L#2 (64KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1 L#3 (64KB) + Core L#3 + PU L#3 (P#3)
Package L#2
NUMANode L#2 (P#2 8192MB)
L2 L#4 (1024KB) + L1 L#4 (64KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1 L#5 (64KB) + Core L#5 + PU L#5 (P#5)
Package L#3
NUMANode L#3 (P#3 8192MB)
L2 L#6 (1024KB) + L1 L#6 (64KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1 L#7 (64KB) + Core L#7 + PU L#7 (P#7)
On a 2-package quad-core Xeon (pre-Nehalem, with 2 dual-core dies into each package):
Here's the same output in textual form:
Machine (total 16GB)
NUMANode L#0 (P#0 16GB)
Package L#0
L2 L#0 (4096KB)
L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L1 L#1 (32KB) + Core L#1 + PU L#1 (P#4)
L2 L#1 (4096KB)
L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L1 L#3 (32KB) + Core L#3 + PU L#3 (P#6)
Package L#1
L2 L#2 (4096KB)
L1 L#4 (32KB) + Core L#4 + PU L#4 (P#1)
L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#3 (4096KB)
L1 L#6 (32KB) + Core L#6 + PU L#6 (P#3)
L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
Programming Interface
The basic interface is available in hwloc.h. Some higher-level functions are available in hwloc/helper.h to reduce the need to manually manipulate objects and follow links between them. Documentation for all these is provided later in this document. Developers may also want to look at hwloc/inlines.h which contains the actual inline code of some hwloc.h routines, and at this document, which provides good higher-level topology traversal examples.
To precisely define the vocabulary used by hwloc, a Terms and Definitions section is available and should probably be read first.
Each hwloc object contains a cpuset describing the list of processing units that it contains. These bitmaps may be used for CPU binding and Memory binding. hwloc offers an extensive bitmap manipulation interface in hwloc/bitmap.h.
Moreover, hwloc also comes with additional helpers for interoperability with several commonly used environments. See the Interoperability With Other Software section for details.
The complete API documentation is available in a full set of HTML pages, man pages, and self-contained PDF files (formatted for both both US letter and A4 formats) in the source tarball in doc/doxygen-doc/.
NOTE: If you are building the documentation from a Git clone, you will need to have Doxygen and pdflatex installed – the documentation will be built during the normal "make" process. The documentation is installed during "make install" to $prefix/share/doc/hwloc/ and your systems default man page tree (under $prefix, of course).
Portability
Operating System have varying support for CPU and memory binding, e.g. while some Operating Systems provide interfaces for all kinds of CPU and memory bindings, some others provide only interfaces for a limited number of kinds of CPU and memory binding, and some do not provide any binding interface at all. Hwloc's binding functions would then simply return the ENOSYS error (Function not implemented), meaning that the underlying Operating System does not provide any interface for them. CPU binding and Memory binding provide more information on which hwloc binding functions should be preferred because interfaces for them are usually available on the supported Operating Systems.
Similarly, the ability of reporting topology information varies from one platform to another. As shown in Command-line Examples, hwloc can obtain information on a wide variety of hardware topologies. However, some platforms and/or operating system versions will only report a subset of this information. For example, on an PPC64-based system with 8 cores (each with 2 hardware threads) running a default 2.6.18-based kernel from RHEL 5.4, hwloc is only able to glean information about NUMA nodes and processor units (PUs). No information about caches, packages, or cores is available.
Here's the graphical output from lstopo on this platform when Simultaneous Multi-Threading (SMT) is enabled:
And here's the graphical output from lstopo on this platform when SMT is disabled:
Notice that hwloc only sees half the PUs when SMT is disabled. PU L#6, for example, seems to change location from NUMA node #0 to #1. In reality, no PUs "moved" – they were simply re-numbered when hwloc only saw half as many (see also Logical index in Indexes and Sets). Hence, PU L#6 in the SMT-disabled picture probably corresponds to PU L#12 in the SMT-enabled picture.
This same "PUs have disappeared" effect can be seen on other platforms – even platforms / OSs that provide much more information than the above PPC64 system. This is an unfortunate side-effect of how operating systems report information to hwloc.
Note that upgrading the Linux kernel on the same PPC64 system mentioned above to 2.6.34, hwloc is able to discover all the topology information. The following picture shows the entire topology layout when SMT is enabled:
Developers using the hwloc API or XML output for portable applications should therefore be extremely careful to not make any assumptions about the structure of data that is returned. For example, per the above reported PPC topology, it is not safe to assume that PUs will always be descendants of cores.
Additionally, future hardware may insert new topology elements that are not available in this version of hwloc. Long-lived applications that are meant to span multiple different hardware platforms should also be careful about making structure assumptions. For example, a new element may someday exist between a core and a PU.
API Example
The following small C example (available in the source tree as ``doc/examples/hwloc-hello.c'') prints the topology of the machine and performs some thread and memory binding. More examples are available in the doc/examples/ directory of the source tree.
#include "hwloc.h"
#include <errno.h>
#include <stdio.h>
#include <string.h>
int depth)
{
char type[32], attr[1024];
unsigned i;
printf("%*s%s", 2*depth, "", type);
if (*attr)
printf("(%s)", attr);
printf("\n");
for (i = 0; i < obj-> arity; i++) {
print_children(topology, obj-> children[i], depth + 1);
}
}
int main(void)
{
int depth;
unsigned i, n;
unsigned long size;
int levels;
char string[128];
int topodepth;
void *m;
for (depth = 0; depth < topodepth; depth++) {
printf("*** Objects at level %d\n", depth);
i++) {
printf("Index %u: %s\n", i, string);
}
}
printf("*** Printing overall tree\n");
printf("*** The number of packages is unknown\n");
} else {
printf("*** %u package(s)\n",
}
levels = 0;
size = 0;
obj;
levels++;
}
printf("*** Logical processor 0 has %d caches totaling %luKB\n",
levels, size / 1024);
if (obj) {
char *str;
int error = errno;
printf("Couldn't bind to cpuset %s: %s\n", str, strerror(error));
free(str);
}
}
size = 1024*1024;
m = malloc(size);
free(m);
return 0;
}
hwloc_bitmap_t hwloc_cpuset_t A CPU set is a bitmap whose bits are set according to CPU physical OS indexes. Definition: hwloc.h:140
@ HWLOC_OBJ_NUMANODE NUMA node. An object that contains memory that is directly and byte-accessible to the host processors... Definition: hwloc.h:236
@ HWLOC_OBJ_PACKAGE Physical package. The physical package that usually gets inserted into a socket on the motherboard.... Definition: hwloc.h:191
@ HWLOC_OBJ_PU Processing Unit, or (Logical) Processor. An execution unit (may share a core with some other logical ... Definition: hwloc.h:201
@ HWLOC_OBJ_CORE Core. A computation unit (may be shared by several PUs, aka logical processors). Definition: hwloc.h:197
int hwloc_topology_init(hwloc_topology_t *topologyp) Allocate a topology context.
struct hwloc_topology * hwloc_topology_t Topology context. Definition: hwloc.h:691
void hwloc_topology_destroy(hwloc_topology_t topology) Terminate and free a topology context.
int hwloc_topology_load(hwloc_topology_t topology) Build the actual topology.
unsigned hwloc_get_nbobjs_by_depth(hwloc_topology_t topology, int depth) Returns the width of level at depth depth.
static hwloc_obj_t hwloc_get_root_obj(hwloc_topology_t topology) Returns the top-object of the topology-tree.
hwloc_obj_t hwloc_get_obj_by_depth(hwloc_topology_t topology, int depth, unsigned idx) Returns the topology object at logical index idx from depth depth.
static hwloc_obj_t hwloc_get_obj_by_type(hwloc_topology_t topology, hwloc_obj_type_t type, unsigned idx) Returns the topology object at logical index idx with type type.
static int hwloc_get_nbobjs_by_type(hwloc_topology_t topology, hwloc_obj_type_t type) Returns the width of level type type.
static int hwloc_get_type_or_below_depth(hwloc_topology_t topology, hwloc_obj_type_t type) Returns the depth of objects of type type or below.
int hwloc_get_type_depth(hwloc_topology_t topology, hwloc_obj_type_t type) Returns the depth of objects of type type.
int hwloc_topology_get_depth(hwloc_topology_t restrict topology) Get the depth of the hierarchical tree of objects.
@ HWLOC_TYPE_DEPTH_UNKNOWN No object of given type exists in the topology. Definition: hwloc.h:821
int hwloc_obj_attr_snprintf(char *restrict string, size_t size, hwloc_obj_t obj, const char *restrict separator, int verbose) Stringify the attributes of a given topology object into a human-readable form.
int hwloc_obj_type_snprintf(char *restrict string, size_t size, hwloc_obj_t obj, int verbose) Stringify the type of a given topology object into a human-readable form.
int hwloc_set_cpubind(hwloc_topology_t topology, hwloc_const_cpuset_t set, int flags) Bind current process or thread on cpus given in physical bitmap set.
void * hwloc_alloc_membind(hwloc_topology_t topology, size_t len, hwloc_const_bitmap_t set, hwloc_membind_policy_t policy, int flags) Allocate some memory on NUMA memory nodes specified by set.
int hwloc_free(hwloc_topology_t topology, void *addr, size_t len) Free memory that was previously allocated by hwloc_alloc() or hwloc_alloc_membind().
int hwloc_set_area_membind(hwloc_topology_t topology, const void *addr, size_t len, hwloc_const_bitmap_t set, hwloc_membind_policy_t policy, int flags) Bind the already-allocated memory identified by (addr, len) to the NUMA node(s) specified by set.
@ HWLOC_MEMBIND_BYNODESET Consider the bitmap argument as a nodeset. Definition: hwloc.h:1491
@ HWLOC_MEMBIND_BIND Allocate memory on the specified nodes. Definition: hwloc.h:1403
int hwloc_obj_type_is_cache(hwloc_obj_type_t type) Check whether an object type is a CPU Cache (Data, Unified or Instruction).
int hwloc_bitmap_asprintf(char **strp, hwloc_const_bitmap_t bitmap) Stringify a bitmap into a newly allocated string.
void hwloc_bitmap_free(hwloc_bitmap_t bitmap) Free bitmap bitmap.
int hwloc_bitmap_singlify(hwloc_bitmap_t bitmap) Keep a single index among those set in bitmap bitmap.
hwloc_bitmap_t hwloc_bitmap_dup(hwloc_const_bitmap_t bitmap) Duplicate bitmap bitmap by allocating a new bitmap and copying bitmap contents.
Structure of a topology object. Definition: hwloc.h:395
struct hwloc_obj ** children Normal children, children[0 .. arity -1]. Definition: hwloc.h:455
hwloc_nodeset_t nodeset NUMA nodes covered by this object or containing this object. Definition: hwloc.h:539
unsigned os_index OS-provided physical index number. It is not guaranteed unique across the entire machine,... Definition: hwloc.h:400
hwloc_cpuset_t cpuset CPUs covered by this object. Definition: hwloc.h:511
unsigned arity Number of normal children. Memory, Misc and I/O children are not listed here but rather in their dedi... Definition: hwloc.h:451
hwloc_obj_type_t type Type of object. Definition: hwloc.h:397
union hwloc_obj_attr_u * attr Object type-specific Attributes, may be NULL if no attribute value was found. Definition: hwloc.h:414
struct hwloc_obj * parent Parent, NULL if root (Machine object) Definition: hwloc.h:445
struct hwloc_obj_attr_u::hwloc_cache_attr_s cache
hwloc_uint64_t size Size of cache in bytes. Definition: hwloc.h:616
hwloc provides a pkg-config executable to obtain relevant compiler and linker flags. For example, it can be used thusly to compile applications that utilize the hwloc library (assuming GNU Make):
CFLAGS += $(shell pkg-config --cflags hwloc)
LDLIBS += $(shell pkg-config --libs hwloc)
hwloc-hello: hwloc-hello.c
$(CC) hwloc-hello.c $(CFLAGS) -o hwloc-hello $(LDLIBS)
On a machine 2 processor packages – each package of which has two processing cores – the output from running hwloc-hello could be something like the following:
shell$ ./hwloc-hello
*** Objects at level 0
Index 0: Machine
*** Objects at level 1
Index 0: Package#0
Index 1: Package#1
*** Objects at level 2
Index 0: Core#0
Index 1: Core#1
Index 2: Core#3
Index 3: Core#2
*** Objects at level 3
Index 0: PU#0
Index 1: PU#1
Index 2: PU#2
Index 3: PU#3
*** Printing overall tree
Machine
Package#0
Core#0
PU#0
Core#1
PU#1
Package#1
Core#3
PU#2
Core#2
PU#3
*** 2 package(s)
*** Logical processor 0 has 0 caches totaling 0KB
shell$
History / Credits
hwloc is the evolution and merger of the libtopology project and the Portable Linux Processor Affinity (PLPA) (https://www.open-mpi.org/projects/plpa/) project. Because of functional and ideological overlap, these two code bases and ideas were merged and released under the name "hwloc" as an Open MPI sub-project.
libtopology was initially developed by the Inria Runtime Team-Project. PLPA was initially developed by the Open MPI development team as a sub-project. Both are now deprecated in favor of hwloc, which is distributed as an Open MPI sub-project.
Further Reading
The documentation chapters include
Make sure to have had a look at those too!
|
|
|