Dynamic linking madness: solving a bug in go-nvml

2025-02-15 · Braydon Kains

I work on open source observability software, primarily the Google Cloud Ops Agent, OpenTelemetry Collector, and Fluent Bit.
Over the past few years, I have gained an affinity for taking on the types of deep issues that have me journeying as deep into the weeds as I can get. In this post I’m going to go over one of those issues, perhaps partially to self-document everything I learned but also because I think it was an interesting journey worth writing down.

The Issue: go-nvml crashes our OpenTelemetry Collector

One of the features of the Ops Agent is GPU Monitoring; if you install the Ops Agent on a GCE VM with a GPU, you will automatically get metrics for it through the NVIDIA Management Library (NVML), and optionally through DCGM. To achieve this, we built specific instrumentation using the Go bindings for NVML and for DCGM.

We learned when attempting to upgrade our build of the Collector to Go 1.21 that the Collector would crash on startup if a GPU was present on the machine. It produced the kind of panic you wouldn’t usually be used to seeing in a Go program:

SIGSEGV: segmentation violation
PC=0x0 m=0 sigcode=1
signal arrived during cgo execution

Seeing PC=0x0 was very surprising to me. I had no idea how this sort of thing could occur in a Go program, even with CGO. Even more strange was that this crash was only happening on certain systems. How could something like a segfault be system dependent?
I was absolutely hooked. I would not rest until I understood why this could possibly be happening.

You can read the original issue in go-nvml and the issue I opened in golang/go to see the real discussions, or read on for my direct retelling.

Intro to dynamic libraries

This is information that I feel is important to understand the underlying issue. If you are already familiar with how dynamic libraries are loaded, you can skip to How go-nvml works.

Dynamic vs Static Linking

In C and adjacent languages, there are two ways to link a library to your application: static, and dynamic. Static linking is pretty straightforward; the library code is included at compile-time, and when the library is compiled into an object, it is then linked directly into the resulting binary. When the compiled program is run and something from the library is referenced, the implementation is already present within the binary. With dynamic linking, rather than the libraries being built directly into the binary, the libraries are simply referenced by the application to then be loaded at runtime. These will be .so on Linux or .dll on Windows. When the application is run, the operating system receives instructions to look for the libraries on the system, and if they are found they are loaded for the program to use, or if not found the program fails to start.

Static linking sure does sound great, right? There’s not much to think about there, the code is just included in the binary rather than needing to worry about having specific dynamic libraries on the system. Why wouldn’t you always do that? Golang agrees with you; all binaries built with pure Go are completely statically linked. This is actually a selling point of the language, and as an avid user of it I can feel the benefits. It is so nice to build a giant Go program, and just have one nice clean binary at the end with everything the binary needs. As someone working on a tool written in Go, I love that building and distributing it is so dead simple because it’s one statically linked binary. No separate instructions that certain libraries have to be apt installed onto the system, or being forced to distribute a container image for the tool to be usable.

Dynamic linking does have a purpose though, especially when writing lower level applications. One of the most popular ones is C runtime libraries, an implementation of which is available on any Linux distribution, or can be installed on Windows through the Visual C++ Redistributable (something I’m sure many gamers have installed and not really known why). C runtimes can be statically linked in most compilers, however it often doesn’t make much sense to statically link something that is available on most any system the application will run on. One of the biggest reasons is binary sizes. I’ve seen people online be quite confused at the size of a simple Go Hello World program exceeding a megabyte (at least at the time), but the reason for this is that Go does indeed statically link its runtime with the binary which baloons the size of the binary.

Large binaries with lots of static linked libraries has other complications as well, such as the amount of memory the program can take to run. I’d like to write a separate blog post about this at some point, but in short, large statically linked binaries can take more memory to run because loading the binary instructions and data in the first place takes up more space in RAM. The difference with dynamically loading libraries is that the memory the libary takes up in memory can be shared by any other processes using the library. So if we just take dynamically linking libc as an example, there are probably tons of other applications on the system also dynamically loading libc and all sharing that memory in RAM. If all those same binaries had statically linked libc, then they would each have a private copy of libc with all the space in memory that would take up and would be unable to share with any other processes on the system.

Dynamic Loading

The other way to interact with dynamic libraries is by loading them explicitly. With dynamic linking, the required libraries are built into the binary for the system to discover when the program is loaded. However, sometimes the exact library to be used can’t be known at compile time. There may be multiple versions of the library that the program is built to work with, and there needs to be some logic done at runtime to determine exactly which library is loaded. This is common with versioned APIs, where there may be v2 versions of functions present in dynamic libraries (rather than just reimplementing the functions so that backwards compatibility can be maintained, which is really important for dynamic libraries).
So the alternative method is loading the libraries at runtime using dlopen in Linux, or LoadLibrary in Windows. This gives you a handle to the libary loaded into program memory, and to find symbols in it you can look them up in the loaded library using dlsym in Linux or GetProcAddress in Windows.

Exporting Dynamic Symbols (Linux ELF binaries)

We have now exceeded my knowledge of how this might work in Windows, so this section is specific to ELF binaries on Linux.

What typically happens in the linking step is the linker maintains all external references to dynamic symbols in two sections of the binary called the PLT (Procedure Linkage Table) and the GOT (Global Offset Table). The PLT maintains references to all dynamic symbols used, while the GOT maintains the actual address of known dynamic symbols. Upon usage of a dynamic symbol, the compiler references the PLT entry for that symbol. At the linking stage, the linker will add those known symbols to the GOT. At runtime, when a PLT entry is called, it will look for an entry in the GOT and jump to that address, otherwise it willtry to resolve the symbol manually.

Let’s see this in action with a very simple C program:

#include <stdio.h>

int main() {
    printf("hi\n");
    return 0;
}

I’ll compile the binary with gcc and immediately disassemble it:

$ make
gcc -o hello -g -Wall main.c
$ objdump -d hello > hello.s

Let’s navigate the dump to the main subroutine:

0000000000001149 <main>:
    1149:	f3 0f 1e fa          	endbr64
    114d:	55                   	push   %rbp
    114e:	48 89 e5             	mov    %rsp,%rbp
    1151:	48 8d 3d ac 0e 00 00 	lea    0xeac(%rip),%rdi        # 2004 <_IO_stdin_used+0x4>
    1158:	e8 f3 fe ff ff       	call   1050 <puts@plt>
    115d:	b8 00 00 00 00       	mov    $0x0,%eax
    1162:	5d                   	pop    %rbp
    1163:	c3                   	ret
    1164:	66 2e 0f 1f 84 00 00 	cs nopw 0x0(%rax,%rax,1)
    116b:	00 00 00 
    116e:	66 90                	xchg   %ax,%ax

What we care about here is instruction 1158, with the call to puts@plt. This is a reference to a symbol puts in the PLT, which is a result of us calling printf from stdio.h in our program.

In the dump we can also analyze the disassembly of the plt:

Disassembly of section .plt:

0000000000001020 <.plt>:
    1020:	ff 35 9a 2f 00 00    	push   0x2f9a(%rip)        # 3fc0 <_GLOBAL_OFFSET_TABLE_+0x8>
    1026:	ff 25 9c 2f 00 00    	jmp    *0x2f9c(%rip)        # 3fc8 <_GLOBAL_OFFSET_TABLE_+0x10>
    102c:	0f 1f 40 00          	nopl   0x0(%rax)
    1030:	f3 0f 1e fa          	endbr64
    1034:	68 00 00 00 00       	push   $0x0
    1039:	e9 e2 ff ff ff       	jmp    1020 <_init+0x20>
    103e:	66 90                	xchg   %ax,%ax

Disassembly of section .plt.got:

0000000000001040 <__cxa_finalize@plt>:
    1040:	f3 0f 1e fa          	endbr64
    1044:	ff 25 ae 2f 00 00    	jmp    *0x2fae(%rip)        # 3ff8 <__cxa_finalize@GLIBC_2.2.5>
    104a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

Disassembly of section .plt.sec:

0000000000001050 <puts@plt>:
    1050:	f3 0f 1e fa          	endbr64
    1054:	ff 25 76 2f 00 00    	jmp    *0x2f76(%rip)        # 3fd0 <puts@GLIBC_2.2.5>
    105a:	66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)

We can see that puts@plt ends up doing a jump to address 0x2f76, the location of that symbol from GLIBC_2.2.5.

All of this will be important when we get to the bug itself, so I hope you stayed awake!

How go-nvml works

The Go NVML bindings are an interesting challenge. NVML is a closed source library, and the intended usage is to link to the shared object on the system using a public header. So the way the Go NVML bindings work is as follows:

Provide a copy of the NVML header
Using a 3rd party tool called c-for-go generate a set of Go bindings
Wrap the Go bindings in a light API layer for user friendliness

The function that was segfaulting was actually the first function, nvmlInit. So let’s look at the process of loading this function:

The library libnvidia-ml.so.1 is loaded using dlopen with the flags RTLD_LAZY | RTLD_GLOBAL.
Much of the API is versioned in the library, so each of the versioned APIs are search in the loaded library using dlsym. If the v2 version of a symbol is present, then the bindings are told to use the v2 version of the symbol. In our case, we are using an NVML library that’s new enough to have nvmlInit_v2, so we will end up using that symbol.
Each of these symbols is wrapped with an exported Go function, that loads the library and checks for errors before calling into the generated bindings. So we would call nvml.Init() in our Go code.
This would lead to the generated bindings, which are what actually calls into CGO using import "C" and calls C.nvmlInit_v2().

The Bug

A considerable amount of time has passed since this investigation took place, so I am writing with a ton of hindsight here. This explanation will obscure a ton of straw-grapsing, which you can look through in the Go GitHub issue I opened. For the sake of this post though, I’m going to skip to the part where it all came together and the issue and solution became clear.

Ignoring the deep inner workings of how the NVML Go bindings work, I will focus on the most important core of it. This project generates C bindings based on an input header file. This header file represents the accessible API for libnvidia-ml.so.1, a proprietary binary that is expected to be installed on the user’s machine and loaded at runtime. It is not provided as part of the binding package, and will not be linked as a part of the build. To deal with this, the linker flag --unresolved-symbols=ignore-in-object-files is passed to the linker as part of the bindings. This flag makes it so the symbols from nvml.h, which are not going to be resolved in the build with the shared object missing, will be ignored by the linker and not considered an error.

Our initial knowledge was that the bug occurred under the following circumstances:

Using Go 1.21
Building on Ubuntu Jammy or newer, but not on earlier distros like Debian 10 Buster

While at this point in the investigation a lot of these concepts were somewhat new to me, I did have a feeling that given the issue was with a dynamic library loaded through CGO, the issue probably had something to do with linking, and I suspected the version of ld on the system was the culprit, and that something in the CGO layer of Go had changed in conflict with a new version of ld. It took me a non-trivial amount of time to realize why, but this ended up mostly correct.

Standalone Repro

In order to a) determine whether this was go-nvml specific or something inherent to Go, and b) to not require me to have NVIDIA libraries installed while developing, I created a standalone reproduction. This confirmed that setting up a small CGO program under the same circumstances (providing a header but no object and passing --unresolved-symbols=ignore-in-object-files to ld) panicked in the exact same way. We can work with this from here on out.

Comparing Go 1.20 to 1.21

Using the reproduction, I will build 2 binaries, one with Go 1.20 and one with Go 1.21.

The repro program includes a header that defines a function get42 and makes a call to it. This symbol should be unresolved in the build, and should show up as such in our binary. If we use nm on the Go 1.20 binary, we can find our get42 existing as expected as an unresolved symbol:

$ nm cgo_dl_repro_go120 | grep get42
0000000000483760 T _cgo_49665a31f432_Cfunc_get42
                 U get42
0000000000483580 t main._Cfunc_get42.abi0
000000000051b1c8 d main._cgo_49665a31f432_Cfunc_get42

However, checking out the Go 1.21 binary shows an important difference, which is that this symbol is missing!

nm cgo_dl_repro_go121 | grep get42
000000000047ce70 T _cgo_49665a31f432_Cfunc_get42
000000000047cca0 t main._Cfunc_get42.abi0
000000000051b1a8 d main._cgo_49665a31f432_Cfunc_get42

The only get42 symbols are the CGO calls we make in the Go code and the symbol from the C code that CGO generates.

I did not fully grasp what I was looking at when I found this, but this turned out to be the important difference. The get42 unresolved symbol being missing actually meant that the get42 symbol did not have an entry in the PLT. This results in Go generating assembly for this program that looks like this (disassembled by go tool objdump):

TEXT _cgo_49665a31f432_Cfunc_get42(SB) 
  :0			0x47ce70		4154			PUSHQ R12			
  :0			0x47ce72		55			PUSHQ BP			
  :0			0x47ce73		53			PUSHQ BX			
  :0			0x47ce74		4889fb			MOVQ DI, BX			
  :0			0x47ce77		e88416feff		CALL _cgo_topofstack(SB)	
  :0			0x47ce7c		4989c4			MOVQ AX, R12			
  :0			0x47ce7f		31c0			XORL AX, AX			
  :0			0x47ce81		e87a31b8ff		CALL 0x0 <-- EVIL!!!!	
  :0			0x47ce86		89c5			MOVL AX, BP			
  :0			0x47ce88		e87316feff		CALL _cgo_topofstack(SB)	
  :0			0x47ce8d		4c29e0			SUBQ R12, AX			
  :0			0x47ce90		892c03			MOVL BP, 0(BX)(AX*1)		
  :0			0x47ce93		5b			POPQ BX				
  :0			0x47ce94		5d			POPQ BP				
  :0			0x47ce95		415c			POPQ R12			
  :0			0x47ce97		c3			RET

And a reminder of what that panic looks like:

SIGSEGV: segmentation violation
PC=0x0 m=0 sigcode=1
signal arrived during cgo execution

That explains how we’re getting program counter 0x0!

The Solution

While I spent a considerable amount of time experimenting and looking through go tool linker and cgo source code to try and understand what was going on, and I did learn a lot, I ended up finding the problem with a good old fashioned git bisect. I ended up at commit 1f29f39.
The message of that commit: cmd/link: don't export all symbols for ELF external linking
The problematic code change was from this:

// Force global symbols to be exported for dlopen, etc.
if ctxt.IsELF {
	argv = append(argv, "-rdynamic")
}

To this:

// Force global symbols to be exported for dlopen, etc.
if ctxt.IsELF {
	if ctxt.DynlinkingGo() || ctxt.BuildMode == BuildModeCShared || !linkerFlagSupported(ctxt.Arch, argv[0], altLinker, "-Wl,--export-dynamic-symbol=main") {
		argv = append(argv, "-rdynamic")
	} else {
		ctxt.loader.ForAllCgoExportDynamic(func(s loader.Sym) {
			argv = append(argv, "-Wl,--export-dynamic-symbol="+ctxt.loader.SymExtname(s))
		})
	}
}

What does this mean? The code used to always pass the -rdynamic flag to gcc, which passes --export-dynamic to ld under the hood. The change for the code changed to only pass -rdynamic to gcc if the particular linker flag is not supported. The justification for this is in this issue (TL;DR it’s because this is unnecessary in most cases and thus wastes space on a majority of binaries). While it’s hard to know exactly when the --export-dynamic-symbol flag was added to ld, it seems like the only plausible reason that this issue only occurs on an ld version that is high enough.

Since -rdynamic is now not always being passed in the CGO build process, the change I ended up on was to modify the binding generation in go-nvml to always pass the --export-dynamic linker flag. This doesn’t break if the -rdynamic flag is passed, but ensures that we still have the required ld flag being passed in newer versions of Go and ld.

Conclusion

This was a very hard issue to figure out, and was around a week’s worth of effort. The solution was 16 characters. This is why it’s hard to measure coding productivity by raw output! :)

I’m still glad I went through all of it, and glad I went through the process of re-documenting it by writing up this post. Hopefully you got some enjoyment out of my adventure!