A next-generation debugger

June 11, 2024June 12, 2024 Andrei MateiLeave a comment

Andrei Matei

If you buy the premise of my previous post about how existing debuggers don’t quite work for modern cloud software, particularly in production environments, what would it take to build a debugger that does work? We’ve been building one for Go called Side-Eye, so we have some ideas. At a low level, half of what makes debuggers tick still applies just fine: using the debug information produced by compilers in order to make sense of a program’s memory (find variable locations, decode structs, etc). The other half needs re-engineering: we need new mechanisms for extracting data from programs without “pausing” them. A traditional debugger will use ptrace to stop the target program and get control, then it will read its memory using other system calls; this is all very slow. Instead of this, we should be using dynamic instrumentation techniques (Side-Eye generates eBPF programs that read the data from the target process’ memory based on the compiler-produced debug information¹).

The low-level stuff is only part of the story. The rest is about packaging powerful raw capabilities into a product that works at scale and that people actually reach for. So, making GDB or Delve work across multiple processes is not the goal; the goal is to build a tool that is both more powerful and friendlier. The goal is to re-invent the debugger for the cloud age.

Where have all the debuggers gone?

June 11, 2024June 11, 2024 Andrei MateiLeave a comment

Andrei Matei

In my previous post, I was ranting about how hard it is to debug a complex system based on the instrumentation we’ve programmed into the software. What I long wanted is a more natural way of asking an arbitrary question from the program; I’d like to quickly get an answer, then ask the next question until we get to the root of the matter. An example of a tool that works like this is a debugger — they have this interactive, conversational mode of use where you converse with the program and iteratively dig deeper into the problem. Crucially, the debugger doesn’t rely on any particular instrumentation being pre-programmed — the debugger can answer questions that the software is not programmed to answer. A debugger used to be considered a basic part of the programmer’s daily toolbox: a programming platform would come with a text editor, a compiler and a debugger. But somehow debuggers got lost along the way. They seem to have stayed stuck in the desktop software era and don’t quite work for the cloud computing / SaaS era. Traditionally, debuggers have a pretty rigid usage model where you attach to one process which you then stop to poke around in its memory. There’s a human in the loop responsible for pausing the program, thinking for a while and typing some commands, then resuming the program and maybe stopping it again later. That feels a bit quaint these days. More and more, the systems we’re debugging are distributed across multiple processes, containers and machines. You can’t “pause” any of the processes for any measurable amount of time; if you did pause them, you’d cause a big disruption to your service, and also you’d quickly destroy the very state that you were trying to observe. Some debuggers, such as GDB or Delve, allow you to script the interactions with the target program; even so these interactions are way too slow. In short, existing debuggers simply don’t apply to production environments.

Debugging modern systems is too damn hard

June 11, 2024June 11, 2024 Andrei Matei

Andrei Matei

This post is a rant about the observability and debuggability of modern software. I’m coming from many years of working on CockroachDB, so I’m biased towards “cloud-native” distributed systems, and SaaS in particular. You’ll guess that I’m not very happy with how we interact with such systems; this motivates the product we’re building at Data Ex Machina.

On software observability

In an old talk that really stayed with me over the years, Bryan Cantrill, one of the authors of DTrace, went metaphysical on the reasons why software observability is fundamentally hard — namely that software doesn’t “look like anything”: it doesn’t emit heat, it doesn’t attract mass. So, if software is not naturally observable, how can observe it? When a program is not working as expected, figuring out why is frequently an ad-hoc and laborious process. If, say, a car engine is misbehaving, you can open the hood and see the physical processes with your naked eyes; no such luck with software.

There are, of course, different ways in which we do get some observability into our programs. We might add different types of instrumentation, which is analogous to baking in some questions that we anticipate having in the future so that the software is able to answer them on demand: what’s the rate of different operations? What are the distributions of latencies of different operations? Has this particular error condition happened recently? What are the stack traces for my threads? We can also get other hints about what the software is doing by observing the environment – CPU usage, memory usage, I/O. I’m talking about metrics, logging, traces and built-in platform instrumentation. All these techniques are suitable for monitoring — verifying that the software continues to work within desired parameters. But observability should be about more than monitoring; it should be about unknown unknowns, about the ability to react and understand whatever comes and service our software in the field.

Hello world

June 10, 2024June 11, 2024 Andrei MateiLeave a comment

Andrei Matei

Last year, my friend Andrew Werner and I started a company — Data Ex Machina — with a mission to improve software observability and debugging. We’ve been working on Side-Eye, a debugger for cloud software written in Go. The following post on this blog will introduce the motivation and thinking behind it.

In my software engineering career I worked on several large-scale systems with stringent requirements around reliability and performance (most recently both of us worked on CockroachDB). What has always bothered me is the fact that these systems are hard to service — when they misbehave in production, it can be very difficult to figure out what’s going wrong. Complex systems fail in complex ways, and it always seemed to me that the tools we have at our disposal for debugging are not up to the task. In fact, in some ways the tooling is actually going backwards because we’ve largely lost the ability to use debuggers for many of the pathologies we’re interested in understanding — we no longer have a way to “ask arbitrary questions” about the execution and state of our programs. This thought has eaten at me over the years, together with the belief that we could do a lot better than the state of the art. We eventually decided to do something about it and we started working on our ultimate observability and debugging tool.

Data Ex Machina blog

An observability and debugging company

A next-generation debugger

Where have all the debuggers gone?

Debugging modern systems is too damn hard

On software observability

Hello world