Improved debugging with rr

Introduction

To investigate a bug, people will often resort to two methods once the bug has been reproduced:

  • Add traces in the code.
  • Use a debugger to instrument the execution of the bug.

The second solution is much more powerful but also quite limited in the types of bug it can handle:

  • If the instrumentation of the execution goes too far, it is impossible to go in reverse. A new execution is needed, with new breakpoints or watches, but also new addresses, variable values…
  • If the reproduction is random, having to restart the process and reproduce the bug again can be very time consuming.
  • Instrumenting the bug with breakpoints can render the bug non-reproducible.

For those complex bugs, the investigation will thus often end as a combination of traces + placed assertion to generate core files and being able to investigate with a debugger the state of the program.

rr was created to improve debugging for those complex bugs. Its use is very simple:

$ rr record ./my-exec
I am buggy!
$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/my-exec-0/mmap_hardlink_3_my-exec...done.
done.
(rr) continue
Continuing.
I am buggy!

Program received signal SIGKILL, Killed.
(rr)

An execution is first recorded, and can then be replayed in GDB, as many times as needed. As every replay will be the same one, two interesting properties are induced:

  • Reproducibility: every replay is identical, so a random bug can be recorded only once, and then the execution investigated as much as needed. There is no risk to go too far in the execution, and having to reproduce the bug once again to set a different breakpoint.
  • Reverse debugging: When replaying the execution, it is possible to go in reverse: reverse-continue, reverse-next, reverse-step. This is extremely useful.

To show practical example of how much using rr can speed up debugging, I will present 3 bugs that I investigated, where this tool has either helped me save a lot of time, or where I wasted my time only to think in retrospect how much time I could have saved.

Exhibit A: Reverse continuation

Lets say you have a crash on a function failure:

if (foo(a, &err) < 0) {
    assert (false);
    logger_error("foo failed: %s", err);
    return -1;
}

This function should fill a string-buffer with the description of the error, which should give a good idea of the reason of the crash:

(gdb) print err
$1 = 0x0

Oops, looks like someone forgot to fill the buffer before returning the error. To understand the issue, we now need to find the exact line that returned the error, by stepping inside the execution of foo, until the return -1 is found.

Now imagine this function call is the deserialization of an object, itself containing other objects, arrays, unions, etc. Stepping through the function calls would be a nightmare, as the error may occur for example when deserializing the 6th field of the 532th element in an array. The full path of the field that is invalid is only available at the end of the call, just before the whole stack is unwound.

One solution would be to add traces on each deserialization function, trying to single out the right element that cannot be deserialized, and iterate to add more traces, …

Another solution is to use rr, which allows to step in reverse. As the full details of the error is only a few instructions before the assert (only stack unwinding is done between the error and the assert), stepping in reverse several times will reach the error:

$ rr replay
(rr) c
Continuing.

Program received SIGABRT, Killed.
#12 0x0000000001d68d62 in ...
6007            assert (false);
(rr) # step in reverse to unpop all the "return -1"
(rr) rs
unpack_struct at iop.blk:4627
4627    }
(rr) rs
unpack_value at iop.blk:4433
4433    }
...
(rr) rs
4372        return -1;
(rr) rs
4371    if (field_is_valid(field) < 0) {
(rr) bt
#0  0x0000000002f340cc in unpack_union at iop.blk:4815
#1  0x0000000002f326f4 in unpack_value  at iop.blk:4433
#2  0x0000000002f33485 in unpack_struct at iop.blk:4627
#3  0x0000000002f3277f in unpack_value at iop.blk:4440
#4  0x0000000002f340a0 in unpack_union at iop.blk:4814
#5  0x0000000002f326f4 in unpack_value at iop.blk:4433
#6  0x0000000002f33485 in unpack_struct at iop.blk:4627
...
#36 0x0000000002f34448 in load_object at iop.blk:4851

Now we easily find which condition failed, and the backtrace can tell us exactly which field is invalid. In my case, it was something like:

input[2].query.select.cond.expr.andOp.conditions[3].orOp.conditions[0]

Quite horrible to get to with forward stepping.

Exhibit B: deterministic replay

Our products are composed of several daemons that communicate through RPC calls. In one daemon, data that needs to be passed from the RPC call to the callback is saved in an internal memory pool (we call them ic_msg_t). That memory pool factorizes allocations by creating pages, and ic_msg_t objects are just offsets in those pages.

When one message leaks, the page won’t be freed, and this will be detected when shutting down the daemon. Unfortunately, the tools we usually use for memory leaks cannot help us in this situation, as they will only point towards the allocation of the whole page. Using rr as seen in the first example will not help us either: we can inspect the page that leaked, but we do not know the offset of the object that caused the page to leak.

However, in that case, the reproducibility of a rr trace can help us. Lets start by tracing all the ic_msg_t creations and deletions:

static ic_msg_t *ic_msg_init(ic_msg_t *msg)
{
     e_error("XXX create %p", msg);
     ...
}

static void ic_msg_delete(ic_msg_t **msg)
{
    if (!msg) {
        return;
    }
    e_error("XXX delete %p", *msg);
    ...
}

Let’s now run the leaking test:

$ rr record ./my-daemon 2>&1 | tee traces.txt
XXX create message 0x7fb5035987d0
XXX delete message 0x7fb5035987d0
XXX create message 0x7fb503591400
XXX create message 0x7fb503591948
...

As the offsets in the allocated page can be reused, the same address will be used for different messages. To find out the one that leaks, we need one that has an odd number of traces (as the messages are creating in one unique pool, the same adress can be re-used by another message):

# Find out the address that was used an odd-number of times
$ grep XXX traces.txt | cut -d\  -f4 | sort | uniq -c | awk '$1 % 2'
3 0x7fb503598b60

We now know the leaking address. As replaying the traces with rr will not change the adresses, we can replay the execution, and the last allocation of this address will be the leaking one:

$ rr replay
(rr) # Run until the end, then go in reverse, to find the last alloc
(rr) c

Program received signal SIGKILL, Killed.
(rr) # Break on the allocation line
(rr) b iop-rpc-channel.blk:334 if msg == 0x7fb503598b60
(rr) rc
Continuing.

Thread 1 hit Breakpoint 1, ic_msg_init(msg=0x7fb503598b60) at iop-rpc-channel.blk:334
334         e_error("XXX create message %p", msg);
(rr) # Now we can find who leaks the message thanks to the backtrace
(rr) bt
#0  ic_msg_init (msg=0x7fb503598b60) at iop-rpc-channel.blk:334
#1  ic_msg_new () at iop-rpc-channel.blk:341
#2  _mw_query_cb (...) at platform-mw.blk:3593

In that case, a function call in _mw_query_cb had a side effect on a variable used to free the message, leading to the leak.

How do you solve this with only GDB?

  • You cannot use the address of the leaking message as it will change on every execution.
  • You can add more traces to distinguish the different callers of ic_msg_init (ie, a trace in _mw_query_cb), but you need an idea of where the leak is for this to be viable.
  • You can also count the number of calls of ic_msg_init before the leaking one, and add a static counter to abort right on the leaking call. This however will not work as soon as the number of calls can vary, which is the case when running a test that involves multiple communications with different daemons.

In that case, we could not avoid adding traces to investigate the issue, but the recording of the execution made sure that those traces could be reused when investigating the execution in depth.

Exhibit C: multiple processes

The last example will show how combining reverse execution, reproducibility and recordings of multiple processes allows debugging some very capricious bugs.

Let’s first get into a bit of details about the bug to investigate, involving multiple processes. Two daemons will interest us for this bug, the miner, and the EP. The miner is the daemon responsible to process computations. When computations produce results, it sends the results to the EP, and waits for an ACK on these results.

This works well, but in some very rare cases, the miner may softlock, apparently waiting for an ACK from the EP that never comes. Looking through logs, a disconnection/reconnection between the two daemons seems to be one trigger for the bug.

The bug can be reproduced, but it is extremely racy. The process goes like this:

  1. Start the product
  2. Start a computation that should last a few tens of seconds.
  3. SIGSTOP the miner during the computation
  4. wait for a disconnect with the EP
  5. SIGCONT the miner
  6. If not triggered, goto 2

After many tries, the right way to trigger the bug turned out to be to trigger the disconnect right before the end of the computation. The timeframe to send the SIGSTOP is very small, and several tries are often needed to re-trigger the bug.

Now that the stage is set, lets think about how to debug this:

  • Trying to generate cores to investigate afterwards would not work: it is not possible to know exactly when the bug was triggered, and a state of the process after it will be useless, both processes will be in IDLE state, with details of their communications destroyed. A live execution is needed.
  • Attaching the processes to gdb could be done, but with the issues already listed above:
    • any breakpoints will risk destroying the precise timings required to trigger the issue.
    • if the breakpoint goes too far, the bug needs to be triggered once more, with an earlier breakpoint.
  • Traces, more traces.

With rr, I only needed to reproduce the bug once. First the program was run while instrumenting the two processes:

INSTRUMENT="miner:rr ep:rr" ./master

Here, the master will start all other daemons, and instrument those mentioned in this env variable.

Then, after several tries, the bug was reproduced, and the processes can be stopped, then replayed through rr.

$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/miner-0/mmap_hardlink_0_miner...done.
done.
(rr)

# In another terminal
$ rr replay ~/.local/share/rr/ep-0
Reading symbols from /home/thiberville/.local/share/rr/ep-0/mmap_hardlink_0_ep...done.
done.
(rr)

The two processes synchronize with the truncate RPC, which uses a (fid, fseqid) pair as the synchronization point. The fid is an incremental number referring to a file, and the fseqid is the record number in this file. Lets first find out the lasttruncate received by the miner:

# Miner
(rr) c
[logs...]

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c
Continuing.
[...]

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) b ep_enabler_truncate_impl
Breakpoint 1 at 0x578af6: file enabler.blk, line 627.
(rr) rc
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) rc
Continuing.

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000000 in ?? ()
(rr) rc
Continuing.

Thread 1 hit Breakpoint 1,...
627     ...
(rr) p arg
$1 = {
    fid = 8,
    fseqid = 6548860,
}

Synchronization is based on the pair (fid, fseqid).

# EP
(rr) b ep.blk:887 if dest_sync.fid == 8 && dest_sync.fseqid == 6548860
Breakpoint 2 at 0x49a837: file platform/bigdata/ep.blk, line 887.
(rr) c
Continuing.
[logs...]

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) d 1
(rr) b ep.blk:887
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) rc

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) p dest_sync
$1 = {
    fid = 8,
    fseqid = 7992094,
}

Because of the desynchronization, 3 RPC sent from the EP are never received by the miner, the last one being on 8:7992094. On reconnection, another RPC is used to resynchronize the two daemons:

# Miner
(rr) b sync_with_ep
Breakpoint 2 at ...
(rr) c
Continuing.

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 hit Breakpoint 2, sync_with_ep ...
(rr) p *seqid
$2 = {
    fid = 8,
    fseqid = 7992094
}

We can see that when the two daemons are reconnected, the EP re-provides the last fseqid that it acked, so the bug is probably in the miner. Looking around the code, we can see that when it receives a synchronization point from the EP, it checks if the fseqid is the last one for the associated fid. If it is, it means all records from the corresponding file have been acked, and it thus removes the file. Lets check if that is the case here:

# Miner
(rr) p last_offset_per_file
$3 = vector of fq_pos_t (len: 0)

This vector is supposed to contain the last fseqid for each fid, but it is empty here. That explains why the miner then softlocks, it never removes the file. It should not be empty however, which means it has been incorrectly cleaned:

# Miner
(rr) watch last_offset_per_file.len
(rr) rc
Continuing.

Thread 1 hit Hardware watchpoint 3: last_offset_per_file.len

Old value = 0
New value = 1
0x0000000000b10c30 in vector_reset ...
27          vec->len = 0;
(rr) bt
#0  0x0000000000b10c30 in vector_reset ...
#1  0x000000000057901f in ep_disconnected () at enabler.blk:683

The vector is reset on disconnection with the EP, losing information that is needed when reconnecting. Removing the reset fixes this issue. Another bug was actually hidden behind it, but I will not get into it, as it won’t show anything more about what rr brings:

  • Analysis of recording of multiple processes, allowing tracing communications between them.
  • Reverse stepping, with breakpoints, watchpoints, …

Conclusion

rr has been a huge timesaver since i’ve started using it regularly. The ability to step in reverse as well as the reproducibility means that I often only need a single recording of a bug to find the cause, compared to many iterations of traces as well as the use of gdb.

The execution time cost of using rr is relatively minimal (almost invisible in my tests). However, rr does suffer from the same limitations as other runtime wrappers such as valgrind, the main one being that it does not support multiple threads, and only emulates a single-thread. For a race between threads, rr won’t be able to help you.

Author Avatar

Vincent Thiberville