Introducing auto-formatting in an existing codebase

At Intersec, we aim at maintaining a consistent style throughout all the code, depending on the programming language. For example, the C codebase is the most consistent, with our coding rules being enforced on code review. This enforcement, however, can lead to significant time loss and frustrations when patches must be repushed to be adapted to specific rules.

In the past, there has been several attempts at configuring auto-formatting tools. They were never fully satisfactory because several of our coding rules did not fit into the limited configuration options of these tools. This subject was born out of these attempts. However, instead of focusing on adapting the tools to our coding rules, we considered doing the opposite. What about adapting our coding rules (in particular, some of those peculiar rules) so that auto-formatting tools could be applied easily? After all, if some of our rules are too specific to exist in popular tools, maybe those rules cause more harm than improvement to our code.

Python

First step was Python, which we use extensively: for testing, in our build system as well as specific APIs. Python is the language where a specific coding style is the least enforced. Our guidelines tell us to follow PEP8, but this has never been particularly applied, hence a very heterogeneous codebase.

We looked into autopep8, yapf and black. Given the state of our codebase, any of these tools would involve significant changes when applied on all files. We decided to go with the most opinionated one, after discussing specific configuration options.

As expected, the changes were significant. A quick wc -l indicated that we had about 200k lines of python code. Launching black on every python file resulted in about +56k additions and -34k deletions. These are huge changes: although we are happy about the formatting, these changes practically ensure conflicts during branches merges, which may generate significant time loss as well.

Funnily enough, runing black on our whole codebase generated two errors where the output of black was not idempotent, in old code with dubious style. black was smart enough to detect that and provide the diff, so that the formatting could be fixed by hand.

Javascript

We considered using prettier for javascript, but decided to use eslint instead, as we already use eslint (happily) as a linter. Adding a few indentation rules, we can use eslint --fix as a sort of auto formatter. It is less opinionated than prettier, but fits nicely with our codebase.

For 236k lines of code, running eslint --fix on all JS files resulted in about 5k modifications only that fall into two main categories:

  • Specific coding rules that did not match with eslint rules.
  • Badly indented code that was fixed.

This is pretty good. The changes are relatively minimal, and this could be used to make sure that the code is properly formatted on new commits. However, it may require some user reformatting, as some constructs may cause some very ugly indentation (for example, declaring an object on multiple lines in a function call aligns the object fields with the opening bracket of the object, which can lead to very high indentation).

We also have around 120k lines of typescript. We are currently in the process of migrating our tslint configuration to eslint, which, when finished, will also allow us to format our typescript code using eslint.

C

Our C Codebase was the most problematic. It uses Blocks, has grown to use a lot of idiosyncratic syntaxes and allows using higher-level data types through many custom macros. When used to it, the coding rules make the code easily readable for us. For a formatting tool however, not so much. C is notoriously hard to parse, and formatting tools can have a hard time understanding which identifiers are types, or which macros are used to simulate keywords or blocks.

The goal was to use clang-format, which supports blocks natively. However, the development version was needed to get support for some latest features, including whitelisting of specific macros as keywords or blocks.

The result is not entirely convincing though. We triggered a few bugs which breaks the indentation, and some of the reformatting leaves the code arguably less readable than before. This could possibly be alleviated by modifying our coding rules significantly to improve the formatting, but that would involve a lot of modifications to the whole codebase, which we would rather avoid.

Hackathon 0x09 – lib-common benchmarks

The goal was to develop benchmarks on a few of our core technologies in lib-common, in order to:

  • Be able to compare the performances of our custom implementations with standard implementations.
  • Be able to add automated tests on performance (e.g. adding non-regression tests to ensure that changes which seem to be harmless do not worsen performance).

Benchmark library

The first step was to develop a benchmark library; the success criteria we established were the following (compared to the already existing benchmarks in our code base):

  • Ease writing of benchmarks
  • Standardize output format
  • Allow factorization of the use of external tools using benchmarks

And the result looks like this, on the user side:

ZBENCH_GROUP_EXPORT(bithacks) {
    const size_t small_res = 71008;

     ZBENCH(membitcount_naive_small) {
        ZBENCH_LOOP() {
            size_t res = 0;

            ZBENCH_MEASURE() {
                res = membitcount_check_small(&membitcount_naive);
            } ZBENCH_MEASURE_END

            if (res != small_res) {
                e_fatal("expected: %zu, got: %zu", small_res, res);
            }
        } ZBENCH_LOOP_END
    } ZBENCH_END
} ZBENCH_GROUP_END

If you are familiar with lib-common, you can see that it looks very similar to the z test library.
The code is more or less translated as follows:

if (BENCHMARK_RUN_GROUP("bithacks") {
    const size_t small_res = 71008;

    if (BENCHMARK_RUN("membitcount_naive_small") {
        for (int i = 0; i < BENCHMARKS_NB_RUNS; i++) {
            size_t res = 0;

            BENCHMARK_START();
            res = membitcount_check_small(&membitcount_naive);
            BENCHMARK_END();

            if (res != small_res) {
                e_fatal("expected: %zu, got: %zu", small_res, res);
            }
        }
    }
}

First benchmark: printf

The first benchmark we did was the benchmark of libc snprintf against our own implementation of snprint (actually called isnprintf internally).

The result was not what we expected (the chart below shows the duration for one million calls to snprintf):

function real min (ms) real max (ms) real mean (ms)
isnprintf 5.021 6.968 5.536
snprintf 2.123 3.187 2.408

As you can see, our implementation is about two times slower than the standard implementation in libc. So, it might be interesting to use the standard implementation instead of our own.

Unfortunately, we can define and use some custom formatters in our own implementation that are not trivially compatible with the standard implementation.

In conclusion, this is an interesting idea to improve the speed of lib-common, but it needs some rework in order to replace our own implementation.

IOP packing/unpacking

If you have read one of our previous articles, you already know what IOP is.

Long story short, and for what matters here, we use it as a serialization library. Since it is widely used in our products, we also decided to benchmark serialization and deserialization in binary, JSON, and YAML.

The purpose of this benchmark was not really to compare the performances of packing and unpacking with other implementations, but more for non-regression or comparison between JSON and YAML (indeed, as we could have expected, binary packing and unpacking is a lot faster – it is what is used for communication between daemons).

function real min (ms) real max (ms) real mean (ms)
JSON pack 0.01 0.15 0.011
JSON unpack 0.018 0.192 0.024
binary pack 0.002 0.119 0.002
binary unpack 0.003 0.089 0.003
YAML pack 0.011 1.339 0.016
YAML unpack 0.033 0.328 0.048

We can see that unpacking costs more than packing, which seems normal, and that YAML unpacking seems particularly costly. This is an interesting point to keep in mind for optimization.

Bithack

The speed of lib-common is partly due to the optimization of low-level functions. One of these functions is membitcount which counts the number of bits set in a buffer.

We have four different implementations:

  • membitcount_naive which does the sum of a naive bitcount on each byte of the buffer.
  • membitcount_c which takes into account multiple bytes at once when doing the bitcount of the buffer.
  • membitcount_ssse3 which uses the SSSE3 processor instruction set.
  • membitcount_popcnt which uses the popcnt processor instruction.

The purpose of this benchmark is to check if the optimized implementations have a real impact on performance, or if we can keep a more naive implementation that is more readable and maintainable.

Small real min (ms) real max (ms) real mean (ms)
membitcount naive 0.135 1.575 0.173
membitcount c 0.076 0.379 0.086
membitcount ssse3 0.044 0.161 0.051
membitcount popcnt 0.038 0.11 0.044
Big real min (ms) real max (ms) real mean (ms)
membitcount naive 0.721 11.999 0.896
membitcount c 0.215 12.882 0.248
membitcount ssse3 0.078 0.23 0.092
membitcount popcnt 0.037 0.114 0.044

There are some real differences between the four different implementations, so the optimizations are legitimate. membitcount_popcnt is the fastest one, and it is the one actually used when available.

Spinlocks/Mutexes

Recently, there were some discussions about the use of spinlocks in the user space. In lib-common, we use some spinlocks in thread jobs.

The purpose of this benchmark is to check if replacing the spinlocks by mutexes has an impact on the performance of thread jobs.

thrjobs test real min Mutexes (ms) real min Spinlock (ms) real max Mutexes (ms) real max Spinlock (ms) real mean Mutexes (ms) real mean Spinlock (ms)
contention 0.165 0.189 0.427 1.712 0.211 0.32
sort job 26.987 26.885 93.645 79.923 38.372 36.968
sort block 27.027 26.943 88.772 109.533 36.849 38.277
queue 0.039 0.04 2.139 4.068 0.067 0.074
queue syn 0.725 1.289 8.968 4.889 2.737 2.726
wake up thr0 0.4 0.394 0.662 0.755 0.47 0.461
post notify 0.003 0.003 0.06 0.017 0.005 0.004
for each 195.948 182.7 224 224.167 210.577 209.49

The results are not conclusive. We see no visible impacts of switching from spinlocks to mutexes in the benchmark.

This might be because:

  • There are no real macro differences when switching from spinlocks to mutexes.
  • The differences are masked by the background noises of the benchmarks, they are not properly designed to test the modification.

Conclusion

Over this hackathon, we developed a new benchmark library, zbenchmark, that is easy to use and standardizes the output of the different benchmarks.

We also adapted and wrote some benchmarks to find future optimizations and avoid regressions.

Although there is still a lot of things to do (write new benchmarks, compare with standard implementations, find optimizations), the work done during this hackathon is promising.

Alexis BRASY & Nicolas PAUSS

Hackathon 0x09 – eBPF —

At Intersec, we love new technologies that can improve our working tasks, our code, and because it is fun! During Hackathon 0x09, I tested the possibility to use BPF for tracing and debugging our C codebase.

What is BPF?

In the beginning, BPF was a technology used for packet filtering1. For example, when using the command tcpdump -i lo arp, BPF is used to filter ARP packets on the loopback interface. BPF has since been enhanced. The BPF instructions set was extended in Linux kernel 3.15 and was called “extended BPF” or eBPF. Today “BPF” could be defined as the technology that uses a set of tools to write and execute eBPF code.

So technically speaking, BPF is an in-kernel virtual machine that runs user-supplied programs inside the kernel. The instructions are verified by a BPF verifier that simulates the execution of the program and checks the stack state, out of range accesses, etc. The eBPF program must finish in a bounded time2 so this is not Turing complete. The program is rejected if the verifier considers it invalid. The virtual machine includes an interpreter and a JIT compiler that generates machine instructions.

The eBPF program can be written in C. LLVM Clang can compile C code into eBPF bytecode and it can be loaded with the bpf syscall3. Therefore, writing a single BPF program could be complex. Fortunately, we have the BCC frontend4. BCC is a set of tools to write, load and execute eBPF programs with Python bindings.

What is it used for?

Another way to explain BPF is that this technology allows us to attach and run small user-supplied programs on a large number of kernels, user applications, and libraries. It can gather application information, send events, aggregate statistics, and more. It is still used in networking for filtering packets and processing packets in the driver of the NIC (XDP5). It is also used in security (seccomp), DDoS mitigation… and observability.

Observability is a way to get insights from the system and the applications, by providing tracing or sampling events. This is what we want to achieve in this hackathon.

There is already a lot of tools available for this purpose: gdb, log, rr6, strace… BPF has some advantages. It is lightweight and has little impact on the execution, it is programmable and the BPF verifier guarantees the safety of the system or the application. BPF uses static or dynamic instrumentation and can gather information from the whole system, so it can be executed on libraries, applications, and kernel during runtime. Therefore, BPF is different and it could be considered as a complementary tool for the investigation.

Purpose of this hackathon

The goal of this hackathon is to try the usability of BPF in our ecosystem.

In this hackathon, I used BPF in three different ways: with USDT (Userland Statically Defined Tracing), kprobe (kernel probe) and uprobe (user probe) by using bcc frontend. As a playground, I used one of the Intersec internal applications: a highly performant distributed shard-based database.

The database (db) can run on multiple hosts but for convenience and for the exercise, we run three instances of the db on the same host.

Our first BPF program with USDT

User Statically-Defined Tracing is a convenient way to instrument user-space code and it allows tracing in production with low overhead. However, a tracing point must be inserted and maintained in the code. There are several ways to insert a USDT probe, for example, by using systemtap7.  I used the header provided in bcc lib that I incorporated in our common lib8 for the occasion.

Our db can create ‘snapshots’. So, Let’s try to follow the snapshots by using a BPF program. During a snapshot, the data of each shard is persisted into files. This is the only modification to the code:

+#include <lib-common/usdt.h>

@@ -740,10 +741,14 @@ uint32_t db_shard_snapshot(db_shard_t *shard);
res = qps_snapshot(shard->qps, data.s, data.len, ^void (uint32_t gen) {
    int sid = shard->sid;
    struct timeval tv_end, tv_diff;

+   /* Static tracepoint. */
+   USDT(db, snapshot, sid);
+
    lp_gettv(&tv_end);

USDT first argument (db) is the namespace, the second argument (snapshot) corresponds to the name of the USDT, and the last argument sid will expose the shard id to the BPF program. At Intersec, we use a rewriter to provide a kind of closure in C9, the closure block is defined by the caret character (^).

A static tracepoint has low overhead. Indeed, here it adds only a nop operation to the code.

  0x00000000009267ec <+63>: mov 0x18(%rax),%eax
  0x00000000009267ef <+66>: mov %eax,-0x14(%rbp)
+ 0x00000000009267f2 <+69>: nop
  0x00000000009267f3 <+70>: lea -0x50(%rbp),%rax
  0x00000000009267f7 <+74>: mov %rax,%rdi
  0x00000000009267fa <+77>: callq 0xbeed95 <lp_gettv>

More importantly, it adds information to the elf header of the db binary.

> readelf -n db

[...]
Displaying notes found in: .note.stapsdt
Owner Data size Description
stapsdt 0x00000032 NT_STAPSDT (SystemTap probe descriptors)
Provider: db
Name: snapshot
Location: 0x00000000009267f2, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
Arguments: -4@-20(%rbp)

Now, I can write our first eBPF program in C:

  BPF_HASH(snapshot_hash, struct snapshot_key_t, u64);                           
                                                                                 
  int snapshot_probe0(struct pt_regs *ctx) {                                     
      struct snapshot_key_t key = {};                                            
      int32_t sid = 0;                                                           
                                                                                 
      /* Read first argument of USDT. */                                         
      bpf_usdt_readarg(1, ctx, &sid);                                            
                                                                                 
      /* Send information to Userspace. */                                       
      bpf_trace_printk("shard :%d\n", sid);                                      
                                                                                 
      /* increment value of map(key) */                                         
      key.sid = sid;                                                             
      snapshot_hash.increment(key);                                              
      return 0;                                                                  
  }   

It is quite simple: it reads the first argument of the USDT tracepoint and sends back this information with a common pipe. BPF provides data structure (maps) that allows data to be used in the kernel and/or the user space. We use a map to count the snapshot occurrences for each shard id.

With the BCC frontend, attaching and executing the probe is as simple as these 3 lines in a python script:

usdt = USDT(path=<path_to_the_binary>)
usdt.enable_probe(probe="snapshot", fn_name="snapshot_probe0")
bpf = BPF(text=<bpf_program.c>, usdt_contexts=[usdt])

We need the path of the binary or the process id to insert the eBPF program. What the kernel does is to replace the nop by a breakpoint int3. When this breakpoint is hit, the kernel executes the eBPF program. Then, when we trigger snapshots in the db, the python script polls events and prints the collected information:

Tracing USDT db::snapshot... Hit Ctrl-C to end.
PID    PROG             TIME             SHARD ID
12717  db               4130.496204000   4
12546  db               4130.591365000   3
12546  db               4131.376820000   3
12717  db               4131.401658000   4
12717  db               4131.595490000   6
...

As you can see, when I provide only the path to the binary, BPF is able to trace all the processes that use this binary.

We can also read the map and display a summary:

     COUNT SID
         1 "6"
         2 "17"
         3 "4"
         3 "3"

kprobe

kprobe is a powerful tool to gather system information. kprobe allows us to instrument dynamically nearly any kernel function without running the kernel in a special mode, without rebooting or recompiling the kernel! The instrumentation sequence is the following: the target instruction address is saved and replaced by an int3 or a jmp instruction. When one of these instructions is hit, the related eBPF program is executed; then the saved instructions are executed and the application instructions flow is resumed. It is also possible to instrument when the kernel function returns, this is called kretprobe.

So let’s try with our db. When the db is snapshotting, files are written somewhere. Thus, the db will create new files, and I can guess without checking our code that it will probably use open syscall. So, we can trace the do_sys_open kernel function which is the endpoint of open syscalls. My new eBPF program will instrument the do_sys_open function entry (kprobe) and the function return (kretprobe). During the function entry, the eBPF program will store, for each call, some information (filename, flags, process details…) on a specific map. During the function return, if the do_sys_open function is successful, the information contained in the map for this call is sent to a BPF ring buffer and printed by the python script.

bpf.attach_kprobe(event="do_sys_open", fn_name="open_entry")
bpf.attach_kretprobe(event="do_sys_open", fn_name="open_return")
bpf["events"].open_perf_buffer(print_open_event, page_cnt=64)

Here, we also used the previous USDT tracepoint along with the do_sys_open tracing, the results are printed together:

Tracing USDT db::snapshot... Hit Ctrl-C to end.
PID    PROG    TIME            FD  SHARD_ID   FLAGS    FILENAME
12546  db      4099.361961000      3
19793  db                      6              00100101 .lock
19793  db                      4              00100302 00000000.00000009.qpt
12546  db                      7              00101102 01000007:0000000000000071.log
...

We can note that BPF can trace new db processes that are not present when I attach the BPF program. The file persistence is done by a forked process.

uprobe

BPF can also dynamically instrument user application or libraries. It works the same way as kprobes. It traces a file. So all processes, backed by this file, which are already running and all future processes that will use this file, will be tracked automatically. Thus, if you instrument the libc:malloc function, you will end up tracing every process and each new process that will use malloc of this libc in the system. This is maybe not the smartest move because it will trigger a lot of events, but you will be like an omniscient demiurge in your Linux environment. It is safe to use. If you can trace the kernel you can trace any application.

This time, we used argdist10 which is a wrapper that can write, attach the BPF program, and handle events in one line (see also bpftrace11). The log rotate function is triggered during the snapshot sequence, as its name indicates: it will rotate the binary log.

argdist.py 
   -i 5                                        # print every 5s
   -C'p:                                       # trace function entry
       ./db:                                   # binary path
       db_log_rotate(db_shard_t* shard):       # function prototype
       uint32_t,uint64_t:                          # values types
       shard->version.epoch,shard->version.version # print these values
       #epoch'
       -z 32
   -I'r9y.h'                                   # function header

...

[11:57:07]
epoch
        COUNT      EVENT
        1          shard->version.epoch = 16777220, shard->version.version = 422
        1          shard->version.epoch = 33554441, shard->version.version = 150
...

The BPF program traces all calls to the db_log_rotate function, and prints shard versions. It is simple and easy to use, I needed to use a sham header because of some compilation issues between the kernel headers and our lib-common. I try to handle the compatibility issues but due to lack of time, this will maybe be done on the next hackathon!

Dynamic Instrumentation is quite powerful because you can trace all functions in your application without modifying the code. However, it requires debug information and it can suffer from the interface instability. The interface may not exist depending on the version of the binary or the function may be inlined… Nevertheless, it is still very useful because with BPF you can do more than just print values as I did.

Conclusion

This hackathon was the occasion to use BPF and test the possibility to trace our base code. It was relatively easy, powerful and fun to use.

However, although it does not require, nowadays, the cutting edge of the kernel version (> 4.X), some distributions12 are still not compatible with BPF.

What still remains to be done is the compilation of the eBPF program for compatibility with our lib-common headers and especially with our IOPs13.


  1. http://www.tcpdump.org/papers/bpf-u could is large and csenix93.pdf []
  2. Kernel 5.3: bounded loop support: https://lwn.net/Articles/794934/ []
  3. http://man7.org/linux/man-pages/man2/bpf.2.html []
  4. https://github.com/iovisor/bcc []
  5. https://www.iovisor.org/technology/xdp []
  6. https://techtalk.intersec.com/2018/03/improved-debugging-with-rr/ []
  7. https://sourceware.org/systemtap/wiki/AddingUserSpaceProbingToApps []
  8. https://github.com/Intersec/lib-common []
  9. https://techtalk.intersec.com/2014/11/blocks-rewriting-with-clang/ []
  10. https://github.com/iovisor/bcc/blob/master/tools/argdist.py []
  11. https://github.com/iovisor/bpftrace []
  12. https://www.redhat.com/en/blog/introduction-ebpf-red-hat-enterprise-linux-7 []
  13. https://techtalk.intersec.com/2017/08/intersec-object-packer-part-1-the-basics/ []

Hackathon – 5 years later

During summer 2014 we organized our first hackathon.
The rules are simple and are still up to date: The subjects are suggested by anyone in the company and there is no defined framework nor limit on what they can be although they often fall in the same categories, I’ll come back to that.

Anyone can show interest in any of the proposed subjects and then work on it. In practice though, participants are mostly people from technical teams. Teams are formed and work on their topic from Thursday morning until Friday 4pm. They can even work late or at night if they wish to.
To make sure our contestants stay in good shape during that period, complimentary breakfasts are served, snacks are available and pizzas are ordered on Thursday evening.

Thurday 7pm, it's pizzas time!
On Friday at 4pm, each team presents its work and has 5 minutes precisely to explain what they did. There can be slides and there is often a demo. A vote takes place just after the presentations to nominate the favorite projects and the three best teams are awarded prizes.

Time for some presentations

Generally, the subjects proposed and chosen fall into three main categories:

  • A killer feature using our products that didn’t make it sooner in our roadmap and that usually comes with a big wow effect (these ones are usually winners).
  • Something useful to improve the way technical teams work.
  • Fun projects that are usually very creative ways of hacking our products or the tools we use everyday.

A 30 minute des présentations les dernière retouches

In January, we threw our ninth hackathon and a few things have changed since the first edition. Ludovic, who was in charge of the organization previously, introduced mini challenges that allows teams to win extra points in exchange for a small amount of their precious time. Although he left the company, we keep up with this tradition. This year the challenges were as follows:

  1. The first one was a speed test on a mathematical riddle:

    The 91 Intersec employees stand in a  circle, holding hands. 81 hold the hand of a man and 20 hold the hand of a woman.
    How many women are there at Intersec?

  2. The second challenge was a programming tournament inspired from http://www.rpscontest.com/submit with a twist: the random move could only be used for the first move.
  3. The third challenge was also a speed test. The goal was to go to the following riddle website and to tell what was on the image of the fifth page (without cheating): http://www.mcgov.co.uk/riddles/level1.html

The quality of the resulting projects has also greatly increased over the editions and all these stories deserve to be told. This is what the articles to come are about.