Winning a hackathon with

(Written in Feb-March 2020 – Reading time: 10 minutes)

On the 23rd and 24th of January, an internal hackathon took place at Intersec. Our team “Laws of the Universe” took part in this hackathon, with the ambition of “testing”, an open-source solution of geodata viz and analysis.

More precisely, what we meant by “test” was a twofold objective:

  • See if we could build nice viz based on the type of data commonly processed by our solutions
  • Ideally, integrate them directly in our products, to demonstrate the feasibility of an industrialized solution based on this technology

To be honest, before the hackathon, our knowledge of was no more advanced than “Wow, this looks nice!” when browsing their website. Thankfully, our dream team was composed of two geodatascientists and two full stack developers, so we had all in hand to make it a success!

What is

According to their website, is “a powerful open source geospatial analysis tool for large-scale data sets“. More precisely, they claim to offer three desirable properties:

  • Performance: Built with, utilizes WebGL to render large datasets quickly and efficiently.
  • Interaction: You can easily drag and drop a dataset, add filters, apply scales, and do aggregation on the fly.
  • Embeddable: Built on React & Redux, can be embedded inside your own mapping applications.

Now, this is the official “marketing” presentation, but what about testing it in real conditions? We’ll come back later on technical aspects, but for now let’s discuss our first experimentations as “dataviz users”.

The user interface is directly accessible on their website, but maps can also be downloaded to work offline (if you do not trust their claim not to store any data :)). The promise of “interaction” is clearly delivered. It is indeed very easy to load data, to get acquainted with basic options in minutes, and in a few clicks you manage to build your first nice maps!

To illustrate that, we worked on what we call a “cell file”, i.e. a file listing all antennas within a given cellular network, with their coordinates. This type of file is central for our solutions allowing to work on cellular location. Without any additional processing, we could load it as a CSV file, and Kepler automatically displayed the locations on a map.

Antennas of a French cellular network

Even with hundreds of thousands of points, it was very easy to navigate on the map, zooming in and out, without any latency. In a few clicks, we were able to set the radius of points, their colors, possibly depending on other variables, …

Antennas of a French cellular network, by density

A single click also allows to switch to a nice 3D-mode. Here, we represented the density of antennas on the whole territory. This allows simultaneously to have a macro vision of dense areas (main cities) but also, when zooming, a more fine-grained view on specifics regions.

3D-vision of the density of antennas within a French cellular network

Zoom from previous illustration

These first examples confirmed how valuable could be to simply build nice viz. We now wanted to link it more with our products, respectively from a use case and from a “technical integration” points of view.

Our dataset

Before we go into more details on Intersec solutions, let’s say a few words about the dataset we built and used for some of the viz to come.

During the past months, volunteers within Intersec employees carried phones with a dedicated app tracking their position and surrounding cellular antennas. We so had access to more than 200 daily trajectories. Our idea for this hackathon was to modify their timestamps, to make all of them occur on the same day. Doing so, we simulated a population of 200 different users traveling during a given day, with the particularity of having a good chance to come to our place of work at La Défense during work hours. This gave us relevant data to display nice maps about people mobility.

Having said that, let’s go back to Intersec products!

Link with Intersec solutions: geoInsights

Among the solutions developed by Intersec to help operators to leverage their mobility data, geoInsights allows to produce anonymized statistics on the mobility of populations, for final clients within the transport, tourism or retail industries. Such statistics could be for example counts of people coming within a given area over a given period (let’s call it “density of population”), or flows of people between two regions, distributed by mean of transportation. We decided to create a viz for these two use cases.


First, we wanted to see how densities of population evolved over the day, within our dataset. As for the cell file viz displayed above, it was quite easy to build a 3D map with bars whose height represented the density. Additionally, we were able to add the temporal dimension, to make the viz dynamic and not static anymore. Here again, Kepler features allowed to do it quite easily, by selecting the timestamp variable as “filter”. Resulting viz is displayed below.

Densities of presence of our 200 users, evolving over the day

During night hours, we notice a high density in Neuilly-sur-Seine (place of living of one of the main contributors to the dataset, in the northwest of Paris), this density shifting as expected to La Défense (quite close, but a bit more northwest 😉 ) during working hours. This kind of observations illustrates how relevant this viz is to follow densities of people over time (having in mind that it is obviously possible to pause the animated map to have a more precise look at a given time). Further tests should be performed to see how larger datasets are managed, as we expect to follow millions of devices for such use cases.


As said, another use case is about counting flows of population between regions. We used the “Arc” display of Kepler to illustrate such flows, setting the width and color of the arcs according to the number of people between each couple of regions. The following figure shows an example of such a viz, based on sample data.

Example of map displaying volumes of people traveling between different regions

Even without animation this time, we clearly see the value of this type of viz, to directly have an understandable view on the flows of population between different regions.

Link with Intersec solutions: geoTrack

Among the other geo-related solutions developed by Intersec, geoTrack allows customers to follow their fleet of devices (IoT, …) over time. So here, we talk more about tracking each individual device than aggregating things like in geoInsights.

Our goal was to leverage on Kepler’s capabilities to build an animated viz illustrating this use case. The dataset described above is a perfect example of fleet that we would like to track, with around 200 users over a given day.

It took us a bit more work to transform sequences of locations into geoJSON shapes to enhance the visualization thanks to the “Trip” layer, but we are quite proud of what we achieved:

Position monitoring of our fleet of 200 devices, zoom over Paris area

Here again, we see flows of devices coming at La Défense on the morning, and a nice spread leaving the zone at the end of the working day.

We can also zoom out a bit, to follow the fleet on a wider scale and see trips and presence over the whole country.

Position monitoring of our fleet of 200 devices, whole France

Beyond how beautiful this animated map is, we believe it is a good way to have a global picture of the locations of a given fleet, so perfectly useful for the geoTrack solution!

Link with Intersec solutions: technical aspects

As said above, our objective was not only to display nice maps with Kepler, but to make them accessible through our products. We are proud of having managed to make it for both geoInsights viz described above in our two-day hackathon. For those of you familiar with the GUI of our analytics solution, a new option “Kepler” was made accessible in our widgets, in addition to “Computation”, “Raw data” and “Raw data on a map”. This option allows to integrate the described viz in a few clicks, as showed in the following sequence.

Steps to build a Kepler viz within our own web interface

So, how did we manage to reach this goal? Let’s have a closer look at the magic behind the scenes!

Global picture

Globally speaking, our main goal was to combine the Kepler library, our database engine and our website application.

The link between our database engine and Kepler was relatively easy to set, as both systems are used to work on the same kind of data: geolocation. The data stored in Intersec databases are ready-to-use and available through our query APIs, so forwarding them to Kepler was not a big deal.

Integrating the Kepler library in our website codebase required more efforts, as described below.

Kepler environment

Kepler is available in NPM, based on React/Redux libraries. This is not a framework we use, our stack being based on Backbonejs, Vue.js and Typescript, wrapped by webpack. So, it did not integrate directly in our codebase.

Thankfully, Kepler offers to load all dependencies, minified, through CDN. This was the best way to quickly setup the targeted environment, and this was the option we chose to save time.

Kepler component

Kepler uses Redux for reactivity, React for rendering.

Data are computed on our backend servers, and then ready-to-use on the website. We did not intend to change data in the meantime, so we did not need reactivity for that and the Redux part could be partially discarded.

Then, we had to put the rendering workflow in our website on a dedicated widget, as displayed on the image above. Widgets are Intersec analytics display units, designed to be completely agnostic to the underlying library used to display results. This choice of design when implementing them eased the integration with Kepler, confirming the relevance of our agnostic approach.

We also had to deal with the integration of the Kepler component. Most of our codebase is based on Backbonejs framework and virtual DOM, which does not properly integrate with ReactDOM. Fortunately, here again we were able to capitalize on previous work for similar problematics, solving the issue with a deferred rendering method.

Kepler configuration

As seen in the first section of this article, Kepler offers a nice interface allowing to play with types of display, filters, etc. Its possible customization allowed us to build the look-and-feel of each use case we wanted to address. The way we integrated it in our products was to configure each desired use case directly through Kepler interface, then to replicate the resulting customization within our widget configuration.

To conclude on this technical part, we can say that Kepler, in addition to the nice viz interface they offer to users, also give developers tools to ease the integration into external solutions, as ours. Even with products based on different technologies, we were able to integrate their interface on our website in two days, which is a great sign of a practical and developer-friendly technology.


From our experience of hackathons, the recipe for a winning topic could be the following:

  • Nice visual outcomes, for the famous “wow effect”
  • Fulfilled technical challenges, to charm technical people
  • Demonstrated usefulness, to prove it is not all about a geeky recreation

After two days of intense efforts, we were glad to have met our initial objectives (nice maps, integrated in our products), and believed that the outcome was totally in line with this recipe. It seems that our work also convinced attendees of our final presentation, as we had the pleasure to simultaneously win the public vote and the product management special prize!

Congrats to all the team!

75% of the winning team 😉 : Mouna, Yohann and Arthur (unfortunately Pierre-Louis could not stay until the prizes were awarded 🙁 )

Next steps will obviously consist in capitalizing on this work to move towards a more industrialized integration of this type of viz. Stay tuned!

Pierre-Louis Cuny, Yohann Balawender, Arthur Bombarde and Mouna Rhalimi

Final note: for those of you that wonder why we were the “Laws of the Universe” team, a few elements about astrodynamics to be found here! 🙂

Hackathon 0x09 – Monitoring with Prometheus / Grafana


In our products, we use a home-made technology called QRRD (for Quick Round Robin Database) to store monitoring metrics (system CPU/memory monitoring, incoming event flows, …).

QRRD (which is written in C) was actively developed between 2009 and 2013, but we have not been investing in it since, so it has not evolved anymore. And even if this is a really great technology (especially in terms of scaling and performances), it has the following drawbacks:

  • The associated visualization tools are really old-fashioned, and not convenient at all.
  • Its data model (which is really close to the graphite one) is limited compared to key-value data models; the difference between both is well explained here.
  • Its support of alerts is very basic and their configuration is complex.
  • It is old and therefore harder and harder to maintain (or develop).

With this hackathon subject, our team wanted to explore the possibility of using more modern, standard, fancy and open-source tools as a replacement of QRRD for monitoring our products. We decided to try using the Prometheus/Grafana couple:

  • Prometheus is an open-source time-series database, originally built at SoundCloud, and now widely used.
  • Grafana is an open-source visualization and alerting system that can connect to several databases, including Prometheus. It is fancy, powerful and easy to use.

What was done

The first step consisted in trying to send statistics to a Prometheus server from the core binaries of our products. Since our core product is coded in C, we had to use this unofficial third-party C client library. We first had to integrate this library to our build system, and write some helpers around it to make it easy to declare Prometheus metrics in any daemon of our products.

Then, we integrated Promethus itself in our products, as an external service (i.e. a non-Intersec daemon that is launched and monitored by our product). Every Intersec daemon that uses the Prometheus client library is automatically registered in Prometheus using the file-based service discovery, so that it is not necessary to manually update its configuration.

Then it was time to actually implement some metrics coming from our product. We implemented the following ones, that already existed in QRRD:

  • What we call master-monitor, which is pure system metrics: CPU, memory, network, file descriptors, etc. per host and per service.
  • Metrics about the aggregation chain and data-collection (i.e. incoming data to our product, and ingestion by the database): number of incoming files/events per flow, size of queues and buffers, …
  • Some more “functional” metrics about scenarios: number of scenarios per state, size of the scenario schema.

Finally, we installed Grafana, connected it to the Prometheus source, and wrote some dashboards to display the produced metrics in a beautiful and useful way.


The main challenge was to make the prometheus C client work in our code. After we integrated it in our build system, and coded dummy metrics for testing, our daemon crashed in the code of the client library as soon as Prometheus tried to scrape the metrics. We spent some time trying to understand what we had done wrong, before realizing that even the test program delivered with the C client library was crashing on our systems (at that time, we were using debian 9). We noticed that it worked fine on more recent systems, but we did not have time to upgrade our workstations. So we had to setup debian 10 containers to work on, which was pretty time-consuming.


Here are screenshots of the first monitoring dashboard we built, displaying system monitoring metrics. The selectors on top of the dashboards allow to choose the platform to monitor and its host:

Host monitoring dashboard - part 1

Host monitoring dashboard - part 2 Host monitoring dashboard

The second dashboard is the aggregation chain monitoring dashboard. It displays useful information about the incoming data flows in our product.

Aggregation chain dashboard - part 1

Aggregation chain dashboard - part 2Aggregation chain dashboard

Another dashboard, showing the number of scenarios per state, along with the total size of the scenario schema:

Scenarios dashboard

Scenarios dashboard

Finally, we made an alerting dashboard, that sends alerts on Slack when the CPU/RAM consumption goes too high, or when no events are received on the platform for a very long time:

Alerting dashboard

Alerting dashboard


This hackathon was the occasion to experiment another technology than QRRD to store and display monitoring time-series.

What we achieved is quite promising: Prometheus was successfully integrated in our products as a time-series database, and Grafana was used to build monitoring and alerting dashboards.

But of course, these developments are not production-ready. In order to complete them properly, we need at least to:

  • Stabilize and bench the Prometheus C client; depending on the results of the bench, we might consider writing our own Prometheus client.
  • Migrate more statistics to Prometheus, and build more “smart and ready-to-deploy” Grafana dashboards.
  • Perform tests, write documentation and automate deployment, so that this becomes the standard monitoring solution in future versions.

Introducing auto-formatting in an existing codebase

At Intersec, we aim at maintaining a consistent style throughout all the code, depending on the programming language. For example, the C codebase is the most consistent, with our coding rules being enforced on code review. This enforcement, however, can lead to significant time loss and frustrations when patches must be repushed to be adapted to specific rules.

In the past, there has been several attempts at configuring auto-formatting tools. They were never fully satisfactory because several of our coding rules did not fit into the limited configuration options of these tools. This subject was born out of these attempts. However, instead of focusing on adapting the tools to our coding rules, we considered doing the opposite. What about adapting our coding rules (in particular, some of those peculiar rules) so that auto-formatting tools could be applied easily? After all, if some of our rules are too specific to exist in popular tools, maybe those rules cause more harm than improvement to our code.


First step was Python, which we use extensively: for testing, in our build system as well as specific APIs. Python is the language where a specific coding style is the least enforced. Our guidelines tell us to follow PEP8, but this has never been particularly applied, hence a very heterogeneous codebase.

We looked into autopep8, yapf and black. Given the state of our codebase, any of these tools would involve significant changes when applied on all files. We decided to go with the most opinionated one, after discussing specific configuration options.

As expected, the changes were significant. A quick wc -l indicated that we had about 200k lines of python code. Launching black on every python file resulted in about +56k additions and -34k deletions. These are huge changes: although we are happy about the formatting, these changes practically ensure conflicts during branches merges, which may generate significant time loss as well.

Funnily enough, runing black on our whole codebase generated two errors where the output of black was not idempotent, in old code with dubious style. black was smart enough to detect that and provide the diff, so that the formatting could be fixed by hand.


We considered using prettier for javascript, but decided to use eslint instead, as we already use eslint (happily) as a linter. Adding a few indentation rules, we can use eslint --fix as a sort of auto formatter. It is less opinionated than prettier, but fits nicely with our codebase.

For 236k lines of code, running eslint --fix on all JS files resulted in about 5k modifications only that fall into two main categories:

  • Specific coding rules that did not match with eslint rules.
  • Badly indented code that was fixed.

This is pretty good. The changes are relatively minimal, and this could be used to make sure that the code is properly formatted on new commits. However, it may require some user reformatting, as some constructs may cause some very ugly indentation (for example, declaring an object on multiple lines in a function call aligns the object fields with the opening bracket of the object, which can lead to very high indentation).

We also have around 120k lines of typescript. We are currently in the process of migrating our tslint configuration to eslint, which, when finished, will also allow us to format our typescript code using eslint.


Our C Codebase was the most problematic. It uses Blocks, has grown to use a lot of idiosyncratic syntaxes and allows using higher-level data types through many custom macros. When used to it, the coding rules make the code easily readable for us. For a formatting tool however, not so much. C is notoriously hard to parse, and formatting tools can have a hard time understanding which identifiers are types, or which macros are used to simulate keywords or blocks.

The goal was to use clang-format, which supports blocks natively. However, the development version was needed to get support for some latest features, including whitelisting of specific macros as keywords or blocks.

The result is not entirely convincing though. We triggered a few bugs which breaks the indentation, and some of the reformatting leaves the code arguably less readable than before. This could possibly be alleviated by modifying our coding rules significantly to improve the formatting, but that would involve a lot of modifications to the whole codebase, which we would rather avoid.

Hackathon 0x09 – lib-common benchmarks

The goal was to develop benchmarks on a few of our core technologies in lib-common, in order to:

  • Be able to compare the performances of our custom implementations with standard implementations.
  • Be able to add automated tests on performance (e.g. adding non-regression tests to ensure that changes which seem to be harmless do not worsen performance).

Benchmark library

The first step was to develop a benchmark library; the success criteria we established were the following (compared to the already existing benchmarks in our code base):

  • Ease writing of benchmarks
  • Standardize output format
  • Allow factorization of the use of external tools using benchmarks

And the result looks like this, on the user side:

    const size_t small_res = 71008;

     ZBENCH(membitcount_naive_small) {
        ZBENCH_LOOP() {
            size_t res = 0;

            ZBENCH_MEASURE() {
                res = membitcount_check_small(&membitcount_naive);
            } ZBENCH_MEASURE_END

            if (res != small_res) {
                e_fatal("expected: %zu, got: %zu", small_res, res);

If you are familiar with lib-common, you can see that it looks very similar to the z test library.
The code is more or less translated as follows:

if (BENCHMARK_RUN_GROUP("bithacks") {
    const size_t small_res = 71008;

    if (BENCHMARK_RUN("membitcount_naive_small") {
        for (int i = 0; i < BENCHMARKS_NB_RUNS; i++) {
            size_t res = 0;

            res = membitcount_check_small(&membitcount_naive);

            if (res != small_res) {
                e_fatal("expected: %zu, got: %zu", small_res, res);

First benchmark: printf

The first benchmark we did was the benchmark of libc snprintf against our own implementation of snprint (actually called isnprintf internally).

The result was not what we expected (the chart below shows the duration for one million calls to snprintf):

function real min (ms) real max (ms) real mean (ms)
isnprintf 5.021 6.968 5.536
snprintf 2.123 3.187 2.408

As you can see, our implementation is about two times slower than the standard implementation in libc. So, it might be interesting to use the standard implementation instead of our own.

Unfortunately, we can define and use some custom formatters in our own implementation that are not trivially compatible with the standard implementation.

In conclusion, this is an interesting idea to improve the speed of lib-common, but it needs some rework in order to replace our own implementation.

IOP packing/unpacking

If you have read one of our previous articles, you already know what IOP is.

Long story short, and for what matters here, we use it as a serialization library. Since it is widely used in our products, we also decided to benchmark serialization and deserialization in binary, JSON, and YAML.

The purpose of this benchmark was not really to compare the performances of packing and unpacking with other implementations, but more for non-regression or comparison between JSON and YAML (indeed, as we could have expected, binary packing and unpacking is a lot faster – it is what is used for communication between daemons).

function real min (ms) real max (ms) real mean (ms)
JSON pack 0.01 0.15 0.011
JSON unpack 0.018 0.192 0.024
binary pack 0.002 0.119 0.002
binary unpack 0.003 0.089 0.003
YAML pack 0.011 1.339 0.016
YAML unpack 0.033 0.328 0.048

We can see that unpacking costs more than packing, which seems normal, and that YAML unpacking seems particularly costly. This is an interesting point to keep in mind for optimization.


The speed of lib-common is partly due to the optimization of low-level functions. One of these functions is membitcount which counts the number of bits set in a buffer.

We have four different implementations:

  • membitcount_naive which does the sum of a naive bitcount on each byte of the buffer.
  • membitcount_c which takes into account multiple bytes at once when doing the bitcount of the buffer.
  • membitcount_ssse3 which uses the SSSE3 processor instruction set.
  • membitcount_popcnt which uses the popcnt processor instruction.

The purpose of this benchmark is to check if the optimized implementations have a real impact on performance, or if we can keep a more naive implementation that is more readable and maintainable.

Small real min (ms) real max (ms) real mean (ms)
membitcount naive 0.135 1.575 0.173
membitcount c 0.076 0.379 0.086
membitcount ssse3 0.044 0.161 0.051
membitcount popcnt 0.038 0.11 0.044
Big real min (ms) real max (ms) real mean (ms)
membitcount naive 0.721 11.999 0.896
membitcount c 0.215 12.882 0.248
membitcount ssse3 0.078 0.23 0.092
membitcount popcnt 0.037 0.114 0.044

There are some real differences between the four different implementations, so the optimizations are legitimate. membitcount_popcnt is the fastest one, and it is the one actually used when available.


Recently, there were some discussions about the use of spinlocks in the user space. In lib-common, we use some spinlocks in thread jobs.

The purpose of this benchmark is to check if replacing the spinlocks by mutexes has an impact on the performance of thread jobs.

thrjobs test real min Mutexes (ms) real min Spinlock (ms) real max Mutexes (ms) real max Spinlock (ms) real mean Mutexes (ms) real mean Spinlock (ms)
contention 0.165 0.189 0.427 1.712 0.211 0.32
sort job 26.987 26.885 93.645 79.923 38.372 36.968
sort block 27.027 26.943 88.772 109.533 36.849 38.277
queue 0.039 0.04 2.139 4.068 0.067 0.074
queue syn 0.725 1.289 8.968 4.889 2.737 2.726
wake up thr0 0.4 0.394 0.662 0.755 0.47 0.461
post notify 0.003 0.003 0.06 0.017 0.005 0.004
for each 195.948 182.7 224 224.167 210.577 209.49

The results are not conclusive. We see no visible impacts of switching from spinlocks to mutexes in the benchmark.

This might be because:

  • There are no real macro differences when switching from spinlocks to mutexes.
  • The differences are masked by the background noises of the benchmarks, they are not properly designed to test the modification.


Over this hackathon, we developed a new benchmark library, zbenchmark, that is easy to use and standardizes the output of the different benchmarks.

We also adapted and wrote some benchmarks to find future optimizations and avoid regressions.

Although there is still a lot of things to do (write new benchmarks, compare with standard implementations, find optimizations), the work done during this hackathon is promising.

Alexis BRASY & Nicolas PAUSS

Hackathon 0x09 – eBPF —

At Intersec, we love new technologies that can improve our working tasks, our code, and because it is fun! During Hackathon 0x09, I tested the possibility to use BPF for tracing and debugging our C codebase.

What is BPF?

In the beginning, BPF was a technology used for packet filtering1. For example, when using the command tcpdump -i lo arp, BPF is used to filter ARP packets on the loopback interface. BPF has since been enhanced. The BPF instructions set was extended in Linux kernel 3.15 and was called “extended BPF” or eBPF. Today “BPF” could be defined as the technology that uses a set of tools to write and execute eBPF code.

So technically speaking, BPF is an in-kernel virtual machine that runs user-supplied programs inside the kernel. The instructions are verified by a BPF verifier that simulates the execution of the program and checks the stack state, out of range accesses, etc. The eBPF program must finish in a bounded time2 so this is not Turing complete. The program is rejected if the verifier considers it invalid. The virtual machine includes an interpreter and a JIT compiler that generates machine instructions.

The eBPF program can be written in C. LLVM Clang can compile C code into eBPF bytecode and it can be loaded with the bpf syscall3. Therefore, writing a single BPF program could be complex. Fortunately, we have the BCC frontend4. BCC is a set of tools to write, load and execute eBPF programs with Python bindings.

What is it used for?

Another way to explain BPF is that this technology allows us to attach and run small user-supplied programs on a large number of kernels, user applications, and libraries. It can gather application information, send events, aggregate statistics, and more. It is still used in networking for filtering packets and processing packets in the driver of the NIC (XDP5). It is also used in security (seccomp), DDoS mitigation… and observability.

Observability is a way to get insights from the system and the applications, by providing tracing or sampling events. This is what we want to achieve in this hackathon.

There is already a lot of tools available for this purpose: gdb, log, rr6, strace… BPF has some advantages. It is lightweight and has little impact on the execution, it is programmable and the BPF verifier guarantees the safety of the system or the application. BPF uses static or dynamic instrumentation and can gather information from the whole system, so it can be executed on libraries, applications, and kernel during runtime. Therefore, BPF is different and it could be considered as a complementary tool for the investigation.

Purpose of this hackathon

The goal of this hackathon is to try the usability of BPF in our ecosystem.

In this hackathon, I used BPF in three different ways: with USDT (Userland Statically Defined Tracing), kprobe (kernel probe) and uprobe (user probe) by using bcc frontend. As a playground, I used one of the Intersec internal applications: a highly performant distributed shard-based database.

The database (db) can run on multiple hosts but for convenience and for the exercise, we run three instances of the db on the same host.

Our first BPF program with USDT

User Statically-Defined Tracing is a convenient way to instrument user-space code and it allows tracing in production with low overhead. However, a tracing point must be inserted and maintained in the code. There are several ways to insert a USDT probe, for example, by using systemtap7.  I used the header provided in bcc lib that I incorporated in our common lib8 for the occasion.

Our db can create ‘snapshots’. So, Let’s try to follow the snapshots by using a BPF program. During a snapshot, the data of each shard is persisted into files. This is the only modification to the code:

+#include <lib-common/usdt.h>

@@ -740,10 +741,14 @@ uint32_t db_shard_snapshot(db_shard_t *shard);
res = qps_snapshot(shard->qps, data.s, data.len, ^void (uint32_t gen) {
    int sid = shard->sid;
    struct timeval tv_end, tv_diff;

+   /* Static tracepoint. */
+   USDT(db, snapshot, sid);

USDT first argument (db) is the namespace, the second argument (snapshot) corresponds to the name of the USDT, and the last argument sid will expose the shard id to the BPF program. At Intersec, we use a rewriter to provide a kind of closure in C9, the closure block is defined by the caret character (^).

A static tracepoint has low overhead. Indeed, here it adds only a nop operation to the code.

  0x00000000009267ec <+63>: mov 0x18(%rax),%eax
  0x00000000009267ef <+66>: mov %eax,-0x14(%rbp)
+ 0x00000000009267f2 <+69>: nop
  0x00000000009267f3 <+70>: lea -0x50(%rbp),%rax
  0x00000000009267f7 <+74>: mov %rax,%rdi
  0x00000000009267fa <+77>: callq 0xbeed95 <lp_gettv>

More importantly, it adds information to the elf header of the db binary.

> readelf -n db

Displaying notes found in: .note.stapsdt
Owner Data size Description
stapsdt 0x00000032 NT_STAPSDT (SystemTap probe descriptors)
Provider: db
Name: snapshot
Location: 0x00000000009267f2, Base: 0x0000000000000000, Semaphore: 0x0000000000000000
Arguments: -4@-20(%rbp)

Now, I can write our first eBPF program in C:

  BPF_HASH(snapshot_hash, struct snapshot_key_t, u64);                           
  int snapshot_probe0(struct pt_regs *ctx) {                                     
      struct snapshot_key_t key = {};                                            
      int32_t sid = 0;                                                           
      /* Read first argument of USDT. */                                         
      bpf_usdt_readarg(1, ctx, &sid);                                            
      /* Send information to Userspace. */                                       
      bpf_trace_printk("shard :%d\n", sid);                                      
      /* increment value of map(key) */                                         
      key.sid = sid;                                                             
      return 0;                                                                  

It is quite simple: it reads the first argument of the USDT tracepoint and sends back this information with a common pipe. BPF provides data structure (maps) that allows data to be used in the kernel and/or the user space. We use a map to count the snapshot occurrences for each shard id.

With the BCC frontend, attaching and executing the probe is as simple as these 3 lines in a python script:

usdt = USDT(path=<path_to_the_binary>)
usdt.enable_probe(probe="snapshot", fn_name="snapshot_probe0")
bpf = BPF(text=<bpf_program.c>, usdt_contexts=[usdt])

We need the path of the binary or the process id to insert the eBPF program. What the kernel does is to replace the nop by a breakpoint int3. When this breakpoint is hit, the kernel executes the eBPF program. Then, when we trigger snapshots in the db, the python script polls events and prints the collected information:

Tracing USDT db::snapshot... Hit Ctrl-C to end.
PID    PROG             TIME             SHARD ID
12717  db               4130.496204000   4
12546  db               4130.591365000   3
12546  db               4131.376820000   3
12717  db               4131.401658000   4
12717  db               4131.595490000   6

As you can see, when I provide only the path to the binary, BPF is able to trace all the processes that use this binary.

We can also read the map and display a summary:

         1 "6"
         2 "17"
         3 "4"
         3 "3"


kprobe is a powerful tool to gather system information. kprobe allows us to instrument dynamically nearly any kernel function without running the kernel in a special mode, without rebooting or recompiling the kernel! The instrumentation sequence is the following: the target instruction address is saved and replaced by an int3 or a jmp instruction. When one of these instructions is hit, the related eBPF program is executed; then the saved instructions are executed and the application instructions flow is resumed. It is also possible to instrument when the kernel function returns, this is called kretprobe.

So let’s try with our db. When the db is snapshotting, files are written somewhere. Thus, the db will create new files, and I can guess without checking our code that it will probably use open syscall. So, we can trace the do_sys_open kernel function which is the endpoint of open syscalls. My new eBPF program will instrument the do_sys_open function entry (kprobe) and the function return (kretprobe). During the function entry, the eBPF program will store, for each call, some information (filename, flags, process details…) on a specific map. During the function return, if the do_sys_open function is successful, the information contained in the map for this call is sent to a BPF ring buffer and printed by the python script.

bpf.attach_kprobe(event="do_sys_open", fn_name="open_entry")
bpf.attach_kretprobe(event="do_sys_open", fn_name="open_return")
bpf["events"].open_perf_buffer(print_open_event, page_cnt=64)

Here, we also used the previous USDT tracepoint along with the do_sys_open tracing, the results are printed together:

Tracing USDT db::snapshot... Hit Ctrl-C to end.
12546  db      4099.361961000      3
19793  db                      6              00100101 .lock
19793  db                      4              00100302 00000000.00000009.qpt
12546  db                      7              00101102 01000007:0000000000000071.log

We can note that BPF can trace new db processes that are not present when I attach the BPF program. The file persistence is done by a forked process.


BPF can also dynamically instrument user application or libraries. It works the same way as kprobes. It traces a file. So all processes, backed by this file, which are already running and all future processes that will use this file, will be tracked automatically. Thus, if you instrument the libc:malloc function, you will end up tracing every process and each new process that will use malloc of this libc in the system. This is maybe not the smartest move because it will trigger a lot of events, but you will be like an omniscient demiurge in your Linux environment. It is safe to use. If you can trace the kernel you can trace any application.

This time, we used argdist10 which is a wrapper that can write, attach the BPF program, and handle events in one line (see also bpftrace11). The log rotate function is triggered during the snapshot sequence, as its name indicates: it will rotate the binary log. 
   -i 5                                        # print every 5s
   -C'p:                                       # trace function entry
       ./db:                                   # binary path
       db_log_rotate(db_shard_t* shard):       # function prototype
       uint32_t,uint64_t:                          # values types
       shard->version.epoch,shard->version.version # print these values
       -z 32
   -I'r9y.h'                                   # function header


        COUNT      EVENT
        1          shard->version.epoch = 16777220, shard->version.version = 422
        1          shard->version.epoch = 33554441, shard->version.version = 150

The BPF program traces all calls to the db_log_rotate function, and prints shard versions. It is simple and easy to use, I needed to use a sham header because of some compilation issues between the kernel headers and our lib-common. I try to handle the compatibility issues but due to lack of time, this will maybe be done on the next hackathon!

Dynamic Instrumentation is quite powerful because you can trace all functions in your application without modifying the code. However, it requires debug information and it can suffer from the interface instability. The interface may not exist depending on the version of the binary or the function may be inlined… Nevertheless, it is still very useful because with BPF you can do more than just print values as I did.


This hackathon was the occasion to use BPF and test the possibility to trace our base code. It was relatively easy, powerful and fun to use.

However, although it does not require, nowadays, the cutting edge of the kernel version (> 4.X), some distributions12 are still not compatible with BPF.

What still remains to be done is the compilation of the eBPF program for compatibility with our lib-common headers and especially with our IOPs13.

  1. could is large and csenix93.pdf []
  2. Kernel 5.3: bounded loop support: []
  3. []
  4. []
  5. []
  6. []
  7. []
  8. []
  9. []
  10. []
  11. []
  12. []
  13. []

Hackathon – 5 years later

During summer 2014 we organized our first hackathon.
The rules are simple and are still up to date: The subjects are suggested by anyone in the company and there is no defined framework nor limit on what they can be although they often fall in the same categories, I’ll come back to that.

Anyone can show interest in any of the proposed subjects and then work on it. In practice though, participants are mostly people from technical teams. Teams are formed and work on their topic from Thursday morning until Friday 4pm. They can even work late or at night if they wish to.
To make sure our contestants stay in good shape during that period, complimentary breakfasts are served, snacks are available and pizzas are ordered on Thursday evening.

Thurday 7pm, it's pizzas time!
On Friday at 4pm, each team presents its work and has 5 minutes precisely to explain what they did. There can be slides and there is often a demo. A vote takes place just after the presentations to nominate the favorite projects and the three best teams are awarded prizes.

Time for some presentations

Generally, the subjects proposed and chosen fall into three main categories:

  • A killer feature using our products that didn’t make it sooner in our roadmap and that usually comes with a big wow effect (these ones are usually winners).
  • Something useful to improve the way technical teams work.
  • Fun projects that are usually very creative ways of hacking our products or the tools we use everyday.

A 30 minute des présentations les dernière retouches

In January, we threw our ninth hackathon and a few things have changed since the first edition. Ludovic, who was in charge of the organization previously, introduced mini challenges that allows teams to win extra points in exchange for a small amount of their precious time. Although he left the company, we keep up with this tradition. This year the challenges were as follows:

  1. The first one was a speed test on a mathematical riddle:

    The 91 Intersec employees stand in a  circle, holding hands. 81 hold the hand of a man and 20 hold the hand of a woman.
    How many women are there at Intersec?

  2. The second challenge was a programming tournament inspired from with a twist: the random move could only be used for the first move.
  3. The third challenge was also a speed test. The goal was to go to the following riddle website and to tell what was on the image of the fifth page (without cheating):

The quality of the resulting projects has also greatly increased over the editions and all these stories deserve to be told. This is what the articles to come are about.

Improved debugging with rr


To investigate a bug, people will often resort to two methods once the bug has been reproduced:

  • Add traces in the code.
  • Use a debugger to instrument the execution of the bug.

The second solution is much more powerful but also quite limited in the types of bug it can handle:

  • If the instrumentation of the execution goes too far, it is impossible to go in reverse. A new execution is needed, with new breakpoints or watches, but also new addresses, variable values…
  • If the reproduction is random, having to restart the process and reproduce the bug again can be very time consuming.
  • Instrumenting the bug with breakpoints can render the bug non-reproducible.

For those complex bugs, the investigation will thus often end as a combination of traces + placed assertion to generate core files and being able to investigate with a debugger the state of the program.

rr was created to improve debugging for those complex bugs. Its use is very simple:

$ rr record ./my-exec
I am buggy!
$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/my-exec-0/mmap_hardlink_3_my-exec...done.
(rr) continue
I am buggy!

Program received signal SIGKILL, Killed.

An execution is first recorded, and can then be replayed in GDB, as many times as needed. As every replay will be the same one, two interesting properties are induced:

  • Reproducibility: every replay is identical, so a random bug can be recorded only once, and then the execution investigated as much as needed. There is no risk to go too far in the execution, and having to reproduce the bug once again to set a different breakpoint.
  • Reverse debugging: When replaying the execution, it is possible to go in reverse: reverse-continue, reverse-next, reverse-step. This is extremely useful.

To show practical example of how much using rr can speed up debugging, I will present 3 bugs that I investigated, where this tool has either helped me save a lot of time, or where I wasted my time only to think in retrospect how much time I could have saved.

Exhibit A: Reverse continuation

Lets say you have a crash on a function failure:

if (foo(a, &amp;err) &lt; 0) {
    assert (false);
    logger_error(&quot;foo failed: %s&quot;, err);
    return -1;

This function should fill a string-buffer with the description of the error, which should give a good idea of the reason of the crash:

(gdb) print err
$1 = 0x0

Oops, looks like someone forgot to fill the buffer before returning the error. To understand the issue, we now need to find the exact line that returned the error, by stepping inside the execution of foo, until the return -1 is found.

Now imagine this function call is the deserialization of an object, itself containing other objects, arrays, unions, etc. Stepping through the function calls would be a nightmare, as the error may occur for example when deserializing the 6th field of the 532th element in an array. The full path of the field that is invalid is only available at the end of the call, just before the whole stack is unwound.

One solution would be to add traces on each deserialization function, trying to single out the right element that cannot be deserialized, and iterate to add more traces, …

Another solution is to use rr, which allows to step in reverse. As the full details of the error is only a few instructions before the assert (only stack unwinding is done between the error and the assert),  stepping in reverse several times will reach the error:

$ rr replay
(rr) c

Program received SIGABRT, Killed.
#12 0x0000000001d68d62 in ...
6007            assert (false);
(rr) # step in reverse to unpop all the &quot;return -1&quot;
(rr) rs
unpack_struct at iop.blk:4627
4627    }
(rr) rs
unpack_value at iop.blk:4433
4433    }
(rr) rs
4372        return -1;
(rr) rs
4371    if (field_is_valid(field) &lt; 0) {
(rr) bt
#0  0x0000000002f340cc in unpack_union at iop.blk:4815
#1  0x0000000002f326f4 in unpack_value  at iop.blk:4433
#2  0x0000000002f33485 in unpack_struct at iop.blk:4627
#3  0x0000000002f3277f in unpack_value at iop.blk:4440
#4  0x0000000002f340a0 in unpack_union at iop.blk:4814
#5  0x0000000002f326f4 in unpack_value at iop.blk:4433
#6  0x0000000002f33485 in unpack_struct at iop.blk:4627
#36 0x0000000002f34448 in load_object at iop.blk:4851

Now we easily find which condition failed, and the backtrace can tell us exactly which field is invalid. In my case, it was something like:


Quite horrible to get to with forward stepping.

Exhibit B: deterministic replay

Our products are composed of several daemons that communicate through RPC calls. In one daemon, data that needs to be passed from the RPC call to the callback is saved in an internal memory pool (we call them ic_msg_t). That memory pool factorizes allocations by creating pages, and ic_msg_t objects are just offsets in those pages.

When one message leaks, the page won’t be freed, and this will be detected when shutting down the daemon. Unfortunately, the tools we usually use for memory leaks cannot help us in this situation, as they will only point towards the allocation of the whole page. Using rr as seen in the first example will not help us either: we can inspect the page that leaked, but we do not know the offset of the object that caused the page to leak.

However, in that case, the reproducibility of a rr trace can help us. Lets start by tracing all the ic_msg_t creations and deletions:

static ic_msg_t *ic_msg_init(ic_msg_t *msg)
     e_error(&quot;XXX create %p&quot;, msg);

static void ic_msg_delete(ic_msg_t **msg)
    if (!msg) {
    e_error(&quot;XXX delete %p&quot;, *msg);

Let’s now run the leaking test:

$ rr record ./my-daemon 2&amp;gt;&amp;1 | tee traces.txt
XXX create message 0x7fb5035987d0
XXX delete message 0x7fb5035987d0
XXX create message 0x7fb503591400
XXX create message 0x7fb503591948

As the offsets in the allocated page can be reused, the same address will be used for different messages. To find out the one that leaks, we need one that has an odd number of traces (as the messages are creating in one unique pool, the same adress can be re-used by another message):

# Find out the address that was used an odd-number of times
$ grep XXX traces.txt | cut -d\  -f4 | sort | uniq -c | awk '$1 % 2'
3 0x7fb503598b60

We now know the leaking address. As replaying the traces with rr will not change the adresses, we can replay the execution, and the last allocation of this address will be the leaking one:

$ rr replay
(rr) # Run until the end, then go in reverse, to find the last alloc
(rr) c

Program received signal SIGKILL, Killed.
(rr) # Break on the allocation line
(rr) b iop-rpc-channel.blk:334 if msg == 0x7fb503598b60
(rr) rc

Thread 1 hit Breakpoint 1, ic_msg_init(msg=0x7fb503598b60) at iop-rpc-channel.blk:334
334         e_error(&quot;XXX create message %p&quot;, msg);
(rr) # Now we can find who leaks the message thanks to the backtrace
(rr) bt
#0  ic_msg_init (msg=0x7fb503598b60) at iop-rpc-channel.blk:334
#1  ic_msg_new () at iop-rpc-channel.blk:341
#2  _mw_query_cb (...) at platform-mw.blk:3593

In that case, a function call in _mw_query_cb had a side effect on a variable used to free the message, leading to the leak.

How do you solve this with only GDB?

  • You cannot use the address of the leaking message as it will change on every execution.
  • You can add more traces to distinguish the different callers of ic_msg_init (ie, a trace in _mw_query_cb), but you need an idea of where the leak is for this to be viable.
  • You can also count the number of calls of ic_msg_init before the leaking one, and add a static counter to abort right on the leaking call. This however will not work as soon as the number of calls can vary, which is the case when running a test that involves multiple communications with different daemons.

In that case, we could not avoid adding traces to investigate the issue, but the recording of the execution made sure that those traces could be reused when investigating the execution in depth.

Exhibit C: multiple processes

The last example will show how combining reverse execution, reproducibility and recordings of multiple processes allows debugging some very capricious bugs.

Let’s first get into a bit of details about the bug to investigate, involving multiple processes. Two daemons will interest us for this bug, the miner, and the EP. The miner is the daemon responsible to process computations. When computations produce results, it sends the results to the EP, and waits for an ACK on these results.

This works well, but in some very rare cases, the miner may softlock, apparently waiting for an ACK from the EP that never comes. Looking through logs, a disconnection/reconnection between the two daemons seems to be one trigger for the bug.

The bug can be reproduced, but it is extremely racy. The process goes like this:

  1. Start the product
  2. Start a computation that should last a few tens of seconds.
  3. SIGSTOP the miner during the computation
  4. wait for a disconnect with the EP
  5. SIGCONT the miner
  6. If not triggered, goto 2

After many tries, the right way to trigger the bug turned out to be to trigger the disconnect right before the end of the computation. The timeframe to send the SIGSTOP is very small, and several tries are often needed to re-trigger the bug.

Now that the stage is set, lets think about how to debug this:

  • Trying to generate cores to investigate afterwards would not work: it is not possible to know exactly when the bug was triggered, and a state of the process after it will be useless, both processes will be in IDLE state, with details of their communications destroyed. A live execution is needed.
  • Attaching the processes to gdb could be done, but with the issues already listed above:
    • any breakpoints will risk destroying the precise timings required to trigger the issue.
    • if the breakpoint goes too far, the bug needs to be triggered once more, with an earlier breakpoint.
  • Traces, more traces.

With rr, I only needed to reproduce the bug once. First the program was run while instrumenting the two processes:

INSTRUMENT=&quot;miner:rr ep:rr&quot; ./master

Here, the master will start all other daemons, and instrument those mentioned in this env variable.

Then, after several tries, the bug was reproduced, and the processes can be stopped, then replayed through rr.

$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/miner-0/mmap_hardlink_0_miner...done.

# In another terminal
$ rr replay ~/.local/share/rr/ep-0
Reading symbols from /home/thiberville/.local/share/rr/ep-0/mmap_hardlink_0_ep...done.

The two processes synchronize with the truncate RPC, which uses a (fid, fseqid) pair as the synchronization point. The fid is an incremental number referring to a file, and the fseqid is the record number in this file. Lets first find out the last truncate received by the miner:

# Miner
(rr) c

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) b ep_enabler_truncate_impl
Breakpoint 1 at 0x578af6: file enabler.blk, line 627.
(rr) rc

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) rc

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000000 in ?? ()
(rr) rc

Thread 1 hit Breakpoint 1,...
627     ...
(rr) p arg
$1 = {
    fid = 8,
    fseqid = 6548860,

Synchronization is based on the pair (fid, fseqid).

# EP
(rr) b ep.blk:887 if dest_sync.fid == 8 &amp;&amp; dest_sync.fseqid == 6548860
Breakpoint 2 at 0x49a837: file platform/bigdata/ep.blk, line 887.
(rr) c

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &amp;dest_sync);
(rr) d 1
(rr) b ep.blk:887
(rr) c

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &amp;dest_sync);
(rr) c

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &amp;dest_sync);
(rr) c

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &amp;dest_sync);
(rr) c

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) rc

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &amp;dest_sync);
(rr) p dest_sync
$1 = {
    fid = 8,
    fseqid = 7992094,

Because of the desynchronization, 3 RPC sent from the EP are never received by the miner, the last one being on 8:7992094. On reconnection, another RPC is used to resynchronize the two daemons:

# Miner
(rr) b sync_with_ep
Breakpoint 2 at ...
(rr) c

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c

Thread 1 hit Breakpoint 2, sync_with_ep ...
(rr) p *seqid
$2 = {
    fid = 8,
    fseqid = 7992094

We can see that when the two daemons are reconnected, the EP re-provides the last fseqid that it acked, so the bug is probably in the miner. Looking around the code, we can see that when it receives a synchronization point from the EP, it checks if the fseqid is the last one for the associated fid. If it is, it means all records from the corresponding file have been acked, and it thus removes the file. Lets check if that is the case here:

# Miner
(rr) p last_offset_per_file
$3 = vector of fq_pos_t (len: 0)

This vector is supposed to contain the last fseqid for each fid, but it is empty here. That explains why the miner then softlocks, it never removes the file. It should not be empty however, which means it has been incorrectly cleaned:

# Miner
(rr) watch last_offset_per_file.len
(rr) rc

Thread 1 hit Hardware watchpoint 3: last_offset_per_file.len

Old value = 0
New value = 1
0x0000000000b10c30 in vector_reset ...
27          vec-&amp;gt;len = 0;
(rr) bt
#0  0x0000000000b10c30 in vector_reset ...
#1  0x000000000057901f in ep_disconnected () at enabler.blk:683

The vector is reset on disconnection with the EP, losing information that is needed when reconnecting. Removing the reset fixes this issue. Another bug was actually hidden behind it, but I will not get into it, as it won’t show anything more about what rr brings:

  • Analysis of recording of multiple processes, allowing tracing communications between them.
  • Reverse stepping, with breakpoints, watchpoints, …


rr has been a huge timesaver since i’ve started using it regularly. The ability to step in reverse as well as the reproducibility means that I often only need a single recording of a bug to find the cause, compared to many iterations of traces as well as the use of gdb.

The execution time cost of using rr is relatively minimal (almost invisible in my tests). However, rr does suffer from the same limitations as other runtime wrappers such as valgrind, the main one being that it does not support multiple threads, and only emulates a single-thread. For a race between threads, rr won’t be able to help you.

Intersec Object Packer – Part 1 : the basics

This post is an introduction to a useful tool here at Intersec, a tool that we call IOP: the Intersec Object Packer.

IOP is our take on the IDL approach. It is a method to serialize structured data to use in our communication protocols or data storage technologies. It is used to transmit data over the network in a safe manner, to exchange data between different programming languages or to provide a generic interface to store (and load) C data on disk. IOP provides data integrity checking and backward-compatibility.

The concept of IDL is not new. There are a lot of different available languages, such as Google Protocol Buffers or Thrift. IOP itself isn’t new, its initial version was written in 2008 and has seen a lot of evolutions during its, almost decade-long, life. However, IOP has proven itself to be solid and sufficiently well designed for not seeing any backward incompatible changes during that period.

IOP package description

The first thing to do with IOP is to declare the data structures in the IOP description language. With those definitions, our IOP compiler will automatically create all the helpers needed to use these IOP data structures in different languages and to allow serialization and deserialization.

Data stucture declaration is done in a C-like syntax (actually, it is almost the D language syntax) and lives inside a .iop file. As a convention, we use CamelCase in our iop files (which is different from our .c files coding rules).

Let’s look at a quick example:

struct User {
    int    id;
    string name;

Here we are. An IOP object with two fields: an id (as an integer) and a name (as a string). Obviously, it is possible to create much more complex structures. To do so, here is the list of available types for our structure fields.

Basic types

IOP allow several low-level types to be used to define object members. One can use the classics:

  • int/uint (32 bits signed/unsigned integer)
  • long/ulong (64 bits signed/unsigned integer)
  • byte/ubyte (8 bits signed/unsigned integer)
  • short/ushort (16 bits signed/unsigned integer)
  • bool
  • double
  • string

and also the types:

  • bytes (a binary blob)
  • xml (for an XML payload)
  • void (to specify a lack of data).

Complex types

Four complex data types are also available for our fields.


The structure describes a record containing one or more fields. Each field has a name and a type. To see what it looks like, let’s add an address to our user data structure:

struct Address {
    int    number;
    string street;
    int    zipCode;
    string country;

struct User {
    int     id;
    string  name;
    Address address;

Of course, there is no theoretical limitation on the number of struct “levels”. A struct can have a struct field which also contains a struct field etc.


A class is an extendable structure type. A class can inherit from another class, creating a new type that adds new fields to the one present in its parent class.

We will see classes in more details in a separate article.


An union is a list of possibilities. Its description is very similary to a structure: it has typed fields, but only one of the fields is defined at a time. The name union is inherited from C since the concept is very similar to C unions, however IOP unions are tagged, which means we do know which of the field is defined.


union MyUnion {
    int    wantInt;
    string wantString;
    User   wantUser;


The last type that can be used is the enumeration. Here again, an enum is similar to the C-enum. It defines several literal keys associated to integer values. Just like the C enum, the IOP enum supports the whole integer range for its values.


enum MyEnum {
    VALUE_1 = 1,
    VALUE_2 = 2,
    VALUE_3 = 3,

Member constraints

Now that we have all the types we need for our custom data structure fields, it’s time to add some new features to them, in order to gain flexibility. Those features are called constraints. These constraints are qualifiers for IOP fields. For now, we have 4 different constraints: optional, repeated, with a default value and the implicit mandatory constraint.


By default, a member of an IOP structure is mandatory. This means it must be set to a valid value in order for the structure instance to be valid. In particular, you must guarantee the field is set before serializing/deserializing the object. By default, mandatory are value fields in the generated C structure: this means the value is inlined in the structure type and is copied. There are however some exceptions to this rule but we will see that later.

The example is pretty simple:

struct Foo {
    int mandatoryInteger;

Optional members

An optional member is indicated by a ? following the data type. The packers/unpackers allow these members to be absent without generating an error.

struct Foo {
    int? optionalMember;
    Bar? optionalMember2;
    int  mandatoryInteger;

Repeated members

A repeated member is a field that can appear zero or more times in the structure (often represented by an array in the programming languages). As such a repeated field is optional (can be present 0 times). A repeated member is indicated by a “[]” following the data type.

In the next example, you can consider the repeatedInteger field as a list of integers.

struct Foo {
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;

With default value

A field with a default value is a kind of mandatory member but allowed to be absent. When the member is absent, the packer/unpacker always sets the member to its default value.

A member with a default value is indicated by setting the default value after the field declaration.

struct Foo {
    int   defaultInteger = 42;
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;

Moreover, it is allowed to use arithmetic expressions on integer (and enum) member types like this:

struct Foo {
    int   defaultInteger = 2 * (256 << 20) + 42;
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;

IOP packages

The last thing to know to be able to write our first IOP file is about packages.

An IOP file corresponds to an IOP package. Basically, the package is kind of a namespace for the data structures you are declaring. The filename must match with package name. Every IOP file must define its package name like this:

package foo; /*< package name of the file foo.iop */

struct Foo {


A package can also be a sub-package, like this:

package; /*< package name of the file foo/bar.iop */

struct Bar {


Finally, you can import objects from another package by specifying the package name before the type:

package plop; /*< package name of the file plop.iop */

struct Plop { bar;


How to use IOP

Before going to more complicated features on IOP, let’s see a simple example of how to use our new custom data structures that we just declared.

When compiling our code, a first pass is done on our IOP files using our own compiler. This compiler will parse the .iop files and generate the corresponding C sources files that provides helpers to serialize/deserialize our data structures. Here again, we will see it in more details soon 🙂

Let’s see an example of code which is using IOP. First, let’s assume we have declared a new IOP package:

package User;

struct UserAddress {
    string street;
    int?   zipCode;
    string city;

struct User {
    ulong       id = 1;
    string      login;
    UserAddress addr;

This will create several C files containing the type descriptors used for data serialization/deserialization as well as the C types declarations:

struct user__user_address__t {
    char*     street;  /*< Actually a slightly more complicated type is used for
                        *  strings, but no need to be too specific here :)
    opt_i32_t zip_code;
    char*     city;

struct user__user__t {
    uint64_t                     id;
    char *                       login;
    struct user__user_address__t addr;

Not very different from the IOP file right? We can notice some uncommon stuff still:

  • The opt_i32_t type for zip_code. This is how we handle optional field. It is a structure containing a 32 bits integer + a boolean indicating if the field is set or not.
  • The stuctures names are now in snake_case instead of camelCase. The name of the package is added as a prefix of each structures, and there is a __t suffix too. This helps to recognize IOP structures when we meet one in our C code.

All the code generated by our compiler will be available through a user.iop.h file.

Now let’s play with it in our code :

#include "user.iop.h"


int my_func(void) {
    user__user__t user;

    /* This function will initialize all the fields (and sub-fields) of the
     * structure, according to the IOP declarations. Here, everything will be set
     * to 0/NULL but the field "id" which will contains the value "1". The first
     * argument indicates the package + structure name of our IOP object.
    iop_init(user__user, &user);

    /* This function will pack our IOP structure into an IOP binary format and
     * returns a pointer to the created buffer containing the packed structure.
     * The structures will be packed in order to use as little memory as possible.
     * Let put aside the memory management questions for this post.
    void *binary_data = iop_bpack(user__user, &user);

    /* This call must have failed. Our constraint are not respected, as several
     * mandatory fields were not correctly filled.
    assert(binary_data == NULL);

    user.addr.street = "221B Baker Street";   = "London";
    user.login = "SH";

    binary_data = iop_bpack(user__user, &user);

    /* This one should be the good one. Even if "id" field and "addr.zip_code" are
     * not filled, it is not a problem as the first one got a default value and
     * the second one is an optional field.
    assert(binary_data != NULL);

    /* Now we can do whatever we want with these data (writing it on disk for
     * example). But for now, let's just try to unpack it. Here again, put a
     * blindfold about memory management.
    user__user__t user2;
    int res = iop_bunpack(binary_data, user__user, &user2);

    /* Unpacking should have been successful, and we now have a "user2" struct
     * identical to "user" struct.
    assert(res >= 0)

Here we are. IOP gave us the superpower of packing/unpacking data structures in a binary format in two simple function calls. These binary packed structures can be used for disk storage. But as we will see in a future article, we also use it for our network communications.

Next time, we will talk about inheritance for our IOP objects!

Middleware (or how do our processes speak to one-another)

About multi-process programming

In modern software engineering, you quickly reach the point where one process cannot handle all the tasks by itself. For performance, maintainability or reliability reasons you do have to write multi-process programs. One can also reach the point where he wants its softwares to speak to each-other. This situation raises the question: how will my processes “talk” to each other?

If you already have written such programs in C, you are probably familiar with the network sockets concept. Those are handy (at least compared to dealing with the TCP/IP layer yourself): it offers an abstraction layer and lets you have endpoints for sending and receiving data from a process to another. But quickly some issues arise:

  • How to handle many-to-many communications?
  • How to scale the solution?
  • How to have a clean code that doesn’t have to handle many direct connections and painful scenarios like disconnection/re-connection?
  • How can I handle safely all the corner cases with blocking/non-blocking reads/writes?

Almost every developer or every company has its own way to answer those questions, with the development of libraries responsible of communications between processes.

Of course, we do have our own solution too 🙂

So let’s take a look on what we call MiddleWare, our abstraction layer to handle communication between our processes and software instances.

What is MiddleWare ?

At Intersec, the sockets were quickly replaced by a first abstraction layer called ichannels. These channels basically simplify the creation of sockets, but we still deal with a point-to-point communication. So we started the development of MiddleWare, inspired by the works of iMatix on ØMQ.

First, let see how things were done before Middleware:



As you can see, every daemon or process had to open a direct connection to the other daemon he wanted to talk to, which leads to the issues described above.

Now, after the introduction of our MiddleWare layer:


So what MiddleWare is about? MiddleWare offers an abstraction layer for developers. With it, no need to manage connections and handle scenarios such as disconnection/re-connection anymore. We now communicate to services or roles, not to processes nor daemons.

MiddleWare is in charge of finding where the receiver is located and routing the message accordingly.

This solves many of the problems we were talking about earlier: the code of a daemon focuses on the applicative part, not on the infrastructure / network management part. It is now possible to have many-to-many communications (sending a message to N daemons implementing the same role) and the solution is scalable (no need to create multiple direct connections when adding a new service).

Services vs roles

MiddleWare is able to do service routing and/or role routing. A service is basically a process, the user can specify a host identifier and an instance identifier to get a channel to a specific instance of a service.

Processes can also expose roles: a role is a contract that associates a name with a duty and an interface. Example: "DB:master" can be a role of the master of the database, the one which can write in it, whereas "DB:slave" can be a role for a slave of the database, which has read-only replicate of it. One can also imagine a "User-list:listener" for example, which allows to register a callback for any user-list update.

Roles dissociate the processes from the purpose and allow extensibility of the software by allowing run-time additions of new roles in existing processes. Roles can be associated to a constraint (for example “unique” in cluster/site).

Those roles can also be attached to a module, as described in one of our previous post. As module can be easily rearranged, this adds another layer of abstraction between the code and the actual topology of the software.

Some examples from the API

How does an API for such a feature look like?

As described above, one of the main ideas of MiddleWare is to ease inter-processes communication handling, and let the developer focus on the applicative part of what he is doing. So it’s important to have very few steps to use the “basic” features: create a role if needed, create a channel and use it and handle replies.

So first of all, let’s take a look at the creation of a channel:

mw_channel_t *chan = mw_new_channel_to_service(product__db, -1, -1, false);

And here you are, no need to do more: no connection management, no need to look for the location of the service and the right network address in the product configuration. A simple function call give you a mw_channel_t pointer you can use to send messages. The first argument is what we call a service at intersec (as said above, it is basically a process). Here we just want to have a channel to our DB service. The second and third arguments indicate an host identifier and an instance identifier, if we want to target a specific instance of this service. Here, we just want a channel that targets all the available instances of the DB service by specifying -1 as both host and instance ids. Finally, the last argument indicates whether a direct connection is needed or not, but we will come back to this later.

Now let see some roles. Processes can register/unregister a role with that kind of API:


Pretty simple, isn’t it? All you need to do is give a name to your role. If we want to use a more complex role, with a unique in cluster constraint, we do have another function to do so:

mw_register_unique_role("db:master", role_cb);

The only difference is the need of a callback, which takes as arguments the name of the role and an enum value. This enum represents the status of the role. The callback will be called when the role is granted to a process by MiddleWare: the new owner get a MW_ROLE_OWNER status in its callback, the others get the MW_ROLE_TAKEN value.

On the client side, if you want to declare your role, all you have to do is:

mw_channel_t *chan = mw_new_channel_to_role("db:master", false);

And chan can now be used to send messages to our process which registered the "db:master" role.

How does this (wonderful) functionality work?

The key of MiddleWare is its routing tables. But to understand how it works, I need to introduce to you another concept of our product at Intersec: the master-process. No doubt it will ring a bell, as it is a common design pattern.

In our product, a single process is responsible for launching every sub-process and for monitoring them. This process is called the master process. It does not do much, but our products could not work without it. It detects when one of its child goes down and relaunch it if needed. It also handles communications to other software instances.

Now that you know what a master is in our Intersec environment, let’s go back to MiddleWare and its routing tables.

By default, the routing is done by our master process: every message is transmitted to the master which forwards it to the right host and then the right process.

The master maintains routing tables in order to be resilient to network connectivity issues. Those routing tables are built using a path-vector-like algorithm.

So let’s take a look to another picture which show the communication with more details:


As we can see, MiddleWare opens connections between every master processes and their childs. There are also connections between each master. From the developer’s standpoint, this is completely transparent. One can ask for a channel from the Core daemon to the Connector one, or a channel between the two Computation daemons for example, and then start to send/receive messages on these channels. MiddleWare will route these messages from the child lib to the master on the same host, then to the master on the receiving host, to finally transfer it to the destination process.

In case you expect a large amount of data to go through a channel, it is still possible to ask for a direct connection to a process during the creation of that channel. MiddleWare will still handle all the connection management complexity and from that point, everything will work exactly the same. Note that in our implementation we never have the guarantee that a message will go through a direct link, as MiddleWare will still route the queries throught the master if the direct link is not ready yet. Moreover, every communication from a service to another will use the direct link as soon as it exists.


Having such a layer in a software does not come without some drawbacks. The use of MiddleWare creates an overhead introduced by the abstraction cost: the routing table creation adds a bit of traffic each time a process starts or stop, or when roles are registered or unregistered.

As start-up and shutdown are not critical parts of the execution for us, it is fine to have a small overhead here. In the same way, roles registrations are not frequent, it is not an issue to add some operations during this step.

Finally, high traffic may put some load on our master process that must route the messages. Not a big issue on that one too, as our master does not do much beside message routing. The main responsibility of this process is to monitor its children, no complex calculation or time-consuming operations here. Moreover, if an heavy traffic is expected between two daemons, it is a good practice to ask for a direct link. This decreases the load on the master and therefore the risk of impacting MiddleWare.

C Modules

We do write complex software, and like everyone doing so, we need a way to structure our source code. Our choice was to go modular. A module is a singleton that defines a feature or a set of related feature and exposes some APIs in order for other modules to access these features. Everything else, including the details of the implementation, is private. Examples of modules are the RPC layer, the threading library, the query engine of a database, the authentication engine, … Then, we can compose the various daemons of our application by including the corresponding modules and their dependencies.

Most modules maintain an internal state and as a consequence they have to be initialized and deinitialized at some point in the lifetime of the program. An internal rule at Intersec is to name the constructor {module}_initialize() and the destructor {module}_shutdown(). Once defined, these functions have to be called and this is where everything become complicated when your program has tens of modules with complex dependencies between them.

Continue reading