Improved debugging with rr

Introduction

To investigate a bug, people will often resort to two methods once the bug has been reproduced:

  • Add traces in the code.
  • Use a debugger to instrument the execution of the bug.

The second solution is much more powerful but also quite limited in the types of bug it can handle:

  • If the instrumentation of the execution goes too far, it is impossible to go in reverse. A new execution is needed, with new breakpoints or watches, but also new addresses, variable values…
  • If the reproduction is random, having to restart the process and reproduce the bug again can be very time consuming.
  • Instrumenting the bug with breakpoints can render the bug non-reproducible.

For those complex bugs, the investigation will thus often end as a combination of traces + placed assertion to generate core files and being able to investigate with a debugger the state of the program.

rr was created to improve debugging for those complex bugs. Its use is very simple:

$ rr record ./my-exec
I am buggy!
$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/my-exec-0/mmap_hardlink_3_my-exec...done.
done.
(rr) continue
Continuing.
I am buggy!

Program received signal SIGKILL, Killed.
(rr)

An execution is first recorded, and can then be replayed in GDB, as many times as needed. As every replay will be the same one, two interesting properties are induced:

  • Reproducibility: every replay is identical, so a random bug can be recorded only once, and then the execution investigated as much as needed. There is no risk to go too far in the execution, and having to reproduce the bug once again to set a different breakpoint.
  • Reverse debugging: When replaying the execution, it is possible to go in reverse: reverse-continue, reverse-next, reverse-step. This is extremely useful.

To show practical example of how much using rr can speed up debugging, I will present 3 bugs that I investigated, where this tool has either helped me save a lot of time, or where I wasted my time only to think in retrospect how much time I could have saved.

Exhibit A: Reverse continuation

Lets say you have a crash on a function failure:

if (foo(a, &err) < 0) {
    assert (false);
    logger_error("foo failed: %s", err);
    return -1;
}

This function should fill a string-buffer with the description of the error, which should give a good idea of the reason of the crash:

(gdb) print err
$1 = 0x0

Oops, looks like someone forgot to fill the buffer before returning the error. To understand the issue, we now need to find the exact line that returned the error, by stepping inside the execution of foo, until the return -1 is found.

Now imagine this function call is the deserialization of an object, itself containing other objects, arrays, unions, etc. Stepping through the function calls would be a nightmare, as the error may occur for example when deserializing the 6th field of the 532th element in an array. The full path of the field that is invalid is only available at the end of the call, just before the whole stack is unwound.

One solution would be to add traces on each deserialization function, trying to single out the right element that cannot be deserialized, and iterate to add more traces, …

Another solution is to use rr, which allows to step in reverse. As the full details of the error is only a few instructions before the assert (only stack unwinding is done between the error and the assert),  stepping in reverse several times will reach the error:

$ rr replay
(rr) c
Continuing.

Program received SIGABRT, Killed.
#12 0x0000000001d68d62 in ...
6007            assert (false);
(rr) # step in reverse to unpop all the "return -1"
(rr) rs
unpack_struct at iop.blk:4627
4627    }
(rr) rs
unpack_value at iop.blk:4433
4433    }
...
(rr) rs
4372        return -1;
(rr) rs
4371    if (field_is_valid(field) < 0) {
(rr) bt
#0  0x0000000002f340cc in unpack_union at iop.blk:4815
#1  0x0000000002f326f4 in unpack_value  at iop.blk:4433
#2  0x0000000002f33485 in unpack_struct at iop.blk:4627
#3  0x0000000002f3277f in unpack_value at iop.blk:4440
#4  0x0000000002f340a0 in unpack_union at iop.blk:4814
#5  0x0000000002f326f4 in unpack_value at iop.blk:4433
#6  0x0000000002f33485 in unpack_struct at iop.blk:4627
...
#36 0x0000000002f34448 in load_object at iop.blk:4851

Now we easily find which condition failed, and the backtrace can tell us exactly which field is invalid. In my case, it was something like:

input[2].query.select.cond.expr.andOp.conditions[3].orOp.conditions[0]

Quite horrible to get to with forward stepping.

Exhibit B: deterministic replay

Our products are composed of several daemons that communicate through RPC calls. In one daemon, data that needs to be passed from the RPC call to the callback is saved in an internal memory pool (we call them ic_msg_t). That memory pool factorizes allocations by creating pages, and ic_msg_t objects are just offsets in those pages.

When one message leaks, the page won’t be freed, and this will be detected when shutting down the daemon. Unfortunately, the tools we usually use for memory leaks cannot help us in this situation, as they will only point towards the allocation of the whole page. Using rr as seen in the first example will not help us either: we can inspect the page that leaked, but we do not know the offset of the object that caused the page to leak.

However, in that case, the reproducibility of a rr trace can help us. Lets start by tracing all the ic_msg_t creations and deletions:

static ic_msg_t *ic_msg_init(ic_msg_t *msg)
{
     e_error("XXX create %p", msg);
     ...
}

static void ic_msg_delete(ic_msg_t **msg)
{
    if (!msg) {
        return;
    }
    e_error("XXX delete %p", *msg);
    ...
}

Let’s now run the leaking test:

$ rr record ./my-daemon 2&gt;&1 | tee traces.txt
XXX create message 0x7fb5035987d0
XXX delete message 0x7fb5035987d0
XXX create message 0x7fb503591400
XXX create message 0x7fb503591948
...

As the offsets in the allocated page can be reused, the same address will be used for different messages. To find out the one that leaks, we need one that has an odd number of traces (as the messages are creating in one unique pool, the same adress can be re-used by another message):

# Find out the address that was used an odd-number of times
$ grep XXX traces.txt | cut -d\  -f4 | sort | uniq -c | awk '$1 % 2'
3 0x7fb503598b60

We now know the leaking address. As replaying the traces with rr will not change the adresses, we can replay the execution, and the last allocation of this address will be the leaking one:

$ rr replay
(rr) # Run until the end, then go in reverse, to find the last alloc
(rr) c

Program received signal SIGKILL, Killed.
(rr) # Break on the allocation line
(rr) b iop-rpc-channel.blk:334 if msg == 0x7fb503598b60
(rr) rc
Continuing.

Thread 1 hit Breakpoint 1, ic_msg_init(msg=0x7fb503598b60) at iop-rpc-channel.blk:334
334         e_error("XXX create message %p", msg);
(rr) # Now we can find who leaks the message thanks to the backtrace
(rr) bt
#0  ic_msg_init (msg=0x7fb503598b60) at iop-rpc-channel.blk:334
#1  ic_msg_new () at iop-rpc-channel.blk:341
#2  _mw_query_cb (...) at platform-mw.blk:3593
...

In that case, a function call in _mw_query_cb had a side effect on a variable used to free the message, leading to the leak.

How do you solve this with only GDB?

  • You cannot use the address of the leaking message as it will change on every execution.
  • You can add more traces to distinguish the different callers of ic_msg_init (ie, a trace in _mw_query_cb), but you need an idea of where the leak is for this to be viable.
  • You can also count the number of calls of ic_msg_init before the leaking one, and add a static counter to abort right on the leaking call. This however will not work as soon as the number of calls can vary, which is the case when running a test that involves multiple communications with different daemons.

In that case, we could not avoid adding traces to investigate the issue, but the recording of the execution made sure that those traces could be reused when investigating the execution in depth.

Exhibit C: multiple processes

The last example will show how combining reverse execution, reproducibility and recordings of multiple processes allows debugging some very capricious bugs.

Let’s first get into a bit of details about the bug to investigate, involving multiple processes. Two daemons will interest us for this bug, the miner, and the EP. The miner is the daemon responsible to process computations. When computations produce results, it sends the results to the EP, and waits for an ACK on these results.

This works well, but in some very rare cases, the miner may softlock, apparently waiting for an ACK from the EP that never comes. Looking through logs, a disconnection/reconnection between the two daemons seems to be one trigger for the bug.

The bug can be reproduced, but it is extremely racy. The process goes like this:

  1. Start the product
  2. Start a computation that should last a few tens of seconds.
  3. SIGSTOP the miner during the computation
  4. wait for a disconnect with the EP
  5. SIGCONT the miner
  6. If not triggered, goto 2

After many tries, the right way to trigger the bug turned out to be to trigger the disconnect right before the end of the computation. The timeframe to send the SIGSTOP is very small, and several tries are often needed to re-trigger the bug.

Now that the stage is set, lets think about how to debug this:

  • Trying to generate cores to investigate afterwards would not work: it is not possible to know exactly when the bug was triggered, and a state of the process after it will be useless, both processes will be in IDLE state, with details of their communications destroyed. A live execution is needed.
  • Attaching the processes to gdb could be done, but with the issues already listed above:
    • any breakpoints will risk destroying the precise timings required to trigger the issue.
    • if the breakpoint goes too far, the bug needs to be triggered once more, with an earlier breakpoint.
  • Traces, more traces.

With rr, I only needed to reproduce the bug once. First the program was run while instrumenting the two processes:

INSTRUMENT="miner:rr ep:rr" ./master

Here, the master will start all other daemons, and instrument those mentioned in this env variable.

Then, after several tries, the bug was reproduced, and the processes can be stopped, then replayed through rr.

$ rr replay
Reading symbols from /home/thiberville/.local/share/rr/miner-0/mmap_hardlink_0_miner...done.
done.
(rr)

# In another terminal
$ rr replay ~/.local/share/rr/ep-0
Reading symbols from /home/thiberville/.local/share/rr/ep-0/mmap_hardlink_0_ep...done.
done.
(rr)

The two processes synchronize with the truncate RPC, which uses a (fid, fseqid) pair as the synchronization point. The fid is an incremental number referring to a file, and the fseqid is the record number in this file. Lets first find out the last truncate received by the miner:

# Miner
(rr) c
[logs...]

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c
Continuing.
[...]

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) b ep_enabler_truncate_impl
Breakpoint 1 at 0x578af6: file enabler.blk, line 627.
(rr) rc
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) rc
Continuing.

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000000 in ?? ()
(rr) rc
Continuing.

Thread 1 hit Breakpoint 1,...
627     ...
(rr) p arg
$1 = {
    fid = 8,
    fseqid = 6548860,
}

Synchronization is based on the pair (fid, fseqid).

# EP
(rr) b ep.blk:887 if dest_sync.fid == 8 && dest_sync.fseqid == 6548860
Breakpoint 2 at 0x49a837: file platform/bigdata/ep.blk, line 887.
(rr) c
Continuing.
[logs...]

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) d 1
(rr) b ep.blk:887
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c
Continuing.

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) c

Thread 1 received signal SIGINT, Interrupt.
0x0000000070000002 in ?? ()
(rr) rc

Thread 1 hit Breakpoint 1, ...
887     query(enabler, ..., &dest_sync);
(rr) p dest_sync
$1 = {
    fid = 8,
    fseqid = 7992094,
}

Because of the desynchronization, 3 RPC sent from the EP are never received by the miner, the last one being on 8:7992094. On reconnection, another RPC is used to resynchronize the two daemons:

# Miner
(rr) b sync_with_ep
Breakpoint 2 at ...
(rr) c
Continuing.

Thread 1 received signal SIGSTOP, Stopped (signal).
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 received signal SIGCONT, Continued.
0x0000000070000002 in ?? ()
(rr) c
Continuing.

Thread 1 hit Breakpoint 2, sync_with_ep ...
(rr) p *seqid
$2 = {
    fid = 8,
    fseqid = 7992094
}

We can see that when the two daemons are reconnected, the EP re-provides the last fseqid that it acked, so the bug is probably in the miner. Looking around the code, we can see that when it receives a synchronization point from the EP, it checks if the fseqid is the last one for the associated fid. If it is, it means all records from the corresponding file have been acked, and it thus removes the file. Lets check if that is the case here:

# Miner
(rr) p last_offset_per_file
$3 = vector of fq_pos_t (len: 0)

This vector is supposed to contain the last fseqid for each fid, but it is empty here. That explains why the miner then softlocks, it never removes the file. It should not be empty however, which means it has been incorrectly cleaned:

# Miner
(rr) watch last_offset_per_file.len
(rr) rc
Continuing.

Thread 1 hit Hardware watchpoint 3: last_offset_per_file.len

Old value = 0
New value = 1
0x0000000000b10c30 in vector_reset ...
27          vec-&gt;len = 0;
(rr) bt
#0  0x0000000000b10c30 in vector_reset ...
#1  0x000000000057901f in ep_disconnected () at enabler.blk:683

The vector is reset on disconnection with the EP, losing information that is needed when reconnecting. Removing the reset fixes this issue. Another bug was actually hidden behind it, but I will not get into it, as it won’t show anything more about what rr brings:

  • Analysis of recording of multiple processes, allowing tracing communications between them.
  • Reverse stepping, with breakpoints, watchpoints, …

Conclusion

rr has been a huge timesaver since i’ve started using it regularly. The ability to step in reverse as well as the reproducibility means that I often only need a single recording of a bug to find the cause, compared to many iterations of traces as well as the use of gdb.

The execution time cost of using rr is relatively minimal (almost invisible in my tests). However, rr does suffer from the same limitations as other runtime wrappers such as valgrind, the main one being that it does not support multiple threads, and only emulates a single-thread. For a race between threads, rr won’t be able to help you.

Intersec Object Packer – Part 1 : the basics

This post is an introduction to a useful tool here at Intersec, a tool that we call IOP: the Intersec Object Packer.

IOP is our take on the IDL approach. It is a method to serialize structured data to use in our communication protocols or data storage technologies. It is used to transmit data over the network in a safe manner, to exchange data between different programming languages or to provide a generic interface to store (and load) C data on disk. IOP provides data integrity checking and backward-compatibility.

The concept of IDL is not new. There are a lot of different available languages, such as Google Protocol Buffers or Thrift. IOP itself isn’t new, its initial version was written in 2008 and has seen a lot of evolutions during its, almost decade-long, life. However, IOP has proven itself to be solid and sufficiently well designed for not seeing any backward incompatible changes during that period.

IOP package description

The first thing to do with IOP is to declare the data structures in the IOP description language. With those definitions, our IOP compiler will automatically create all the helpers needed to use these IOP data structures in different languages and to allow serialization and deserialization.

Data stucture declaration is done in a C-like syntax (actually, it is almost the D language syntax) and lives inside a .iop file. As a convention, we use CamelCase in our iop files (which is different from our .c files coding rules).

Let’s look at a quick example:

struct User {
    int    id;
    string name;
};

Here we are. An IOP object with two fields: an id (as an integer) and a name (as a string). Obviously, it is possible to create much more complex structures. To do so, here is the list of available types for our structure fields.

Basic types

IOP allow several low-level types to be used to define object members. One can use the classics:

  • int/uint (32 bits signed/unsigned integer)
  • long/ulong (64 bits signed/unsigned integer)
  • byte/ubyte (8 bits signed/unsigned integer)
  • short/ushort (16 bits signed/unsigned integer)
  • bool
  • double
  • string

and also the types:

  • bytes (a binary blob)
  • xml (for an XML payload)
  • void (to specify a lack of data).

Complex types

Four complex data types are also available for our fields.

Structures

The structure describes a record containing one or more fields. Each field has a name and a type. To see what it looks like, let’s add an address to our user data structure:

struct Address {
    int    number;
    string street;
    int    zipCode;
    string country;
};

struct User {
    int     id;
    string  name;
    Address address;
};

Of course, there is no theoretical limitation on the number of struct “levels”. A struct can have a struct field which also contains a struct field etc.

Classes

A class is an extendable structure type. A class can inherit from another class, creating a new type that adds new fields to the one present in its parent class.

We will see classes in more details in a separate article.

Unions

An union is a list of possibilities. Its description is very similary to a structure: it has typed fields, but only one of the fields is defined at a time. The name union is inherited from C since the concept is very similar to C unions, however IOP unions are tagged, which means we do know which of the field is defined.

Example:

union MyUnion {
    int    wantInt;
    string wantString;
    User   wantUser;
};

Enumeration

The last type that can be used is the enumeration. Here again, an enum is similar to the C-enum. It defines several literal keys associated to integer values. Just like the C enum, the IOP enum supports the whole integer range for its values.

Example:

enum MyEnum {
    VALUE_1 = 1,
    VALUE_2 = 2,
    VALUE_3 = 3,
};

Member constraints

Now that we have all the types we need for our custom data structure fields, it’s time to add some new features to them, in order to gain flexibility. Those features are called constraints. These constraints are qualifiers for IOP fields. For now, we have 4 different constraints: optional, repeated, with a default value and the implicit mandatory constraint.

Mandatory

By default, a member of an IOP structure is mandatory. This means it must be set to a valid value in order for the structure instance to be valid. In particular, you must guarantee the field is set before serializing/deserializing the object. By default, mandatory are value fields in the generated C structure: this means the value is inlined in the structure type and is copied. There are however some exceptions to this rule but we will see that later.

The example is pretty simple:

struct Foo {
    int mandatoryInteger;
};

Optional members

An optional member is indicated by a ? following the data type. The packers/unpackers allow these members to be absent without generating an error.

struct Foo {
    int? optionalMember;
    Bar? optionalMember2;
    int  mandatoryInteger;
};

Repeated members

A repeated member is a field that can appear zero or more times in the structure (often represented by an array in the programming languages). As such a repeated field is optional (can be present 0 times). A repeated member is indicated by a “[]” following the data type.

In the next example, you can consider the repeatedInteger field as a list of integers.

struct Foo {
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;
};

With default value

A field with a default value is a kind of mandatory member but allowed to be absent. When the member is absent, the packer/unpacker always sets the member to its default value.

A member with a default value is indicated by setting the default value after the field declaration.

struct Foo {
    int   defaultInteger = 42;
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;
};

Moreover, it is allowed to use arithmetic expressions on integer (and enum) member types like this:

struct Foo {
    int   defaultInteger = 2 * (256 << 20) + 42;
    int[] repeatedInteger;
    int?  optionalMember;
    Bar?  optionalMember;
    int   mandatoryInteger;
};

IOP packages

The last thing to know to be able to write our first IOP file is about packages.

An IOP file corresponds to an IOP package. Basically, the package is kind of a namespace for the data structures you are declaring. The filename must match with package name. Every IOP file must define its package name like this:

package foo; /*< package name of the file foo.iop */

struct Foo {
    [...]
};

[...]

A package can also be a sub-package, like this:

package foo.bar; /*< package name of the file foo/bar.iop */

struct Bar {
    [...]
};

[...]

Finally, you can import objects from another package by specifying the package name before the type:

package plop; /*< package name of the file plop.iop */

struct Plop {
    foo.bar.Bar bar;
};

[...]

How to use IOP

Before going to more complicated features on IOP, let’s see a simple example of how to use our new custom data structures that we just declared.

When compiling our code, a first pass is done on our IOP files using our own compiler. This compiler will parse the .iop files and generate the corresponding C sources files that provides helpers to serialize/deserialize our data structures. Here again, we will see it in more details soon 🙂

Let’s see an example of code which is using IOP. First, let’s assume we have declared a new IOP package:

package User;

struct UserAddress {
    string street;
    int?   zipCode;
    string city;
};

struct User {
    ulong       id = 1;
    string      login;
    UserAddress addr;
};

This will create several C files containing the type descriptors used for data serialization/deserialization as well as the C types declarations:

struct user__user_address__t {
    char*     street;  /*< Actually a slightly more complicated type is used for
                        *  strings, but no need to be too specific here :)
                        */
    opt_i32_t zip_code;
    char*     city;
};

struct user__user__t {
    uint64_t                     id;
    char *                       login;
    struct user__user_address__t addr;
};

Not very different from the IOP file right? We can notice some uncommon stuff still:

  • The opt_i32_t type for zip_code. This is how we handle optional field. It is a structure containing a 32 bits integer + a boolean indicating if the field is set or not.
  • The stuctures names are now in snake_case instead of camelCase. The name of the package is added as a prefix of each structures, and there is a __t suffix too. This helps to recognize IOP structures when we meet one in our C code.

All the code generated by our compiler will be available through a user.iop.h file.

Now let’s play with it in our code :

#include "user.iop.h"

[...]

int my_func(void) {
    user__user__t user;

    /* This function will initialize all the fields (and sub-fields) of the
     * structure, according to the IOP declarations. Here, everything will be set
     * to 0/NULL but the field "id" which will contains the value "1". The first
     * argument indicates the package + structure name of our IOP object.
     */
    iop_init(user__user, &user);

    /* This function will pack our IOP structure into an IOP binary format and
     * returns a pointer to the created buffer containing the packed structure.
     * The structures will be packed in order to use as little memory as possible.
     * Let put aside the memory management questions for this post.
     */
    void *binary_data = iop_bpack(user__user, &user);

    /* This call must have failed. Our constraint are not respected, as several
     * mandatory fields were not correctly filled.
     */
    assert(binary_data == NULL);

    user.addr.street = "221B Baker Street";
    user.addr.city   = "London";
    user.login = "SH";

    binary_data = iop_bpack(user__user, &user);

    /* This one should be the good one. Even if "id" field and "addr.zip_code" are
     * not filled, it is not a problem as the first one got a default value and
     * the second one is an optional field.
     */
    assert(binary_data != NULL);

    /* Now we can do whatever we want with these data (writing it on disk for
     * example). But for now, let's just try to unpack it. Here again, put a
     * blindfold about memory management.
     */
    user__user__t user2;
    int res = iop_bunpack(binary_data, user__user, &user2);

    /* Unpacking should have been successful, and we now have a "user2" struct
     * identical to "user" struct.
     */
    assert(res >= 0)
}

Here we are. IOP gave us the superpower of packing/unpacking data structures in a binary format in two simple function calls. These binary packed structures can be used for disk storage. But as we will see in a future article, we also use it for our network communications.

Next time, we will talk about inheritance for our IOP objects!

Middleware (or how do our processes speak to one-another)

About multi-process programming

In modern software engineering, you quickly reach the point where one process cannot handle all the tasks by itself. For performance, maintainability or reliability reasons you do have to write multi-process programs. One can also reach the point where he wants its softwares to speak to each-other. This situation raises the question: how will my processes “talk” to each other?

If you already have written such programs in C, you are probably familiar with the network sockets concept. Those are handy (at least compared to dealing with the TCP/IP layer yourself): it offers an abstraction layer and lets you have endpoints for sending and receiving data from a process to another. But quickly some issues arise:

  • How to handle many-to-many communications?
  • How to scale the solution?
  • How to have a clean code that doesn’t have to handle many direct connections and painful scenarios like disconnection/re-connection?
  • How can I handle safely all the corner cases with blocking/non-blocking reads/writes?

Almost every developer or every company has its own way to answer those questions, with the development of libraries responsible of communications between processes.

Of course, we do have our own solution too 🙂

So let’s take a look on what we call MiddleWare, our abstraction layer to handle communication between our processes and software instances.

What is MiddleWare ?

At Intersec, the sockets were quickly replaced by a first abstraction layer called ichannels. These channels basically simplify the creation of sockets, but we still deal with a point-to-point communication. So we started the development of MiddleWare, inspired by the works of iMatix on ØMQ.

First, let see how things were done before Middleware:

Middleware-img1

 

As you can see, every daemon or process had to open a direct connection to the other daemon he wanted to talk to, which leads to the issues described above.

Now, after the introduction of our MiddleWare layer:

Middleware-img2

So what MiddleWare is about? MiddleWare offers an abstraction layer for developers. With it, no need to manage connections and handle scenarios such as disconnection/re-connection anymore. We now communicate to services or roles, not to processes nor daemons.

MiddleWare is in charge of finding where the receiver is located and routing the message accordingly.

This solves many of the problems we were talking about earlier: the code of a daemon focuses on the applicative part, not on the infrastructure / network management part. It is now possible to have many-to-many communications (sending a message to N daemons implementing the same role) and the solution is scalable (no need to create multiple direct connections when adding a new service).

Services vs roles

MiddleWare is able to do service routing and/or role routing. A service is basically a process, the user can specify a host identifier and an instance identifier to get a channel to a specific instance of a service.

Processes can also expose roles: a role is a contract that associates a name with a duty and an interface. Example: "DB:master" can be a role of the master of the database, the one which can write in it, whereas "DB:slave" can be a role for a slave of the database, which has read-only replicate of it. One can also imagine a "User-list:listener" for example, which allows to register a callback for any user-list update.

Roles dissociate the processes from the purpose and allow extensibility of the software by allowing run-time additions of new roles in existing processes. Roles can be associated to a constraint (for example “unique” in cluster/site).

Those roles can also be attached to a module, as described in one of our previous post. As module can be easily rearranged, this adds another layer of abstraction between the code and the actual topology of the software.

Some examples from the API

How does an API for such a feature look like?

As described above, one of the main ideas of MiddleWare is to ease inter-processes communication handling, and let the developer focus on the applicative part of what he is doing. So it’s important to have very few steps to use the “basic” features: create a role if needed, create a channel and use it and handle replies.

So first of all, let’s take a look at the creation of a channel:

mw_channel_t *chan = mw_new_channel_to_service(product__db, -1, -1, false);

And here you are, no need to do more: no connection management, no need to look for the location of the service and the right network address in the product configuration. A simple function call give you a mw_channel_t pointer you can use to send messages. The first argument is what we call a service at intersec (as said above, it is basically a process). Here we just want to have a channel to our DB service. The second and third arguments indicate an host identifier and an instance identifier, if we want to target a specific instance of this service. Here, we just want a channel that targets all the available instances of the DB service by specifying -1 as both host and instance ids. Finally, the last argument indicates whether a direct connection is needed or not, but we will come back to this later.

Now let see some roles. Processes can register/unregister a role with that kind of API:

mw_register_role("db:master");
mw_unregister_role("db:master");

Pretty simple, isn’t it? All you need to do is give a name to your role. If we want to use a more complex role, with a unique in cluster constraint, we do have another function to do so:

mw_register_unique_role("db:master", role_cb);

The only difference is the need of a callback, which takes as arguments the name of the role and an enum value. This enum represents the status of the role. The callback will be called when the role is granted to a process by MiddleWare: the new owner get a MW_ROLE_OWNER status in its callback, the others get the MW_ROLE_TAKEN value.

On the client side, if you want to declare your role, all you have to do is:

mw_channel_t *chan = mw_new_channel_to_role("db:master", false);

And chan can now be used to send messages to our process which registered the "db:master" role.

How does this (wonderful) functionality work?

The key of MiddleWare is its routing tables. But to understand how it works, I need to introduce to you another concept of our product at Intersec: the master-process. No doubt it will ring a bell, as it is a common design pattern.

In our product, a single process is responsible for launching every sub-process and for monitoring them. This process is called the master process. It does not do much, but our products could not work without it. It detects when one of its child goes down and relaunch it if needed. It also handles communications to other software instances.

Now that you know what a master is in our Intersec environment, let’s go back to MiddleWare and its routing tables.

By default, the routing is done by our master process: every message is transmitted to the master which forwards it to the right host and then the right process.

The master maintains routing tables in order to be resilient to network connectivity issues. Those routing tables are built using a path-vector-like algorithm.

So let’s take a look to another picture which show the communication with more details:

Middleware-img3

As we can see, MiddleWare opens connections between every master processes and their childs. There are also connections between each master. From the developer’s standpoint, this is completely transparent. One can ask for a channel from the Core daemon to the Connector one, or a channel between the two Computation daemons for example, and then start to send/receive messages on these channels. MiddleWare will route these messages from the child lib to the master on the same host, then to the master on the receiving host, to finally transfer it to the destination process.

In case you expect a large amount of data to go through a channel, it is still possible to ask for a direct connection to a process during the creation of that channel. MiddleWare will still handle all the connection management complexity and from that point, everything will work exactly the same. Note that in our implementation we never have the guarantee that a message will go through a direct link, as MiddleWare will still route the queries throught the master if the direct link is not ready yet. Moreover, every communication from a service to another will use the direct link as soon as it exists.

Tradeoffs

Having such a layer in a software does not come without some drawbacks. The use of MiddleWare creates an overhead introduced by the abstraction cost: the routing table creation adds a bit of traffic each time a process starts or stop, or when roles are registered or unregistered.

As start-up and shutdown are not critical parts of the execution for us, it is fine to have a small overhead here. In the same way, roles registrations are not frequent, it is not an issue to add some operations during this step.

Finally, high traffic may put some load on our master process that must route the messages. Not a big issue on that one too, as our master does not do much beside message routing. The main responsibility of this process is to monitor its children, no complex calculation or time-consuming operations here. Moreover, if an heavy traffic is expected between two daemons, it is a good practice to ask for a direct link. This decreases the load on the master and therefore the risk of impacting MiddleWare.

C Modules

We do write complex software, and like everyone doing so, we need a way to structure our source code. Our choice was to go modular. A module is a singleton that defines a feature or a set of related feature and exposes some APIs in order for other modules to access these features. Everything else, including the details of the implementation, is private. Examples of modules are the RPC layer, the threading library, the query engine of a database, the authentication engine, … Then, we can compose the various daemons of our application by including the corresponding modules and their dependencies.

Most modules maintain an internal state and as a consequence they have to be initialized and deinitialized at some point in the lifetime of the program. An internal rule at Intersec is to name the constructor {module}_initialize() and the destructor {module}_shutdown(). Once defined, these functions have to be called and this is where everything become complicated when your program has tens of modules with complex dependencies between them.

Continue reading

Blocks rewriting with clang

Introduction

Back in 2009, Snow Leopard was quite an exciting OS X release. It didn’t focus on new user-visible features but instead introduced a handful of low level technologies. Two of those technologies Grand Central Dispatch (a.k.a. GCD) and OpenCL were designed to help developers benefit from the new computing power of modern computer architectures: multicore processors for the former and GPUs for the latter.

Alongside the GCD engine came a C language extension called blocks. Blocks are the C-based flavor of what is commonly called a closure: a callable object that captures the context in which it was created. The syntax for blocks is very similar to the one used for functions, with the exception that the pointer star is * replaced by a caret ^. This allows inline definition of callbacks which often can help improving the readability of the code.

#include <stdio.h>
#include <stdlib.h>

static void call_blk(void (^blk)(const char *str))
{
    blk("world");
}

int main(int argc, char *argv[])
{
    int count = argc > 1 ? atoi(argv[1]) : 1;

    call_blk(^(const char *str) {
        printf("Hello %s from %s!\n", str, argv[0]);
    });
    call_blk(^(const char *str) {
        for (int i = 0; i < count; i++) {
             printf("%s!\n", str);
        }
    });
    return 0;
}

Blocks are a desirable feature

Standard C does not contain closures. GCC supports nested functions that are closures to some extent. However, nested functions cannot escape their definition scope, and therefore cannot be used in asynchronous code.

As a consequence, continuations in C are often implemented as a pair of a callback function and a context. The context contains variables needed by the callback to continue the execution of the program. Thus the type of the context is specific to the callback. That’s why APIs that use continuations usually take a pair of a function and a void * pointer, which itself is given back to the function when it get called. This is very flexible, but deeply unsafe since using void * forbids any type checking by the compiler, the code can easily be broken during a refactoring, allowing a mismatch between the data expected by the callback and the actual content of the context it receives.

For example, this is how we could implement the previous example using callbacks instead of blocks:

#include <stdio.h>
#include <stdlib.h>

static void call_cb(void (*cb)(const char *str, void *ctx), void *ctx) {
{
    (*cb)("world", ctx);
}

static void say_hello_from(const char *str, void *ctx)
{
    const char *from = ctx;

    printf("Hello %s from %s!\n", str, from);
}

static void repeat(const char *str, void *ctx)
{
    int count = (int)(intptr_t)ctx;

    for (int i = 0; i < count; i++) { printf("%s!\n", str); } } int main(int argc, char *argv[]) { int count = argc > 1 ? atoi(argv[1]) : 1;

    call_cb(&say_hello_from, argv[0]);
    call_cb(&repeat, (void *)(intptr_t)count);
    return 0;
}

On the other hand, blocks are type-safe. The compiler checks that the programmer is passing a block of the correct type and then identifies the variables from the parent scopes that are used by the block (we say those variables are captured by the block). It then automatically generates the code that creates and forwards the context. This ensures the context is always correct.

Blocks provide a safer and more concise syntax for a very common pattern.

How it works

In practice, the compiler does almost the same thing as we did with callback and context. It goes a bit further, though, by putting the pointer to the function in the context, that way a block is a single object that contains a pointer to the function and the associated data. Here is a second take on our rewrite of our example in plain (blockless) C. This second approach uses something very similar to what the compiler does with blocks:

#include <stdio.h>
#include <stdlib.h>

struct take_str_block {
    void (*cb)(void *ctx, const char *str);
};

static void call_cb(const struct take_str_block *blk) {
{
    (*blk->cb)(blk, "world");
}

struct say_hello_from_block {
    void (*cb)(void *ctx, const char *str);
    /* Captured variables */
    const char *from;
};

static void say_hello_from(void *ctx, const char *str)
{
    const struct say_hello_from_block *blk = ctx;

    printf("Hello %s from %s!\n", str, from);
}

struct repeat_block {
    void (*cb)(void *ctx, const char *str);
    /* Captured variables */
    int count;
};

static void repeat(void *ctx, const char *str)
{
    const struct repeat_block *blk = ctx;

    for (int i = 0; i < blk->count; i++) {
        printf("%s!\n", str);
    }
}

int main(int argc, char *argv[])
{
    int count = argc > 1 ? atoi(argv[1]) : 1;

    {
        struct say_hello_from_block ctx = {
            .cb = &say_hello_from,
            .from = argv[0]
        };
        call_cb((struct take_str_block *)&ctx);
    }
    {
        struct repeat_block ctx = {
            .cb = &repeat,
            .count = count
        };
        call_cb((struct take_str_block *)&ctx);
    }
    return 0;
}

Here, the block objects are structures that extend the block type expected by the function call_cb. This ensures the function receives an object it can manage and that each block implementation can add its own variables in its context. As you can see in this example, both say_hello_from and repeat extend the take_str_block type with their captured variables. With that approach, a cast is needed to downgrade the extended structure to the base type, but there is no required cast for captured variables anymore (no more (intptr_t) cast to pass an int into a void *).

The actual ABI for blocks is a bit more complex since it must deal with captured variable shared modifications, block persistency, … but the base principles are there.

Why we needed a rewriter

So, we wanted blocks in our code base. Back in 2010, clang was the only compiler supporting blocks within the Linux environment ((though Apple’s GCC fork supported it too)). However, at that time, clang was still a very young project and it just didn’t fit our needs in terms of optimizations compared to GCC. On top of that, our software products ran (and have always been running) on RedHat Enterprise Linux, so using a compiler supported by RHEL was highly desirable. Unfortunately, this was not the case of clang, not even in its most recent releases.

Thankfully, clang contained a rewriting infrastructure that let a custom pass manipulate the AST ((the AST, Abstract Syntax Tree, is a representation of the source code in the form of a tree which is much easier to manipulate than raw text)). Even more interestingly, clang came with a built-in Objective-C to C++ rewriter… and since Objective-C includes blocks and C++ does not, that rewriter already did the work of rewriting blocks to C++.

However, our code is written in plain C (well, GNU C99 in fact), and we actually use the C99 features. If C++ was a strict superset of C, this wouldn’t be an issue to rewrite our code to C++. Sadly, C++ is not strictly backward compatible with C and some syntax, like the designated initializer syntax I used in the previous code sample are unavailable in C++. As a consequence, rewriting our code to C++ is not acceptable as it would forbid the use of handy C99 features. This was (and still is) a no-brainer: we needed a rewriter from C with blocks to C supported by GCC, that way, we could both have blocks in C and use GCC to compile the code:

Clang rewriter toolchain

Thanks to the existing rewriter, this was not too hard ((you can still find our original discussion about this in the clang mailing list)). Before starting to detail our process, let see what was the actual output of the origin Objective-C to C++ rewriter. The following code is what we obtained by running clang -cc1 -rewrite-objc on the program of the introduction of the article (for the sake of clarity, this excludes the large generated preamble):

struct __block_impl {
  void *isa;
  int Flags;
  int Reserved;
  void *FuncPtr;
};

#include <stdio.h>
#include <stdlib.h>

static void call_blk(void (*blk)(const char *str))
{
    ((void (*)(struct __block_impl *, const char *))((struct __block_impl *)blk)->FuncPtr)((struct __block_impl *)blk,
 "world");
}

struct __main_block_impl_0 {
  struct __block_impl impl;
  struct __main_block_desc_0* Desc;
  char **argv;
  __main_block_impl_0(void *fp, struct __main_block_desc_0 *desc, char **_argv, int flags=0) : argv(_argv) {
    impl.isa = &_NSConcreteStackBlock;
    impl.Flags = flags;
    impl.FuncPtr = fp;
    Desc = desc;
  }
};
static void __main_block_func_0(struct __main_block_impl_0 *__cself, const char *str) {
  char **argv = __cself->argv; // bound by copy

        printf("Hello %s from %s!\n", str, argv[0]);
    }

static struct __main_block_desc_0 {
  size_t reserved;
  size_t Block_size;
} __main_block_desc_0_DATA = { 0, sizeof(struct __main_block_impl_0)};

struct __main_block_impl_1 {
  struct __block_impl impl;
  struct __main_block_desc_1* Desc;
  int count;
  __main_block_impl_1(void *fp, struct __main_block_desc_1 *desc, int _count, int flags=0) : count(_count) {
    impl.isa = &_NSConcreteStackBlock;
    impl.Flags = flags;
    impl.FuncPtr = fp;
    Desc = desc;
  }
};
static void __main_block_func_1(struct __main_block_impl_1 *__cself, const char *str) {
  int count = __cself->count; // bound by copy

        for (int i = 0; i < count; i++) { 
             printf("%s!\n", str); 
        }
 } 

static struct __main_block_desc_1 {
    size_t reserved; 
    size_t Block_size; 
} __main_block_desc_1_DATA = { 0, sizeof(struct __main_block_impl_1)}; 
int main(int argc, char *argv[]) 
{ 
    int count = argc > 1 ? atoi(argv[1]) : 1;

    call_blk((void (*)(const char *))&__main_block_impl_0((void *)__main_block_func_0, &__main_block_desc_0_DATA, argv));
    call_blk((void (*)(const char *))&__main_block_impl_1((void *)__main_block_func_1, &__main_block_desc_1_DATA, count));
    return 0;
}

If you look at this closely, you can notice the similarity with our take_str_block example. The compiler generates a function and an impl structure for each block. The base structure __block_impl contains the function pointer as well as information about the type of block (mostly used for block persistency).

What we did

First, we had to get rid of the Objective-C specific code: no need to keep code we don’t care about. We initially hacked directly the code of the Objective-C to C++ rewriter. But it rapidly proved really hard to maintain the patched rewriter since we had a lot of conflicts with each new clang release, so we actually forked the Objective-C to C++ rewriter into a specialized block rewriter, introducing a -rewrite-blocks flag to clang.

Secondly, we needed to produce pure C code instead of C++ for blocks. As you can see in the generated code, the generated code uses structure constructors to initialize the implementation objects. Constructors do not exist in C, but we have initializers (and far better initializers than C++ ones). As a consequence, we changed the code that instantiated the impl structure to use structure initializers. Note that our implementation uses two C99 features: the designated initializers we mentioned already as well as compound literals.

The rewriter also uses another feature that cannot be observed in our example: the RAII, which is used only for variables with the __block modifier. Those variables are captured by reference instead of being captured by value. As a consequence, the runtime must perform some reference counting on those variables. The RAII is used to ensure we release the reference taken by the definition scope of the variable when we exit that scope. Standard C contains no equivalent construct. As a consequence, we had to manually generate the release of the reference everywhere we exited the definition scope of the variable. Thankfully, GCC has a cleanup attribute. This attribute associates a cleanup callback to a variable and acts exactly the same way as the RAII does. As a consequence, this completely solves our issue in a simple way.

With those issues fixed, here is the output of our rewriter clang -cc1 -rewrite-blocks:

struct __block_impl {
  void *isa;
  int Flags;
  int Reserved;
  void *FuncPtr;
};

#include <stdio.h>
#include <stdlib.h>

static void call_blk(void (*blk)(const char *str))
{
    ((void (*)(struct __block_impl *, const char *))((struct __block_impl *)blk)->FuncPtr)((struct __block_impl *)blk,
 "world");
}

struct __main_block_impl_0 {
  struct __block_impl impl;
  struct __main_block_desc_0* Desc;
  char **argv;
#define __main_block_impl_0(__blk_fp, __blk_desc, _argv, __blk_flags) \
  { \
    .argv = (_argv), \
    .impl = { \
      .isa = &_NSConcreteStackBlock, \
      .Flags = (__blk_flags), \
      .FuncPtr = (__blk_fp), \
    }, \
    .Desc = (__blk_desc), \
  }
#define __main_block_impl_0__INST(...)  ({ \
    memcpy(&__main_block_impl_0__VAR, &(struct __main_block_impl_0)__main_block_impl_0(__VA_ARGS__), sizeof(__main_block_impl_0__VAR)); \
    &__main_block_impl_0__VAR; \
  })
};
static void __main_block_func_0(struct __main_block_impl_0 *__cself, const char *str) {
  char **argv = __cself->argv; // bound by copy

        printf("Hello %s from %s!\n", str, argv[0]);
    }

static struct __main_block_desc_0 {
  unsigned long reserved;
  unsigned long Block_size;
} __main_block_desc_0_DATA = { 0, sizeof(struct __main_block_impl_0)};

struct __main_block_impl_1 {
  struct __block_impl impl;
  struct __main_block_desc_1* Desc;
  int count;
#define __main_block_impl_1(__blk_fp, __blk_desc, _count, __blk_flags) \
  { \
    .count = (_count), \
    .impl = { \
      .isa = &_NSConcreteStackBlock, \
      .Flags = (__blk_flags), \
      .FuncPtr = (__blk_fp), \
    }, \
    .Desc = (__blk_desc), \
  }
#define __main_block_impl_1__INST(...)  ({ \
    memcpy(&__main_block_impl_1__VAR, &(struct __main_block_impl_1)__main_block_impl_1(__VA_ARGS__), sizeof(__main_block_impl_1__VAR)); \
    &__main_block_impl_1__VAR; \
  })
};
static void __main_block_func_1(struct __main_block_impl_1 *__cself, const char *str) {
  int count = __cself->count; // bound by copy

        for (int i = 0; i < count; i++) { 
             printf("%s!\n", str); 
        }
   } 

static struct __main_block_desc_1 { 
    unsigned long reserved; 
    unsigned long Block_size; 
} __main_block_desc_1_DATA = { 0, sizeof(struct __main_block_impl_1)}; 
int main(int argc, char *argv[]) 
{struct __main_block_impl_1 __main_block_impl_1__VAR;struct __main_block_impl_0 __main_block_impl_0__VAR; 
    int count = argc > 1 ? atoi(argv[1]) : 1;

    call_blk((void (*)(const char *))(void *)__main_block_impl_0__INST((void *)__main_block_func_0, &__main_block_desc_0_DATA, argv, 0
));
    call_blk((void (*)(const char *))(void *)__main_block_impl_1__INST((void *)__main_block_func_1, &__main_block_desc_1_DATA, count, 0
));
    return 0;
}

Limitations

An issue remains, though, the rewriter cannot rewrite blocks within macros. This is a direct consequence of how the preprocessor works: it is a (not so) simple text replacement and the text of a macro is not guaranteed to be valid C. This could be worked around by rewriting the code output of the preprocessor instead of rewriting the source code. However, since we need to compile the output of the rewriter using GCC in order to use the optimizer (which is the most complex part of the compiler) supported by RHEL this was not possible. Indeed, in our code base, we sometimes use some compiler specific code paths (for example, until recently, clang didn’t know about the flatten attribute, GCC didn’t know about clang‘s __has_feature() preprocessor intrinsic…). Since the rewriter is based on clang, the macros are expanded to their clang flavour. As a consequence, had we worked on the preprocessed code, at best we would have lost some GCC-specific optimisations and at worst we would have encountered code GCC could not compile. Today, rewriting within macros remains the main limitation of our toolchain. In early versions, it even used to cause crashes of clang, a consequence of a pointer misuse inherited from the Objective-C rewriter ((It looks like the Objective-C rewriter was some kind of experiment, and it’s clearly not of production quality)).

Since we rewrite the original source file, not the preprocessed file, we also cannot rewrite blocks in included files. This means that no blocks are allowed in header files (or in any other included code file). However, we still need to declare block types and functions that take blocks as arguments in those header files. And here comes probably the part that created the greatest amount of confusion among our engineers.

A block type is defined using a caret:

typedef int (^my_block_b)(const char *str);
int call_blk(int (^blk)(const char *str));

That won’t compile with GCC. So we had to find a way to allow declaration of block types even in header files with two constraints: we must use the block syntax when the compiler supports blocks, but use some standard C syntax when the compiler does not. We did this by introducing a dedicated BLOCK_CARET macro that gets rewritten to a caret when the compiler supports blocks, but as a star when the compiler does not support them:

#ifdef __BLOCKS__
# define BLOCK_CARET  ^
#else
# define BLOCK_CARET  *
#endif

typedef int (BLOCK_CARET my_block_b)(const char *str);
int call_blk(int (BLOCK_CARET blk)(const char *str));

This means that GCC will see block types as functions types. However we already saw that a block is not a function, but a structure. The rewriting to a function type is just a trick that makes the code acceptable to GCC, but that trick does not transform blocks to functions. Even if both blocks and functions are callable C types, they are not interchangeable: a function cannot be used where a block is expected, and vice-versa. Blocks remain a patch on top of the C standard. Most languages that have built-in closures won’t have that kind of limitations, Apple itself fixed this in the Swift language… five years later.

4 years later…

We’ve been using our rewriter for over 4 years. During the few first months, we had a hard time convincing our engineers that blocks really were worth the extra cost (the extra compilation time and the various limitations of our rewriter), but today, a quarter of our code base uses blocks. The benefits of the lighter syntax introduced by blocks can be observed in various use cases spanning from asynchronous code to the core of our database engines. For example, we can easily create atomic commits using blocks:

db_do_atomically(^{
    do_some_write();
    do_some_other_write();
});

We also spotted more limitations (mostly with __block variables), added basic support for C++ (ironically, we now think that we can generate better C++ that the original Objective-C to C++ rewriter for the small subset of C++ we support)… Probably the most frustrating issue today is the performance impact of blocks: they are too tricky to be inlined by GCC, even if defined and called in the same compilation unit. Moreover, in some use cases, we spotted limitation due to the fact blocks get allocated on the heap when we need to persists them (using Block_copy()), causing some contention in the allocator.

Our hacked clang is available on github for all branches since clang 3.0 (we actually began working on it before clang 3.0, but that early work was lost when we switched from our own git-svn clone to the official llvm git mirror). The code is not clean and someday we’ll do a great cleanup pass in order to be able to submit our patches upstream.

Hackathon 0x1 – Pimp my Review or the Epic Birth of a Gerrit Plugin

This series of articles describes some of the best realisations made by Intersec R&D team during the 2-day Hackathon that took place on the 3rd and 4th of July.

The goal had been set a day or two prior to the beginning of the hackathon: we were hoping to make Gerrit better at recommending relevant reviewers for a given commit. To those who haven’t heard of it, Gerrit is a web-based code review system. It is a nifty Google-backed open-source project evolving amid an active community of users. We have been using this product here at Intersec since 2011 and some famous software projects also rely heavily on it for their development process.

Da Review Pimpers

This would be a good metaphor to illustrate our mindset at the beginning of the hackathon! (credits: Team Fortress 2)

Our team consisted of five people: Kamal, Romain, Thomas, Louis and Romain (myself).

Continue reading

Memory – Part 6: Optimizing the FIFO and Stack allocators

Introduction

The most used custom allocators at Intersec are the FIFO and the Stack allocators, detailed in a previous article. The stack allocator is extremely convenient, thanks to the t_scope macro, and the FIFO is well fitted to some of our use cases, such as inter-process communication. It is thus important for these allocators to be optimized extensively.

We are two interns at Intersec, and our objective for this 6 week internship was to optimize these allocators as far as possible. Optimizing an allocator can have several meanings: it can be in terms of memory overhead, resistance to contention, performance… As the FIFO allocator is designed to work in single threaded environments, and the t_stack is thread local, we will only cover performance and memory overhead.

Continue reading

Hackathon 0x1 – Interactive mode in Behave

This series of articles describes some of the best realisations made by Intersec R&D team during the 2-day Hackathon that took place on the 3rd and 4th of July.

Presentation of the Project

As testers, we spend a lot of time working on behave, our test automation framework ((behave is a clone of cucumber written in Python. It is based on the BDD (Behavior Driven Development) principles. Tests are described as a succession of english-sentences (assumptions, then actions, then results) which are themselves mapped to the corresponding Python code.)). Our test framework is a great tool, but it takes a lot of time starting and initializing the product, running tests one by one.

We are not only test automation developers. We also need to explore the product under test by experimenting again and again based on the information we gather along the way. Manual testing is the basic approach to perform non-trivial experiments, but it can benefit from automated testing as it offers a quick and reliable way to set up a product in any given state.

The hackathon ((a hackathon is an event in which computer programmers and others involved in software development, including graphic designers, interface designers and project managers, collaborate intensively on software projects. Intersec promotes this event to focus and enhance project innovation)) was the perfect opportunity to buy ourselves a new exploratory tool to perform interactive automated testing. To do that, we needed to be able to:

  • Gain more control over the set up steps
  • Pause the product in order to manually test the state of the product
  • Select some more steps to run, check again, and so on
  • Explore with some automatic help

We elaborated an interactive mode to manage the run of a scenario. The goal was to be able to play each sentence on demand. This mode would help us in the future to reproduce issues and to set up the environment faster for testing purpose or even for customer demonstration.

Continue reading

DAZIO: Detecting Activity Zones based on Input/Output sms and calls activity for geomarketing and trade area analysis

Introduction

Telecom data is a rich source of information for many purposes, ranging from urban planning (Toole et al., 2012), human mobility patterns (Ficek and Kencl, 2012; Gambs et al., 2011), points of interest detection (Vieira et al., 2010), epidemic spread modeling (Lima et al., 2013), community detection (Morales et al., 2013) disaster planning (Pulse, 2013) and social interactions (Eagle et al., 2013).

One common task for these applications is to identify dense areas where many users stay for a significant time (activity zones), the regions relaying theses activity zones (transit zones) as well as the interaction between identified activity zones. Thus, in the present  article we will identify activity and transit zones to monitor and predict the activity levels in the telecom operators network based on the SMS and calls input/output activity levels issued from the Telecom Italia Big Data Challenge. The results of the present study could be directly applied to:

  • Location-Based Advertising
  • defining a suitable place to open a new store in a city
  • planning where to add cell towers to improve QoS

The contribution of this work is twofold: to present a model accounting for changes of activity levels (over time) and to predict those changes using Markov chains. We also propose a methodology to detect activity and transit zones.

Continue reading