Introducing auto-formatting in an existing codebase

At Intersec, we aim at maintaining a consistent style throughout all the code, depending on the programming language. For example, the C codebase is the most consistent, with our coding rules being enforced on code review. This enforcement, however, can lead to significant time loss and frustrations when patches must be repushed to be adapted to specific rules.

In the past, there has been several attempts at configuring auto-formatting tools. They were never fully satisfactory because several of our coding rules did not fit into the limited configuration options of these tools. This subject was born out of these attempts. However, instead of focusing on adapting the tools to our coding rules, we considered doing the opposite. What about adapting our coding rules (in particular, some of those peculiar rules) so that auto-formatting tools could be applied easily? After all, if some of our rules are too specific to exist in popular tools, maybe those rules cause more harm than improvement to our code.

Python

First step was Python, which we use extensively: for testing, in our build system as well as specific APIs. Python is the language where a specific coding style is the least enforced. Our guidelines tell us to follow PEP8, but this has never been particularly applied, hence a very heterogeneous codebase.

We looked into autopep8, yapf and black. Given the state of our codebase, any of these tools would involve significant changes when applied on all files. We decided to go with the most opinionated one, after discussing specific configuration options.

As expected, the changes were significant. A quick wc -l indicated that we had about 200k lines of python code. Launching black on every python file resulted in about +56k additions and -34k deletions. These are huge changes: although we are happy about the formatting, these changes practically ensure conflicts during branches merges, which may generate significant time loss as well.

Funnily enough, runing black on our whole codebase generated two errors where the output of black was not idempotent, in old code with dubious style. black was smart enough to detect that and provide the diff, so that the formatting could be fixed by hand.

Javascript

We considered using prettier for javascript, but decided to use eslint instead, as we already use eslint (happily) as a linter. Adding a few indentation rules, we can use eslint --fix as a sort of auto formatter. It is less opinionated than prettier, but fits nicely with our codebase.

For 236k lines of code, running eslint --fix on all JS files resulted in about 5k modifications only that fall into two main categories:

Specific coding rules that did not match with eslint rules.
Badly indented code that was fixed.

This is pretty good. The changes are relatively minimal, and this could be used to make sure that the code is properly formatted on new commits. However, it may require some user reformatting, as some constructs may cause some very ugly indentation (for example, declaring an object on multiple lines in a function call aligns the object fields with the opening bracket of the object, which can lead to very high indentation).

We also have around 120k lines of typescript. We are currently in the process of migrating our tslint configuration to eslint, which, when finished, will also allow us to format our typescript code using eslint.

C

Our C Codebase was the most problematic. It uses Blocks, has grown to use a lot of idiosyncratic syntaxes and allows using higher-level data types through many custom macros. When used to it, the coding rules make the code easily readable for us. For a formatting tool however, not so much. C is notoriously hard to parse, and formatting tools can have a hard time understanding which identifiers are types, or which macros are used to simulate keywords or blocks.

The goal was to use clang-format, which supports blocks natively. However, the development version was needed to get support for some latest features, including whitelisting of specific macros as keywords or blocks.

The result is not entirely convincing though. We triggered a few bugs which breaks the indentation, and some of the reformatting leaves the code arguably less readable than before. This could possibly be alleviated by modifying our coding rules significantly to improve the formatting, but that would involve a lot of modifications to the whole codebase, which we would rather avoid.

Posted on Apr 8, 2020 at 17:54