Coverity is a proprietary static code analysis tool that can be used to find code quality defects in large-scale, complex software. It supports a number of languages and frameworks and has been trusted to ensure compliance with various standards such as MISRA, AUTOSAR, ISO 26262, and others. Coverity provides integrations with several build systems, including Bazel, however the official Bazel integration fell short of the expectations of our client, who wanted to leverage the Bazel remote cache in order to speed up Coverity analysis and be able to run it in normal merge request workflows. We took on that challenge.
The Coverity workflow
To understand the rest of the post it is useful to first become familiar with the Coverity workflow, which is largely linear and can be summarized as a sequence of steps:
- Configuration. In this step
cov-configureis invoked, possibly several times with various combinations of key compiler flags (such as-nostdinc++,-fPIC,-std=c++14, and others). This produces a top-level XML configuration file accompanied by a directory tree that contains configurations that correspond to the key flags that were provided. Coverity is then able to pick the right configuration by dispatching on those key flags when it sees them during the build. - Compilation. This is typically done by invoking
cov-buildor its lower-level cousincov-translate. The distinction between these two utilities will become clear in the next section. For now, it is enough to point out that these tools translate original invocations of the compiler to invocations ofcov-emit(Coverity’s own compiler) which in turn populate an SQLite database with intermediate representations and all the metadata about the built sources such as symbols and include paths. - Analysis. This step amounts to one or more invocations of
cov-analyze. This utility is practically a black box, and it can take a considerable time to run, depending on the number of translation units (think compilation units in C/C++). The input for the analysis is the SQLite database that was produced in the previous step.cov-analyzepopulates a directory tree that can then be used for report generation or uploading of identified software defects. - Reporting. Listing and registration of found defects is performed with
cov-format-errorsandcov-commit-defects, respectively. These tools require a directory structure that was created bycov-analyzein the previous step.
Existing integrations
A successful integration of Coverity into Bazel has to provide a way to perform step 2 of the workflow, since this is the only step that is intertwined with the build system. Step 1 is relatively quick, and it does not make a lot of difference how it is done. Step 3 is opaque and build-system-agnostic because it happens after the build, furthermore there is no obvious way to improve its granularity and cacheability. Step 4 is similar in that regard.
The official integration works by wrapping a
Bazel invocation with cov-build:
$ cov-build --dir /path/to/idir --bazel bazel build //:my-bazel-targetHere, --dir specifies the intermediate directory where the SQLite database
along with some metadata will be stored. In the --bazel mode of operation
cov-build tricks Bazel by intercepting the invocations of the compiler
which are replaced by invocations of cov-translate. Compared to
cov-build, which is a high-level wrapper, cov-translate corresponds more
closely to individual invocations of a compiler. It identifies the right
configuration from the collection of configurations created in step 1
and then uses it to convert the command line arguments of the original
compiler into the command line arguments of cov-emit, which it then
invokes.
The main problem with the official integration is the fact that it does not
support caching. bazel build has to run from scratch with caching disabled
in order to make sure that all invocations of the compiler are performed and
none are skipped. Another nuance of the official integration is that one has
to build a target of the supported kind, e.g. cc_library. If you bundle
your build products together in some way (e.g. in an archive), you cannot
simply build the top-level bundle as you normally would. Instead, you need
to identify every compatible target of interest in some other way.
Because of this, our client did not use the official Coverity integration
for Bazel. Instead, they would run Bazel with the --subcommands option
which makes Bazel print how it invokes all tools that participate in the
build. This long log would then be parsed and converted into a Makefile in
order to be able to leverage Coverity’s Make integration in cov-build
instead. This approach still suffered from long execution times due to lack
of caching. It ran as a nightly build and took over 12 hours, which wasn’t
suitable for a regular merge request’s CI pipeline.
Our approach
The key insight that allowed us to make the build process cacheable is the
observation that individual invocations of cov-translate produce SQLite
databases—emit-dbs—that are granular and relatively small. These
individual emit-dbs can then be merged in order to form the final, big
emit-db that can be used for running cov-analyze. Therefore, our plan
was the following:
- Create a special-purpose Bazel toolchain that starts as a copy of the
toolchain which is used for “normal” compilation in order to match the
way the original compiler (
gccin our case) is invoked. - Instead of
gccand some other tools such asar, have it invoke our own wrapper that would drive invocations ofcov-translate. - The only useful output of the compilation step is the
emit-dbdatabase. On the other hand, inCppCompileactions, Bazel normally expects.ofiles to be produced, so we just rename ouremit-dbSQLite database to whatever object file Bazel is expecting. - In the linking step, we use the
addsubcommand ofcov-manage-emitin order to merge individualemit-dbs. - Once the final bundle is built, we iterate through all eligible database
files and merge them together once again in order to obtain the final
emit-dbthat will be used for analysis.
From Bazel’s point of view, we are simply compiling C/C++ code, however, this is a way to perform a cacheable Coverity build. If you are a regular reader of our blog, you may have noticed certain similarities to the approach we used in our CTC++ Bazel integration documented here and here.
Difficulties
The plan may sound simple, but in practice there was nothing simple about
it. The key difficulty was the fact that Coverity tooling was not designed
with reproducibility in mind. At every step, starting from configuration
generation (which is simply a genrule in our case) we had to inspect and
adjust produced outputs in order to make sure that they do not include any
volatile data that could compromise reproducibility. The key output in the
Coverity workflow, the emit-db database, contained a number of pieces of
volatile information:
- Start time, end time, and process IDs of
cov-emitinvocations. - Timestamps of source files.
- Command line IDs of translation units.
- Host name; luckily, this can be overwritten easily by setting the
COV_HOSTenvironment variable. - Absolute file names of source files. These are split into path segments with
each segment stored as a separate database entry, meaning the number of entries
in the
FileNametable varies with the depth of the path to the execroot, breaking hermeticity. Deleting entries is not viable since their IDs are referenced elsewhere. Our solution was to identify the variable prefix leading to the execroot and reset all segments in that prefix to a fixed string. The prefix length proved to be constant for each execution mode, allowing us to use--strip-patharguments incov-analyzeto remove them during analysis. - In certain cases some files from the
includedirectories would be added to the database with altered Coverity-generated contents that would also include absolute paths.
We had to meticulously overwrite all of these, which was only possible
because emit-db is in the SQLite format and there is open source tooling
that makes it possible to edit it. If the output of cov-emit were in a
proprietary format we almost certainly wouldn’t be able to deliver as
efficient a solution for our client.
In practice, normalization of emit-db happens in two stages:
- We run some SQL commands that reset the volatile pieces of data inside the
database. As we were iterating on these “corrective instructions”, we made
sure that we eliminated all instances of volatility by using the
sqldiffutility which can print differences in schema and data between tables. - We dump the resulting database with the
.dumpcommand which exports the SQL statements necessary to recreate the database with the same schema and data. Then we re-load these statements and thus obtain a binary database file that is bit-by-bit reproducible. This is necessary because simply editing a database by running SQL commands on it does not ensure that the result is binary reproducible even if there is no difference in the actual contents.
Performance considerations
Since emit-dbs are considerably larger than typical object files, we found
it highly desirable to use compression for individual SQLite databases, both
for those that result from initial cov-translate invocations and for
merged databases that are created by cov-manage-emit in the linking step.
Zstandard proved to be a good choice for that—the utility is both
fast and makes our build outputs up to 4 times smaller. Without compression,
we risked filling the remote cache quickly; besides, the bigger the files,
the slower I/O operations are.
We were tempted to minimize the size of the database even further by
exploring if there’s anything in emit-db that can be safely removed
without affecting our client’s use case. Alas, every piece of information
stored was required by Coverity during the analysis phase and our attempts
to truncate some of the tables led to failures in cov-analyze. It is worth
noting that the sqlite3_analyzer utility tool (part of the SQLite project)
can be used to produce a report that explains which objects store most data.
This way we found that there is an index that contributes about 20% of the
database size, however, deleting it severely degrades the performance of
record inserts which is a key operation during merging of emit-dbs.
Linking steps, which in our approach amount to merging of emit-db
databases, produce particularly large files. In order
to reduce the amount of I/O we perform and avoid
filling the remote cache too quickly,
we’ve marked all operations that deal with these large files
as no-remote. This is accomplished with this line in
our .bazelrc file:
common:coverity --modify_execution_info=CppArchive=+no-remote,CppLink=+no-remote,CcStrip=+no-remote
# If you wish to disable RE for compilation actions with coverity, uncomment
# the following line:
# common:coverity --modify_execution_info=CppCompile=+no-remote-execOn the other hand, we did end up running CppCompile actions with remote execution (RE)
because it proved to be twice faster than running them on CI agents. Making
RE possible required us to identify the exact collection of files from the
Coverity installation that are required during invocations of the Coverity
tooling. Once we observed RE working correctly, we were confident that the
Bazel definitions we use are hermetic.
Merging of individual emit-db databases found in the final bundle (i.e.
after the build has finished) proved to be time-consuming. This operation
cannot be parallelized since database information cannot be written in
parallel. The time required for this step grows linearly with the number of
translation units (TUs) being inserted, therefore it makes sense to pick the
largest database and then merge smaller ones into it. One could entertain
the possibility of skipping merging altogether and instead running smaller
analyses on individual emit-dbs, but this seems to be not advisable, since
Coverity performs a whole-program analysis, and thus it would lose valuable
information that way. For example, one TU may be exercised in different ways
by two different applications and the analysis results for this TU cannot be
correctly merged.
The analysis phase is a black box, and it can easily become the bottleneck
of the entire workflow, thus making it impractical for running in merge
request pipelines. A common solution for speeding up the analysis in merge
request pipelines is to identify files that were edited and limit the
analysis to only these files with the --tu-pattern option, which supports
a simple language for telling Coverity what to care
about during analysis. We added support for this approach to our solution by
automatically finding the files changed in the current merge request, and
passing these on to --tu-pattern. This restricted analysis still requires the
emit-dbs for the entire project, but most of them will be cached.
The results
The solution that we delivered is in the form of a bazel run-able target
that depends on the binary bundle that needs to be analyzed. It can be
invoked like this:
bazel run //bazel/toolchains/coverity:workflow --config=coverity -- ...This solution can be used both for the nightly runs of Coverity and in the merge request pipelines. We have confirmed that the results that are produced by our solution match the findings of Coverity when it is run in the “traditional” way. A typical run in a merge request pipeline takes about 22 minutes when a couple of C++ files are edited. The time is distributed as follows:
- 8 minutes: building, similar to build times for normal builds (this step is sped up by caching)
- 10 minutes: the final merging of
emit-dbs - 4 minutes: analysis, uploading defects, reporting (this step is sped up by
--tu-pattern).
The execution time can grow, of course, if the edits are more substantial. The key benefit of our approach is that the Coverity build is now cacheable and therefore can be included in merge request CI pipelines.
Conclusion
In summary, integrating Coverity static analysis with Bazel in a cacheable,
reproducible, and efficient manner required a deep understanding of both
Bazel and Coverity, as well as a willingness to address the nuances of
proprietary tooling that got in the way. By leveraging granular emit-db
databases, normalizing volatile data, and optimizing for remote execution
and compression, we were able to deliver a solution that fits well into the
client’s CI workflow and supports incremental analysis in merge request
pipelines.
While the process involved overcoming significant challenges, particularly around reproducibility and performance, the resulting workflow enables fast static analysis without sacrificing the benefits of Bazel’s remote cache. We hope that sharing our approach will help other teams facing similar challenges and inspire improvements when integrating other static analysis tools with Bazel.
Behind the scenes
Mark is a build system expert with a particular focus on Bazel. As a consultant at Tweag he has worked with a number of large and well-known companies that use Bazel or decided to migrate to it. Other than build systems, Mark's background is in functional programming and in particular Haskell. His personal projects include high-profile Haskell libraries, tutorials, and a technical blog.
Alexey is a build systems software engineer who cares about code quality, engineering productivity, and developer experience.
If you enjoyed this article, you might be interested in joining the Tweag team.