Deploying Buildbarn on Kubernetes with mTLS on the side

29 August 2024 — by Yuriy Taraday

We have shown the benefits of using a shared build cache as well as using remote build execution (RBE) to offload builds to a remote build farm. Our customers are interested in leveraging RBE to improve developer experience and reduce continuous integration (CI) run times, giving us an opportunity to learn all aspects of deploying different RBE solutions. I would like to share how one can deploy one of them, Buildbarn, and secure all communications in it.

What is it and why do we care?

We want developers to be productive. Being productive requires spending as little time as possible waiting for build/test feedback, not having to switch to a different task while the build is running.

Remote caching

One part of achieving this is to never build the same thing twice. Tools like Bazel support caching the result of every action, every tool execution. While many tools support storing results in a local directory, Bazel tracks the actions and their inputs with high granularity, resulting in more frequent “cache hits”. This is already a good gain for a single developer working on one machine. However Bazel also supports conducting builds in a controlled environment with identical tooling and using a remote cache that can be shared between team members and CI, taking things a significant step further. You won’t have to rebuild anything that has been built by your colleagues or by CI, which means starting up on a new machine, onboarding a new team member or reproducing issues becomes faster.

Remote build execution

The second part of keeping developers productive is allowing them to use the right tools for the job. They still often need to build new things, and their local machine may be not be the fastest, not have enough charge or have the wrong architecture or OS. Remote build execution extends remote caching by executing actions on shared builders when their results are not cached already. This allows setting up a shared pool of necessary hardware or virtual compute for both developers and CI. In Bazel this was implemented using RBE API.

RBE implementations

Since the last post, RBE for Google Cloud Platform (GCP) has disappeared, and several new self-service and commercial services have been created. The RBE API has also gained popularity with different build systems, including Bazel (where it started), Buck2, and BuildStream. It is also used in projects that cannot change their build systems easily, but can use reclient to wrap all build actions and forward them to an RBE service. Examples of such setup include Android, Fuchsia and Chromium.

We’ll focus on one of opensource RBE API servers, Buildbarn.

Securing remote cache and builds

Any shared infrastructure implies some security risks. When sending code to be built remotely we expose it on the network, where it can be intercepted or altered. When reading from the cache, we trust it to contain valid, unaltered results. When setting up a pool of compute resources, we expect them to be used only for building our code, and not for enriching third parties. All these expectations mean that we require all communications with remote infrastructure and within it to be encrypted and authenticated. The industry standard for achieving this is mTLS: Transport Layer Security (TLS) protocol with mutual authentication. It uses public key infrastructure (PKI) to allow both clients and servers to verify each other’s identities before sending any data, and makes sure that the data sent on one side matches the data received on the other side.

Overview

In this extended blog post we’ll start by showing how to deploy Buildbarn on a Kubernetes cluster running in a local VM and configure a simple Bazel example to use it. Then we’ll turn on mTLS with the help of cert-manager for all Buildbarn pieces communicating with one another, and, finally, configure Bazel on a developer or CI machine to authenticate over the RBE API with a certificate and verify the one presented by the build server.

This blog post contains a lot of code snippets that let you follow the installation process step by step. If you copy each command into your terminal in order, you should see the same results as described. If you prefer to jump to the final result and look at the complete picture, you can check out our fork of the upstream buildbarn/bb-deployments repository and follow the instructions there.

Deploying Buildbarn

In this section we’ll create a local Buildbarn deployment on a Kubernetes cluster running in a VM. We’ll create a local VM with Kubernetes using an example config provided by lima. Then we’ll configure persistent volumes for Buildbarn storage inside that VM. After that we’ll use the Kubernetes example from a repository provided by Buildbarn to deploy Buildbarn itself.

Setting up a Kubernetes instance

If you already have access to a Kubernetes cluster that you can use, you can skip this section. Here we’ll deploy a local VM with Kubernetes running in it. In subsequent steps below it’s assumed that you’re using a local VM, so you’ll have to adjust some parameters accordingly if you use different means.

I’ve found that the easiest and most portable way to get a Kubernetes running locally is using the lima (Linux Machines) project. You can follow the official docs to install it. I prefer using Nix and direnv, so I’ve created a .envrc file with one line use nix and shell.nix with the following contents:

{ nixpkgs ? builtins.getFlake "nixpkgs"
, system ? builtins.currentSystem
, pkgs ? nixpkgs.legacyPackages.${system}
}:
pkgs.mkShell {
  packages = with pkgs; [
    kubectl
    lima-bin
    jq
  ];
}

Then you just need to run direnv allow and it will fetch the necessary packages and make them available in your shell.

Now we can create a Lima VM from the k8s template. We remove mounts from the template to specify our own later. We also need to add some special options for running on macOS:

limactl create template://k8s --name k8s --tty=false \
  --set '.provision |= . + {"mode":"system","script":"#!/bin/bash
for d in /mnt/fast-disks/vol{0,1,2,3}; do sudo mkdir -p $d; sudo mount --bind $d $d; done"}' \
  $([ "$(uname -s)" = "Darwin" ] && { echo "--vm-type vz"; [ "$(uname -m)" = "arm64" ] && echo "--rosetta"; })

Here arguments are:

--name k8s sets a name for the new VM; it defaults to the template name, but let’s keep it explicit
--set '.provision ...' uses a jq expression to add an additional provision step to the resulting YAML file creating necessary mountpoints for persistent volumes
--tty=false disables console prompts and confirmations
for macOS we also add --vm-type vz to use the native macOS Virtualization framework instead of QEMU for a faster VM
for Apple Silicon we also add --rosetta to enable the translation layer, allowing us to run x86_64 containers in the VM with little overhead

You can start the final VM and check if it is ready with:

limactl start k8s
export KUBECONFIG=~/.lima/k8s/copied-from-guest/kubeconfig.yaml
kubectl get node

It will take some time to bootstrap Kubernetes, after which it should show you one node called lima-k8s with Ready status:

NAME       STATUS   ROLES           AGE     VERSION
lima-k8s   Ready    control-plane   4m54s   v1.29.2

Buildbarn will need some PersistentVolumes to store data. Let’s teach it to use the mounts that we created earlier for that. First, configure a storage class:

kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-disks
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
EOF

It should respond with storageclass.storage.k8s.io/fast-disks created.

Then start a local volume provisioner from sig-storage-local-static-provisioner:

curl -L https://raw.githubusercontent.com/kubernetes-sigs/sig-storage-local-static-provisioner/master/deployment/kubernetes/example/default_example_provisioner_generated.yaml | kubectl apply -f -

Run kubectl get pv to see that it created four volumes. They may take several seconds to appear. You can check the provisioner’s logs for any errors with kubectl logs daemonset/local-volume-provisioner.

Deploying Buildbarn

bb-deployments provides a Kustomize template to deploy Buildbarn. Let’s clone it, patch one service so that we can run it locally, and deploy:

git clone https://github.com/buildbarn/bb-deployments.git
pushd bb-deployments/kubernetes
cat >> kustomization.yaml <<EOF

# patch frontend service to not require external load balancers
patches:
  - target:
      kind: Service
      name: frontend
    patch: |
      - op: replace
        path: /spec/type
        value: NodePort
      - op: add
        path: /spec/ports/0/nodePort
        value: 30080
EOF
kubectl apply -k .
kubectl rollout status -k . 2>&1 | grep -Ev "no status|unable to decode"

The last command will wait for everything to start. We’ve filtered out all messages about resources that it doesn’t know how to wait for.

To check that the Buildbarn frontend is accessible, we can use grpc-client-cli. Add it to the list in shell.nix, save it and run:

grpc-client-cli -a 127.0.0.1:30080 health

It should report that it is SERVING:

{
 "status": "SERVING"
}

We can exit the bb-deployments directory now:

popd

In this section we’ve deployed Buildbarn and verified that its API is accessible. Now we’ll move on to setting up a small Bazel project to use it. Then we’ll configure mTLS on Buildbarn, and finally configure Bazel to work with mTLS.

Using Buildbarn

Let’s set up a small Bazel project to use our Buildbarn instance. In this section we’ll use Bazel examples repo and show how to build it using Bazel locally and with RBE. We’ll also see how remote caching speeds up builds by caching intermediate results.

We will be using Bazelisk to fetch and run upstream distribution of Bazel. First we’ll need to install Bazelisk by adding bazelisk to shell.nix. If you are running NixOS, you will have to create an FHS environment to run Bazel. If you are running macOS and don’t have Xcode command line tools installed, you also need to provide necessary libraries to bazel invocation. Add this to your shell.nix:

pkgs.mkShell {
  packages = with pkgs; [
    ...
    bazelisk
  ];
  env = pkgs.lib.optionalAttrs pkgs.stdenv.isDarwin {
    BAZEL_LINKOPTS = with pkgs.darwin.apple_sdk;
      "-F${frameworks.Foundation}/Library/Frameworks:-L${objc4}/lib";
    BAZEL_CXXOPTS = "-I${pkgs.libcxx.dev}/include/c++/v1";
  };
  # fhs is only used on NixOS
  passthru.fhs = (pkgs.buildFHSUserEnv {
    name = "bazel-userenv";
    runScript = "zsh";  # replace with your shell of choice
    targetPkgs = pkgs: with pkgs; [
      libz  # required for bazelisk to unpack Bazel itself
    ];
  }).env;
}

Then on NixOS you can run nix-shell -A fhs to enter an environment where directories like /bin, /usr and /lib are set up as tools made for other Linux distributions expect.

Now we can clone Bazel examples repo and enter the simple C++ example in it:

git clone --depth 1 https://github.com/bazelbuild/examples
pushd examples/cpp-tutorial/stage1

On macOS we’ll need to configure compiler and linker flags to look for libraries in Nix store:

echo "build:macos --action_env=BAZEL_CXXOPTS=${BAZEL_CXXOPTS}" >> .bazelrc
echo "build:macos --action_env=BAZEL_LINKOPTS=${BAZEL_LINKOPTS}" >> .bazelrc

We will be building remotely for the Linux platform later, so we should specify a concrete platform and toolchain to use for Linux:

echo "build:linux --platforms=@aspect_gcc_toolchain//platforms:x86_64_linux" >> .bazelrc
echo "build:linux --extra_execution_platforms=@aspect_gcc_toolchain//platforms:x86_64_linux" >> .bazelrc

And then build and run the example locally:

bazelisk run //main:hello-world

You should see output like:

Starting local Bazel server and connecting to it...
INFO: Analyzed target //main:hello-world (38 packages loaded, 165 targets configured).
INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 7.545s, Critical Path: 0.94s
INFO: 8 processes: 6 internal, 2 processwrapper-sandbox.
INFO: Build completed successfully, 8 total actions
INFO: Running command line: bazel-bin/main/hello-world
Hello world

Note that if we run bazelisk run //main:hello-world again, it’ll be much faster, because Bazel only spends a fraction of a second on computing the action graph and making sure that nothing needs to be rebuilt:

...
INFO: Elapsed time: 0.113s, Critical Path: 0.00s
INFO: 1 process: 1 internal.
INFO: Build completed successfully, 1 total action
...

We can also run bazelisk clean to remove previous output and re-run it to make sure we can rebuild from scratch.

Now let’s try building it using Buildbarn. First we need to configure execution properties to match ones set up in Buildbarn’s worker config:

echo "build:remote --remote_default_exec_properties OSFamily=linux" >> .bazelrc
echo "build:remote --remote_default_exec_properties container-image=docker://ghcr.io/catthehacker/ubuntu:act-22.04@sha256:5f9c35c25db1d51a8ddaae5c0ba8d3c163c5e9a4a6cc97acd409ac7eae239448" >> .bazelrc

Then we should tell Bazel to use Buildbarn as a remote executor:

echo "build:remote --remote_executor grpc://127.0.0.1:30080" >> .bazelrc

Now we can build it with bazelisk build --config=linux --config=remote //main:hello-world. Note that it will take some time to extract the Linux compiler and supplemental files first:

INFO: Invocation ID: d70b9d30-1865-4d1f-8d52-77c6fc5ec607
INFO: Build options --extra_execution_platforms, --incompatible_enable_cc_toolchain_resolution, and --platforms have changed, discarding analysis cache.
INFO: Analyzed target //main:hello-world (3 packages loaded, 6315 targets configured).
INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 96.249s, Critical Path: 52.72s
INFO: 5 processes: 3 internal, 2 remote.
INFO: Build completed successfully, 5 total actions

As you can see, two actions were executed remotely: compilation and linking. But we can find the result locally in bazel-bin/main/hello-world (and run it if we’re on an appropriate platform):

 % file bazel-bin/main/hello-world
bazel-bin/main/hello-world: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 4.9.0, not stripped

Now if we clean local caches and rebuild, we can see that it reuses results already stored in Buildbarn (remote cache hits):

 % bazelisk clean
INFO: Invocation ID: d655d3f2-071d-48ff-b3e9-e0b1c61ae5fb
INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
 % bazelisk build --config=linux --config=remote //main:hello-world
INFO: Invocation ID: d38526d8-0242-4b91-92da-20ddd110d3ae
INFO: Analyzed target //main:hello-world (41 packages loaded, 6315 targets configured).
INFO: Found 1 target...
Target //main:hello-world up-to-date:
  bazel-bin/main/hello-world
INFO: Elapsed time: 0.663s, Critical Path: 0.07s
INFO: 5 processes: 2 remote cache hit, 3 internal.
INFO: Build completed successfully, 5 total actions

We can exit the examples directory now:

popd

In this section we’ve configured a Bazel project to be built using our Buildbarn instance. Now we’ll configure mTLS on Buildbarn and then finally reconfigure this Bazel project to access Buildbarn using mTLS.

Configuring TLS in Buildbarn

We want each component of Buildbarn to have its own automatically generated certificate and use it to connect to other components. On the other side, each component that accepts connections should verify that the incoming connection is accompanied by a valid certificate as well. In this section we’ll use cert-manager to generate certificates and a more secure CSI driver to request certificates and propagate them to Buildbarn components. Then we’ll configure Buildbarn components to verify both sides of each connection. Here’s how this process should look like for frontend and storage containers, for example:

Node 1                       │        Kubernetes API             │ Node 2
                             │                                   │
┌─────────────────────────┐  │                                   │  ┌─────────────────────────┐
│ Frontend pod            │  │              mTLS                 │  │             Storage pod │
│      bb-storage process │<───────────────────────────────────────>│ bb-storage process      │
├─────────────────────────┤  │       ┌──────────────┐            │  ├─────────────────────────┤
│ CSI volume       ca.crt │  │       │ cert-manager │            │  │ ca.crt       CSI volume │
│       tls.key   tls.crt │  │       └─────┬────────┘            │  │ tls.crt   tls.key       │
└──────────^─────────^────┘  │             │ fills out           │  └───^─────────^───────────┘
           │         │       │             V                     │      │         │
       generates  stores     │    apiVersion: cert-manager.io/v1 │   stores   generates
           │         │            kind: CertificateRequest              │         │
          ┌┴─────────┴─┐ creates  spec:                                ┌┴─────────┴─┐
          │ CSI driver │────────>   request: LS0tLS...                 │ CSI driver │
          └────────────┘          status:                              └────────────┘
                     ^ retrieves    certificate: ...
                     └───────────   ca: ...

CSI driver sees CSI volume, generates a key in tls.key in there.
CSI driver uses key from tls.key to generate a Certificate Signing Request (CSR) and creates CertificateRequest resource in Kubernetes API with it.
cert-manager signs the CertificateRequest with CA certificate and puts both resulting certificate and CA certificate in the CertificateRequest’s status.
CSI driver stores them in tls.crt and ca.crt respectively in CSI volume.
bb-storage process in the frontend pod uses certificate and key from tls.crt and tls.key to establish TLS connection to the storage pod, verifying that the later presents a valid certificate signed by a CA certificate from ca.crt.
On the storage side tls.key, tls.crt and ca.crt are filled out in the similar manner
bb-storage process in the storage pod verifies the incoming certificate with CA certificate from ca.crt and presents certificate from tls.crt to the frontend.

Notice how with this approach secret keys never leave the node where they are generated and used, and the connection between frontend and storage pods is authenticated on both ends.

Installing cert-manager

To generate certificates for our Buildbarn we need to install and configure cert-manager itself and its CSI driver. cert-manager is responsible for generating and updating certificates requested via Kubernetes API objects. The CSI driver lets users create special volumes in pods where private keys are generated locally and certificates are requested from cert-manager and provided to the pod.

First, let’s fetch all necessary manifests and add them to our deployment. The cert-manager project publishes a ready-to-use Kubernetes manifest, so we can manually fetch it:

pushd bb-deployments/kubernetes
curl -LO https://github.com/cert-manager/cert-manager/releases/download/v1.14.3/cert-manager.yaml

And then add it to the resources section of our kustomization.yaml:

resources:
  - ...
  - cert-manager.yaml

Unfortunately, the cert-manager CSI driver doesn’t directly provide a k8s manifest, but rather a Helm chart. Add kubernetes-helm to your shell.nix and then run:

helm template -n cert-manager -a storage.k8s.io/v1/CSIDriver https://charts.jetstack.io/charts/cert-manager-csi-driver-v0.7.1.tgz > cert-manager-csi-driver.yaml

-a storage.k8s.io/v1/CSIDriver makes sure that chart uses the latest version of the Kubernetes API to register itself.

Then we can add it to resources section of our kustomization.yaml:

resources:
  - ...
  - cert-manager.yaml
  - cert-manager-csi-driver.yaml

Let’s deploy and wait for everything to start. We will use cmctl to check that cert-manager is working correctly, so you’ll need to add it to shell.nix.

kubectl apply -k .
kubectl rollout status -k . 2>&1 | grep -Ev "no status|unable to decode"
cmctl check api --wait 10m
kubectl get csinode -o yaml

cmctl should report The cert-manager API is ready, and the last command should output your only node with one driver called csi.cert-manager.io installed:

namespace/buildbarn unchanged
namespace/cert-manager created
...
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
...
The cert-manager API is ready
apiVersion: v1
items:
- apiVersion: storage.k8s.io/v1
  kind: CSINode
  metadata:
    ...
    name: lima-k8s
    ...
  spec:
    drivers:
    - name: csi.cert-manager.io
      nodeID: lima-k8s
      topologyKeys: null
kind: List
metadata:
  resourceVersion: ""

If it says drivers: null, re-run kubectl get csinode -o yaml a bit later to allow more time for driver deployment and startup.

Creating CA certificate

First we need to create a CA certificate and an Issuer that cert-manager will use to generate certificates for our needs. Note that to generate a self-signed certificate we’ll also need to create another issuer. Put this in ca.yaml:

apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: selfsigned
  namespace: buildbarn
spec:
  selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ca
  namespace: buildbarn
spec:
  isCA: true
  commonName: ca
  secretName: ca
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: selfsigned
    kind: Issuer
    group: cert-manager.io
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: ca
  namespace: buildbarn
spec:
  ca:
    secretName: ca

Then add it to resources section of our kustomization.yaml:

resources:
  - ...
  - ca.yaml

And apply it and check their status:

kubectl apply -k .
kubectl -n buildbarn get issuers -o wide

Both issuers should be there, and ca issuer should have the Signing CA verified status:

NAME         READY   STATUS                AGE
ca           True    Signing CA verified   14s
selfsigned   True                          14s

If it says something like secrets "ca" not found, it means it needs some time to generate the certificate. Re-run kubectl -n buildbarn get issuers -o wide.

Generating certificates for Buildbarn components

As mentioned before, we will be generating certificates for each component using cert-manager’s CSI driver. To do this, we need to add a volume to each pod and mount it into the main container so that the service can read it. We also need to pass CA certificate into all these containers to verify other side of each connection. Unfortunately, Buildbarn doesn’t support reading these from file, so we’ll have to pass it statically via config. Let’s prepare this config file using this command that reads the CA certificate via the Kubernetes API and formats it using jq into a JSON string:

kubectl -n buildbarn get certificaterequests ca-1 -o jsonpath='{.status.ca}' | base64 -d | jq --raw-input --slurp . > config/ca-cert.jsonnet

Now we can configure all pods by adding the following patches in kustomization.yaml:

patches:
  - ...
  - target:
      kind: Deployment
      namespace: buildbarn
    patch: |
      - op: add
        path: /spec/template/spec/volumes/-
        value:
          name: tls-cert
          csi:
            driver: csi.cert-manager.io
            readOnly: true
            volumeAttributes:
              csi.cert-manager.io/issuer-name: ca
      - op: add
        path: /spec/template/spec/containers/0/volumeMounts/-
        value:
          mountPath: /cert
          name: tls-cert
          readOnly: true
  - target:
      kind: Deployment
      namespace: buildbarn
      name: frontend
    patch: |
      - op: add
        path: /spec/template/spec/volumes/0/configMap/items/-
        value:
          key: ca-cert.jsonnet
          path: ca-cert.jsonnet
      - op: add
        path: /spec/template/spec/volumes/1/csi/volumeAttributes/csi.cert-manager.io~1dns-names
        value: frontend,frontend.${POD_NAMESPACE},frontend.${POD_NAMESPACE}.svc.cluster.local
      - op: add
        path: /spec/template/spec/volumes/1/csi/volumeAttributes/csi.cert-manager.io~1ip-sans
        value: 127.0.0.1
  - target:
      kind: Deployment
      namespace: buildbarn
      name: browser
    patch: |
      - op: add
        path: /spec/template/spec/volumes/0/configMap/items/-
        value:
          key: ca-cert.jsonnet
          path: ca-cert.jsonnet
      - op: add
        path: /spec/template/spec/volumes/1/csi/volumeAttributes/csi.cert-manager.io~1dns-names
        value: browser,browser.${POD_NAMESPACE},browser.${POD_NAMESPACE}.svc.cluster.local
  - target:
      kind: Deployment
      namespace: buildbarn
      name: scheduler-ubuntu22-04
    patch: |
      - op: add
        path: /spec/template/spec/volumes/0/configMap/items/-
        value:
          key: ca-cert.jsonnet
          path: ca-cert.jsonnet
      - op: add
        path: /spec/template/spec/volumes/1/csi/volumeAttributes/csi.cert-manager.io~1dns-names
        value: scheduler,scheduler.${POD_NAMESPACE}
  - target:
      kind: Deployment
      namespace: buildbarn
      name: worker-ubuntu22-04
    patch: |
      - op: add
        path: /spec/template/spec/volumes/1/configMap/items/-
        value:
          key: ca-cert.jsonnet
          path: ca-cert.jsonnet
      - op: add
        path: /spec/template/spec/volumes/3/csi/volumeAttributes/csi.cert-manager.io~1dns-names
        value: worker,worker.${POD_NAMESPACE}
  - target:
      kind: StatefulSet
      namespace: buildbarn
      name: storage
    patch: |
      - op: add
        path: /spec/template/spec/volumes/0/configMap/items/-
        value:
          key: ca-cert.jsonnet
          path: ca-cert.jsonnet
      - op: add
        path: /spec/template/spec/volumes/-
        value:
          name: tls-cert
          csi:
            driver: csi.cert-manager.io
            readOnly: true
            volumeAttributes:
              csi.cert-manager.io/issuer-name: ca
              csi.cert-manager.io/dns-names: ${POD_NAME}.storage,${POD_NAME}.storage.${POD_NAMESPACE}
      - op: add
        path: /spec/template/spec/containers/0/volumeMounts/-
        value:
          mountPath: /cert
          name: tls-cert
          readOnly: true

To avoid repetition, the first patch is applied to all Deployment objects, and consecutive patches only add the proper list of DNS names for each certificate. Note that many of those DNS names will not be used as only some of these services actually accept connections. For the frontend Deployment we also add 127.0.0.1 IP so that it can be accessed via a port forwarded to localhost as we currently use it on the host machine. For the storage StatefulSet we configure unique DNS name for each Pod because they are contacted directly and not through a common service. For each of these we also add ca-cert.jsonnet to the list of files used from the configuration ConfigMap. We also need to add it to the ConfigMap itself by adding it to the list in config/kustomization.yaml:

configMapGenerator:
  - name: buildbarn-config
    namespace: buildbarn
    files:
      - ...
      - ca-cert.jsonnet

We can apply all these changes with:

kubectl apply -k .
kubectl rollout status -k . 2>&1 | grep -Ev "no status|unable to decode"

Now you can fetch the list of CertificateRequest objects to see their statuses:

kubectl -n buildbarn get certificaterequest

It will output one request for the ca certificate named ca-1 and a bunch of requests generated for each pod:

NAME                                   APPROVED   DENIED   READY   ISSUER       REQUESTOR                                                    AGE
14468f64-909f-43d1-b67d-07b0844c0683   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m
1d9e41a6-e58f-4c13-b9e6-0b1ba1d5a4f6   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
2c2f1177-81fc-45e5-8487-9b66bc0d6f73   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
31fdb0ef-0c0b-4a06-94af-fb17875ee05d   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
376d0933-c0e9-4d39-b5c6-b76071c65966   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m58s
3967cdd6-7d48-4814-8cec-542041182dd0   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
464a1f35-f0ba-4236-aeec-294f880d9675   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m57s
5181e602-276e-413e-8888-76c4bd1ede21   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m57s
6f02092d-b8a3-4eb7-8ff2-5e4a433d59bb   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
710a458e-6ba0-4a44-87ab-5115b5a2c213   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m58s
753c4653-71ae-447e-bbe5-022ce35cee9d   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
8bcbb5a0-4575-40ad-b842-9c86bde8fdb8   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m56s
8df59bf5-ed23-47af-bfcc-3cf8a9053b9b   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m1s
b47fff23-40b4-43ed-8e34-35d988eb434d   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m56s
be72bdc6-c61d-4f1b-928e-f743df0f6188   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   4m57s
c14a52d5-dc20-4626-afe6-975442103d8b   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m
ca-1                                   True                True    selfsigned   system:serviceaccount:cert-manager:cert-manager              3d22h
ceabf1ab-06a7-47c0-855a-2009bbbd2418   True                True    ca           system:serviceaccount:cert-manager:cert-manager-csi-driver   5m

Using certificates

Now that we’ve generated all necessary certificates and made them available to all pods, we can configure all components to use them. We’ll use similar stanzas for each service, so let’s first add some helper functions to the top of config/common.libsonnet:

local localKeyPair = {
  files: {
    certificate_path: '/cert/tls.crt',
    private_key_path: '/cert/tls.key',
    refresh_interval: '3600s',
  },
};

local grpcClientWithTLS = function(address) {
  address: address,
  tls: {
    server_certificate_authorities: import 'ca-cert.jsonnet',
    client_key_pair: localKeyPair,
  },
};

local oneListenAddressWithTLS = function(address) [{
  listenAddresses: [address],
  authenticationPolicy: {
    tls_client_certificate: {
      client_certificate_authorities: import 'ca-cert.jsonnet',
      validation_jmespath_expression: '`true`',
      metadata_extraction_jmespath_expression: '`{}`',
    },
  },
  tls: {
    server_key_pair: localKeyPair,
  },
}];

And then expose these functions to use in other configs at the end of the file:

  ...
  grpcClientWithTLS: grpcClientWithTLS,
  oneListenAddressWithTLS: oneListenAddressWithTLS,
}

Note that local certificate and key files will be reloaded every hour per the refresh_interval setting, but the CA certificate will need to be reconfigured manually every time it refreshes.

Also note that we accept all valid certificates by setting validation_jmespath_expression to `true`. This expression can be configured later for each service if needed.

Now we’re ready to configure the Buildbarn services.

Storage

Let’s start with storage. The client side configuration is the same for all services that connect to it and is stored in config/common.libsonnet. Replace lines like this one:

backend: { grpc: { address: 'storage-0.storage.buildbarn:8981' } },

with usage of our new function:

backend: { grpc: grpcClientWithTLS('storage-0.storage.buildbarn:8981') },

Keep the address the same (storage-0 and storage-1 should remain in place).

Now in config/storage.jsonnet replace these GRPC server configuration lines:

grpcServers: [{
  listenAddresses: [':8981'],
  authenticationPolicy: { allow: {} },
}],

With a call to another function:

grpcServers: common.oneListenAddressWithTLS(':8981'),

Make sure that the address itself is the same again.

Now let’s apply it and wait for all pods to restart:

kubectl apply -k .
kubectl rollout status -k . 2>&1 | grep -Ev "no status|unable to decode"

Let’s check that the storage service is still accessible via the frontend service by rebuilding our example project:

pushd ../../examples/cpp-tutorial/stage1
bazelisk clean
bazelisk build --config=linux --config=remote //main:hello-world
popd

It should show that it fetched output from the remote cache:

...
INFO: 5 processes: 2 remote cache hit, 3 internal.
...

Scheduler

The scheduler exposes at least four GRPC endpoints, but we’ll cover only the client (frontend) and worker sides as we don’t use other endpoints yet. Just like with storage, you should replace clientGrpcServers and workerGrpcServers settings with calls to oneListenAddressWithTLS in config/scheduler.jsonnet, passing the addresses themselves as an argument:

...
clientGrpcServers: common.oneListenAddressWithTLS(':8982'),
workerGrpcServers: common.oneListenAddressWithTLS(':8983'),
...

The scheduler itself only connects to storage, and that part has already been configured in config/common.jsonnet.

Workers

Workers only connect to the scheduler and storage. With the latter being already configured, we need to only change scheduler setting in config/worker-ubuntu22-04.jsonnet:

...
scheduler: common.grpcClientWithTLS('scheduler:8983'),
...

Frontend

The frontend listens for incoming connections from clients and fans them out, either to storage or to the scheduler. Storage access has already been covered, so we only need to replace grpcServers and schedulers settings in config/frontend.jsonnet:

grpcServers: common.oneListenAddressWithTLS(':8980'),
schedulers: {
  '': {
    endpoint: common.grpcClientWithTLS('scheduler:8982') {
      addMetadataJmespathExpression: |||
        {
          "build.bazel.remote.execution.v2.requestmetadata-bin": incomingGRPCMetadata."build.bazel.remote.execution.v2.requestmetadata-bin"
        }
      |||,
    },
  },
},

Note that we preserve all addresses and keep the additional addMetadataJmespathExpression field that augments requests to the scheduler.

Applying it all

Now we can apply all these settings with:

kubectl apply -k .
kubectl rollout status -k . 2>&1 | grep -Ev "no status|unable to decode"

All deployments should eventually roll out and work. This means that all internal communications between Buildbarn components are encrypted and authenticated.

In this section we’ve achieved our goal of securing Buildbarn deployment using mTLS. Now all that’s left is to reconfigure Bazel to use and verify certificates while accessing Buildbarn’s RBE API endpoint.

Configuring certificates on client

So far we’ve configured Buildbarn to always use TLS encrypted connections. It means that our current client setup for using it will not work because it doesn’t expect TLS. In this section we’ll generate a client certificate for it using the cmctl tool, configure Bazel to both validate the server certificate and use this new client certificate when communicating with Buildbarn, and show the final complete example.

First, note that as said, if we run Bazel with current client configuration it will fail due to using a non-encrypted connection to an encrypted endpoint:

pushd ../../examples/cpp-tutorial/stage1
bazelisk clean
bazelisk build --config=linux --config=remote //main:hello-world

The error will look like this:

INFO: Invocation ID: dc8188ca-e77f-4884-a596-612779c6ae33
ERROR: Failed to query remote execution capabilities: UNAVAILABLE: Network closed for unknown reason

To configure the client to use an encrypted connection, we need to replace the grpc protocol with grpcs in .bazelrc and try again:

sed -i s/grpc/grpcs/ .bazelrc
bazelisk build --config=linux --config=remote //main:hello-world

Now the error will indicate that something else is missing - in this case, a client certificate:

INFO: Invocation ID: 7dcb900f-17eb-4dbb-ab9c-df9c70bc2c92
ERROR: Failed to query remote execution capabilities: UNAVAILABLE: io exception
Channel Pipeline: [SslHandler#0, ProtocolNegotiators$ClientTlsHandler#0, WriteBufferingAndExceptionHandler#0, DefaultChannelPipeline$TailContext#0]

To address that, we need to generate client certificates and configure Bazel to use them.

Generating the client certificate

We will use cert-manager and its CLI client cmctl to generate a certificate for our client. First, we need to create a Certificate object template in cert-template.yaml:

cat > cert-template.yaml <<EOF
apiVersion: cert-manager.io/v1
kind: Certificate
spec:
  commonName: client
  usages:
  - client auth
  privateKey:
    algorithm: ECDSA
    size: 256
  issuerRef:
    name: ca
    kind: Issuer
    group: cert-manager.io
EOF

Then we can use it to create the actual certificate:

cmctl create certificaterequest -n buildbarn client --from-certificate-file cert-template.yaml --fetch-certificate

It will use this certificate template as if it was created in Kubernetes: it will generate a key in client.key, create a Certificate Signing Request (CSR) from it, embed that in a cert-manager CertificateRequest and send it, wait for the server to sign it, and finally retrieve the resulting certificate to client.crt.

We also need a CA certificate to verify server certificates. We can use the same command we used for Buildbarn configuration here:

kubectl -n buildbarn get certificaterequests ca-1 -o jsonpath='{.status.ca}' | base64 -d > ca.crt

You can make sure that client certificate is signed with this CA certificate by adding openssl to shell.nix and running:

openssl verify -CAfile ca.crt client.crt

It will output client.crt: OK if everything is correct.

Building with certificates

All that’s left is to tell Bazel to use these certificates to connect to Buildbarn. We’ll need to convert the private key to PKCS#8 format for it and add these settings to .bazelrc:

openssl pkcs8 -topk8 -nocrypt -in client.key -out client.pem
echo "build:remote --tls_certificate=ca.crt" >> .bazelrc
echo "build:remote --tls_client_certificate=client.crt" >> .bazelrc
echo "build:remote --tls_client_key=client.pem" >> .bazelrc

Now let’s clean the Bazel cache and run the build:

bazelisk clean
bazelisk build --config=linux --config=remote //main:hello-world

You will see that the remote cache is in use, which means that TLS has been configured successfully:

...
INFO: Elapsed time: 0.601s, Critical Path: 0.10s
INFO: 5 processes: 2 remote cache hit, 3 internal.
...

To make sure that the actual build also works, we can change the source file a bit and re-run the build:

echo >> main/hello-world.cc
bazelisk build --config=linux --config=remote //main:hello-world

It will now take some time and actually show that it has built one action remotely:

...
INFO: Elapsed time: 15.866s, Critical Path: 15.69s
INFO: 2 processes: 1 internal, 1 remote.
...

Conclusion

We’ve shown how to deploy Buildbarn on Kubernetes, how to configure mTLS between all its components, and how to use TLS authentication with RBE API clients using Bazel as an example. This is a starting configuration that can be improved in several aspects not covered here:

The Buildbarn browser and the scheduler web UIs are neither exposed nor encrypted;
cert-manager is not configured to limit access to certificate generation, meaning that anyone with access to Kubernetes API has access to all its capabilities;
no limits are imposed on client certificates, they only need to be valid;
there is no automation for client certificate renewal;
and only certificates are used for authentication, which is secure but can be enhanced or replaced with OAuth which is more flexible and provides better control

All these are interesting topics that would each deserve their own blog post.

Behind the scenes

Yuriy Taraday

Tech Group

Scalable Builds

Correct, efficient, and reliable builds are critical for developers to work and collaborate effectively.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Programming Languages & Compilers Activity Report - Q2 2024 Adding algebraic data types to Nickel →