At Tweag, we are constantly striving to improve the developer experience by contributing tools and utilities that streamline workflows.
We recently completed a project with IMAX, where we learned that they had developed a way to simplify and optimize the process of integrating Google Cloud Storage (GCS) with Bazel. Seeing value in this tool for the broader community, we decided to publish it together under an open source license. In this blog post, we’ll dive into the features, installation, and usage of rules_gcs
, and how it provides you with access to private resources.
What is rules_gcs
?
rules_gcs
is a Bazel ruleset that facilitates the downloading of files from Google Cloud Storage. It is designed to be a drop-in replacement for Bazel’s http_file
and http_archive
rules, with features that make it particularly suited for GCS. With rules_gcs
, you can efficiently fetch large amounts of data, leverage Bazel’s repository cache, and handle private GCS buckets with ease.
Key Features
-
Drop-in Replacement:
rules_gcs
providesgcs_file
andgcs_archive
rules that can directly replacehttp_file
andhttp_archive
. They take ags://bucket_name/object_name
URL and internally translate this to an HTTPS URL. This makes it easy to transition to GCS-specific rules without major changes to your existing Bazel setup. -
Lazy Fetching with
gcs_bucket
: For projects that require downloading multiple objects from a GCS bucket,rules_gcs
includes agcs_bucket
module extension. This feature allows for lazy fetching, meaning objects are only downloaded as needed, which can save time and bandwidth, especially in large-scale projects. -
Private Bucket Support: Accessing private GCS buckets is seamlessly handled by
rules_gcs
. The ruleset supports credential management through a credential helper, ensuring secure access without the need to hardcode credentials or usegsutil
for downloading. -
Bazel’s Downloader Integration:
rules_gcs
uses Bazel’s built-in downloader and repository cache, optimizing the download process and ensuring that files are cached efficiently across builds, even across multiple Bazel workspaces on your local machine. -
Small footprint: Apart from the
gcloud
CLI tool (for obtaining authentication tokens),rules_gcs
requires no additional dependencies or Bazel modules. This minimalistic approach reduces setup complexity and potential conflicts with other tools.
Understanding Bazel Repositories and Efficient Object Fetching with rules_gcs
Before we dive into the specifics of rules_gcs
, it’s important to understand some key concepts about Bazel repositories and repository rules, as well as the challenges of efficiently managing large collections of objects from a Google Cloud Storage (GCS) bucket.
Bazel Repositories and Repository Rules
In Bazel, external dependencies are managed using repositories, which are declared in your WORKSPACE
or MODULE.bazel
file. Each repository corresponds to a package of code, binaries, or other resources that Bazel fetches and makes available for your build. Repository rules, such as http_archive
or git_repository
, and module extensions define how Bazel should download and prepare these external dependencies.
However, when dealing with a large number of objects, such as files stored in a GCS bucket, using a single repository to download all objects can be highly inefficient. This is because Bazel’s repository rules typically operate in an “eager” manner—they fetch all the specified files as soon as any target of the repository is needed. For large buckets, this means downloading potentially gigabytes of data even if only a few files are actually needed for the build. This eager fetching can lead to unnecessary network usage, increased build times, and larger disk footprints.
The rules_gcs
Approach: Lazy Fetching with a Hub Repository
rules_gcs
addresses this inefficiency by introducing a more granular approach to downloading objects from GCS. Instead of downloading all objects at once into a single repository, rules_gcs
uses a module extension that creates a “hub” repository, which then manages individual sub-repositories for each GCS object.
How It Works
-
Hub Repository: The hub repository acts as a central point of reference, containing metadata about the individual GCS objects. This follows the “hub-and-spoke” paradigm with a central repository (the bucket) containing references to a large number of small repositories for each object. This architecture is commonly used by Bazel module extensions to manage dependencies for different language ecosystems (including Python and Rust).
-
Individual Repositories per GCS Object: For each GCS object specified in the lockfile,
rules_gcs
creates a separate repository using thegcs_file
rule. This allows Bazel to fetch each object lazily—downloading only the files that are actually needed for the current build. -
Methods of Fetching: Users can choose between different methods in the
gcs_bucket
module extension. The default method of creating symlinks is efficient while preserving the file structure set in the lockfile. If you need to access objects as regular files, choose one of the other methods.- Symlink: Creates a symlink from the hub repo pointing to a file in its object repo, ensuring the object repo and symlink pointing to it are created only when the file is accessed.
- Alias: Similar to symlink, but uses Bazel’s aliasing mechanism to reference the file. No files are created in the hub repo.
- Copy: Creates a copy of a file in the hub repo when accessed.
- Eager: Downloads all specified objects upfront into a single repository.
This modular approach is particularly beneficial for large-scale projects where only a subset of the data is needed for most builds. By fetching objects lazily, rules_gcs
minimizes unnecessary data transfer and reduces build times.
Integrating with Bazel’s Credential Helper Protocol
Another critical aspect of rules_gcs
is its seamless integration with Bazel’s credential management system. Accessing private GCS buckets securely requires proper authentication, and Bazel uses a credential helper protocol to handle this.
How Bazel’s Credential Helper Protocol Works
Bazel’s credential helper protocol is a mechanism that allows Bazel to fetch authentication credentials dynamically when accessing private resources, such as a GCS bucket. The protocol is designed to be simple and secure, ensuring that credentials are only used when necessary and are never hardcoded into build files.
When Bazel’s downloader prepares a request and a credential helper was configured, it invokes the credential helper with the command get
. Additionally, the request URI is passed to the helpers standard input encoded as JSON.
The helper is expected to return a JSON object containing HTTP headers, including the necessary Authorization
token, which Bazel will then include in its requests.
Here’s a breakdown of how the credential_helper
script used in rules_gcs
works:
-
Authentication Token Retrieval: The script uses the
gcloud
CLI tool to obtain an access token viagcloud auth application-default print-access-token
. This token is tied to the user’s current authentication context and can be used to fetch any objects the user is allowed to access. -
Output Format: The script outputs the token in a JSON format that Bazel can directly use:
{ "headers": { "Authorization": ["Bearer ${TOKEN}"] } }
This JSON object includes the
Authorization
header, which Bazel uses to authenticate its requests to the GCS bucket. -
Integration with Bazel: To use this credential helper, you need to configure Bazel by specifying the helper in the
.bazelrc
file:common --credential_helper=storage.googleapis.com=%workspace%/tools/credential-helper
This line tells Bazel to use the specified
credential_helper
script whenever it needs to access resources fromstorage.googleapis.com
. If a request returns an error code or unexpected content, credentials are invalidated and the helper is invoked again.
How rules_gcs
Hooks Into the Credential Helper Protocol
rules_gcs
leverages this credential helper protocol to manage access to private GCS buckets securely and efficiently. By providing a pre-configured credential helper script, rules_gcs
ensures that users can easily set up secure access without needing to manage tokens or authentication details manually.
Moreover, by limiting the scope of the credential helper to the GCS domain (storage.googleapis.com
), rules_gcs
reduces the risk of credentials being misused or accidentally exposed. The helper script is designed to be lightweight, relying on existing gcloud
credentials, and integrates seamlessly into the Bazel build process.
Installing rules_gcs
Adding rules_gcs
to your Bazel project is straightforward. The latest version is available on the Bazel Central Registry. To install, simply add the following to your MODULE.bazel
file:
bazel_dep(name = "rules_gcs", version = "1.0.0")
You will also need to include the credential helper script in your repository:
mkdir -p tools
wget -O tools/credential-helper https://raw.githubusercontent.com/tweag/rules_gcs/main/tools/credential-helper
chmod +x tools/credential-helper
Next, configure Bazel to use the credential helper by adding the following lines to your .bazelrc
:
common --credential_helper=storage.googleapis.com=%workspace%/tools/credential-helper
# optional setting to make rules_gcs more efficient
common --experimental_repository_cache_hardlinks
These settings ensure that Bazel uses the credential helper specifically for GCS requests. Additionally, the setting --experimental_repository_cache_hardlinks
allows Bazel to hardlink files from the repository cache instead of copying them into a repository. This saves time and storage space, but requires the repository cache to be located on the same filesystem as the output base.
Using rules_gcs
in Your Project
rules_gcs
provides three primary rules: gcs_bucket
, gcs_file
, and gcs_archive
. Here’s a quick overview of how to use each:
-
gcs_bucket
: When dealing with multiple files from a GCS bucket, thegcs_bucket
module extension offers a powerful and efficient way to manage these dependencies. You define the objects in a JSON lockfile, andgcs_bucket
handles the rest.gcs_bucket = use_extension("@rules_gcs//gcs:extensions.bzl", "gcs_bucket") gcs_bucket.from_file( name = "trainingdata", bucket = "my_org_assets", lockfile = "@//:gcs_lock.json", )
-
gcs_file
: Use this rule to download a single file from GCS. It’s particularly useful for pulling in assets or binaries needed during your build or test processes. Since it is a repository rule, you have to invoke it withuse_repo_rule
in aMODULE.bazel
file (or wrap it in a module extension).gcs_file = use_repo_rule("@rules_gcs//gcs:repo_rules.bzl", "gcs_file") gcs_file( name = "my_testdata", url = "gs://my_org_assets/testdata.bin", sha256 = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", )
-
gcs_archive
: This rule downloads and extracts an archive from GCS, making it ideal for pulling in entire repositories or libraries that your project depends on. Since it is a repository rule, you have to invoke it withuse_repo_rule
in aMODULE.bazel
file (or wrap it in a module extension).gcs_archive = use_repo_rule("@rules_gcs//gcs:repo_rules.bzl", "gcs_archive") gcs_archive( name = "magic", url = "gs://my_org_code/libmagic.tar.gz", sha256 = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855", build_file = "@//:magic.BUILD", )
Try it Out
rules_gcs
is a versatile and simple solution for integrating Google Cloud Storage with Bazel. We invite you to try out rules_gcs
in your projects and contribute to its development. As always, we welcome feedback and look forward to seeing how this tool enhances your workflows. Check out the full example to get started!
Thanks to IMAX for sharing their initial implementation of rules_gcs
and allowing us to publish the code under an open source license.
About the author
Malte is a software engineer with a background in security. In the process of improving the supply chain security and reproducibility of security-critical software, he has gained experience with Bazel and Nix. He is passionate about building secure and reliable systems and enjoys tinkering with immutable operating systems.
If you enjoyed this article, you might be interested in joining the Tweag team.