For a software team to be successful, you need excellent communication. That is why we want to build systems that foster cross-team communication. Using a monorepo is an excellent way to do that. A monorepo provides:
-
Visibility: by seeing the pull requests (PRs) of colleagues, you are easily informed of what other teams are doing.
-
Uniformity: by working in one central repository, it is easier to share the configuration of linters, formatters, etc. This makes it easy to use the same code style and documentation standards.
Uniformity smooths the onboarding of newcomers as well as the reassignment of engineers to different internal projects.
-
Easier continuous integration: new code is picked up automatically by CI, without any manual interaction, ensuring uniformity and best practices.
-
Atomic changes: because all libraries and projects are in one place, a large change can be implemented in one PR. This avoids the usual workflow of cascading updates, which causes mistakes to be caught later, rather than sooner, and causes friction in development.
Atomic changes are implemented in our setup, thanks to a living at HEAD setup. Living at HEAD is a term popularized by Titus Winters, from Google. It means that all code in a monorepo depends on the code that is next to it on disk, as opposed to depending on released versions of the code in the same repository.
Designing a monorepo can be challenging, as it impacts the development workflow of all engineers. In addition, monorepos come with their own scaling challenges. Special care for tooling is required for a monorepo to stay performant as a team grows.
In this post, we describe a design for a Python monorepo: how we structure it; which tools we favor; alternatives that were considered; and some possible improvements. Before diving into the details, we would like to acknowledge the support we had from our client Kaiko, for which we did most of the work described in this series of blog posts.
Python environments: one global vs many local
Working on a Python project requires a Python environment (a.k.a. a sandbox), with a Python interpreter and the right Python dependencies (packages). When working on multiple projects, one can either use a single shared sandbox for all projects, or many specific ones, for each project.
On the one hand, a single sandbox for all projects makes it trivial to ensure that all developers and projects use a common set of dependencies. This is desirable as it reduces the scope of things to manage when implementing and debugging. Also, it ensures that all team members are working towards a shared, common knowledge about their software.
On the other hand, it makes it impossible for different projects to use different versions of external dependencies. It is also mandatory to install the dependencies for all projects, even when they only need a subset of them to work on a single project. These two facts can create friction among developers and reduce throughput.
To avoid losing flexibility, we decided to use multiple sandboxes, one per project. We will later improve the consistency of external dependencies across Python environments with dedicated tooling.
A sandbox can be created with Python’s standard venv
module:
> python3 -m venv .venv
> source .venv/bin/activate
(.venv) > which python
/some/path/.venv/bin/python
We will later describe how this is put to use.
Choosing a Python package manager
In our scenario, we chose to stick with pip to install dependencies in sandboxes, because Poetry still doesn’t work well with PyTorch, a core library in the data ecosystem.
Over the years, pip has undergone many important breaking changes, such as editable installs (PEP 660).
To improve reproducibility, we pin the version of pip in a top-level pip-requirements.txt
file, with the exact version of pip to use:
# Install a specific version of pip before installing any other package.
pip==22.2.2
It will be important to install pip with this exact version before installing anything else.
Creating projects and libraries
In an organization, each team will be the owner of its own projects. For instance, there could be a web API, a collection of data processing jobs, and machine learning training pipelines. While each team is working on its own projects, it is most likely that a portion of their code is shared. Following the DRY (Don’t Repeat Yourself) principle, it is best to refactor those shared portions into libraries and make it a common effort that can benefit from everyone’s work.
In Python, there is no significant difference between projects and libraries; they are all Python packages. Because of that, we make no distinction between the two. However, for the sake of clarity, we split the monorepo structure into two top-level folders, one for projects and one for libraries:
├── libs/
└── projects/
This top-level organization highlights that libraries are shared across the entire organization.
To create a project or a library, a folder needs to be created in one or the other. It should then be populated with the following:
-
A
pyproject.toml
file which defines the Python package. It contains its metadata (name, version, description) and the list of its dependencies, for dependency resolution. -
A
requirements.txt
file which serves as the basis for creating local sandboxes for developers and also as the default environment in continuous integration (CI). It has to list all direct dependencies, frozen at a specific version, in pip’s requirement file format.By freezing the versions, we don’t mean using
pip freeze
, because ourrequirements.txt
are manually maintained. We mean that we require the version numbers of dependencies to use the==
specifier. This is sometimes also called pinning the dependencies, which goes a long way towards reproducibility.We explain below in more details why we use both
pyproject.toml
andrequirements.txt
. In a nutshell,pyproject.toml
is used as the central place for configuration and for deployment; whilerequirements.txt
are used for reproducibility in local environments and in the CI. -
A
README.md
file. This file’s purpose is to list the owners of this package; the people to contact if the package needs to evolve or is broken. This file also contains a short description of what the package is about and example commands to run the code or test it. It’s supposed to be a gentle introduction to newcomers to this package.Library owners are also specified in the top-level
CODEOWNERS
file, to tame the amount of notifications. We recommend configuring a repository so that reviewers are automatically chosen based on a pull request’s changes, by usingCODEOWNERS
to map changes to reviewers.
Formatting and linting
For formatting source code, we chose Black, because it is easy to use and is easily accepted by most developers. Our philosophy for choosing a formatter is simple: pick one and don’t discuss it for too long.
For linting source code, we chose Flake8, isort, and Pylint. We use Flake8 and isort without any tuning. Again, the rationale being that the default checks are good and easily accepted. Regarding Pylint, we use it only for checking that public symbols are documented. Because Pylint is more intrusive, activating more checks would have required more lengthy discussions, which we decided not worthwhile in the early days of our monorepo.
To make all the tools work well together, we need a little configuration:
> cat pyproject.toml # From the repository's top-level
[tool.black]
line-length = 100
target-version = ['py38']
[tool.pylint."messages control"]
ignore = ["setup.py", "__init__.py"]
disable = "all"
enable = [
"empty-docstring",
"missing-class-docstring",
"missing-function-docstring",
"missing-module-docstring"
]
[tool.isort]
profile = "black"
known_first_party = ["mycorp"] # see package configuration below
> cat .flake8 # From the repository's top-level
[flake8]
max-line-length = 100
# required for compatibility with Black:
extend-ignore = E203
exclude = .venv
We need to use both pyproject.toml
and .flake8
because, as of writing, Flake8 doesn’t support pyproject.toml
.1
In addition, we need pyproject.toml
files in each project and library, because pyproject.toml
files are used to list direct dependencies of a package.
We will use the CI to ensure that nested pyproject.toml
files and the top-level pyproject.toml
file agree on the configuration of tools
that are common to both.
To pin the versions of all these tools for the entire monorepo, we have a top-level dev-requirements.txt
file that contains the following:
black==22.3.0
flake8==4.0.1
isort==5.10.1
To recap what we’ve described so far, at this point our monorepo’s structure is:
├── .flake8
├── dev-requirements.txt
├── pip-requirements.txt
├── pyproject.toml
├── libs/
└── projects/
With this setup, you format and lint code deterministically locally and on the CI with:
python3 -m venv .venv
# Make the sandbox active in the current shell session
source .venv/bin/activate
# Install pinned pip first
pip install -r pip-requirements.txt
# Install shared development dependencies, in a second step
# to use the pinned pip version
pip install -r dev-requirements.txt
# Black, Flake8, and isort are now available. Use them as follows:
black --check .
flake8 .
isort --check-only .
Note that it is possible to do this at the top-level, because Black, Flake8, and isort don’t need the external dependencies of the monorepo’s libraries and projects to be installed.
Typechecking
To typecheck code, we chose Microsoft’s Pyright. In our benchmarks, it proved noticeably faster than mypy and seems more widely used than Facebook’s Pyre. Compared to mypy, Pyright also has the advantage that it can execute as you type: it gives feedback without requiring to save the file being edited. Because mypy has a noticeable startup time, this made for a significant difference in user experience, consolidating our choice in favor of Pyright.
Pyright has different levels of checking.
We stick to the default settings, called base
.
These settings make the tool easily accepted: if your code is not annotated, Pyright will mostly remain silent.
If your code is annotated, in our experience, Pyright reports only errors that are relevant.
In the rare cases where it reported false positives (i.e. reporting an error where there isn’t one),
the context made sense.
For example, if type-correctness depends on a computation that cannot be statically analyzed.
Pyright also works really well no matter the amount of annotations that external dependencies (i.e. libraries outside the monorepo) have; relying on external annotations if available, or type inference by crawling source code. Either way, we observed that Pyright uses the correct types, even for data-science libraries with a lot of union types (such as pandas).
To enable typechecking with Pyright, we specified a pinned version in the shared top-level dev-requirements.txt
as follows:
pyright==1.1.239
and we configure it in the pyproject.toml
file of every project and library:
> cat pyproject.toml
...
[tools.pyright]
reportMissingTypeArgument = true # Report generic classes used without type arguments
strictListInference = true # Use union types when inferring types of lists elements, instead of Any
As with pip and other tools, pinning Pyright’s version helps make local development and the CI deterministic.
Testing
To test our code, we chose pytest. This was an obvious choice, because all concerned developers have experience with it and it wasn’t challenged.
Among pytest’s qualities, we can cite its good progress reporting while the tests are running and easy integration with test coverage.
To make pytest available, we again specified a pinned version in the shared top-level dev-requirements.txt
as follows:
pytest==7.0.1
pytest-cov==3.0.0 # Coverage extension
Sandboxes
With all of the above in place, we are now able to create sandboxes to obtain comfortable development environments.
For example, suppose we have one library named base
, yielding the monorepo structure as follows:
├── .flake8
├── dev-requirements.txt
├── pip-requirements.txt
├── pyproject.toml
├── libs/
│ └── base/
│ ├── README.md
│ ├── pyproject.toml
│ └── requirements.txt
└── projects/
To create base
’s development environment, go to directory libs/base
and execute:
python3 -m venv .venv
# Make the sandbox active in the current shell session
source .venv/bin/activate
# Install pinned pip first
pip install -r $(git rev-parse --show-toplevel)/pip-requirements.txt
# Install shared development dependencies and project/library-specific dependencies
pip install -r $(git rev-parse --show-toplevel)/dev-requirements.txt -r requirements.txt
# With project-specific dependencies installed, typecheck your code as follows:
pyright .
This could be shortened by using a tool like Nox.
Configuration of a package
We use a common namespace in all projects and libraries.
This avoids one level of nesting by avoiding the src
folder (which was the historical way of doing things, called the src layout).
Supposing that we choose mycorp
as the namespace, this means that the code of library libs/base
lives in directory libs/base/mycorp
and the pyproject.toml
of the library must contain:
[project]
...
packages = [
{ include = "mycorp" }
]
We now come to an important topic: the difference between pyproject.toml
and requirements.txt
for declaring dependencies.
requirements.txt
requirements.txt
files are meant to be used to install dependencies in sandboxes both on the developers’ machines
and in the CI. requirements.txt
files
specify both local dependencies (packages developed
in the monorepo itself) and external dependencies (dependencies which
are usually hosted on PyPI, such as NumPy and pandas).
To install local dependencies in our development sandboxes, we use
editable installs.
If a library A
depends on a library B
, this makes changes to B
immediately
available to developers of A
: A
depends on the code of B
that is next to it
in the monorepo, not on a released version.
This allows the implementation of a live at HEAD workflow, as detailed below.
The requirements.txt
file of a library should include all direct dependencies
and they should be pinned, i.e. specify exact version numbers, using the ==
operator.
By using pinned dependencies, we achieve a good level of reproducibility.
We don’t list the transitive dependencies, because that
would amount to maintaining a lockfile manually, which would be very tedious.
We neither used pip compile
nor pyenv
, because they don’t work well
in multiplatforms scenarios, as visible
here for the former and
here for the latter.
On this topic and others, we would have loved to use Poetry, that provides a dedicated CLI for managing lockfiles.
That would have allowed us to pin both direct dependencies and transitive dependencies.
However, Poetry doesn’t play well with an essential data-science package:
PyTorch.2
If Poetry and PyTorch start working well together in the future, our setup
can transition smoothly, because we use the pyproject.toml
file
as the central place for storing configuration, as we discuss below.
Note that our requirements.txt
doesn’t specify the hashes of dependencies
(usually done with --hash
with pip). Hashes are used to defend against supply chain attacks.
As our setup is meant for starting a monorepo, we intentionally don’t delve
in this more advanced topic in this post.
pyproject.toml
Generally speaking, pyproject.toml
files are used to configure a project or library.
In this section we only deal with specifying dependencies;
i.e. how we write the [tool.poetry.dependencies]
section.3
In our setup, pyproject.toml
files specify dependencies for deployment.
Because of that, dependencies in pyproject.toml
files should be loose, to avoid blocking
using your code in a variety of environments.4
Taking NumPy as an example, a simple rule to specify dependencies
in this scenario is to use numpy = ^X.Y.Z
where either
(1) X.Y.Z
is the version number of NumPy used when starting to use it, or
(2) X.Y.Z
is the version of NumPy introducing a feature depended upon.
Poetry’s documentation
provides good guidance on possible specifiers.
Example
To demonstrate how this setup works in practice, let’s introduce a new library libs/fancy
that depends on the local dependency libs/base
and the external dependency numpy
:
...as above...
└── libs/
├── base/
│ └── ...as above...
└── fancy/
├── README.md
├── pyproject.toml
└── requirements.txt
libs/fancy/requirements.txt
is as follows:
-e ../base # local dependency, use editable install
numpy==1.22.3 # external dependency, installed from PyPI
libs/fancy/pyproject.toml
is as follows:
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
[tool.poetry]
name = "mycorp-fancy"
version = "0.0.1"
description = "Mycorp's fancy library"
authors = [ "Tweag <[email protected]>", ]
packages = [
{ include = "mycorp" }
]
[tool.poetry.dependencies]
python = ">=3.8"
numpy = "^1.2.3"
mycorp-base = "*"
# tooling configuration, omitted (same as top-level pyproject.toml)
In the spirit of our explanations above:
requirements.txt
uses an editable install to specify the dependency tolibs/base
, with-e ../base
.pyproject.toml
uses the very loose"*"
qualifier to specify the dependency tolibs/base
.
Living at HEAD
As mentioned above, our monorepo obey the living at HEAD workflow made popular by Google.
In our simple example above, it means that the library fancy
depends on the code of the library base
that is next to it on disk (hence the Git term at HEAD
), not on a released version of base
.
This makes it possible to perform atomic updates of the entire monorepo in a single PR.
Whereas in a traditional polyrepo setup with separate releases, to use a new version of base
,
one would have to update base
first (in its own PR), then release it, then update
fancy
to make it use the latest version of base
(in another PR). In the wild,
a polyrepo setup creates cascading PRs, increasing the time it takes to perform updates crossing
various libraries, or causes code duplication.
In this vein, the use of editable installs in our setup is our Python-specific implementation of living at HEAD.
Updating dependencies
In our setup, both the top-level dev-requirements.txt
files and each library’s
requirements.txt
file are manually maintained. Admittedly, maintaining these
files manually is not going to scale in the long term, but it makes
for a simple start while providing a good level of reproducibility.
Here are how updates occur:
-
Some library developer needs to update a dependency, because they need a new feature that was released recently. In this case, we recommend to update this dependency in the concerned library, but also in all other places where this dependency is used, to maintain a high-level of consistency within the monorepo.
Because we advocate for good test coverage, we assume that, if tests of modified libraries still pass after the update of the dependency, the PR updating the dependencies can be safely merged.
-
When a number of teams reach the end of sprints and are preparing for a next iteration, or it’s a low-intensity period (summer vacation). In this case, it’s a good time to update dependencies that are old, while keeping friction low, because work happening in parallel is limited.
This scenario aligns well with the fact that our monorepo’s setup minimizes surprises: by pinning dependencies, we minimize chances of unplanned breakage (that would be caused if we were pulling latest version of dependencies). As a consequence, we can separate the periods where features are being rolled out, from the periods where maintenance (such as updates of dependencies) happen.
-
A bot checking for vulnerabilities such as Dependenbot creates a PR to update a specific dependency. If the dependency is used with the same version everywhere in the monorepo, the bot will create a PR that updates the entire monorepo at once.
Again here, if working under the assumption that test coverage is good, such PRs can be merged quickly if tests pass.
Conclusion
So far we have seen a monorepo structure that features:
- a streamlined structure for libraries and projects,
- unified formatting, linting, and typechecking, and
- a Python implementation of the live at HEAD workflow.
We showed a setup that is both simple and achieves a great level of reproducibility, while being easy to communicate about; all the tools mentioned in this post are well known to seasoned Python developers.
In post two of this series, we describe how to implement a CI for this monorepo, using simple GitHub Actions and how templating can be used to ease onboarding and maintain consistency as more developers start working in the monorepo.
- See Flake8 issue #234.↩
- We are not going to detail it here, but multi-platform lockfile support for projects that depend on PyTorch is still an issue today. See Poetry issues #4231, #4704, and #6939 for recent developments. In this post, we hence stick to a simpler pip-based approach, that can be easily amended to use Poetry instead.↩
- Despite using
[tools.poetry]
inpyproject.toml
here, we don’t use Poetry the tool; we use Poetry the backend. This was made possible by PEP 517, which made pip capable of consuming information inpyproject.toml
. This is enabled by the stanzabuild-backend = "poetry.core.masonry.api"
above. We did so to circumvent a bug insetuptools
and because it madepyproject.toml
the central place for configuration.↩ - pip has been using a new dependency resolver since v20.3 (October 2020), which can throw an error if version bounds are incompatible between packages to install. If version bounds are too tight for a dependency, a conflict of versions for this dependency can happen and block the installation.
When code is deployed into an environment with many other packages,
it will possibly be in the presence of versions of dependencies that haven’t been used so far.
Specifying exact version numbers in
pyproject.toml
would make this impossible and as such is not desirable.↩
About the authors
Guillaume is a versatile engineer based in Paris, with fluency in machine learning, data engineering, web development and functional programming.
Clément is a Senior Software Engineer that straddles the management/engineer boundary. Clément studied Computer Science at Telecom Nancy and received his PhD from Université Nice Sophia Antipolis, where he proved multithreaded programs using linear logic.
If you enjoyed this article, you might be interested in joining the Tweag team.