We have previously introduced FawltyDeps, a tool to help Python projects avoid the dreaded, and seemingly unavoidable state, where dependencies declared in the configuration do not match those actually imported in the code1. FawltyDeps is the perfect addition to your CI, your pre-commit hooks, or your dependency management arsenal.
Curious to know how FawltyDeps works its magic? In this sequel we’ll delve into an essential component of FawltyDeps: how it matches imports and dependencies behind the scenes, and why it is important to get this matching right.
We’ve been busy working on an improved mapping strategy that combines versatility with simplicity, and we have come a long way from the quite limited version we presented in our first announcement. By the end of this post, you’ll have a solid understanding of FawltyDeps’ brand new mapping options and how to tailor them to your project’s unique context and needs.
Matching imports and dependencies
Simply put, FawltyDeps extracts imports from your code, and dependencies declared in your project configuration, and matches them against each other:
- the imports that are not present in your declared dependencies are reported as undeclared dependencies
- the declared dependencies that are not imported in your code are reported as unused dependencies.
When matching imports and dependencies, we first assume that a dependency (specifically: the package it references) and an import have the same name. This approximation works well for many Python packages.
numpy
is a good example: in your code, you write import numpy
, and to install it you run pip install numpy
, or you list numpy
in your requirements.txt
(or wherever you list your project dependencies).
Problem solved! So why are we even writing this post?
It turns out that, as always, things are not that simple™. Many packages provide import names that are different from the package name. For example:
- You depend on the
pyyaml
package, but you importyaml
(as seen in Figure 1). - You depend on the
scikit-learn
package, but you importsklearn
. - You depend on the
setuptools
package, but you import eitherpkg_resources
or some other import, assetuptools
exposes multiple imports.
Clearly our first approximation (hereafter referred to as the identity mapping) is not good enough. To solve this, we need a smarter mapping: a way to figure out which packages correspond to which imports. In practice, there are a few different ways to acquire these mappings, each having its advantages and limitations. Our main goal here is to lay out the mappings we support in FawltyDeps, and explain how they can be used individually or together to resolve packages into their respective imports.
Mapping from already-installed packages
Arguably, the only correct way for FawltyDeps to match packages to imports is to actually ask each package what imports it provides. FawltyDeps can do this3, but it first needs to find where the packages are installed, and that turns out to be more complicated than one might think.
In the first versions of FawltyDeps, we had not yet properly drilled into this issue. Instead, we only looked at the Python environment in which FawltyDeps itself was already running, and we simply assumed that your project dependencies should be installed into the same environment 4. If a dependency of your project was not found in this environment, we would fall back to the identity mapping.
This meant simply pushing the problem onto the user, however, and making FawltyDeps harder to use. What we wanted instead was for FawltyDeps to resolve the dependencies wherever they may be installed. This is where things can get very complicated: In general, there is a bewildering variety of ways to install dependencies in the Python world.
We are not going to open the entire Pandora’s box of Python packaging and dependency management in this blog post, except as to note some different examples of where Python packages (specifically: your project’s 3rd-party dependencies) can typically be found:
- System-wide package locations, like those found under
/usr/lib/python*
or/usr/local/lib/python*
(whether installed by your system’s package manager or system-widepip install
). - User-specific packages, installed by tools like
pipx install
orpip install --user
. - Virtual environments (from
venv
,virtualenv
, Poetry, PDM, etc.), located either within your project, or somewhere else. - Other, less common, methods or locations5 that resemble any of the above.
We would like to have FawltyDeps work with as many of these as possible, and furthermore, when it’s possible: to have FawltyDeps automatically discover and use them by default.
As of v0.13.0 we have come a long way towards realizing this vision: We support the kinds of Python environments mentioned above (for FawltyDeps’ purpose, a “Python environment” really means any directory in which Python packages could be installed), and the following diagram outlines how FawltyDeps determines which Python environments are used to look up the project’s dependencies:
In other words:
- The
--pyenv
option lets you point to one or more Python environments. All of these environments will be used when matching dependencies to imports - If
--pyenv
is not used, FawltyDeps will automatically find and use Python environments that exist within your project directories (i.e. within any directory that is passed as a positional argument to FawltyDeps, aka. “basepath”, or the current directory by default). - If no Python environment is found by the two methods above, FawltyDeps will fall back to using the environment in which it’s running.
There is still some way to go until all the details are perfect here6, but we believe this approach covers most common cases well.
Temporarily installing dependencies to complete the mapping
There is an elephant in the room that we have not yet talked about: Sometimes you may be running FawltyDeps on a project where the project dependencies are not installed at all! Then what can you do? (Assuming that you don’t want to go through the bother of installing packages manually.) Until recently FawltyDeps would simply fall back to the identity mapping for any packages that it could not find locally, with the undeclared/unused report provided by FawltyDeps suffering as a result.
With the new --install-deps
option introduced in v0.13.0, we are now able to provide a better alternative: With this option FawltyDeps will not fall back to the identity mapping, instead it will automatically use pip install
to install the unresolved dependencies (from PyPI, by default7) into a temporary virtualenv8, and it will then use this as an additional source for the dependency-to-import mapping. For dependencies that are not found locally, this allows FawltyDeps to come up with the correct mapping (and hence produce a much better undeclared/unused report) rather than relying on the imperfect identity mapping.
Since this is a potentially expensive strategy we have chosen to hide it behind the --install-deps
command-line option. If you want to always enable this option, you can set the corresponding install_deps
configuration variable to true
in the [tool.fawltydeps]
section of your pyproject.toml
.
Note that there is no guarantee that we’re able to resolve all dependencies with this method: For example, there could be a typo in your declared dependency that means it will never be found on PyPI, or there could be other circumstances (e.g. network issues) that prevent this strategy from working at all. What happens with such unresolved dependencies will be covered below.
User-defined mapping
The mappings discussed above have FawltyDeps look into packages that are actually installed (whether in an existing local environment or temporarily by FawltyDeps). But this might not always be achievable in practice. You might want to run FawltyDeps in your CI, possibly on multiple libraries, without having to either set up a local environment or access packages from outside sources (like PyPI).
A simple solution to this is to provide FawltyDeps with your own custom mapping.9 We have chosen not to ship any database with the code as it needs to be frequently updated, with no guarantee of it covering all Python packages. Instead, we allow users to provide their own custom TOML mapping. This mapping does not have to be complete and it can be used in conjunction with the other mappings discussed in this article. We talk more about how FawltyDeps combines different mappings in the following section.
Putting it together: FawltyDeps’ mapping strategy
Now that we have gathered all these mappings, let’s see how to best combine them.
Overall, we have three guiding principles in this endeavor:
- Completeness: we should be able to resolve all dependencies extracted from a project into associated import names, as otherwise we cannot reach any conclusions about undeclared or unused dependencies.
- Correctness: some mappings offer a higher level of correctness than others. Identity mapping, for example, is correct for many - but certainly not all - packages. Resolving a dependency via a locally installed package offers a higher guarantee of correctness.
- Transparency: we should be able to trace back what mapping was used to resolve any given dependency. This allows users to discover where they may improve the information passed to FawltyDeps (e.g. using
--pyenv
to point at the most appropriate Python environments). It also makes it much easier for us to diagnose where FawltyDeps itself might be improved.
First, let’s start by repeating our available strategies:
- Identity mapping: The simplest strategy, but also the worst. We would like to avoid using it as much as possible.
- Looking at locally installed packages: Our best option in terms of correctness, but not always complete: sometimes we have to concede that not all dependencies are available in a local Python environment, so we still need a fallback strategy.
- Installing packages (from PyPI) into a temporary virtualenv: The ultimate fallback solution, but quite heavy-weight, and not always suitable (e.g. in a restricted CI environment). Hence, we put this behavior behind the
--install-deps
option. - Custom/user-defined mapping: Allow the user to have the final say in how dependencies are mapped into imports. This strategy should override the other strategies, but we expect few users will want to go through the fuss of defining their own mapping, so we cannot rely on this being used commonly.
Now, we need to figure out how to combine these strategies in the best way.
We have chosen to organize them in the sequence shown in Figure 3 below. Each strategy - when given the name of a dependency - can either return a successful mapping of that dependency name (into a corresponding set of import names), or return nothing (when a dependency is not found by that strategy). Dependencies that are not resolved by a strategy are passed onto the next strategy in the sequence. Since a dependency is mapped by only one strategy, that is, the first that returns something, we need to organize our strategies in order of decreasing preference. In other words:
- The user-defined mapping, when provided, should always override other mappings. It thus comes first in the sequence.
- Next, we want to look at the locally installed packages.
- Finally, if we have not been able to find the dependency in either of the above, we want to use a fallback strategy:
- If the user has enabled
--install-deps
, we attempt to install packages (subject topip
configuration, but from PyPI by default). If any of these packages fail to install, we abort the entire process and raise an error, as we do not expect the user wants a further fallback to the inaccurate identity mapping. - Otherwise, our fallback is the identity mapping, that is, we assume any unresolved dependency points to a package (as yet unseen) that provides a single import of the same name. Although this strategy is always “successful” (in terms of mapping to an import name), it is crucially not always correct!
- If the user has enabled
To illustrate:
To bring this back into the overall context of FawltyDeps: once we have resolved the dependencies through the above mapping strategies, we now have an overall mapping of dependency names to provided import names, and this is the basis for the final report:
- Any import found in the project that is not covered by any dependency is reported as an undeclared dependency.
- Any dependency found to only provide imports that are never imported from anywhere is reported as a possibly unused dependency.
The table below provides a summary of the available mappings, sorted in the order FawltyDeps processes them, along with options to customize them.
Priority | Mapping strategy | Options |
---|---|---|
1 | User-defined mapping |
Provide a custom mapping in TOML format via --custom-mapping-file
or a [tool.fawltydeps.custom_mapping] section in pyproject.toml .
Default: No custom mapping |
2 | Mapping from installed packages |
Point to one or more environments via --pyenv .Default: auto-discovery of Python environments under the project’s basepath. If none are found, default to the Python environment in which FawltyDeps itself is installed. |
3a | Mapping via temporary installation of packages | Activated with the --install-deps option. |
3b | Identity mapping |
Active by default. Deactivated when --install-deps is used.
|
Examples
This section dives into some practical scenarios. Suppose you have a simple
requirements.txt
file:
numpy>=1.25.0
scikit-learn
pyyaml
We assume that these packages are already imported in some_script.py
as
import numpy
import sklearn
import yaml
As we can see, our project has defined all its dependencies as it should, so FawltyDeps should ideally not report any problems. But let’s also assume that we’re running FawltyDeps in an incomplete environment - one where pyyaml
is not installed - to see how this affects FawltyDeps.
Example 1: running with default options
When running with default options, like so:
fawltydeps
FawltyDeps will run through the default sequence of mappings, as shown in Figure 4:
In particular:
- No custom mapping is provided.
- FawltyDeps automatically finds local environments or defaults to its own
environment. In this example it finds
scikit-learn
andnumpy
in the local environment, and we can see thatscikit-learn
is correctly resolved to thesklearn
import name. - Identity mapping is used to resolve any dependencies not resolved via previous
mappers. In this example,
pyyaml
was not found above, and was therefore incorrectly resolved by the identity mapping topyyaml
.
The resulting output from FawltyDeps is:
These imports appear to be undeclared dependencies:
- 'yaml'
These dependencies appear to be unused (i.e. not imported):
- 'pyyaml'
For a more verbose report re-run with the `--detailed` option.
This first example shows a common pitfall of the identity mapping.
Next, we can see how --install-deps
improves on these situations:
Example 2: running with custom options
Let’s now take advantage of some advanced FawltyDeps options by running the following command:
fawltydeps --custom-mapping-file my_mapping.toml --pyenv venv --install-deps
Figure 5 shows the path FawltyDeps takes through the sequence of mappings:
In particular:
- We provide a partial custom mapping. (e.g. via
--custom-mapping-file
). In this example, the custom mapping is defined inmy_mapping.toml
.scikit-learn = ["sklearn"]
- We point to a local virtual environment (with
--pyenv
) where some dependencies are installed. (In this example, onlynumpy
is installed invenv
.) - We pass
--install-deps
, to ask FawltyDeps to temporarily install and resolve any remaining dependencies.
FawltyDeps returns the following result:
No undeclared or unused dependencies detected.
As expected, FawltyDeps now returns a better result:
The --install-deps
option downloads the pyyaml
PyPI package and
makes it available to the resolver, so it can now map the yaml
import to the
correct pyyaml
dependency declaration.
Customizing your FawltyDeps’ mappers
These examples demonstrate two extremes and we expect most usage to fall somewhere in between.
With the --json
flag, the resulting package-to-imports mapping is exposed in the output under the
.resolved_deps
key.
Using a command like this:
fawltydeps --custom-mapping-file my_mapping.toml --pyenv venv --install-deps --json | jq .resolved_deps
you can see which mappings are used to resolve a package into a set of imports, and further iterate on the mapping options to help FawltyDeps perform its best on your codebase.
Conclusion
FawltyDeps has come a long way from the version we presented in our first announcement. While it was initially limited to resolving packages from its own environment and falling back to the identity mapping, it now supports arbitrary local environments, custom user mappings and it can temporarily install and resolve packages on its own. On top of that, it can also automatically discover virtual environments inside the analyzed project.2
We strive to provide a default behavior that makes sense for most projects, and to offer a customizable yet simple interface for advanced users that wish to take control over the mapping process. We believe the result is a powerful tool that delivers a complete, correct and transparent matching of your project’s dependencies and imports.
As always, we would be happy to hear your feedback! Try out the latest version of FawltyDeps and reach out to us with any problems or questions on our Github repository.
-
The recent publication of Computational reproducibility of Jupyter notebooks from biomedical publications highlights that missing dependencies is a frequent occurrence in repositories hosting scientific computational experiments and has a detrimental effect on reproducibility.↩
-
We depend on functionality from the excellent
importlib_metadata
library to extract imports exposed in locally installed packages.↩ -
This assumption was made no matter whether FawltyDeps was installed in a virtualenv or as part of the system-wide Python installation, and we only documented that FawltyDeps had to be installed into the same environment as your project dependencies. One example of where this did not work out well is when you installed FawltyDeps with
pipx install fawltydeps
: This makesfawltydeps
available everywhere (via your$PATH
), but pipx installs it into its own, separate, virtualenv that is isolated from your project, meaning that FawltyDeps would almost always fall back to the identity mapping, and yield poor results.↩ -
Some less common locations of Python packages:
__pypackages__
directories (even though PEP582 was recently rejected, these still occur in the wild).- Conda and other environment managers (not yet explicitly supported, although it’s on our radar).
- Nix closures containing Python packages, like those produced by poetry2nix.
-
One open issue is that FawltyDeps currently does not look at package versions. This usually does not cause problems in practice, but there are corner cases where it might: Consider, for example, a
package_foo
that used to provide two import namesmodule_a
andmodule_b
, but starting from version 2, it only providesmodule_a
. Now, if your project declares a dependency onpackage_foo>=2
, but you still happen toimport module_b
in your code, this should be reported by FawltyDeps as an undeclared dependency (because you’re declaring a dependency on a version ofpackage_foo
wheremodule_b
no longer exists). However, ifpackage_foo
version 1 (not version 2) happens to be installed in your project’s environment, FawltyDeps will simply believe thatpackage_foo
(whichever version) provides bothmodule_a
andmodule_b
, and the error won’t be flagged.↩ -
To customize automatic installation (for example, to use a different package index), you can use pip’s environment variables.↩
-
Note that the PyPI API does not currently expose imports of the hosted packages (see here and here for relevant discussions). Downloading and unpacking these packages is therefore necessary.↩
-
Some tools rely on custom mappings. A notable example is the Pants build system, which relies on static mappings provided by the user. Another example is the pipreqs library, which keeps a static database mapping packages to the import names they expose.↩
-
For completeness, here is an overview of the changes we’ve made to our mapping strategy over the last releases, and that together realize the picture presented in this blog post:
- v0.7 introduces the
--pyenv
option to allow FawltyDeps to look up packages in a different Python environment than the one in which FawltyDeps is running. - v0.9 adds the user-defined mapping.
- v0.10 adds support for
__pypackages__
directories. - v0.11 introduces support for multiple
--pyenv
options. - v0.12 revamps our project traversal, allowing Python environments to be automatically found inside the project.
- v0.13 introduces the
--install-deps
option allowing missing project dependencies to be mapped correctly instead of using the identity mapping.
- v0.7 introduces the
About the authors
Johan is a Developer Productivity Engineer at Tweag. Originally from Western Norway, he is currently based in Delft, NL, and enjoys this opportunity to discover the Netherlands and the rest of continental Europe. Johan has almost twenty years of industry experience, mostly working with Linux and open source software within the embedded realm. He has a passion for designing and implementing elegant and useful solutions to challenging problems, and is always looking for underlying root causes to the problems that face software developers today. Outside of work, he enjoys playing jazz piano and cycling.
Nour is a data scientist/engineer that recently made the leap of faith from Academia to Industry. She has worked on Machine Learning, Data Science and Data Engineering problems in various domains. She has a PhD in Computer Science and currently lives in Paris, where she stubbornly tries to get her native Lebanese plants to live on her tiny Parisian balcony.
Maria, a mathematician turned Senior Data Engineer, excels at blending her analytical prowess and software development skills within the tech industry. Her role at Tweag is twofold: she is not only a key contributor to the innovative AI projects but also heavily involved in data engineering aspects, such as building robust data pipelines and ensuring data integrity. This skill set was honed through her transition from academic research in numerical modelling of turbulence to the realm of software development and data science.
Zhihan is a data scientist/engineer with expertise in Machine Learning. Holding a PhD in Geometric Topology and Probabilities, she transitioned from a pure mathematics background to apply her knowledge in industry. Currently at Tweag, Zhihan focuses on developing data engineering solutions and cloud deployment. Beyond her technical skills, she has a passion for music, arts, literature, and cooking.
If you enjoyed this article, you might be interested in joining the Tweag team.