Python Packaging in the Real World: Biomedical projects vs. PyPI

24 September 2024 — by Zhihan Zhang, Dorran Howell, Maria Knorps

The Python programming language, and its huge ecosystem (there are more than 500,000 projects hosted on the main Python repository, PyPI), is used both for software engineering and scientific research. Both have similar requirements for reproducibility. But, as we will see, the practices are quite different.

In fact, the Python ecosystem and community is notorious for the countless ways it uses to declare dependencies. As we were developping FawltyDeps¹, a tool to ensure that declared dependencies match the actual imports in the code, we had to accommodate many of these ways. This got us thinking: Could FawltyDeps be used to gain insights into how packaging is done across Python ecosystems?

In this blog post, we look at project structures and dependency declarations across Python projects, both from biomedical scientific papers (as an example of scientific usage of Python) as well as from more general and widely used Python packages. We’ll try to answer the following questions:

What practices does the community actually follows? And how do they differ between software engineering and scientific research?
Could such differences be related to why it’s often hard to reproduce results from scientific notebooks published in the data science community?

Experiment setup

In the following, we discuss the experimental setup — how we decided which data to use, where to get this data from, and what tools we use to analyze it, before we discuss our results in depth.

Data

First, we need to collect the names and source code locations of projects that we want to include in the analysis. Now, where did we find these projects? We selected projects for analysis based on two key areas: impactful real-world applications and broad community adoption.

Biomedical data analysis repositories: biomedical data plays a vital role in healthcare and research. To capture its significance, we focused on packages directly linked to biomedical data, sourced from repositories supported or referenced by scientific biomedical articles. This criterion anchored our experiment in real-world scientific applications.
To analyze software engineering practices, we’ve chosen to use the most popular PyPI packages: acknowledging the importance of widely adopted packages, we included a scan of the most downloaded and frequently used PyPI packages.

Biomedical data

We leverage a recent study by Samuel, S., & Mietchen, D. (2024): Computational reproducibility of Jupyter notebooks from biomedical publications. This study analyzed 2,177 GitHub repositories associated with publications indexed in PubMed Central to assess computational reproducibility. Specifically, we reused the dataset they generated (found here) for our own analyses.

PyPI data

In order to start analyzing actual projects published to PyPI, we still needed to access some basic metadata about these projects: the project’s name, source URL, and any extra metadata which could be useful for further analysis such as project tags.

While this information is available via the PyPI REST API, this API is subject to rate limiting and is not really designed for bulk analyses such as ours. Conveniently, Google maintains a public BigQuery dataset of PyPI download statistics and project metadata which we leveraged instead. As a starting point for our analysis, we produced a CSV with relevant metadata for top packages downloaded in 2023 using a simple SQL query. Since the above-mentioned biomedical database contains 2,177 projects, we conducted a scan of the first 2,000 PyPI packages to create a dataset of comparable size.

Using FawltyDeps to analyze the source code data

Now that we have the source URLs of our projects of interest, we downloaded all sources and ran an analysis script that wraps around FawltyDeps on the packages. For safety, all of this happened in a virtual machine.

Post-processing and filtering of FawltyDeps analysis results

While the data we collected from PyPI was quite clean (modulo broken or inaccessible project URLs), the biomedical dataset contained some projects written in R and some projects written in Python 2.X, which are outside of our scope. To further filter for relevant projects that are written in Python 3.X, we applied the following rules:

there should be .py or .ipynb files in the source code directory of the data. If there are only .ipynb files and no imports, then it is most likely an R project and not taken into account.
we are also only interested in Python projects that have 3rd-party imports, as these are the project we would expect to declare their dependencies.

After these filtering steps, we have 1,260 biomedical projects and 1,118 PyPI packages to be analyzed.

Results

Now that we had crunched thousands of Python packages, we were curious to see what secrets the data produced by FawltyDeps would reveal!

Dependency declaration patterns

First, we investigated which dependency declaration file choices were made in both samples. The following pie charts show the proportion of projects with and without dependency declaration files, and whether these files actually contain dependency declarations.

distribution deps 1 — Figure 1. Percent of projects with dependency declaration files and actual dependency(ies) declared.

distribution deps 1 new — Figure 1. Percent of projects with dependency declaration files and actual dependency(ies) declared.

We find that about 60% of biomedical projects have dependency declaration files, while for PyPI packages, that number is almost 100%. That is expected, as the top PyPI projects are written to be reproducible: they are downloaded by a large group of people and if they are not working due to lack of dependency declarations, it would be noticed immediately by the users.

Interestingly, we found that some biomedical projects (6.8%) and PyPI packages (16.0%) have dependency declaration files with no dependencies listed inside them. This might be because they genuinely have no third-party dependencies, but more commonly it is a symptom of either:

setup.py files with complex dependency calculations: although FawltyDeps supports parsing simple setup.py files with a single setup()call and no computation involved for setting the install_requires and extras_require arguments, it is currently not able to analyze more complex scenarios.
pyproject.toml might be used to configure tools with sections like [tool.black] or [tool.isort], and declaring dependencies (and other project metadata) in the same file is not strictly required.

For the remainder of the analysis, we do not take these cases into account.

We then examined how different package types utilize various dependency declaration methods. The following chart shows the distribution of requirements.txt, pyproject.toml, and setup files across biomedical projects and PyPI packages (note that these three categories are not exclusive):

distribution deps 2 — Figure 2. Percent of projects with dependencies declared in `requirements.txt`, `pyproject.toml` and setup files.

For biomedical projects, requirements.txt and setup.py/setup.cfg files are a majority of declaration files. In contrast, PyPI projects show a higher occurrence of pyproject.toml compared to biomedical projects. pyproject.toml is a suggested modern way of declaring dependencies. This result should not come as a surprise: top PyPI projects are actively maintained and are more likely to follow best practices. A requirements.txt file, on the other hand, is easier to add and if you do not need to package your projects it is a simpler option.

Now let’s have a more detailed view in which categories are exclusive:

distribution deps 3 — Figure 3. Distribution of mutually exclusive dependency file choices.

For biomedical data there are a lot of projects that have either requirements.txt or setup.py/setup.cfg files (or a combination of both) present. The traditional method of using setup files utilizing setuptools to create Python packages has been around for a while and is still heavily relied upon in the scientific community.

On the PyPI side, no single method for declaring dependencies stood out, as different approaches were used with similar frequency across all projects. However, when it comes to using pyproject.toml, PyPI packages were about five times more likely to adopt this method compared to biomedical projects, suggesting that PyPI package authors tend to favor pyproject.toml significantly more often for dependency management.

Also, almost no top biomedical projects (only 2 out of 1,260) and very few PyPI packages (only 25 out of 1,118) used pyproject.toml and setup files together: it seems that projects don’t often mix the older method - setup files - with the more modern one - pyproject.toml - at the same time.

A different method of visualizing the subset of results pertaining to requirements.txt, pyproject.toml and setup.py/setup.cfg files are Venn diagrams:

distribution deps 4 — Figure 4. Venn diagram of projects with dependencies declared with categories including combination of dependency files.

While these diagrams don’t contain new insights, they show clearly how much more common pyproject.toml usage is for PyPI packages.

Source code directories

We next examined where projects store their source code, which we refer to as the “source code directory”. In the following analysis, we defined this directory as the directory that contains the highest number of Python code files and does not have names like “test”, “example”, “sample”, “doc”, or “tutorial”.

code structure — Figure 5. Source code directories choices.

We can make some interesting observations: Over half (53%) of biomedical projects store their main source code in a directory with a name different than the project itself, and source code is not commonly stored in directories named src or src-python (7%). For PyPI projects, the numbers are lower, with 37% storing their main code in a directory that matches the project name. However, naming the source code directory differently from the package name is still fairly common for PyPI projects, appearing in 36% of cases. A somewhat surprising finding: the src layout, recommended by Python packaging user guide, appears in only 14% of cases.

Another noteworthy observation is that 23% of biomedical projects store all their source code in the root directory of the project. In contrast, only 12% of PyPI projects follow this pattern. This difference makes sense, as scientists working on biomedical projects might be less concerned about maintaining a strict code structure compared to developers on PyPI. Additionally, a lot of biomedical projects might be a loose collection of notebooks/scripts not intended to be packaged/importable, and thus will typically not need to add any subdirectories at all. On the other hand, everything from the PyPI data set is an importable package. Even in the “flat” layout (according to discussion), related modules are collected in a subdirectory named after the package.

The top PyPI projects that keep their code in the root directory are often small Python modules or plugins, like “python-json-patch”, “appdirs”, and “python-json-pointer”. These projects usually have all their source code in a single file, so storing it in the root directory makes sense.

Key results

Many people have preconceptions about how a Python project should look, but the reality can be quite different. Our analysis reveals distinct differences between top PyPI projects and biomedical projects:

PyPI projects tend to use modern tools like pyproject.toml more frequently, reflecting better overall project structure and dependency management practices.
In contrast, biomedical projects display a wide variety of practices; some store code in the root directory and fail to declare dependencies altogether.

This discrepancy is partially explained by the selection criteria: popular PyPI packages, by necessity, must be usable and thus correctly declare their dependencies, while biomedical projects accompanying scientific papers do not face such stringent requirements.

Conclusion

We found that biomedical projects are written with less attention to the coding best practices, which compromises their reproducibility. There are many projects without dependencies declared. The use of pyproject.toml, which is current state-of-the-art way to declare dependencies is less frequently present in biomedical packages. In our opinion, though, it’s essential for any package to adhere to the same high standards of reproducibility as top PyPI packages. This includes implementing robust dependency management practices and embracing modern packaging standards. Enhancing these practices will not only improve reproducibility but also foster greater trust and adoption within the scientific community.

While our initial analysis revealed some interesting insights, we feel that there might be some more interesting treasures to be found within this dataset - you can check yourself in our FawltyDeps-analysis repository! We invite you to join the discussion on FawltyDeps and reproducibility in package management on our Discord channel.

Finally, this experiment also served as a real-world stress test for FawltyDeps itself and identified several edge cases we had not yet accounted for, suggesting avenues of further development for FawltyDeps: One of the main challenges was to parse unconventional require and extra-require sections in setup.py files. This issue has been addressed by the FawltyDeps project, specifically through the improvements made in FawltyDeps PR #440. Furthermore, it was also not trivial to handle projects with multiple packages declared in one. Addressing these issues will be a focus as we continue to refine and improve FawltyDeps.

Stay tuned as we will drill deeper into the data we’ve collected. So far, we’ve reused part of FawltyDeps‘ code for our analysis, but the next step will be to run the full FawltyDeps tool on a large number of packages. Join us as we examine how FawltyDeps performs under rigorous testing and what improvements can be made to enhance its capabilities!

For more insights, refer to our previous talk at PyData Global: Finding undeclared and unused dependencies in your notebooks and projects.↩

Behind the scenes

Zhihan Zhang

Zhihan is a data scientist/engineer with expertise in Machine Learning. Holding a PhD in Geometric Topology and Probabilities, she transitioned from a pure mathematics background to apply her knowledge in industry. Currently at Tweag, Zhihan focuses on developing data engineering solutions and cloud deployment. Beyond her technical skills, she has a passion for music, arts, literature, and cooking.

Dorran Howell

Maria Knorps

Maria, a mathematician turned Senior Data Engineer, excels at blending her analytical prowess and software development skills within the tech industry. Her role at Tweag is twofold: she is not only a key contributor to the innovative AI projects but also heavily involved in data engineering aspects, such as building robust data pipelines and ensuring data integrity. This skill set was honed through her transition from academic research in numerical modelling of turbulence to the realm of software development and data science.

Tech Group

Data Engineering

Data is a critical resource feeding software systems and decision-making. Open source tools which allow for correct and efficient processing of data are essential in order to effectively leverage this resource.

If you enjoyed this article, you might be interested in joining the Tweag team.

This article is licensed under a Creative Commons Attribution 4.0 International license.

← Reflecting away from definitions in Liquid Haskell Introducing rules_gcs →