pandas-stubs — How we enhanced pandas with type annotations

Jsendorek
VirtusLab
Published in
9 min readAug 26, 2021

This article is co-authored by Zbigniew Królikowski and Joanna Sendorek.

You can check our project here: https://github.com/VirtusLab/pandas-stubs

At VirtusLab we created our own pandas-stubs library, which enhances pandas with type information that’s necessary for maintaining a high level of type-safety for all pandas-dependent projects, and has already been downloaded a few hundred thousand times. In this article we’ll describe its inception, the development process and real use cases.

What problems did we encounter?

We’ve been working with multiple projects that used both the pyspark and pandas libraries simultaneously, and ran into difficulties when importing them together. Both libraries are focused on data processing and have named their core class alike — pd.DataFrame and spark.sql.DataFrame. We couldn’t import two DataFrame classes under their names, so we had to work around that by using one of them through their parent module e.g. pd.DataFrame.

This did work — except for when somebody made an honest mistake … we could use full module names for both classes for better clarity, but there would still be an occasional mix-up. We also noticed that we can’t rely on the type checker to detect these conflicts, because it doesn’t have anywhere to get type information from, and defaults to treating all types from those libraries as the special Any type. It effectively doesn’t work with code based on the pandas or pyspark objects.

We didn’t expect full API type information or annotation coverage for either of them, but assumed that both would have at least rudimentary typing support. So we started using pyspark-stubs to alleviate the problem of missing type information (note that since Spark 3.0 this isn’t necessary — type annotations are already integrated into the library).

pyspark-stubs allowed us to find some issues regarding our use of pyspark, but unfortunately the story remained the same. The DataFrame mismatch was still happening, since spark.sql.DataFrame is an instance of Any, and any code based on pandas remained untyped. So we decided to start looking for an equivalent pandas-stubs library, and found these two projects that have bits and pieces of typing support for pandas:

However, they didn’t completely suit our needs, especially in their current early stages of development, so we realized we’d need to look for alternative solutions.

We decided to attempt to fill the hole ourselves by producing our own stubs package, and defined the following goals:

  • Any valid pandas usage shouldn’t be marked as invalid by the type checker (completeness).
  • Invalid pandas usage should be marked as such (soundness). This takes precedence over completeness.
  • The stubs should be tested against real-life code examples.
  • The package should be installable from PyPI and Conda Forge.

What were the possibilities?

Since PEP 484 was introduced, there are two ways of providing type information alongside the code. The first one is more straightforward, and is in the form of type annotations.

This syntax is similar to statically typed languages, but don’t be deceived: as in the case of the most common Python interpreter — CPython — the extra annotations have no bearing on the runtime whatsoever. They only come into play when using external static type-checking tools, including MyPy, Pytype, Pyright and Pyre.

Python remains both dynamically and strongly typed language in its core. However, the extra safety and clarity brought on by integrating one (or more) of these tools into your IDE and CI are well worth the time and effort needed to add type annotations in your code. But what about pandas? How could we deal with the missing type information there?

That’s exactly the use-case behind type stubs — they provide a method for supplementing existing library code with type information without modifying the library. They come in the form of files with a .pyi extension — much like header files in C-like language. Their role can be thought of as declaring “interfaces” for the implementations that are stored in .py files.

When installing pandas_stubs, the .pyi files are placed alongside installed pandas source files. The type checker loads the stubs files and makes decisions based on the information they contain. Note that what’s declared in .pyi files takes precedence over what’s stored in the .py files — even if the .py files contain annotations, they’ll be ignored in favour of those from the stub files.

How we approached creating stubs

The least work intensive — and at the same time, the least reliable — way to generate stubs is by using a tool called stubgen, which is bundled with mypy. It tries to infer the correct types from the real uses it finds in the codebase. While powerful in theory, it didn’t work for us in practice.

The problem was that stubgen didn’t effectively resolve conflicts in method signatures — instead of joining variants by Union or Option, it just suggested a couple of alternative method signatures corresponding to different usage. Moreover, the original pandas code required some minor fixes before stubgen stopped failing.

It was still possible to generate stubs directly from the code itself without using the inference method. By using this approach, stubgen just copied the declarations over from the source files, filling all the holes in annotations by using Any. This resolved the problem of conflicts, but the drawback was that while it used the existing annotations from the pandas source code (even if incomplete), it still left most of the declarations completely ambiguous.

We decided to look into Pandas documentation for help, as it already contains hints as to what types are used in its API. However, sometimes the description was not precise enough for us to create a fully typed signature, such as for the update method of the DataFrame:

From the fragment above, we couldn’t be sure exactly what it meant for an object to be coercible into a DataFrame, so we needed to infer this by looking at example code snippets. Eventually we decided that the best approach was to define a new type standing for anything that can be converted into DataFrame.

How we’ve set up package — bringing it all together

In accordance with PEP 561, we included a py.typed marker file, which signals that the package supports type checking, and explicitly specified it in the package_data parameter in setup.py.

Since the original package name is pandas, we named the stubs package pandas-stubs to signal this relationship to the package manager.

Testing

The tricky part of stubs development is ensuring they are correct, and fully cover the pandas API. This brings up the important topic of testing. Proper tests for stubs are often missing from other stub libraries. The consequences are easy to predict: changes in the API (such as the pandas API) lead to the stubs getting outdated, and regression errors are almost guaranteed. To prevent these problems in pandas-stubs, we focused on providing proper tests from the beginning of our development.

You can think of stubs correctness from two angles: internal integrity, and application usage. Internal integrity checks mean that type checking executed on the stubs will be successful — generics will be in place, inheritance will maintain types consistency, and type variables will be defined properly. In pandas-stubs, we ensure it by running mypychecker on the stubs codebase.

Another aspect to be tested is the stubs application usage. Are correct API calls recognized by our typing system? Are all the keywords covered, and do results have the proper types? In order to ensure it, we created test snippets which we run with pytest. These are just simple fragments of code which exemplify pandas API calls with different parameters and keywords. They help us see if we broke any previously covered use-cases when updating definitions.

Using this approach, we have internal type checking in place, and we test our stubs on actual samples of API calls.

We also want stubs to be compatible with multiple Python versions, and the tool which comes in handy here is Tox. Tox is a command line tool for virtual environments management, and we utilize it to run the mypy checker against four major Python distributions: 3.6–3.9.

Writing test snippets helps us to find bugs and potential gaps in our type definitions. To our surprise, they even helped with finding a bug in pandas itself! This happened when we were testing a setitem() method in the DataFrame API to check the behavior of different assignment values, and we discovered that the type of the value affects the way in which the following assignments are evaluated. We then reported the bug, and it has been fixed.

How it works in practice

Let’s take this piece of code as an example.

Since pandas-stubs is absent from this environment, mypy won’t detect any issues. However, you can see that right after installing we actually start getting errors:

Similarly, the following piece of code has typing issues that wouldn’t be detectable otherwise:

These errors could pose a real risk if they were uncaught before going into production. Of course with tests it’s likely that they would be caught — a comprehensive unit test suite would improve the chance of detecting them.

However, static type checking comes at no additional costs, and doesn’t rely on the developer to work out all of the edge cases 100% of the time.

Has it proven useful? What are the next steps?

We mentioned that the goal of our pandas-stubs implementation was to help us with improving type consistency in our codebase, but you might wonder what the outer reception was like. These GitHub images show the result:

As of this writing we’ve reached 100k downloads in a month, which we are very excited about! It helps us know that we’ve created a project that’s valuable to the community. We also noticed a peculiar peak in usage, which made us wonder what it means:

Usage peak which made us wonder…

After investigating this, we discovered that we owe it to another library — openai-python, which has included pandas-stubs in its requirements. So it looks like pandas-stubs helps not only individual users, but also other projects depending on pandas!

We’ve already received some external contributions, for which we are very grateful. However, there is still a lot we can improve. Among the open issues, we have a number of modules that still need to be covered by type annotations. We also hope that with wider adoption, as more people will use our small library, we will be able to discover and fix more inconsistencies, which will make pandas-stubs more complete and precise, and make your experience with pandas even better!

Many thanks to Paweł Lipski for technical support in creating the pandas-stubs project. Also, thanks to Hubert Pomorski for support with both this article and project and to Paweł Batko for a very thorough review.

“[…] there is still a lot we could improve”

--

--