Pandas and the geosciences: a 4.5 billion year story

Kim Pevey
CyberPaleo
Published in
5 min readApr 21, 2023

Pandas is a ubiquitous tool for time series analysis but until recently was unusable in the paleogeosciences, which involve timescales that encompass the entirety of Earth’s history. This is because Pandas long ago hardcoded nanoseconds as the base unit of time, limiting the timescales it can represent on a 64-bit machine to a relatively narrow timespan of 585 years, thus excluding many paleogeoscience applications. Through an NSF EarthCube grant (RISE-2126510), the University of Southern California contracted with Quansight to support non-nanosecond-resolution datetimes in Pandas 2.0. This work allows for resolutions as coarse as one second that can cover timescales of days across billions of years. USC and Quansight then worked to implement the usage of Pandas in the Pyleoclim python library. The Pyleoclim library is maintained by USC researchers and is a leading package for analyzing and visualizing paleo datasets.

Paleoclimatology presents unique challenges to timeseries analysis: the time axis is often unevenly-spaced, time is highly uncertain, timescales vary from days to billions of years, and time is often represented as positive towards the past, with an origin point that is often dependent on the dating method (e.g., 1950 is standard for radiocarbon measurements, while 2000 is standard with cave deposits like speleothems). These challenges have made it difficult for the paleoclimate community to use standard libraries.

The Pandas non-nanosecond integration effort

Perhaps the biggest challenge to paleoclimatology analytics packages has been the inability to use Pandas, a core Python tool for analytics.

The first step was to adapt low-level tslibs functions (timezones.pyx, vectorized.pyx, and tzconvers.pyx) to support non-nanosecond resolution. A Localizer class was also added to support these. These were particularly challenging to implement without negatively affecting performance.

The team also needed to adapt higher level Cython functions such as parsing.pyx, fields.pyx, array_to_datetime and infer_dtype.

Next, the team adapted Pandas NaT (not a time), Timestamp, Timedelta, and Period classes to support and interact with non-nanosecond resolutions. Integration into several other internal functions and classes such as DatetimeArray, TimedeltaArray and find_common_type was also completed. Discussion on user-facing constructors (e.g. pd.to_datetime, pd.to_timedelta) is on-going.

The effort to integrate non-nanosecond resolution in Pandas spanned over a year and required many PRs (check out the list in the Appendix below). In the end, we were able to close an issue that had been open for nearly a decade. It was quite the challenge, but we are excited to provide this functionality to the community.

Integration into Pyleoclim — its first usage

During the initial implementation, tests were added to Pandas to ensure that everything worked as expected. However, as any developer knows, the real test is to put the code in the hands of the users. USC and Quansight led the effort to implement this new dtype into Pyleoclim, exposing and fixing a host of issues that came up. This effort was critical in that the Pandas 2.0 release team can be much more confident of its usage now that it’s been extensively alpha tested.

In particular, although the Pandas DatetimeIndex could hold non-nanosecond datetimes, errors would arise when trying to use them in operations such as resampling, formatting, and parsing. This was due to the fact that for at least a decade, the only supported resolution in Pandas was the nanosecond, and related assumptions were baked in throughout the codebase. The Pyleoclim work helped unearth several code paths where that assumption was being incorrectly made.

For instance, pandas makes it easy to resample a timeseries at a fixed time resolution. Here we show how the benthic oxygen isotope stack of Lisiecki & Raymo (2004), which spans the past 5.3 million years at uneven resolution, can be easily resampled onto 5,000 year intervals:

ts = pyleo.utils.load_dataset('LR04')
ts5k = ts.resample('5ka').mean()
fig, ax = ts.plot(invert_yaxis='True',xlim=[0, 1000])
ts5k.plot(ax=ax,color='C1')

This project also highlighted the importance of well-defined calendars for conversion between dates. For example, the standard assumption of 24 hours per day does not apply in the distant past and model simulations often do not use the Gregorian calendar. Although Pyleoclim has implemented the use of cftime to tackle these challenges, there is still a need for the Python ecosystem to continue to develop to fully meet the needs of the paleogeosciences.

Related work in the python scientific ecosystem

Although the Pandas extension and incorporation into Pyleoclim represents a major stepping stone to allow scientists in these domains to make use of more open science code, work remains for interoperability with other open source libraries such as Matplotlib, Seaborn, Scikit-learn, Numpy, and Scipy. These libraries will need to update for non-nanosecond to fully unleash the power of the Python open source ecosystem for the paleogeosciences.

One example is the inability of Matplotlib to natively plot these timespans. For fairly arbitrary reasons, it currently has an internal limitation of 1–9999 years. For now, the solution is to implement our own plotting mechanism in Pyleoclim which ensures the time axis is provided to Matplotlib as float(see figure above). However, it will be great when Matplotlib relaxes this restriction.

If you have a csv file with entries like 1300-01-01, then read_csv with parse_dates will simply fail to convert the dates due to the remaining datetime64 limitations in Pandas (in this case, the inability to run to_datetime([‘1300–01–01’])). Additional work is needed to make the new integration fully seamless.

Conclusion

Opening the door for usage of Pandas as a core tool has brought about a significant advancement in paleogeoscience analytics. This was a monumental effort and related work efforts are still ongoing, but the basics are in place for users to check out. Now go try it out for yourself!

Appendix of Pandas PRs related to this work

https://github.com/pandas-dev/pandas/pull/46397

https://github.com/pandas-dev/pandas/pull/46410

https://github.com/pandas-dev/pandas/pull/46578

https://github.com/pandas-dev/pandas/pull/46688

https://github.com/pandas-dev/pandas/pull/46828

https://github.com/pandas-dev/pandas/pull/46839

https://github.com/pandas-dev/pandas/pull/46901

https://github.com/pandas-dev/pandas/pull/46902

https://github.com/pandas-dev/pandas/pull/46917

https://github.com/pandas-dev/pandas/pull/46959

https://github.com/pandas-dev/pandas/pull/46990

https://github.com/pandas-dev/pandas/pull/47044

https://github.com/pandas-dev/pandas/pull/47076

https://github.com/pandas-dev/pandas/pull/47120

https://github.com/pandas-dev/pandas/pull/47126

https://github.com/pandas-dev/pandas/pull/47162

https://github.com/pandas-dev/pandas/pull/47191

https://github.com/pandas-dev/pandas/pull/47230

https://github.com/pandas-dev/pandas/pull/47245

https://github.com/pandas-dev/pandas/pull/47246

https://github.com/pandas-dev/pandas/pull/47278

https://github.com/pandas-dev/pandas/pull/47299

https://github.com/pandas-dev/pandas/pull/47307

https://github.com/pandas-dev/pandas/pull/47312

https://github.com/pandas-dev/pandas/pull/47313

https://github.com/pandas-dev/pandas/pull/47316

https://github.com/pandas-dev/pandas/pull/47320

https://github.com/pandas-dev/pandas/pull/47322

https://github.com/pandas-dev/pandas/pull/47324

https://github.com/pandas-dev/pandas/pull/47333

https://github.com/pandas-dev/pandas/pull/47334

https://github.com/pandas-dev/pandas/pull/47340

https://github.com/pandas-dev/pandas/pull/47346

https://github.com/pandas-dev/pandas/pull/47355

https://github.com/pandas-dev/pandas/pull/47356

https://github.com/pandas-dev/pandas/pull/47373

https://github.com/pandas-dev/pandas/pull/47374

https://github.com/pandas-dev/pandas/pull/47394

https://github.com/pandas-dev/pandas/pull/47395

https://github.com/pandas-dev/pandas/pull/47421

https://github.com/pandas-dev/pandas/pull/47522

https://github.com/pandas-dev/pandas/pull/47535

https://github.com/pandas-dev/pandas/pull/47537

https://github.com/pandas-dev/pandas/pull/47579

https://github.com/pandas-dev/pandas/pull/47641

https://github.com/pandas-dev/pandas/pull/47668

https://github.com/pandas-dev/pandas/pull/47682

https://github.com/pandas-dev/pandas/pull/47720

https://github.com/pandas-dev/pandas/pull/47807

https://github.com/pandas-dev/pandas/pull/48261

https://github.com/pandas-dev/pandas/pull/48661

https://github.com/pandas-dev/pandas/pull/48669

https://github.com/pandas-dev/pandas/pull/48743

https://github.com/pandas-dev/pandas/pull/48748

https://github.com/pandas-dev/pandas/pull/48815

https://github.com/pandas-dev/pandas/pull/48819

https://github.com/pandas-dev/pandas/pull/48836

https://github.com/pandas-dev/pandas/pull/48894

https://github.com/pandas-dev/pandas/pull/48901

https://github.com/pandas-dev/pandas/pull/48910

https://github.com/pandas-dev/pandas/pull/48923

https://github.com/pandas-dev/pandas/pull/48928

https://github.com/pandas-dev/pandas/pull/48953

https://github.com/pandas-dev/pandas/pull/48956

https://github.com/pandas-dev/pandas/pull/48961

https://github.com/pandas-dev/pandas/pull/49008

https://github.com/pandas-dev/pandas/pull/49014

https://github.com/pandas-dev/pandas/pull/49015

https://github.com/pandas-dev/pandas/pull/49034

https://github.com/pandas-dev/pandas/pull/49050

https://github.com/pandas-dev/pandas/pull/49058

https://github.com/pandas-dev/pandas/pull/49097

https://github.com/pandas-dev/pandas/pull/49098

https://github.com/pandas-dev/pandas/pull/49104

https://github.com/pandas-dev/pandas/pull/49106

https://github.com/pandas-dev/pandas/pull/49171

https://github.com/pandas-dev/pandas/pull/49285

https://github.com/pandas-dev/pandas/pull/49290

https://github.com/pandas-dev/pandas/pull/49737

https://github.com/pandas-dev/pandas/pull/49824

https://github.com/pandas-dev/pandas/pull/50015

https://github.com/pandas-dev/pandas/pull/50348

https://github.com/pandas-dev/pandas/pull/50369

https://github.com/pandas-dev/pandas/pull/50469

https://github.com/pandas-dev/pandas/pull/50642

https://github.com/pandas-dev/pandas/pull/50719

https://github.com/pandas-dev/pandas/pull/50773

https://github.com/pandas-dev/pandas/pull/50774

https://github.com/pandas-dev/pandas/pull/50793

https://github.com/pandas-dev/pandas/pull/50835

https://github.com/pandas-dev/pandas/pull/50852

https://github.com/pandas-dev/pandas/pull/50914

https://github.com/pandas-dev/pandas/pull/50978

https://github.com/pandas-dev/pandas/pull/51039

https://github.com/pandas-dev/pandas/pull/51087

https://github.com/pandas-dev/pandas/pull/51092

https://github.com/pandas-dev/pandas/pull/51223

https://github.com/pandas-dev/pandas/pull/51274

https://github.com/pandas-dev/pandas/pull/51320

https://github.com/pandas-dev/pandas/pull/51334

https://github.com/pandas-dev/pandas/pull/51594

https://github.com/pandas-dev/pandas/pull/51978

--

--

Kim Pevey
CyberPaleo

Senior Software Engineer at Quansight with a heart for Open Source