Pandas and the geosciences: a 4.5 billion year story
Pandas is a ubiquitous tool for time series analysis but until recently was unusable in the paleogeosciences, which involve timescales that encompass the entirety of Earth’s history. This is because Pandas long ago hardcoded nanoseconds as the base unit of time, limiting the timescales it can represent on a 64-bit machine to a relatively narrow timespan of 585 years, thus excluding many paleogeoscience applications. Through an NSF EarthCube grant (RISE-2126510), the University of Southern California contracted with Quansight to support non-nanosecond-resolution datetimes in Pandas 2.0. This work allows for resolutions as coarse as one second that can cover timescales of days across billions of years. USC and Quansight then worked to implement the usage of Pandas in the Pyleoclim python library. The Pyleoclim library is maintained by USC researchers and is a leading package for analyzing and visualizing paleo datasets.
Paleoclimatology presents unique challenges to timeseries analysis: the time axis is often unevenly-spaced, time is highly uncertain, timescales vary from days to billions of years, and time is often represented as positive towards the past, with an origin point that is often dependent on the dating method (e.g., 1950 is standard for radiocarbon measurements, while 2000 is standard with cave deposits like speleothems). These challenges have made it difficult for the paleoclimate community to use standard libraries.
The Pandas non-nanosecond integration effort
Perhaps the biggest challenge to paleoclimatology analytics packages has been the inability to use Pandas, a core Python tool for analytics.
The first step was to adapt low-level tslibs
functions (timezones.pyx
, vectorized.pyx
, and tzconvers.pyx
) to support non-nanosecond resolution. A Localizer
class was also added to support these. These were particularly challenging to implement without negatively affecting performance.
The team also needed to adapt higher level Cython functions such as parsing.pyx
, fields.pyx
, array_to_datetime
and infer_dtype
.
Next, the team adapted Pandas NaT
(not a time), Timestamp
, Timedelta
, and Period
classes to support and interact with non-nanosecond resolutions. Integration into several other internal functions and classes such as DatetimeArray
, TimedeltaArray
and find_common_type
was also completed. Discussion on user-facing constructors (e.g. pd.to_datetime
, pd.to_timedelta
) is on-going.
The effort to integrate non-nanosecond resolution in Pandas spanned over a year and required many PRs (check out the list in the Appendix below). In the end, we were able to close an issue that had been open for nearly a decade. It was quite the challenge, but we are excited to provide this functionality to the community.
Integration into Pyleoclim — its first usage
During the initial implementation, tests were added to Pandas to ensure that everything worked as expected. However, as any developer knows, the real test is to put the code in the hands of the users. USC and Quansight led the effort to implement this new dtype into Pyleoclim, exposing and fixing a host of issues that came up. This effort was critical in that the Pandas 2.0 release team can be much more confident of its usage now that it’s been extensively alpha tested.
In particular, although the Pandas DatetimeIndex
could hold non-nanosecond datetimes, errors would arise when trying to use them in operations such as resampling, formatting, and parsing. This was due to the fact that for at least a decade, the only supported resolution in Pandas was the nanosecond, and related assumptions were baked in throughout the codebase. The Pyleoclim work helped unearth several code paths where that assumption was being incorrectly made.
For instance, pandas makes it easy to resample a timeseries at a fixed time resolution. Here we show how the benthic oxygen isotope stack of Lisiecki & Raymo (2004), which spans the past 5.3 million years at uneven resolution, can be easily resampled onto 5,000 year intervals:
ts = pyleo.utils.load_dataset('LR04')
ts5k = ts.resample('5ka').mean()
fig, ax = ts.plot(invert_yaxis='True',xlim=[0, 1000])
ts5k.plot(ax=ax,color='C1')
This project also highlighted the importance of well-defined calendars for conversion between dates. For example, the standard assumption of 24 hours per day does not apply in the distant past and model simulations often do not use the Gregorian calendar. Although Pyleoclim has implemented the use of cftime
to tackle these challenges, there is still a need for the Python ecosystem to continue to develop to fully meet the needs of the paleogeosciences.
Related work in the python scientific ecosystem
Although the Pandas extension and incorporation into Pyleoclim represents a major stepping stone to allow scientists in these domains to make use of more open science code, work remains for interoperability with other open source libraries such as Matplotlib, Seaborn, Scikit-learn, Numpy, and Scipy. These libraries will need to update for non-nanosecond to fully unleash the power of the Python open source ecosystem for the paleogeosciences.
One example is the inability of Matplotlib to natively plot these timespans. For fairly arbitrary reasons, it currently has an internal limitation of 1–9999 years. For now, the solution is to implement our own plotting mechanism in Pyleoclim which ensures the time axis is provided to Matplotlib as float
(see figure above). However, it will be great when Matplotlib relaxes this restriction.
If you have a csv
file with entries like 1300-01-01
, then read_csv
with parse_dates
will simply fail to convert the dates due to the remaining datetime64
limitations in Pandas (in this case, the inability to run to_datetime([‘1300–01–01’])
). Additional work is needed to make the new integration fully seamless.
Conclusion
Opening the door for usage of Pandas as a core tool has brought about a significant advancement in paleogeoscience analytics. This was a monumental effort and related work efforts are still ongoing, but the basics are in place for users to check out. Now go try it out for yourself!
Appendix of Pandas PRs related to this work
https://github.com/pandas-dev/pandas/pull/46397
https://github.com/pandas-dev/pandas/pull/46410
https://github.com/pandas-dev/pandas/pull/46578
https://github.com/pandas-dev/pandas/pull/46688
https://github.com/pandas-dev/pandas/pull/46828
https://github.com/pandas-dev/pandas/pull/46839
https://github.com/pandas-dev/pandas/pull/46901
https://github.com/pandas-dev/pandas/pull/46902
https://github.com/pandas-dev/pandas/pull/46917
https://github.com/pandas-dev/pandas/pull/46959
https://github.com/pandas-dev/pandas/pull/46990
https://github.com/pandas-dev/pandas/pull/47044
https://github.com/pandas-dev/pandas/pull/47076
https://github.com/pandas-dev/pandas/pull/47120
https://github.com/pandas-dev/pandas/pull/47126
https://github.com/pandas-dev/pandas/pull/47162
https://github.com/pandas-dev/pandas/pull/47191
https://github.com/pandas-dev/pandas/pull/47230
https://github.com/pandas-dev/pandas/pull/47245
https://github.com/pandas-dev/pandas/pull/47246
https://github.com/pandas-dev/pandas/pull/47278
https://github.com/pandas-dev/pandas/pull/47299
https://github.com/pandas-dev/pandas/pull/47307
https://github.com/pandas-dev/pandas/pull/47312
https://github.com/pandas-dev/pandas/pull/47313
https://github.com/pandas-dev/pandas/pull/47316
https://github.com/pandas-dev/pandas/pull/47320
https://github.com/pandas-dev/pandas/pull/47322
https://github.com/pandas-dev/pandas/pull/47324
https://github.com/pandas-dev/pandas/pull/47333
https://github.com/pandas-dev/pandas/pull/47334
https://github.com/pandas-dev/pandas/pull/47340
https://github.com/pandas-dev/pandas/pull/47346
https://github.com/pandas-dev/pandas/pull/47355
https://github.com/pandas-dev/pandas/pull/47356
https://github.com/pandas-dev/pandas/pull/47373
https://github.com/pandas-dev/pandas/pull/47374
https://github.com/pandas-dev/pandas/pull/47394
https://github.com/pandas-dev/pandas/pull/47395
https://github.com/pandas-dev/pandas/pull/47421
https://github.com/pandas-dev/pandas/pull/47522
https://github.com/pandas-dev/pandas/pull/47535
https://github.com/pandas-dev/pandas/pull/47537
https://github.com/pandas-dev/pandas/pull/47579
https://github.com/pandas-dev/pandas/pull/47641
https://github.com/pandas-dev/pandas/pull/47668
https://github.com/pandas-dev/pandas/pull/47682
https://github.com/pandas-dev/pandas/pull/47720
https://github.com/pandas-dev/pandas/pull/47807
https://github.com/pandas-dev/pandas/pull/48261
https://github.com/pandas-dev/pandas/pull/48661
https://github.com/pandas-dev/pandas/pull/48669
https://github.com/pandas-dev/pandas/pull/48743
https://github.com/pandas-dev/pandas/pull/48748
https://github.com/pandas-dev/pandas/pull/48815
https://github.com/pandas-dev/pandas/pull/48819
https://github.com/pandas-dev/pandas/pull/48836
https://github.com/pandas-dev/pandas/pull/48894
https://github.com/pandas-dev/pandas/pull/48901
https://github.com/pandas-dev/pandas/pull/48910
https://github.com/pandas-dev/pandas/pull/48923
https://github.com/pandas-dev/pandas/pull/48928
https://github.com/pandas-dev/pandas/pull/48953
https://github.com/pandas-dev/pandas/pull/48956
https://github.com/pandas-dev/pandas/pull/48961
https://github.com/pandas-dev/pandas/pull/49008
https://github.com/pandas-dev/pandas/pull/49014
https://github.com/pandas-dev/pandas/pull/49015
https://github.com/pandas-dev/pandas/pull/49034
https://github.com/pandas-dev/pandas/pull/49050
https://github.com/pandas-dev/pandas/pull/49058
https://github.com/pandas-dev/pandas/pull/49097
https://github.com/pandas-dev/pandas/pull/49098
https://github.com/pandas-dev/pandas/pull/49104
https://github.com/pandas-dev/pandas/pull/49106
https://github.com/pandas-dev/pandas/pull/49171
https://github.com/pandas-dev/pandas/pull/49285
https://github.com/pandas-dev/pandas/pull/49290
https://github.com/pandas-dev/pandas/pull/49737
https://github.com/pandas-dev/pandas/pull/49824
https://github.com/pandas-dev/pandas/pull/50015
https://github.com/pandas-dev/pandas/pull/50348
https://github.com/pandas-dev/pandas/pull/50369
https://github.com/pandas-dev/pandas/pull/50469
https://github.com/pandas-dev/pandas/pull/50642
https://github.com/pandas-dev/pandas/pull/50719
https://github.com/pandas-dev/pandas/pull/50773
https://github.com/pandas-dev/pandas/pull/50774
https://github.com/pandas-dev/pandas/pull/50793
https://github.com/pandas-dev/pandas/pull/50835
https://github.com/pandas-dev/pandas/pull/50852
https://github.com/pandas-dev/pandas/pull/50914
https://github.com/pandas-dev/pandas/pull/50978
https://github.com/pandas-dev/pandas/pull/51039
https://github.com/pandas-dev/pandas/pull/51087
https://github.com/pandas-dev/pandas/pull/51092
https://github.com/pandas-dev/pandas/pull/51223
https://github.com/pandas-dev/pandas/pull/51274
https://github.com/pandas-dev/pandas/pull/51320
https://github.com/pandas-dev/pandas/pull/51334