Please pardon our appearance during renovations

Chang She
LanceDB
Published in
4 min readFeb 9, 2023

Rust, Jehovah’s witnesses, and why our docs are broken

The Lance github repo is currently a construction site and we’d like to apologize to anyone looking to get their bearings in the repo this week. We’re putting humpty-dumpty back together and will get this sorted out soon with a new feature that will be worth the wait!

Re-writing Lance in Rust

This is the story of how a small hackathon turned us into Jehovah’s witnesses for Rust. Over the past holiday season, we began writing Rust code for a hack project which required us to re-write part of the read path for Lance in Rust building on the arrow-rs crate. Up to that point, we had been implementing Lance in C++ for the second half of last year. While we made good progress, the pain of fighting with CMake, linking dependencies, and feeling not so confident about segfaults really took its toll. Writing in Rust felt like such a breath of fresh air that, after 3 days, we made a call to re-write the whole thing in Rust.

This process took us about 3 weeks of focused development in Rust. v0.2.9 of Lance is the last release with the C++ implementation. v0.3.0 is the first release with the Rust implementation.

Nearing the end of that process, this pull-request made us super happy

Apparently our sentiment was echoed by Martin Horenovsky on Twitter:

How it’s going:

🔭 Some observations

Creating a new data tool in Rust was super interesting. We learned a lot during this process and here’s a summary of differences, learnings, and observations.

⏫ Arrow

Previously we’ve been integrating against Arrow C++ and now we’re using arrow-rs. Arrow-rs has the basics but it’s much less full-fledged, which is both a blessing and curse. We’ve had to build a bunch of things ourselves, but in C++ land, the problem was that existing classes/abstractions were not flexible enough to customize for the enriched featureset in Lance (e.g., schema evolution, versioning, random access, extension types, to name a few).

🐍 Python

Previously we had been using Cython to integrate with python and now we’re using pyo3 + some python code on top. I’ve been writing cython for a long time so I’m used to it, but pyo3 is much nicer. The only time I managed to crash python was I had a mismatch between the schema and the batches of the RecordBatchReader in Arrow which leaked into the unsafe nether regions of arrow-rs and the Arrow C Data interface.

Ecosystem integration

This is where the Rust story isn’t quite as good. For example, we want to dress up the new Lance datasets as Pyarrow datasets to duckdb and it wasn’t really possible. The kind folks at duckdb labs were gracious enough to accept this PR to fix the blocking problem so that’ll go out next release of DuckDB. That just leaves the interface in pyarrow being cumbersome, which we’re discussing on arrow github.

What’s missing from Lance still

We’ve made a hard choice and have had to take some small steps backwards. We’re still working on bringing a number of things back:

  1. Schema evolution. Dataset versioning is enabled for appending new rows, but adding new columns hasn’t been re-implemented in Rust yet
  2. Filter pushdown. In C++ it was easy to use the Arrow Compute Expressions to get the predicates pushed down from duckdb and pyarrow. In Rust there’s no abstraction for the compute expressions. We’re investigating substrait but we may just have to do a string based stop gap for now 🤷‍♂
  3. NA handling. For strings and variable length arrays this has been implemented. For fixed-stride data, Lance is currently missing NA handling.
  4. Extension types. While Arrow-rs has the extension data type, there is no extension registry and there’s no real story around extension types. This makes it harder to build semantic types for images, videos, pdf, etc. But then again, the C++ extension types had so many rough edges that maybe this is a blessing in disguise.

Hey where’s the surprise you promised?!

I’ve ranted for too long so I can’t go into details, but we shipped a fast vector index (IVF-PQ implemented, HNSW in roadmap). This means you can get millisecond latency straight from disk on nearest neighbor search. Current priority is making index creation speed fast enough so you don’t need to go do something else while waiting. I’ll lay out the details in a separate post

Come build with us!

In between evangelizing Rust like the fanbois we are now, we’re working on bringing the documentation for Lance back online and updating the tutorial notebooks we’ve broken. Please stayed tuned for more fun with Rusty Lances. If you love data and you like / want to try Rust, ping me at chang@eto.ai !

--

--