Peeking Duck: duckdb + lance for computer vision
SELECT predict(‘resnet’, image) FROM dataset
There’s been a lot of excitement about a modern-data-stack-in-a-box with DuckDB, and rightly so. What if the modern data stack also included unstructured data and machine learning? What if it was just as easy to do image classification and object detection as a simple value counts? Since I’m writing this post, you might guess the answer already: yes you can.
This is made possible using DuckDB in conjunction with Lance, a new columnar data format for computer vision (CV). Lance is like Parquet but built with CV in mind, with fast point-access, partial reads and optimisations for nested annotation columns. TLDR, it’s much more performant. Lance can be accessed by tools like Pandas and DuckDB via Apache Arrow. Let’s see it in action.
Loading the extension
First, follow the build instructions for Lance duckdb extension, fire up a jupyter notebook (or ipython session) to install and load the extension.
Create the model
The extension comes with table functions like create_pytorch_model
and ml_models()
to load and list existing models.
Where do these models come from? As the output of the select statement indicates, these are torchscript models. For resnet it’s 3 lines of code to create the pth file I just loaded:
The data
So resnet was trained on ImageNet (1000 classes), so it’d be fun to run it on a different dataset. Here we use the Oxford Pet Dataset, which is organized into /images and /annotations directories. The /annotations directory has the following:
- data indices in txt format
- /xmls annotations each labeled image
- /trimap of trimap png’s
To make this data queryable, we’ve converted it to Lance format, which is easily read into an Arrow dataset:
Let’s query the data
The pyarrow dataset we created is directly queryable via duckdb. We can easily slice-and-dice the dataset using SQL:
Show me the magic already!
OK time for the real fun. The Lance duckdb extension has a predict
scalar function which let’s you run inference by passing in the model name and a binary column of images:
The create_pytorch_model
function from before registered the model under the name “resnet”, and the predict
function knows to find it under that name. It then applies the model on the image column, which is a blob column of the image data itself.
The model output is the probabilities for each of the 1000 ImageNet classes it was trained on. To get a predicted class, we use the list_argmax
array function to use the class with the highest probability as the predicted class.
How about something human readable?
OK so we got some results, but how do we know if 258 is a reasonable answer? Since we’re using python this is pretty simple. First, we load a pandas DataFrame of the ImageNet labels:
We can then join this pandas dataframe back to the predictions to map the integer class to a string label:
Conclusion
In this post we showed that we do model inference easily in SQL, using the Lance extension for duckdb. Using Lance via Arrow, we can manage images, metadata, and annotations all in one place, and we can query it efficiently even when the data lives in cheap remote storage.
You can find Lance here: https://github.com/eto-ai/lance If you like us, we’d love a star on our project, and we’d appreciate your feedback even more!
Code
All of the code snippets can be found in this notebook.
If you like this, feel free to upvote on HN!