Peeking Duck: duckdb + lance for computer vision

SELECT predict(‘resnet’, image) FROM dataset

Chang She
LanceDB
4 min readOct 19, 2022

--

(source: DALL-E)

There’s been a lot of excitement about a modern-data-stack-in-a-box with DuckDB, and rightly so. What if the modern data stack also included unstructured data and machine learning? What if it was just as easy to do image classification and object detection as a simple value counts? Since I’m writing this post, you might guess the answer already: yes you can.

simple query for classification

This is made possible using DuckDB in conjunction with Lance, a new columnar data format for computer vision (CV). Lance is like Parquet but built with CV in mind, with fast point-access, partial reads and optimisations for nested annotation columns. TLDR, it’s much more performant. Lance can be accessed by tools like Pandas and DuckDB via Apache Arrow. Let’s see it in action.

then you just draw the rest of the owl

Loading the extension

First, follow the build instructions for Lance duckdb extension, fire up a jupyter notebook (or ipython session) to install and load the extension.

Install and load the lance duckdb extension

Create the model

The extension comes with table functions like create_pytorch_model and ml_models() to load and list existing models.

registering a torchscript model

Where do these models come from? As the output of the select statement indicates, these are torchscript models. For resnet it’s 3 lines of code to create the pth file I just loaded:

Compiling and saving a pre-trained model to torchscript

The data

So resnet was trained on ImageNet (1000 classes), so it’d be fun to run it on a different dataset. Here we use the Oxford Pet Dataset, which is organized into /images and /annotations directories. The /annotations directory has the following:

  1. data indices in txt format
  2. /xmls annotations each labeled image
  3. /trimap of trimap png’s

To make this data queryable, we’ve converted it to Lance format, which is easily read into an Arrow dataset:

create a pyarrow dataset from Lance

Let’s query the data

The pyarrow dataset we created is directly queryable via duckdb. We can easily slice-and-dice the dataset using SQL:

Smiling sammy!!!

Show me the magic already!

OK time for the real fun. The Lance duckdb extension has a predict scalar function which let’s you run inference by passing in the model name and a binary column of images:

run model inference using `predict`

The create_pytorch_model function from before registered the model under the name “resnet”, and the predict function knows to find it under that name. It then applies the model on the image column, which is a blob column of the image data itself.

The model output is the probabilities for each of the 1000 ImageNet classes it was trained on. To get a predicted class, we use the list_argmax array function to use the class with the highest probability as the predicted class.

How about something human readable?

OK so we got some results, but how do we know if 258 is a reasonable answer? Since we’re using python this is pretty simple. First, we load a pandas DataFrame of the ImageNet labels:

Load dataframe with label and id

We can then join this pandas dataframe back to the predictions to map the integer class to a string label:

As Colonel Hans Landa would say, pas si mauvais

Conclusion

In this post we showed that we do model inference easily in SQL, using the Lance extension for duckdb. Using Lance via Arrow, we can manage images, metadata, and annotations all in one place, and we can query it efficiently even when the data lives in cheap remote storage.

You can find Lance here: https://github.com/eto-ai/lance If you like us, we’d love a star on our project, and we’d appreciate your feedback even more!

Code

All of the code snippets can be found in this notebook.

If you like this, feel free to upvote on HN!

--

--