Upgrade DuckDB Storage Versions

No more juggling CLI paths and versions.

Tyler White
Learning The Computers
5 min readNov 1, 2023

--

Currently, data engineers and data scientists are engaged in discussions about DuckDB, which is one of the most popular topics in the field. This database management system provides a unique and refreshing alternative approach with its file-based system. However, it may take some time to adapt to its unique functionality. DuckDB frequently releases new versions, which come with additional features and deprecate older ones. When this happens, the storage version of the database may also be upgraded, requiring users to use a previous version to work with their existing database. Nonetheless, DuckDB provides a clear explanation and solution for this behavior.

If you know the problem, scroll to the bottom to check the solution!

The DuckDB internal storage format is currently in flux, and is expected to change with each release until we reach v1.0.0.

The Problem

For most users, this is nothing more than a minor inconvenience. Understanding the complexities of storage versions is beyond the scope of most people, and a simple upgrade usually solves the problem. The real challenge lies in the fact that most users need to keep multiple versions of DuckDB installed on their devices to facilitate the upgrading process. In the past, I would manually install pinned versions, export data, reinstall the desired version, and then import the data.

I frequently encountered this problem attempting to work with demos that had already materialized DuckDB databases. I wanted to make the upgrade process much more accessible to prevent repeating steps.

The Solution

DuckDB provides helpful information related to compatible versions and how to check the storage version of a particular file. With this in mind, we can scaffold a few boilerplate solutions to check the version and download the appropriate CLI version of DuckDB for a particular platform.

Detecting and downloading versions.

With this code, we define a StorageVersion class to represent different storage versions, each associated with a list of DuckDB versions that support it. It also includes a list of predefined STORAGE_VERSIONS, detailing these associations.

The script provides two main functions. First, it has a check_version function that checks the version of a DuckDB storage file by reading its binary data. Second, it has a download_version function to download and extract the DuckDB binary for a specific version. The binary choice is determined based on the operating system (Darwin, Windows, Linux) and processor architecture (i386, arm). The downloaded binary is then given executable permissions using the subprocess module's chmod command.

Please note that the code assumes a specific directory structure and relies on the availability of DuckDB binaries on the provided download URL.

We can now chain these functions together to perform the upgrade.

The upgrade_database function is designed to facilitate upgrading a DuckDB database from an old version to a new version. It takes three parameters:

  • old_name: The name of the existing DuckDB database that you want to upgrade.
  • new_name: The name for the upgraded database.
  • new_version: The target DuckDB version (default is "v0.9.1").

Here’s a step-by-step explanation of what the function does:

  1. It starts by calling the download_version function with the specified new_version to ensure that the desired DuckDB version is downloaded and available for use.
  2. It determines the DuckDB version of the existing database (old_name) by searching through the predefined STORAGE_VERSIONS list and find the most recent DuckDB version that corresponds to the storage version of the old database. This version is stored in the duckdb_version variable.
  3. It then calls the download_version function again, this time to ensure that the DuckDB binary for the determined duckdb_version is downloaded and available.
  4. It uses the subprocess.run function to execute two DuckDB command-line operations to export the data from the old database (old_name) to a temporary location named 'tmp' and import the data from the ‘tmp’ location into a new database with the specified new_name using the new_version of DuckDB.
  5. After the database upgrade is completed, the script removes the temporary ‘tmp’ directory using shutil.rmtree("tmp").

In summary, this function automates the process of upgrading a DuckDB database. It takes care of downloading necessary DuckDB versions, exporting and importing data, and cleaning up temporary files.

Conclusion

I may consider turning this into a CLI in the future, but for now, it serves as a quick upgrade tool. It could also use a few improvements by checking better to determine if downloads are necessary and optionally using previously installed locations to avoid duplication.

I hope this is helpful if you’re running into challenges maintaining different versions of DuckDB databases.

Full Source Code

import platform
import shutil
import struct
import subprocess
import zipfile
from io import BytesIO
from pathlib import Path

import requests

BASE_DOWNLOAD_URL = "https://github.com/duckdb/duckdb/releases/download"
TMP_PATH = "../tmp"
DARWIN_FILE = "duckdb_cli-osx-universal.zip"
WINDOWS_FILE = "duckdb_cli-windows-amd64.zip"
LINUX_I386_FILE = "duckdb_cli-linux-amd64.zip"
LINUX_ARM_FILE = "duckdb_cli-linux-aarch64.zip"


class StorageVersion:
def __init__(self, version, duckdb_versions):
self.version = version
self.duckdb_versions = duckdb_versions

def __str__(self):
return f"Storage Version {self.version} supported by DuckDB versions: {', '.join(self.duckdb_versions)}"


STORAGE_VERSIONS = [
StorageVersion(64, ["v0.9.0", "v0.9.1"]),
StorageVersion(51, ["v0.8.0", "v0.8.1"]),
StorageVersion(43, ["v0.7.0", "v0.7.1"]),
StorageVersion(39, ["v0.6.0", "v0.6.1"]),
StorageVersion(38, ["v0.5.0", "v0.5.1"]),
StorageVersion(33, ["v0.3.3", "v0.3.4", "v0.4.0"]),
StorageVersion(31, ["v0.3.2"]),
StorageVersion(27, ["v0.3.1"]),
StorageVersion(25, ["v0.3.0"]),
StorageVersion(21, ["v0.2.9"]),
StorageVersion(18, ["v0.2.8"]),
StorageVersion(17, ["v0.2.7"]),
StorageVersion(15, ["v0.2.6"]),
StorageVersion(13, ["v0.2.5"]),
StorageVersion(11, ["v0.2.4"]),
StorageVersion(6, ["v0.2.3"]),
StorageVersion(4, ["v0.2.2"]),
StorageVersion(1, ["v0.2.1 and prior"]),
]


def check_version(path: str):
"""Check the version of a DuckDB storage file."""
pattern = struct.Struct("<8x4sQ")
return pattern.unpack(open(path, "rb").read(pattern.size))[-1]


def download_version(version: str):
"""Download and extract the DuckDB binary for a specific version."""
system = platform.system()
processor = platform.processor()

if system == "Darwin":
FILE_NAME = DARWIN_FILE
elif system == "Windows":
FILE_NAME = WINDOWS_FILE
elif system == "Linux":
if processor == "i386":
FILE_NAME = LINUX_I386_FILE
elif processor == "arm":
FILE_NAME = LINUX_ARM_FILE

LOCAL_VERSION_PATH = f"{TMP_PATH}/{version}"
DOWNLOAD_VERSION_PATH = f"{BASE_DOWNLOAD_URL}/{version}/{FILE_NAME}"

if not Path(LOCAL_VERSION_PATH).exists():
Path(LOCAL_VERSION_PATH).mkdir(parents=True, exist_ok=True)
zipfile.ZipFile(
BytesIO(requests.get(DOWNLOAD_VERSION_PATH).content)
).extractall(LOCAL_VERSION_PATH)

subprocess.run(["chmod", "755", f"{LOCAL_VERSION_PATH}/duckdb"])

return f"{LOCAL_VERSION_PATH}/duckdb"


def upgrade_database(old_name, new_name, new_version: str = "v0.9.1"):
"""Upgrade a DuckDB database from an old version to a new version."""
download_version(new_version)
duckdb_version = [
x.duckdb_versions[-1]
for x in STORAGE_VERSIONS
if x.version == check_version(old_name)
][0]
download_version(duckdb_version)
subprocess.run(
[f"{TMP_PATH}/{duckdb_version}/duckdb", old_name, "-c", "EXPORT DATABASE 'tmp'"]
)
subprocess.run(
[f"{TMP_PATH}/{new_version}/duckdb", new_name, "-c", "IMPORT DATABASE 'tmp'"]
)
shutil.rmtree("tmp")

--

--

Tyler White
Learning The Computers

I constantly seek new ways to learn, improve, and share my knowledge and experiences. Solutions Architect @ Snowflake.