Rust crate `derive_sql` example
Collate and interrogate Museum of Modern Art data using a Rust CLI
I am the author of the crate derive_sql
, which is the first (and only) library that I develop and make available. This article shows how to use it, as well as allow me to try it out in a practical context and improve it.
The example is inspired from the Postgres — Rust Cookbook Aggregate data. It describes the development of a simple command line (CLI) tool using derive_sql
to collect Museum of Modern Art (MOMA) Collection database — available on github — records in an SQL database and subsequently query these records. The story also touches on unit testing.
The source code is available on github.
The views/opinions expressed in this story are my own. This story relates to my personal experience and choices. The story, code extract(s), and source code are provided in the hope that they will be useful without any warranty.
The aim of the story is to develop a CLI tool to retrieve the MOMA collection records, store these records in an SQL database and run two queries against these records: (a) calculate the number of artists of each nationalities and returns the top 25 nationalities, and (b) calculate the number of artworks by artists nationalities and returns the top 25 nationalities.
The MOMA collection is composed of two JSON files: Artists.json
a list of ‘all the artists who have worked MOMA’s collection’ and Artworks.json
a list of ‘all of the works that have been accessioned into MoMA’s collection’. The list of artworks contains an array potentially storing multiple artists thaat may have contributed to one artworks. The above collection of information is collected in a database composed of 3 tables describing: (a) each artist, (b) each artwork and (c) artist to artwork relation. The following code extract shows the definition of each table.
The struct Artwork
and ArtworkAttribution
do not allow for deserialisation of the Artworks.json
JSON data structure directly. The file Artworks.json
is parsed using a dedicated JSON data structure that is then used to create instances of Artwork
and ArtworkAttribution
structs.
The functionality of collecting and storing artist and artwork information consists of 3 steps: (a) prepare the SQL table, (b) download the information and (c) store the information. Step (a) and (b) order can be reversed. The following code shows how this was done for the artist information alongside the method signature. The struct SqlArtist
is generated by the derive macro derive_sql::DeriveSqlStatement
applied on the Artist
struct.
The method signature allows for the data manipulation and insertion logic to be tested as shown below. The testing uses rusqlite
in-memory database functionalities — but the testing could also be done using a mock implementation of the derive_sql
traits, or a temporary connection to a MySQL database using mysql
. The functions to retrieve artists and artworks are injected so that API calls that would be time consuming and not controlled are avoided.
The above describe the implementation of the retrieval of the MOMA data and storage in a SQL database. The next step is to implement the ability of running a query of the database to retrieve the top 25 nationalities by number of artists, alongside the number of artists for each of these nationality. The SQL query statement is shown in the Rust Cookbook:
SELECT nationality, COUNT(nationality) AS count
FROM artist GROUP BY nationality ORDER BY count DESC LIMIT 25
The implementation of this query is done in 3 steps. In the first step, we define a struct that the query result will be converted into. The macro DeriveSqlStatement
automatically implements trait derive_sql::traits::TryFromRefRow
to allow conversion of query results into the result struct:
The second step is to define a struct that will provide the SQL query by implementing the trait derive_sql::traits::SelectStatement
. This trait underpin SELECT
queries only. The formatting of the statement uses the SQL table name and field names defined in the struct supporting the artist SQL table. This allow for any change to the latter struct to be reflected here — if the table name were changer or result in a compilation error if the field nationality
was spelt differently.
The last step is to make the query call using the functionality provided in the SelectV2
trait. This traits has a blanket implementation for any struct that implement the SelectStatement
trait and includes variations to undertake query call with filtering, ordering, limit and offset options.
This functionality can be tested similarly to the original test by starting an SQLite in-memory database, populating it with test data, running the query and checking the results as shown below:
The second query follows a very similar process to the exception that the SQL query to be run uses SQL joins to allow the query to span all 3 tables. This results in a code that is nearly identical to the one above — the QueryResult
struct could be reused, but is kept separate so that each query is self contained within its own file . The main difference lay in the implementation of the SelectStatement
trait. The SQL statement shown below exceeds the needs required for the set goals — listing nationality by number of artworks only require artwork association, but artwork details. The use of the artwork detail is shown so that one could apply a filtering on for example the department
field of the Artwork
table.
The command line interface parameters are handled using the clap
crate (‘builder’ approach) — see the github repository for details.
Bonus Notes:
The aim was to support SQLite (rusqlite
crate) and MySQL (mysql
crate) in the same compiled executable. The first attempt using a function returning a dynamic trait led to Rust compilation error E0038 — `error[E0038]: the trait `...` cannot be made into an object
. This issue was new to me, and allowed me to learn (and probably not fully understand) about ‘object-safe traits’. Initially I started along the road of re-writing the library traits into ‘object-safer’ traits. But I decided to backtrack and approach the problem with using the current library as building blocks. This was done by creating an enum that encapsulates the supported SQL connections and implement the relevant derive_sql
traits — see file database.rs
.
The MOMA data results in 15,000+ artists and 150,000+ artworks to be inserted in database. Each data was originally inserted in a sequential manner — no batch operation is/was available in derive_sql
. Optimising this operation lead to (a) introducing a new trait to insert multiple entries as batch and (b) learning about SQL transactions. This allowed for the data insertion time to decrease from minutes down to tens of seconds.
Another lesson learnt related to crate versioning, the ability of cargo
to handle multiple version of the same crate and the side effects of this ability.
The video below shows the CLI in action — note the use of export RUST_LOG=info
to display log information. The top country by number of artists and artworks is America.
Rustaceans 🚀
Thank you for being a part of the Rustaceans community! Before you go:
- Show your appreciation with a clap and follow the publication
- Discover how you can contribute your own insights to Rustaceans
- Connect with us: X | Rust Bytes Newsletter