#2: Learning Rust — Conspiracies, Databases, and Diesel.rs

This post is the second in a series of posts where I share my experience learning Rust. I’m building out a conspiracy theories API to help me get more familiar with Rust and to have a little fun. Since I am new to Rust, I welcome any and all feedback, especially from developers who have been using Rust for quite some time, leave a comment below or contact me on Twitter. With that out of the way, it is time to put your foil lined hat on and start storing the conspiracies in a database.

In post #1 I built a CLI tool to fetch the Conspiracy theory Wikipedia page, grabbed the page data, and populated an instance of the WikiPage struct with the data. In this post, I add functionality that retrieves more pages and stores them in a SQLite database using diesel.rs. In addition to the page data, I am going to grab all the links from the Conspiracy theory page. The links are used to retrieve other pages to add to the database. Before I get started adding the new functionality, I’m going to re-organize the way the source files are arranged. Eventually, there will be two binaries in this project and the way the code is laid out it won’t be possible to do that. So, I’m going to make the following changes.

Re-organizing the Code

The first step of the re-org is to create the src/wiki.rs file. Now, the wiki related code lives in its own module and is usable by all modules in the project. Since I am accessing the WikiPage struct from the src/db_loader/main.rs, which is a different module, I need to make it a public struct. In Rust, to make a struct, function or method public, you add the pub keyword to the beginning of the struct definition.

The src/lib.rs file is where the extern crate lines live. This file is where I declare that the wiki module is public by adding the line pub mod wiki;. Making the wiki module public allows other modules in the project to load the Wikipedia related code into their scope. Now that I have the src/lib.rs file I can move the src/main.rs into a new directory, db_loader. The current directory structure looks like this:

▶ tree src 
src
├── db_loader
│ └── main.rs
├── lib.rs
└── wiki.rs

Since I’ve moved the main.rs file, I need to let cargo know that by adding a [[bin]] section to the Cargo.toml file.

[[bin]] 
name = "db_loader"
path = "src/db_loader/main.rs"

The properties in the [[bin]] section are straightforward and allow me to run cargo build --bin db_loader if I want to build the db_loader. If I run cargo build it builds the db_loader and any other binaries that are in the project. After updating the Cargo.toml file its time to run another build. Here’s what the result of the build command gave me:

Now that the wiki.rs file lives in a different module, main I need to add an extern crate conspiracies; line at the top of main.rs and add use conspiracies::wiki::{WikiRepo};. Adding these lines brings the conspiracies code into scope, and the use statement allows me to use WikiPage instead of a fully qualified name wherever I make use of the struct.

importing the the wiki module’s WikiRepo

Another side effect of moving the WikiPage struct is the fact that the fields of the struct are now private to the wiki module. So if I want to access the fields, I either need to make each field public or have methods to interact with the fields indirectly. I don’t want to have each of the fields public, so I’m going to create a ‘constructor’ to create a new instance of WikiPage. Rust doesn’t have a true constructor method like C# or C++, but I can create a function that acts like one.

the WikePage ‘constructor’

There are a couple of things I’d like to point out here, the first being that the pub is not in front of the impl statement, it goes in front of the new function. If I had pub impl instead I would receive a message from the compiler that tells me pub not permitted here because it's implied. Adding the visibility on a function by function basis allows the developer to have control over what functions or methods are made public. With these changes in place, I can safely build the db_loader.

Remember, I can build it by running cargo build, which builds all binaries in the project, or I can build the one binary I need to by running cargo build --bin <name>. With the code re-organized I can start working on the database related code.

Using Diesel.rs to Interact with a Database

After getting the page data I want to save it in a database, I’m going to use diesel.rs which seems to be the most popular ORM in the Rust world. Diesel has a command line tool that is used to set up the project, create the database, and allows me to run migrations with diesel. You’ll need to install the diesel_cli tool by running the command below.

cargo install diesel_cli --no-default-features --features sqlite

When I first attempted to install the diesel_cli, I ran it without the --no-default-features flag, and I received an error about the MySQL client library not being found. If you get that error, you can either install the missing library or use the –no-default-features flag. After installing the diesel CLI, I created the database directory which houses the SQLite database. Next, I created a .env file in the project’s root directory and added DATABASE_URL=./database/conspiracies.sqlite3 to it. The diesel_cli tool and my db_loader utility use the value of DATABASE_URL to connect to the database. With the CLI tool installed and the .env file in place, I can now run diesel setup. The command creates the migrations directory, the database using the value of DATABASE_URL, and create the __diesel_schema_migrations table. The __diesel_schema_migrations table’s purpose is to help diesel ‘remember’ what migrations were applied to the database.

Speaking of migrations, let’s go ahead and create the first migration. The first table i need is the conspiracies table, so I’m going to create a new migration by running diesel migration generate create_conspiracies. Here’s what the output of the create migration process looks like:

▶ diesel migration generate create_conspiracies Creating migrations/2018-05-01-002930_create_conspiracies/up.sql Creating migrations/2018-05-01-002930_create_conspiracies/down.sql

It created a directory for the migration files, up.sql, and down.sql. The up.sql file is where I added my CREATE TABLE definition, and in the down.sql I added a DROP TABLE command. Here’s the definition of the conspiracies table.

The table mirrors the WikiPage struct. In the down.sql file, I’ve added the DROP TABLE conspiracies; command. With those two files ready to go, I’m going to create the table with another migration command, diesel migration run which adds my table to the database. If, for some reason, you want to remove the changes that were in the up.sql file you’d run diesel migration revert. The revert command runs the drop table command we added to the down.sql file.

▶ sqlite3 database/conspiracies.sqlite3
-- Loading resources from /Users/robertrowe/.sqliterc
SQLite version 3.19.3 2017-06-27 16:48:08
Enter ".help" for usage hints.
sqlite> .tables
__diesel_schema_migrations conspiracies
sqlite>

After the migration, the conspiracies table now exists in the database. Now, I need to generate the schema file by running diesel print-schema >schema.rs` to produce the schema file below.

diesel’s generated schema.rs file

According to the diesel Getting Started guide The table! macro “creates a bunch of code on the database schema to represent all of the tables and columns.” The schema module comes in handy when I’m adding conspiracies to the database. Before I can build the binary, I need to add two dependencies, the first being diesel, and the second dependency is for dotenv.

[dependencies]
clap = "2.31.2"
wikipedia = "0.3.1"
diesel = { version = "1.0.0", features = ["sqlite"] }
dotenv = "0.9.0"

As I mentioned earlier, diesel handles the database interactions and the dotenv file is used to retrieve the environment variables from a .env file. After updating the Cargo.toml file, I added #[macro_use] extern crate diesel; to the lib.rs file so I can use diesel and the macro_use attribute is there so we can use the table! macro in the schema.rs file. I know it seems like a lot of set up, but it doesn’t take that long to complete. I’ve finished the setup, and it’s time to start writing some code.

The db module

The database related code is going to live in the src/db.rs file. Eventually, I’m going to be storing data in three different tables, conspiracies, categories_to_pages, and links_processed. The conspiracies table houses the page data for all of the processed pages. The categories_to_pages table houses relationships between a Wikipedia category and a page using the page_id as the foreign key. The links_processed table contains the page titles of the links found on the Conspiracy theories page and is used to keep track of what pages were processed and what ones still need to be processed. This table allows me to batch the page requests I had started off trying to load all the pages from the links in one shot, however, after about 80 pages I started receiving HTTP errors. So I changed to a batch approach using the links_processed table to manage the loading process. Now that the tables have been defined its time to write the code to load the tables.

The get_sqlite_connection function pretty straightforward, it takes the value of the DATABASE_URL environment variable and returns a SqliteConnection or write an error message and exit the program since there isn’t anything I can do without a database connection.

The next function to write is the add_conspiracy, which adds a conspiracy to the conspiracies table. It takes two parameters, a SqliteConnection and a reference to an instance of a WikiPage struct. It uses the diesel::insert_into associated function to create a new row based off of the data given in the WikiPage struct. All of the ‘add’ functions follow the same pattern, add_* with two parameters the database connection and the object to be added. To use the structs as part of the insert statement, I need to decorate the structs with a couple of attributes.

added database related attributes

The derive attribute has the Debug trait so that when I pass the struct to a println! statement and use the {:#?} format string it will print out in a nice, easy to read format. The Insertable trait is what allows me to pass the structs to the values function to add a new row to the table names in the table_name attribute. Before using the table name attribute bring the table objects in the schema module into scope using the use schema::{conspiracies, links_processed, categories_to_pages};, otherwise you’ll receive a compiler error.

Reading the data from the database I’ve taken a slightly different approach. In the get_links_to_process function I format a SQL statement, which will retrieve a set of links that have not been processed.

The function takes the connection and the maximum number of links to retrieve. It returns either an empty Vec or a Vec of LinkProcessed up to the value of num_link items. The SQL statement is created using the format! macro. The result of the call is used in the sql call which converts the string into a diesel::expression::SqlLiteral which is used to execute the query against the database. The results of the query.load call will either be a vector of the links to be processed, an empty vector or write out an error and exits.

Here’s the struct used for inserting and reading data from the links_processed table. In addition to the two traits I mentioned on the WikiPage struct, this struct adds the Queryable trait so that it can be used in the get_links_to_process query.

After a link successfully processing a link, the mark_link_as_processed function is called to mark the link as processed. A link’s processed column is set to 1 to indicate that the link has been successfully processed.

Now that I’ve walked you through database code, its time to update the src/db_loader/main.rs file to use the new code.

Updating db_loader/main.rs

I’ve made a few changes to the plan on how to retrieve and store the data since I wrote the first post, so I’m going to walk you through the changes I’ve made to the main.rs file. First is I’ve added two new command line arguments, --get-links and --page-count. The --page-count argument allows you to set the number of pages you want to fetch per run. After trial and error, I determined that I was able to run batch sizes of 60 successfully, so if --page-count isn’t used the default value is 60. The --get-links flag is used to indicate that you wish to get and store the links off of the ‘seed’ page, which is the ‘Conspiracy_theory’ page. Here’s the code that executes when the flag is present.

Since –get-links takes no values I need to check for its presence. If it exists, I call a new function in the wiki module, get_page_links. I will go over that function in a little bit, but first I want to talk about the last parameter to get_page_links. The third parameter is a closure that handles the inserting of the links into the links_processed table. Why did I do this? That’s a good question. The reason I took this approach is that when I was attempting to return a vector of the links back to main, I continued to have lifetime related errors. While I was going through the pains of trying to figure out lifetime issues, I happened to see this post, Strategies for Returning References In Rust by Brice Fisher-Fleig. In the post, the author lists a few approaches to help get around some of the lifetime problems I was having. I chose to use the closure approach. After get_page_links has retrieved the links, it calls my closure which handles the insertion of the data into the database. Here’s what the function signature looks like for get_page_links.

The 'a in the angle brackets is there to indicate the lifetime of the Wikipedia client reference. The F is there as a generic, but not the typical data type generic; this is for the closure that gets passed in. The Where clause says that F must be a function that takes a single parameter that is a LinkProcessed struct. When the function is called it gets the Wikipedia page, makes a call to get the page’s links and the loops over the links, creating a LinkProcessed struct for each link. After the struct is instantiated, it calls the closure. Here’s what the body of the function looks like:

The save_action function is called just like any other function. Using this approach allowed me to get past the issues I was having in a way that makes this function extensible by changing the internals of the save_action function passed by the caller. If I wanted to change from storing the link in a database and write it out to file I don’t have to change a thing inside of get_page_links.

The last change is to add code to retrieve the links to be processed, the page data and category data. It follows a similar pattern as the get_page_links; it uses a closure to handle adding the page and category data to the database.

The db::get_links_to_process function looks for the number of unprocessed links indicated in batch_size. Any unprocessed links that found are passed to get_conspiracies. The conspiracy page data, along with any categories found are then passed to the closure where the data is saved. Anytime there is a problem saving a conspiracy an error is written to standard out, and the loop continues. The match statement ensures that I only add categories and mark the link as processed when a page has was successfully processed.

Loading the Database

Now that I’ve re-organized the codebase added the database functions I can now run and load the database. Since the database has the tables but no link data I need to run the db_loader with the –get-links flag and start the database loading process.

▶ cargo run --bin db_loader -- --get-links --title Conspiracy_theories
Compiling conspiracies v0.0.1 (file:///<path to code>/conspiracies_api)
Finished dev [unoptimized + debuginfo] target(s) in 5.89 secs
Running `target/debug/db_loader --get-links --title Conspiracy_theories`
Added: Conspiracy_theory 5530
Added: 1980 Camarate air crash 40839521
Added: 1986 Mozambican Tupolev Tu-134 crash 7631057
Added: 2006 O'Hare International Airport UFO sighting 8790044
Added: 2012 phenomenon 21538638
Added: 2013 Lahad Datu standoff 38554050
Added: 9/11 5058690
Added: 9/11 Truth movement 2658444
Added: 9/11 conspiracy theories 1077137
Added: A Culture of Conspiracy 24772931
sleeping for 189 seconds starting at 2018-05-08 20:53:23.299697 -04:00

It is just a matter of time before I have a database full of conspiracy theories! One thing I didn’t mention was that I added a random sleep time after ten pages. I did that after encountering issues when running straight through with no stop. Once I added the batch size and the random sleep times I was able to retrieve conspiracies successfully. What am I going to do with a database full of conspiracies? I’m going to create a RESTful API! In my next post, I will share my experiences creating an API using actix-web.

In this post, I went over how to setup your project to support building more than one binary. I also introduced the diesel crate which is used to interact with databases, and I briefly introduced closures and generics. I hope you found this post useful, as always I’m open to any and all feedback!

Diesel update

Diesel was updated to v1.3.0 while I was writing this post. There were changes made that makes keeping your schema.rs file up to date, automatically each time a migration runs the schema.rs file is updated. So no more running diesel print-schema >./src/schema.rs manually. The Configuring Diesel CLI documentation has the instructions to get your environment set up to take advantage of this new feature.

Resources

Downloads/Installs

Reading

Crates of Interest

Data Source

Me


Originally published at www.myprogrammingadventure.org.