Turn single spreadsheet into a SQLite Database

Phan Cuong
6 min readMay 26, 2023

--

https://www.pexels.com/vi-vn/@kevin-ku-92347/

Introduction:

An relational database is a way of structuring information in tables, rows and columns. An relational database has ability to establish links or relationships between information by joining tables, whihc makes it easy to understand and gain insights about the relationship between various data points. On another hand, a spreadsheet is a file made of rows and columns that help sort, organize and arrange data efficiently and calculate numerical data. the common between a spreadsheet and an relational database are rows and columns, and we can treat each tab of a spread as a table in relational database. with that thought in mind, this post show an example using Python to transform a spreadsheet into a SQLite (Structred Query Language — SQL) database.

Data
The example in this post uses demo dataset, which can be found data folder. This dataset is stored as an Excel workbook, which has five sheets: users, addresses, books, reviews, users_books.

Import Python modules
pandas module read dataset from spreadsheet
sqlite3 module communicate to SQLite database via SQL command

import pandas as pd
import sqlite3
# create a sqlite database with name test_sqlite.db
db_name_sqlite='test_sqlite.db'
db_path = db_name_sqlite
# path of spreadsheet file
excel_file = 'data/testData.xlsx'
# read spreadsheet data and put to dataframe
users = pd.read_excel(excel_file, sheet_name='users', header=0)
addresses = pd.read_excel(excel_file, sheet_name='addresses', header=0)
reviews = pd.read_excel(excel_file, sheet_name='reviews', header=0)
books = pd.read_excel(excel_file, sheet_name='books', header=0)
users_books = pd.read_excel(excel_file, sheet_name='users_books', header=0)

The first argument in read_excel is the name of the file we want to work with. We can also include the path where the file is located if needed. For example, when I ran this code, my data was in my working directory. The second argument, sheet_name=<sheet_name>, tells Python which spreadsheet in the workbook we want to work with. The last argument, header=0, indicates that the first row in the spreadsheet has the column names. Python counts from zero, so we use 0 to refer to the first row. By specifying the header row, Python will use the values in the first row of each spreadsheet as the column names for each data frame.

Let’s check the first few rows of each data frame to make sure everything is correct.

users.head()
my source
table_addresses = """
CREATE TABLE addresses (
user_id INTEGER,
street TEXT NOT NULL,
city TEXT NOT NULL,
state TEXT NOT NULL,
PRIMARY KEY (user_id),
FOREIGN KEY(user_id) REFERENCES users(user_id)
)"""

table_addresses is a variable that contains a command to create a table named addresses. Lines #3–6 create four new columns in addresses: user_id, street, city and state. The data type of each column is specified to the right of the respective column name in data frame. More details on SQLite datatypes can be found here. Note that the order of the columns matches the order of the columns from the associated data frame.

For example, suppose the order of the columns of an adresses SQL table are user_id, city, street and state. If we write the values of the city column from the data frame to the addresses table in SQL, they would replace the values of the street column.

Line #7 establishes the table’s primary key, which is a column that contains values that uniquely identify each row. In the addresses table, user_id satisfies the requirements for a primary key because no two rows have the same user_id value.

Line #8 in the previous set of code establishes the foreign key for addressed. A foreign key is a column in one table that is the primary key in another table. For instance, notice that different rows can share the same user_id value in the addresses table, which disqualifies user_id from being the primary key in this table. However, in the users table each row does contain a unique user_id value. As such, user_id can serve as the primary key for users table.

Let’s return to the actual code in line #8 in the previous CREATE TABLE command. The first part of this code, FOREIGN KEY(user_id), establishes that the user_id column in addresses is a foreign key. The second part, REFERENCE users(user_id) then specifies the table and primary key to which user_id refers. Repeat the command for other tables.

table_users = """
CREATE TABLE users (
enabled boolean DEFAULT TRUE,
username TEXT NOT NULL,
user_id INTEGER,
PRIMARY KEY (user_id)
)"""
table_reviews = """
CREATE TABLE reviews (
user_id INTEGER,
book_id INTEGER,
id INTEGER,
review_content TEXT,
published_date DATE,
PRIMARY KEY (id),
FOREIGN KEY(user_id) REFERENCES users(user_id),
FOREIGN KEY(book_id) REFERENCES books(book_id)
)"""
table_books="""
CREATE TABLE books (
author TEXT NOT NULL,
title TEXT NOT NULL,
book_id INTEGER,
isbn INTEGER,
published_date DATE,
PRIMARY KEY (book_id)
)"""
table_users_books = """
CREATE TABLE users_books (
book_id INTEGER,
return_date DATE,
user_id INTEGER,
isbn INTEGER,
checkout_date DATE,
FOREIGN KEY(user_id) REFERENCES users(user_id),
FOREIGN KEY(book_id) REFERENCES books(book_id)
)"""

Finishing the Database
First step, creating database tables with below lines of code. Line #1 making a connection to existing data base, then executing commands one by one to create tables.

with sqlite3.connect(db_path) as con:
# delete the table if it exist
con.execute( "DROP TABLE IF EXISTS users;")
con.execute( "DROP TABLE IF EXISTS addresses;")
con.execute( "DROP TABLE IF EXISTS reviews;")
con.execute( "DROP TABLE IF EXISTS books;")
con.execute( "DROP TABLE IF EXISTS users_books;")
# execute these commands to create database tables
con.execute(table_users)
con.execute(table_books)
con.execute(table_reviews)
con.execute(table_addresses)
con.execute(table_users_books)

Below diagram show relationship of tables in the database with PRIMARY key and FORIEGN key

my source

Note: This image was captured from DBVisualizer. Read this link to how to view relationship of database.
We have already created four data frames with the relevant data for the database. The next step is to insert the data into the corresponding tables in the database. This can be done easily by using pandas to_sql function on each data frame (you can find more details on to_sql here). The following code shows how to use to_sql to transfer the data from each data frame to the right table in the database.

with sqlite3.connect(db_path) as con:
users.to_sql('users', con=con, if_exists='append',index=False)
addresses.to_sql('addresses', con=con, if_exists='append',index=False)
reviews.to_sql('reviews', con=con, if_exists='append',index=False)
books.to_sql('books', con=con, if_exists='append',index=False)
users_books.to_sql('users_books', con=con, if_exists='append',index=False)

We start by creating a connection to the database at line #1. Then, for each line from #2 to #6, we specify the name of the database table where we want to write the data from the data frame as the first parameter, and the database connection as the second parameter. We also use if_exists=‘append’ to tell to_sql that we want to add the data from the data frame to the existing table, if_exists is not ‘append’ all relationship of these tables overwite without relationships. Finally, we use index=False to tell to_sql not to include the data frame index as an extra column in the table.

Conclusion:
Now we have SQLite database that is ready to go. You can run some commands to query on the database. For example, using below lines of code to get data of users table.

with sqlite3.connect(db_path) as con:
# get data of users table from database
print(con.execute( "SELECT * FROM users").fetchall())

All the code is updated here.

Alright, that will do it for this post. Thanks for reading and please reach out if you have any questions and/or constructive feedback.

--

--