Quick Introduction to Using ClickHouse and SQL
How to start ClickHouse on Docker, insert data, and give queries using the command-line client
I am a beginner to ClickHouse and SQL myself, but I wanted to create a short tutorial because ClickHouse is a popular database management system and valuable to know. I’m using the clickhouse/clickhouse-server image. Its page on the Docker Hub(https://hub.docker.com/r/clickhouse/clickhouse-server/) details many more options to use the image, but I only use the most basic:
docker run -d --name clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server
to start the server (with the name “clickhouse-server”), and to enter the command-line client,
docker exec -it clickhouse-server clickhouse-client
If I wanted to mount a volume so that my data would persist after removing the container, I could’ve also done
docker run -d --volume=intro-example:/var/lib/clickhouse --name clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server
with the name “intro-example.” The volume is created if it does not already exist.
Now that I’m in the client, I can play with the database.
Creating Tables
I find learning from examples easier, so I’ll be creating two small tables to run some queries on:
Table 1(profiles) will store the profile information of a list of accounts, such as an ID number and name, and table 2 (logTimes) will store the times any account has logged in and the corresponding ID number.
CREATE TABLE profiles (ID UInt8, name String)
ENGINE = MergeTree
PRIMARY KEY ID
CREATE TABLE logTimes (time DateTime, ID UInt8)
ENGINE = MergeTree
PRIMARY KEY ID
The syntax:
CREATE TABLE <table name>( <col name> <data type>, … )
ENGINE = MergeTree
PRIMARY KEY (<col name>, …)
I honestly don’t know anything about the engines, but the primary key determines how the data will be organized and stored. As this article(https://medium.com/datadenys/how-clickhouse-primary-key-works-and-how-to-choose-it-4aaf3bf4a8b9) nicely explains, it’s best to choose columns that will be often used, in order to optimize queries (although with my small example, it is not very important).
SHOW TABLES can be used to display all current tables, and DROP TABLE <table name> can be used to delete a table.
Now that I’ve made a table, I can insert data into it.
Inserting Data
I am inserting some accounts and login times to my two tables:
INSERT INTO profiles VALUES (1, 'One'), (2, 'Two'), (3, 'Three')
INSERT INTO logTimes VALUES (now(),1), (now()-216000,1), (now()+216000,2),
('2020-01-01 10:00:00', 3), ('2000-01-01 10:00:00', 3)
The syntax is as follows:
INSERT INTO <table name> VALUES (<data tuple>), …
The data tuple must match the data types and ordering of the columns specified when creating the table. Strings are encased in single quotes, and the current DateTime can easily be generated with now().
I could continue to insert data at any time, but I will move on to queries.
Simple Queries
Here’s a super-simple query to view all the data currently in a table:
SELECT * FROM logTimes
which outputs:
┌────────────────time─┬─ID─┐
│ 2023-09-08 18:34:52 │ 1 │
│ 2023-09-06 06:34:52 │ 1 │
│ 2023-09-11 06:34:52 │ 2 │
│ 2020-01-01 10:00:00 │ 3 │
│ 2000-01-01 10:00:00 │ 3 │
└─────────────────────┴────┘
The SELECT keyword is used to specify which columns to display, and * indicates all columns. Alternatively, I could’ve listed out all the columns to achieve the same result (SELECT time, ID). FROM specifies which table to use. If an order is not specified for the data, data is ordered by the table’s primary key (the ID).
Here’s another example on the logTimes table, but with the data ordered by recent login times:
SELECT * FROM logTimes
ORDER BY time DESC
┌────────────────time─┬─ID─┐
│ 2023-09-11 06:34:52 │ 2 │
│ 2023-09-08 18:34:52 │ 1 │
│ 2023-09-06 06:34:52 │ 1 │
│ 2020-01-01 10:00:00 │ 3 │
│ 2000-01-01 10:00:00 │ 3 │
└─────────────────────┴────┘
Notice that the table is no longer ordered by ID. ORDER BY can take multiple columns to differentiate between entries with the same value, and it orders in ascending order by default.
More Interesting Queries
I found this website useful when I started learning SQL, and it’s still helpful for review: https://mode.com/sql-tutorial/intro-to-intermediate-sql/. I referred to it while writing the rest of this example.
I want to see how many times each account has logged in since 2010 and what their most recent login time was, with the most recent accounts shown first. This is a query to do so:
SELECT ID, COUNT(*) AS times_logged_in, MAX(time) AS last_logged
FROM logTimes
WHERE time >= '2010-01-01 00:00:00'
GROUP BY ID
ORDER BY last_logged DESC
┌─ID─┬─times_logged_in─┬─────────last_logged─┐
│ 2 │ 1 │ 2023-09-11 06:34:52 │
│ 1 │ 2 │ 2023-09-08 18:34:52 │
│ 3 │ 1 │ 2020-01-01 10:00:00 │
└────┴─────────────────┴─────────────────────┘
Only account 3 had a login time before 2010, and only that entry is excluded from the count. Otherwise, each login is accounted for in times_logged_in, and the most recent time is shown in the last_logged column. Columns are displayed in the order I specified in the SELECT.
Login times ‘where’ the time is later than 2010 are included in the count, and entries are ‘grouped by’ matching IDs so that I can use aggregate functions in the SELECT clause: COUNT(*) tallies the number of valid logins from each ID, and MAX(time) chooses the latest login time of each ID. Along the with ID, these two values are displayed ‘as’ their own columns.
Here’s a final example: I’m going to modify the query to show if an account is active/inactive(based on the most recent login time) as well as show the account’s name (using the profiles table I created earlier but never used).
SELECT ID, name, COUNT(*) AS times_logged_in, MAX(time) AS last_logged,
CASE WHEN last_logged >= '2023-09-08 18:34:52'
THEN 'active' ELSE 'inactive'
END AS activity
FROM logTimes
JOIN profiles ON profiles.ID = logTimes.ID
WHERE time >= '2010-01-01 00:00:00'
GROUP BY ID, name
ORDER BY last_logged DESC
┌─ID─┬─name──┬─times_logged_in─┬─────────last_logged─┬─activity─┐
│ 2 │ Two │ 1 │ 2023-09-11 06:34:52 │ active │
│ 1 │ One │ 2 │ 2023-09-08 18:34:52 │ active │
│ 3 │ Three │ 1 │ 2020-01-01 10:00:00 │ inactive │
└────┴───────┴─────────────────┴─────────────────────┴──────────┘
I changed 3 parts of the previous query:
- The CASE clause operates like an if-else statement to determine if the account is active or not by comparing the last_logged column to the current time, and it returns a string to be displayed in a new column, “activity.” ELSE is optional, but END is required (if I did not include the ELSE, then the activity for account 3 would’ve shown null).
- GROUP BY now includes the account name with the ID. Since I wanted to display the name column, I must also group by name in order for the aggregate functions to work. ‘Exception: Unknown column’ is thrown if I do not group by both ID and name. This edit is, however, unnecessary if I don’t add the name column in SELECT, but then what was the point of joining these two tables? Which leads into the final edit…
- A (inner) JOIN associates the matching ID column of profiles with the IDs of logTimes, allowing me to combine the data of each table for each ID and display the name of each account. There are other types of JOINs, but INNER JOIN will only output entries that exist in both tables. (For example, if I inserted into logTimes a new login from ID 99, it would not be displayed in an INNER JOIN because it has no corresponding ID in the profiles table.)
Conclusion
For their most basic uses, ClickHouse and SQL aren’t too difficult to learn, and I hope I could demonstrate that with this tutorial. Thanks for reading.