A Life with ClickHouse UDFs

Mert Kanber
Insider Engineering
3 min readJun 20, 2022

--

With the release of ClickHouse release v21.10, 2021–10–16, There’s an announcement for a long-waited feature UDF (User defined function). For the ones who does not know what UDF is, it is function defined by a user in a program or environment.

In database, UDFs are used to extend capabilities of the database server usually by SQLs. Most of the time, it is a regular SQL statement but some data warehouse projects like Apache Hive allow users to create UDAF (user-defined aggregate function) and UDTF (user defined table-generating function).

In ClickHouse, UDFs are defined as lambda expression or even executable files. Let’s see how UDF works in ClickHouse

First, let’s define a table that generates random data by simply using GenerateRandom table engine to be used for our functions.

CREATE TABLE _local.udf_rand
(
`id` UInt32,
`a` String,
`b` String
) ENGINE = GenerateRandom(1, 25, 5)

Simply running a SELECT statement, you will see randomly generated ids and string values with the given parameters to the table engine. Now, we are good to go with our function definitions and calculate the uniqueness of a and b columns by following the documentation.

CREATE FUNCTION name AS (parameter0, ...) -> expression

It can take arbitrary number of parameters and name should be unique across cluster or server.

CREATE FUNCTION uniqTotal ON CLUSTER 'default' as (a, b) -> uniq(a) + uniq(b)

As you can see from the definition, we named is as uniqTotal since uniq is a system function. Let’s use our function with the table created beforehand.

SELECT uniqTotal(a, b) AS totalFROM (SELECT * FROM _local.udf_rand LIMIT 100000000)Query id: dee85d40-a31d-418b-80ca-d2e213f1eaf1┌─────total─┐│ 170772458 │└───────────┘

As you can see above, our query did the same process with SELECT uniq(x) + uniq(y) in a more refined way. Lets take a look on another example on how it simplifies a bit more complex queries using UDF function.

CREATE FUNCTION reduce_array_max ON CLUSTER 'default' AS size -> (SELECT arrayReduce('max', grouped_ids)FROM(SELECT arrayMap(x -> x, groupArray(id)) AS grouped_idsFROM (SELECT id FROM _local.udf_rand LIMIT size)))

By calling the function that we defined reduce_array_max with a parameter to be used on SELECT query for our random data generating table we are able to see the result. Note that, since we have subqueries in the function definition, it will return us a subquery as well. You can put an alias on the SELECT query to see in a desired name.

SELECT reduce_array_max(10000000) AS reducedQuery id: adde9baf-2474-4968-a32c-7a94ab06101c┌────reduced─┐│ 4294966617 │└────────────┘1 rows in set. Elapsed: 0.112 sec.

Creating user-defined function has few advantages;

  • Complex code is stored in one structure to be used in everywhere. This reduces error while repeating the same process.
  • Easy to test with the given input parameters.
  • One change on the function reflects on all the places that the function is being used. So you do not have to write the same query over and over again.

You can find more detailed explanation in official ClickHouse documentation

In the end, we have seen that user defined functions are very powerful. You should use them when there are repetitive calculation or processes in your database but of course as with everything else, you should not overuse them.

--

--