Step-by-Step Guide to Migrating Hive Custom Functions to BigQuery SQL UDFs
In my previous post on migrating custom functions from Hive to Google BigQuery, we dissected the landscape of UDFs, UDTFs, and UDAFs, offering a strategic roadmap for tackling this complex task. Building on that foundation, this post zeroes in on one of the most common migration paths: transforming Hive UDFs into BigQuery’s SQL UDFs. We’ll walk through a detailed, step-by-step process that not only translates your Hive functions into BigQuery SQL but also shows how to deploy them with Terraform. Whether you’re a data engineer seeking precision or a CTO aiming for a seamless transition, this guide aims to demystify the nuances and equip you with the know-how to execute this crucial aspect of your data warehouse migration effectively.
Why SQL UDF?
Because it is SQL :) Using SQL enables a wider range of professionals to maintain and understand the function making the adoption process smoother for teams. SQL UDFs in BigQuery also tend to be more performant and cheaper compared to the other UDF alternatives (JS and Remote Functions) for typical query operations due to their optimized execution within BigQuery. The move to SQL UDFs thus represents not just a migration of code, but an upgrade to more efficient, cost-effective, and scalable data processing practices, positioning your data stack to take full advantage of BigQuery’s powerful analytics capabilities.
The source function: ArrayFilterEmptyOrNull
To demonstrate the migration process I have chosen a simple function that accepts an array as input, then filters out the null elements and the elements those string representation is an empty string and returns the filtered array. It is a quite simple function, but this is exactly what we are aiming for: functions that are simple enough to implement in SQL.
This is the Hive UDF function body in Java:
@Override
public Object evaluate(DeferredObject[] arguments) throws HiveException {
List<?> inputList = inpuListObjectInspector.getList(arguments[0].get());
if (inputList == null) {
return new ArrayList<>();
}
List<Object> returnList = inputList.stream()
.filter(element -> element != null && !element.toString().isEmpty())
.collect(Collectors.toList());
return returnList;
}
The full Java class can be found here.
The BQ SQL UDF
It should not be too hard to implement this logic in BQ SQL on your own, but I give you a possible solution:
ARRAY(SELECT e
FROM UNNEST(INPUT_ARRAY) AS e
WHERE e IS NOT NULL AND CAST(e AS STRING) != '')
The only tricky parts that we need to consider that
- we need to wrap the body in an `ARRAY` function to show BQ that you will return an array
- since we do not have restrictions on the type of the array elements we need to cast the elements to a string before checking the emptiness.
To deploy the function in a proper manner we at Aliz always use terraform. We can use the `google_bigquery_routine` terraform resource for our UDF. Besides the function body, we need to specify the project and the dataset where we wanna deploy the function, the name of the UDF, and the input parameters. Something like this:
resource "google_bigquery_routine" "sql_udf" {
project = var.gcp_project_id
dataset_id = google_bigquery_dataset.udfs.dataset_id
routine_id = "ARRAY_FILTER_EMPTY_OR_NULL"
routine_type = "SCALAR_FUNCTION"
language = "SQL"
definition_body = <<EOT
ARRAY(
SELECT e
FROM UNNEST(INPUT_ARRAY) AS e
WHERE e IS NOT NULL AND CAST(e AS STRING) != '')
EOT
description = "Filters out empty or null values from an array and returns the filtered array."
arguments {
name = "INPUT_ARRAY"
argument_kind = "ANY_TYPE"
}
}
You can find the full terraform implementation here. By applying the root terraform module you should be able to deploy the function in your GCP environment. (If you do not have a GCP environment yet here are the steps on how you can get started with GCP)
Once the ‘terraform apply’ is successful and the UDF is deployed you can easily invoke the function like this:
SELECT `hive_sample_udfs.ARRAY_FILTER_EMPTY_OR_NULL`(["ALMA", NULL, "", "KORTE"])
As you can see you invoke your custom UDF the same way as you would a native BQ function the only difference is that you need to specify at least the dataset and optionally the project where the UDF is deployed. If you execute the query job in the same project where your function is deployed you can omit the project id.
You can find more information about the deployment and usage here.
Things you need to be mindful of
The Hive UDF framework is pretty flexible regarding the input arguments. It allows the developers to use a variable number and type of arguments in the same UDF. The BigQuery SQL UDF is more strict in this regard: if you wanna have different method signatures you need to define separate SQL UDFs. Also, BigQuery does not allow to define UDFs with the same name and different parameter types so you need to use different names for the different signatures.
Wrapping up
And there you have it — a clear-cut, actionable guide to transitioning your Hive UDFs to BigQuery’s SQL UDFs, with a practical example to get you started. By leveraging the power of SQL within BigQuery and harnessing the efficiency of Terraform for deployment, you’re well on your way to transforming your data processes into more streamlined, cloud-native operations. So, go ahead and apply these steps to your own Hive functions, and watch as they evolve into optimized BigQuery UDFs, ready to tackle the analytical demands of today and tomorrow. Stay tuned for more insights in our series where we’ll tackle more complex functions and dive into advanced migration techniques. Until then, happy migration!