Microservices as Functions in BigQuery — Language Translation using SQL (Part 1)
Learn how to use Translation APIs in your SQL queries.
Motivation
I’ve been experimenting with BigQuery User Defined Functions (UDFs) for a while, and I’ve found them to be powerful — yet limited.
In short, UDFs allow you to define your own custom SQL, written in SQL or Javascript, to process data in BigQuery. This way, you can perform complex data transformations or operations not available in the standard BigQuery functions.
The limitations I have noticed are the following:
- They cannot be used to make API calls to external services — This means that if you want to access metadata from an external service you will need to use a different approach.
- They can only be written in Javascript or SQL — This means that if you are more comfortable using another programming language, such as Python, you will not be able to use UDFs.
A workaround to these limitations was released earlier this year and is called Remote Functions.
Google Documentation: A BigQuery remote function allows you to implement your function in other languages than SQL and Javascript or with the libraries or services which are not allowed in BigQuery user-defined functions.
In this two-part tutorial, we will build two microservices using Cloud Functions and implement them as SQL functions in BigQuery.
In Part 1, we will build a web app that uses the Azure Translator API to take text input and translate it to the desired language.
In Part 2, we will build a web app that uses MaxMind’s offline database to take an IP address as input and retrieve geolocation information such as the country, city, and postal code.
In both cases, we will be testing the final SQL function in a toy dataset in BigQuery.
Running Costs
By now you should be wondering about the costs of running a web app, as well as using a Translation API. Is it worth the trouble? Yes, it does.
For the web app, we will be using Google’s Cloud Functions which are more cost-effective than renting a server since they are only charged when used. We will be using the free Tier, which currently offers 2 million invocations per Month for free. Read more here.
For the Translation API, we will be using Azure’s Translator API. We will be using the free Tier, which currently offers 2 million characters per Month for free. Read more here.
Language Translation Function
The full code can be found here.
0. Prerequisites
To continue we need to make sure we have done the following:
- Enable Google Cloud Functions. Read more here.
- Install and configure locally gcloud CLI. Read more here.
- Enable Azure Translator API. Read more here.
Moreover, in order to run the code snippets that follow we need to replace the following variables with our own:
- <your-project-id>
- <gcf-conn-name> (step 2)
- <gcf-endpoint> (step 4)
An easy way to follow the tutorial is to copy Readme.md and Search-Replace the values above with your own using an editor.
1. Clone the repository (CLI)
We will start by cloning the repo.
git clone https://github.com/justdataplease/bigquery-translation.git
cd bigquery-translation
The repo directory has the following structure:
.env_sample: Includes the environmental variables with some sample values.
Later on, we will rename it from .env_sample ->.env and replace the sample values with our own.
translation.py: Includes the Translator class, which is used for translating text from one language to another. The translate_text() method takes in a string of text and a target language and returns the translated text. It also offers the option to specify the original language (from_language) of the text or automatically detect it. For this tutorial, we will automatically detect the original language.
The Translator class uses Azure's Translator Text API to perform the translations.
main.py: Includes the code for our Cloud Function.
import json
import functions_framework
from translation import Translator
@functions_framework.http
def translate(request):
"""
Defines translate Google Cloud Function
:param request:
:return:
"""
request_json = request.get_json()
calls = request_json['calls']
replies = []
trans = Translator()
for call in calls:
text = call[0]
to_language = call[1]
rs = trans.translate_text(text=text, to_language=to_language)
# each reply is a STRING (JSON not currently supported)
replies.append(json.dumps(rs, ensure_ascii=False))
return json.dumps({'replies': replies})
This function has a special form in order to work with BigQuery. It should accept multiple rows (calls) as input and iterate through each one (call) to perform the translation process. For each row (call), it pulls two columns (call[0] and call[1]). The first one is treated as the text we want to translate (text = call[0]) and the second one is treated as the desired language we want to translate the text to (to_language = call[1]). Finally, we gather each response to a list named replies and we convert this list into a string.
test_cloud_function.py: Includes the code to test our Cloud Function. We will use it after the Cloud Function is deployed.
2. Deploy Cloud Function (CLI)
First of all, to deploy our Cloud Function we need to make sure
- we are located at the root of the bigquery-translation directory.
- we have renamed .env_sample to .env and we have changed the sample values of AZURE_TRANSLATION_KEY and AZURE_TRANSLATION_LOCATION with our own (we will specify CLOUD_FUNCTION_URL after Cloud Function is deployed).
Then we should run the following command to deploy our directory as a Cloud Function. To learn more about the parameters used read here.
gcloud functions deploy bigquery-translation --gen2 --runtime python39 --trigger-http --project=<your-project-id> --entry-point=translate --source . --region=europe-west3 --memory=128Mi --max-instances=3 --allow-unauthenticated
--allow-unauthenticated : For simplicity, we added this parameter which makes our function public. If reproduce it is better to avoid using this parameter.
From the output of the last command, we should note the URI (i.e. https://bigquery-iplookup-xxxxxx.a.run.app) or visit Google Cloud Console Functions.
Now to test the Cloud Function, we can update CLOUD_FUNCTION_URL in our .env file and run the following code.
python test_cloud_function.py
# Input
data = {
"calls": [
["Programming is great", "es"],
["Support me as a writer", "es"]
]
}
# Output
{'replies': [
'{"detected_lang": "en", "detected_conf": 1.0,
"trans_text": "la programación es genial", "trans_lang": "es", "error": ""}',
'{"detected_lang": "en", "detected_conf": 1.0,
"trans_text": "apóyame como escritor", "trans_lang": "es", "error": ""}'
]}
From the output that we got, we can see that our Cloud Function is working!
3. Create a connection between BigQuery and Cloud Functions (CLI)
We need to connect BigQuery and Cloud Functions in order to use a Cloud Function as a Remote Function.
gcloud components update
bq mk --connection --display_name='my_gcf_conn' --connection_type=CLOUD_RESOURCE --project_id=<your-project-id> --location=EU gcf-conn
bq show --project_id=<your-project-id> --location=EU --connection gcf-conn
From the output of the last command (bq show), we should note the name (i.e. xxxxxx.eu.gcf-conn) because we will need this later.
4. Create a toy dataset (CLI)
To test our function we will create a toy dataset. This dataset will include our Remote Function and a table with some test data.
bq mk --dataset_id=<your-project-id>:translation --location=EU
5. Create an example table (BigQuery)
To test our Remote Function we will create a table with test data.
CREATE OR REPLACE TABLE `<your-project-id>.translation.example_table` (
text STRING,
to_language STRING
);
INSERT INTO `<your-project-id>.translation.example_table`(text, to_language)
VALUES ('I love programming', 'es'),
('We are learning great things today', 'es'),
('Support me as a writer', 'es'),
('Support me as a writer', 'fr'),
('Support me as a writer', 'de');
6. Create a Remote Function (BigQuery)
Finally, we will create our Remote Function.
CREATE OR REPLACE FUNCTION `<your-project-id>.translation.translate`(text STRING, to_language STRING)
RETURNS STRING
REMOTE WITH CONNECTION `<gcf-conn-name>`
OPTIONS (
-- change this to reflect the Trigger URL of your Cloud Function (look for the TRIGGER tab)
endpoint = '<gcf-endpoint>'
);
7. Test the Remote Function (BigQuery)
To test our Remote Function we will run it on t the test data. The output of the function is a string so we need to parse it to extract the information into columns.
WITH A AS (SELECT `<your-project-id>.translation.translate`(text,to_language) trans_rs,text origin_text FROM `<your-project-id>.translation.example_table`)
select
origin_text,
json_value(trans_rs, '$.detected_lang') detected_lang,
json_value(trans_rs, '$.detected_conf') detected_conf,
json_value(trans_rs, '$.trans_text') trans_text,
json_value(trans_rs, '$.trans_lang') trans_lang,
json_value(trans_rs, '$.error') error
from a A;
We did it! Our function is working.
I remind you, that this function under the hood is using a paid external API and even though we are on the free tier we should not forget to use it wisely.
8. Remove Everything (CLI)
To remove the Cloud Function the Remote Function and the Toy Dataset, we need to run the following commands:
# Remove Cloud Function (gcf)
gcloud functions delete bigquery-translation --region=europe-west3 --project=<your-project-id> --gen2
# Remove DATASET
bq rm -r -f -d <your-project-id>:translation
# Remove connection between BigQuery and Cloud Functions (gcf-conn)
bq rm --connection --location=EU <gcf-conn-name>
Conclusion
In Part 1 of this series, we demonstrated how we can perform language translation using SQL, a task that couldn't have been accomplished with standard Bigquery UDFs. In the upcoming Part 2, we will explore how we can look up an IP address to retrieve geolocation data without making API calls, so stay tuned!
You may also find interesting my story on how to use BigQuery Javascript UDFs to perform complex preprocessing tasks, like Stemming.
If you enjoy reading stories like this and want to support me as a writer, please consider following me and smashing that clap button.
Thank you for your support!