Schedule your BigQuery jobs with Play Framework 3 and Akka
Built on Akka, Play provides predictable and minimal resource consumption (CPU, memory, threads) for highly-scalable applications.
BigQuery is a fully managed, AI-ready data analytics platform that helps you maximize value from your data and is designed to be multi-engine, multi-format, and multi-cloud.
This article will show how to implement a BigQuery job with Play Framework and Akka. The job will run every day at midnight.
Before starting it is recommended to have the basics of Play Framework and Google Cloud BigQuery.
Sequence diagram
In the diagram below, we see that the application will execute a BigQery query every midnight, which will update the filtered_table table. After the update, a response containing the number of bytes read is sent asynchronously.
So we will need
- access to be able to execute BigQuery queries from Java code
- the Java project to execute our Job
BigQuery Setup
Assuming you have a Google Cloud account:
a) Create a new project that we will call “Big Query Example”
b) Ensure that the big Query API is enabled
c) Create a service account to allow Java code to authenticate.
In IAM and admin >> Services accounts, click Create service account
- Make sure the “Big Query Example” project is selected
- Service account name: Big Query
- Service account ID: big-query
Click Create and Continue
d) Add the Big Query Admin role and click Done
e) You must have a screen like this without keys “No keys” in the Key ID column like this
f) We will add a key by selecting “Manage keys”
Then “Create new key”, choose “Json” and click “Create”
If all goes well, a .json file will be downloaded. Note these fields they will be used later: project-id, client-email, private-key.
g) In BigQuery create a covid_data dataset.
Play Framework project
Before you start You must install JDK 11 or higher and Sbt and have the basics of Play Framework
a) We will create a new project. For this, we will use the command
sbt new playframework/play-java-seed.g8
and fill like the image below
scala_version and sbt_giter8_scaffold_version may differ depending on the machines.
b) We will use the Alpakka library for Google Bigquery. Alpakka is a Reactive Enterprise Integration library for Java and Scala, based on Reactive Streams and Akka.
In the build.sbt file, adapt libraryDependencies to have this
libraryDependencies ++= Seq(
guice,
"com.lightbend.akka" %% "akka-stream-alpakka-google-cloud-bigquery" % "8.0.0",
)
this will add the akka library which will later help us communicate with Bigquery.
For compatibility issues between Play 3 and the Jackson Library, we will add this
dependencyOverrides += "com.fasterxml.jackson.core" % "jackson-databind" % "2.14.2"
Finally, to allow sbt to find the akka repo when resolving dependencies, we will add this
resolvers += "Akka library repository".at("https://repo.akka.io/maven")
Your full build.sbt should look like this
to download the dependencies and make sure everything is ok run
sbt compile
c) we will create the app/services folder and we will add the classes:
IBigQueryJob.java
BigQueryJobImpl.java
The runJob() function in our example will update the “filtered_table” table with the data read from the “covid19_open_data” table.
d) Now that we have our function to execute the job, we will write the piece of code allowing us to execute this function regularly.
We will create the app/schedulers folder and we will add the classes:
- JobSchedulerActor.java in which we will indicate the routine to be executed every day at midnight in our case. Your code should look like this
- JobScheduler.java which is our scheduler in which we will tell it to trigger every day at midnight. Your code will look like this
The wait variable will calculate the remaining time it takes for it to trigger at midnight
e) We need to activate our schedulers.
- create a BigQueryModule.class in the app folder. Your code should look like this
- reference this module in the conf/application.conf in which we will add this
play.modules.enabled += "BigQueryModule"
f) add the BigQuery identifiers. Still in the application.conf file add this at the end
alpakka.google {
credentials {
service-account {
project-id = "your project-id"
client-email = "your client-email"
private-key = "your private-key"
}
}
}
replace project-id, client-email, and private-key with the values retrieved from the .json file obtained in the service account section above.
g) Finally we will go to conf/logback.xml and add this in the <configuration> tag
<logger name="schedulers" level="INFO"/>
This will allow us to display the logs of the classes found in the schedulers package.
Your final tree should look like this
Finally, we will execute sbt run. Then you will need to launch http://localhost:9000/
Except that to test you have to wait until midnight. We are going to modify the code just for testing so that the scheduler starts right away and at 5-second intervals.
The JobScheduler class becomes
Finally we have the following results
and in BigQuery
Conclusion
This article will show how to schedule our BigQuery tasks using Play Framework 3. This example is ideal for independent tasks. If the jobs depend on each other, a good solution would be to use Data Pipeline Tools like Airflow, Beam, Oozie, Azkaban, and Luigi.
The source code of the application is available on GitHub.
Please clap for this article if you enjoyed reading it. For more about Google Cloud, data science, data engineering, and AI/ML and software architecture follow me on LinkedIn.