Ingestion of data from Firebase Analytics

Maycon Viana Bordin
7 min readJul 10, 2018

--

Firebase is perhaps one of the best tools to track everything the users do on your app across multiple (Android and iOS) mobile platforms. You can use Firebase instead of going through your app stores and adding the numbers to get the total number of downloads per period.

You can check the converting rates of your marketing campaigns, tracking a user from the moment it clicks on a banner until he reaches a certain screen or does some specific action.

Even if you automate the ingestion of data from app stores, the data formats are different, and the granularity is not the same across platforms.

Firebase is also a very powerful tool for UI/UX improvement, as you can see in detail what the users are doing in each screen, how much times does it take for them to do an action.

The interface of Firebase for Analytics is very useful, but it has some limitations regarding how much you can work on your data. On top of that, Firebase will only store the last 500 events.

Moving your data out of Firebase (to BigQuery)

To overcome those limitations you will have to upgrade to the Blaze Plan, that’s the only way you can get your data out of Firebase (not out of Google though, more on that later).

In order to do that, you will have to enable your app to send data to BigQuery. After that you will see that Firebase created a dataset for your app named analytics_XXXXXXXXX.

Recently Firebase changed the way they send data to BigQuery, they used to create a dataset per platform like io_company_app_ANDROID. Now you have all your data in the same dataset.

Within your dataset is a list of historic tables events_ with the date in the format YYYYMMDD as a suffix. You will also see a table with the prefix events_intraday_, this table has the data from the current day and it is updated in real-time.

An intraday table is created at midnight GMT and it can stay around for more than 24 hours before it gets renamed, in the meantime you might have two intraday tables. Even after being renamed, the table can still receive events.

With your tables in place you can start exploring your dataset with BigQuery. Below is the schema of a Firebase table:

{
"event_date": "20180703",
"event_timestamp": "1530664885570000",
"event_name": "screen_view",
"event_params": [...],
"event_previous_timestamp": "1530664885510000",
"event_value_in_usd": null,
"event_bundle_sequence_id": "496",
"event_server_timestamp_offset": "-4148908",
"user_id": "00000000000",
"user_pseudo_id": "9f6e6800cfae7749eb6c486619254b9c",
"user_properties": [...],
"user_first_touch_timestamp": "1530219848238000",
"user_ltv": null,
"device": {
"category": "mobile",
"mobile_brand_name": "Samsung",
"mobile_model_name": "Galaxy S6",
"mobile_marketing_name": null,
"mobile_os_hardware_model": "S6",
"operating_system": "ANDROID",
"operating_system_version": "6.0",
"vendor_id": null,
"advertising_id": null,
"language": "pt-br",
"is_limited_ad_tracking": "No",
"time_zone_offset_seconds": "-10800"
},
"geo": {
"continent": "Americas",
"country": "Brazil",
"region": "Sao Paulo",
"city": "Sao Paulo"
},
"app_info": {
"id": "io.company.app",
"version": "1.0.0",
"install_store": null,
"firebase_app_id": "x:xxxx:android:xxxxxx",
"install_source": "com.android.vending"
},
"traffic_source": {
"name": null,
"medium": "organic",
"source": "google-play"
},
"stream_id": "1100000000",
"platform": "ANDROID"
}

A great improvement made on the Firebase dataset recently was breaking a single record with multiple events into multiple records, each with a single event.

Moving your data out of BigQuery

The next step is to move your data out of BigQuery. To do that we need to create an export Job on BigQuery to send our data to Google Cloud Storage (GCS), and from there we can download the dataset.

BigQuery presents three export formats: CSV, JSON and AVRO. I prefer to use AVRO as it saves space, which reduces the cost of the download from GCS. With AVRO you can also create an external Hive table over the files to explore it.

First of all you will need the libraries from Google to access BigQuery and GCS. Here are the dependencies for Gradle:

compile group: 'com.google.apis', name: 'google-api-services-storage', version: 'v1-rev120-1.23.0'
compile group: 'com.google.apis', name: 'google-api-services-bigquery', version: 'v2-rev387-1.23.0'

After that you will need to create an authentication key on Google Cloud Platform with permissions to BigQuery and GCS. Download the JSON key and add it to your project.

A few information details are required in order to create the code for exporting the table to GCS. You will also need to create a bucket on GCS in order to store the exported tables.

val appName    = "my-app"
val
projectId = "myapp-x0000"
val
datasetId = "analytics_XXXXXXXXX"
val
tableId = "events_YYYYMMDD"
val
credPath = "/MyApp-xxxxxxxxxxxx.json"
val
outputPath = "gs://myapp-bucket/my_table.avro"

The project, dataset and table IDs you can find in the BigQuery dashboard, under the table Details tab, in the Table ID row:

+-----------+--------------------+-----------------+
| Table ID | <project_id>:<dataset_id>:<table_id> |
+-----------+--------------------+-----------------+

Next, you will need to create an instance of the BigQuery client:

// default transport and json library
val transport: NetHttpTransport = GoogleNetHttpTransport.newTrustedTransport
val jsonFactory: JacksonFactory = JacksonFactory.getDefaultInstance

// setup GCP credential
val credential: GoogleCredential = GoogleCredential.fromStream(getClass.getResourceAsStream(credPath))
.createScoped(List(BigqueryScopes.BIGQUERY))

// create instance of bigquery
val bigquery: Bigquery = new Bigquery.Builder(transport, jsonFactory, credential)
.setApplicationName(appName)
.setHttpRequestInitializer(credential)
.build()

With your client created you can start using BigQuery. In our case we are going to create a Job of type Extract, providing information about the table to be extracted and the desired format for extraction. Some configuration properties are specific to some formats (e.g. field delimiter, compression).

val job = new Job
val configuration = new JobConfiguration

val extractConfig = new JobConfigurationExtract()
.setDestinationFormat("AVRO")
.setDestinationUris(List(outputPath))
.setSourceTable(new TableReference()
.setProjectId(projectId)
.setDatasetId(datasetId)
.setTableId(tableId)
)

configuration.setExtract(extractConfig)
job.setConfiguration(configuration)

val insert = bigquery.jobs().insert(projectId, job)
insert.setProjectId(projectId)

val jobReference = insert.execute().getJobReference

println(s"Created new job with ID ${jobReference.getJobId}")

After the job is submitted, you will receive a reference to the job so that you can check out its status later. You will use the BigQuery API to retrieve the Job and see its current status.

While in progress the status will be RUNNING, when finished the status becomes DONE. Then you can check if there’s an error result, if not the job has succeeded.

var done = false

while
(!done) {
val pollJob = bigquery.jobs().get(jobReference.getProjectId, jobReference.getJobId).execute()

if (pollJob.getStatus.getState.equals("DONE")) {
if (pollJob.getStatus.getErrorResult != null) {
val storage = new Storage.Builder(transport, jsonFactory, credential)
.setApplicationName(appName)
.build() println(s"Failed to execute job ${jobReference.getJobId}: ${pollJob.getStatus.getErrorResult}")
}

done = true
} else {
println(s"Waiting 2 seconds for Job ${jobReference.getJobId} to finish. Current status: ${pollJob.getStatus.getState}")
Thread.sleep(2000)
}
}

println(s"Job ${jobReference.getJobId} is done.")

After the Job has finished successfully, you can download the data from the GCS bucket, in the specified outputPath.

First you need to create an instance of the Storage client, then you can use it to download the object and receive an InputStream. As per our example the get call would be like storage.objects().get("myapp-bucket", "my_table.avro").

val storage = new Storage.Builder(transport, jsonFactory, credential)
.setApplicationName(appName)
.build()
val getObject = storage.objects().get(bucketName, key)

getObject.getMediaHttpDownloader.setDirectDownloadEnabled(true)
val is = getObject.executeMediaAsInputStream()

If you choose to export as JSON or CSV you can simply convert you input stream into a string.

val data: String = scala.io.Source.fromInputStream(is, charset).getLines.mkString

However, if you choose AVRO you will need the library to read the file format.

compile group: 'org.apache.avro', name: 'avro', version: '1.8.2'

In the code below we create a new generic reader for Avro (you can use custom classes to specify a data schema) and add all items of the file to a list. The GenericDate.get().toString(record) call will make the AVRO record become a JSON string.

val reader = new GenericDatumReader[GenericRecord]()
val dataFileStream = new DataFileStream[GenericRecord](is, reader)

var record: GenericRecord = null
var
items: ListBuffer[String] = new ListBuffer[String]()

while (dataFileStream.hasNext) {
record = dataFileStream.next(record)
items += GenericData.get().toString(record)
}

dataFileStream.close()

If you’are sending the data downstream it is better to emit the record JSON instead of appending it to the list buffer.

Most of the record schema is flat or is one level deep, but still can be mapped as a flat object, such as the user_properties, geo, app_info and traffic_source objects.

The user_properties contains custom fields that can be added to Firebase events, such as the user identification. The easiest way to retrieve an specific field is with JSON Path:

$.user_properties[?(@.key=="email")].value.string_value

We also have the event_params object within the event. This one has custom field names that may change from event to event. In order to flat this object we can iterate over each custom field and retrieve its value:

def getValue(value: Map[String, AnyRef]): String = {
if (value.getOrElse("string_value", null) != null) value("string_value").toString
else if (value.getOrElse("int_value", null) != null) value("int_value").toString
else if (value.getOrElse("float_value", null) != null) value("float_value").toString
else if (value.getOrElse("double_value", null) != null) value("double_value").toString
else null
}
val params = JsonUtil.toMapList[AnyRef](JsonPath.read[String]("$.event_params", data))

val event = params.map(param => {
param("key").toString -> getValue(param("value").asInstanceOf[Map[String, AnyRef]])
})

Conclusion

After finishing all the code and gluing it together, you can move to the parse step, extracting the fields that you need and decide how you are going to show the data.

You can also decide whether you run the ingestion once per day, or every few minutes on the intraday table. As per my experience, the intraday table can change after is has been consolidated (the timestamps can change to be specific) and it can cause duplication of records if you ingest it.

My recommendation would be to ingest consolidated tables in one place and the intraday table at a different place, only for showing near real-time data.

In any case, Firebase is a rich source of information and can help answering many questions about the user experience, how customers use the app and what issues might be blocking them.

Having a daily ingestion of Firebase data into your Data Lake will make the data scientists very happy, as well as improve your app experience.

--

--