How to read your entire dataset in Firestore

Published in

Firebase Tips & Tricks

5 min readJan 25, 2021

Motivation

Firestore documents are flying in your app: it writes, updates, delete documents and collections, either through Cloud Functions or directly on client side with the various client SDKs.

As the app administrator, at some point you may need to look into your dataset, or modify a bunch of documents for various purpose:

maintenance: for example to clean-up some mess left during development, or to retrieve some piece of data for debugging
development: for example to update your collections organization, or add a field to some documents to support a feature before deploying the code

There are several ways to access your data:

read and write data from your app with the client SDKs
navigate your dataset in the Firebase console
export your dataset to Google Cloud Storage
explore your dataset with the Admin SDK through a script or Cloud Function

In this article we will discuss the latter: we will write a script leveraging the Admin SDK to act on all documents. It is nice to feel that our data in Firestore is at our fingertips, it allows filtering data and acting on documents however we need (because we are in our script!). As an aside, it allows visiting sub-collections of non-existing documents which frequently raises questions on Stack Overflow.

Please not that exploring your dataset is subject to Firestore billing and will cost you 1 read per document.

Data structure

In Firestore, root collections contain documents, which can contain collections (sometimes called sub-collections), which contain documents, and so on... This recursive structure forms a tree: documents and collections are the nodes, they have a single parent and zero or multiple children.

In order to read the entire data set, we need to implement Tree Traversal. It cannot be achieved with the client SDKs, and this is on purpose: this is not a nominal use of Firestore, most probably because of the scaling drawback discussed before. It should not be part of a normal app flow, and reserved to ad-hoc usage.

In this article we will inspect a script that can be run locally to target a remote Firebase project or the Emulators. This code can also be wrapped in a Cloud Function which could be available from an admin dashboard.

Let’s get coding!

The whole code can be found in this gist. It consists of three modules: admin.js which sets up the Admin SDK, traverse.js which implements the traversal, and myscript.js which uses the traversal to display the entire data set. Let’s look into the implementation of the tree traversal. The module traverse.js exposes two functions:

traverseFromRoot, which visits the entire data set,
traverseFromPath, which visits all collections and documents under the specified path.

Here is the code, explanations follow:

Explanations

Both functions take two user-defined callbacks as parameters: onDocument and onCollection. These callbacks are executed on each document or collection accordingly. They can inspect the data and stop the traversal of the current branch as needed.

The function traverseFromRoot visits all root collections returned by listCollections.

The function traverseFromPath takes a path as input. This is a string of the form: root-collection/document/collection/… An odd number of levels points to a collection, an even number points to a document. Armed with this knowledge we can obtain a DocumentReference or CollectionReference, then visit it.

The function visit is the most interesting. As you have noticed, this function is recursive: it calls itself to visit the children of the current node. Let’s detail how this function works:

It first determines whether the node is a document or collection by testing the presence of listCollections which is only found in documents.
It executes the appropriate user callback, passing the current node (document of collection) as argument.
If the user callback returns false, the function returns: the traversal does not go deeper in this branch.
Otherwise, the traversal continues deeper: we list the children of the current node with listDocuments or listCollections, and visit each child sequentially.

Caution!

This code loads entire document lists in memory at once, so it can exceed available memory. See “Bonus: the stream version” for code that reads document one by one to prevent memory exhaustion.

Results

I run this script on my chat application to print every documents (see the gist for details), here is the output:

chatrooms
   firebase: {"subject":"Firebase"}
      messages
         m1: {"author":"user1","content":"Welcome everyone!"}
         m2: {"author":"user2","content":"hello user1!"}
   flash: "not-existing"
      messages
         m1: {"author":"user3","content":"Welcome to Flashboard"}
   react: {"subject":"React.js"}
      messages
         m1: {"author":"user3","content":"Welcome to the chatroom!"}
         m2: {"author":"user4","content":"Let's start coding!"}
users
   user1: {"name":"Alice"}
   user2: {"name":"Bob"}
   user3: {"name":"Carol"}

We can see my two root-collections: chatrooms and users. It shows the three chatrooms documents, even through chatrooms/flash does not exist (more on this in another article). These chatrooms documents have a subject in their data, and a sub-collection: messages which contain all messages documents.

Additional notes

For simplicity, we used listDocuments to list the children of a collection, to increase scalability, it could be replaced with get to retrieve the documents data for each document, instead of having to do it in onDocument, However it will overlook non-existing documents. It could also be replaced with stream for scalability (see the code at the end).
Limitation: listCollections and listDocuments return an array containing all references. This data is not streamed so it could exceed your available memory if you have many documents with long ids.
Pricing: listDocuments costs 1 read per document, according to this SO answer, we can reasonably expect that listCollection costs 1 read per collection. So overall the traversal itself costs 1 read per document + 1 read per collection in your data set, plus the reads done in onDocument and onCollection at the user discretion.
Our implementation visits documents and collections sequentially, because we favored determinism of the ordering and memory protection (only one document read at a time) over speed for this maintenance script. It can be parallelized, see the code at the end.
Firestore documents can contain references to other documents or collections, making its structure a graph instead of a tree (nodes can have multiple parents). If we don’t follow these references, it remains a tree as we do not explore these references.
We have implemented a pre-order traversal, feel free to expand it to a post-order if needed. In-order does not apply because our tree is not binary.

Bonus: the parallel version