How Does Google Sheets work?

Abhinav Sharma
Google Cloud - Community
8 min readApr 17, 2021

We all have used Google Sheets but never thought of how actually does it work so effectively with collaborative editing too. Designing Sheets is not as simple as designing a google doc keeping into consideration the numerous advanced functionalities that sheets provide which happen to be far more complex due to the numerous functionalities it provides, so this blog focuses on the High-Level Designing Of Google Sheet and is written for all levels of experienced people.

Questions To Be Asked to yourself before you dive into design:

Question 1: If I had to recreate something like Google Sheets from scratch, which database would you prefer: Relational Databases or Non-Relational? Which of these will be more effective.

Answer: If you are thinking of a large scaling app then yes you should go for NOSQL, as storing the row/column data and continuous CRUD operations would be highly scalable in NOSQL then RDBMS.

Question 2: What sheet operations you need to focus on mainly for our design?

Answer: In our case we are paying attention to the basic CRUD operations on sheet, including Collaborative editing.

Collaborative editing in layman terms is just multiple users contributing/sharing the same resource at the same time. But it is not easy to implement as it sounds like, because of numerous factors that we need to consider while we are in the process to design it for a large number of users.

Question 3: How many requests are we currently serving?

Answer: We are currently focusing on the DB design and handling the basic functionalities first and then scaling it accordingly.

There are more under the hood questions that must arise inside your mind:

  1. How will we implement Collaborative editing functionality keeping in mind the consistency issues?
  2. Will sheets allow an auto-save kind of feature?

Having answered those questions let's get our hands dirty now by tweaking into the database designing.

Database Designing

Let's now dive into the design and talk about the DB design and usage first.

So as a Google Sheet we are talking about rows and columns extensively, and it is far much harder than implementing a DB for google docs.

Here is the sample which could potentially be one of the first few thoughts in the mind of a developer while thinking of a relational design.

This design is a highly relatable design that we can easily land to while building our first few solutions. But is this efficient with so many continuous updates? Is this good with fetching data? The answer is a BIG NO.

I think that introducing a table for column values of the datasheet has a huge amount of performance overhead on the table joins. Imagine that we hit the upper limit maximum no. of rows in the datasheet, which is hypothetically 10³, and each row has let us say 100 columns. To fetch the whole datasheet we will have to read:

10³ * 10² = 1000 * 100 = 10⁵

And let's say this sheet is being accessed by 100 different users, and every time we query the server to get the sheet then the load happens to be

10⁵ * 10² = 10⁷

Now someone will say that we will fetch it from the cache memory and not the server directly. So the answer to that will be that cache has limited memory and storing whole table data for multiple sheets in it will eventually be a bad idea.

Someone will say that I will come up with a more dynamic design of the database and create the row and column dynamically with the following design only when needed.

A new cell in the table will only be formed if it doesn’t exist. But now again let's calculate this mathematically with the above values.

10³(rows) * 10²(columns) = 10⁵ cells in total.

So when an update request comes for updating a whole row, we need to make 100 update queries to update a whole row, and also if we add a row in between we need to update all the below rows with a new rowID, this will result in a very heavy load on the DB and its not scalable in terms of getting the data and paginating the responses.

Now let's look at the NoSQL design that can be more effective than the above ones.

Where sheetID represents a sheet that contains the row. And rowNo represents the sequential no. of the row. And row JSON represents the JSON for row data. Row data looks something like this:

But the approach I mentioned for datasheet column values as a JSON in the row table itself, saves that hassle when it comes to updating the rows data.

Now the above data model has two major flaws. If you haven’t got it so far, pause for a minute and then come back.

  1. What if we add a row in the datasheet, our row table needs to update rowNo for all the subsequent rows. This could be a huge DB write issue requiring quite a few records updated as a result of just adding one row in between.
  2. What if we add a column in the datasheet, all our rows need to be updated to include that extra blank datasheet column in the raw JSON of each row.

Let's think of it in the form of Data structures and Algorithms and ask questions to ourselves of which Data Structure will be the most efficient in this case.

Problem: Shift Data/ Row ahead and add a new record to the existing sheet.

Solution: Which DS performs best while shifting records?

What if we use LinkedList? Ohh man Yesss, you are thinking in the right direction. We are having unique rowID’s, let's take them as LinkedList Nodes and when we know which row follows next we know how the sequence goes.

This means we only need to change the next pointer and we are done in a single update request, and we no longer have to shift the whole data one by one to the left/ right.

So in a way, my first data model turned out to be a pretty bad design. I fixed it by replacing rowNo with rowID in the rows table. And introducing a new column in the rows table, i.e. prevRowID which contains the ID of the previous row. So in a way when a new row gets added in the datasheet, we only create one new row in the rows table. And update one row in the table to change the previous row identifier. This is what a simplified version of the updated rows table looks like something below. This approach is also not without flaws, but for brevity let us proceed with it.

A similar problem exists with the raw JSON. On similar lines, we can solve this issue by introducing prevColumnID stored in each object in the raw JSON. Something like this:

So we have solved the problem of the new row and a new column. But we will certainly require another table in the database to contain datasheet column metadata.

Now as we have our database ready, we can store data on fire.

Collaborative Editing

Building mutually editable documents is not an easy task. There are several issues that we face.

  1. Concurrency: This means multiple computations are happening at the same time. It is everywhere in modern programming, whether we like it or not: Multiple computers in a network. Multiple applications running on one computer. Multiple processors in a computer (today, often multiple processor cores on a single chip)
  2. Latency: As we are using the internet for accessing the mutually editable document so there can be latency issues that may even lead to concurrency.

What comes first to the mind with concurrency — Locks.

But can they really save us in this case? NO, NEVER.

Multiple users are concurrently editing the same document, and we need to implement a turn-based lock architecture, which is nowhere efficient and easy to implement. So locks can’t always be the best choice for concurrency issues. Let's keep that in mind from now and not always jump up to this approach when we hear the word concurrency :P

We need a lock-free architecture. Umm, what can be the ways we can do this?

Sync on basis of row updates?

Sync on basis of cell updates?

Basically, there are 3 major ways to do this.

  1. Diff Sync - Just like a git diff

2. Event Sync - Whenever anything gets updated by the user.

3. Row Sync - Whenever a row is updated, and after the updating is done.

What to use? What do Google and other large competitors use?

Yes, you guess correctly, it's the 2nd one, which is Event-Based Sync. But in the technical world, it's called OT(Operational Transformation).

Operational transformation (OT) is a technology for supporting a range of collaboration functionalities in advanced collaborative software systems. OT was originally invented for consistency maintenance and concurrency control in collaborative editing of plain text documents. Its applications expanded to include group undo, locking, conflict resolution, operation notification, and compression, application-sharing.

Will write a separate blog to explain Operational Transformation in depth.

High-Level Design

Let's now talk about how will my design actually look at a high level.

So at a very high level, I was suggesting something like below. This is too simplified and with fewer functionalities not including the flow of Operational Transaction Design.

Satellite view of the architecture

Here we are using Distributed System Approach and Horizontal Scaling for availability and efficiency of operations.

Further Challenge: We never know which update command from the browser will be picked by which server. One solution can be to use a data streaming technology (e.g. Kafka) which will support commands of the same datasheet going to the same data processor server which is under the same. Hence we can scale up Kafka clusters now.

There is a lot to System Design and every part of it is always questionable.

So it's always a good approach to discussing with multiple minds and get the best solution that you can, which will again be questionable to others.

THIS IS A PART OF GOOGLE SHEET DESIGN. There is much much more to it.

Cheers !! :)

--

--