System Design — Google Sheets

A mini version of google sheets for a Small to Medium enterprise

Manpreet Singh
7 min readSep 25, 2020
Photo by Markus Spiske on Unsplash

A dear friend and ex-colleague reached out to me with a question. Below is a snippet of conversation with him:

Friend: If I had to recreate something like Google Sheets from scratch, how would I handle the backend DB? Simple Relational Database? I can’t find any design pattern discussions online

Me: Would it be an internet-scale app?

Friend: It is an internal enterprise app

Me: Are you expecting huge no. of rows in the sheet

Friend: Less than 1000

Me: Then the relational database will work. But would your sheets feature allow multiple users to edit at the same time?

Friend: Nopes

Me: Then it is quite simple

The conversation seemed to have ended here. But a couple of questions started coming to my mind:

  1. Why are off-the-shelf datasheet products not being used and integrated into the solution?
  2. Will custom sheet allow auto-save kind of feature?

Now my friend is strong on the business side and the solution architecture. Which made me hold back on my first question. There might have been compliance reasons or some licensing reasons or something else. But I did bring up the second question in a follow-up conversation:

Me: Would your custom sheet software have the feature of auto-save

Friend: Yes

Me: Then synchronization problem needs to be taken care

Friend: As in?

Me: Let us say a user edited value of a column to ‘value2’ and then to ‘value3’. The final value which goes into the database should be ‘value3’. But with the distributed systems it is never sure which request reaches the server or the database first, whether ‘value2’ or ‘value3’. Hence you might need to make sure that you sequence the requests going from the browser.

Friend: How do I solve this?

Me: You could sequence based on client timestamp passed in the update command to the server. But even that could lead to problems sometimes based on the automatic time synchronization on the clients. It could be there was an error in the time settings on client and it was fixed with time synchronization. So in a way timestamp is also not the right way.

Friend: How do we go ahead with this?

Me: I think you should have a sequence no. of the commands from the client end. And if server received something out of the sequence, it should try to hold the commands in the memory for sometime. And if the server isn’t able to reconcile the things in a few seconds, client could actually throw an error. Or try to replay the commands from the time it hasn’t received the success.

So at a very high level, I was suggesting something like below. This is a too simplified version, just meant to convey the concept.

Satellite view of the architecture

The challenge here is that there has to be more than one server, to take care of availability. We will need to have at least 2 servers, each in different availability zones. And we never know which update command from the browser will be picked by which server. To take care of such a situation, we will have to have either implement session stickiness, which is more like requests from one user goes to the same server. This isn’t a great way of handling this problem since there could be a skewed load on the servers based on the usage pattern. Another solution can be to use a data streaming technology (e.g. Kafka) which will support commands of the same datasheet going to the same data processor server. Hence we can scale up the command processor servers without really worrying about if the commands from the same datasheet will go to different processors. There could be a bit of load skew even in this case because the stickiness will be at the level of datasheet instead of user session. But it will be significantly less skew as compared to the skew in session stickiness.

Coming to a bit more detail, what should be the schema of the table containing row data. Here is the sample which could potentially be one of the first thoughts in the mind of a developer. Again I have a knack of oversimplifying the things. So please ignore the over-simplification.

Where sheetID represents a sheet that contains the row. And rowNo represents the sequential no. of the row. And rowJSON represents the JSON for row data. Something like below:

BTW, 5834266 is the total no. of COVID19 cases in India as of 25th Sept 2020 :(

Each object in the JSON contains the sequential column data. Now many would ask me that I suggested a relational database to my friend, and I am suggesting this hybrid model, where row contains JSON. I think that introducing a table for column values of the datasheet has a huge amount of performance overhead on the table joins. Imagine that we hit the upper limit of no. of rows in the datasheet, which is hypothetically 1k as per my friend, and each datasheet row has let us say 100 columns. To fetch the whole datasheet we will have to read 100k rows from the columns table. But the approach I mentioned for datasheet column values as a JSON in the row table itself, saves that hassle by having one datasheet row data colocated in one table row.

Now the above data model has two major flaws. If you haven’t got it so far, pause for a minute and then come back.

  1. What if we add a row in the datasheet, our row table needs to update rowNo for all the subsequent rows. This could be a huge DB write issue requiring quite a few records updated as a result of just adding one row in between,
  2. What if we add a column in the datasheet, all our rows need to be updated to include that extra blank datasheet column in the rowJSON of each row.

So in a way, my first data model turned out to be pretty bad design. I fixed it by replacing rowNo with rowID in the rows table. And introducing a new column in the rows table, i.e. prevRowID which contains the ID of the previous row. So in a way when a new row gets added in the datasheet, we only create one new row in the rows table. And update one row in the table to change the previous row identifier. This is what a simplified version of the updated rows table looks like something below. This approach is also not without flaws, but for brevity let us proceed with it.

A similar problem exists with the rowJSON. On similar lines, we can solve this issue by introducing columnID stored in each object in the rowJSON. Something like this:

So we have solved the problem of the new row and a new column. But we will certainly require another table in the database to contain datasheet column metadata. Things like which is the previousColumnID for each datasheet column, on similar lines to datasheet rows.

Making us walk through the system design of a mini version of Google Sheet wasn’t my real intention. Below is what I wanted to convey:

  • System design is always made based on the situation. There is no use of making a full-blown highly scalable version of a system, without the need of doing so. E.g. the Google sheet equivalent for a small to medium enterprise needn’t be built similarly as the original Google Sheet system. No system design is wrong or right. Every design is based on the constraints as defined by business, technology, cost, team, and time available.
  • System design will most likely not be optimal in the first attempt. E.g. how I had to change the data model to take care of basic software needs like a new datasheet row and a new datasheet column. And even the above system design is far from perfection. There might be quite a few flaws still hidden.

Cheers!

--

--