Introduction to System Design Interview Questions
Being able to see the big picture from a high level is a skill required for success in any role in the Tech industry.
System design problems are designed to test your problem-solving skills. Let us tackle problems in different domains from Spam, Storage, Latency to designing a whole system.
Problem 1— Spam Detection
People are copying content across websites. Let us assume we are a part of the Google Search webspam team. How would we detect duplicate websites? We will attempt to determine which one is the original and which one is copied.
1. Apply a hash function (#) to the content: Subsequent pages with similar content are duplicates.
Risks: Google may incorrectly index the duplicate page before the original.
2. Content with the most hashes of inbound links is the original.
Risks: Bad actors can game the system by inflating the number of inbound links.
3. Embed a unique ID to a page: Later pages without a unique ID but with the same content are duplicates.
Risks: This would require additional work from web admins. Thus we may not get 100% compliance.
4. Compare time stamps for similar web pages: assume the earlier one is the original one.
Risks: Web admins can fake early timestamps.
5. Factor domain reputation. Domains that are known to copy original content are penalized.
Risks: This could require manual intervention, which would be slow and costly.
We can apply the following solutions given the pros and cons: hash function with a unique ID.
Problem 2 — Reduce Gmail’s mail storage size
There are at least 4 different ways Gmail can save storage space. Let us describe each one and discuss the pros and cons of each.
Apply a compression algorithm or concatenate mails before compressing.
Cons: It will make the mails slower to access.
Mitigation: Selectively/smartly choosing mails to compress.
Cons: Auto-deleting mails goes against Gmails marketing.
Mitigation: Auto-delete mails after a certain amount of days or use AI to determine mails that may never be required smartly.
3. Single Item storage
Keep a single copy of emails, images, and attachments included in multiple mails.
4. Client-Side storage
Store some mails on the client machine (similar to WhatsApp).
Cons: The user won’t be able to access all mails across different machines.
Mitigation: Selectively/smartly choosing mails to store locally at the client machine.
5. Off-site storage
We can store some mails on off-site storage, which is significantly cheaper.
Cons: It will make the mails slower to access.
Mitigation: Selectively/smartly choosing mails to store off-site.
We can also determine the storage reduction strategy based on the types of Gmail users/customers and their use cases/problems.
Problem 3— Resolve a server bottleneck
If you had an application running on a server that stored your client’s insurance information and your colleagues wanted to access and update these docs. Still, they complained about the long wait times and network disconnections.
Causes for slow response times:
- Network: Network latency can be caused by slow or overloaded network connections at the data center or the end user’s location. A slow Internet backbone can also be the cause of network latency.
- Server: Server latency is caused by slow processors and inefficient server hardware architectures.
- Storage: Storage latency is due to slow performing storage devices. Solid state drives and in-memory solutions offer higher performance.
- Database: Database (DB) latency occurs when the application makes frequent database trips. DB latency can be especially challenging in storing binary data since DBs are meant to store relational data.
- Application latency: Applications are inefficiently slow if:
- They use suboptimal data structures and poor algorithms.
- They run on operating systems that aren’t optimized for the latest hardware.
- If there are deadlock scenarios.
Let’s say the IT team has ruled out the network, server, storage, and application latency. They’ve isolated it to database latency.
How can we solve it?
1. Use a different backend for our application. Many companies are now using NoSQL solutions such as MongoDB to store non-relational data.
2. Migrate our binary data into documents stored on a file system or SAN/NAS storage. This minimizes DB trips and DB latency.
3. Utilize an in-memory distributed cache. This would alleviate database traffic with a cache system that’s optimized for read operations.
Problem 4— Design a Blogging Application
Let us end by designing the data model and key functions of a Blogging website.
Blog post data model:
- Blog post number
- Blog author’s email
Comments data model:
- Comment number
- Comment author’s email
Let us walk through how all this comes together, i.e., walk-through of user input interacts with each part of the program:
- The user visits the blog home page. This calls GetAllPosts(), which gets the last 10 blog posts in reverse chronological order.
- Once the data is retrieved, ShowAllPosts() puts the information in the appropriate view, including the HTML rendered in the user’s brow.
- A user can click to see a specific blog post. This calls GetSinglePost() and a subroutine, GetComments(), which retrieves that specific blog post and comments from the database.
- Once the data is retrieved, ShowSinglePost() puts the information in the appropriate view.
- If the user decides to add a comment, it calls AddComment(), which saves the author’s comment to the database after appropriate authentication.
- If the user decides to add a new blog, it calls AddNewPost(), which saves the author’s blog to the database after appropriate authentication.