System Design — Online Judge with Data Modelling
A good system design question can be to design an online coding challenge judge.
Problem: A coding challenge is a competition where a set of coding questions gets released for a period of few hours/days and a list of pre-registered competitors participate by solving these questions and submitting the solutions which gets run against some hidden test cases and competitors are scored based on the test results. A platform hosting such challenges is called an online judge. ex: SPOJ, topcoder etc.
Features
The design can be very open ended and there is no one true solution. That being said, a list of features(not exhaustive) expected out of you can be…
- A participant should be able to register for future competitions
- A participant should be able to submit solutions to the problems during the competitions
- A participant should be able to fetch his/her profile which includes personal details along with participation history
- Platform should maintain a score representing participant’s performance based on all the events participated in past. Should calculate participant’s cumulative score from all the participated competitions
- A participant should be able to fetch the leader board of a competition
- Should have a cumulative leader board that includes score contributions from all challenges
- Each question should have some hidden test cases, editorials(accessible after the competition), comments
- Should have practice problems which doesn’t contribute to scoring
- Should have a mechanism to evaluate the submitted solutions against the underlying test cases and generate scores
Design Black Boxes
- A user communicates to the application via browser client
- There can be multiple application servers that query data from an underlying datastore to distribute the load
- All the servers are stateless which gives us the provision to have a simple load balancer to distribute the load across these application servers
- An underlying datastore that stores metadata of the users, competitions, questions etc.
- A simple CPU intensive set of machines to run the submitted code against the test cases
Scale
- The total number of users in a typical online judge are in orders of 100K or to put an upper cap, say it can be a Million
- Say the upper cap on the number of users registered for any competition is approximately 15K
CAP: Consistency vs Availability
Optional read: A well written introduction to CAP Theorem.
The latency of our systems needs to be as low as possible as the scoring mechanism relies on how quickly the participant can complete the challenge questions and having a low latency buys the participants time to work on other questions. Trading availability for consistency is a bad idea as the whole competition is a failure if the system is not available during these competitions. So availability wins.
Low Level Components
Entities Identification
Model the question using ER diagrams. A good starting point can be to identify all the major models and then decide which are to be entities and which can be simple attributes of these entities. Listing down all the main models…
- users
- challenges
- questions
- editorials
- comments
- submissions
- participations/registrations
- question test cases
- challenge/cumulative leader boards
There isn’t any querying over the editorials, comments or test cases as per our application query patterns and these completely belong to the questions entity. So these can be made to be part of questions table.
Entity Relationships
This section is optional. A very neat way to identify and analyse the relations is to mark the entities using ROR associations. Below are few popular associations…
- has_many
- belongs_to
- has_many through
Entity Associations
- A user has_many challenges through participations — user.challenges — used for queries such as “Get all challenges userA registered for”
- A challenge has_many users through participations — challenge.users — “Get all the users registered for challengeA”
- A challenge has_many questions — challenge.questions — “Get all the question for challengeA”
- A question belongs_to challenge — question.challenge — “What challenge questionA appeared in?”. belongs_to association implies question has a foreign key of challengeId
- A question has_may submissions — question.submissions — “Get all the submissions of questionA” or “Get all the submissions of questionA by userA”
- A submission belongs_to user — Implies submission has a foreign key of userId
- A submission belongs_to question — Implies submission has a foreign key of questionId
- A participation belongs_to user — Implies participation has a foreign key of userId
- A participation belongs_to challenge — Implies participation has a foreign key of challengeId
Data Modelling
The list of refined entities(along with their attributes) and number in bytes occupied by each column(using mySQL storage numbers for estimations) is…
- Users — userId(36), name(150), dateOfBirth(3), email(30), college(30), username(20), password(20), cumulativeScore(8), addedAt(8), updatedAt(8)
- Challenges — challengeId(36), name(150), challengeStartTime(8), challengeEndTime(8)
- Questions — questionId(36), challengeId — foreign Key(36), name(150), description(300), questionText(2000), testCases(2000), author(100), addedAt(8), updatedAt(8), editorial(3500), comments(2000), points(8), difficultyLevel(10)
- Submissions — submissionId(36), userId — foreign Key(36), questionId — foreign Key(36), programmingLanguage(15), submittedCode(3000), score(8), createdAt(8)
- Participations (or Registrations) — participationId(36), userId — foreign Key (36), challengeId — foreign Key(36), score(8), acceptedSubmissions(150)
Note: We’re using uuid as primary keys of all the rows and uuids occupy 36 bytes
Data Size
For each row, the Users tables has the storage of ~ 300 bytes, Challenges ~ 200 bytes, Questions ~ 11KB, Submissions ~ 3KB, Participations ~ 250bytes.
As per the estimates, there are around a Mil users, i.e ~ 300MB of users table. A maximum total of 10,000 challenges gives ~ 2MB of challenges table. Assuming a max of 50K problems gives us a questions table size of 600MB. Assuming that every user on an average tries 4 implementations of every problem and attempts 500 questions in his/her lifetime, the submissions table size is ~6MB(the assumption assumes a skewed behaviour of users participating). Finally, assuming that at most 30,000 participants per each challenge, participations table occupies storage of ~ 37GB.
As one can see, that the total size of all the data rows put together(without db metadata) is ~38GB, which can be easily fit into a single machine. Indexes are needed on some of tables for faster querying, which can also fit into a single machine.
Note: The data size we calculated might be far lesser than what might be present for a real online judge.
Datastores — SQL vs NoSQL
optional: This stack link describes why using SQL might not scale as much as when using NoSQL in some cases
While making the choice of SQL vs NoSQL a question that needs to be asked is — Is the data and indexes are too huge that following ACID properties might hamper our latencies? A typical SQL system can easily handle few hundreds of GBs and our data size is less than that — SQL wins.
Are there extensive relations between the system entities with good amount of joins. Which is true in our case (can see at later parts of this article)— SQL wins
Based on the above criteria, SQL based datastores are sufficient and are well suited and there isn’t any need for NoSQL.
APIs
There can be lots of APIs for our application, but adding a few important ones along with their happy flows below…
registerUserToCompetition(userId, competitionId)
The API reads the parameters and calls the DB module, adds an entry in participations table thus registers the user and returns the success/failure response.
submitCode(userId, questionId, programmingLanguage, codeText)
The API reads the parameters, creates an entry in the submission table through DB module, calls the code execution module to run the submitted code against the hidden test cases, gets the score metric and finally calls the DB module to update the success status and scores of the execution in the submissions, participations and users table.
getHistory(userId)
The API reads the parameters, fetches the entries from user and competitions table to supply the user profile info and score history respectively
fetchLeaderBoard(challengeId)
This API fetches the score board from the caching layer based on the competition Id; the API fallbacks to the main datastore in case the cache doesn’t have the leader board information(say due to machine failures)
getLastSuccessfulSubmission(userId, questionId, programmingLanguage)
This API fetches last successful submission from the submissions table
Scoring Mechanism
There are different factors that might be taken into account in deciding a multiplier for scoring a problem submission. To list a few…
- How much time after the problem opened, the correct solution has been submitted
- Partial scoring for sub-optimal solutions(ex: submitted O(n²) solution for expected O(n) solution)
- Number of incorrect submissions made
Based on factors like above, a user’s submission can be scored. If the user submits a solution multiple times, and the newly submitted code gives same or more points than the previously submitted, then we update the score of the user submission. These scores get updated in the score column of submissions table, and a participant’s total score in a challenge can be updated in the participations table
Leader Boards
To get a leaderboard for a challenge, a query of the form “select * from participations where challenge = ‘currentChallenge’ order by score desc” can be issued.
To support cumulative dashboard, we also update the user’s score in the user.cumulativeScore field along with the current challenge score in participations table
High Level Components
Caching
Caching is important part of any design solutions. In our case, there are some queries which appear very often and there are some which are very inefficient, for example…
- Fetch the leader boards
- What are all the set of questions that are asked in a challenge?
- Get profile information of an active user
- Get an active user’s score history of past challenges
- Get practise problems sorted on difficulty level
Caching can be used to reduce the query times of all the above read queries drastically. Redis with its powerful data structures can be used to cache all of these in-memory
Machine Failures
Follow a master slave architecture, where master takes read and write requests, and slave takes only read requests, and any writes to the master will be eventually consistent with the slave in order of few seconds. In case the master dies, the slave gets promoted as a master and starts taking the complete read and write traffic.
Since the data size is less, a single machine system can be used as master without any sharding.
Code Execution Layer — Running the code agains tests
Use special containers running on machines with high CPU to run the submitted code. Code sand boxing is necessary so that the executions…
- doesn’t consume too much of the resources
- should have the appropriate privileges set so that the code doesn’t peek into system config
- should have time limits set
Plagiarism Checks
There might be a chance of plagiarism in the submitted solutions and such users need to be flagged and disqualified from the current challenge. Real time plagiarism checks are non optimal taking into account the large times to find code duplicates, so the online judge should also be running post challenge plagiarism checks comparing codes of different participants using softwares like MOSS
Conclusion
This is just one way to tackle the problem, most of the times the interview are very open ended and might go in specific directions depending on the interviewer and company. One may be asked to go deep into any of the above modules.
References
- Inspiration from Interview bit
Also please checkout my article on Rate limiter Modelling
Thanks to Bhargav for reviewing and Anubhav for feedback. Please clap if you think it’s useful and comment your opinions about this article.