Second Class Test for Databases
Hey! This blog post features all content from the lecture slides, things Rasmus has said, vital discussions, material from the book of the course & things from Google. As well as all the tutorials that are required for this class test.
I’m going to somehow embed my tutorial answers / write-ups here, once I figure out what software Medium supports for embedding PDFs 😁✨
Most of the test is focussed on chapter 3 (I asked Rasmus), so if you’re strapped for time only focus on chapter 3 😁🔥
Note: If you fancy supporting the work I do, my PayPal is https://www.paypal.me/brandonskerritt 😉😜 Or if you have Monzo, you can support me here: https://monzo.me/brandonskerritt
Flashcard deck for this module can be found here.
Aborts
Why might a transaction abort?
- Deadlock
- Errors while executing transactions
- Explicit request to abort in the code
- Media failures (hard drive dying)
- Fires, floods etc
- System failures
- Logging
Deadlock
Deadlock occurs when each transaction in a set of two or more transactions is waiting for some item that is locked by some other transaction in the same set.
Without looking at the graph, you can clearly see that both transactions are trying to write an item that the other transaction holds a lock on.
As you can see, it’s rather easy for a transaction to enter deadlock. We’ll talk more about deadlock later.
Logging
The idea of logging is to write important transactions to a log so that if those transactions fail, we can try again in the future and restore the database to the correct state. Logging should work even in the case of system failures.
There are 3 main types of logging:
- Undo logging (atomicity)
- Redo logging (durability)
- Combinations of the two
Undo Logging
We log with the intention to restore the database back to a previous state (undoing the work transactions have done).
The syntax for this is:
- <start T> — transaction T has started
- <commit T> — Transaction T has been committed
- <abort T> — transaction T has been aborted
- <T, x, v> — transaction T has updated the value to x, and the old value was v.
If transaction T updates database item X and the old value was v, then <T, X, v> must be written to the log on disk before X is written to disk.
If transaction T commits, then <COMMIT T> must be written to disk as soon as all database elements changed by T have been written to disk.
What we want to do is before doing anything, we log it. Unless we commit, then the second after we commit we log the commit. No other transactions should occur between the commit and the log.
Undo essentially ensures Atomicity.
Recovery with Undo Logs
If an error occurs, the recovery manager restores the last consistent database state. It does this by traversing the undo log backwards.
Redo Logs
Logs activities with the goal of restoring committed transactions (ignores incomplete transactions).
The syntax is the same as before, but a new meaning to <T, x, v>.
“Transaction T has updated the value of database item X & the new value of X is v”
To do this, we need to change the procedure for logging.
- T first writes all log records for all updates to disk
- T writes <COMMIT T> to the log on disk
- T writes all updates to disk
Redo essentially ensures Durability.
Recovery with Redo Logs
This is essentially the reverse of undo logs.
- Identify all the transactions with a COMMIT log record
- Traverse the log from first to the last item.
- If we see <T, X, v>and T has a COMMIT log record, then change the value of X on disk to v.
- For each incomplete transaction T, write <ABORT T> into the log on disk.
What if a transaction aborts?
Use the undo log to undo all changes made by the transaction.
Undo Logging Example
Rasmus said that this will be in the exam.
The first thing we do is start transaction 1. We write this into the log.
When we get to line 3, we log <T1, X, 1> . As in, T1 has changed X from 1. We write what it was before. Before, X = 1.
We now start transaction 2. We write some write logs.
The next thing we do is flush the log into the hard disk. Everything written in Log(buffer) we put onto the log(disk).
Afterwards we write out the values of X and Y to the hard disk. Doesn’t change anything in the log. T1 has finished, so we commit it and T2 has finished so we commit it.
We write the logs (commit T1, commit T2) to the hard disk (flush_log).
Redo Logging Example
Very similar to before.
Which transaction here can cause us to want to use undo / redo logging?
The hint is that you can delete one of these lines and still get the same outcome if you’re using undo / redo logging. Deleting line 13 is a good idea. Line 13 is about outputting X and there’s already another transaction (t2) playing around with X.
As we know the transaction will delete output x later on, we don’t need to output X here.
Undo Without Redo
Ensurcing atomicity and durabilit without logging at all
Stealing means that the working memory is full (think of registers in assembly, very limited working memory) so our transaction needs to clear some data from working memory in order to carry out what it wants to do. Let’s say you have 2 transactions, t1 and t2. T2 writes 10 to x, t1 wants to write 5 to x, so it force-commits x to the database and writes 5 to it. T2 then goes to reuse x as it hasn’t finished, but something else has stolen x from it and changed the value. Forcing means that Everytime a transaction commits, all the effected variables will be pushed to the data base, this is inefficient as multiple transactions can write the same x to the database.
By using steal / no force we speed it up a lot because we ignore all of these rules. By using logging if we have any issues with stolen memory we can just redo/ undo to fix it.
Checkpoints
The idea is to checkpoint the log periodically. After t minutes, after m transactions etc we create a checkpoint. It is annoying to go through the entire log file when we want to redo operations, so we place checkpoints in the log file to better find good places to recover to. It flows as follows:
- Stop accepting new transactions
- Wait until all active transactions finish and have written COMMIT or ABORT record to the log.
- Flush the log to disk.
- Write a log record <CHECKPOINT>.
- Flush the log to disk again.
- Resume accepting transactions.
Cascading rollbacks
If you abort a transaction, you need to find all transactions that depend on this transaction.
If a transaction, T, aborts then:
- Find all transactions that have read items written by T.
- Recursively abort all transaction that has read items written by an aborted transaction
If we don’t abort all of the transactions, then we break isolation. If we do abort all of them. we may break durability.
We want to find a method where neither of these break.
The problem for durability in regards to cascading rollbacks occurs because a transaction T_1 reads data from some transaction T_2, then T_1 commits and afterwards T_2 aborts.
Recoverable Schedules
A schedule s is recoverable if the following is true:
- if a transaction t_1 commits and has read an item x that was written before by a different transaction t_2.
- then t_2 must commit before t_1 commits
This schedule is recoverable as it ensures we don’t get problems with durability (and isolation).
If you write something to the log in a certain order, it appears in that disk in that order.
Recoverable schedules lead us to cascading rollbacks.
Both are active, none have committed, so no durability problems here.
Cascadeless Schedules
Cascading rollbacks are kinda slow to handle, so it’s better to avoid them entirely.
A schedule is cascadeless if each transaction in it reads only values that were written by transactions that have already committed. This avoids dirty reads and cascading rollbacks.
This isn’t allowed This is a recoverable schedule. We read X before T2 commits.
But this is allowed
All of them read data written by something that isn’t commited yet.
Cascadeless schedules are recoverable. In general, they are not serializable. Rasmus gives a proof for why they’re recoverable. Tbh I don’t care that much about proofs and I’m sure you don’t either, so I’m not going to include that here.
Strict Schedules
Strict schedules have no cascading aborts & serializability.
A schedule is strict if each transaction in it reads and writes only values that were written by transactions that have already committed.
So this is not possible. But this is:
Strict Two-Phase locking
Ensures it’s both conflict-serializable and it is a strict schedule.
You can only unlock things at the very very end of the schedule.
Left is 2PL as we have all unlocks following all locks, but it’s not strict yet.
If a schedule consisting only of strict 2PL transactions then:
- The schedule is conflict-serializable
- The schedule is strict
This is an example of something that is serializable, not conflict serializable and not recoverable (because its reading something written before but not committed).
In total, we have all of these types of reads:
The problem with these is that deadlock may happen.
1. RECOVERABLE SCHEDULE
A recoverable schedule is one where, for each pair of Transaction Ti and Tj such that if Tj reads data item previously written by Ti , then the commit operation of Ti appears before the commit operation Tj.
Suppose after T2 Read(x) operation, it commits. And then somehow T1 fails. So the transaction T2 must be aborted so as to ensure atomicity. However, since T2 is committed, and can’t be aborted . Hence a situation arrives where it is impossible to recover.
And hence its necessary that Ti commits before Tj.
2. CASCADELESS SCHEDULE
Transaction T1 writes x that is read by Transaction T2. Transaction T2 writes x that is read by Transaction T3. Suppose at this point T1 fails. T1 must be rolled back, since T2 is dependent on T1, T2 must be rolled back,and since T3 is dependent on T2, T3 must be rolled back.
This phenomenon, in which a single transaction failure leads to a series of transaction rollbacks is called Cascading rollback.
- Cascading rollback is undesirable, since it leads to the undoing of a significant amount of work.
- It is desirable to restrict the schedules to those where cascading rollbacks cannot occur, Such schedules are called Cascadeless Schedules.
- Formally, a cascadeless schedule is one where for each pair of transaction Ti and Tj such that Tj reads data item, previously written by Ti the commit operation of Ti appears before the read operation of Tj .
Every Cascadeless schedule is also recoverable schedule.
Cascadeless Schedule Example,
3. STRICT SCHEDULE
Suppose lets take a scenario of a cascadeless schedule,
In this, the Write(x) of the transaction T2 overwrites the previous value written by T1 , and hence overwrite conflicts arise . This problem is taken care in Strict Schedule.
Strict Schedule is a schedule in which a transaction can neither Read(x) nor Write(x) until the last transaction that wrote x has committed or aborted.
The hierarchy is:
Strict
Cascadless
Recoverable
So if you’re asked “What is the strictest recoverability condition that each schedule satisfies”, see if it’s strict, if not, see if it’s cascadless etc. Go down the pyramid of weird schedule rules.
Remember that really complicated Venn diagram with like 50 different rings? This is the TL;DR of that.
Deadlock Detection
Timeouts
We assign to each transaction some sort of time limit, if it takes more than X then it is more than likely to be in deadlock.
Wait-for graphs
One node is created in the wait-for graph for each transaction that is currently executing. Whenever a transaction is waiting to lock an item that is currently locked by another transaction, a directed edge is created.
If there is a cycle in this graph we have deadlock. Remember COMP124? This is the same principle as taught there.
Timestamp
Each transaction is given a unique timestamp upon the arrival. Timestamps do not change even after a restart.
Younger transactions can wait for the older transaction but not the other way around.
Wait-Die Scheme
Older transactions always wait for unlocks.
In wait-die, older transactions are allowed to wait for a younger transaction, whereas a younger transaction requesting an item held by an older transaction is aborted and restarted.
Wound-Wait Scheme
In wound-wait, a younger transaction is allowed to wait for an older one, whereas an older transaction requesting an item held by a younger transaction preempts the younger transaction by aborting it.
Both wait-die and wound-wait end up aborting the younger of the two transactions that may be involved in a deadlock, assuming that this will waste less processing. It can be shown that these two techniques are deadlock-free.
Eventually, any finite number of transactions finishes under wound-wait.
Wait-Die is similar, but we look at the oldest transaction of the transaction it (recursively) waits for.
Timestamp-based alternative to strict two-phase locking
The idea is to schedule transactions so that the effect is the same as executing each transaction instantaneously when it is started.
Equivalent to serial schedule that has all transactions in the order of their start time.
Timestamp-based schedulers
Nothing new here.
Timestamps in the scheduler
Each transaction T is assigned a unique timestamp integer TS(T) when it starts (the timestamp of T). This is different from earlier when it arrived. This is starts. Restarts get a new time stamp.
We use these timestamps to define what will happen with a read/write request.
Before timestamps were arrives, but this time its starts. So if you restart you get a new times stamp. T1 stands and we give it a time stamp of 1, t2 starts and it gets 2, t3 starts and it gets 3. t4 starts and it gets t4. At this point, t2 gets aborted and when it starts it gets 4.
You can not read something that happens in the future, only read in the past.
Example
We write Y later on but try to read it earlier, this can’t happen in T_1 so we roll it back. We also need to rollback T_2 as we read an item before writing to it.
Rasmus adds “additional bookkeeping” below.
We do everything for t1 as time 1. Later on we write this item at time stamp t2. If t1 was really happening at time stamp 1, it cannot read from the future so we have to restart it.
We’re going to see the requests as they’re coming in and then deciding what we should do about it.
At time 0, all read & write times is time 0. We have 2 transactions, t1 & t2 and each one has a time stamp. t1 has 1, t2 has 2.
T1 reads X, X has been read at time 1 now so we write 1 in read for X.
We add 100 to X, nothing changes as we haven’t written or read anything.
At this timestamp, T2 decides to do something. So we give read time 2 to transaction 2
At wite_item(Y) we update the write time. Whenever we read / write, we update those times. The read / write time remains the max of whatever process has written to it.
If transaction t1 reads Y, it cannot read Y as it’s been written into the future (timestamp 2) so we have to abort here & restart T1, so it can read transaction 2’s Y in the past.
Now it’s timestamp 3, it can read timestamp 2 stuff (read item Y).
We haven’t done anything with X with the next transaction so it goes back to 0, 0.
So we redo transaction 1, but this time when we run read_item(Y) it actually works.
Ensuring Strictness
To make a timestamp-based schedule strict, we need to enforce this rule:
“Delay read or write requests until the youngest transaction who wrote X before has committed or aborted”
So in this example we have transactions that write to X, we delay the read and write requests of transaction 1 until transaction 3 either commits the transaction to the database or aborts.
Deadlock Prevention
The above stuff is just stuff we’ve seen, this below is slightly different and important (but we’ve still seen most of it, he’s repeating stuff in the slides).
This is basically what we’ve seen earlier. He likes to repeat slides a lot.
we can either start reading down t1, or start reading down t2. Let’s read down t2.
We have an item y written to at time 2, we try to read it at time 1. this cannot happen therefore we’re going to roll t1 back.
Yeah uhm, this looks like it’s exactly the same as we’ve done earlier.
Chapter 3
Basically search but with filters.
(SELECT(CONDITION(FROM)))
Join is about pairing things up with their attributes.
I made a YouTube video here about natural joins:
Unlike normal trees, we proceed from the bottom to the top of the plan.
He goes over the entire for loop. This is the important part:
We can go faster
tbh there’s so many slides of just this, i’ll share the important ones. chapter 3 lecture 2 around slide 23 if you want to see all of this.
Loads of these slides too… Let’s skip to the end…
I don’t think he’s going to ask us about the speed of algorithms… But just in case here they are.
Summary:
Yeah, I’m going to skip ahead like… so many slides just to find the stuff that is in the exam. Chapter 3 lecture 3 if you care about this.
It says (idea) because this is what they’re supposed to do, not what they actually do. It’s probably important to learn these in case he asks “in an ideal world, how does this work blah blah blah” meaning that in the world where the idea worked. Probably not that important, but file this among “not important, but just important enough to actually matter”
It might be a good idea to watch these lectures (last 2–3 lectures of chapter 3) because he’s going over an example and it’s hard to portray the example here.
The pointers are the arrows literally pointing to things I believe.
Smaller is left, larger is right.
To get to pointers with ≥ 11, find the first number that is 11 or higher and we just go right along the bottom, printing out 11, 13, 17, 19, 23 etc.
These heuristrics are just general tips on what we want to achieve to optimise it, not strict rules.
He has a lot of stuff on estimating size / speed, might be in the exam? I’ll ask and found out and then update this text.
And we’re done. That’s everything. Good luck!