Okay so what is database indexing ? What are we trying to solve ? How does this ‘thing’ make anything fast.
Lets step back a bit. Think of a scenario : you’re in a library where none of the books are organised in alphabetical order. Yeah doesn’t sound like a pretty picture does it?Let’s try dramatise this further, where you have to find this book to win a million dollars, and you’re sure that the book is in the library.
Why is it realistically needed ?
If there are 5000 books in the library and it takes you 1 second to read the title of each book, we can safely assume it would need take you 5000 seconds and access to every book presuming the worst case where your book is the last book you’ll find. Now flip side, if the books were alphabetically ordered on table, (evidently the longest table to ever exist), your life would be a lot easier, at every point you’ll need access to only (number of books/2 — log2 N access — ) books. Why ? If you were looking for ‘One Minute Manager’ you would certainly know that the book would be in the latter half of the table, since ‘O’ is in the latter half of the alphabet series. This fastens up our search incredibly.
Cue music, we can think of database indexing in a similar format.
Where is it needed in software ?
Data is stored in disks as blocks of data. Blocks of data are self aware, similar to linked lists, they have a pointer to the location of the next block. Data blocks are stored in context to a singular field, i.e they can’t be stored based on both their ‘age’ property as well as their ‘name’ property. In a situation where the data is stored in reference to the ‘name’ attribute,but we’re trying to find a data node based on ‘age’, we’ll need access to the entire table storing all the nodes, i.e N access. And likewise in a situation we need access to information on a field with which the nodes were organised, we’ll need ‘N/2’ access (N being total blocks). In a linear search, i.e the former the resource exhaustion would be a lot more as compared to in the latter system where we can execute a binary search which has a log2 N,
Now what is indexing ?
Indexing is a way of sorting nodes based on multiple fields. When we create an index based on a field in a database our database system creates another ‘reference’ data structure, whose task is to hold the field value and a pointer to the original record that it is associated with, now this data structure is sorted which lets us perform a binary search on our table, which as we concluded above to be relatively more performant.
Lets try get a little practical
We are assuming the use of a MySQL database using MyISAM as the default storage engine, which uses a default block size of 1024 bytes or 1KB.
Suppose you have a data node, that consists of :
Let’s say each of these nodes have a weight of 204 B which calculatively announces that we have 5 records (Blocking factor) in a disk storage block. If our sample DB had 10,000,000 records it would be ideal to say that the total number of blocks required would be 10,000,000/5 = 2,000,000 storage blocks.
If we were to do a linear search on this assuming the value we’re searching for is not the one used to index the data nodes, then we would need access to an average of 1,000,000 blocks. However if we were searching based on the id field, i.e a field that is already sorted we would be looking at a binary search i.e log2 2000000 = 20.93 block accesses. Voila we just went from 1,000,000 block access requirements to 20 by simply working with a sorted attribute.
Optimising for a non-sorted attributes
What if we’re trying to effectively do a search for the address, a binary search wouldn’t work since we haven’t sorted out our database based on that, we would need to do an linear search i.e requiring access to an average of 1,000,000 records, which is no good.
Indexing in this situation makes sense, where we create a table with nodes containing just two attributes:
- The pointer (which are 2–5 bytes in size)
Assuming that the address is of type CHAR i.e size on disk is 50 bytes. We can say that each node is around 55 bytes, working off a similar procedure the blocking factor would be 1024/55 = 18.6 ~ 19 records per disk block. Using this approach the index table will consist of 10,000,000 / 19 = 526,315 records. This is sorted so we can do a binary search, doing a log2 526,315 calculation reveals we need access to 19 blocks. This again is significantly lower than the initial access to 5,000,000 for a linear search.
What we need to consider when indexing DB attributes is the speed to space expense, is the excess speed in indexing worth the cost in DB expansion.