Database Indexing

Mohammed ElDoheiri
3 min readJan 26, 2024

You have a large amount to data, how can you make it as efficient as possible to look up the information you need and pin point it without having to scan the whole volume of data, for example the oxford dictionary has over 280,000 words, without an index at the begining of the book, looking for a certain word would take days to scan the whole list of words, but the index which is in essense a data structure that captures the position of sections of the data, would make it much easier to find what you’re looking for, i.e words are divided by the first letter they start with and each first letter section start page is registered in the index, so if you’re looking for the word “Expand”, then you’ll first look for the start page of the section of the words the starts with “e”, then scan this section until you find the word, assuming that the number of words that start with “e” is not that big

But what if the list of words that starts with “e” is still too big to scan? then maybe have another index dedicated just for words that start with “e”, i.e divide the words that start with “e” according to the second letter, in that case you would first look in the first index for the page dedicated for words that starts with “e”, lets say page 21, you go to page 21, you’ll be presented with yet another index, you then look for the start page of words whoes second letter is “x” -we are searching for the word Expand-, lets say page 27, then scan for the word “expand” starting page 27

Now we can continue dividing the words of the dictionary further and further and have more and more nested indexes, i.e indexes that point to other indexes, recognize this datastructure? its a tree, the root index divides the words by the first letter, and for each first letter, the page on which the child index resides is registered, then you go to that child index page, you would find the index for words divided by the second letter, and depending on the depth of the tree, you go deeper and deeper specifying the word you want letter by letter

Of course the oxford dictionary doesn’t have enough words to have multi level indexes, just one root index is good enough!

Now for a table of data that contains 100,000,000 rows who’s primary key ‘id’ is a number from 1–100,000,000 , and you want to find the row whoes id is 45434

without an index, the only way to find this row, is to just scan the whole table looking for the id you want

But what if we divid the rows into two halves, id’s from 1–50,000,000 and 50,000,001–100,000,000, but still 50,000,000 rows is still too big to scan, so we further divide each half into quarters and each quarter into 1/8th and so on, lets say if we stop at 1/16th, that mean we would have a tree data structure of depth 4 levels, and then finally you reach the leaf node of the tree, which contains the start address of the range of id’s you’re searching for of length 6,250,000, i.e 100,000,000 / 16

But of course in reality the index will divide probably much further than 4 levels, until pin pointing the row id that you want is found in a few milliseconds, and also each node doesn’t necessarily have two child nodes, but maybe hundreds, to limit the depth of the tree specially with large datasets of 100s of millions of rows, balancing the number of children for each node such that the tree depth is not too big is the secret sauce that database engines use to optimize data access

This is a very nice tool to visualize this datastructure

https://www.cs.usfca.edu/~galles/visualization/BTree.html

--

--