Stories by Ravishankar Joshi on Medium

How useful and relevant are Computer Science college courses in a software engineering job?

Ravishankar Joshi — Mon, 24 Jul 2023 18:27:42 GMT

Context: We will be comparing the Computer Science college courses from tier 1 Indian colleges with a typical FAAMG/big tech and startup software engineer’s job to see their relevance and importance. We will be ignoring IIIT Hyderabad since it has an exceptional curriculum.

As a reference, we shall use BITS Pilani’s B.E. CS curriculum, but the analysis shall be generic enough to apply to all tier 1 Indian colleges. This is not intended to defame BITS or any college and is not based on my experiences at BITS or my employer(s). It is a general analysis based on all tier 1 colleges and general tech companies.

Computer science vs Software engineering

There are major 5 types of software engineers: Front end (web), backend, data, machine learning, and mobile (Android and iOS). There are a few minor and niche types too like research, hardware/IoT, testing, SRE, DevOps etc.

First-year courses

In the first year, we have common courses like biology, chemistry, physics, multi-variate calculus (maths 1), mechanical engineering workshop, thermodynamics, basic electrical science, linear algebra and complex analysis (maths 2). These common courses are rarely used in any software engineering job. You’ll have to work hard to find an exception where they are used. Although these common courses can be useful to understand real-world phenomena outside work or just for curiosity. However, it doesn’t seem like the best use of the first year.

(We may note that the common first-year structure has been copied by Indian universities from US universities, where they allow students to choose their branch at the end of the first year based on their CGPA and preference. However, in India, all colleges admit students in a particular branch (like BE CS), not without a branch. Hence common first year doesn’t seem very meaningful, since you won’t have a choice to move branches freely anyway.)

Probability and statistics, linear algebra, and multi-variate calculus have limited use in machine learning-related areas.

Engineering graphics can be tangentially useful in developing an eye for graphics design, game development, UI/UX design or web development roles. But a direct course in UI UX design might be more useful. The engineering graphics course is designed more for the mechanical engineering branch.

Technical report writing and computer programming are probably the most useful courses from the first year. As a SWE, most of your time is going to be spent coding and writing documents. Most young, inexperienced SWE are weak at writing documents.

Second-year courses

Maths 3 (differential equations): Can’t think of much use. Maybe in some ML areas?

Logic in CS: Theorem proving, program verification. But very very few people might use it.

Principles of economics, and management: Very useful outside work. Helps in navigating the economy and business. Useful in making a lot of decisions. Demand and supply equations are everywhere.

Digital design, microprocessor programming: Not useful at all. Unless you choose to join an electronics-based company like Nvidia, Intel, Samsung or some low-level Kernel/OS teams at Microsoft.

DSA, Discrete maths: Most important for Leetcode interviews.

Object-oriented programming, Database systems: Useful in day-to-day work. However, they leave out a lot of advanced stuff like software engineering, sharding, NoSQL, real-world database use examples etc.

Software engineering should be a compulsory course instead of an elective, where you build a project using GitHub. It can also include how to work with legacy code and focus on business impact rather than coding.

Third-year courses

Theory of computation: Probably not much useful. Automata and regex are useful in real work, but most of ToC is very theoretical maths and has very less use in work. SWE aren’t mathematicians after all.

Operating systems, and networks: These can be useful for someone working in OS or networks like Microsoft or Cisco. Can be useful sometimes for platform/systems software engineers. For others, most of the time, these things are abstracted out and very rarely do you need to dive deep into these areas. Also, pedagogy needs to be completely changed from “building an OS” perspective to “using an OS”. Can be useful and interesting if taught from the perspective of how you would solve certain issues in real microservice systems using OS stuff. I didn’t like this area when I was in college, but Hussain Nasser’s backend engineering changed it completely.

Computer architecture: Useful only at hardware companies like Nvidia, Intel etc. Maybe Microsoft and sometimes systems.

Principles of programming languages: Typically taught in a very theoretical way. Can be interesting and useful if it includes functional programming, trying out different languages and different paradigms, and new languages like Rust, Golang, Kotlin etc. Identifying the right tool for the job is necessary, which this course should teach.

Compiler construction: This can be useful for someone working on compilers at Nvidia, Intel, Microsoft, Google etc. Probably not useful for most SWE. Sometimes can be useful in other areas which require parsing like web scraping, information retrieval (search) etc.

Design and analysis of algorithms: Sometimes useful for interviews, mostly not useful for real work, can be theoretical.

So overall, this would be the scorecard:

Out of 30 compulsory courses, we have 11 essential, 3 somewhat useful and 16 not useful courses. Hmm, 50% efficiency. Not very efficient and effective IMO.

Electives

Most of the useful courses seem to be in electives. Instead of diving deep into electives, I’ll just give a quick summary.

Useful electives: data mining, parallel computing, ML, Neural networks, software engineering, quantum computing, and information retrieval.

Somewhat useful: computer graphics, network programming, data storage technologies, software development for portable devices*, image processing, internetworking technologies, AI.

Not much useful: Combinatorial maths, multimedia computing, Cryptography, Number theory, graph theory, software for embedded systems.

Software dev for portable devices*: In such courses, the mobile app dev part is useful. IoT part not so much.

In the AI course, most of the stuff is theoretical and algorithmic, which might not be as useful in real work. Only the machine learning-based part might be useful for work.

Things required in real-world that are not taught in colleges

Most of the CS curriculum is taught in a backend/systems-oriented way, leaving other SWE areas almost unknown or unexplored. I would have loved to have at least one course on web development and mobile development. There should also be an elective for Product management and UI/UX design.

Microservices

Microservices, distributed systems and big data are all-pervasive in real work, but colleges might not even introduce these to you.

A course on software design and architecture is necessary. It can include monoliths, service-oriented architecture, microservices, their tradeoffs, how to manage microservices, their complexity, testing, deployment, CI/CD, monitoring, alerting, staging environments, cloud vs on-prem DCs etc.

A distributed systems or cloud computing or big data course introducing MapReduce/Hadoop, Hive, Presto, Spark, Scala, NoSQL dbs, CAP theorem, system design, Apache Flink, Kafka etc, designed like this and having the “Designing data-intensive applications” as the textbook. It should include distributed systems theoretical aspects, designs of (say) Spark, and how to use them. It should give a practical introduction to data engineering. It should also give at least a basic idea about what SREs would normally do and what are incidents/outages and how they’re mitigated.

Big data technologies

Courses on web/mobile, distributed systems and data engineering should ideally be compulsory courses.

There can be electives on security and new technologies like AR/VR, blockchain, and parallel computing using GPUs.

There should also be non-technical electives like -

Design thinking, Product design (Can be combined with PM and UI/UX design courses.)
Professionalism
Employee rights (like the LLA channel, including the history of the industry, capitalism, and labour laws)
Law (IP, copyrights, source code licensing, patents, GDPR, understanding offer letters/contracts, different laws applicable in software and outside)
Personal health and well-being (how to stay healthy and fit)
Personal finance (how to manage one’s money and build wealth. It should also include how RSU/ESOPs work.)
Writing and communication skills
Management and leadership (conflict resolution, negotiation, managing up and down, navigating politics, giving and receiving feedback etc)
Morality, ethics and philosophy (Esp stoicism and Gita’s philosophy.)
History of the software industry and computing (from assembly to Rust.)
Rise and fall of software companies (E.g. IBM)
Venture capital, entrepreneurship, innovation and startups (to help students understand how tech startups work, what is funding, funding rounds etc. Even the Shark Tank show would suffice. :P. It’s great if they get a chance at building a small startup, but even understanding how they work, whether to work at a startup or not and how to identify a good startup from a bad one is good.)

We did have good courses on writing and communication skills at BITS, but many colleges might not have. IIIT Hyderabad does have compulsory morning physical exercise in their first year, which their students hate. :P

Many software engineers face issues with their employee rights at small startups. I have heard of cases where people are not paid salaries for months, have 3 months' notice periods, bonds etc.

I have seen a lot of software engineers suffer due to the lack of knowledge in the above-mentioned areas. Most of them learn these from the school of life. :)

Common criticism

But this is how the curriculum has been in all Indian colleges for the last 20 years.
-> CS and SWE is a fast-moving and ever-evolving field. How companies worked 20 years ago is very different from how they work now. (E.g. Microservices and AWS) Colleges must keep updating the curriculum regularly to keep up.
But the curriculum has been decided by expert professors who have N years of teaching and research experience.
-> Experts can also be wrong. (Newton has also been wrong. 🤷) Most Indian professors have only worked in academia, but haven’t worked as software engineers. Also, the feedback mechanism from industry to colleges is broken.
But what if you don’t study common and electronics courses, but during your job, you suddenly need them?
-> Well, then, one can always self-study them when required. Let’s say there is a 10% chance of me needing any electronics knowledge, but there is >80% needing microservices and cloud knowledge. It’s easy to see which one is more important.
If we remove electronics courses from the CS program, it becomes an IT or software engineering program.
-> Then we should let it be. It’s merely a question of naming convention. Most tier 1 colleges (IIT, BITS, IIITs) except a few NITs, offer only a CS program, not IT or SWE BTech. Hence, if someone doesn’t want or need electronics courses, they shouldn’t be forced to take them.
These courses make CS “engineering”; one can always go to boot camps like Interviewbit Scaler Academy if one just wants a job as a “coder”. Bootcamp grads only know how to use OS. Only real “engineers” can build an OS.
-> I doubt the academic rigour of most CS BTech programs. Most CS BTech grads won’t have built an OS as a part of their OS curriculum and would have to learn a lot if they want to build one. Besides, only a handful of companies have the kind of opportunities to implement an OS/next React/Kafka/Spark etc. Most other companies just have jobs where you just use out-of-the-box tools (OS, React, Kafka, Spark). (Think MERN stack)
If we remove the electronics courses, what if someone wants to work at a hardware company?
-> They can be kept available as electives, rather than compulsory for everyone. From a utilitarian perspective, it’s best to optimize for the greater good, rather than for a few people. ≤ 15% of CS BTech students might be going for electronics roles. Colleges can study trends in demand for electronics SWE roles (Nvidia, Intel, Samsung etc) in the last 20 years, and decide based on that.
These courses are required for people who go into research.
-> Still common first-year courses like PCB don’t help anyone I think. They can be removed or kept as electives. That itself will give some space. Also, ≤ 15% of students might be going for MS. Others don’t need to study those same courses. (By utilitarianism) Besides, most research students would rather benefit if these useless courses were removed and they had more space for advanced electives like information retrieval, ML, distributed systems, information theory etc.
Is the purpose of college only to get a job?
-> Now we come to the root problem. The answer is yes. 🙃 College is higher education, which is an investment of a student into themselves. It’s fine to expect returns from it in the form of a job, startup, higher edu opportunities (MS/MBA/PhD) etc. We have been taught to never expect any kind of return from college. If the college had no expected return, most middle-class people won’t have spare 15–20 lakhs to spend on it. Investors always invest in something expecting some returns. College is a vidya mandir (temple of education), not a party bar/pub/club, which it unfortunately has been turned into, due to Western culture. Besides, the common or electronics courses don’t help with any kind of overall development, startup or MBA etc. Effectively their utility is very little.
Why change anything when CS students get good jobs and have good careers despite the current curriculum?
-> They’re successful despite the current curriculum, not because of it. Even electronics, electrical, ECE, mechanical etc students have started getting SWE jobs, which shows that SWE jobs are in high demand, the supply of CS students is not enough and some non-CS students can beat CS students at their game (interviews). This might indicate that companies don’t see any extra value in the CS curriculum, a clear red signal.
One can always learn on one’s own. There is no need for colleges to teach these things.
-> Well, yes. But why can’t colleges teach this? US universities like UIUC, Stanford, CMU etc regularly teach advanced courses like these. As a customer, it’s on us to demand more from colleges.
One should focus on competitive coding/development/hackathons/XYZ (enter the latest buzzword here)
-> Yes, one can. However, still, that means thousands of person-hours are being wasted in useless courses. Just hiding our heads in the sand like an ostrich is not a good approach. This is just an acknowledgement of colleges’ inefficiency.
But SWE are just coders and is a boring job, and we should all try to do research(MS/PhD) / PM / MBA / content creation / startup / XYZ (insert the latest buzzwords here)
-> Still doesn’t change the fact that >80% of CS BTech graduates might be taking up a software engineer or similar job. Most Indian students who go to the US for MS CS also do a SWE job only in the US. Very few of them do PhD and get into research. PhD and research are painful and not for most people.
Whether a job is boring or interesting is subjective and depends on each person. I may find PM, consulting, content creation, research etc boring. Even if SWE is a boring job, those who signed up for BTech CS have to study it. As simple as that.
In India, there are a lot of SWE jobs, fairly easy to get for a tier 1 college CS grad, high paying, good WLB and offer other perks. It would really not be smart to pass that opportunity up unless you have a better alternative and you know what you’re doing.
Software is eating the world, all engineers (electronics, mechanical, civil etc) and many even non-engineers are getting into SWE jobs. There are even BCom students who have completed a bootcamp in Bangalore and got a SWE job. (Not saying it’s easy, but it’s possible.)

Examples of better curriculums

Most US universities have an advanced curriculum as mentioned above. E.g. UIUC, Stanford, MIT etc. It doesn’t need a lot of money (funding) and labs for many courses. Our colleges already have decent computer labs. It just needs a will (and demand from students).

In India too, IIIT Hyderabad, IIIT Delhi, other IIITs, NIT K, and PES Bangalore have a good curriculum. When we go through PES’ curriculum, we feel nothing but excitement and respect. Almost all the courses mentioned in this article are taught in PES Bangalore, despite it being underrated compared to IIT/NIT/BITS. One can definitely feel the amount of effort their management and professors have put into designing the syllabus.

People often say IIIT Hyderabad can have a better curriculum because it’s focused on CS, and doesn’t have other departments, but NITK and PES are good counterexamples. US universities also offer all degrees and still have a rigorous CS curriculum. In future, I hope we see the growth of IIITs and colleges like NITK and PES, which have a good curriculum.

Bootcamps vs colleges

Bootcamps provide good competition to college degrees, since they’re cheaper, have a lower admission bar than clearing JEE, are more efficient (teach only useful stuff) and are decently effective (their grads can get jobs).

In economics, competition is always good for customers (students here) since they can get better services. Colleges will need to improve to compete with boot camps. The same old curriculum might not be worthwhile forever. Also, private colleges are getting extremely costly every year, with high fees and a 13% fee raise every year, making them out of reach for the middle class.

Appendix: This article was written as an answer to this post: https://www.linkedin.com/posts/arpitbhayani_asliengineers-activity-7089184760060657664-LDwq?utm_source=share&utm_medium=member_desktop
This post gets too optimistic about how many SWE really get a chance to work on a new OS, Kafka, React etc.

How to predict things and not be an ostrich. I saw it coming.

Ravishankar Joshi — Sat, 11 Feb 2023 13:30:27 GMT

We can predict many things without astrology. Here’s how.

I don’t believe in astrology. You shouldn’t too. After all, it’s not logical or rational. Planets cannot determine anything. They move in their orbits without bothering about what happens on Earth. Two people born on the same day live very different lives. One becomes rich, while the other becomes poor. Astrology and Hinduism have very little in common. One can reject astrology and still believe in Hinduism (Dharma). Most things in Hinduism stay unaffected by astrology. Still, many people give more importance to astrology than Dharma itself to predict things.

Now that astrology is out of the way, what if I told you we can still predict many things in life, with just a common sense statistical realist approach?

Accepting the reality — stoicism and realism

The first step to predicting reality is to accept it the way it is. Many people run away, shy away or have other intense emotions when they face reality. First, we should get rid of them. We should become cold-hearted/ emotionless/ stoic about reality. If we have strong emotions about reality, we can’t accept it for what it is. If we can’t accept reality, we can’t move forward. For this, one should read Bhagwad Gita or ancient Greek stoic philosophy. Gita calls such a person स्थित-प्रज्ञ (Sthit-pragya) — steady-minded. I.e. one who doesn’t get affected by anything.

A famous billionaire hedge fund manager Ray Dalio explains what it means to be a realist in his video on “how to be successful”. A realist is someone who accepts reality as it is. He approaches everything as a machine. History repeats itself. We can predict reality by looking at events of the past. Watch the complete video to get his ideas on success since it gives an excellent introduction to realism.

Different philosophical schools

Realism is one way of thought. Other important schools are:
Nihilism: Life is meaningless.
Idealism: We must follow ideals and behave ideally.
Liberalism: We must protect and enhance individual freedom.
Hedonism: Pleasure is the most important goal.
You can read more about them here: https://bigthink.com/thinking/10-schools-of-philosophy-and-why-you-should-know-them/.

Among these, realism is the most useful while dealing with the realities of life and predicting them.

Probability and probability distributions

Probability means the chances of occurring something. E.g. probability of heads when tossing a coin is 1/2 = 0.5 = 50%. If we toss two coins, P(“HH”) (we get both heads) is 1/4 = 0.25. The probability that the sun rises in the East is 1 = 100%.

A normal curve — probability distribution

A probability distribution is a table/chart showing the probability of different events. The chart above is a normal curve, which is a common probability distribution occurring in nature in many different things. It shows that P(x in mu to mu+sigma) = 34.1%. P(x in mu+sigma to mu+2*sigma) = 13.6% and so on. If we sum all probabilities from -infinity to infinity, it will become 1. Since x can take value only in this range.

This means that when we measure x, we don’t know for sure what will be its value beforehand, but we do know that it’s most likely going to be in mu-sigma to mu+sigma, with 68% chances.

Physics’ view on reality

Physics allows us to understand most natural phenomena but doesn’t help with man-made realities such as economics, politics, society etc. Statistical mechanics is a branch of thermodynamics that predicts gas particles' motion purely from statistical concepts and probability distributions. Similar statistical concepts also extend in quantum physics while predicting/understanding the behaviour of sub-atomic particles like electrons, protons, and neutrons. What a surprising revelation that physics provides that ‘every event in nature has a certain probability’! Nothing is certain.

I.e. we can’t be sure whether a particular oxygen atom has a certain speed. It can have different probabilities (chances) of different speeds.

------------------------------------------------------
| speed        | <1 m/s | 1-2 m/s | 2-3 m/s | 3+ m/s |
------------------------------------------------------
| probability  |  0.03  | 0.5     | 0.4     | 0.07   |
------------------------------------------------------

This is just some random data that I came up with. It says the chance that the atom has a speed < 1 m/s is 3%. The chance that the speed is between 1 to 2 m/s is 50%. And so on. Using this, we estimate that the speed will be between 1 to 3 m/s with a 90% chance.

Understanding realities using statistics

I believe it can be extended to include most man-made phenomena like economics, politics, society, etc. Let’s take a few hypothetical examples to understand it better.

If we know that an average mechanical engineer earns x and an average software engineer earns 2x, then it makes total sense to choose computer science or become a software engineer if one is optimizing for money. Although the first step would be to find reliable data for this so that one can believe it. I’m sure you’ll find examples of people who don’t earn x and 2x from both the sample sets, but as we know, it’s a probability distribution. There are going to be people on the whole range of the spectrum from 0 to infinity, but each with a certain probability. There are going to be exceptional mechanical engineers who earn 3x, and weak SWE who earn 0.5x, but that doesn’t change anything. Those are exceptions, and they are going to be rare. Just a few per cent of the whole curve. Exceptions shouldn’t lead us to believe that the fact is false.

If we know that 90% of startups fail, then we shouldn’t expect startups to run forever. If one is doing a startup or going to work at a startup, they should mentally accept the failure stat. They should be ok with switching every 3 years or so.

Japan is known to be earthquake-prone. Hence while living in Japan, one should be aware of this risk. If one knows this and still expects not to be hit by an earthquake, then well, that person is being delusional, i.e. ignoring reality, which is the biggest sin in realism.

Drinking and smoking are among the main contributors to cancer. Smoking causes cancer is a very well-known fact, but even drinking causes cancer was surprising to me when I first discovered it in 2020 or so. Still, most people either don’t know these, or they willingly ignore them like an ostrich. Or they expect others to get it, but they won’t get it. They sometimes get hit by reality while taking insurance when they have to disclose whether they do these two things. It’s possible to do either of these things and not catch cancer till the age of 70, since as we know, it’s a probability distribution. No one can know for sure. But we know that it increases the probability of getting cancer. Hence a smart realist won’t take these two addictions.

An escapist approach — like an ostrich

In Hindi, Gujarati and possibly many Indian languages, a person who ignores reality is called an ostrich. Since there’s a saying that ostriches hide their heads in the sand when they are afraid of a predator, while it might not be biologically true, it’s a good proverb regardless. This approach is called escapism — escaping from reality.

Ostrich burying its head in the sand. - A person who ignores the reality

People often use “it depends on each person”/ “to each his own” / “my life, my choice” or some variant of it when faced with brutal realities, since their mind refuses to accept them. It’s ok, but nature doesn’t care about you. It’s purely statistical. Your believing or not believing doesn’t change anything. There is no “depends on x and y”. There are only probabilities.

Reality is only affected by some real variables. E.g. for smoking, how many times you take it affects your chances of getting cancer. I’m sure there would be stats on that. Something like: Once a day for 10 years has a chance of 1%. Twice a day has a chance of 2%. And so on. It won’t be affected by the person (smoker). Nature doesn’t care about you.

“Nothing is certain” is also often used saying against this approach. Well, yes, nothing is certain. But the probabilities are certain. Whenever we do anything, we’re playing a game of chance. If the chances of an aeroplane crash are 1%, I can take 100 flights and survive, while someone can get killed on their first trip. Still, we should firmly believe in the stats.

Our insignificance against nature

“I am but a speck of dust in this vast universe. But I am a star among stars such as you. In this infinite world we are only small from the vantage point of our minute perspective. We are the universe and the universe is us.” — M. Mondesir

This is a great quote explaining our insignificance in this vast universe. We’re nothing. God has created the universe and is playing a game of dice with us. The universe won’t change its laws for us. Your power, money, love or whatever won’t save you from the laws of nature. This humbles us when we get too much ego. Only in stupid movies, can a heroine save her hero from cancer with her love. Being illogical creates good stories.

Limitations of the statistical approach

Finding good, reliable statistics can be difficult.
Deriving logical inferences from statistics can be difficult for complex phenomena.
Sometimes statistics can lead to misleading conclusions.
This approach can’t provide the reason behind the phenomena. It has to be found by other approaches. (E.g. why does drinking cause cancer.)
It can usually provide a way to deal with reality, but not always. For some realities, no remedy might be possible or easy.
Knowing reality can lead to stress, sorrow or even depression. Not everyone can mentally accept reality like a stoic. It’s not for the faint-hearted. They can consider “ignorance is bliss”. It needs a lion’s heart to accept the brutal realities.
Since it’s statistical, it cannot answer with 100% confidence. Nothing is fully certain.

Conclusion

We can’t expect reality to change to suit us. Reality is what it is. We can’t run away from it.

Our only hope is to either

willingly not know it (ignorance is bliss) and hope we don’t get hit by it, or
be prepared for it.

A statistical realist approach to life is just common sense. It’s not rocket science. But sometimes common sense can be seriously uncommon!

Hope this article helps you find some interesting stats and make decisions based on them! :)

Cache, cache everywhere Part 2

Ravishankar Joshi — Sun, 12 Apr 2020 11:27:29 GMT

A beginner-friendly and breadth-first introduction to caches. This article tries to explain caches from a software perspective, without introducing any hardware-level details.

The first part of this article: https://medium.com/@ravishankarjoshi/cache-cache-everywhere-part-1-988ba01a627c

Why use caching? What are the basic principles behind them?
Types of caches: caching in data structures and algorithms (theory)
Caches for code
Caches for data (RW data from RAM)
RAM as a cache for file I/O
Database caches
Network caches
How does YouTube work? Distributed caches on the internet — CDNs.
Why not use caches everywhere?
What is compiler optimization? Can it hurt?

Topics 5 to 10 are covered in this article, and 1 to 4 are covered in part 1.

5. RAM as a cache for disk I/O (Page Cache)

OS uses the Unused RAM as the page cache to load “pages” of the disk (HDD or SSD). It can help for the file I/O operations.
Executable binaries like applications and libraries are also brought into RAM using the page cache. Since the binary code is not going to be modified, multiple running processes can share it.

When a file read/write occurs, OS searches for the block of the file in the cached pages. If OS finds the block in the page cache, it reads/writes operation from/to the page in the main memory (page cache). If the block does not exist in the page cache, OS fetches the page(s) from disk and applies the requested operations.

Since reading/writing to files goes through such page caching mechanism, the file I/O APIs for languages like C, C++, Java, and Python use a file pointer, which points to the current location in the file. Thereby (generally) only allowing sequential access and not random access.

At disk-level too, HDD operations require mechanical disk rotations (disk seek), which makes them slow. Hence they buffer all the changes before writing.

6. Database caches

Databases store vast amounts of data nowadays, in much software, particularly in the web-based ones. We primarily focus on SQL (relational) databases here.

Database engines internally optimize user-written SQL queries to reduce latency. However, they might not be able to optimize all the queries. To write queries optimally, we often need to understand how DB engines implement/execute them internally.

Joins are among the costliest SQL queries. Hence, optimizing SQL queries often focuses on optimizing joins, which becomes the bottleneck operation.
DB engines can implement joins in three ways:
1. Nested Loop Join
2. Hash Join
3. Sort-Merge Join

This article talks in detail about how all 3 kinds of joins work and how CrateDB used hash join. https://crate.io/a/lab-notes-how-we-made-joins-23-thousand-times-faster-part-one/

Let’s say the query is something like:

SELECT *
FROM t1 JOIN t2 
ON t1.a = t2.b;

If the DB engine can execute the above join as a nested-loop join in the following two ways:

for each t1_entry in t1:
   for each t2_entry in t2:
      if t1_entry.a == t2_entry.b
         print(t1_entry, t2_entry)

Or as —

for each t2_entry in t2:
   for each t1_entry in t1:
      if t1_entry.a == t2_entry.b
         print(t1_entry, t2_entry)

Now, as these queries run on a real computer, the latencies depend on the size of t1 and t2.
If size(t2) < size(cache of the Db server) < size(t1), the first implementation would be faster than the second one, since the cache can store the whole of t2 at once. (Similar to arrays)
(Here, the nested-loop join is discussed for the sake of simplicity and brevity. The same logic can be extended to the other 2 kinds of joins too.)

DB engines also optimize database queries by caching the results of the recent queries directly. (Assuming no writes/updates.)

E.g. if the cricket world cup tournament is going on and Google search keeps getting queries from Indian users like “The schedule of India’s matches in the current cricket world cup”. Some database might store the cricket schedules, but due to the high query load, it can optimize the queries by caching the result of this particular query.

However, this kind of optimization might not help much in real-time queries such as “India’s current score” due to the incoming updates. (When the Indian team is playing a cricket match.)

7. Network caches

Since sending/receiving packets on network/internet are much slower than even disk accesses, networks use various forms of caches to optimize it.
Web browsers and web proxies cache static content like web pages and images from previous HTTP/HTTPS responses from web servers. This caching is useful because if a user accesses a static webpage frequently, it is less likely to change between two consecutive visits.
We can call this as some kind of “less frequent updates” rule of thumb for web caching.

Domain Name Service (DNS) is a service which translates human-readable domain names (E.g. https://google.com) into IP addresses (e.g. 216.58.203.46). It uses a hierarchy of cache servers to do so. It has a set of worldwide servers, where each server maintains domain name to IP mappings, to reduce the load on the top-most servers in the hierarchy.

Peer-to-peer (P2P) forms of caches are also possible, where nodes (computers) in the same LAN or same city might decide to use each other as a cache to reduce network usage and improve response times.
A canonical example of this would be Windows 10 allowing users to download updates from other computers in the same network automatically (if possible) rather than downloading them from Microsoft servers directly.

8. How does YouTube work? Distributed caches on the internet — CDNs.

This topic extends the previous one. YouTube and similar websites get a vast number of users to watch static (and often data-intensive) content. Rather than sending all the user queries to YouTube servers and videos from YouTube servers to users, they make use of content delivery networks (CDNs), to exploit several usage patterns and to reduce load.

For web caching, an important rule is Geographical locality of access. I.e. the content of a particular type might be high in demand in a geographically specific area of the world.
E.g. the videos in Hindi might get many viewers in India, but not in other countries. Also, since network latency also depends on the distance between the server and client, it makes sense to cache such Hindi videos into the servers nearest to Indian users, and not in say US servers. (Since if a particular Hindi video is trending in India, a lot of Indians might want to watch it.)

Also, the number of users of a particular service might be very high in a country, but very low in another country. E.g. due to high population, India might have 10x more users for a particular website than some small country. Hence, it might make sense to keep a data centre near India.

A content delivery network, or content distribution network (CDN), is a geographically distributed network of proxy servers and their data centres.
CDNs increase the complexity of the infrastructure and code of the website using them. E.g. counting the number of viewers in a CDN-based YouTube can become more complicated than counting the same in a single server-based website.
It becomes even more difficult for websites, which do not just majorly serve static content but also have real-time updates, such as Twitter. However, understanding the caching solution behind Youtube and Twitter is beyond the scope of this article.

9. Why not use caches everywhere?

As discussed in the initial sections (processor instruction and data caches), caches are expensive compared to regular memory devices like RAM, while providing faster access. Hence, it is a trade-off that one needs to make on how much cache to have against how costly it would be.

Also, often, caching reduces read-times, but at the cost of delayed writes. This impact becomes even more interesting in the database caches and CDNs, where the latest update might take longer to reflect into the caches (depending on the implementation).

10. What are compiler optimizations? Can they hurt?

In simple words, we use caches to make our code run fast. However, other performance improvement techniques like compiler optimizations, multi-threading, using multiple processors, distributed computing can interfere with caches, leading to problems ranging from slow performance to Heisenbugs.

How can multi-threading work with caches?
Let’s say we have two threads A and B are running on a single-core, single-processor computer. When the context switches from thread A to thread B, the (shared) cache would evict all the blocks stored by A and write them back to the RAM (if needed), and bring the blocks for B. Hence, multi-threading can potentially lead to more cache misses, if not done correctly.
One way to avoid this would be by creating only as many threads as the number of cores. Hence, no thread would ever context-switch, and no cache misses would happen.

Compiler optimization means compiler statically optimizing the code based on the properties it can prove about it.

Some of you might have heard of the Java keyword “volatile”. It can help avoid some such bugs.

When one declares a variable volatile, the compiler avoids any kind of optimizations on that variable, because otherwise, it can lead to bugs. I.e. the volatile variable would be read from the main memory (RAM) every time it is accessed, bypassing the cache.

Volatile keyword is helpful when multiple threads share a variable and its value is used in some conditions to decide the branch. If the compiler optimizes it like the usual, single-threaded code, the code might not read the value of the variable before testing the condition. This compiler optimization bug can happen if threads A and B share some variable.

Thread A’s pseudo-code:

x = 1 // shared variable, but not declared as volatile
while(true) {
   mutex.lock()
   if (x < 10) {
       print("Volatile bug")
   }
   mutex.unlock()
}

Thread B’s pseudo-code:

x = 1 // shared variable, but not declared as volatile
while(true) {
   mutex.lock()
   if (someFunction()) {
       x++
   }
   mutex.unlock()
}

Here, the compiler might think that thread A is not modifying x, hence it might optimize the first code to —

x = 1 // shared variable, but not declared as volatile
while(true) {
   mutex.lock()
   print("Volatile bug")
   mutex.unlock()
}

Removing the if condition, and creating a bug.

We generally believe that protecting a variable with a mutex would assure that only one thread reads/writes to it at a time, and hence all the other threads would see the updates in the correct order. However, it does not hold in the presence of caches.

Let’s see what happens if we do not declare x as volatile, but the compiler does not optimize the code either. Hence, the original code is run. Can this still lead to bugs?
The answer is yes.
What if thread A and B are running on different processors?
In such a case, they would not share the cache. Hence, the threads would read and write to “x” from their processor’s cache, getting an illusion that they are reading/writing from/to the actual variable from RAM. Hence, they would have their own “versions of x”. Hence, this results in an inconsistent value of x.

Hence, it is necessary to have a cache coherency protocol, which maintains caches from the different processors in a coherent manner. Typically, such coherency protocols are implemented at the cache-block level, which means when thread B writes to the variable x, it might not reflect to thread A immediately! The value of x might be reflected only when cache writes x’s block to the RAM. We can avoid this race condition by declaring x as volatile.
The code reads and writes to/from RAM directly on each access of the volatile variables.

Further reading:
https://medium.com/google-developer-experts/on-properly-using-volatile-and-synchronized-702fc05faac2
https://stackoverflow.com/questions/7872175/c-volatile-variables-and-cache-memory
https://dzone.com/articles/why-do-we-need-the-volatile-keyword
(However, the volatile keyword does not guarantee thread-safety in C/C++ and is discouraged now. But it guarantees atomic reads and writes from RAM in Java. Hence, both thread and cache safety.)

Cache, cache everywhere!

From processor level caches to CDNs.

Cache, cache everywhere! Part 1

Ravishankar Joshi — Sun, 12 Apr 2020 11:26:08 GMT

A beginner-friendly and breadth-first introduction to caches. This article tries to explain caches from a software perspective, without introducing any hardware-level details.

“Water, water, everywhere,
And all the boards did shrink;
Water, water, everywhere,
Nor any drop to drink.”
- Samuel Taylor Coleridge (in The Rime of the Ancient Mariner)

When we hear the word cache, most of us think about the processor-level caches and their low-level (hardware) design and how they work. However, there is more to caches than just this. Most of us recall our (often dreadful! :P) computer architecture classes in university, where we learnt about the hardware caches for processor instructions. However, we might not have seen or used them anywhere else.

Caches are used almost everywhere in computer science, be it data structures and algorithms (theory), processor instructions, reading/writing data from RAM, reading/writing from disk (file I/O), caches for databases, network I/O or surfing the Internet (CDNs/ distributed caches).

In general, the design of read-only caches is simpler compared to read-write caches, since optimizing writes complicates the cache design. Hence this article mainly focuses on read-only caches.

Why use caching? What are the basic principles behind them?
Types of caches: caching in data structures and algorithms (theory)
Caches for code
Caches for data (RW data from RAM)
RAM as a cache for file I/O
Database caches
Network caches
How does YouTube work? Distributed caches on the internet — CDNs.
Why not use caches everywhere?
What are compiler optimizations? Can they hurt?

This article covers topics 1 to 4. Part 2 of this article covers topics 5 to 10.

1. Why use caching? What are the basic principles behind them?

A basic idea in computer science is to write software that runs faster so that we can solve larger and larger computational problems.

Client: Hey, can our website handle 1000 users?
You: Yes, it can.
Client: Hey, now it went viral, and we have 10,00,000 users overnight. Will it still work?
You: Yeah, I will make it work. 😒

So, we want to improve our software performance. One way to achieve high performance is by removing redundant computations and doing them only once.

Benchmarking: Studying the performance of some software with various standardized inputs, finding the performance problems and improving it.

Enter caches.
Caching means storing some data to reuse it soon to avoid fetching/calculating it again. Caching is an example of the time-space tradeoff. We sacrifice some memory to reduce the time taken for some future computation.

Umm, well, why store the data at all?
The principles behind caches are Spatial and Temporal locality of memory access.
Let’s say you are working with some integer array ‘arr’, and doing some operations on its elements.
Temporal locality of memory access: It is a rule of thumb that if you access some memory location (say arr[i]), you will reaccess it soon.
Spatial locality of memory access: If you access some memory location (arr[i]), you might also access its nearby locations (arr[i+1], arr[i-1]) in the near future.

So, the question becomes, can’t we just access the data (array) from the RAM every time? Why do we need to “backup” it in the cache?

Jeff Dean, the current head of Google AI, published a few numbers in 2012, in an article titled “The latency numbers every programmer should know”:

Source: https://gist.github.com/jboner/2841832

https://medium.com/media/53a88e25ef2fba126803707a78c00308/href

Based on these, we can see that for reading/writing data, the overall latency order is -
Processor speed < cache < main memory (which we call as RAM) < Solid-state disk (SSD) < hard disk (HDD) < network.

Notice that sometimes there is an order of magnitude (10x) increase in latency as we go from the left to right in these memory devices.

Hence, we use caching to store data from a slow memory device into a faster device, hoping the Spatio-temporal locality principles hold and we can get more performance.

What is a cache hit and a cache miss?
Cache hit: When the processor needs some instructions/data, and it is already present in the cache, it can directly use it.
Cache miss: If the cache does not have the required instruction/data, the processor has to bring the data into the cache.
The higher the cache hit rate, the faster your program.
Cache-friendly program: The program which has the least number of cache misses. (i.e. Has maximum hit rate)

2. Caching in Data Structures and Algorithms (Theory)

In data structures and algorithms, many a few times, memoization is used, which is a kind of caching.

To take a natural example, let’s say you are given an N-ary tree, which does not support node insertion and deletions (i.e. constant tree), but it has to support queries of the form: Given a node-id, find the number of nodes in its subtree.

An N-ary tree

Let’s say the tree looks like above. When you get the query: findSubtreeNodes(string nodeId), a simple solution to this would be traversing all nodes in the subtree of the node and finding the number of the nodes. However, that would make each query O(N), where N = number of nodes in the sub-tree.
Its optimal solution would be to precompute the number of nodes in sub-tree of each node and then answer each query in O(1).

An augmented N-ary tree, where each node stores the number of children its subtree

The tree with the above precomputation looks like this. This precomputation is an example of memoization or caching, using which, one can answer such queries in O(1).

Similarly, solving a problem faster by memoizing values from recursive function calls is called dynamic programming. Its most famous example being computing Nth Fibonacci numbers.

3. Caches for code

When we compile high-level language programs, they eventually become assembly-code or bytecode, depending on our language. Programs compile down to assembly since processors understand only their specific assembly language. When the JVM runs the bytecode, JVM translates it into your processor-specific assembly code. Hence, eventually, every code becomes a platform-specific assembly before running.

RAM speed is generally much slower than the processor speed.

When a processor runs the assembly code, it has to fetch one instruction at a time and run it. This (1 instruction at a time) is a very simplified model.
However, this processor model is very slow because each time an instruction is fetched, RAM is accessed, and before RAM gives the instruction, the processor is idle, since it doesn’t have any other instruction to run. To solve this, processors have a fast memory called instruction cache, where say the next 10 instructions are stored. The processor keeps getting the next instructions from the RAM and keeps running instructions simultaneously. This cache allows the processor to reduce the impact of the RAM instruction fetch.

Here, we are mainly talking about sequential programs.
What happens when we add a condition or a loop into your program?

for(i = 2; i<100; i++) { // prints all the primes in [2, 100).
 isPrime = true
 for(j = 2; (j*j)<=i; j++){
   if(i%j == 0) 
     isPrime = false
 }
 if(isPrime){
   print(i) 
 }
}

Before the loop condition and the if conditions, the processor instruction cache might create trouble, since it does not know which branch will be taken in the code beforehand. It would know the result of the if condition (true or false) only after executing it, which means it cannot pre-fetch instructions after the if condition.

To solve this, processors do some magic called “branch prediction”. I.e. it guesses beforehand which branch would be taken based on historical information. If its prediction comes true, it doesn’t face any penalty. Otherwise, it has to empty the cache and fetch instructions again.

(Branch prediction is a complicated topic in itself. To simplify it, it decides the branch based on what happened in the last few if condition tests. If last 100 times the same if condition (in a loop) resulted in “false”, it would assume that for 101st time also, it would result in “false”.)

One of the most exciting applications of branch predictions is this SO question: https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array
An SO user benchmarked the performance of iterating (and checking an if condition) on a sorted array vs unsorted array in C++ and Java. He found that iterating on a sorted array is much faster (if you are running a certain kind of code).
Here, it seems that the code iterating on a sorted array is cache-friendly (for instruction cache), but the one using unsorted arrays is not. Because the unsorted array causes the branch prediction to fail 50% of the time.

(Often, such benchmarking results also depend on the processor, the operating system, the programming language, the compiler used.)

4. Caches for data

Processors also cache data that the instructions operate on similar to how they cache instructions. Generally, computer architecture courses taught in Indian universities rarely give enough importance to data caches.

Most of the content below is from my previous Quora answer: https://www.quora.com/What-is-the-difference-between-cache-unfriendly-code-and-the-cache-friendly-code-in-C-C/answer/Ravishankar-Joshi

What is random v/s sequential access? Are arrays random access? Arrays v/s Linked lists

Random access: If we can access any member of a data structure directly, in random order, and in O(1) (constant) run-time, it would be called a random access data structure.
Sequential access: If we can access members of a data structure only by iterating from its beginning or end, the data structure would be called a sequential access data structure. (I.e. random access is not possible.)

Hence, we call arrays as “random access” data structures, while linked lists are sequential data structures.

However, is random access as fast as sequential access on an array?

The answer is no. While an array does support random access in O(1), it is not as fast as sequential access on the array. To find out why read below!

A typical laptop/PC might have 8 or 16 GBs of RAM, but usually up to 10 MBs of processor cache.

So, the workflow is like this: If you are working with an array of say 10 KB, you go to RAM and bring it to the cache. Do all your (say 10000) operations and put it back to RAM. So, you have just 2 RAM accesses instead of 10000 RAM accesses. (Of course, here, it is an over-simplified model.) A typical RAM access might have latencies of the order of 100*cache latencies.

When your program accesses an array element (say arr[i]), the processor automatically brings that element (arr[i]) and a few of its nearby elements — arr[i+1], arr[i+2], … into the cache (depending on cache size). (Spatio-temporal locality)
Typically caches store next 64 bytes of memory when accessing an array element. Hence, if you access arr[i+1] after arr[i], arr[i+1] will not require another RAM access.
Hence, if the array is accessed sequentially, it can produce many cache hits, while randomly accessing it will produce cache misses.

What happens for 2D or N-D arrays?

int arr[1000][1000];
for(int i=0; i<1000; i++)
{   for(int j=0; j<1000; j++)
    {   
        arr[i][j] = i*j; // some random operation.
    }
}

Compared to

int arr[1000][1000];
for(int i=0; i<1000; i++)
{   for(int j=0; j<1000; j++)
    {   
        arr[j][i] = i*j; // some random operation.
    }
}

To compare these two pieces of code, we need to understand the memory model of the programming language used to write this code. If this code is written in a language like C, C++ or NumPy in Python, the arrays follow row-major order. (Some languages like MATLAB, Octave and Julia follow column-major order.)

In row-major order, the consecutive elements of a row of a multi-dimensional array reside next to each other in memory (RAM). I.e. arr[i][j], arr[i][j+1], arr[i][j+2], … are next to each other in the memory. Also, arr[i][999] (last element of ith row) will be next to arr[i+1][0] (first element of i+1th row) in the above example.

Hence, the first code only accesses contiguous elements, thereby benefitting from the cache. In contrast, the following code jumps from one row to the next, hence resulting in many cache misses. Hence, the first code is cache-friendly and much faster than the other code.

Challenge problem: Can you optimize the above first code (cache-friendly) even further (for large array sizes)? (It is possible to optimize it. If you can answer, reply in the comments section below.)

What about pointers or data structures which use pointers?

Let’s see what happens to trees in the presence of caches. We can make a tree as an array vs using pointers like this:

// A binary tree made with an array like so:
int node[1024]; //0th element is to be kept unused.

where left child of node[i] is node[2*i] and right child is node[2*i +1].

Now, on this tree, if we apply any standard tree operation like BFS, DFS, in-order traversal, it results in a cache hit, and high performance, but if the implementation of the tree uses pointers like:

struct node {
    int data;
    node* left;
    node* right;
};

If we apply the same operation (BFS, DFS etc.) on this tree, the performance would be much slower because pointers would not be from contiguous memory locations. And, when you try to get one node into the cache, the cache would store the pointers node->left and node->right (waste of cache memory). However, the cache will not store those nodes themselves.
Hence, the first example above is cache-friendly, while the second is not.

While we can optimize RAM data reads using caches, what happens with writes?
We did not have to worry about writes for the instruction cache since the program instructions can (generally) be only read, not written. However, data does require writes.

Well, write operations are applied on the cache first, and when the cache is full, it is flushed to the RAM.

Even while using delayed writes, reading (from cache) gives the correct value, since the processor tries to search for an element in the cache first, and only if it is not there, it is fetched from the RAM.

Also, what happens when the cache becomes full?
When the cache becomes full, to make the space for an incoming element, one element has to be removed from the cache, while reducing performance impact. For this, caches follow a cache replacement policy. Replacing the least recently used (LRU) element is a standard cache replacement policy. Based on the temporal locality principle, the LRU element is the least likely to be accessed soon.

For further reading: https://www.cs.cornell.edu/courses/cs3110/2012sp/lectures/lec25-locality/lec25.html
https://www.cl.cam.ac.uk/teaching/1718/ProgC/lectures/lecture6.pdf

To learn about
RAM as a cache for file I/O,
Database caches,
Network caches,
How does YouTube work?
Why not use cache everywhere?
What is compiler optimization?
Please check out part 2 of this article: https://medium.com/@ravishankarjoshi/cache-cache-everywhere-part-2-35c74a79f90.